使用滑动窗口删除注释而不使用嵌套的while循环

我正在尝试使用c代码从ac文件中删除注释和字符串。 我会坚持对这些例子的评论。 我有一个滑动窗口,所以在任何给定时刻我只有字符nn-1 。 我试图找出一个不使用嵌套whiles的算法,如果可能的话,但我需要一个getchar通过输入。 我的第一个想法是通过找到n=* and (n-1)=/然后直到n=/ and (n-1)=* ,但考虑到这已嵌套而我觉得它是低效的。 如果必须的话,我可以这样做,但我想知道是否有人有更好的解决方案。

用一个while循环编写的算法可能如下所示:

 while ((c = getchar()) != EOF) { ... // looking at the byte that was just read if (...) // the symbol is not inside a comment { putchar(c); } } 

要确定输入char是否属于注释,可以使用状态机。 在以下示例中,它有4个状态; 也有穿越下一个州的规则。

 int state = 0; int next_state; while ((c = getchar()) != EOF) { switch (state) { case 0: next_state = (c == '/' ? 1 : 0); break; case 1: next_state = (c == '*' ? 2 : c == '/' ? 1 : 0); break; case 2: next_state = (c == '*' ? 3 : 2); break; case 3: next_state = (c == '/' ? 0 : c == '*' ? 3 : 2); break; default: next_state = state; // will never happen } if (state == 1 && next_state == 0) { putchar('/'); // for correct output when a slash is not followed by a star } if (state == 0 && next_state == 0) { putchar(c); } state = next_state; } 

上面的例子非常简单:在C语言中的非注释上下文中,它不能正确地用于/* ; 它不支持//评论等

正确地做这件事比起初人们想象的要复杂得多,正如这里其他评论所指出的那样。 我强烈建议编写一个表驱动的FSM,使用状态转换图来获得正确的转换。 试图用一些案例陈述来做更多的事情是非常容易出错的IMO。

这是一个dot / graphviz格式的图表,您可以从中直接编写状态表。 请注意,我根本没有测试过这个,所以YMMV。

该图的语义是,当你看到 ,如果该状态中的其他输入都不匹配,则它是一个下降。 文件结尾是除S0之外的任何状态的错误,因此未明确列出的任何字符或 。 除了在评论( S4S5 )中以及检测到开始评论( S1 )时,打印的每个字符都被打印。 您必须在检测到开始注释时缓冲字符,如果它是错误的开始则打印它们,否则在确定它确实是注释时将它们丢弃。

在点图中, sq是单引号'dq是双引号"

 digraph state_machine { rankdir=LR; size="8,5"; node [shape=doublecircle]; S0 /* init */; node [shape=circle]; S0 /* init */ -> S1 /* begin_cmt */ [label = "'/'"]; S0 /* init */ -> S2 /* in_str */ [label = dq]; S0 /* init */ -> S3 /* in_ch */ [label = sq]; S0 /* init */ -> S0 /* init */ [label = ""]; S1 /* begin_cmt */ -> S4 /* in_slc */ [label = "'/'"]; S1 /* begin_cmt */ -> S5 /* in_mlc */ [label = "'*'"]; S1 /* begin_cmt */ -> S0 /* init */ [label = ""]; S1 /* begin_cmt */ -> S1 /* begin_cmt */ [label = "'\\n'"]; // handle "/\n/" and "/\n*" S2 /* in_str */ -> S0 /* init */ [label = "'\\'"]; S2 /* in_str */ -> S6 /* str_esc */ [label = "'\\'"]; S2 /* in_str */ -> S2 /* in_str */ [label = ""]; S3 /* in_ch */ -> S0 /* init */ [label = sq]; S4 /* in_slc */ -> S4 /* in_slc */ [label = ""]; S4 /* in_slc */ -> S0 /* init */ [label = "'\\n'"]; S5 /* in_mlc */ -> S7 /* end_mlc */ [label = "'*'"]; S5 /* in_mlc */ -> S5 /* in_mlc */ [label = ""]; S7 /* end_mlc */ -> S7 /* end_mlc */ [label = "'*'|'\\n'"]; S7 /* end_mlc */ -> S0 /* init */ [label = "'/'"]; S7 /* end_mlc */ -> S5 /* in_mlc */ [label = ""]; S6 /* str_esc */ -> S8 /* oct */ [label = "[0-3]"]; S6 /* str_esc */ -> S9 /* hex */ [label = "'x'"]; S6 /* str_esc */ -> S2 /* in_str */ [label = ""]; S8 /* oct */ -> S10 /* o1 */ [label = "[0-7]"]; S10 /* o1 */ -> S2 /* in_str */ [label = "[0-7]"]; S9 /* hex */ -> S11 /* h1 */ [label = hex]; S11 /* h1 */ -> S2 /* in_str */ [label = hex]; S3 /* in_ch */ -> S12 /* ch_esc */ [label = "'\\'"]; S3 /* in_ch */ -> S13 /* out_ch */ [label = ""]; S13 /* out_ch */ -> S0 /* init */ [label = sq]; S12 /* ch_esc */ -> S3 /* in_ch */ [label = sq]; S12 /* ch_esc */ -> S12 /* ch_esc */ [label = ""]; } 

由于你只想使用两个字符作为缓冲区而只有一个while循环,我建议使用第三个字符来跟踪你的状态(是否跳过文本)。 我已经为您编写了一个测试程序,其中包含解释逻辑的内联注释:

 // Program to strip comments and strings from a C file // // Build: // gcc -o strip-comments strip-comments.c // // Test: // ./strip-comments strip-comments.c #include  #include  #include  #include  #include  #include  /* The following is a block of strings, and comments for testing * the code. */ /* test if three comments *//* chained together */// will be removed. static int value = 128 /* test comment within valid code *// 2; const char * test1 = "This is a test of \" processing"; /* testing inline comment */ const char * test2 = "this is a test of \n within strings."; // testing inline comment // this is a the last test int strip_c_code(FILE * in, FILE * out) { char buff[2]; char skipping; skipping = '\0'; buff[0] = '\0'; buff[1] = '\0'; // loop through the file while((buff[0] = fgetc(in)) != EOF) { // checking for start of comment or string block if (!(skipping)) { // start skipping in "//" comments if ((buff[1] == '/') && (buff[0] == '/')) skipping = '/'; // start skipping in "/*" comments else if ((buff[1] == '/') && (buff[0] == '*')) skipping = '*'; // start skipping at start of strings, but not character assignments else if ( ((buff[1] != '\'') && (buff[0] == '"')) && ((buff[1] != '\\') && (buff[0] == '"')) ) { fputc(buff[1], out); skipping = '"'; }; // clear buffer so that processed characters are not interpreted as // end of skip characters. if ((skipping)) { buff[0] = '\0'; buff[1] = '\0'; }; }; // check for characters which terminate skip block switch(skipping) { // if skipping "//" comments, look for new line case '/': if (buff[1] == '\n') skipping = '\0'; break; // if skipping "/*" comments, look for "*/" terminating string case '*': if ((buff[1] == '*') && (buff[0] == '/')) { buff[0] = '\0'; buff[1] = '\0'; skipping = '\0'; }; break; // if skipping strings, look for terminating '"' character case '"': if ((buff[1] != '\\') && (buff[0] == '"')) { skipping = '\0'; buff[0] = '\0'; buff[1] = '\0'; fprintf(out, "NULL"); // replace string with NULL }; break; default: break; }; // if not skipping, write character out if ( (!(skipping)) && ((buff[1])) ) fputc(buff[1], out); // shift new character to old character position buff[1] = buff[0]; }; // verify that the comment or string was terminated properly if ((skipping)) { fprintf(stderr, "Unterminated comment or string\n"); return(-1); }; // write last character fputc(buff[1], out); return(0); } int main(int argc, char * argv[]) { FILE * fs; if (argc != 2) { fprintf(stderr, "Usage: %s \n", argv[0]); return(1); }; if ((fs = fopen(argv[1], "r")) == NULL) { perror("fopen()"); return(1); }; strip_c_code(fs, stdout); fclose(fs); return(0); } /* end of source file */ 

我还在Github上发布了这个代码,以便于下载和编译:

https://gist.github.com/syzdek/5417109