使用滑动窗口删除注释而不使用嵌套的while循环

我正在尝试使用c代码从ac文件中删除注释和字符串。我会坚持对这些例子的评论。我有一个滑动窗口，所以在任何给定时刻我只有字符n和n-1 。我试图找出一个不使用嵌套whiles的算法，如果可能的话，但我需要一个getchar通过输入。我的第一个想法是通过找到n=* and (n-1)=/然后直到n=/ and (n-1)=* ，但考虑到这已嵌套而我觉得它是低效的。如果必须的话，我可以这样做，但我想知道是否有人有更好的解决方案。

用一个while循环编写的算法可能如下所示：

 while ((c = getchar()) != EOF) { ... // looking at the byte that was just read if (...) // the symbol is not inside a comment { putchar(c); } }

要确定输入char是否属于注释，可以使用状态机。在以下示例中，它有4个状态; 也有穿越下一个州的规则。

 int state = 0; int next_state; while ((c = getchar()) != EOF) { switch (state) { case 0: next_state = (c == '/' ? 1 : 0); break; case 1: next_state = (c == '*' ? 2 : c == '/' ? 1 : 0); break; case 2: next_state = (c == '*' ? 3 : 2); break; case 3: next_state = (c == '/' ? 0 : c == '*' ? 3 : 2); break; default: next_state = state; // will never happen } if (state == 1 && next_state == 0) { putchar('/'); // for correct output when a slash is not followed by a star } if (state == 0 && next_state == 0) { putchar(c); } state = next_state; }

上面的例子非常简单：在C语言中的非注释上下文中，它不能正确地用于/* ; 它不支持//评论等

正确地做这件事比起初人们想象的要复杂得多，正如这里其他评论所指出的那样。我强烈建议编写一个表驱动的FSM，使用状态转换图来获得正确的转换。试图用一些案例陈述来做更多的事情是非常容易出错的IMO。

这是一个dot / graphviz格式的图表，您可以从中直接编写状态表。请注意，我根本没有测试过这个，所以YMMV。

该图的语义是，当你看到，如果该状态中的其他输入都不匹配，则它是一个下降。文件结尾是除S0之外的任何状态的错误，因此未明确列出的任何字符或。除了在评论（ S4和S5 ）中以及检测到开始评论（ S1 ）时，打印的每个字符都被打印。您必须在检测到开始注释时缓冲字符，如果它是错误的开始则打印它们，否则在确定它确实是注释时将它们丢弃。

在点图中， sq是单引号' ， dq是双引号" 。

 digraph state_machine { rankdir=LR; size="8,5"; node [shape=doublecircle]; S0 /* init */; node [shape=circle]; S0 /* init */ -> S1 /* begin_cmt */ [label = "'/'"]; S0 /* init */ -> S2 /* in_str */ [label = dq]; S0 /* init */ -> S3 /* in_ch */ [label = sq]; S0 /* init */ -> S0 /* init */ [label = ""]; S1 /* begin_cmt */ -> S4 /* in_slc */ [label = "'/'"]; S1 /* begin_cmt */ -> S5 /* in_mlc */ [label = "'*'"]; S1 /* begin_cmt */ -> S0 /* init */ [label = ""]; S1 /* begin_cmt */ -> S1 /* begin_cmt */ [label = "'\\n'"]; // handle "/\n/" and "/\n*" S2 /* in_str */ -> S0 /* init */ [label = "'\\'"]; S2 /* in_str */ -> S6 /* str_esc */ [label = "'\\'"]; S2 /* in_str */ -> S2 /* in_str */ [label = ""]; S3 /* in_ch */ -> S0 /* init */ [label = sq]; S4 /* in_slc */ -> S4 /* in_slc */ [label = ""]; S4 /* in_slc */ -> S0 /* init */ [label = "'\\n'"]; S5 /* in_mlc */ -> S7 /* end_mlc */ [label = "'*'"]; S5 /* in_mlc */ -> S5 /* in_mlc */ [label = ""]; S7 /* end_mlc */ -> S7 /* end_mlc */ [label = "'*'|'\\n'"]; S7 /* end_mlc */ -> S0 /* init */ [label = "'/'"]; S7 /* end_mlc */ -> S5 /* in_mlc */ [label = ""]; S6 /* str_esc */ -> S8 /* oct */ [label = "[0-3]"]; S6 /* str_esc */ -> S9 /* hex */ [label = "'x'"]; S6 /* str_esc */ -> S2 /* in_str */ [label = ""]; S8 /* oct */ -> S10 /* o1 */ [label = "[0-7]"]; S10 /* o1 */ -> S2 /* in_str */ [label = "[0-7]"]; S9 /* hex */ -> S11 /* h1 */ [label = hex]; S11 /* h1 */ -> S2 /* in_str */ [label = hex]; S3 /* in_ch */ -> S12 /* ch_esc */ [label = "'\\'"]; S3 /* in_ch */ -> S13 /* out_ch */ [label = ""]; S13 /* out_ch */ -> S0 /* init */ [label = sq]; S12 /* ch_esc */ -> S3 /* in_ch */ [label = sq]; S12 /* ch_esc */ -> S12 /* ch_esc */ [label = ""]; }

由于你只想使用两个字符作为缓冲区而只有一个while循环，我建议使用第三个字符来跟踪你的状态（是否跳过文本）。我已经为您编写了一个测试程序，其中包含解释逻辑的内联注释：

 // Program to strip comments and strings from a C file // // Build: // gcc -o strip-comments strip-comments.c // // Test: // ./strip-comments strip-comments.c #include  #include  #include  #include  #include  #include  /* The following is a block of strings, and comments for testing * the code. */ /* test if three comments *//* chained together */// will be removed. static int value = 128 /* test comment within valid code *// 2; const char * test1 = "This is a test of \" processing"; /* testing inline comment */ const char * test2 = "this is a test of \n within strings."; // testing inline comment // this is a the last test int strip_c_code(FILE * in, FILE * out) { char buff[2]; char skipping; skipping = '\0'; buff[0] = '\0'; buff[1] = '\0'; // loop through the file while((buff[0] = fgetc(in)) != EOF) { // checking for start of comment or string block if (!(skipping)) { // start skipping in "//" comments if ((buff[1] == '/') && (buff[0] == '/')) skipping = '/'; // start skipping in "/*" comments else if ((buff[1] == '/') && (buff[0] == '*')) skipping = '*'; // start skipping at start of strings, but not character assignments else if ( ((buff[1] != '\'') && (buff[0] == '"')) && ((buff[1] != '\\') && (buff[0] == '"')) ) { fputc(buff[1], out); skipping = '"'; }; // clear buffer so that processed characters are not interpreted as // end of skip characters. if ((skipping)) { buff[0] = '\0'; buff[1] = '\0'; }; }; // check for characters which terminate skip block switch(skipping) { // if skipping "//" comments, look for new line case '/': if (buff[1] == '\n') skipping = '\0'; break; // if skipping "/*" comments, look for "*/" terminating string case '*': if ((buff[1] == '*') && (buff[0] == '/')) { buff[0] = '\0'; buff[1] = '\0'; skipping = '\0'; }; break; // if skipping strings, look for terminating '"' character case '"': if ((buff[1] != '\\') && (buff[0] == '"')) { skipping = '\0'; buff[0] = '\0'; buff[1] = '\0'; fprintf(out, "NULL"); // replace string with NULL }; break; default: break; }; // if not skipping, write character out if ( (!(skipping)) && ((buff[1])) ) fputc(buff[1], out); // shift new character to old character position buff[1] = buff[0]; }; // verify that the comment or string was terminated properly if ((skipping)) { fprintf(stderr, "Unterminated comment or string\n"); return(-1); }; // write last character fputc(buff[1], out); return(0); } int main(int argc, char * argv[]) { FILE * fs; if (argc != 2) { fprintf(stderr, "Usage: %s \n", argv[0]); return(1); }; if ((fs = fopen(argv[1], "r")) == NULL) { perror("fopen()"); return(1); }; strip_c_code(fs, stdout); fclose(fs); return(0); } /* end of source file */

我还在Github上发布了这个代码，以便于下载和编译：

https://gist.github.com/syzdek/5417109

使用滑动窗口删除注释而不使用嵌套的while循环

具有Core 2 CPU（SSSE3）的大缓冲区的位popcount

C分段故障中的方程求解器

通过函数指针使用的函数可以内联吗？

C中的浮点运算是关联的吗？

在c中通过套接字发送图像文件的问题

如何在swift中将wchar_t转换为字符串

什么是找到平均大小分布而不必声明一个巨大的2D数组的有效方法？

不使用* printf打印数字

C数据类型的值范围是“系统”依赖的？

C ++：extern“C”和类成员之间的命名空间冲突