用C解析文本

我有这样一个文件：

... words 13 more words 21 even more words 4 ...

（一般格式是一个非数字字符串，然后是一个空格，然后是任意数量的数字和一个换行符）

我想解析每一行，将单词放入结构的一个字段中，将数字放入另一个字段中。现在我正在使用一个难看的黑客读取线，而字符不是数字，然后阅读其余的。我相信有一个更清晰的方式。

编辑：您可以使用pNum-buf获取字符串的字母部分的长度，并使用strncpy（）将其复制到另一个缓冲区。请务必在目标缓冲区的末尾添加“\ 0”。我会在pNum ++之前插入这段代码。

 int len = pNum-buf; strncpy(newBuf, buf, len-1); newBuf[len] = '\0';

您可以将整行读入缓冲区然后使用：

 char *pNum; if (pNum = strrchr(buf, ' ')) { pNum++; }

获取指向数字字段的指针。

 fscanf(file, "%s %d", word, &value);

这将值直接转换为字符串和整数，并应对空白和数字格式等的变化。

编辑

哎呀，我忘记了单词之间有空格。在那种情况下，我会做以下事情。（请注意，它会截断’line’中的原始文本）

 // Scan to find the last space in the line char *p = line; char *lastSpace = null; while(*p != '\0') { if (*p == ' ') lastSpace = p; p++; } if (lastSpace == null) return("parse error"); // Replace the last space in the line with a NUL *lastSpace = '\0'; // Advance past the NUL to the first character of the number field lastSpace++; char *word = text; int number = atoi(lastSpace);

您可以使用stdlib函数解决此问题，但由于您只搜索您感兴趣的字符，因此上述内容可能更有效。

你可以尝试使用strtok（）来标记每一行，然后检查每个标记是一个数字还是一个单词（一旦你有了标记字符串，这是一个相当简单的检查 – 只需查看标记的第一个字符）。

假设该数字后面紧跟’\ n’。你可以读取每一行到字符缓冲区，在整行使用sscanf（“％d”）来获取数字，然后计算这个数字在文本字符串末尾所占的字符数。

根据字符串的复杂程度，您可能需要使用PCRE库。至少你可以编译一个perl’ish正则表达式来分割你的行。但这可能有点过分。

给出描述，这就是我要做的事情：使用fgets（）将每一行读作单个字符串（确保目标缓冲区足够大），然后使用strtok（）拆分该行。要确定每个标记是单词还是数字，我将使用strtol（）来尝试转换并检查错误情况。例：

 #include  #include  #include  /** * Read the next line from the file, splitting the tokens into * multiple strings and a single integer. Assumes input lines * never exceed MAX_LINE_LENGTH and each individual string never * exceeds MAX_STR_SIZE. Otherwise things get a little more * interesting. Also assumes that the integer is the last * thing on each line. */ int getNextLine(FILE *in, char (*strs)[MAX_STR_SIZE], int *numStrings, int *value) { char buffer[MAX_LINE_LENGTH]; int rval = 1; if (fgets(buffer, buffer, sizeof buffer)) { char *token = strtok(buffer, " "); *numStrings = 0; while (token) { char *chk; *value = (int) strtol(token, &chk, 10); if (*chk != 0 && *chk != '\n') { strcpy(strs[(*numStrings)++], token); } token = strtok(NULL, " "); } } else { /** * fgets() hit either EOF or error; either way return 0 */ rval = 0; } return rval; } /** * sample main */ int main(void) { FILE *input; char strings[MAX_NUM_STRINGS][MAX_STRING_LENGTH]; int numStrings; int value; input = fopen("datafile.txt", "r"); if (input) { while (getNextLine(input, &strings, &numStrings, &value)) { /** * Do something with strings and value here */ } fclose(input); } return 0; }

鉴于描述，我想我会使用这个（现在测试的）C99代码的变体：

 #include  #include  #include  #include  struct word_number { char word[128]; long number; }; int read_word_number(FILE *fp, struct word_number *wnp) { char buffer[140]; if (fgets(buffer, sizeof(buffer), fp) == 0) return EOF; size_t len = strlen(buffer); if (buffer[len-1] != '\n') // Error if line too long to fit return EOF; buffer[--len] = '\0'; char *num = &buffer[len-1]; while (num > buffer && !isspace(*num)) num--; if (num == buffer) // No space in input data return EOF; char *end; wnp->number = strtol(num+1, &end, 0); if (*end != '\0') // Invalid number as last word on line return EOF; *num = '\0'; if (num - buffer >= sizeof(wnp->word)) // Non-number part too long return EOF; memcpy(wnp->word, buffer, num - buffer); return(0); } int main(void) { struct word_number wn; while (read_word_number(stdin, &wn) != EOF) printf("Word <<%s>> Number %ld\n", wn.word, wn.number); return(0); }

您可以通过为不同问题返回不同的值来改进错误报告。您可以使用动态分配的内存来处理行的单词部分。你可以使用比我允许的更长的线条。您可以向后扫描数字而不是非空格 – 但这允许用户编写“abc 0x123”并正确处理hex值。您可能更愿意确保单词部分中没有数字; 这段代码不关心。

用C解析文本

从这个例子中确定LR（k）的k？

如何解决2 + 2和2 ++ 2冲突

二进制流解析C的库

为Erlang提取C函数签名

解析mmap（） – ed文件

解析具有多个公共分隔符C的文件

微软在“C”中的文字解析器

如何在C中解析HTTP响应？

预处理后解析C ++源文件

数学表达式的自定义解释器