是否可以在C程序中“强制”使用UTF-8?

通常当我希望我的程序使用UTF-8编码时,我会写setlocale (LC_ALL, ""); 。 但是今天我发现它只是设置定位到环境的默认语言环境,我不知道环境是否默认使用UTF-8。

我想知道有没有办法强制字符编码为UTF-8? 另外,有没有办法检查我的程序是否使用UTF-8?

这是可能的,但这是完全错误的事情。

首先,当前的语言环境由用户决定。 它不仅包括字符集,还包括语言,日期和时间格式等。 你的程序绝对没有“权利”来搞乱它。

如果您无法本地化您的程序,只需告诉用户您的程序具有的环境要求,并让他们担心。

实际上,您不应该真正依赖UTF-8作为当前编码,而是使用宽字符支持,包括wctype()wctype()等函数。 POSIXy系统还在其C库中提供了iconv_open()iconv()函数系列,以便在编码之间进行转换(应始终包括与wchar_t之间的转换); 在Windows上,您需要一个单独的版本libiconv库。 这就是GCC编译器处理不同字符集的方法。 (在内部,它使用Unicode / UTF-8,但如果你要求它,它可以进行必要的转换以与其他字符集一起使用。)

我个人强烈支持在任何地方使用UTF-8 ,但是在程序中覆盖用户区域设置是非常可怕的。 可恶。 令人反感; 就像桌面小程序改变显示分辨率一样,因为程序员特别喜欢某个。

我很乐意写一些示例代码来说明如何正确解决任何字符集合理的情况,但有这么多,我不知道从哪里开始。

如果OP修改他们的问题以准确说明覆盖字符集应该解决的问题 ,我愿意展示如何使用上述实用程序和POSIX工具(或Windows上等效的免费库)来正确解决它。

如果这对某人来说似乎很苛刻,那只是因为在这里采取简单易行的方法(覆盖用户的语言环境设置)是错误的 ,纯粹是出于技术原因。 即使没有动作更好,实际上也是可以接受的,只要你只记录你的应用程序只处理UTF-8输入/输出。


示例1.本地化新年快乐!

 #include  #include  #include  #include  int main(void) { /* We wish to use the user's current locale. */ setlocale(LC_ALL, ""); /* We intend to use wide functions on standard output. */ fwide(stdout, 1); /* For Windows compatibility, print out a Byte Order Mark. * If you save the output to a file, this helps tell Windows * applications that the file is Unicode. * Other systems don't need it nor use it. */ fputwc(L'\uFEFF', stdout); wprintf(L"Happy New Year!\n"); wprintf(L"С новым годом!\n"); wprintf(L"新年好!\n"); wprintf(L"賀正!\n"); wprintf(L"¡Feliz año nuevo!\n"); wprintf(L"Hyvää uutta vuotta!\n"); return EXIT_SUCCESS; } 

请注意,wprintf()采用宽字符串(宽字符串常量的forms为L"" ,宽字符常量L'' ,而不是普通/窄对应的""'' )。 格式仍然相同; %s打印正常/窄字符串, %ls打宽字符串。


示例2.从标准输入读取输入行,并可选择将它们保存到文件中。 文件名在命令行中提供。

 #include  #include  #include  #include  #include  #include  #include  typedef enum { TRIM_LEFT = 1, /* Remove leading whitespace and control characters */ TRIM_RIGHT = 2, /* Remove trailing whitespace and control characters */ TRIM_NEWLINE = 4, /* Remove newline at end of line */ TRIM = 7, /* Remove leading and trailing whitespace and control characters */ OMIT_NUL = 8, /* Skip NUL characters (embedded binary zeros, L'\0') */ OMIT_CONTROLS = 16, /* Skip control characters */ CLEANUP = 31, /* All of the above. */ COMBINE_LWS = 32, /* Combine all whitespace into a single space */ } trim_opts; /* Read an unlimited-length line from a wide input stream. * * This function takes a pointer to a wide string pointer, * pointer to the number of wide characters dynamically allocated for it, * the stream to read from, and a set of options on how to treat the line. * * If an error occurs, this will return 0 with errno set to nonzero error number. * Use strerror(errno) to obtain the error description (as a narrow string). * * If there is no more data to read from the stream, * this will return 0 with errno 0, and feof(stream) will return true. * * If an empty line is read, * this will return 0 with errno 0, but feof(stream) will return false. * * Typically, you initialize variables like * wchar_t *line = NULL; * size_t size = 0; * before calling this function, so that subsequent calls the same, dynamically * allocated buffer for the line, and it is automatically grown if necessary. * There are no built-in limits to line lengths this way. */ size_t getwline(wchar_t **const lineptr, size_t *const sizeptr, FILE *const in, trim_opts const trimming) { wchar_t *line; size_t size; size_t used = 0; wint_t wc; fpos_t startpos; int seekable; if (lineptr == NULL || sizeptr == NULL || in == NULL) { errno = EINVAL; return 0; } if (*lineptr != NULL) { line = *lineptr; size = *sizeptr; } else { line = NULL; size = 0; *sizeptr = 0; } /* In error cases, we can try and get back to this position * in the input stream, as we cannot really return the data * read thus far. However, some streams like pipes are not seekable, * so in those cases we should not even try. * Use (seekable) as a flag to remember if we should try. */ if (fgetpos(in, &startpos) == 0) seekable = 1; else seekable = 0; while (1) { /* When we read a wide character from a wide stream, * fgetwc() will return WEOF with errno set if an error occurs. * However, fgetwc() will return WEOF with errno *unchanged* * if there is no more input in the stream. * To detect which of the two happened, we need to clear errno * first. */ errno = 0; wc = fgetwc(in); if (wc == WEOF) { if (errno) { const int saved_errno = errno; if (seekable) fsetpos(in, &startpos); errno = saved_errno; return 0; } if (ferror(in)) { if (seekable) fsetpos(in, &startpos); errno = EIO; return 0; } break; } /* Dynamically grow line buffer if necessary. * We need room for the current wide character, * plus at least the end-of-string mark, L'\0'. */ if (used + 2 > size) { /* Size policy. This can be anything you see fit, * as long as it yields size >= used + 2. * * This one increments size to next multiple of * 1024 (minus 16). It works well in practice, * but do not think of it as the "best" way. * It is just a robust choice. */ size = (used | 1023) + 1009; line = realloc(line, size * sizeof line[0]); if (!line) { /* Memory allocation failed. */ if (seekable) fsetpos(in, &startpos); errno = ENOMEM; return 0; } *lineptr = line; *sizeptr = size; } /* Append character to buffer. */ if (!trimming) line[used++] = wc; else { /* Check if we have reasons to NOT add the character to buffer. */ do { /* Omit NUL if asked to. */ if (trimming & OMIT_NUL) if (wc == L'\0') break; /* Omit controls if asked to. */ if (trimming & OMIT_CONTROLS) if (iswcntrl(wc)) break; /* If we are at start of line, and we are left-trimming, * only graphs (printable non-whitespace characters) are added. */ if (trimming & TRIM_LEFT) if (wc == L'\0' || !iswgraph(wc)) break; /* Combine whitespaces if asked to. */ if (trimming & COMBINE_LWS) if (iswspace(wc)) { if (used > 0 && line[used-1] == L' ') break; else wc = L' '; } /* Okay, add the character to buffer. */ line[used++] = wc; } while (0); } /* End of the line? */ if (wc == L'\n') break; } /* The above loop will only end (break out) * if end of line or end of input was found, * and no error occurred. */ /* Trim right if asked to. */ if (trimming & TRIM_RIGHT) while (used > 0 && iswspace(line[used-1])) --used; else if (trimming & TRIM_NEWLINE) while (used > 0 && (line[used-1] == L'\r' || line[used-1] == L'\n')) --used; /* Ensure we have room for end-of-string L'\0'. */ if (used >= size) { size = used + 1; line = realloc(line, size * sizeof line[0]); if (!line) { if (seekable) fsetpos(in, &startpos); errno = ENOMEM; return 0; } *lineptr = line; *sizeptr = size; } /* Add end of string mark. */ line[used] = L'\0'; /* Successful return. */ errno = 0; return used; } /* Counts the number of wide characters in 'alpha' class. */ size_t count_letters(const wchar_t *ws) { size_t count = 0; if (ws) while (*ws != L'\0') if (iswalpha(*(ws++))) count++; return count; } int main(int argc, char *argv[]) { FILE *out; wchar_t *line = NULL; size_t size = 0; size_t len; setlocale(LC_ALL, ""); /* Standard input and output should use wide characters. */ fwide(stdin, 1); fwide(stdout, 1); /* Check if the user asked for help. */ if (argc < 2 || argc > 3 || strcmp(argv[1], "-h") == 0 || strcmp(argv[1], "--help") == 0 || strcmp(argv[1], "/?") == 0) { fprintf(stderr, "\n"); fprintf(stderr, "Usage: %s [ -h | --help | /? ]\n", argv[0]); fprintf(stderr, " %s FILENAME [ PROMPT ]\n", argv[0]); fprintf(stderr, "\n"); fprintf(stderr, "The program will read input lines until an only '.' is supplied.\n"); fprintf(stderr, "If you do not want to save the output to a file,\n"); fprintf(stderr, "use '-' as the FILENAME.\n"); fprintf(stderr, "\n"); return EXIT_SUCCESS; } /* Open file for output, unless it is "-". */ if (strcmp(argv[1], "-") == 0) out = NULL; /* No output to file */ else { out = fopen(argv[1], "w"); if (out == NULL) { fprintf(stderr, "%s: %s.\n", argv[1], strerror(errno)); return EXIT_FAILURE; } /* The output file is used with wide strings. */ fwide(out, 1); } while (1) { /* Prompt? Note: our prompt string is narrow, but stdout is wide. */ if (argc > 2) { wprintf(L"%s\n", argv[2]); fflush(stdout); } len = getwline(&line, &size, stdin, CLEANUP); if (len == 0) { if (errno) { fprintf(stderr, "Error reading standard input: %s.\n", strerror(errno)); break; } if (feof(stdin)) break; } /* The user does not wish to supply more lines? */ if (wcscmp(line, L".") == 0) break; /* Print the line to the file. */ if (out != NULL) { fputws(line, out); fputwc(L'\n', out); } /* Tell the user what we read. */ wprintf(L"Received %lu wide characters, %lu of which were letterlike.\n", (unsigned long)len, (unsigned long)count_letters(line)); fflush(stdout); } /* The line buffer is no longer needed, so we can discard it. * Note that free(NULL) is safe, so we do not need to check. */ free(line); /* I personally also like to reset the variables. * It helps with debugging, and to avoid reuse-after-free() errors. */ line = NULL; size = 0; return EXIT_SUCCESS; } 

上面的getwline()函数几乎是在处理本地化的宽字符支持时可能需要的最复杂的函数末尾。 它允许您读取没有长度限制的本地化输入行,并可选择修剪和清除返回的字符串(删除控制代码和嵌入的二进制零)。 它也适用于LF和CR-LF( \n\r\n )换行符编码。

尝试:

 setlocale(LC_ALL, "en_US.UTF-8"); 

您可以在终端中运行locale -a以获取系统支持的完整语言环境列表(大多数/所有UTF-8支持系统都应支持“en_US.UTF-8”)。

编辑1 (替代拼写)

在评论中,Lee指出一些系统有一个替代拼写, "en_US.utf8" (这令我感到惊讶,但我们每天都在学习新东西)。

由于setlocale在失败时返回NULL,因此您可以链接这些调用:

 if(!setlocale(LC_ALL, "en_US.UTF-8") && !setlocale(LC_ALL, "en_US.utf8")) printf("failed to set locale to UTF-8"); 

编辑2 (找出我们是否使用UTF-8)

要确定区域设置是否设置为UFT-8(在尝试设置它之后),您可以检查返回的值( NULL表示调用失败)或检查使用的区域设置。

选项1:

 char * result; if((result = setlocale (LC_ALL, "en_US.UTF-8")) == NULL) printf("failed to set locale to UTF-8"); 

选项2:

 setlocale (LC_ALL, "en_US.UTF-8"); // set char * result = setlocale (LC_ALL, NULL); // review if(!strstr(result, "UTF-8")) printf("failed to set locale to UTF-8"); 

这不是一个答案,而是关于如何使用宽字符I / O的第三个非常复杂的例子。 这太长了,无法添加到我对这个问题的实际答案中 。

此示例显示如何使用宽字符串读取和处理CSV文件( RFC-4180格式,可选择使用有限的反斜杠转义支持)。

以下代码是CC0 / public domain,因此您可以随意使用它,甚至可以包含在您自己的专有项目中,但如果它破坏了任何内容,您可以保留所有内容而不是向我抱怨。 (如果您在下面的评论中找到并报告它们,我将很乐意包含任何错误修复。)

然而,代码的逻辑是健壮的。 特别是,它支持通用换行符,所有四种常见的换行符类型:类似Unix的LF( \n ),旧的CR LF( \r\n ),旧的Mac CR( \r ),以及偶尔遇到的奇怪的LF CR( \n\r )。 wrt没有内置限制。 字段的长度,记录中的字段数或文件中的记录数。 如果您需要转换CSV或处理CSV输入流(逐字段或逐个记录),而不必在一个内存中有多个内存,它的工作非常好。 如果要构造结构来描述内存中的记录和字段,则需要为此添加一些脚手架代码。

由于通用的换行支持,当以交互方式读取输入时,此程序可能需要两个连续的输入结束(在Windows和MS-DOS中为Ctrl + Z ,在其他地方为Ctrl + D ),因为第一个通常被“消耗” csv_next_field()csv_skip_field()函数和csv_next_record()函数需要再次重新读取它才能实际检测到它。 但是,您通常不会要求用户以交互方式输入CSV数据,因此这应该是一个可接受的怪癖。

 #include  #include  #include  #include  #include  #include  #include  /* RFC-4180 -format CSV file processing using wide input streams. * * #define BACKSLASH_ESCAPES if you additionally wish to have * \\, \a, \b, \t, \n, \v, \f, \r, \", and \, de-escaped to their * C string equivalents when reading CSV fields. */ typedef enum { CSV_OK = 0, CSV_END = 1, CSV_INVALID_PARAMETERS = -1, CSV_FORMAT_ERROR = -2, CSV_CHARSET_ERROR = -3, CSV_READ_ERROR = -4, CSV_OUT_OF_MEMORY = -5, } csv_status; const char *csv_error(const csv_status code) { switch (code) { case CSV_OK: return "No error"; case CSV_END: return "At end"; case CSV_INVALID_PARAMETERS: return "Invalid parameters"; case CSV_FORMAT_ERROR: return "Bad CSV format"; case CSV_CHARSET_ERROR: return "Illegal character in CSV file (incorrect locale?)"; case CSV_READ_ERROR: return "Read error"; case CSV_OUT_OF_MEMORY: return "Out of memory"; default: return "Unknown csv_status code"; } } /* Start the next record. Automatically skips any remaining fields in current record. * Returns CSV_OK if successful, CSV_END if no more records, or a negative CSV_ error code. */ csv_status csv_next_record (FILE *const in); /* Skip the next field. Returns CSV_OK if successful, CSV_END if no more fields in current record, * or a negative CSV_ error code. */ csv_status csv_skip_field (FILE *const in); /* Read the next field. Returns CSV_OK if successful, CSV_END if no more fields in current record, * or a negative CSV_ error code. * If this returns CSV_OK, then *dataptr is a dynamically allocated wide string to the field * contents, space allocated for *sizeptr wide characters; and if lengthptr is not NULL, then * *lengthptr is the number of wide characters in said wide string. */ csv_status csv_next_field (FILE *const in, wchar_t **const dataptr, size_t *const sizeptr, size_t *const lengthptr); static csv_status internal_skip_quoted(FILE *const in) { while (1) { wint_t wc; errno = 0; wc = fgetwc(in); if (wc == WEOF) { if (errno == EILSEQ) return CSV_CHARSET_ERROR; if (errno) return CSV_READ_ERROR; if (ferror(in)) { errno = EIO; return CSV_READ_ERROR; } errno = 0; return CSV_FORMAT_ERROR; } if (wc == L'"') { errno = 0; wc = fgetwc(in); if (wc == L'"') continue; while (wc != WEOF && wc != L'\n' && wc != L'\r' && iswspace(wc)) { errno = 0; wc = fgetwc(in); } if (wc == WEOF) { if (errno == EILSEQ) return CSV_CHARSET_ERROR; if (errno) return CSV_READ_ERROR; if (ferror(in)) { errno = EIO; return CSV_READ_ERROR; } errno = 0; return CSV_END; } if (wc == L',') { errno = 0; return CSV_OK; } if (wc == L'\n' || wc == L'\r') { ungetwc(wc, in); errno = 0; return CSV_END; } ungetwc(wc, in); errno = 0; return CSV_FORMAT_ERROR; } #ifdef BACKSLASH_ESCAPES if (wc == L'\\') { errno = 0; wc = fgetwc(in); if (wc == L'"') continue; if (wc == WEOF) { if (errno == EILSEQ) return CSV_CHARSET_ERROR; if (errno) return CSV_READ_ERROR; if (ferror(in)) { errno = EIO; return CSV_READ_ERROR; } errno = 0; return CSV_END; } } #endif } } static csv_status internal_skip_unquoted(FILE *const in, wint_t wc) { while (1) { if (wc == WEOF) { if (errno == EILSEQ) return CSV_CHARSET_ERROR; if (errno) return CSV_READ_ERROR; if (ferror(in)) { errno = EIO; return CSV_READ_ERROR; } errno = 0; return CSV_END; } if (wc == L',') { errno = 0; return CSV_OK; } if (wc == L'\n' || wc == L'\r') { ungetwc(wc, in); errno = 0; return CSV_END; } #ifdef BACKSLASH_ESCAPES if (wc == L'\\') { errno = 0; wc = fgetwc(in); if (wc == WEOF) { if (errno == EILSEQ) return CSV_CHARSET_ERROR; if (errno) return CSV_READ_ERROR; if (ferror(in)) { errno = EIO; return CSV_READ_ERROR; } errno = 0; return CSV_END; } } #endif errno = 0; wc = fgetwc(in); } } csv_status csv_next_record(FILE *const in) { while (1) { wint_t wc; csv_status status; do { errno = 0; wc = fgetwc(in); } while (wc != WEOF && wc != L'\n' && wc != L'\r' && iswspace(wc)); if (wc == WEOF) { if (errno == EILSEQ) return CSV_CHARSET_ERROR; if (errno) return CSV_READ_ERROR; if (ferror(in)) { errno = EIO; return CSV_READ_ERROR; } errno = 0; return CSV_END; } if (wc == L'\n' || wc == L'\r') { wint_t next_wc; errno = 0; next_wc = fgetwc(in); if (next_wc == WEOF) { if (errno == EILSEQ) return CSV_CHARSET_ERROR; if (errno) return CSV_READ_ERROR; if (ferror(in)) { errno = EIO; return CSV_READ_ERROR; } errno = 0; return CSV_END; } if ((wc == L'\n' && next_wc == L'\r') || (wc == L'\r' && next_wc == L'\n')) { errno = 0; return CSV_OK; } ungetwc(next_wc, in); errno = 0; return CSV_OK; } if (wc == L'"') status = internal_skip_quoted(in); else status = internal_skip_unquoted(in, wc); if (status < 0) return status; } } csv_status csv_skip_field(FILE *const in) { wint_t wc; if (!in) { errno = EINVAL; return CSV_INVALID_PARAMETERS; } else if (ferror(in)) { errno = EIO; return CSV_READ_ERROR; } /* Skip leading whitespace. */ do { errno = 0; wc = fgetwc(in); } while (wc != WEOF && wc != L'\n' && wc != L'\r' && iswspace(wc)); if (wc == L'"') return internal_skip_quoted(in); else return internal_skip_unquoted(in, wc); } csv_status csv_next_field(FILE *const in, wchar_t **const dataptr, size_t *const sizeptr, size_t *const lengthptr) { wchar_t *data; size_t size; size_t used = 0; /* length */ wint_t wc; if (lengthptr) *lengthptr = 0; if (!in || !dataptr || !sizeptr) { errno = EINVAL; return CSV_INVALID_PARAMETERS; } else if (ferror(in)) { errno = EIO; return CSV_READ_ERROR; } if (*dataptr) { data = *dataptr; size = *sizeptr; } else { data = NULL; size = 0; *sizeptr = 0; } /* Skip leading whitespace. */ do { errno = 0; wc = fgetwc(in); } while (wc != WEOF && wc != L'\n' && wc != L'\r' && iswspace(wc)); if (wc == WEOF) { if (errno == EILSEQ) return CSV_CHARSET_ERROR; if (errno) return CSV_READ_ERROR; if (ferror(in)) { errno = EIO; return CSV_READ_ERROR; } errno = 0; return CSV_END; } if (wc == L'\n' || wc == L'\r') { ungetwc(wc, in); errno = 0; return CSV_END; } if (wc == L'"') while (1) { errno = 0; wc = getwc(in); if (wc == WEOF) { if (errno == EILSEQ) return CSV_CHARSET_ERROR; if (errno) return CSV_READ_ERROR; if (ferror(in)) { errno = EIO; return CSV_READ_ERROR; } errno = 0; return CSV_FORMAT_ERROR; } else if (wc == L'"') { errno = 0; wc = getwc(in); if (wc != L'"') { /* Not an escaped doublequote. */ while (wc != WEOF && wc != L'\n' && wc != L'\r' && iswspace(wc)) { errno = 0; wc = getwc(in); } if (wc == WEOF) { if (errno == EILSEQ) return CSV_CHARSET_ERROR; if (errno) return CSV_READ_ERROR; if (ferror(in)) { errno = EIO; return CSV_READ_ERROR; } } else if (wc == L'\n' || wc == L'\r') { ungetwc(wc, in); } else if (wc != L',') { errno = 0; return CSV_FORMAT_ERROR; } break; } #ifdef BACKSLASH_ESCAPES } else if (wc == L'\\') { errno = 0; wc = getwc(in); if (wc == L'\0') continue; else if (wc == WEOF) { if (errno == EILSEQ) return CSV_CHARSET_ERROR; if (errno) return CSV_READ_ERROR; if (ferror(in)) { errno = EIO; return CSV_READ_ERROR; } break; } else switch (wc) { case L'a': wc = L'\a'; break; case L'b': wc = L'\b'; break; case L't': wc = L'\t'; break; case L'n': wc = L'\n'; break; case L'v': wc = L'\v'; break; case L'f': wc = L'\f'; break; case L'r': wc = L'\r'; break; case L'\\': wc = L'\\'; break; case L'"': wc = L'"'; break; case L',': wc = L','; break; default: ungetwc(wc, in); wc = L'\\'; } #endif } if (used + 2 > size) { /* Allocation policy. * Anything that yields size >= used + 2 is acceptable. * This one allocates in roughly 1024 byte chunks, * and is known to be robust (but not optimal) in practice. */ size = (used | 1023) + 1009; data = realloc(data, size * sizeof data[0]); if (!data) { errno = ENOMEM; return CSV_OUT_OF_MEMORY; } *dataptr = data; *sizeptr = size; } data[used++] = wc; } else while (1) { if (wc == L',') break; if (wc == L'\n' || wc == L'\r') { ungetwc(wc, in); break; } #ifdef BACKSLASH_ESCAPES if (wc == L'\\') { errno = 0; wc = fgetwc(in); if (wc == WEOF) { if (errno == EILSEQ) return CSV_CHARSET_ERROR; if (errno) return CSV_READ_ERROR; if (ferror(in)) { errno = EIO; return CSV_READ_ERROR; } wc = L'\\'; } else switch (wc) { case L'a': wc = L'\a'; break; case L'b': wc = L'\b'; break; case L't': wc = L'\t'; break; case L'n': wc = L'\n'; break; case L'v': wc = L'\v'; break; case L'f': wc = L'\f'; break; case L'r': wc = L'\r'; break; case L'"': wc = L'"'; break; case L',': wc = L','; break; case L'\\': wc = L'\\'; break; default: ungetwc(wc, in); wc = L'\\'; } } #endif if (used + 2 > size) { /* Allocation policy. * Anything that yields size >= used + 2 is acceptable. * This one allocates in roughly 1024 byte chunks, * and is known to be robust (but not optimal) in practice. */ size = (used | 1023) + 1009; data = realloc(data, size * sizeof data[0]); if (!data) { errno = ENOMEM; return CSV_OUT_OF_MEMORY; } *dataptr = data; *sizeptr = size; } data[used++] = wc; errno = 0; wc = getwc(in); if (wc == WEOF) { if (errno == EILSEQ) return CSV_CHARSET_ERROR; if (errno) return CSV_READ_ERROR; if (ferror(in)) { errno = EIO; return CSV_READ_ERROR; } break; } } /* Ensure there is room for the end-of-string mark. */ if (used >= size) { size = used + 1; data = realloc(data, size * sizeof data[0]); if (!data) { errno = ENOMEM; return CSV_OUT_OF_MEMORY; } *dataptr = data; *sizeptr = size; } data[used] = L'\0'; if (lengthptr) *lengthptr = used; errno = 0; return CSV_OK; } /* Helper function: print a wide string as if in quotes, but backslash-escape special characters. */ static void wquoted(FILE *const out, const wchar_t *ws, const size_t len) { if (out) { size_t i; for (i = 0; i < len; i++) if (ws[i] == L'\0') fputws(L"\\0", out); else if (ws[i] == L'\a') fputws(L"\\a", out); else if (ws[i] == L'\b') fputws(L"\\b", out); else if (ws[i] == L'\t') fputws(L"\\t", out); else if (ws[i] == L'\n') fputws(L"\\n", out); else if (ws[i] == L'\v') fputws(L"\\v", out); else if (ws[i] == L'\f') fputws(L"\\f", out); else if (ws[i] == L'\r') fputws(L"\\r", out); else if (ws[i] == L'"') fputws(L"\\\"", out); else if (ws[i] == L'\\') fputws(L"\\\\", out); else if (iswprint(ws[i])) fputwc(ws[i], out); else if (ws[i] < 65535) fwprintf(out, L"\\x%04x", (unsigned int)ws[i]); else fwprintf(out, L"\\x%08x", (unsigned long)ws[i]); } } static int show_csv(FILE *const in, const char *const filename) { wchar_t *field_contents = NULL; size_t field_allocated = 0; size_t field_length = 0; unsigned long record = 0UL; unsigned long field; csv_status status; while (1) { /* First field in this record. */ field = 0UL; record++; while (1) { status = csv_next_field(in, &field_contents, &field_allocated, &field_length); if (status == CSV_END) break; if (status < 0) { fprintf(stderr, "%s: %s.\n", filename, csv_error(status)); free(field_contents); return -1; } field++; wprintf(L"Record %lu, field %lu is \"", record, field); wquoted(stdout, field_contents, field_length); wprintf(L"\", %lu characters.\n", (unsigned long)field_length); } status = csv_next_record(in); if (status == CSV_END) { free(field_contents); return 0; } if (status < 0) { fprintf(stderr, "%s: %s.\n", filename, csv_error(status)); free(field_contents); return -1; } } } static int usage(const char *argv0) { fprintf(stderr, "\n"); fprintf(stderr, "Usage: %s [ -h | --help | /? ]\n", argv0); fprintf(stderr, " %s CSV-FILE [ ... ]\n", argv0); fprintf(stderr, "\n"); fprintf(stderr, "Use special file name '-' to read from standard input.\n"); fprintf(stderr, "\n"); return EXIT_SUCCESS; } int main(int argc, char *argv[]) { FILE *in; int arg; setlocale(LC_ALL, ""); fwide(stdin, 1); fwide(stdout, 1); if (argc < 1) return usage(argv[0]); for (arg = 1; arg < argc; arg++) { if (!strcmp(argv[arg], "-h") || !strcmp(argv[arg], "--help") || !strcmp(argv[arg], "/?")) return usage(argv[0]); if (!strcmp(argv[arg], "-")) { if (show_csv(stdin, "(standard input)")) return EXIT_FAILURE; } else { in = fopen(argv[arg], "r"); if (!in) { fprintf(stderr, "%s: %s.\n", argv[arg], strerror(errno)); return EXIT_FAILURE; } if (show_csv(in, argv[arg])) return EXIT_FAILURE; if (ferror(in)) { fprintf(stderr, "%s: %s.\n", argv[arg], strerror(EIO)); fclose(in); return EXIT_FAILURE; } if (fclose(in)) { fprintf(stderr, "%s: %s.\n", argv[arg], strerror(EIO)); return EXIT_FAILURE; } } } return EXIT_SUCCESS; } 

使用上面的csv_next_field()csv_skip_field()csv_next_record()非常简单。

  1. 正常打开CSV文件,然后在其上调用fwide(stream, 1)告诉C库你打算使用宽字符串变体而不是标准的窄字符串I / O函数。

  2. 创建四个变量,并初始化前两个:

      wchar_t *field = NULL; size_t allocated = 0; size_t length; csv_status status; 

    field是指向您读取的每个字段的动态分配内容的指针。 它是自动分配的; 基本上,你根本不需要担心它。 allocated保存当前分配的大小(宽字符,包括终止L'\0' ),稍后我们将使用lengthstatus

  3. 此时,您已准备好读取或跳过第一个记录中的第一个字段。

    此时您不希望调用csv_next_record() ,除非您希望完全跳过文件中的第一条记录。

  4. 呼叫status = csv_skip_field(stream); 跳过下一个字段,或者status = csv_next_field(stream, &field, &allocated, &length); 阅读它。

    如果status == CSV_OK ,则在明智的字符串field包含字段内容。 它有很length字符。

    如果status == CSV_END ,则当前记录中不再有字段。 (该field未更改,您不应该检查它。)

    否则, status < 0 ,它描述错误代码。 您可以使用csv_error(status)来获取描述它的(窄)字符串。

  5. 在任何时候,您都可以通过调用status = csv_next_record(stream);来移动(跳过)到下status = csv_next_record(stream);记录的status = csv_next_record(stream);

    如果它返回CSV_OK ,则可能有新记录可用。 (我们只知道你何时尝试阅读或跳过第一个字段。这类似于标准C库函数feof()只告诉你是否已经尝试读取输入结束,它不会告诉你是否还有更多数据是否可用。)

    如果它返回CSV_END ,您已经处理了最后一条记录,并且没有更多记录。

    否则,它返回一个负的错误代码, status < 0 。 您可以使用csv_error(status)来获取描述它的(窄)字符串。

  6. 完成后,丢弃字段缓冲区:

      free(field); field = NULL; allocated = 0; 

    您实际上不需要将变量重置为NULL和零,但我建议它。 实际上,您可以随时执行上述操作(当您不再对当前字段的内容感兴趣时),因为csv_next_field()将根据需要自动分配新缓冲区。

    注意free(NULL); 总是安全无所事事。 在释放field之前,您不需要检查field是否为NULL 。 这也是我建议在声明变量时立即初始化变量的原因。 它只是使一切都更容易处理。

已编译的示例程序将一个或多个CSV文件名作为命令行参数,然后读取文件并报告文件中每个字段的内容。 如果您有一个特别极其复杂的CSV文件,这对于检查此方法是否正确读取所有字段是最佳选择。