如何在使用C语言的文件中操作时提高性能

我已经在410k行的大型数据集上实现了朴素贝叶斯算法。现在我的所有记录都被正确分类,但事情是程序花了差不多一小时将记录写入相应的文件。什么是改进的最佳方法我的代码的性能。这是下面的代码。这段代码是将410k记录写入相应的文件。谢谢。

fp=fopen("sales_ok_fraud.txt","r"); while(fgets(line,80,fp)!=NULL) //Reading each line from file to calculate the file size. { token = strtok(line,","); token = strtok(NULL,","); token = strtok(NULL,","); token = strtok(NULL,","); token = strtok(NULL,","); token = strtok(NULL,","); token1 = strtok(token,"\n"); memcpy(mystr,&token1[0],strlen(token1)-1); mystr[strlen(token1)-1] = '\0'; if( strcmp(mystr,"ok") == 0 ) counter_ok++; else counter_fraud++; } printf("The no. of records with OK label are %f\n",counter_ok); printf("The no. of records with FRAUD label are %f\n",counter_fraud); prblty_ok = counter_ok/(counter_ok+counter_fraud); prblty_fraud = counter_fraud/(counter_ok+counter_fraud); printf("The probability of OK records is %f\n",prblty_ok); printf("The probability of FRAUD records is %f\n",prblty_fraud); fclose(fp); fp=fopen("sales_unknwn.txt","r"); fp2=fopen("sales_unknown_ok_classified.txt","a"); fp3=fopen("sales_unknown_fraud_classified.txt","a"); while(fgets(line1,80,fp)!=NULL) //Reading each line from file to calculate the file size. { unknwn_attr1 = strtok(line1,","); unknwn_attr2 = strtok(NULL,","); unknwn_attr3 = strtok(NULL,","); unknwn_attr4 = strtok(NULL,","); unknwn_attr5 = strtok(NULL,","); //printf("%s-%s-%s-%s-%s\n",unknwn_attr1,unknwn_attr2,unknwn_attr3,unknwn_attr4,unknwn_attr5); fp1=fopen("sales_ok_fraud.txt","r"); while(fgets(line,80,fp1)!=NULL) //Reading each line from file to calculate the file size. { ok_fraud_attr1 = strtok(line,","); ok_fraud_attr2 = strtok(NULL,","); ok_fraud_attr3 = strtok(NULL,","); ok_fraud_attr4 = strtok(NULL,","); ok_fraud_attr5 = strtok(NULL,","); ok_fraud_attr6 = strtok(NULL,","); memcpy(ok_fraud_attr6_str,&ok_fraud_attr6[0],strlen(ok_fraud_attr6)-2); ok_fraud_attr6_str[strlen(ok_fraud_attr6)-2] = '\0'; //ok_fraud_attr6[strlen(ok_fraud_attr6)-2] = '\0'; //printf("Testing ok_fraud_attr6 - %s-%d\n",ok_fraud_attr6_str,strlen(ok_fraud_attr6_str)); if( strcmp(ok_fraud_attr6_str,"ok") == 0 ) { if( strcmp(unknwn_attr2,ok_fraud_attr2) == 0 ) counter_ok_attr2++; if( strcmp(unknwn_attr3,ok_fraud_attr3) == 0 ) counter_ok_attr3++; if( strcmp(unknwn_attr4,ok_fraud_attr4) == 0 ) counter_ok_attr4++; if( strcmp(unknwn_attr5,ok_fraud_attr5) == 0 ) counter_ok_attr5++; } if( strcmp(ok_fraud_attr6_str,"fraud") == 0 ) { if( strcmp(unknwn_attr2,ok_fraud_attr2) == 0 ) counter_fraud_attr2++; if( strcmp(unknwn_attr3,ok_fraud_attr3) == 0 ) counter_fraud_attr3++; if( strcmp(unknwn_attr4,ok_fraud_attr4) == 0 ) counter_fraud_attr4++; if( strcmp(unknwn_attr5,ok_fraud_attr5) == 0 ) counter_fraud_attr5++; } } fclose(fp1); if(counter_ok_attr2 == 0) prblty_attr2_given_ok = (counter_ok_attr2+arbitrary_value*prblty_ok)/(counter_ok+arbitrary_value); else prblty_attr2_given_ok = (counter_ok_attr2)/(counter_ok); if(counter_ok_attr3 == 0) prblty_attr3_given_ok = (counter_ok_attr3+arbitrary_value*prblty_ok)/(counter_ok+arbitrary_value); else prblty_attr3_given_ok = (counter_ok_attr3)/(counter_ok); if(counter_ok_attr4 == 0) prblty_attr4_given_ok = (counter_ok_attr4+arbitrary_value*prblty_ok)/(counter_ok+arbitrary_value); else prblty_attr4_given_ok = (counter_ok_attr4)/(counter_ok); if(counter_ok_attr5 == 0) prblty_attr5_given_ok = (counter_ok_attr5+arbitrary_value*prblty_ok)/(counter_ok+arbitrary_value); else prblty_attr5_given_ok = (counter_ok_attr5)/(counter_ok); if(counter_fraud_attr2 == 0) prblty_attr2_given_fraud = (counter_fraud_attr2+arbitrary_value*prblty_fraud)/(counter_fraud+arbitrary_value); else prblty_attr2_given_fraud = (counter_fraud_attr2)/(counter_fraud); if(counter_fraud_attr3 == 0) prblty_attr3_given_fraud = (counter_fraud_attr3+arbitrary_value*prblty_fraud)/(counter_fraud+arbitrary_value); else prblty_attr3_given_fraud = (counter_fraud_attr3)/(counter_fraud); if(counter_fraud_attr4 == 0) prblty_attr4_given_fraud = (counter_fraud_attr4+arbitrary_value*prblty_fraud)/(counter_fraud+arbitrary_value); else prblty_attr4_given_fraud = (counter_fraud_attr4)/(counter_fraud); if(counter_fraud_attr5 == 0) prblty_attr5_given_fraud = (counter_fraud_attr5+arbitrary_value*prblty_fraud)/(counter_fraud+arbitrary_value); else prblty_attr5_given_fraud = (counter_fraud_attr5)/(counter_fraud); total_prblty_ok = prblty_ok*prblty_attr2_given_ok*prblty_attr3_given_ok*prblty_attr4_given_ok*prblty_attr5_given_ok; total_prblty_fraud = prblty_fraud*prblty_attr2_given_fraud*prblty_attr3_given_fraud*prblty_attr4_given_fraud*prblty_attr5_given_fraud; // printf("Testing counts for OK - %f - %f - %f - %f\n",counter_ok_attr2,counter_ok_attr3,counter_ok_attr4,counter_ok_attr5); // printf("Testing counts for FRAUD - %f - %f - %f - %f\n",counter_fraud_attr2,counter_fraud_attr3,counter_fraud_attr4,counter_fraud_attr5); // printf("Testing attribute probabilities for OK - %f - %f - %f - %f\n",prblty_attr2_given_ok,prblty_attr3_given_ok,prblty_attr4_given_ok,prblty_attr5_given_ok); // printf("Testing attribute probabilities for FRAUD- %f - %f - %f - %f\n",prblty_attr2_given_fraud,prblty_attr3_given_fraud,prblty_attr4_given_fraud,prblty_attr5_given_fraud); // printf("The final probabilities are %f - %f\n",total_prblty_ok,total_prblty_fraud); if(total_prblty_ok > total_prblty_fraud) { fprintf(fp2,"%s,%s,%s,%s,%s,ok\n",unknwn_attr1,unknwn_attr2,unknwn_attr3,unknwn_attr4,unknwn_attr5); } else { fprintf(fp3,"%s,%s,%s,%s,%s,fraud\n",unknwn_attr1,unknwn_attr2,unknwn_attr3,unknwn_attr4,unknwn_attr5); } counter_ok_attr2=counter_ok_attr3=counter_ok_attr4=counter_ok_attr5=0; counter_fraud_attr2=counter_fraud_attr3=counter_fraud_attr4=counter_fraud_attr5=0; } fclose(fp); fclose(fp2); fclose(fp3); 

我可以按照我尝试的顺序立即看到一些我能看到的东西:

  1. 停止在输出文件上重复打开 – 写 – 关闭,打开 – 写 – 关闭的意识形态。 他们的名字是固定和有限的。 在这个东西的开头适当地打开它们,然后在你完成时冲洗并关闭。
  2. 有几种逻辑结构可以大大简化。
  3. 你的strlen()横冲直撞需要大幅减少。 最优秀的优化编译器将检测未更改的源并优化后续调用已知未更改的char-ptr,所以我最后会这样做(但老实说我仍然这样做,因为调用重复的strlen()是一种不好的做法调用相同的数据。
  4. 在与OP进行对话后添加 :您反复重复解析相同的数据文件(sales_ok_fraud.txt),一次用于sales_unknwn.txt中的数据行。 如果sales_ok_fraud.txt可以适合内存,则12gB / abg-line-length是很多不必要的重复解析。 加载该数据一次计算其基本统计数据,并使用其中的数据和统计数据来处理其余的数据紧缩。

逻辑缩减

你可以在一个地方切出大量的工作,改变这个:

  if(strcmp(unknwn_attr2,ok_fraud_attr2) == 0 && strcmp(ok_fraud_attr6_str,"ok") == 0) counter_ok_attr2++; if(strcmp(unknwn_attr3,ok_fraud_attr3) == 0 && strcmp(ok_fraud_attr6_str,"ok") == 0) counter_ok_attr3++; if(strcmp(unknwn_attr4,ok_fraud_attr4) == 0 && strcmp(ok_fraud_attr6_str,"ok") == 0) counter_ok_attr4++; if(strcmp(unknwn_attr5,ok_fraud_attr5) == 0 && strcmp(ok_fraud_attr6_str,"ok") == 0) counter_ok_attr5++; if(strcmp(unknwn_attr2,ok_fraud_attr2) == 0 && strcmp(ok_fraud_attr6_str,"fraud") == 0) counter_fraud_attr2++; if(strcmp(unknwn_attr3,ok_fraud_attr3) == 0 && strcmp(ok_fraud_attr6_str,"fraud") == 0) counter_fraud_attr3++; if(strcmp(unknwn_attr4,ok_fraud_attr4) == 0 && strcmp(ok_fraud_attr6_str,"fraud") == 0) counter_fraud_attr4++; if(strcmp(unknwn_attr5,ok_fraud_attr5) == 0 && strcmp(ok_fraud_attr6_str,"fraud") == 0) counter_fraud_attr5++; 

对此:

  if (strcmp(ok_fraud_attr6_str, "ok") == 0) { if(strcmp(unknwn_attr2,ok_fraud_attr2) == 0) counter_ok_attr2++; if(strcmp(unknwn_attr3,ok_fraud_attr3) == 0 ) counter_ok_attr3++; if(strcmp(unknwn_attr4,ok_fraud_attr4) == 0) counter_ok_attr4++; if(strcmp(unknwn_attr5,ok_fraud_attr5) == 0) counter_ok_attr5++; } else if (strcmp(ok_fraud_attr6_str,"fraud") == 0) { if(strcmp(unknwn_attr2,ok_fraud_attr2) == 0) counter_fraud_attr2++; if(strcmp(unknwn_attr3,ok_fraud_attr3) == 0) counter_fraud_attr3++; if(strcmp(unknwn_attr4,ok_fraud_attr4) == 0) counter_fraud_attr4++; if(strcmp(unknwn_attr5,ok_fraud_attr5) == 0) counter_fraud_attr5++; } 

Front-Loading sales_ok_fraud.txt

以下内容依赖于sales_ok_fraud.txt统计文件的数据格式的神圣性,同时尽量在validation所述格式时尽可能迂腐。 它分配了一大块足够大的内存来保存整个文件plus-one-char,将整个主体视为单个null-term-string。 然后通过与之前相同的通用算​​法拼接该缓冲区。 结果将是一个指向固定长度char指针数组的指针表,然后可以在当前(并重复)打开,解析,使用和丢弃所有内容的同一位置迭代使用。

 // declare an array of six string pointers typedef char *OFAttribs[6]; // loads a table consisting of the following format: // // str1,str2,str3,str4,str5,str6\n // str1,str2,str3,str4,str5,str6\n // ... // str1,str2,str3,str4,str5,str6 // // any deviation from the above will cause premature termination of the loop // but will return whatever was able to be parsed up to the point of failure. // the caller should therefore always `free()` the resulting table and data // pointers. size_t init_ok_fraud_data(const char *fname, OFAttribs **ppTable, char **ppTableData) { if (!fname || !*fname) return 0; // check file open for thumbs up FILE *fp = fopen(fname, "rb"); if (!fp) return 0; // allocate enough memory to hold the entire file, plus a terminator fseek(fp, 0, SEEK_END); long len = ftell(fp); fseek(fp, 0, SEEK_SET); // allocate enough ram for the entire file plus terminator OFAttribs *pTable = NULL; size_t nTableLen = 0; char *pTableData = malloc((len+1) * sizeof(char)); if (NULL != pTableData) { fread(pTableData , len, 1, fp); pTableData[len] = 0; } // no longer need the file fclose(fp); // prime first token char *token = strtok(pTableData, ","); while (token) { // read next line of tokens OFAttribs attribs = { NULL }; for (int i=0;i<4 && token; ++i) { attribs[i] = token; token = strtok(NULL, ","); } // filled 0..3, set lat token and move on if (attribs[3] && token) { // next-to-last entry set attribs[4] = token; // line enter is only terminated by newline token = strtok(NULL, "\n"); if (token) { // proper format. 6 parms, 5 commas, one new-line. attribs[5] = token; size_t slen = strlen(token); if (slen > 0) { while (isspace(token[--slen])) token[slen] = 0; } // make space on the master list for another. OFAttribs *tmp = realloc(pTable, sizeof(*tmp) * (nTableLen+1)); if (NULL != tmp) { pTable = tmp; memcpy(pTable + nTableLen++, attribs, sizeof(attribs)); } else { // allocation failure. printf("Error allocating memory for expanding OKFraud data set"); exit(EXIT_FAILURE); } } else { // not good. printf("Invalid line format detected. Expected ok/fraud\\n"); break; } // next token of new line token = strtok(NULL, ","); } } // set output variables *ppTable = pTable; *ppTableData = pTableData; return nTableLen; } 

把它放在一起

合并上述所有内容会对您的代码库产生以下影响:

 // load the ok_fraud table ONCE. OFAttribs *okfr = NULL; char *okfr_data = NULL; size_t okfr_len = init_ok_fraud_data("sales_ok_fraud.txt", &okfr, &okfr_data); // walk table to determine probabilities of ok and fraud states. // note: this really should be done as part of the loader. for (size_t i=0;i total_prblty_fraud) { fprintf(fp2,"%s,%s,%s,%s,%s,ok\n",unknwn_attr1,unknwn_attr2,unknwn_attr3,unknwn_attr4,unknwn_attr5); } else { fprintf(fp3,"%s,%s,%s,%s,%s,fraud\n",unknwn_attr1,unknwn_attr2,unknwn_attr3,unknwn_attr4,unknwn_attr5); } counter_ok_attr2=counter_ok_attr3=counter_ok_attr4=counter_ok_attr5=0; counter_fraud_attr2=counter_fraud_attr3=counter_fraud_attr4=counter_fraud_attr5=0; } // free the table data and dynamic pointer array free(okfr); free(okfr_data); fclose(fp); fclose(fp2); fclose(fp3); return 0; 

这些只是一些想法。 还有更多东西值得肯定,但这些对于处理具有连续输出的文件单前向扫描非常有帮助,这与您在这些情况下获得的效率差不多。 毫无疑问,三大组合:单文件打开+关闭,逻辑缩减和单解析缓存sales_ok_fraud.txt文件将在性能上有很大的提升,尤其是第一个和最后一个。

编辑协助OP更新此处理器以预先加载sales_ok_fraud.txt文件内容,从而消除重复加载,解析并立即丢弃要重复解析的15000多行文本(每个主源输入行一次)。 以上答案相应更新。

@ m02ph3u5是对的。 保持文件打开,调用fopen并将其从循环中取出。

 inputFile = fopen("sales_unknwn.txt","r"); okayFile = fopen("sales_ok_fraud.txt","r"); unknownOkayFile = fopen("sales_unknown_ok_classified.txt","a"); unknownFraudFile = fopen("sales_unknown_fraud_classified.txt","a"); // your loops go here fclose(inputFile); fclose(okayFile); fclose(unknownOkayFile); fclose(unknownFraudFile); 

如果它仍然很慢,请在您的应用上运行一个采样分析器,并将测试数据的子集作为输入,以保持快速周转。 这将告诉你程序在哪里花费时间。 你可能会感到惊讶。 如果您不知道要使用分析器,您可以通过使用调试器运行应用程序,重复进入调试器并注意它正在运行的function,来对穷人模拟采样分析器进行模拟。如果您看到它某个function或大部分时间都在特定线路上,这可能是您可以优化的热点。

一些建议:

•反复打开文件以追加,关闭它们并重新打开它们非常昂贵。 这是因为I / O比内存访问慢得多,并且您强制磁盘打开每个文件并在每次写入时寻找结束 最好在开始时打开它们并在结束时关闭它们,除非你担心程序会崩溃并且你将丢失你到目前为止所写的数据。

•您可以更换线路

 memcpy(ok_fraud_attr6_str, &ok_fraud_attr6[0], strlen(ok_fraud_attr6)-2); ok_fraud_attr6_str[strlen(ok_fraud_attr6)-2] = '\0'; 

 ok_fraud_attr6[strlen(ok_fraud_attr6)-2] = '\0'; 

然后在测试中使用ok_fraud_attr6 。 由于strtok是破坏性的(快速搜索将值得您花时间去了解为什么通常使用它总是一个坏主意),您不必担心保留lineok_fraud_attr6的内容。

•当您发现自己一遍又一遍地编写相同的代码时,通常会发现您的算法效率低下。 代替

 if ((some_unique_test) && (a_common_test)) do_some_stuff; if ((some_other_unique_test) && (a_common_test)) do_other_stuff; 

你可以写

 if (a_common_test) { if (some_unique_test) do_some_stuff; if (some_other_unique_test) do_other_stuff; } 

但请注意,只有第一个建议可能会对程序的执行时间产生明显影响,尽管它们都是学习的好习惯。

Jason建议使用剖析器是非常好的建议,不能过分强调。 程序员 – 甚至是经验丰富的程序员 – 在预测代码瓶颈所在的位置方面非常糟糕。 在调试器旁边,配置文件是您最好的朋友。