OpenMP令人尴尬地并行循环，没有加速

我有一个非常简单的并行for循环，它只是将零写入整数数组。但事实certificate线程越多，循环越慢。我认为这是由于一些缓存抖动所以我玩了调度，块大小， __restrict__ ，在并行块内嵌套并行，并刷新。然后我注意到读取数组进行缩减也比较慢。

这应该显然非常简单，并且应该几乎线性加速。我在这里想念的是什么？

完整代码：

 #include  #include  #include  #include  void tic(), toc(); int main(int argc, const char *argv[]) { const int COUNT = 100; const size_t sz = 250000 * 200; std::vector vec(sz, 1); std::cout << "max threads: " << omp_get_max_threads()<< std::endl; std::cout << "serial reduction" << std::endl; tic(); for(int c = 0; c < COUNT; ++ c) { double sum = 0; for(size_t i = 0; i < sz; ++ i) sum += vec[i]; } toc(); int *const ptr = vec.data(); const int sz_i = int(sz); // some OpenMP implementations only allow parallel for with int std::cout << "parallel reduction" << std::endl; tic(); for(int c = 0; c < COUNT; ++ c) { double sum = 0; #pragma omp parallel for default(none) reduction(+:sum) for(int i = 0; i < sz_i; ++ i) sum += ptr[i]; } toc(); std::cout << "serial memset" << std::endl; tic(); for(int c = 0; c < COUNT; ++ c) { for(size_t i = 0; i < sz; ++ i) vec[i] = 0; } toc(); std::cout << "parallel memset" << std::endl; tic(); for(int c = 0; c < COUNT; ++ c) { #pragma omp parallel for default(none) for(int i = 0; i < sz_i; ++ i) ptr[i] = 0; } toc(); return 0; } static clock_t ClockCounter; void tic() { ClockCounter = std::clock(); } void toc() { ClockCounter = std::clock() - ClockCounter; std::cout << "\telapsed clock ticks: " << ClockCounter << std::endl; }

运行此产生：

 g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1 ./omp_test max threads: 12 serial reduction elapsed clock ticks: 1790000 parallel reduction elapsed clock ticks: 19690000 serial memset elapsed clock ticks: 3860000 parallel memset elapsed clock ticks: 20800000

如果我使用-O2运行，g ++能够优化串行减少，我得到零时间，因此-O1 。另外，放omp_set_num_threads(1); 使时间更相似，尽管仍有一些差异：

 g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1 ./omp_test max threads: 1 serial reduction elapsed clock ticks: 1770000 parallel reduction elapsed clock ticks: 7370000 serial memset elapsed clock ticks: 2290000 parallel memset elapsed clock ticks: 3550000

这应该是相当明显的，我觉得我没有看到一些非常基本的东西。我的CPU是具有超线程的英特尔（R）Xeon（R）CPU E5-2640 0 @ 2.50GHz，但在具有4个内核且没有超线程的同事的i5中观察到相同的行为。我们都在运行Linux。

编辑

似乎一个错误是在时间方面，运行：

 static double ClockCounter; void tic() { ClockCounter = omp_get_wtime();//std::clock(); } void toc() { ClockCounter = omp_get_wtime()/*std::clock()*/ - ClockCounter; std::cout << "\telapsed clock ticks: " << ClockCounter << std::endl; }

产生更“合理”的时间：

 g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1 ./omp_test max threads: 12 serial reduction elapsed clock ticks: 1.80974 parallel reduction elapsed clock ticks: 2.07367 serial memset elapsed clock ticks: 2.37713 parallel memset elapsed clock ticks: 2.23609

但是，仍然没有加速，它只是不再慢 。

编辑2 ：

正如user8046所建议的那样，代码严重受内存限制。并且正如Z boson所建议的那样，串行代码很容易被优化掉，并且不确定这里测量的是什么。所以我做了一个小的改变，将总和放在循环之外，这样它在c每次迭代都不会为零。我还用sum+=F(vec[i])和memset操作替换了还原操作，其中vec[i] = F(i) 。运行方式：

 g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1 -D"F(x)=sqrt(double(x))" ./omp_test max threads: 12 serial reduction elapsed clock ticks: 23.9106 parallel reduction elapsed clock ticks: 3.35519 serial memset elapsed clock ticks: 43.7344 parallel memset elapsed clock ticks: 6.50351

计算平方根为线程增加了更多工作，最终有一些合理的加速（大约7x ，这是有意义的，因为超线程内核共享内存通道）。

你发现了时间错误。仍然没有加速，因为两个测试用例都是严重的内存限制。在典型的消费者硬件上，所有内核共享一个内存总线，因此使用更multithreading不会为您带来更多带宽，因为这是瓶颈，加速。如果减小问题大小，这可能会改变，因此它适合缓存，或者确保增加每个数据的计算次数，例如，如果计算减少exp（vec [i]）或1 / vec [一世]。对于memset：你可以用一个线程使内存饱和，你永远不会看到加速。（仅当您可以访问具有更multithreading的第二个内存总线时，与某些多插槽体系结构一样）。关于减少的一个评论，这很可能不是用锁来实现的，这将是非常低效的，但使用一个没有那么糟糕的对数加速的加法树。

除了您在Linux中使用时钟function时的错误，您可以通过阅读这些问题/答案来回答您的其余问题。

在-AN-的openmp并行化码将-存在待任何效益换memset的将要运行-IN-P / 11579987

测量存储器带宽-从- -点积-的两arrays

memset的function于平行与线程绑定到每个物理核

因此，您应该看到使用memset的多个线程带来的显着优势，即使在单个套接字系统上也可以进行减少。我已经编写了自己的工具来测量带宽。您可以从我的i5-4250U（Haswell）找到一些结果，其中2核（GCC 4.8，Linux 3.13，EGLIBC 2.19）超过1 GB。 vsum是你的减少。请注意，即使在这两个核心系统上也有显着的改进。

一个线程

 C standard library GB time(s) GB/s GFLOPS efficiency memset: 0.50 0.80 6.68 0.00 inf % memcpy: 1.00 1.35 7.93 0.00 inf % Agner Fog's asmlib GB time(s) GB/s GFLOPS efficiency memset: 0.50 0.71 7.53 0.00 inf % memcpy: 1.00 0.93 11.51 0.00 inf % my_memset 0.50 0.71 7.53 0.00 inf % FMA3 reduction tests GB time(s) GB/s GFLOPS efficiency vsum: 0.50 0.53 10.08 2.52 inf % vmul: 0.50 0.68 7.93 1.98 inf % vtriad: 0.50 0.70 7.71 3.85 inf % dot 1.00 1.08 9.93 2.48 inf %

两个线程

 C standard library GB time(s) GB/s GFLOPS efficiency memset: 0.50 0.64 8.33 0.00 inf % memcpy: 1.00 1.10 9.76 0.00 inf % Agner Fog's asmlib GB time(s) GB/s GFLOPS efficiency memset: 0.50 0.36 14.98 0.00 inf % memcpy: 1.00 0.66 16.30 0.00 inf % my_memset 0.50 0.36 15.03 0.00 inf % FMA3 tests standard sum tests with OpenMP: 2 threads GB time(s) GB/s GFLOPS efficiency vsum: 0.50 0.41 13.03 3.26 inf % vmul: 0.50 0.39 13.67 3.42 inf % vtriad: 0.50 0.44 12.20 6.10 inf % dot 1.00 0.97 11.11 2.78 inf %

这是我的自定义memset函数（我还有其他几个测试）。

 void my_memset(int *s, int c, size_t n) { int i; __m128i v = _mm_set1_epi32(c); #pragma omp parallel for for(i=0; i


 编辑： 
 您应该使用-O3和-ffast-math编译。 定义外环外部的总和，然后将其打印出来，这样GCC就不会对其进行优化。  GCC不会自动向量化减少因为浮点运算不是关联的，并且对循环进行向量化可能会破坏IEEE浮点规则。 使用-ffast-math允许浮点arithemetic是关联的，这允许GCC对减少进行矢量化。 应该指出的是，已经在OpenMP中进行了减少，假设浮点运算是关联的，因此它已经破坏了IEEE浮点规则。 
 double sum = 0; tic(); for(int c = 0; c < COUNT; ++ c) { #pragma omp parallel for reduction(+:sum) for(int i = 0; i < sz_i; ++ i) sum += ptr[i]; } toc(); printf("sum %f\n", sum); 
 编辑： 
 我测试了你的代码并进行了一些修改。 使用多个线程，使用简化和memset可以获得更快的时间 
 max threads: 4 serial reduction dtime 1.86, sum 705032704 parallel reduction dtime 1.39 s, sum 705032704 serial memset dtime 2.95 s parallel memset dtime 2.44 s serial my_memset dtime 2.66 s parallel my_memset dtime 1.35 s 
 这是我使用的代码（g ++ foo.cpp -fopenmp -O3 -ffast-math） 
 #include  #include  #include  #include  #include  #include  void my_memset(int *s, int c, size_t n) { int i; __m128i v = _mm_set1_epi32(c); for(i=0; i vec(sz, 1); std::cout << "max threads: " << omp_get_max_threads()<< std::endl; std::cout << "serial reduction" << std::endl; double dtime; int sum; dtime = -omp_get_wtime(); sum = 0; for(int c = 0; c < COUNT; ++ c) { for(size_t i = 0; i < sz; ++ i) sum += vec[i]; } dtime += omp_get_wtime(); printf("dtime %.2f, sum %d\n", dtime, sum); int *const ptr = vec.data(); const int sz_i = int(sz); // some OpenMP implementations only allow parallel for with int std::cout << "parallel reduction" << std::endl; dtime = -omp_get_wtime(); sum = 0; for(int c = 0; c < COUNT; ++ c) { #pragma omp parallel for default(none) reduction(+:sum) for(int i = 0; i < sz_i; ++ i) sum += ptr[i]; } dtime += omp_get_wtime(); printf("dtime %.2f s, sum %d\n", dtime, sum); std::cout << "serial memset" << std::endl; dtime = -omp_get_wtime(); for(int c = 0; c < COUNT; ++ c) { for(size_t i = 0; i < sz; ++ i) vec[i] = 0; } dtime += omp_get_wtime(); printf("dtime %.2f s\n", dtime); std::cout << "parallel memset" << std::endl; dtime = -omp_get_wtime(); for(int c = 0; c < COUNT; ++ c) { #pragma omp parallel for default(none) for(int i = 0; i < sz_i; ++ i) ptr[i] = 0; } dtime += omp_get_wtime(); printf("dtime %.2f s\n", dtime); std::cout << "serial my_memset" << std::endl; dtime = -omp_get_wtime(); for(int c = 0; c < COUNT; ++ c) my_memset(ptr, 0, sz_i); dtime += omp_get_wtime(); printf("dtime %.2f s\n", dtime); std::cout << "parallel my_memset" << std::endl; dtime = -omp_get_wtime(); for(int c = 0; c < COUNT; ++ c) my_memset_omp(ptr, 0, sz_i); dtime += omp_get_wtime(); printf("dtime %.2f s\n", dtime); return 0; }



		      	 您正在使用std :: clock来报告使用的CPU时间，而不是实时。 因此，每个处理器的时间加起来并且总是高于单线程时间（由于开销）。 
  http://en.cppreference.com/w/cpp/chrono/c/clock



  使用gcc中的内联汇编从stdin扫描并打印到stdout
  为什么C平台依赖？
	在C中，+ = b比a = a + b更有效吗？
在没有ATL的情况下实现COM IDispatch
ARM程序集访问C全局变量
在C中读取和打印.txt文件行的最清晰方法
在信号处理的上下文中，究竟哪些变量需要是sig_atomic_t？
为什么无法访问的代码不是C ++中的错误？
C / C ++ / Assembly以编程方式检测Windows，Mac和Linux上的超线程是否处于活动状态
线程具有相同的ID
如何从系统函数调用中删除后台进程

OpenMP令人尴尬地并行循环，没有加速

当我尝试运行测试时，为什么会出现“Segmentation Fault”错误？

IDG PNG文件格式块

获得一个双字母列表的信件

Openmpi mpmd获得通信大小

scanf（）跳过变量

裸机环境下的stdlib式库？（内存管理和希望pthread支持）

如何使用转义键编程窗口关闭

将hex地址传递给指针变量

输入缓冲区刷新

glibc的fprintf（）线程安全实现吗？

OpenMP令人尴尬地并行循环，没有加速

当我尝试运行测试时，为什么会出现“Segmentation Fault”错误？

IDG PNG文件格式块

获得一个双字母列表的信件

Openmpi mpmd获得通信大小

scanf（）跳过变量

裸机环境下的stdlib式库？ （内存管理和希望pthread支持）

如何使用转义键编程窗口关闭

将hex地址传递给指针变量

输入缓冲区刷新

glibc的fprintf（）线程安全实现吗？

裸机环境下的stdlib式库？（内存管理和希望pthread支持）