如何优化这个Langton的antsim？

我正在写一个Langton的antsim（用于规则字符串RLR），我正在尝试优化它以提高速度。这是相关的代码：

#define AREA_X 65536 #define AREA_Y 65536 #define TURN_LEFT 3 #define TURN_RIGHT 1 int main() { uint_fast8_t* state; uint_fast64_t ant=((AREA_Y/2)*AREA_X) + (AREA_X/2); uint_fast8_t ant_orientation=0; uint_fast8_t two_pow_five=32; uint32_t two_pow_thirty_two=0;/*not fast, relying on exact width for overflow*/ uint_fast8_t change_orientation[4]={0, TURN_RIGHT, TURN_LEFT, TURN_RIGHT}; int_fast64_t* move_ant={AREA_X, 1, -AREA_X, -1}; ... initialise empty state while(1) { while(two_pow_five--)/*removing this by doing 32 steps per inner loop, ~16% longer*/ { while(--two_pow_thirty_two) { /*one iteration*/ /* 54 seconds for init + 2^32 steps ant_orientation = ( ant_orientation + (117>>((++state[ant])*2 )) )&3; state[ant] = (36 >> (state[ant] *2) ) & 3; ant+=move_ant[ant_orientation]; */ /* 47 seconds for init + 2^32 steps ant_orientation = ( ant_orientation + ((state[ant])==1?3:1) )&3; state[ant] += (state[ant]==2)?-2:1; ant+=move_ant[ant_orientation]; */ /* 46 seconds for init + 2^32 steps ant_orientation = ( ant_orientation + ((state[ant])==1?3:1) )&3; if(state[ant]==2) { --state[ant]; --state[ant]; } else ++state[ant]; ant+=move_ant[ant_orientation]; */ /* 44 seconds for init + 2^32 steps ant_orientation = ( ant_orientation + ((++state[ant])==2?3:1) )&3; if(state[ant]==3)state[ant]=0; ant+=move_ant[ant_orientation]; */ // 37 seconds for init + 2^32 steps // handle every situation with nested switches and constants switch(ant_orientation) { case 0: switch(state[ant]) { case 0: ant_orientation=1; state[ant]=1; ++ant; break; case 1: ant_orientation=3; state[ant]=2; --ant; break; case 2: ant_orientation=1; state[ant]=0; ++ant; break; } break; case 1: switch(state[ant]) { ... } break; case 2: switch(state[ant]) { ... } break; case 3: switch(state[ant]) { ... } break; } } } two_pow_five=32; ... dump to file every 2^37 steps } return 0; }

我有两个问题：

我试图通过试验和错误测试尽可能地优化c，有没有我没有利用的技巧？请尝试在c中说话而不是汇编，虽然我可能会在某个时候尝试组装。
是否有更好的方法来模拟问题以提高速度？

更多信息：便携性无关紧要。我使用的是64位Linux，使用gcc，i5-2500k和16 GB内存。目前的状态arrays使用4GiB，程序可以使用12GiB的ram。的sizeof（uint_fast8_t）= 1。故意不存在边界检查，很容易从转储中手动发现损坏。

编辑：也许反直觉地说，在switch语句上堆积而不是消除它们已经产生了迄今为止最好的效率。

编辑：我已经重新建模了这个问题，并且每次迭代得到的东西比单步更快。之前，每个状态元素使用两位并描述Langtonant网格中的单个单元格。新方法使用全部8位，并描述网格的2×2部分。通过查找当前状态+方向的步数，新方向和新状态的预先计算值，每次迭代完成可变数量的步骤。假设一切都是平等的，每次迭代平均达到2步。作为奖励，它使用1/4的内存来模拟相同的区域：

 while(--iteration) { // roughly 31 seconds per 2^32 steps table_offset=(state[ant]*24)+(ant_orientation*3); it+=twoxtwo_table[table_offset+0]; state[ant]=twoxtwo_table[table_offset+2]; ant+=move_ant2x2[(ant_orientation=twoxtwo_table[table_offset+1])]; }

尚未尝试优化它，接下来要尝试的是消除偏移方程并使用嵌套开关和常量进行查找（但是使用648个内部情况而不是12个）。

或者，您可以使用单个无符号字节常量作为人工寄存器而不是分支：

 value: 1 3 1 1 bits: 01 11 01 01 ---->101 decimal value for an unsigned byte index 3 2 1 0 ---> get first 2 bits to get "1" (no shift) --> get second 2 bits to get "1" (shifting for 2 times) --> get third 2 bits to get "3" (shifting for 4 times) --> get last 2 bits to get "1" (shifting for 6 times) Then "AND" the result with binary(11) or decimal(3) to get your value. (101>>( (++state[ant])*2 ) ) & 3 would give you the turnright or turnleft Example: ++state[ant]= 0: ( 101>>( (0)*2 ) )&3 --> 101 & 3 = 1 ++state[ant]= 1: ( 101>>( (1)*2 ) )&3 --> 101>>2 & 3 = 1 ++state[ant]= 2: ( 101>>( (2)*2 ) )&3 --> 101>>4 & 3 = 3 -->turn left ++state[ant]= 3: ( 101>>( (3)*2 ) )&3 --> 101>>6 & 3 = 1 Maximum six-shifting + one-multiplication + one-"and" may be better. Dont forget constant can be auto-promoted so you may add some suffixes or something else.

由于您对％4模数使用“unsigned int”，因此可以使用“和”运算。

  state[ant]=state[ant]&3; instead of state[ant]=state[ant]%4;

对于不熟练的编译器，这应该提高速度。

最难的部分：modulo-3

  C = A % B is equivalent to C = A – B * (A / B) We need state[ant]%3 Result = state[ant] - 3 * (state[ant]/3) state[ant]/3 is always <=1 for your valid direction states. Only when state[ant] is 3 then state[ant]/3 is 1, other values give 0. When multiplied by 3, that part is 0 or 3 (only 3 when state[ant] is 3 otherwise 0) Result = state[ant] - (0 or 3) Lets look at all possibilities: state[ant]=0: 0 - 0 ---> 0 ----> 00100100 shifted by 0 times &3 --> 00000000 state[ant]=1: 1 - 0 ---> 1 ----> 00100100 shifted by 2 times &3 --> 00000001 state[ant]=2: 2 - 0 ---> 2 ----> 00100100 shifted by 4 times &3 --> 00000010 state[ant]=3: 3 - 3 ---> 0 ----> 00100100 shifted by 6 times &3 --> 00000000 00100100 is 36 in decimal. (36 >> (state[ant] *2) ) & 3 will give you state[ant]%3 for your valid states (0,1,2,3) Example: state[ant]=0: 36 >> 0 --> 36 ----> 36& 3 ----> 0 satisfies 0%3 state[ant]=1: 36 >> 2 --> 9 -----> 9 & 3 ----> 1 satisfies 1%3 state[ant]=2: 36 >> 4 --> 2 -----> 2 & 3 ----> 2 satisfies 2%3 state[ant]=3: 36 >> 6 --> 0 -----> 0 & 3 ----> 0 satisfies 3%3

如何优化这个Langton的antsim？

通过函数指针使用的函数可以内联吗？

非阻塞连接OpenSSL

GCC：严格别名警告的准确性

对超过2种类型的查询使用延迟传播

如果新块大小小于初始块，我应该强制执行realloc检查吗？

数组和数组有什么区别？

C / C ++旋转BMP图像

数组随机排序

为什么C / C ++字符串文字声明必须是单行的？

由C中的SIGSEGV信号终止