本文所有代码可以在 test-cuda
下查看。
下面的图展示的是 CUDA
编程中一种非常经典的算法模式:并行规约(Parallel
Reduction)。简单来说,它的目的是将一个大数组中的所有元素“折叠”成一个值(比如求和、求最大值或最小值)。
block-beta
columns 25
space
a0 a1 a2 a3 a4 a5 a6 a7
b0 b1 b2 b3 b4 b5 b6 b7
c0 c1 c2 c3 c4 c5 c6 c7
r1("r1")
sum0("sum"):2 sum1("sum"):2 sum2("sum"):2 sum3("sum"):2
sum4("sum"):2 sum5("sum"):2 sum6("sum"):2 sum7("sum"):2
sum8("sum"):2 sum9("sum"):2 sum10("sum"):2 sum11("sum"):2
r2("r2")
sum12("sum"):4 sum13("sum"):4
sum14("sum"):4 sum15("sum"):4
sum16("sum"):4 sum17("sum"):4
r3("r3")
sum18("sum"):8
sum19("sum"):8
sum20("sum"):8
space:25
r4("r4")
s("cross"):24
sum18 --"跨block通信"--> s
sum19 --"跨block通信"--> s
sum20 --"跨block通信"--> s
class a0,a1,a2,a3,a4,a5,a6,a7 green
class b0,b1,b2,b3,b4,b5,b6,b7 blue
class c0,c1,c2,c3,c4,c5,c6,c7 orange
class sum0,sum1,sum2,sum3,sum12,sum13 green
class sum4,sum5,sum6,sum7,sum14,sum15 blue
class sum8,sum9,sum10,sum11,sum16,sum17 orange
class sum18,sum19,sum20 gray
class r1,r2,r3,r4 purple
class s error
%% 样式定义
classDef content fill:#fff,stroke:#ccc;
classDef animate stroke:#666,stroke-dasharray: 8 4,stroke-dashoffset: 900,animation: dash 20s linear infinite;
classDef yellow fill:#FFEB3B,stroke:#333,color:#000,font-weight:bold;
classDef blue fill:#489,stroke:#333,color:#fff,font-weight:bold;
classDef pink fill:#FFCCCC,stroke:#333,color:#333,font-weight:bold;
classDef light_green fill:#e8f5e9,stroke:#695;
classDef green fill:#695,color:#fff,font-weight:bold;
classDef purple fill:#968,stroke:#333,color:#fff,font-weight:bold;
classDef gray fill:#ccc,stroke:#333,font-weight:bold;
classDef error fill:#bbf,stroke:#f65,stroke-width:2px,color:#fff,stroke-dasharray: 5 5;
classDef coral fill:#f8f,stroke:#333,stroke-width:4px;
classDef orange fill:#fff3e0,stroke:#ef6c00,color:#ef6c00,font-weight:bold;
第一层(最上方):每个 Block
内部的线程开始成对地读取数据。例如,线程 0 读取位置 0 和位置 1
的数,将它们相加。
逐级折叠: