CUDA编程入门:一些常用算子的实现
Matrix Multiplication
TILE和THREAD
首先,我们需要使用 TILE 和 THREAD 来对我们的输入张量进行分块,分块有两种策略:连续分块 和 交错分块
连续分块下,我们的内存模型为:
block-beta
columns 10
block:header0:10
columns 2
space:2
t0("thread (0, 0)")
t1("thread (0, 1)")
end
block:t00:5
columns 4
b00t00("tile(0, 0)") b00t01("tile(0, 1)") b00t02("tile(0, 2)") b00t03("...")
b00t10("tile(1, 0)") b00t11("tile(1, 1)") b00t12("tile(1, 2)") b00t13("...")
end
block:t01:5
columns 4
b01t00("tile(0, 0)") b01t01("tile(0, 1)") b01t02("tile(0, 2)") b01t03("...")
b01t10("tile(1, 0)") b01t11("tile(1, 1)") b01t12("tile(1, 2)") b01t13("...")
end
block:header1:10
columns 2
space:2
t2("thread (1, 0)")
t3("thread (1, 1)")
end
block:t02:5
columns 4
b02t00("tile(0, 0)") b02t01("tile(0, 1)") b02t02("tile(0, 2)") b02t03("...")
b02t10("tile(1, 0)") b02t11("tile(1, 1)") b02t12("tile(1, 2)") b02t13("...")
end
block:t03:5
columns 4
b03t00("tile(0, 0)") b03t01("tile(0, 1)") b03t02("tile(0, 2)") b03t03("...")
b03t10("tile(1, 0)") b03t11("tile(1, 1)") b03t12("tile(1, 2)") b03t13("...")
end
class t00,t01,t02,t03 animate
class t0,t1,t2,t3 purple
class header0,header1 transparent
classDef transparent fill:none,stroke:none,color:inherit;
classDef content fill:#fff,stroke:#ccc;
classDef animate stroke:#666,stroke-dasharray: 8 4,stroke-dashoffset: 900,animation: dash 20s linear infinite;
classDef yellow fill:#FFEB3B,stroke:#333,color:#000,font-weight:bold;
classDef blue fill:#489,stroke:#333,color:#fff,font-weight:bold;
classDef pink fill:#FFCCCC,stroke:#333,color:#333,font-weight:bold;
classDef light_green fill:#e8f5e9,stroke:#695;
classDef green fill:#695,color:#fff,font-weight:bold;
classDef purple fill:#968,stroke:#333,color:#fff,font-weight:bold;
classDef gray fill:#ccc,stroke:#333,font-weight:bold;
classDef error fill:#bbf,stroke:#f65,stroke-width:2px,color:#fff,stroke-dasharray: 5 5;
classDef coral fill:#f8f,stroke:#333,stroke-width:4px;
classDef orange fill:#fff3e0,stroke:#ef6c00,color:#ef6c00,font-weight:bold;