高性能计算程序设计（5）秋季2024

提交格式说明

按照实验报告模板填写报告，需要提供源代码及代码描述至https://easyhpc.net/course/212。实验报告模板使用PDF格式，命名方式为高性能计算程序设计_学号_姓名。如果有问题，请发邮件至<zhudp3@mail2.sysu.edu.cn、liux276@mail2.sysu.edu.cn>询问细节。

任务1：

通过CUDA实现通用矩阵乘法（Lab1）的并行版本，CUDA Thread Block size从32增加至512，矩阵规模从512增加至8192。

通用矩阵乘法（GEMM）通常定义为：

\[C = AB\] \[C_{m,n} = \sum_{n = 1}^{N}{A_{m,n}B_{n,k}}\]

输入：M , N, K三个整数（512 ~8192）

问题描述：随机生成M*N和N*K的两个矩阵A,B,对这两个矩阵做乘法得到矩阵C。

输出：A,B,C三个矩阵以及矩阵计算的时间

代码解释:

? 《grid, block》将矩阵分割到每一个cuda thread，然后每一个thread只有其坐标在(0,0)到(m,k)（也就是目标矩阵的大小内）才有效，这样的thread遍历a的行和b的列，对应相乘加入到本地temp,然后赋值到目标矩阵，由于一一对应，所以不需要对目标矩阵加锁。

/**
 * @brief  implemenntation of the GEMM cuda in global memory
 * 
 * @param a  the first matrix in m*n in device memory
 * @param b  the second matrix in n*k in device memory
 * @param c  the result matrix in m*k in device memory
 * @param m  matrix size
 * @param n  matrix size
 * @param k  matrix size 
 * @return the result in ptr c
 */
__global__ void gpu_matrix_mult_gm(float *a, float *b, float *c, int m, int n, int k)
{
    int row = blockIdx.y * blockDim.y + threadIdx.y
    int col = blockIdx.x * blockDim.x + threadIdx.x
    float temp = 0
    if (row < m && col < k) // Ensure bounds are within the matrix dimensions
    {
        for (int i = 0 i < n i++)
        {
            temp += a[row * n + i] * b[i * k + col]
        }
        c[row * k + col] = temp
    }
}

进一步优化，可以考虑分块矩阵，以及共享内存，分块矩阵乘法在缓存命中率上有提高，cuda的cache模型是一行128字节,也就是32个float，将矩阵划分为32*32的小块，可以增加缓存读写的使用率，同时共享内存的读写也比全局内存快很多。

/**
 * @brief  implemenntation of the GEMM cuda in shared memory
 *
 * @param a  the first matrix in m*n in device memory
 * @param b  the second matrix in n*k in device memory
 * @param c  the result matrix in m*k in device memory
 * @param m  matrix size
 * @param n  matrix size
 * @param k  matrix size
 * @return the result in ptr c
 */
__global__ void gpu_matrix_mult_sm(float *a, float *b, float *c, int m, int n, int k)
{
    __shared__ float tile_a[BLOCK_SIZE][BLOCK_SIZE]
    __shared__ float tile_b[BLOCK_SIZE][BLOCK_SIZE]
    int row = blockIdx.y * BLOCK_SIZE + threadIdx.y
    int col = blockIdx.x * BLOCK_SIZE + threadIdx.x
    float temp = 0

    int block_num = (n + BLOCK_SIZE - 1) / BLOCK_SIZE // Divide the n dimension by block size
    for (int i = 0 i < block_num ++i)
    {
        // Copy the ith block into shared memory
        if (row < m && i * BLOCK_SIZE + threadIdx.x < n)
        {
            int a_index = row * n + i * BLOCK_SIZE + threadIdx.x
            tile_a[threadIdx.y][threadIdx.x] = a[a_index]
        }
        else
            tile_a[threadIdx.y][threadIdx.x] = 0 // Handle edge case for smaller matrix

        if (col < k && i * BLOCK_SIZE + threadIdx.y < n)
        {
            int b_index = (i * BLOCK_SIZE + threadIdx.y) * k + col
            tile_b[threadIdx.x][threadIdx.y] = b[b_index]
        }
        else
            tile_b[threadIdx.y][threadIdx.x] = 0

        __syncthreads() // Synchronize threads in the block before computation

        // Compute the contribution for c[row][col]
        for (int j = 0 j < BLOCK_SIZE j++)
        {
            temp += tile_a[threadIdx.y][j] * tile_b[j][threadIdx.x]
        }
        __syncthreads() // Ensure all threads finish their computation before moving on
    }
    if (row < m && col < k)
    {
        c[row * k + col] = temp
    }
}

奇怪的问题：

        // Copy the ith block into shared memory
        if (row < m && i * BLOCK_SIZE + threadIdx.x < n)
        {
            int a_index = row * n + i * BLOCK_SIZE + threadIdx.x
            tile_a[threadIdx.y][threadIdx.x] = a[a_index]
        }
        else
            tile_a[threadIdx.y][threadIdx.x] = 0 // Handle edge case for smaller matrix

        if (col < k && i * BLOCK_SIZE + threadIdx.y < n)
        {
            int b_index = (i * BLOCK_SIZE + threadIdx.y) * k + col
            tile_b[threadIdx.x][threadIdx.y] = b[b_index]
        }
        else
            tile_b[threadIdx.x][threadIdx.y] = 0

        __syncthreads() // Synchronize threads in the block before computation

        // Compute the contribution for c[row][col]
        for (int j = 0 j < BLOCK_SIZE j++)
        {
            temp += tile_a[threadIdx.y][j] * tile_b[threadIdx.x][j]
        }

? 使用加速方法，将b矩阵转置后，本来觉得可以加快缓存读取，但是发现程序速度反而变慢了

Testting the diy_cuda_global_mem implementation of matrix multiply
510.3349 520.4258 ... 514.6938 517.7357
508.9028 ...... ... ...... 520.2448
...... ...... ... ...... ......
507.9561 ...... ... ...... 513.7372
521.5385 526.4379 ... 520.4982 516.9260
Custom Kernel Time: 3.424504 ms Performance: 5016.746651 GFLOPS

Testting the diy_cuda_shared_mem implementation of matrix multiply
510.3349 520.4258 ... 514.6938 517.7357
508.9028 ...... ... ...... 520.2448
...... ...... ... ...... ......
507.9561 ...... ... ...... 513.7372
521.5385 526.4379 ... 520.4982 516.9260
Custom Kernel Time: 2.797537 ms Performance: 6141.069320 GFLOPS

Testting the cublas implementation of matrix multiply
506.5235 519.2048 ... 503.0756 527.3256
513.7955 ...... ... ...... 526.7343
...... ...... ... ...... ......
502.1938 ...... ... ...... 514.4681
490.7425 508.1088 ... 486.1488 503.2191
cuBLAS Time: 2.810711 ms        Performance: 6112.286092 GFLOPS
(base) hongjie@kemove-ESC8000-G4:~/ml/cuda/MM$ make
nvcc compare.cu -lcublas -o main
(base) hongjie@kemove-ESC8000-G4:~/ml/cuda/MM$ ./main 2048 2048 2048 32

Testting the diy_cuda_global_mem implementation of matrix multiply
510.3349 520.4258 ... 514.6938 517.7357
508.9028 ...... ... ...... 520.2448
...... ...... ... ...... ......
507.9561 ...... ... ...... 513.7372
521.5385 526.4379 ... 520.4982 516.9260
Custom Kernel Time: 3.419901 ms Performance: 5023.499135 GFLOPS

Testting the diy_cuda_shared_mem implementation of matrix multiply
510.3349 520.4258 ... 514.6938 517.7357
508.9028 ...... ... ...... 520.2448
...... ...... ... ...... ......
507.9561 ...... ... ...... 513.7372
521.5385 526.4379 ... 520.4982 516.9260
Custom Kernel Time: 8.448617 ms Performance: 2033.453430 GFLOPS

Testting the cublas implementation of matrix multiply
506.5235 519.2048 ... 503.0756 527.3256
513.7955 ...... ... ...... 526.7343
...... ...... ... ...... ......
502.1938 ...... ... ...... 514.4681
490.7425 508.1088 ... 486.1488 503.2191
cuBLAS Time: 2.828008 ms        Performance: 6074.901333 GFLOPS

? 后来想想,确实更改后会更慢,因为本来的运行模型是32个线程作为一个warp来进行SIMT,然后如果是之前的代码

temp += tile_a[threadIdx.y][j] * tile_b[j][threadIdx.x]

? 虽然在循环的时候涉及了列优先的访问,但是对于整个warp来说是行优先的访问(根据Idx.x)

时间测试：最后一起给出

任务2：

通过NVDIA的矩阵计算函数库CUBLAS计算矩阵相乘，矩阵规模从512增加至8192，并与任务1和任务2的矩阵乘法进行性能比较和分析，如果性能不如CUBLAS，思考并文字描述可能的改进方法（参考《计算机体系结构-量化研究方法》第四章）。

CUBLAS参考资料《CUBLAS_Library.pdf》，CUBLAS矩阵乘法参考第70页内容。

CUBLAS矩阵乘法例子，参考附件《matrixMulCUBLAS》

? 以下是比较矩阵乘法的表格，元素左边是计算时间ms，右边是浮点性能GFLOPS

代码解释：

//   m*k k*n = m*n
	cublasH     cublasHandle_t handle
    cublasCreate(&handle)
    // Configure cuBLAS operations
    cublasOperation_t opA = CUBLAS_OP_N // No transpose for A
    cublasOperation_t opB = CUBLAS_OP_N // No transpose for B
	cublasSgemm(handle, opA, opB, n, m, k, &alpha, d_B, n, d_A, k, &beta, d_C, n) // 我日，cublas 居然是列优先的

? 因为cublas使用列优先，所以输入为行优先分配的矩阵相当于先转置，BT*AT=CT， CT会列优先存到内存，再行优先就是C（这也太搞了）

测试运行（连同任务1）

? 包含三个算法，全局内存，共享内存，cublas, 测试了 blocksize从8~32，矩阵维度从512到8192

run:
	@for n in 8 16 24 32 do \
		for size in 512 1024 2048 4096 8192 do \
			echo "Running program with size $$size and $$n BLOCK" | tee -a output.txt \
			./$(PROGRAM) $$size $$size $$size $$n | tee -a output.txt \
		done \
		echo "End of program with $$n BLOCK SIZE" | tee -a output.txt \
		echo "" | tee -a output.txt \
	done \
	echo "End of program" | tee -a output.txt \
	echo "" | tee -a output.txt

使用的4090显卡数据如下：

Device 0: NVIDIA GeForce RTX 4090
  Clock Rate (GHz): 2.52 GHz
  CUDA Cores: 16384
  Theoretical Peak Performance (FP32): 82.5754 TFLOPS
Shared memory per block:  49152  bytes

diy_cuda_global_mem

矩阵维度	Blocksize:8*8	Blocksize:16*16	Blocksize:24*24	Blocksize:32*32
512	0.1123 / 2389.79	0.0872 / 3078.11	0.0912 / 2944.79	0.0835 / 3216.22
1024	0.5735 / 3744.41	0.4539 / 4731.19	0.4659 / 4609.26	0.4403 / 4877.43
2048	4.2440 / 4048.06	3.4547 / 4972.90	3.4974 / 4912.13	3.4195 / 5024.12
4096	39.1658 / 3509.16	27.0892 / 5073.57	27.4131 / 5013.62	27.1107 / 5069.54
8192	1079.691 / 1018.36	297.402 / 3697.06	247.026 / 4450.99	243.413 / 4517.06

diy_cuda_shared_mem

矩阵维度	Blocksize:8*8	Blocksize:16*16	Blocksize:24*24	Blocksize:32*32
512	0.0527 / 5091.72	0.0466 / 5764.87	0.0526 / 5104.11	0.0522 / 5142.15
1024	0.3210 / 6690.96	0.2939 / 7306.50	0.3231 / 6646.07	0.3384 / 6346.85
2048	2.4581 / 6989.17	2.3404 / 7340.62	2.4892 / 6901.76	2.7261 / 6302.00
4096	19.5173 / 7041.92	18.5687 / 7401.64	19.5364 / 7035.03	23.4161 / 5869.42
8192	149.465 / 7356.34	141.881 / 7749.54	158.358 / 6943.19	188.046 / 5847.05

cublas

矩阵维度	Blocksize:8*8	Blocksize:16*16	Blocksize:24*24	Blocksize:32*32
512	0.1119 / 2399.42	0.1038 / 2586.78	0.1022 / 2626.80	0.1051 / 2554.46
1024	0.4111 / 5224.19	0.3969 / 5411.12	0.3960 / 5422.38	0.3964 / 5418.05
2048	2.8099 / 6114.08	2.8116 / 6110.45	2.8080 / 6118.18	2.8110 / 6111.61
4096	22.0671 / 6228.24	22.0636 / 6229.22	22.0639 / 6229.13	22.0593 / 6230.42
8192	169.151 / 6500.17	169.021 / 6505.19	169.004 / 6505.82	169.397 / 6490.72

结果分析：

? 1. cublas库的运行不受blocksize的影响；

? 2. 对于普通的全局内存的函数，增大blocksize有助于提高运行速度；

? 3. 对于共享内存的版本，由于每个SM上的共享内存块是48KB,当blocksize增多到一定限度的时候（16*16），如果在增加线程数量，一个SM上的warps数量会继续增加，导致寄存器调度（切换warp）需要更多时间，同时在任务量不变得情况下，需要的SM数量也会下降，导致并行效率下降。

? 4. 对于规整的矩阵乘法，在部分任务上，共享内内存版本能超越cublas，但是也能看到cublas对于不同问题规模的运算的效率都保持在高水平。

? 5. 在cublas和共享内存的对种，只有32*32block_size的4096和8192乘法，共享内存函数是不如cublas的，在计算强度不变的情况下，根据屋顶线模型，考虑是内存访存的限制，在访问共享内存的时候出现了stall（How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog在kernel 3介绍了原因），所以可以考虑增加计算强度，在访存不变的情况下，同一个thread可以做更多的计算任务，之前是一个thread负责一个对应元素，现在可以负责一个小区域的元素（将结果矩阵分块）。

任务3：

在信号处理、图像处理和其他工程/科学领域，卷积是一种使用广泛的技术。在深度学习领域，卷积神经网络(CNN)这种模型架构就得名于这种技术。在本实验中，我们将在GPU上实现卷积操作，注意这里的卷积是指神经网络中的卷积操作，与信号处理领域中的卷积操作不同，它不需要对Filter进行翻转，不考虑bias。

任务一通过CUDA实现直接卷积（滑窗法），输入从256增加至4096或者输入从32增加至512.

输入：Input和Kernel(3x3)

问题描述：

用直接卷积的方式对Input进行卷积，这里只需要实现2D, height*width，通道channel(depth)设置为3，Kernel (Filter)大小设置为3*3，步幅(stride)分别设置为1，2，3，可能需要通过填充(padding)配合步幅(stride)完成CNN操作。注：实验的卷积操作不需要考虑bias(b)，bias设置为0.

输出：输出卷积结果以及计算时间

代码解释：

? 以常用的张量描述（N, C, H,W）,其中N是batchsize，也就是张量数目， C是张量通道数，这里是RGB也就是3通道， H是高度也就是矩阵的行数, W是宽度也就是张量的列数，

? 首先需要对输入矩阵进行padding. paddedBlock是矩阵pdding之后的大小，（i+padding）遍历是为了0到padding-1行都是0，然后（j+padding）是为了让左边的列padding为0.

// m n, depth分别是矩阵的行，列，通道数
float *padMatrix(const float *matrix, float *padded_matrix, int m, int n, int depth, int padding)
{
    int rows = m
    int cols = n
    int paddedCols = n + 2 * padding
    int paddedRows = m + 2 * padding
    int paddedBlcok = paddedRows * paddedCols
    int originblock = rows * cols
    for (int k = 0 k < depth k++)
    {
        for (int i = 0 i < rows ++i)
        {
            for (int j = 0 j < cols ++j)
            {
                padded_matrix[k * paddedBlcok + (i + padding) * paddedCols + j + padding] = matrix[k * originblock + i * n + j]
            }
        }
    }
    return padded_matrix
}   
   memset(temp, 0, sizeof(float) * paddedRows * paddedCols * input.matrix_nums)  //需要把原来的内存设置为0，否则padding的时候可能会出错
    padMatrix(h_input, temp, m, n, input.matrix_nums, padding)

然后是滑窗卷积函数：

[人工智能-深度学习-27]：卷积神经网络CNN - 核心概念（卷积、滑动、填充、参数共享、通道）_cnn 本地连接,参数共享-CSDN博客

(row % stride == 0 && col % stride == 0是为了步长滑动，相当于滑动卷积窗口
row < paddedm - kernelSet.rows + 1 && col < paddedn - kernelSet.cols + 1是为了卷积窗口在边界的时候不会越界（避免滑出矩阵边界）
然后就是三次遍历，分别是通道，行，列，将卷积核和对应的局部矩阵作向量点积，最后化为一个数字，temp
因为是每一个thread负责一个对应输出的元素，所以不存在共享变量读取竞争的问题

//输入m,n是原本的行列数
__global__ void conv2d_cal(float *input, float *output, int output_block, int m, int n, const Kernel kernelSet, int stride, int padding)
{
    int row = blockIdx.y * blockDim.y + threadIdx.y
    int col = blockIdx.x * blockDim.x + threadIdx.x
    int paddedm = m + 2 * padding
    int paddedn = n + 2 * padding

    int conved_n = (n - kernelSet.cols + 2 * padding) / stride + 1  //计算卷积后的列数
    
    // extern __shared__ float temp[] // 使用动态共享内存，适配 kernelSet.numKernels 注意会出现竞争！！！
    float temp

    if (row % stride == 0 && col % stride == 0 && row < paddedm - kernelSet.rows + 1 && col < paddedn - kernelSet.cols + 1)
    {
        for (int k = 0 k < kernelSet.numKernels k++)
        {
            for (int i = 0 i < kernelSet.rows i++)
            {
                for (int j = 0 j < kernelSet.cols j++)
                {
                    int input_row = row + i
                    int input_col = col + j
                    int kernel_idx = k * kernelSet.rows * kernelSet.cols + i * kernelSet.cols + j
                    temp += input[input_row * paddedn + input_col] * kernelSet.deviceKernels[kernel_idx]
                }
            }
            // 写入输出           
        }
        output[row * conved_n + col] = temp
    }
}

测试运行：

正确性

对于卷积任务（1，3，5，5）（1，3，3，3）stride =x padding =1验证

Kernel 0 values:  	  
2 3 
5 6 
8 9 
Kernel 1 values:  	  
2 3 
5 6 
8 9
Kernel 2 values:  	  
2 3 
5 6 
8 9
atfer padding:
000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 
000000 1.000000 2.000000 3.000000 4.000000 5.000000 0.000000 
000000 6.000000 7.000000 8.000000 9.000000 0.000000 0.000000 
000000 1.000000 2.000000 3.000000 4.000000 5.000000 0.000000 
000000 6.000000 7.000000 8.000000 9.000000 0.000000 0.000000 
000000 1.000000 2.000000 3.000000 4.000000 5.000000 0.000000 
000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 

验证stride =1的正确性，打印输出矩阵(0,2)元素的一个通道的计算过程，可以看到shape正确，计算过程也正确(3个通道要乘3)

row: 0, col: 2, i: 0, j:0, input:0.000000, kernel:1.000000, temp: 0.000000
row: 0, col: 2, i: 0, j:1, input:0.000000, kernel:2.000000, temp: 0.000000
row: 0, col: 2, i: 0, j:2, input:0.000000, kernel:3.000000, temp: 0.000000
row: 0, col: 2, i: 1, j:0, input:2.000000, kernel:4.000000, temp: 8.000000
row: 0, col: 2, i: 1, j:1, input:3.000000, kernel:5.000000, temp: 23.000000
row: 0, col: 2, i: 1, j:2, input:4.000000, kernel:6.000000, temp: 47.000000
row: 0, col: 2, i: 2, j:0, input:7.000000, kernel:7.000000, temp: 96.000000
row: 0, col: 2, i: 2, j:1, input:8.000000, kernel:8.000000, temp: 160.000000
row: 0, col: 2, i: 2, j:2, input:9.000000, kernel:9.000000, temp: 241.000000
Custom Kernel Time for sliding conv: 4.198948 ms
Printing matrix of printing 0th matrix
384.000000 606.000000 723.000000 570.000000 312.000000 
318.000000 513.000000 648.000000 603.000000 354.000000 
483.000000 738.000000 873.000000 648.000000 339.000000 
318.000000 513.000000 648.000000 603.000000 354.000000 
150.000000 228.000000 291.000000 264.000000 150.000000

? 验证stride =2的正确性,打印出输出矩阵的(0,1)元素一个通道的计算过程，可以看到，shape是正确的

row: 0, col: 2, i: 0, j:0, input:0.000000, kernel:1.000000, temp: 0.000000
row: 0, col: 2, i: 0, j:1, input:0.000000, kernel:2.000000, temp: 0.000000
row: 0, col: 2, i: 0, j:2, input:0.000000, kernel:3.000000, temp: 0.000000
row: 0, col: 2, i: 1, j:0, input:2.000000, kernel:4.000000, temp: 8.000000
row: 0, col: 2, i: 1, j:1, input:3.000000, kernel:5.000000, temp: 23.000000
row: 0, col: 2, i: 1, j:2, input:4.000000, kernel:6.000000, temp: 47.000000
row: 0, col: 2, i: 2, j:0, input:7.000000, kernel:7.000000, temp: 96.000000
row: 0, col: 2, i: 2, j:1, input:8.000000, kernel:8.000000, temp: 160.000000
row: 0, col: 2, i: 2, j:2, input:9.000000, kernel:9.000000, temp: 241.000000
Custom Kernel Time for sliding conv: 4.179901 ms
Printing matrix of printing 0th matrix
384.000000 723.000000 312.000000 
483.000000 873.000000 339.000000 
150.000000 291.000000 150.000000 

? 验证stride =3的正确性,打印出输出矩阵的(0,1)元素一个通道的计算过程，可以看到，shape是正确的结果也正确

row: 0, col: 3, i: 0, j:0, input:0.000000, kernel:1.000000, temp: 0.000000
row: 0, col: 3, i: 0, j:1, input:0.000000, kernel:2.000000, temp: 0.000000
row: 0, col: 3, i: 0, j:2, input:0.000000, kernel:3.000000, temp: 0.000000
row: 0, col: 3, i: 1, j:0, input:3.000000, kernel:4.000000, temp: 12.000000
row: 0, col: 3, i: 1, j:1, input:4.000000, kernel:5.000000, temp: 32.000000
row: 0, col: 3, i: 1, j:2, input:5.000000, kernel:6.000000, temp: 62.000000
row: 0, col: 3, i: 2, j:0, input:8.000000, kernel:7.000000, temp: 118.000000
row: 0, col: 3, i: 2, j:1, input:9.000000, kernel:8.000000, temp: 190.000000
row: 0, col: 3, i: 2, j:2, input:0.000000, kernel:9.000000, temp: 190.000000
Custom Kernel Time for sliding conv: 4.153343 ms
Printing matrix of printing 0th matrix
384.000000 570.000000 
318.000000 603.000000 

时间测试

计算复杂度：O(Hout?Wout?Cout?Cin?K2)

Matrix Size	Stride 1 Time (ms)	Stride 2 Time (ms)	Stride 3 Time (ms)
512	7.709874	7.900139	9.811203
1024	26.421349	34.015903	30.129812
2048	104.506943	108.413010	105.430489
4096	444.949066	417.461884	434.954102
8192	1763.79	1575.86	1716.98

任务4：

使用im2col方法结合任务1实现的GEMM（通用矩阵乘法）实现卷积操作。输入从256增加至4096或者输入从32增加至512，具体实现的过程可以参考下面的图片和参考资料。

输入：Input和Kernel (Filter)

问题描述：

用im2col的方式对Input进行卷积，这里只需要实现2D, height*width，通道channel(depth)设置为3，Kernel (Filter)大小设置为3*3。注：实验的卷积操作不需要考虑bias(b)，bias设置为0，步幅(stride)分别设置为1，2，3。

输出：卷积结果和时间。

代码解释：

? 首先还是先补充padding,同上省略

? 然后是最麻烦的，将张量展开为矩阵

先计算出展开后的形状，高度（行数）是卷积核的大小 filter_size，宽度是列数，是卷积后矩阵的大小converted_n

然后五层展开，第一二层移动卷积窗口，里面三层是复制原矩阵的内容到展开后的矩阵

    int padded_m = h_input.rows + 2 * padding  //padding
    int padded_n = h_input.cols + 2 * padding
    int convable_m = (h_input.rows + 2 * padding - kernelSet.rows) / stride + 1  //after convolution
    int convable_n = (h_input.cols + 2 * padding - kernelSet.cols) / stride + 1

    int converted_n = convable_m * convable_n  // converted to a flat matrix
    int converted_m = kernelSet.rows * kernelSet.cols

    int block = converted_m * converted_n
    int padded_block = padded_m * padded_n

    float *temp
    // printf("hello\n")
    cudaMallocHost((void **)&temp, sizeof(float) * padded_block * h_input.matrix_nums)  //create a temp to get the padded matrix in host mem

    padMatrix(h_input.h_input, temp, h_input.rows, h_input.cols, h_input.matrix_nums, padding)


    float *h_temp
    cudaMallocHost((void **)&h_temp, sizeof(float) * h_input.matrix_nums * block)  //for converted matrix
    for (int i = 0 i +kernelSet.rows-1 < padded_m i+=stride) // the next 2 iterations are for sliding the conv window
    {
        for (int j = 0  j+kernelSet.cols-1<padded_n j+=stride)
        {
            for (int a = 0 a < kernelSet.rows a++) // the naxt 2 iterations are for inside the conv window
            {
                for (int b = 0 b < kernelSet.cols b++)
                {
                    for (int k = 0 k < h_input.matrix_nums k++) // the k iteration for the input matrixs(could be more than 1)
                    {
                        int li = k * block + (a * kernelSet.cols + b) * converted_n + (i/stride) * convable_n + (j/stride)
                        int ri = k * padded_block + (i + a) * padded_m + j + b
                        h_temp[li] = temp[ri]
                    }
                }
            }
        }
    }

? 然后是卷积核的展开，然后直接调用之前实现的矩阵乘算法，就可以实现结果的计算.

? Img2col展开后的矩阵乘法是不规整的,对于普通实现的矩阵乘法压力很大,所以我在矩阵乘法阶段直接使用了cublas的矩阵乘法库函数

void gpu_gemm_cublas(const float *d_a, const float *d_b, float *d_result, int m, int n, int k)
{
    //   m*n x n*k =m *k
    cublasHandle_t handle
    cublasCreate(&handle)
    // Configure cuBLAS operations
    const float alpha = 1.0, beta = 0.0
    cublasOperation_t opA = CUBLAS_OP_N                                           // No transpose for A
    cublasOperation_t opB = CUBLAS_OP_N                                           // No transpose for B
    cublasSgemm(handle, opA, opB, k, m, n, &alpha, d_b, k, d_a, n, &beta, d_result, k) // 我日，cublas 居然是列优先的
    // cudaMemcpy(h_output.h_input, d_result, sizeof(float)*converted_n, cudaMemcpyDeviceToHost)
}

? 另外在实验中发现,程序的大量时间都用在了展开矩阵上,后续优化可以考虑将展开写为核函数.

问题调试

在实现这个Img2col算法的时候，出现了奇怪的问题，核函数只能算出32个数字，根据对核函数，也就是之前实现的矩阵乘法函数的分析：

__global__ void gpu_matrix_mult_gm(const float *a, float *b, float *c, int m, int n, int k)
{
    int row = blockIdx.y * blockDim.y + threadIdx.y
    int col = blockIdx.x * blockDim.x + threadIdx.x
    float temp = 0
    if (row < m && col < k) // Ensure bounds are within the matrix dimensions
    {
        for (int i = 0 i < n i++)
        {
            temp += a[row * n + i] * b[i * k + col]
        }
        c[row * k + col] = temp
        printf("%d %d index:%d, reuslt: %d\n", row, col, row * k + col ,temp)
    }
}

以及切分矩阵的语句：

    dim3 blockDim(32,32) 
    dim3 gridDim(1, 1) //这里我以为1024的线程足够计算简单任务，没有设置

? 可以发现对于瘦高的矩阵乘法，之前对方块矩阵的切分方式可能导致有的元素没有参加运算，比如（1， 3 ， 1024 ，1024）的图像矩阵和(1, 3, 3, 3 )卷积核的卷积，等价(1, 27)的卷积核向量和（27， 1024*1024）的矩阵相乘，之前的矩阵分割算法，是对右边矩阵的列和左边矩阵的行分别进行分割，然后对应相乘计算出结果，这里行是1，列是1024x1024，如果使用32，32的block shape，就会有行方向31组线程浪费，列方向又远远不足以计算任务。

解决的办法：重新设置切割语句，适应不同矩阵乘法的需求

    int blockx = 256, blocky = 1  //和warp对应，还是需要blocksize是32的倍数， 但这个时候，设置为矮宽的形状
    dim3 blockDim(blockx, blocky)
    int gridx = (converted_n + blockx - 1) / blockx  //计算网格
    int gridy = 1  //由于左边向量的高度还是1，所以考虑块的排布也都是扁平的
    dim3 gridDim(gridx, gridy)

这个时候出现了另外一个问题，就是使用滑动窗口算法，和使用img2col算法（共享内存版本），他们都只能算矩阵维度最大是1022的，由于1022+2=1024是32的倍数，这个时候我考虑是blocksize大小的问题，我还需要重新检查这两个算法，确定他们的切分方式和blocksize的关系，同时共享内存之前的写法也是和blocksize关联的，但是内存是有限的，不能让blocksize太大，所以需要进一步检查。

? 好吧，原来是我错误的将block和grid写反了，而4090gpu的threadperblock的限制是1024，当维度一大，将grid和block搞反就以为着会超过1024的限制，导致报错。

<<<BLock, Grid>> kernel_function()//我在这里搞反了顺序,Grid应该在前面

测试运行：

正确性:

和上面的任务使用同样的输入确定正确性

stride :1
Custom Kernel Time for img2col conv: 450.193542 ms
Printing matrix of printing 0th matrix
384.000000 606.000000 723.000000 570.000000 312.000000 
318.000000 513.000000 648.000000 603.000000 354.000000 
483.000000 738.000000 873.000000 648.000000 339.000000 
318.000000 513.000000 648.000000 603.000000 354.000000 
150.000000 228.000000 291.000000 264.000000 150.000000
stride:2
Custom Kernel Time for img2col conv: 446.803680 ms
Printing matrix of printing 0th matrix
384.000000 723.000000 312.000000 
483.000000 873.000000 339.000000 
150.000000 291.000000 150.000000 
stride:3
Custom Kernel Time for img2col conv: 424.988342 ms
Printing matrix of printing 0th matrix
384.000000 570.000000 
318.000000 603.000000

时间测试

Img2Col Convolution 时间统计表

Matrix Size	Stride 1 Time (ms)	Stride 2 Time (ms)	Stride 3 Time (ms)
512	2971.511719	1231.348145	1313.612671
1024	975.632629	688.659485	920.661926
2048	6211.638184	2188.381104	2943.849365
4096	24702.511719	10237.832031	13833.249023
8192	N/A	N/A	N/A

任务5：

问题描述

NVIDIA cuDNN是用于深度神经网络的GPU加速库。它强调性能、易用性和低内存开销。

使用cuDNN提供的卷积方法进行卷积操作，记录其相应Input的卷积时间，与自己实现的卷积操作进行比较。如果性能不如cuDNN，用文字描述可能的改进方法。

CNN参考资料，见实验发布网站

斯坦福人工智能课件Convolutional Neural Networks，by Fei-Fei Li & Andrej Karpathy & Justin Johnson

其他参考资料（搜索以下关键词）

[1]如何理解卷积神经网络（CNN）中的卷积和池化

[2] Convolutional Neural Networks (CNNs / ConvNets) https://cs231n.github.io/convolutional-networks/

[3]im2col的原理和实现

[4]cuDNN安装教程

[5] convolutional-neural-networks

https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-convolutional-neural-networks

代码解释:

void conv_cudnn(float* h_input, float* h_output, float* kernelSet, int rows ,int cols, int channels, int kernel_row, int kernel_col, int stride, int padding)
{
    int N = 1 // Batch size
    int C = h_input.matrix_nums
    int H = h_input.rows
    int W = h_input.cols

    int R = kernelSet.rows
    int S = kernelSet.cols

    int conved_H = h_output.rows
    int conved_W = h_output.cols

    // Check if the kernel count matches the input channels
    if (kernelSet.numKernels != h_input.matrix_nums)
    {
        printf("Err: not paired kernels and input\n")
        return
    }

    // Check if the output dimensions match the expected sizes
    if ((H - R + 2 * padding) / stride + 1 != conved_H || (W - S + 2 * padding) / stride + 1 != conved_W)
    {
        printf("Err: not paired conv src and result space\n")
        return
    }

    cudnnHandle_t cudnn
    CHECK_CUDNN_ERR(cudnnCreate(&cudnn))

    // Allocate device memory for input, output, and kernels
    float *d_input, *d_output, *d_kernels
    cudaMalloc((void **)&d_input, C * H * W * sizeof(float))
    cudaMalloc((void **)&d_output, h_output.matrix_nums * conved_H * conved_W * sizeof(float))
    cudaMalloc((void **)&d_kernels, kernelSet.numKernels * C * R * S * sizeof(float))

    // Copy input and kernel data from host to device
    cudaMemcpy(d_input, h_input.h_input, C * H * W * sizeof(float), cudaMemcpyHostToDevice)
    cudaMemcpy(d_kernels, kernelSet.deviceKernels, kernelSet.numKernels * C * R * S * sizeof(float), cudaMemcpyHostToDevice)

    cudnnTensorDescriptor_t input_desc, output_desc
    cudnnFilterDescriptor_t kernel_desc
    cudnnConvolutionDescriptor_t conv_desc

    // Create cuDNN descriptors
    CHECK_CUDNN_ERR(cudnnCreateTensorDescriptor(&input_desc))
    CHECK_CUDNN_ERR(cudnnCreateTensorDescriptor(&output_desc))
    CHECK_CUDNN_ERR(cudnnCreateFilterDescriptor(&kernel_desc))
    CHECK_CUDNN_ERR(cudnnCreateConvolutionDescriptor(&conv_desc))

    // Set input tensor descriptor (NCHW format)
    CHECK_CUDNN_ERR(cudnnSetTensor4dDescriptor(input_desc, CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT, N, C, H, W))

    // Set output tensor descriptor (NCHW format)
    CHECK_CUDNN_ERR(cudnnSetTensor4dDescriptor(output_desc, CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT, N, h_output.matrix_nums, conved_H, conved_W))

    // Set kernel descriptor (filter)
    CHECK_CUDNN_ERR(cudnnSetFilter4dDescriptor(kernel_desc, CUDNN_DATA_FLOAT, CUDNN_TENSOR_NCHW, kernelSet.numKernels, C, R, S))

    // Set convolution descriptor
    CHECK_CUDNN_ERR(cudnnSetConvolution2dDescriptor(conv_desc, padding, padding, stride, stride, 1, 1, CUDNN_CROSS_CORRELATION, CUDNN_DATA_FLOAT))

    // Get the output dimensions from cuDNN
    int n, c, h, w
    CHECK_CUDNN_ERR(cudnnGetConvolution2dForwardOutputDim(conv_desc, input_desc, kernel_desc, &n, &c, &h, &w))

    // Check the computed dimensions
    if (n != 1 || c != h_output.matrix_nums || h != conved_H || w != conved_W)
    {
        printf("Err: computed output dimensions don't match expected ones. Got (%d, %d, %d, %d)\n", n, c, h, w)
        return
    }

    // Set the output tensor descriptor with the computed dimensions
    CHECK_CUDNN_ERR(cudnnSetTensor4dDescriptor(output_desc, CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT, N, c, h, w))

    // Perform the convolution
    float alpha = 1.0f, beta = 0.0f
    CHECK_CUDNN_ERR(cudnnConvolutionForward(cudnn, &alpha, input_desc, d_input, kernel_desc, d_kernels, conv_desc, CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM, NULL, 0, &beta, output_desc, d_output))

    // Copy the result back to host memory
    cudaMemcpy(h_output., d_output, h_output.matrix_nums * conved_H * conved_W * sizeof(float), cudaMemcpyDeviceToHost)

    // Free device memory and cuDNN descriptors
    cudaFree(d_input)
    cudaFree(d_output)
    cudaFree(d_kernels)
    cudnnDestroyTensorDescriptor(input_desc)
    cudnnDestroyTensorDescriptor(output_desc)
    cudnnDestroyFilterDescriptor(kernel_desc)
    cudnnDestroyConvolutionDescriptor(conv_desc)
    cudnnDestroy(cudnn)
}

测试运行:

时间测试:

Kernel Performance Statistics

Matrix Size	Stride 1 Time (ms)	Stride 2 Time (ms)	Stride 3 Time (ms)
512	826.092407	811.012817	739.268188
1024	917.171265	731.723877	780.229431
2048	872.473267	805.256348	774.350159
4096	1059.660156	972.454163	950.795410
8192	4997.216797	1728.338379	1832.116211

结果分析

? 使用滑窗卷积法在大部分任务上和cudnn的计算时间在同一个数量级,甚至计算时间更少

? 使用Img2col算法,由于在展开的时候使用的是cpu展开,对于大型矩阵是非常慢的,后续考虑将展开函数分配到核函数进行, 然后生成的矩阵使用分块矩阵乘法加速计算,比如这里是1x27和27x4096的矩阵计算,可以对4096列分块计算,每一个线程负责一个列的计算.

本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

HPC lab5 cuda编程的并行计算

高性能计算程序设计（5）秋季2024

任务1：

代码解释:

时间测试：最后一起给出

任务2：

代码解释：

测试运行（连同任务1）

diy_cuda_global_mem

diy_cuda_shared_mem

cublas

结果分析：

任务3：

问题描述：

代码解释：

测试运行：

正确性

时间测试

任务4：

问题描述：

代码解释：

问题调试

测试运行：

正确性:

时间测试

任务5：

问题描述

代码解释:

测试运行:

时间测试:

结果分析

CATALOG

FEATURED TAGS

高性能计算程序设计（5） 秋季2024

任务1：

代码解释:

时间测试：最后一起给出

任务2：

代码解释：

测试运行（连同任务1）

diy_cuda_global_mem

diy_cuda_shared_mem

cublas

结果分析：

任务3：

问题描述：

代码解释：

测试运行：

正确性

时间测试

任务4：

问题描述：

代码解释：

问题调试

测试运行：

正确性:

时间测试

任务5：

问题描述

代码解释:

测试运行:

时间测试:

结果分析

CATALOG

FEATURED TAGS

高性能计算程序设计（5）秋季2024