中山大学计算机院本科生实验报告
(2024学年秋季学期)
课程名称:高性能计算程序设计 批改人:
实验 | 高性能计算程序设计基础(0) | 专业(方向) | 信息与计算科学 |
---|---|---|---|
学号 | 22336173 | 姓名 | 罗弘杰 |
3133974071@qq.com | 完成日期 | 9/12 |
实验目的
- 比较三种语言(C,python, java)编写矩阵乘法运算程序的计算时间;
- 通过编译器优化来减少运行时间;
- 通过调整循环操作顺序来减少运行时间。
实验过程和核心代码
编写代码
c语言
? 源代码c_m.c,输入M,N,K作为两个矩阵形状MxN, NxK,然后按i,j,k的顺序运行矩阵乘法。在这里,a矩阵行优先,b矩阵列优先.注意使用双精度浮点数。
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
//定义随机矩阵
for (int i = 0; i < M_SIZE * N_SIZE; i++) {
a[i] = (double)rand() / RAND_MAX * 10000.0;
}
for (int i = 0; i < N_SIZE * K_SIZE; i++) {
b[i] = (double)rand() / RAND_MAX * 10000.0;
}
for (int i = 0; i < M_SIZE * K_SIZE; i++) {
c[i] = 0.0;
}
start = clock(); //计时器
//定义i,j,k顺序的矩阵乘法
for (int i = 0; i < M_SIZE; i++) {
for (int j = 0; j < K_SIZE; j++) {
for (int k = 0; k < N_SIZE; k++) {
c[i * K_SIZE + j] += a[i * N_SIZE + k] * b[k * K_SIZE + j];
}
}
}
end = clock();
python
? 同理
0
1
2
3
4
5
6
7
8
9
10
11
12
13
a = np.random.uniform(0, 10000, (m, n))
b = np.random.uniform(0, 10000, (n, k))
c = np.zeros((m, k), dtype=np.float64)
start = time.time()
for i in range(m):
for j in range(k):
for l in range(n):
c[i, j] += a[i, l] * b[l, j]
end = time.time()
time_use = end - start
java
? 同理
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import java.util.Random;
public class j_m {
public static void j_m(int mSize, int nSize, int kSize) {
double[][] a = new double[mSize][nSize];
double[][] b = new double[nSize][kSize];
double[][] c = new double[mSize][kSize];
Random random = new Random();
for (int i = 0; i < mSize; i++) {
for (int j = 0; j < nSize; j++) {
a[i][j] = random.nextDouble() * 10000;
}
}
for (int i = 0; i < nSize; i++) {
for (int j = 0; j < kSize; j++) {
b[i][j] = random.nextDouble() * 10000;
}
}
long startTime = System.currentTimeMillis();
for (int i = 0; i < mSize; i++) {
for (int j = 0; j < kSize; j++) {
for (int k = 0; k < nSize; k++) {
c[i][j] += a[i][k] * b[k][j];
}
}
}
long endTime = System.currentTimeMillis();
double timeUsed = (endTime - startTime) / 1000.0;
System.out.printf("Java Matrix Multiplying of %dx%d and %dx%d took %.6f seconds\n",
mSize, nSize, nSize, kSize, timeUsed);
}
public static void main(String[] args) {
if (args.length != 3) {
System.err.println("Usage: java JMatrixMultiplication M_SIZE N_SIZE K_SIZE");
return;
}
int mSize = Integer.parseInt(args[0]);
int nSize = Integer.parseInt(args[1]);
int kSize = Integer.parseInt(args[2]);
j_m(mSize, nSize, kSize);
}
}
运行脚本
? run_matrix_multiply.sh,使用./run_matrix_multiply.sh M,N,K ;来输入矩阵形状。结果保存在results.txt
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
#!/bin/bash
# Read matrix dimensions from the command line arguments
M_SIZE=$1
N_SIZE=$2
K_SIZE=$3
# Ensure that the required arguments are provided
if [ -z "$M_SIZE" ] || [ -z "$N_SIZE" ] || [ -z "$K_SIZE" ]; then
echo "Usage: $0 M_SIZE N_SIZE K_SIZE"| tee -a"results.txt"
exit 1
fi
# Define output file
OUTPUT_FILE="results.txt"
echo "ijk " |tee -a "results.txt" #这里记录了遍历顺序
echo "$M_SIZE $N_SIZE $K_SIZE " |tee -a "results.txt"
# Run Python matrix multiplication
python3 p_m.py $M_SIZE $N_SIZE $K_SIZE | tee -a "results.txt"
# Compile and run Java matrix multiplication
javac j_m.java | tee -a "OUTPUT_FILE"
java j_m $M_SIZE $N_SIZE $K_SIZE | tee -a "results.txt"
# Compile and run C matrix multiplication
gcc -O c_m.c -o c_m | tee -a "results.txt"
./c_m $M_SIZE $N_SIZE $K_SIZE | tee -a "results.txt" #不开启优化
gcc -O1 c_m.c -o c_m1 | tee -a "results.txt"
./c_m1 $M_SIZE $N_SIZE $K_SIZE | tee -a "results.txt"
gcc -O2 c_m.c -o c_m2 | tee -a "results.txt"
./c_m2 $M_SIZE $N_SIZE $K_SIZE | tee -a "results.txt"
gcc -O3 c_m.c -o c_m3 | tee -a "results.txt" #最高级编译优化
./c_m3 $M_SIZE $N_SIZE $K_SIZE | tee -a "results.txt"
echo "Results have been saved to results.txt" | tee -a "results.txt"
echo " " >> temp | tee -a "results.txt"
更换循环顺序
以c语言为例:
? 本来默认的循环次序是ijk,可以考虑别的循环执行次序比如ikj,比较两者之间的差距。
0
1
2
3
4
5
6
7
//定义i,k,j顺序的矩阵乘法
for (int i = 0; i < M_SIZE; i++) {
for (int k = 0; k < N_SIZE; k++) {
for (int j = 0; j < K_SIZE; j++){
c[i * K_SIZE + j] += a[i * N_SIZE + k] * b[k * K_SIZE + j];
}
}
}
? a矩阵行优先;b矩阵行优先
计算浮点性能和峰值性能
? 计算程序浮点计算次数:
? 一共有3套循环,1024的三次方个运算,每次是乘法配一次加法,一共是2,147,483,648次双精度浮点运算。
? 本cpu i5-12500h desktop的浮点性能如下:
? 一共有16个核心,单核3.3GHz,根据英特尔官网,单个周期执行的浮点计算次数是16次
? 那么总共是16x3.11Gx16也就是796GFlops;
?
实验结果
? 可见results.txt
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
ijk
1024 1024 1024
Python Matrix Multiplying of 1024x1024 and 1024x1024 used time of 327.518531 seconds
Java Matrix Multiplying of 1024x1024 and 1024x1024 took 3.049000 seconds
C Matrix Multiplying of 1024*1024 and 1024*1024 use time of 2.853793 seconds #0优化
C Matrix Multiplying of 1024*1024 and 1024*1024 use time of 2.735906 seconds
C Matrix Multiplying of 1024*1024 and 1024*1024 use time of 4.114278 seconds
C Matrix Multiplying of 1024*1024 and 1024*1024 use time of 2.828579 seconds #最高优化
Results have been saved to results.txt
ikj
1024 1024 1024
Python Matrix Multiplying of 1024x1024 and 1024x1024 used time of 348.964143 seconds
Java Matrix Multiplying of 1024x1024 and 1024x1024 took 1.116000 seconds
C Matrix Multiplying of 1024*1024 and 1024*1024 use time of 0.970858 seconds
C Matrix Multiplying of 1024*1024 and 1024*1024 use time of 0.501914 seconds
C Matrix Multiplying of 1024*1024 and 1024*1024 use time of 0.569884 seconds
C Matrix Multiplying of 1024*1024 and 1024*1024 use time of 0.265389 seconds
Results have been saved to results.txt
版本 | 实现 | 运行时间 (s) | 相对加速比 (相对前一版本) | 绝对加速比 (相对版本1) | 浮点性能 (GFLOPS) | 达到峰值性能(796GFLOPS)的百分比 |
---|---|---|---|---|---|---|
1 | Python | 327.518531 | 1 | 0.006557001 | 0.00082% | |
2 | Java | 3.049000 | 107 | 107.38 | 0.7041 | 0.0884% |
3 | C | 2.853793 | 1.07 | 114.92 | 0.7535 | 0.0946% |
4 | +调整循环顺序 | 0.970858 | 2.94 | 337.64 | 2.2139 | 0.2780% |
5 | +编译优化 | 2.828579 | 1.01 | 115.79 | 0.7594 | 0.0953% |
6 | +调整循环顺序 和编译优化 | 0.265389 | 10.8 | 1404.11 | 8.094 | 1.012% |
实验感想
1.对于矩阵计算更改循环顺序可以优化的分析:应该是和cpu的cache命中率有关,用ikj的顺序遍历,保证在寻找矩阵数据的时候按行优先,这和c语言的数组内存分配规定相符合,会提高cache命中率,减少I/O时间,是浮点性能提高,计算时间减少。如果用ijk,那么在寻找矩阵B数据的时候,由于行长度太大,会跳出cache存储的区间,导致cache miss,或者要到二级cache找。
使用valgrind来监视cache命中情况
i, j, k顺序循环
i , k, j顺序循环
? 对比可以看到在D1(一级缓存的数据读取)缓存这一栏,原始循环顺序的miss率是48.8%,使用ikj循环顺序的miss rate是4.1%,可见更换循环顺序确实有利于降低数据读取的缓存命中率,从而提高程序运行速度。
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.