OpenCV Optimization - Stage I Update
After discuss with my instructor, I realized that my OpenCV optimization stage I is not sufficient. There are the list of tasks I need to add for stage I:
- Test Host - Specify: which test host I will use for benchmaking.
- Test Plan - Safety: What is the result of the test. Is it always returning same&correct result?
- Test Plan - Performance: Display the detailed performance of the test.
- Optimization Plan: Explan why and how I am going to optimize target functions.
- Other Architecture: What is the preformance of the test in other platform
Build and Banchmark
Test Host:
The host for initial test is Aarch64
named Archie with following features:
[wpan17@aarchie ~]$ lscpu
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 12
NUMA node(s): 1
Vendor ID: ARM
Model: 4
Model name: Cortex-A53
Stepping: r0p4
BogoMIPS: 200.00
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 4096K
NUMA node0 CPU(s): 0-23
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
Test Data:
As I mentioned in Stage I, the original testing data is very simple, just need to use cvRound, cvCeil and cvFloor functions for a single float for one million times. But now I realize this test data is not enough. I should try multiple input.
I used 2 pair of test data, one is float and one is double. Each pair contains 2 numbers, one positive and one negative. These numbers are all randomly selected using formula (max - min) * ((float)rand() / RAND_MAX) + min;
. The ranges (min - max) are between -1000 to 0 for positive and 0 to 1000 for negative number.
float float_a = 975.256165;
float float_b = -363.565369;
double double_a = 553.034485;
double double_b = -634.781189;
Optimization and Testing Plan
In previous Stage I blog, we already discovered that the performance of cvCeil
and cvFloor
is much worse than cvRound
.
Now I need to redo the Benchmark with new test data and plan.
First of all, I create 6 test file for cvCeil
, cvFloor
and cvRound
.
cvRound | cvCeil | cvFloor | |
---|---|---|---|
float | cvRound_test_f.cpp | cvCeil_test_f.cpp | cvFloor_test_f.cpp |
double | cvRound_test_d.cpp | cvCeil_test_d.cpp | cvFloor_test_d.cpp |
The main body for these test files are similar like:
int main(int argc,char** argv){
if( argc != 2 ){
#if __aarch64__
printf("__aarch64__ cvRound(float)\n");
#endif
float float_a = 975.256165;
float float_b = -363.565369;
long int total_fra = 0; //total number for float
long int total_frb = 0; //total number for double
for (int i=0;i<10000000;i++){
total_fra += cvRound(975.256165);
total_frb += cvRound(-363.565369);
}
printf("total cvRound float a: %ld\n",total_fra);
printf("correct answer: %ld\n\n",(long int)round(float_a)*10000000);
printf("total cvRound float b: %ld\n", total_frb);
printf("correct answer: %ld\n\n",(long int)round(float_b)*10000000);
return 0;
}
}
Test Plan
In above script, we will call cvRound()
function one million time with both positive and negative float. The result of will be added up to long integer total_fra and total_frb and will be printed out. The expected result will be calculated use regular round()
function. The test plan for other two functions will follow the same method and same test data.
Test Plan - Safety
cvRound_test_f.cpp:
total cvRound float a: 9750000000
correct answer: 9750000000
total cvRound float b: -3640000000
correct answer: -3640000000
cvRound_test_d.cpp:
total cvRound double a: 5530000000
correct answer: 5530000000
total cvRound double b: -6350000000
correct answer: -6350000000
cvCeil_test_f.cpp:
total cvCeil float a: 9760000000
correct answer: 9760000000
total cvCeil float b: -3630000000
correct answer: -3630000000
cvCeil_test_d.cpp:
total cvCeil double a: 5540000000
correct answer: 5540000000
total cvCeil double b -6340000000
correct answer: -6340000000
cvFloor_test_f.cpp
total cvFloor float a: 9750000000
correct answer: 9750000000
total cvFloor float b: -3640000000
correct answer: -3640000000
cvFloor_test_d.cpp
total cvFloor double a 5530000000
correct answer: 5530000000
total cvFloor double b -6350000000
correct answer: -6350000000
So far the result of thse three functions are all correct. After we modify the cvCeil
and cvFloor
in next stage, this result will be used as an reference to check whether the optimization implementation is safe.
Test Plan - Performance
Time Usage with same test condition
cvRound | cvCeil | cvFloor | ||
---|---|---|---|---|
float | real | 0m1.535s | 0m1.755s | 0m1.756s |
user | 0m1.501s | 0m1.702 | 0m1.712s | |
sys | 0m0.032s | 0m0.051 | 0m0.041s | |
double | real | 0m1.535s | 0m1.755s | 0m1.758s |
user | 0m1.503s | 0m1.712s | 0m1.704s | |
sys | 0m0.030s | 0m0.041s | 0m0.051s |
Perf with same test conditions
The perf report of shows us the runtime usage percentage of cvRound and other functions:
57.58% cvRound_float libc-2.27.so [.] _mcount@@GLIBC_2.18
11.77% cvRound_float cvRound_float [.] main
10.92% cvRound_float cvRound_float [.] cvRound
7.32% cvRound_float libm-2.27.so [.] __lrint
59.63% cvRound_double libc-2.27.so [.] _mcount@@GLIBC_2.18
11.11% cvRound_double cvRound_double [.] main
10.29% cvRound_double cvRound_double [.] cvRound
6.59% cvRound_double libm-2.27.so [.] __lrint
Compare with cvRound, we can see that the percentage of cvCeil and cvFloor is much higher:
51.79% cvCeil_float libc-2.27.so [.] _mcount@@GLIBC_2.18
30.13% cvCeil_float cvCeil_float [.] cvCeil
9.49% cvCeil_float cvCeil_float [.] main
3.82% cvCeil_float ld-2.27.so [.] do_lookup_x
50.39% cvCeil_double libc-2.27.so [.] _mcount@@GLIBC_2.18
31.32% cvCeil_double cvCeil_double [.] cvCeil
9.81% cvCeil_double cvCeil_double [.] main
3.81% cvCeil_double ld-2.27.so [.] do_lookup_x
50.91% cvFloor_float libc-2.27.so [.] _mcount@@GLIBC_2.18
30.07% cvFloor_float cvFloor_float [.] cvFloor
10.33% cvFloor_float cvFloor_float [.] main
3.62% cvFloor_float ld-2.27.so [.] do_lookup_x
49.91% cvFloor_double libc-2.27.so [.] _mcount@@GLIBC_2.18
31.29% cvFloor_double cvFloor_double [.] cvFloor
10.20% cvFloor_double cvFloor_double [.] main
3.78% cvFloor_double ld-2.27.so [.] do_lookup_x
Usage Table
cvRound | cvCeil | cvFloor | |
---|---|---|---|
float | 18.24% | 30.13% | 30.07% |
double | 16.88% | 31.32% | 31.29% |
According to the time and perf result, we can see that the performance of cvCeil
and cvFloor
is much worse than cvRound
. There should be some room for us to to improve cvCeil
and cvFloor
, to make their performance approach to cvRound
.
Optimization Plan
After looking into these three function, I realize that the reason why cvRound performs better is that it will call a special GNUC function lrint()
inside of its body. And lrint()
is an optimized function which will round the input to the nearest integer.
│ _ZL7cvRoundd():
12.91 │ stp x29, x30, [sp, #-32]!
11.98 │ mov x29, sp
│ mov x0, x30
26.16 │ str d0, [sp, #24]
│ → bl _mcount@plt
11.80 │ ldr d0, [sp, #24]
│ → bl lrint@plt
11.80 │ ldp x29, x30, [sp], #32
25.35 │ ← ret
│ 0000000000044218 <lrint>:
│ lrint():
20.48 │ frintx d0, d0
79.52 │ fcvtzs x0, d0
│ ← ret
But for cvCeil()
and cvRound
, there is no such optimization
Percent│ 000000000040692c <cvCeil(double)>:
│ _ZL6cvCeild():
4.13 │ stp x29, x30, [sp, #-48]!
4.37 │ mov x29, sp
│ mov x0, x30
7.31 │ str d0, [sp, #24]
│ → bl _mcount@plt
4.12 │ ldr d0, [sp, #24]
12.34 │ fcvtzs w0, d0
3.79 │ str w0, [sp, #44]
3.56 │ ldr w0, [sp, #44]
14.57 │ scvtf d0, w0
│ ldr d1, [sp, #24]
15.80 │ fcmpe d1, d0
4.37 │ cset w0, gt
3.70 │ and w0, w0, #0xff
3.99 │ mov w1, w0
│ ldr w0, [sp, #44]
7.45 │ add w0, w1, w0
3.42 │ ldp x29, x30, [sp], #48
7.07 │ ← ret
So the test plan is to find optimized GNUC funtions which are similar with lrint()
for cvCeil
and cvFloor
, to imporve the performance of these two functions.
Other Architecture
Performence Comparison between different architecture with same test condition
Aarch64 - Betty
[wpan17@bbetty test]$ lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 4
NUMA node(s): 1
Vendor ID: ARM
Model: 2
Model name: Cortex-A57
Stepping: r1p2
BogoMIPS: 500.00
NUMA node0 CPU(s): 0-7
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
Time
cvRound | cvCeil | cvFloor | ||
---|---|---|---|---|
float | real | 0m0.559s | 0m0.654s | 0m0.656s |
user | 0m0.548s | 0m0.633s | 0m0.645s | |
sys | 0m0.010s | 0m0.020s | 0m0.010s | |
double | real | 0m0.541s | 0m0.655s | 0m0.651s |
user | 0m0.530s | 0m0.634s | 0m0.630s | |
sys | 0m0.010s | 0m0.020s | 0m0.021s |
Usage
cvRound | cvCeil | cvFloor | |
---|---|---|---|
float | 7.79% | 9.33% | 9.86% |
double | 7.68% | 12.61% | 10.83% |
The performance of these three functions on Betty seems much better than Archie. But sill, the performence of cvRound is still better than other two functions.
x86_64 xerxes
I plan to use run the same test on xerxes, but there is no engough space for me to install OpenCV on it.
[wpan17@xerxes project]$ git clone https://github.com/opencv/opencv.git
Cloning into 'opencv'...
remote: Enumerating objects: 7, done.
remote: Counting objects: 100% (7/7), done.
remote: Compressing objects: 100% (7/7), done.
fatal: write error: No space left on device55 MiB | 12.94 MiB/s
fatal: index-pack failed