- OpenCV Optimization - Stage II Update
OpenCV Optimization - Stage II Update

There are the list of tasks I need to add for stage II:
- Perform: Details about the processes of performing optimiation plan
- Safety Test: Will the optimized functions pass the safety test?
- Performace Test: Compare the performance of target functions before and after optimization
- Other Architecture: What is the preformance of the target functions in other platform?
Perform Optimization Plan on Aarchie
After read the source code of fast_math.cpp, I realized that the OpenCV developers call lrint(value) function inside of the body of cvRound(), but use regular way to perform cvCeil() and cvFloor() on Archie
CV_INLINE int
cvRound( double value )
{
...
#elif defined CV_ICC || defined __GNUC__
return (int)(lrint(value));
...
}
CV_INLINE int cvCeil( float value )
{
...
#else
int i = (int)value;
return i + (i < value);
#endif
}
As I mentioned before, the optimization plan is to find functions which are similar with lrint(value) and see whether the performance of cvCeil() and cvFloor() can approach to cvRound().
Looking for similar functions
According to Linux Programmer’s Manual, ceil(),ceilf(), floor() and floorf() have similar implementation to lrint(value).
long int lrint(double x);
long int lrintf(float x);
double ceil(double x);
float ceilf(float x);
double floor(double x);
float floorf(float x);
The only concern is that the return value of lrint is long integer, then use (int)lrint(value)concatinate to integer. But for ceil() and floor() will return double and float, then do the concatinate to int.
Code Example after Optimization:
CV_INLINE int cvCeil( float value )
{
#if (defined CV__FASTMATH_ENABLE_GCC_MATH_BUILTINS || defined CV__FASTMATH_ENABLE_CLANG_MATH_BUILTINS) \
&& ( \
defined(__PPC64__) \
)
return __builtin_ceilf(value);
#elif __aarch64__
return (int)(ceilf(value));
#else
int i = (int)value;
return i + (i < value);
#endif
}
Safety Test
According to the following result, it seems the cvCeil and cvFloor pass the safety test. So this optimization will not affect the safety of these two functions.
__aarch64__ cvCeil(float)
total cvCeil float a: 9760000000
correct answer: 9760000000
total cvCeil float b: -3630000000
correct answer: -3630000000
__aarch64__ cvCeil(double)
total cvCeil double a: 5540000000
correct answer: 5540000000
total cvCeil double b -6340000000
correct answer: -6340000000
__aarch64__ cvFloor(float)
total cvFloor float a: 9750000000
correct answer: 9750000000
total cvFloor float b: -3640000000
correct answer: -3640000000
__aarch64__ cvFloor(double)
total cvFloor double a 5530000000
correct answer: 5530000000
total cvFloor double b -6350000000
correct answer: -6350000000
Performance Test
Note: these are median results after multiple runs.
Time Before Optimization
| cvRound | cvCeil | cvFloor | ||
|---|---|---|---|---|
| float | real | 0m1.535s | 0m1.755s | 0m1.756s |
| user | 0m1.501s | 0m1.702 | 0m1.712s | |
| sys | 0m0.032s | 0m0.051 | 0m0.041s | |
| double | real | 0m1.535s | 0m1.755s | 0m1.758s |
| user | 0m1.503s | 0m1.712s | 0m1.704s | |
| sys | 0m0.030s | 0m0.041s | 0m0.051s |
Time After Optimization
| cvRound | cvCeil | cvFloor | ||
|---|---|---|---|---|
| float | real | 0m1.535s | 0m1.542s | 0m1.534s |
| user | 0m1.501s | 0m1.506s | 0m1.501s | |
| sys | 0m0.032s | 0m0.033s | 0m0.030s | |
| double | real | 0m1.535s | 0m1.535s | 0m1.539s |
| user | 0m1.503s | 0m1.502s | 0m1.503s | |
| sys | 0m0.030s | 0m0.030s | 0m0.031s |
Table of Time Improvment Percentage
| cvCeil | cvFloor | ||
|---|---|---|---|
| float | real | 12.14% | 12.64% |
| cpu(usr+sys) | 12.21% | 13.16% | |
| double | real | 12.54% | 12.46% |
| cpu(usr+sys) | 12.61% | 12.59% |
Usage Before Optimization
| cvRound | cvCeil | cvFloor | |
|---|---|---|---|
| float | 18.24% | 30.13% | 30.07% |
| double | 16.88% | 31.32% | 31.29% |
Usage After Optimization
| cvRound | cvCeil | cvFloor | |
|---|---|---|---|
| float | 18.24% | 19.64% | 20.56% |
| double | 16.88% | 19.71% | 19.28% |
Optimization Result
According to tables above, we can see that currently the time usage for cvRound, cvCeil and cvFloor are very close. The real time for cvCeil and cvFloor is about 0.2 second faster. The CPU execution time (usr+sys) of these two function are still approximatey 0.2 second faster.
The usage after optimization for cvCeil and cvFloor decrease around 10% than before and is very close to cvRound
Other Architecture
Aarch64 - Betty
The safety test on Betty is passed.
Time Before Optimization
| cvRound | cvCeil | cvFloor | ||
|---|---|---|---|---|
| float | real | 0m0.559s | 0m0.654s | 0m0.656s |
| user | 0m0.548s | 0m0.633s | 0m0.645s | |
| sys | 0m0.010s | 0m0.020s | 0m0.010s | |
| double | real | 0m0.541s | 0m0.655s | 0m0.651s |
| user | 0m0.530s | 0m0.634s | 0m0.630s | |
| sys | 0m0.010s | 0m0.020s | 0m0.021s |
Time After Optimization
| cvRound | cvCeil | cvFloor | ||
|---|---|---|---|---|
| float | real | 0m0.559s | 0m0.562s | 0m0.533 |
| user | 0m0.548s | 0m0.521s | 0m0.645s | |
| sys | 0m0.010s | 0m0.011s | 0m0.012s | |
| double | real | 0m0.541s | 0m0.560s | 0m0.566s |
| user | 0m0.530s | 0m0.555s | 0m0.630s | |
| sys | 0m0.010s | 0m0.010s | 0m0.010s |
Usage Before Optimization
| cvRound | cvCeil | cvFloor | |
|---|---|---|---|
| float | 7.79% | 9.33% | 9.86% |
| double | 7.68% | 12.61% | 10.83% |
Usage After Optimization
| cvRound | cvCeil | cvFloor | |
|---|---|---|---|
| float | 7.9% | 4.9% | 8.98% |
| double | 7.21% | 6.52% | 7.03% |
Conclusion
Overall, we can see performance imporvement of cvCeil and cvFloor in both Archie and Betty. On Archie, the time of these 2 functions droped ~0.2 second and usage dropped ~12% compare with the result before optimization. On Betty, the time for cvCeil and cvFloor dropped 0.1 seconds and usage decrease 10% - 50%. Now the time and usage performance of cvCeil and cvFloor is very close to cvRound.