WP Software Developer

OpenCV Optimization Practice - Stage II Update


OpenCV Optimization - Stage II Update

OpenCV

There are the list of tasks I need to add for stage II:

  • Perform: Details about the processes of performing optimiation plan
  • Safety Test: Will the optimized functions pass the safety test?
  • Performace Test: Compare the performance of target functions before and after optimization
  • Other Architecture: What is the preformance of the target functions in other platform?

Perform Optimization Plan on Aarchie

After read the source code of fast_math.cpp, I realized that the OpenCV developers call lrint(value) function inside of the body of cvRound(), but use regular way to perform cvCeil() and cvFloor() on Archie

CV_INLINE int
cvRound( double value )
{
    ...
#elif defined CV_ICC || defined __GNUC__
    return (int)(lrint(value));
    ...
}
CV_INLINE int cvCeil( float value )
{
    ...
#else
    int i = (int)value;
    return i + (i < value);
#endif
}


As I mentioned before, the optimization plan is to find functions which are similar with lrint(value) and see whether the performance of cvCeil() and cvFloor() can approach to cvRound().

Looking for similar functions

According to Linux Programmer’s Manual, ceil(),ceilf(), floor() and floorf() have similar implementation to lrint(value).

long int lrint(double x);
long int lrintf(float x);
double ceil(double x);
float ceilf(float x);
double floor(double x);
float floorf(float x);

The only concern is that the return value of lrint is long integer, then use (int)lrint(value)concatinate to integer. But for ceil() and floor() will return double and float, then do the concatinate to int.

Code Example after Optimization:

CV_INLINE int cvCeil( float value )
{
#if (defined CV__FASTMATH_ENABLE_GCC_MATH_BUILTINS || defined CV__FASTMATH_ENABLE_CLANG_MATH_BUILTINS) \
    && ( \
        defined(__PPC64__) \
    )
    return __builtin_ceilf(value);
#elif __aarch64__
    return (int)(ceilf(value));
#else
    int i = (int)value;
    return i + (i < value);
#endif
}

Safety Test

According to the following result, it seems the cvCeil and cvFloor pass the safety test. So this optimization will not affect the safety of these two functions.

__aarch64__ cvCeil(float)
total cvCeil float a: 9760000000
correct answer: 9760000000

total cvCeil float b: -3630000000
correct answer: -3630000000
__aarch64__ cvCeil(double)
total cvCeil double a: 5540000000
correct answer: 5540000000

total cvCeil double b -6340000000
correct answer: -6340000000
__aarch64__ cvFloor(float)
total cvFloor float a: 9750000000
correct answer: 9750000000

total cvFloor float b: -3640000000
correct answer: -3640000000
__aarch64__ cvFloor(double)
total cvFloor double a 5530000000
correct answer: 5530000000

total cvFloor double b -6350000000
correct answer: -6350000000

Performance Test

Note: these are median results after multiple runs.

Time Before Optimization

cvRound cvCeil cvFloor
float real 0m1.535s 0m1.755s 0m1.756s
user 0m1.501s 0m1.702 0m1.712s
sys 0m0.032s 0m0.051 0m0.041s
double real 0m1.535s 0m1.755s 0m1.758s
user 0m1.503s 0m1.712s 0m1.704s
sys 0m0.030s 0m0.041s 0m0.051s


Time After Optimization

cvRound cvCeil cvFloor
float real 0m1.535s 0m1.542s 0m1.534s
user 0m1.501s 0m1.506s 0m1.501s
sys 0m0.032s 0m0.033s 0m0.030s
double real 0m1.535s 0m1.535s 0m1.539s
user 0m1.503s 0m1.502s 0m1.503s
sys 0m0.030s 0m0.030s 0m0.031s


Table of Time Improvment Percentage

cvCeil cvFloor
float real 12.14% 12.64%
cpu(usr+sys) 12.21% 13.16%
double real 12.54% 12.46%
cpu(usr+sys) 12.61% 12.59%


Usage Before Optimization

cvRound cvCeil cvFloor
float 18.24% 30.13% 30.07%
double 16.88% 31.32% 31.29%


Usage After Optimization

cvRound cvCeil cvFloor
float 18.24% 19.64% 20.56%
double 16.88% 19.71% 19.28%

Optimization Result

According to tables above, we can see that currently the time usage for cvRound, cvCeil and cvFloor are very close. The real time for cvCeil and cvFloor is about 0.2 second faster. The CPU execution time (usr+sys) of these two function are still approximatey 0.2 second faster. The usage after optimization for cvCeil and cvFloor decrease around 10% than before and is very close to cvRound

Other Architecture

Aarch64 - Betty

The safety test on Betty is passed.

Time Before Optimization

cvRound cvCeil cvFloor
float real 0m0.559s 0m0.654s 0m0.656s
user 0m0.548s 0m0.633s 0m0.645s
sys 0m0.010s 0m0.020s 0m0.010s
double real 0m0.541s 0m0.655s 0m0.651s
user 0m0.530s 0m0.634s 0m0.630s
sys 0m0.010s 0m0.020s 0m0.021s

Time After Optimization

cvRound cvCeil cvFloor
float real 0m0.559s 0m0.562s 0m0.533
user 0m0.548s 0m0.521s 0m0.645s
sys 0m0.010s 0m0.011s 0m0.012s
double real 0m0.541s 0m0.560s 0m0.566s
user 0m0.530s 0m0.555s 0m0.630s
sys 0m0.010s 0m0.010s 0m0.010s

Usage Before Optimization

cvRound cvCeil cvFloor
float 7.79% 9.33% 9.86%
double 7.68% 12.61% 10.83%

Usage After Optimization

cvRound cvCeil cvFloor
float 7.9% 4.9% 8.98%
double 7.21% 6.52% 7.03%

Conclusion

Overall, we can see performance imporvement of cvCeil and cvFloor in both Archie and Betty. On Archie, the time of these 2 functions droped ~0.2 second and usage dropped ~12% compare with the result before optimization. On Betty, the time for cvCeil and cvFloor dropped 0.1 seconds and usage decrease 10% - 50%. Now the time and usage performance of cvCeil and cvFloor is very close to cvRound.


Similar Posts

Comments