OpenCV Optimization - Stage II Update

OpenCV Optimization - Stage II Update

OpenCV

There are the list of tasks I need to add for stage II:

Perform: Details about the processes of performing optimiation plan
Safety Test: Will the optimized functions pass the safety test?
Performace Test: Compare the performance of target functions before and after optimization
Other Architecture: What is the preformance of the target functions in other platform?

Perform Optimization Plan on Aarchie

After read the source code of fast_math.cpp, I realized that the OpenCV developers call lrint(value) function inside of the body of cvRound(), but use regular way to perform cvCeil() and cvFloor() on Archie

CV_INLINE int
cvRound( double value )
{
    ...
#elif defined CV_ICC || defined __GNUC__
    return (int)(lrint(value));
    ...
}

CV_INLINE int cvCeil( float value )
{
    ...
#else
    int i = (int)value;
    return i + (i < value);
#endif
}

As I mentioned before, the optimization plan is to find functions which are similar with lrint(value) and see whether the performance of cvCeil() and cvFloor() can approach to cvRound().

Looking for similar functions

According to Linux Programmer’s Manual, ceil(),ceilf(), floor() and floorf() have similar implementation to lrint(value).

long int lrint(double x);
long int lrintf(float x);

double ceil(double x);
float ceilf(float x);

double floor(double x);
float floorf(float x);

The only concern is that the return value of lrint is long integer, then use (int)lrint(value)concatinate to integer. But for ceil() and floor() will return double and float, then do the concatinate to int.

Code Example after Optimization:

CV_INLINE int cvCeil( float value )
{
#if (defined CV__FASTMATH_ENABLE_GCC_MATH_BUILTINS || defined CV__FASTMATH_ENABLE_CLANG_MATH_BUILTINS) \
    && ( \
        defined(__PPC64__) \
    )
    return __builtin_ceilf(value);
#elif __aarch64__
    return (int)(ceilf(value));
#else
    int i = (int)value;
    return i + (i < value);
#endif
}

Safety Test

According to the following result, it seems the cvCeil and cvFloor pass the safety test. So this optimization will not affect the safety of these two functions.

__aarch64__ cvCeil(float)
total cvCeil float a: 9760000000
correct answer: 9760000000

total cvCeil float b: -3630000000
correct answer: -3630000000

__aarch64__ cvCeil(double)
total cvCeil double a: 5540000000
correct answer: 5540000000

total cvCeil double b -6340000000
correct answer: -6340000000

__aarch64__ cvFloor(float)
total cvFloor float a: 9750000000
correct answer: 9750000000

total cvFloor float b: -3640000000
correct answer: -3640000000

__aarch64__ cvFloor(double)
total cvFloor double a 5530000000
correct answer: 5530000000

total cvFloor double b -6350000000
correct answer: -6350000000

Performance Test

Note: these are median results after multiple runs.

Time Before Optimization

		cvRound	cvCeil	cvFloor
float	real	0m1.535s	0m1.755s	0m1.756s
	user	0m1.501s	0m1.702	0m1.712s
	sys	0m0.032s	0m0.051	0m0.041s
double	real	0m1.535s	0m1.755s	0m1.758s
	user	0m1.503s	0m1.712s	0m1.704s
	sys	0m0.030s	0m0.041s	0m0.051s

Time After Optimization

		cvRound	cvCeil	cvFloor
float	real	0m1.535s	0m1.542s	0m1.534s
	user	0m1.501s	0m1.506s	0m1.501s
	sys	0m0.032s	0m0.033s	0m0.030s
double	real	0m1.535s	0m1.535s	0m1.539s
	user	0m1.503s	0m1.502s	0m1.503s
	sys	0m0.030s	0m0.030s	0m0.031s

Table of Time Improvment Percentage

		cvCeil	cvFloor
float	real	12.14%	12.64%
float	cpu(usr+sys)	12.21%	13.16%
double	real	12.54%	12.46%
double	cpu(usr+sys)	12.61%	12.59%

Usage Before Optimization

	cvRound	cvCeil	cvFloor
float	18.24%	30.13%	30.07%
double	16.88%	31.32%	31.29%

Usage After Optimization

	cvRound	cvCeil	cvFloor
float	18.24%	19.64%	20.56%
double	16.88%	19.71%	19.28%

Optimization Result

According to tables above, we can see that currently the time usage for cvRound, cvCeil and cvFloor are very close. The real time for cvCeil and cvFloor is about 0.2 second faster. The CPU execution time (usr+sys) of these two function are still approximatey 0.2 second faster. The usage after optimization for cvCeil and cvFloor decrease around 10% than before and is very close to cvRound

Other Architecture

Aarch64 - Betty

The safety test on Betty is passed.

Time Before Optimization

		cvRound	cvCeil	cvFloor
float	real	0m0.559s	0m0.654s	0m0.656s
	user	0m0.548s	0m0.633s	0m0.645s
	sys	0m0.010s	0m0.020s	0m0.010s
double	real	0m0.541s	0m0.655s	0m0.651s
	user	0m0.530s	0m0.634s	0m0.630s
	sys	0m0.010s	0m0.020s	0m0.021s

Time After Optimization

		cvRound	cvCeil	cvFloor
float	real	0m0.559s	0m0.562s	0m0.533
	user	0m0.548s	0m0.521s	0m0.645s
	sys	0m0.010s	0m0.011s	0m0.012s
double	real	0m0.541s	0m0.560s	0m0.566s
	user	0m0.530s	0m0.555s	0m0.630s
	sys	0m0.010s	0m0.010s	0m0.010s

Usage Before Optimization

	cvRound	cvCeil	cvFloor
float	7.79%	9.33%	9.86%
double	7.68%	12.61%	10.83%

Usage After Optimization

	cvRound	cvCeil	cvFloor
float	7.9%	4.9%	8.98%
double	7.21%	6.52%	7.03%

Conclusion

Overall, we can see performance imporvement of cvCeil and cvFloor in both Archie and Betty. On Archie, the time of these 2 functions droped ~0.2 second and usage dropped ~12% compare with the result before optimization. On Betty, the time for cvCeil and cvFloor dropped 0.1 seconds and usage decrease 10% - 50%. Now the time and usage performance of cvCeil and cvFloor is very close to cvRound.

OpenCV Optimization Practice - Stage II Update

OpenCV Optimization - Stage II Update

Perform Optimization Plan on Aarchie

Looking for similar functions

Code Example after Optimization:

Safety Test

Performance Test

Time Before Optimization

Time After Optimization

Table of Time Improvment Percentage

Usage Before Optimization

Usage After Optimization

Optimization Result

Other Architecture

Usage Before Optimization

Usage After Optimization

Conclusion

Similar Posts

Comments