Software Optimization - SIMD Practice (C Intrinsics)

SIMD C Intrinstics

SIMD C Intrinstics

C Intrinsics are function-like extensions to the C language. Although they look like functions, they are compiled inline. Because C Intrinstics are not provided by C language itself, its function is not portable. We need to be very careful if we want to use it.

The source code for Intrinstic version:

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <arm_neon.h>
#include "vol.h"

int main() {

	int16_t*		data;		// data array
	int16_t*		limit;		// end of input array

	register int16_t*	cursor 		asm("r20");	// array cursor (pointer)
	register int16_t	vol_int		asm("r22");	// volume as int16_t

	int			x;		// array interator
	int			ttl =  0;	// array total

	data=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

	srand(-1);
	printf("Generating sample data.\n");
	for (x = 0; x < SAMPLES; x++) {
		data[x] = (rand()%65536)-32768;
	}

// --------------------------------------------------------------------

	cursor = data; // Pointer to start of array
	limit = data + SAMPLES ;

	vol_int = (int16_t) (0.75 * 32767.0);

	printf("Scaling samples.\n");

	while ( cursor < limit ) {
		// Q: What do these intrinsic functions do? 
		// (See gcc intrinsics documentation)
	
		// A: Signed Saturating Doubling Multiply returning High Half
		// Multiply cursor vector and vol_int vector(a vector with all 
		// vol_int), double the result, then place the highest 16 bits
		// to the cursor. 
		vst1q_s16(cursor, vqdmulhq_s16(vld1q_s16(cursor), vdupq_n_s16(vol_int)));

		// Q: Why is the increment below 8 instead of 16 or some other value?
		// A: In AArch64 Advanced SIMD, there are thirty two 128-bit wide vector register, here is (16 bits)*8(lanes)=128 bits
		// It will execute 8 lanes in parallel per time and then move to next 8 lines. 

		// Q: Why is this line not needed in the inline assembler version
		// Because the "str q0, [%[cursor]],#16" in inline assembler version will do post-increment cursor by 16 bytes
		// of this program?
		cursor += 8;
	}

// --------------------------------------------------------------------

	printf("Summing samples.\n");
	for (x = 0; x < SAMPLES; x++) {
		ttl=(ttl+data[x])%1000;
	}

	// Q: Are the results usable? Are they accurate?
	// A: It is same as the inline version result
	// And its sys time is faster than inline version.
	printf("Result: %d\n", ttl);

	return 0;

}

Result for C Intrinstic version

$ time ./vol_intrinsics 
Generating sample data.
Scaling samples.
Summing samples.
Result: 930

real	0m0.519s
user	0m0.508s
sys	0m0.010s

Software Optimization - SIMD Practice (C Intrinsics)

SIMD C Intrinstics

Comments