Software Optimization - SIMD Practice (C Inline)

SIMD C Inline
Example:
Lab
Result

SIMD C Inline

First of all, let’s look at how C Inline assembler works.

The syntax of C Inline is

__asm__ ("assembley code template" : outputs : inputs : clobbers);

It has 4 parameters:

The assembler template (mandatory)
Output operands (optional)
Input operands (optional)
Clobbers(overwrite) (optional)

Example:

int main() {
	int a = 3;
	int b = 19;
	int c;

	// __asm__ ("assembley code template" : outputs : outputs : clobbers)
	__asm__ ("add %0, %1, %2"  //%0 = %1 + %2
        : "=r"(c)              //temp output register %0 value is move to c
        : "r"(a),"r"(b) );     //input value of a, b is placed into temp input register %1, %2 

	printf("%d\n", c);
}

For calculating c=b%a in AArch64 C Inline assemly We can use:

    __asm__("udiv %0, %1, %2" : "=r"(c) : "r"(b), "r"(a) );                //c = b/a = 6
    __asm__("msub %0, %1, %2, %3" : "=r"(c) : "r"(c), "r"(a), "r"(b)  );   //c = 19-(6*3) = 1 

This is the quick start of Arch64 assembly language and C Inline Assembly Instruction

Lab

// vol_inline.c :: volume scaling in C using AArch64 SIMD
// Chris Tyler 2017.11.29-2019.10.02 - Licensed under GPLv3.
// For the SIMD lab in the Seneca College SPO600 Course

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include "vol.h"

int main() {

	int16_t*		data;		// input array
	int16_t*		limit;		// end of input array

	// these variables will be used in our assembler code, so we're going
	// to hand-allocate which register they are placed in
	// Q: what is an alternate approach?
    // A: You can make compiler to define a register for your local variable. Not have to do the specification
	
    register int16_t*	cursor 		asm("r20");	// input cursor
	register int16_t	vol_int		asm("r22");	// volume as int16_t

	int			x;		// array interator
	int			ttl =0 ;	// array total

	data=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

	srand(-1);
	printf("Generating sample data.\n");
	for (x = 0; x < SAMPLES; x++) {
		data[x] = (rand()%65536)-32768;
	}

// --------------------------------------------------------------------

	cursor = data;
	limit = data+ SAMPLES ;

	// set vol_int to fixed-point representation of 0.75
	// Q: should we use 32767 or 32768 in next line? why?
    // A: 32768, because the range of 16 bits int is from -32768 to 32767
	vol_int = (int16_t) (0.75 * 32767.0);

	printf("Scaling samples.\n");

	// Q: what does it mean to "duplicate" values in the next line?
    // A: means duplicate vol_int value to a vector register 1 with 8 lanes of 16 bits (half-word) each
	__asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h

	while ( cursor < limit ) {
		__asm__ (
			"ldr q0, [%[cursor]], #0 	\n\t"
			// load eight samples into q0 (v0.8h)
			// from in_cursor

			"sqdmulh v0.8h, v0.8h, v1.8h	\n\t"
			// multiply each lane in v0 by v1*2
			// saturate results
			// store upper 16 bits of results into
			// the corresponding lane in v0
		
			// Q: Why is #16 included in the str line
			// but not in the ldr line?	
            // A: Because we need to increment cursor cursor by 16 bits

			"str q0, [%[cursor]],#16		\n\t"
			// store eight samples to [cursor]
			// post-increment cursor by 16 bytes
			// and store back into the pointer register

			// Q: What do these next three lines do?
            // A: 
			: [cursor]"+r"(cursor) //output operand
			: "r"(cursor)          //input operand 
			: "memory"             //clobber, overwrite memory
			);
	}

// --------------------------------------------------------------------

	printf("Summing samples.\n");
	for (x = 0; x < SAMPLES; x++) {
		ttl=(ttl+data[x])%1000;
	}

	// Q: are the results usable? are they correct?
    // A: The result 730, different from vol1

	printf("Result: %d\n", ttl);

	return 0;

}

Result

The result of gcc compiler is

$ time ./vol1
Result: -906

real	0m0.477s
user	0m0.446s
sys	0m0.030s

C inline is

$ time ./vol_inline
Generating sample data.
Scaling samples.
Summing samples.
Result: 930

real	0m0.520s
user	0m0.499s
sys	0m0.020s

We can see that the User time of gcc compiler is faster than C inline, but situation for sys time is opposite.

Software Optimization - SIMD Practice (C Inline)

SIMD C Inline

Example:

Lab

Result

Comments