Modern computer store 64 bits (8 bytes) in its every memory location. Ideally, We can read/write each of byte in each memory location by specifying the address.
The following structure is very basic and is only fit to the single task system.
When we start to use multi-tasking system, we need to make sure that each of its program/process cannot access to the memory of other program’s. Memory abstraction(Virtual Memory) is a way to solve this problem. It can also make each process start from same location when they startup, predict and allocate memory space for that process.
The virtual memory is a logical memory (software based, not physical) and comprises a virtual address space that exists the address space of the RAM. The virtual address space is divided into pages. Pages are consequtive of regions of memory with the same size.
Translation table is used to control the mapping of virtual and actual addresses in physical memory. When a software is trying to access a page, and the page table will show the program where the page is directing to in the RAM. Then, then software can access the data in the corresponding address in RAM.
I found I very simple explaination about the relationship between CPU, cache and memory
Different cache can talk with each other, to update which is the latest.
C Intrinsics are function-like extensions to the C language. Although they look like functions, they are compiled inline. Because C Intrinstics are not provided by C language itself, its function is not portable. We need to be very careful if we want to use it.
First of all, let’s look at how C Inline assembler works.
The syntax of C Inline is
__asm__ ("assembley code template" : outputs : inputs : clobbers);
It has 4 parameters:
int main() {
int a = 3;
int b = 19;
int c;
// __asm__ ("assembley code template" : outputs : outputs : clobbers)
__asm__ ("add %0, %1, %2" //%0 = %1 + %2
: "=r"(c) //temp output register %0 value is move to c
: "r"(a),"r"(b) ); //input value of a, b is placed into temp input register %1, %2
printf("%d\n", c);
}
For calculating c=b%a
in AArch64 C Inline assemly
We can use:
__asm__("udiv %0, %1, %2" : "=r"(c) : "r"(b), "r"(a) ); //c = b/a = 6
__asm__("msub %0, %1, %2, %3" : "=r"(c) : "r"(c), "r"(a), "r"(b) ); //c = 19-(6*3) = 1
As we discussed in previous blog Software-Optimization-With-SIMD. We are going to use three methods to see the performance of SIMD vectorization.
SIMD is an acronym for “Single Instruction, Multiple Data”.
When we need to execute a long loop, under certain conditions, we are able to use the SIMD to vectorize the loop into lanes. Then execute the lanes with same instruction in parallel, to make the execution more efficient.