Differences

This shows you the differences between two versions of the page.

Link to this comparison view

ep:labs:01:contents:ex4 [2019/09/18 16:19]
radu.mantu created
— (current)
Line 1: Line 1:
-==== 04. [20p] SSE & gcc intrinsics ==== 
- 
-**Streaming SIMD Extensions** (SSE) is an x86 instruction set extension that focuses on vectorized operations. By packing data in 128-bit registers, the CPU is able to perform the same instruction on multiple data. 
- 
-In order to take advantage of this feature, we will use **gcc intrinsics**. Intrinsics are built-in functions that the compiler is intimately familiar with and can use in building highly optimized machine code. In fact, gcc supports two sets of built-in functions for SIMD: one native and one defined by Intel. In this lab we are going to use the latter since there is much more documentation available. Particularly,​ we will consult the [[https://​software.intel.com/​sites/​landingpage/​IntrinsicsGuide/​|Intel Intrinsics Guide]]. 
- 
-As we mentioned before, data is packed in 128-bit registers. However, more than one data type can be packed and this is reflected in both declared data types and in instructions. 
-Some example of data types: 
-  * <color blue>​**%%__%%m128**</​color> ​ : can hold 4 32-bit values 
-  * <color blue>​**%%__%%m128i**</​color>​ : specially used for integers 
-  * <color blue>​**%%__%%m128d**</​color>​ : specially used for single/​double floating point values 
- 
-The function naming convention is\\ _mm_**<​intrinsic_operation>​**_**<​suffix>​**. For example: 
-  * //<color blue>​%%__%%m128</​color>​ _mm_**add**_**ps** (<color blue>​%%__%%m128</​color>​ a, <color blue>​%%__%%m128</​color>​ b)// 
-    * **add** : addition operation 
-    * **ps** ​ : **p**acked **s**ingle precision (4 floats of 4 bytes each) 
- 
-=== [15p] Task A - Implementation === 
-Starting from the files in {{:​ep:​labs:​sse.zip|sse.zip}},​ implement //sqrt(x[]) / y[]// using SSE intrinsics. How does the execution time compare to that of the normal implementation?​ Note that the data must be loaded from the //x// and //y// buffers into the 128-bit registers and the answer stored back to a buffer. 
- 
-=== [5p] Task B - Questions === 
-Answer the following questions: 
-    - What functions would you use to load/store the data, were the buffers not 16-byte aligned? Would it matter? 
-    - What registers are used in the code that you wrote? Where else do you usually encounter them? (Hint: **objdump**,​ code is compiled with **-g**) 
-    - How could you further optimize the division by //y[]//? 
- 
-<​solution -hidden> 
-**Task A:** 
-<code C> 
-/* start SSE benchamrk */ 
-t = clock(); 
-for (int i=0; i<​VEC_SZ;​ i+=4) { 
-    /* load 128-bit vectors of packed single precision floats */ 
-    __m128 vec_x = _mm_load_ps(x + i); 
-    __m128 vec_y = _mm_load_ps(y + i); 
- 
-    /* compute sqrt(x) / y  
-     * NOTE: to save time, we multiply sqrt(x) with y's reciprocal in stead 
-     ​* ​      of dividing by y (wich is slower but more accurate) 
-     ​* ​      ​maximum relative error < 1.5 * 2^(-12) */ 
-    __m128 sqrt_x ​ = _mm_sqrt_ps(vec_x);​ 
-    __m128 rcp_y   = _mm_rcp_ps(vec_y);​ 
-    __m128 vec_res = _mm_mul_ps(sqrt_x,​ rcp_y); 
- 
-    /* store result back in memory */ 
-    _mm_store_ps(res1 + i, vec_res); 
-} 
-t = clock() - t; 
- 
-printf("​SSE operation: ​   %.3f\n",​ ((float)t) / CLOCKS_PER_SEC);​ 
-</​code>​ 
- 
-**Task B:** 
-  - **_mm_loadu_ps** / **_mm_storeu_ps** instead of **_mm_load_ps** / **_mm_store_ps**. Theoretically,​ access speed should be lower for aligned data. Practically (and surprisingly),​ this is not true on more modern CPUs. 
-  - If we look with objdump at the disassembled code, we see that the XMM registers are the ones used. They are also used in passing floating point parameters to functions (RAX is set to the number of used registers). 
-  - Calculate the reciprocal of //y[]// and multiply by the result of //​sqrt(x[])//​. The result is not as accurate but it is faster. 
-</​solution>​ 
- 
  
ep/labs/01/contents/ex4.1568812751.txt.gz ยท Last modified: 2019/09/18 16:19 by radu.mantu
CC Attribution-Share Alike 3.0 Unported
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0