Using NEON for Parallel Data Processing Zynq-7000 Hardware Architecture Speaker: Leon Qin Title: Processor Specialist Date: Oct, 2012 © Copyright 2009Xilinx Xilinx Copyright 2012
Zynq-7000 Block Diagram Processing System
2x I2C
ARM® CoreSight™ Multi-core & Trace Debug
2x CAN 2x UART I/O MUX
AMBA® Switches
GPIO
NEON™/ FPU Engine
S_AXI_HP2
Cortex™-A9 MPCore™ MIO 32/32 KB I/D Caches
Cortex™-A9 MPCore™ 32/32 KB I/D Caches
S_AXI_HP3
512 KB L2 Cache
2x USB with DMA
General Interrupt Controller
Snoop Control Unit (SCU)
Timer Counters
2x GigE with DMA
S_AXI_ACP
256 KB On-Chip Memory DMA
Configuration
® Switches ® Switches AMBA AMBA
XADC
S_AXI_GP0/1
M_AXI_GP0/1
Multi-Standards I/Os (3.3V & High Speed 1.8V)
Page 2
S_AXI_HP1
NEON™/ FPU Engine
2x SDIO with DMA
EMIO
System Gates, DSP, RAM S_AXI_HP0
PCIe
Multi Gigabit Transceivers
© Copyright 2009Xilinx Xilinx Copyright 2011
Multi-Standards I/Os (3.3V & High Speed 1.8V)
AMBA® Switches
2x SPI
Programmable Logic:
Dynamic Memory Controller DDR3, DDR2, LPDDR2
Static Memory Controller Quad-SPI, NAND, NOR
ARM Architecture Evolution Key Technology Additions by Architecture Generation
Thumb-EE
Execution Environments: Improved memory use
VFPv3 ARM11
NEON™ Adv SIMD
Improved Media and DSP
Thumb®-2 ARM9
TrustZone™
ARM10 SIMD VFPv2 Thumb-2 Only
Jazelle® V5 Page 3
V6
V7 A&R © Copyright 2009Xilinx Xilinx Copyright 2011
V7 M
Why NEON?
General purpose SIMD processing useful for many applications Supports widest range multimedia codecs used for internet applications – Many soft codec standards: MPEG-4, H.264, On2 VP6/7/8, Real, AVS, … – Ideal solution for normal size „internet streaming‟ decode of various formats
Fewer cycles needed – Neon will give 60-150% performance boost on complex video codecs – Simple DSP algorithms can show larger performance boost (4x-8x) – Balance of computation and memory access is required – Processor can sleep sooner => overall dynamic power saving
Page 4
© Copyright 2009Xilinx Xilinx Copyright 2011
Why NEON?
NEON is a mature advanced SIMD technology. – SIMD exist on many 32-bit arch • PowerPC has AltiVec, while x86 has MMX/SSE/AVX
– Can significantly accelerate parallelable repetitive operations on large data sets.
Beneficial to many DSP or multimedia algorithms – Clean orthogonal vector architecture, applicable to a wide range of data intensive computation – audio, video, and image processing codecs. – Not just for codecs – also applicable to 2D & 3D graphics etc – Color-space conversion. – Physics simulations. – Error correction(such as Reed Solomon codecs, CRCs), elliptic curve cryptography, etc. Page 5
© Copyright 2009Xilinx Xilinx Copyright 2011
NEON vs. DSP/FPGA offload NEON advantages – Easy programming & debug – Fully coherent with CPU, no cache maintenance operations – Part of ARM arch - no hardware or software integration required – Ecosystem support off-the-shelf, no porting required
DSP/FPGA advantages – Runs parallel with CPU, few CPU cycles required – More „realtime‟ - no OS/cache variability – Fixed function or limited codec support – Potentially higher performance (e.g. 1080p Full HD video)
Page 6
© Copyright 2009Xilinx Xilinx Copyright 2011
NEON Agenda
NEON Hardware overview
NEON Instruction set overview NEON Software Support NEON improves performance
Page 7Page 7
© Copyright 2009Xilinx Xilinx Copyright 2011
What is NEON? NEON is a wide SIMD data processing architecture – Extension of the ARM instruction set – 32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide)
NEON Instructions perform “Packed SIMD” processing – Registers are considered as vectors of elements of the same data type – Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, single prec. float – Instructions perform the same operation in all lanes
Source Source Registers Registers Dn
Elements
Dm
Dd
Lane Page 8
© Copyright 2009Xilinx Xilinx Copyright 2011
Operation Destination Register
Example SIMD instruction – Vector ADD Larger register size Register split into equal size any type elements Operation performed on same element of each register VADD.U16 D2, D1, D0 63
0x1001
+ 0xFF0
= 0x1FF1
Page 9Page 9
31
47
0x1234
+ 0x5678
= 0x68AC
0
15
0x7
+ 0xFFF8
= 0xFFFF
© Copyright 2009Xilinx Xilinx Copyright 2011
0xAB
D0
+ 0xCD
D1
= 0x178
D2
Neon Data Types NEON natively supports a set of common data types – Integer and Fixed-Point: 8-bit, 16-bit, 32-bit and 64-bit – 32-bit Single-precision Floating-point; 8 and 16-bit polynomial 8/16-bitUnsigned Signed, Signed, Unsigned Integers; Integers; Polynomials
.8
.S8 .U8
.I8 .P8
.16
.S16 .U16
.I16 .P16
32-bit Signed, Unsigned Integers; Floats
.32
.S32 .U32
.I32 .F32
.64
.I64
.S64 .U64
64-bit Signed, Unsigned Integers;
Data types are represented using a bit-size and format letter VADD.U16 D2, D1, D0
Not all data types available in all sizes Page 10
© Copyright 2009Xilinx Xilinx Copyright 2011
NEON Registers NEON provides a 256-byte register file – Distinct from the core registers – Extension to the VFPv2 register file (VFPv3) D0
Two explicitly aliased views
D1
– 32 x 64-bit registers (D0-D31)
D2
– 16 x 128-bit registers (Q0-Q15)
D3
Enables register trade-off
:
– Vector length
D30
– Available registers
D31
Q0
Q1
:
Q15
Also uses the summary flags in the VFP FPSCR – Adds a QC integer saturation summary flag – No per-lane flags, so „carry‟ handled using wider result (16bit+16bit -> 32-bit)
Page 11
© Copyright 2009Xilinx Xilinx Copyright 2011
Vectors and Scalars Registers hold one or more elements of the same data type – Vn can be used to reference either a 64-bit Dn or 128-bit Qn register – A register, data type combination describes a vector of elements 63
0
127
0
Qn
Dn D0
I64 S32
S32
D7
F32
F32
F32
S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8
64-bit
128-bit
Some instructions can reference individual scalar elements – Scalar elements are referenced using the array notation Vn[x] F32
Q0[3]
F32
Q0[2]
F32
F32
Q0[1]
Q0
Q0[0]
Array ordering is always from the least significant bit Page 12
F32
© Copyright 2009Xilinx Xilinx Copyright 2011
Q0 Q7
NEON Agenda
NEON Hardware overview
NEON Instruction set overview NEON Software Support NEON improves performance
Page 13 Page 13
© Copyright 2009Xilinx Xilinx Copyright 2011
Instruction Syntax V{
}{}{}{.}(}, src1, src2 - Instruction Modifiers Q indicates the operation uses saturating arithmetic (e.g. VQADD) H indicates the operation halves the result (e.g. VHADD) D indicates the operation doubles the result (e.g. VQDMUL) R indicates the operation performs rounding (e.g. VRHADD)
- Instruction Operation (e.g. ADD,MUL, MLA, MAX, SHR, SHL, MOV) - Shape L – The result is double the width of both operands W – The result and first operand are double the width of the last operand N – The result is half the width of both operands
- Conditional, used with IT instruction <.dt> - Data type - Destination, - Source operand 1, Page 14
© Copyright 2009Xilinx Xilinx Copyright 2011
Instruction Modifiers and Shapes Q Modifier – Saturating instructions that saturate set the “Cumulative saturation” flag (QC bit) in the Floating-point Status and Control Register (FPSCR) – The QC flag is sticky • Use VMRS and VMSR instructions to read and to clear the flag
H Modifier – Halves the result – Can only be used on addition and subtraction instructions • VHADD, VHSUB and VRHADD
D Modifier – Only available for saturating variants “long” and “high half” multiplies – VQDMLAL, VQDMLSL, VQDMULH, VQDMULL, VQRDMULH
R Modifier – Always “Round to Nearest”, as defined in the IEEE 754 standard – Available on instructions that include a right shift • Including Halving and “high half” instructions Page 15
© Copyright 2009Xilinx Xilinx Copyright 2011
Instruction Shapes Long and Wide – Input elements are promoted before operation Narrow – Input elements are demoted before operation L Shape Dn Dm
Qd
N Shape
W Shape Qn
Qn
Dm
Qm
Qd
Page 16
Dd
© Copyright 2009Xilinx Xilinx Copyright 2011
Multiple 1-Element Structure Access VLD1, VST1 provide standard array access – An array of structures containing a single component is a basic array – List can contain 1, 2, 3 or 4 consecutive registers – Transfer multiple consecutive 8, 16, 32 or 64-bit elements [R1] +2
x1
x0
+4
x2
+2
x1
+6
x3
+4
x2
+8
x4
+6
x3
+10
x5
+12
x6
+14
x7
[R4] +R3
x0
:
x3
x2
x1
D 7
x0
VLD1.16 {D7}, [R4], R3
:
x3
x2
x1
x0
x7
x6
x5
x4
VST1.16 {D3,D4}, [R1]
Page 17
© Copyright 2009Xilinx Xilinx Copyright 2011
D 3 D 4
Addition: Basic NEON supports various useful forms of basic addition
Normal Addition - VADD, VSUB – Floating-point – Integer (8-bit to 64-bit elements) – 64-bit and 128-bit registers
Long Addition - VADDL, VSUBL – Promotes both inputs before operation – Signed/unsigned (8-bit to 32-bit source elements)
VADD.I16
D0,
D1,
D2
VSUB.F32
Q7,
Q1,
Q4
VADD.I8
Q15, Q14, Q15
VSUB.I64
D0,
D30, D5
VADDL.U16
Q1, D7, D8
VSUBL.S32
Q8, D1, D5
VADDW.U8
Q1, Q7, D8
VSUBW.S16
Q8, Q1, D5
Wide Addition - VADDW, VSUBW – Promotes one input before operation – Signed/unsigned (8-bit 32-bit source elements)
Page 18
© Copyright 2009Xilinx Xilinx Copyright 2011
Example – adding all lanes Input in Q0 (D0 and D1) u16 input values
DO
D1
DO
D1
VPADDL.U16
Q0,
Q0
DO
Now Q0 contains 4x u32 values (with 15 headroom bits)
D1 DO
Reducing/folding operation needs 1 bit of headroom
VPADD.U32
D0, D0, D1
DO DO
Result is u64 in D0 Page 19
VPADDL.U32 © Copyright 2009Xilinx Xilinx Copyright 2011
D0,
D0
Summing a vector + + +
+
+
+
+
+
+
+ + DO
+ +
DO
+ +
DO
+ Page 20
D1
© Copyright 2009Xilinx Xilinx Copyright 2011
Some NEON clever features
Some NEON clever features
Page 21
© Copyright 2009Xilinx Xilinx Copyright 2011
Data Movement: Table Lookup Uses byte indexes to control byte look up in a table – Table is a list of 1,2,3 or 4 adjacent registers 11
0
p
o
n
m
l
4
k
l
8
j
e
13
i
i
26
h
n
8
g
0
0
f
i
D3
3
e
a
d
d
c
b
D0
VTBL.8 D0, {D1, D2}, D3
VTBL : out of range indexes generate 0 result VTBX : out of range indexes leave destination unchanged Page 22
© Copyright 2009Xilinx Xilinx Copyright 2011
a
{D1,D2}
Element Load Store Instructions All treat memory as an array of structures (AoS) – SIMD registers are treated as structure of arrays (SoA) – Enables interleaving/de-interleaving for efficient SIMD processing – Transfer up to 256-bits in a single instruction x3 z2 y2 x2 z1 y1 x1 z0 y0 x0 element
3-element structure
Three forms of Element Load Store instructions are provided Forms distinguished by type of register list provided – Multiple Structure Access e.g. {D0, D1} – Single Structure Access e.g. {D0[2], D1[2]} – Single Structure Load to all lanes e.g. {D0[], D1[]}
Page 23
© Copyright 2009Xilinx Xilinx Copyright 2011
Multiple 2-Element Structure Access VLD2, VST2 provide access to multiple 2-element structures – List can contain 2 or 4 registers – Transfer multiple consecutive 8, 16, or 32-bit 2-element structures
[R3]
x0 +2
y0
x0
+4
x1
+2
y0
+6
y1
+4
x1
+8
x2
+6
y1
+8
x2
+10
y2
+12
x3
+14
y3
[R1]
:
x3
x2
x1
x0
y3
y2
y1
y0
VLD2.16 {D2,D3}, [R1] Page 24
D 0 +12 x3 x7 x6 x5 x4 D : 1 +28 y3 y2 y1 y0 x7 D 2 +30 y7 y7 y6 y5 y4 D : 3 VLD2.16 {D0,D1,D2,D3}, [R3]!
!
D 2 D 3
+10
© Copyright 2009Xilinx Xilinx Copyright 2011
y2
x3
x2
x1
x0
Multiple 3/4-Element Structure Access VLD3/4, VST3/4 provide access to 3 or 4-element structures – Lists contain 3/4 registers; optional space for building 128-bit vectors – Transfer multiple consecutive 8, 16, or 32-bit 3/4-element structures [R1]
x0
+2
y0
+4
z0
+6
x1
+8
y1
+10
z1
+12
x2
: +20
y3
+22
z3
[R1]
! x3
x2
x1
x0
y3
y2
y1
y0
z3
z2
z1
z0
:
D 3 D 4 D 5
VST3.16 {D3,D4,D5}, [R1] Page 25
© Copyright 2009Xilinx Xilinx Copyright 2011
x0
+2
y0
+4
z0
+6
x1
+8
y1
D 0 +10 z1 D 1 +12 x2 y3 y2 y1 y0 D : 2 +20 y3 D 3 +22 z3 z3 z2 z1 z0 D : 4 VLD3.16 {D0,D2,D4}, [R1]! x3
x2
x1
x0
Logical NEON supports bitwise logical operations VAND, VBIC, VEORR, VORN, VORR – Bitwise logical operation – Independent of data type
VAND
D0,
D0,
D1
VORR
Q0,
Q1,
Q15
VEOR
Q7,
Q1,
Q15
VORN
D15, D14, D1
VBIC
D0,
D30, D2
– 64-bit and 128-bit registers
D0
VBIT, VBIF, VBSL
D1
– Bitwise multiplex operations
0 1 0 1 1 0
D2
– Insert True, Insert False, Select
– 3 versions overwrite different registers
D1
– 64-bit and 128-bit registers – Used with masks to provide selection Page 26
© Copyright 2009Xilinx Xilinx Copyright 2011
VBIT D1, D0, D2
NEON instruction summary A comprehensive set of data prcoessing instructions Form a general purpose SIMD instruction set suitable for compilers
NEON operations fall in to the following categories Addition / Subtraction ( Saturating, Halving, Rounding) MIN, MAX, NEG, MOV, ABS, ABD, … Multiplication (MUL, MLA, MLS, …)
Comparison and Selection Logic (AND , ORR, EOR, BIC, ORN, …) Bitfield Reciprocal Estimate/Step, Reciprocal Square Root Estimate/Step
Miscellaneous (DUP, EXT, CLZ, CLS, TBL, REV, ZIP. TRN, …)
Many more… Page 27
© Copyright 2009Xilinx Xilinx Copyright 2011
NEON instruction reference Official NEON instruction Set reference is “Advanced SIMD” in ARM Architecture Reference Manual v7 A & R edition
Available to partners on www.arm.com
Page 28
© Copyright 2009Xilinx Xilinx Copyright 2011
Further Reading Documentation – ARM “ARM” v7 A&R – “NEON Support in the Realview compiler” white paper – “NEON optimizations in Android” white paper – Realview Compiler Guide (for intrinsics) – ARM Cortex-A Programmers‟ Guide (from www.arm.com downloads) – Software blogs on www.arm.com
Page 29
© Copyright 2009Xilinx Xilinx Copyright 2011
NEON Agenda
NEON Hardware overview
NEON Instruction set overview NEON Software Support NEON improves performance
Page 30 Page 30
© Copyright 2009Xilinx Xilinx Copyright 2011
How to use NEON NOEN optimized Open source Libraries – OpenMAX DL (Development Layer): APIs contain a comprehensive set of audio, video and imaging functions that can be used for a wide range of accelerated codec functionality such as MPEG-4, H.264, MP3, AAC and JPEG. – Broad open source support for NEON
Vectorizing Compilers – Exploits NEON SIMD automatically with existing C source code
NEON Intrinsics – C function call interface to NEON operations – Supports all data types and operations supported by NEON
Assembler Code – For those who really want to optimize at the lowest level
Page 31
© Copyright 2009Xilinx Xilinx Copyright 2011
OpenMAX DL v1.0 Library Summary Video Domain – MPEG-4 simple profile
Audio Domain
– H.264 baseline
Still Image Domain
MP3 AAC
Signal Processing Domain
– JPEG
Image Processing Domain – Colorspace conversion – De-blocking / de-ringing
FIR IIR FFT Dot Product
– Rotation, scaling, compositing Spec from: http://www.khronos.org/openmax/
Opensource implementation for ARM11 & NEON available from: http://www.arm.com/zh/community/multimedia/standards-apis.php NOTE: OpenMax DL provides low level data processing functions, not the complete codecs Page 32
© Copyright 2009Xilinx Xilinx Copyright 2011
NEON in opensource Google – WebM – 11,000 lines NEON assembler! Bluez – official Linux Bluetooth protocol stack – NEON sbc audio encoder
Pixman (part of cairo 2D graphics library) – Compositing/alpha blending
ffmpeg – libavcodec – LGPL media player used in many Linux distros – NEON Video: MPEG-2, MPEG-4 ASP, H.264 (AVC), VC-1, VP3, Theora – NEON Audio: AAC, Vorbis, WMA
x264 – Google Summer Of Code 2009 – GPL H.264 encoder – e.g. for video conferencing
Android – NEON optimizations – Skia library, S32A_D565_Opaque 5x faster using NEON
Eigen2 – C++ vector math / linear algebra template library Theorarm – libtheora NEON version (optimized by Google) libjpeg – optimized JPEG decode (IJG library) FFTW – NEON enabled FFT library
LLVM – code generation backend used by Android Renderscript Copyright 2011 Xilinx
Page 33
© Copyright 2009 Xilinx
Automatic Vectorization: ARM Compiler Automatic vectorization can generate code targeted for NEON from ordinary C source code – Less effort to produce efficient code – Portable - no compiler-specific source code features need to be used
To enable automatic vectorization on armcc, use these options together:
--vectorize
- enable vectorization
--cpu Cortex-A9
- provide a CPU option with NEON support
-O2 or -O3
- select high optimization level.
-Otime
- optimize for speed over space
--fpmode fast
- the precision of vectorized floating-point operations is the same as VFP RunFast mode
--diag_warning=optimizations to obtain useful diagnostics from the compiler on what it could or could not optimize/vectorize
Note: – When you specify --vectorize, automatic vectorization is enabled only if you also specify Otime and an optimization level of -O2 or -O3. Page 34
© Copyright 2009Xilinx Xilinx Copyright 2011
Automatic Vectorization: GNU tools To enable automatic vectorization on GCC/g++, use these options together: -mcpu=cortex-a9
-Specify a suitable ARMv7-A processor
-mfpu=neon
- enable NEON support
-ftree-vectorize
- support SIMD on many arch - O3 implies -ftree-vectorize
-mvectorize-with-neon-quad
- By default, GCC 4.4 only vectorize for doubleword
-mfloat-abi=softfp
-Can use “hard” for more efficient floating point parameter passing, but all code must be compiled with this option
Understand more with -ftree-vectorizer-verbose Takes an integer value specifying the level of detail to provide, where 1 enables additional printouts and higher values add even more information. What vectorization the compiler is performing, or what is unable to perform because of possible dependencies Page 35
© Copyright 2009Xilinx Xilinx Copyright 2011
Automatic Vectorization - how it works 2. Unroll the loop to the appropriate number of iterations, and perform other transformations like pointerization
void add_int(int * __restrict pa, int * __restrict pb, unsigned int n, int x) { unsigned int i; for(i = 0; i < (n & ~3); i++) pa[i] = pb[i] + x; }
void add_int(int *pa, int *pb, unsigned n, int x) { unsigned int i; for (i = ((n & ~3) >> 2); i; i--) { *(pa + 0) = *(pb + 0) + x; *(pa + 1) = *(pb + 1) + x; *(pa + 2) = *(pb + 2) + x; *(pa + 3) = *(pb + 3) + x; pa += 4; pb += 4; } }
1. Analyze each loop: Are pointer accesses safe for vectorization? What data types are being used? How do they map onto NEON vector registers? How many loop iterations are there? 3. Map each unrolled operation onto a NEON vector lane, and generate corresponding NEON instructions Page 36
© Copyright 2009Xilinx Xilinx Copyright 2011
ARM RVDS & gcc vectorising compiler
int a[256], b[256], c[256]; foo () { int i; for (i=0; i<256; i++){ a[i] = b[i] + c[i]; } }
Page 37
armcc -S --cpu cortex-a8 -O3 -Otime --vectorize test.c
|L1.16| VLD1.32 {d0,d1},[r0]! SUBS r3,r3,#1 VLD1.32 {d2,d3},[r1]! VADD.I32 q0,q0,q1 VST1.32 {d0,d1},[r2]! BNE |L1.16|
.L2:
gcc -S -O3 -mcpu=cortex-a8 -mfpu=neon -ftree-vectorize -ftree-vectorizer-verbose=6 test.c
© Copyright 2009Xilinx Xilinx Copyright 2011
add r1, r0, ip add r3, r0, lr add r2, r0, r4 add r0, r0, #8 cmp r0, #1024 fldd d7, [r3, #0] fldd d6, [r2, #0] vadd.i32 d7, d7, d6 fstd d7, [r1, #0] bne .L2
Tuning C/C++ Code for Vectorizing The goal is to try to make the code simple, straight forward, and parallel, so that the compiler can easily convert the code to NEON assembly Loops can be modified for better vectorizing: – Short, simple loops work the best (even if it means multiple loops in your code) – Avoid breaks / loop-carried dependencies / conditions inside loops – Try to make the number of iteration a power of 2 – Try to make sure the number of iteration is known to the compiler
– Functions called inside a lop should be inlined
Pointer issues: – Using arrays with indexing vectorizes better than using pointer – Indirect addressing (multiple indexing or de-reference) doesn‟t vectorize
– Use __restricet key word to tell the compiler that pointers does not reference overlapping areas of memory
Use suitable data types – For best performance, always use the smallest data type that can hold the required values Page 38
© Copyright 2009Xilinx Xilinx Copyright 2011
NEON Intrinsics Available in armcc, GCC/g++, and llvm. Syntax is the same, so source code that uses intrinsics can be compiled by any of these compilers. Advantage: – Provide low-level access to NEON instructions. Compiler do hard work like: Register allocation. Code scheduling, or re-ordering instructions. The C compilers can reorder code to ensure the minimum number of stalls according to a specific processor.
Disadvantage: – Possibly the compiler output is not exactly the code you want, so there is still some possibility of improvement when moving to NEON assembler code.
Page 39
© Copyright 2009Xilinx Xilinx Copyright 2011
NEON Intrinsics - Example Include intrinsics header file #include Use special NEON data types which correspond to D and Q registers, e.g. int8x8_t D-register containing 8x 8-bit elements int16x4_t D-register containing 4x 16-bit elements int32x4_t Q-register containing 4x 32-bit elements Use special intrinsics versions of NEON instructions vin1 = vld1q_s32(ptr); vout = vaddq_s32(vin1, vin2); vst1q_s32(vout, ptr); Strongly typed! – Use vreinterpret_s16_s32( ) to change the type
Page 40
© Copyright 2009Xilinx Xilinx Copyright 2011
NEON Intrinsics - Example NEON intrinsic #include uint32x4_t double_elements(uint32x4_t input) { return(vaddq_u32(input, input)); }
Command line with GCC arm-none-linux-gnueabi-gcc -mfpu=neon intrinsic.c
Command line with RVCT armcc --cpu=Cortex-A9 intrinsic.c
Page 41
© Copyright 2009Xilinx Xilinx Copyright 2011
NEON Intrinsics : reference For information about the intrinsic functions and vector data types, see the: RealView Compilation Tools Compiler Reference Guide, available from – http://infocenter.arm.com
GCC documentation, available from – http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html
Page 42
© Copyright 2009Xilinx Xilinx Copyright 2011
NEON Assembler Code Advantage: – Careful hand-scheduling is recommended to get the best out of any NEON assembler code you write, especially for performance-critical applications
Disadvantage: – Need to be aware of some underlying hardware features, like pipelining and scheduling issues, memory access behavior and scheduling hazards. – Optimization is processor dependent.
Page 43
© Copyright 2009Xilinx Xilinx Copyright 2011
NEON Assembler Code - Example Gas .text .arm .global double_elements double_elements: vadd.i32 q0,q0,q0 bx lr .end Command: arm-none-linux-gnueabi-as -mfpu=neon asm.s
RVCT AREA RO, CODE, READONLY ARM EXPORT double_elements double_elements VADD.I32 Q0, Q0, Q0 BX LR END Command: armasm --cpu=Cortex-A9 asm.s Page 44
© Copyright 2009Xilinx Xilinx Copyright 2011
Enabling NEON before using NEON data engine is disabled at reset – Needs to be enabled in software before use
Enabling NEON requires 2 steps 1.Enable access to coprocessors 10 and 11 and allow Neon instructions
MRC ORR MCR ISB
p15, 0x0, r0, c1, c0, 2 r0, r0, #(0x0f << 20) p15, 0x0, r0, c1, c0, 2
; Read CP15 CPACR ; Full access rights ; Write CP15 CPACR
2.Enable NEON and VFP MOV r0, #0x40000000 ; set bit 30 VMSR FPEXC, r0 ; write r0 to Floating Point Exception ; Register Page 45
© Copyright 2009Xilinx Xilinx Copyright 2011
NEON Agenda
NEON Hardware overview
NEON Instruction set overview NEON Software Support NEON improves performance
Page 46 Page 46
© Copyright 2009Xilinx Xilinx Copyright 2011
NEON in Audio FFT: 256-point, 16-bit signed complex numbers – FFT is a key component of AAC, Voice/pattern recognition etc. – Hand optimized assembler in both cases
FFT time
No NEON (v6 SIMD asm)
With NEON (v7 NEON asm)
Cortex-A8 500MHz Actual silicon
15.2 us
3.8 us (x 4.0 performance)
Extreme example: FFT in ffmpeg: 12x faster – C code -> handwitten asm
– Scalar -> vector processing – Single-precision floating point on Cortex-A8 (VFPlite -> NEON)
Page 47
© Copyright 2009Xilinx Xilinx Copyright 2011
ARM RVDS vectorizing compiler • RVDS 4.0 professional includes auto-vectorizing armcc – armcc --vectorize --cpu=Cortex-A8 x.c • Up to 4x performance increase for benchmarks, with no source code changes
(no source code changes are permitted for benchmarking) ARM vs NEON (Vectorize) on Cortex-A8 169%
170% 135%
120%
100%
100%
70%
Improved vectorization in latest RVDS 4.0
20% Telecom
Consumer
ARM
NEON
• Simple source code changes can yield significant improvements above this – Use C „__restrict‟ keyword to work around C pointer aliasing issues – Make loops clearly multiple of 2n (e.g. use 4*n as loop end) to aid vectorization Page 48
© Copyright 2009Xilinx Xilinx Copyright 2011
FFmpeg/libav performance ffmpeg performance (relative to realtime)
git.ffmpeg.org snapshot 21-Sep-09
3 2.5 2
YouTube HQ video decode 480x270, 30fps Including AAC audio Real silicon measurements
v7vfp
1.5
v7neon
1 0.5 0 Cortex-A8 256KB L2 500MHz
– OMAP3 Beagleboard – ARM A9TC
NEON ~2x overall performance
Page 49
© Copyright 2009Xilinx Xilinx Copyright 2011
Cortex-A9 512KB L2 400MHz
AC3 decode MHz Using ffmpeg (LGPL opensource codec) with extensive NEON optimizations Dolby reference code also available optimized from Dolby and other vendors 60.0
50.7
46.4 50.0
MHz required (lower is better)
40.0 22.4 30.0 blade_runner (48KHz 192Kbit/s stereo) 20.0
16.5
Broadway-5.1-48KHz-448kbit.ac3
17.8
8.2 10.0
ffmpeg from git.libav.org Checkout from 14-Nov-2011 ./configure --extra-cflags=“-mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp” Benchmarked on 500MHz Cortex-A9 (ARM Versatile Express) Samples from: samples.mplayerhq.hu/A-codecs
0.0 v7neon v7fpu v7fpu --disableasm
Page 50
© Copyright 2009Xilinx Xilinx Copyright 2011
JPEG decode android_external_jpeg optimized for v6 and v7neon
16Mpixel digital camera test image - takes only 1.7s to decode on 500MHz A9+NEON (improved from 4.5s unoptimized) cycles/pixel - lower is better 160
140
140 120
83
100 80
52
60 40
20
Code from: https://github.com/mansr ARM Versatile Express 500MHz Cortex-A9 Ubuntu 11.04 Linux djpeg -outfile /dev/null testfile.jpg
0 djpeg
Page 51
djpeg.v6.opt
djpeg.v7neon.opt
© Copyright 2009Xilinx Xilinx Copyright 2011
Also libjpeg-turbo via Linaro http://libjpeg-turbo.virtualgl.org/
NEON Summary NEON will become standard on general purpose apps & mediacentric devices. – NEON option across the Cortex-A roadmap – NEON ideal for use by open OS systems with downloaded apps
Full enabling technology to support Neon – Compilers, profilers, debuggers, libraries all available now – Key differentiator: easy to program, popular with software engineers
Strong ARM NEON ecosystem
Complementary to DSP/FPGA Page 52
© Copyright 2009Xilinx Xilinx Copyright 2011
Zynq-7000 Hardware Architecture
© Copyright 2009Xilinx Xilinx Copyright 2012