Using NEON for Parallel Data Processing - All Programmable

Author: Greg Lara Subject: The First Extensible Processing Platform Family Created Date: 10/22/2012 4:04:44 PM...

6 downloads 337 Views 905KB Size
Using NEON for Parallel Data Processing Zynq-7000 Hardware Architecture Speaker: Leon Qin Title: Processor Specialist Date: Oct, 2012 © Copyright 2009Xilinx Xilinx Copyright 2012

Zynq-7000 Block Diagram Processing System

2x I2C

ARM® CoreSight™ Multi-core & Trace Debug

2x CAN 2x UART I/O MUX

AMBA® Switches

GPIO

NEON™/ FPU Engine

S_AXI_HP2

Cortex™-A9 MPCore™ MIO 32/32 KB I/D Caches

Cortex™-A9 MPCore™ 32/32 KB I/D Caches

S_AXI_HP3

512 KB L2 Cache

2x USB with DMA

General Interrupt Controller

Snoop Control Unit (SCU)

Timer Counters

2x GigE with DMA

S_AXI_ACP

256 KB On-Chip Memory DMA

Configuration

® Switches ® Switches AMBA AMBA

XADC

S_AXI_GP0/1

M_AXI_GP0/1

Multi-Standards I/Os (3.3V & High Speed 1.8V)

Page 2

S_AXI_HP1

NEON™/ FPU Engine

2x SDIO with DMA

EMIO

System Gates, DSP, RAM S_AXI_HP0

PCIe

Multi Gigabit Transceivers

© Copyright 2009Xilinx Xilinx Copyright 2011

Multi-Standards I/Os (3.3V & High Speed 1.8V)

AMBA® Switches

2x SPI

Programmable Logic:

Dynamic Memory Controller DDR3, DDR2, LPDDR2

Static Memory Controller Quad-SPI, NAND, NOR

ARM Architecture Evolution Key Technology Additions by Architecture Generation

Thumb-EE

Execution Environments: Improved memory use

VFPv3 ARM11

NEON™ Adv SIMD

Improved Media and DSP

Thumb®-2 ARM9

TrustZone™

ARM10 SIMD VFPv2 Thumb-2 Only

Jazelle® V5 Page 3

V6

V7 A&R © Copyright 2009Xilinx Xilinx Copyright 2011

V7 M

Why NEON?

General purpose SIMD processing useful for many applications Supports widest range multimedia codecs used for internet applications – Many soft codec standards: MPEG-4, H.264, On2 VP6/7/8, Real, AVS, … – Ideal solution for normal size „internet streaming‟ decode of various formats

Fewer cycles needed – Neon will give 60-150% performance boost on complex video codecs – Simple DSP algorithms can show larger performance boost (4x-8x) – Balance of computation and memory access is required – Processor can sleep sooner => overall dynamic power saving

Page 4

© Copyright 2009Xilinx Xilinx Copyright 2011

Why NEON?

NEON is a mature advanced SIMD technology. – SIMD exist on many 32-bit arch • PowerPC has AltiVec, while x86 has MMX/SSE/AVX

– Can significantly accelerate parallelable repetitive operations on large data sets.

Beneficial to many DSP or multimedia algorithms – Clean orthogonal vector architecture, applicable to a wide range of data intensive computation – audio, video, and image processing codecs. – Not just for codecs – also applicable to 2D & 3D graphics etc – Color-space conversion. – Physics simulations. – Error correction(such as Reed Solomon codecs, CRCs), elliptic curve cryptography, etc. Page 5

© Copyright 2009Xilinx Xilinx Copyright 2011

NEON vs. DSP/FPGA offload NEON advantages – Easy programming & debug – Fully coherent with CPU, no cache maintenance operations – Part of ARM arch - no hardware or software integration required – Ecosystem support off-the-shelf, no porting required

DSP/FPGA advantages – Runs parallel with CPU, few CPU cycles required – More „realtime‟ - no OS/cache variability – Fixed function or limited codec support – Potentially higher performance (e.g. 1080p Full HD video)

Page 6

© Copyright 2009Xilinx Xilinx Copyright 2011

NEON Agenda

 NEON Hardware overview

 NEON Instruction set overview  NEON Software Support  NEON improves performance

Page 7Page 7

© Copyright 2009Xilinx Xilinx Copyright 2011

What is NEON?  NEON is a wide SIMD data processing architecture – Extension of the ARM instruction set – 32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide)

 NEON Instructions perform “Packed SIMD” processing – Registers are considered as vectors of elements of the same data type – Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, single prec. float – Instructions perform the same operation in all lanes

Source Source Registers Registers Dn

Elements

Dm

Dd

Lane Page 8

© Copyright 2009Xilinx Xilinx Copyright 2011

Operation Destination Register

Example SIMD instruction – Vector ADD  Larger register size  Register split into equal size any type elements  Operation performed on same element of each register  VADD.U16 D2, D1, D0 63

0x1001

+ 0xFF0

= 0x1FF1

Page 9Page 9

31

47

0x1234

+ 0x5678

= 0x68AC

0

15

0x7

+ 0xFFF8

= 0xFFFF

© Copyright 2009Xilinx Xilinx Copyright 2011

0xAB

D0

+ 0xCD

D1

= 0x178

D2

Neon Data Types  NEON natively supports a set of common data types – Integer and Fixed-Point: 8-bit, 16-bit, 32-bit and 64-bit – 32-bit Single-precision Floating-point; 8 and 16-bit polynomial 8/16-bitUnsigned Signed, Signed, Unsigned Integers; Integers; Polynomials

.8

.S8 .U8

.I8 .P8

.16

.S16 .U16

.I16 .P16

32-bit Signed, Unsigned Integers; Floats

.32

.S32 .U32

.I32 .F32

.64

.I64

.S64 .U64

64-bit Signed, Unsigned Integers;

 Data types are represented using a bit-size and format letter VADD.U16 D2, D1, D0

 Not all data types available in all sizes Page 10

© Copyright 2009Xilinx Xilinx Copyright 2011

NEON Registers  NEON provides a 256-byte register file – Distinct from the core registers – Extension to the VFPv2 register file (VFPv3) D0

 Two explicitly aliased views

D1

– 32 x 64-bit registers (D0-D31)

D2

– 16 x 128-bit registers (Q0-Q15)

D3

 Enables register trade-off

:

– Vector length

D30

– Available registers

D31

Q0

Q1

:

Q15

 Also uses the summary flags in the VFP FPSCR – Adds a QC integer saturation summary flag – No per-lane flags, so „carry‟ handled using wider result (16bit+16bit -> 32-bit)

Page 11

© Copyright 2009Xilinx Xilinx Copyright 2011

Vectors and Scalars  Registers hold one or more elements of the same data type – Vn can be used to reference either a 64-bit Dn or 128-bit Qn register – A register, data type combination describes a vector of elements 63

0

127

0

Qn

Dn D0

I64 S32

S32

D7

F32

F32

F32

S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8

64-bit

128-bit

 Some instructions can reference individual scalar elements – Scalar elements are referenced using the array notation Vn[x] F32

Q0[3]

F32

Q0[2]

F32

F32

Q0[1]

Q0

Q0[0]

 Array ordering is always from the least significant bit Page 12

F32

© Copyright 2009Xilinx Xilinx Copyright 2011

Q0 Q7

NEON Agenda

 NEON Hardware overview

 NEON Instruction set overview  NEON Software Support  NEON improves performance

Page 13 Page 13

© Copyright 2009Xilinx Xilinx Copyright 2011

Instruction Syntax V{}{}{}{.
}(}, src1, src2 - Instruction Modifiers Q indicates the operation uses saturating arithmetic (e.g. VQADD) H indicates the operation halves the result (e.g. VHADD) D indicates the operation doubles the result (e.g. VQDMUL) R indicates the operation performs rounding (e.g. VRHADD)

- Instruction Operation (e.g. ADD,MUL, MLA, MAX, SHR, SHL, MOV) - Shape L – The result is double the width of both operands W – The result and first operand are double the width of the last operand N – The result is half the width of both operands

- Conditional, used with IT instruction <.dt> - Data type - Destination, - Source operand 1, Page 14

© Copyright 2009Xilinx Xilinx Copyright 2011

Instruction Modifiers and Shapes Q Modifier – Saturating instructions that saturate set the “Cumulative saturation” flag (QC bit) in the Floating-point Status and Control Register (FPSCR) – The QC flag is sticky • Use VMRS and VMSR instructions to read and to clear the flag

H Modifier – Halves the result – Can only be used on addition and subtraction instructions • VHADD, VHSUB and VRHADD

D Modifier – Only available for saturating variants “long” and “high half” multiplies – VQDMLAL, VQDMLSL, VQDMULH, VQDMULL, VQRDMULH

R Modifier – Always “Round to Nearest”, as defined in the IEEE 754 standard – Available on instructions that include a right shift • Including Halving and “high half” instructions Page 15

© Copyright 2009Xilinx Xilinx Copyright 2011

Instruction Shapes Long and Wide – Input elements are promoted before operation Narrow – Input elements are demoted before operation L Shape Dn Dm

Qd

N Shape

W Shape Qn

Qn

Dm

Qm

Qd

Page 16

Dd

© Copyright 2009Xilinx Xilinx Copyright 2011

Multiple 1-Element Structure Access  VLD1, VST1 provide standard array access – An array of structures containing a single component is a basic array – List can contain 1, 2, 3 or 4 consecutive registers – Transfer multiple consecutive 8, 16, 32 or 64-bit elements [R1] +2

x1

x0

+4

x2

+2

x1

+6

x3

+4

x2

+8

x4

+6

x3

+10

x5

+12

x6

+14

x7

[R4] +R3

x0

:

x3

x2

x1

D 7

x0

VLD1.16 {D7}, [R4], R3

:

x3

x2

x1

x0

x7

x6

x5

x4

VST1.16 {D3,D4}, [R1]

Page 17

© Copyright 2009Xilinx Xilinx Copyright 2011

D 3 D 4

Addition: Basic  NEON supports various useful forms of basic addition

 Normal Addition - VADD, VSUB – Floating-point – Integer (8-bit to 64-bit elements) – 64-bit and 128-bit registers

 Long Addition - VADDL, VSUBL – Promotes both inputs before operation – Signed/unsigned (8-bit to 32-bit source elements)

VADD.I16

D0,

D1,

D2

VSUB.F32

Q7,

Q1,

Q4

VADD.I8

Q15, Q14, Q15

VSUB.I64

D0,

D30, D5

VADDL.U16

Q1, D7, D8

VSUBL.S32

Q8, D1, D5

VADDW.U8

Q1, Q7, D8

VSUBW.S16

Q8, Q1, D5

 Wide Addition - VADDW, VSUBW – Promotes one input before operation – Signed/unsigned (8-bit 32-bit source elements)

Page 18

© Copyright 2009Xilinx Xilinx Copyright 2011

Example – adding all lanes  Input in Q0 (D0 and D1)  u16 input values

DO

D1

DO

D1

VPADDL.U16

Q0,

Q0

DO

 Now Q0 contains 4x u32 values (with 15 headroom bits)

D1 DO

 Reducing/folding operation needs 1 bit of headroom

VPADD.U32

D0, D0, D1

DO DO

 Result is u64 in D0 Page 19

VPADDL.U32 © Copyright 2009Xilinx Xilinx Copyright 2011

D0,

D0

Summing a vector + + +

+

+

+

+

+

+

+ + DO

+ +

DO

+ +

DO

+ Page 20

D1

© Copyright 2009Xilinx Xilinx Copyright 2011

Some NEON clever features

Some NEON clever features

Page 21

© Copyright 2009Xilinx Xilinx Copyright 2011

Data Movement: Table Lookup  Uses byte indexes to control byte look up in a table – Table is a list of 1,2,3 or 4 adjacent registers 11

0

p

o

n

m

l

4

k

l

8

j

e

13

i

i

26

h

n

8

g

0

0

f

i

D3

3

e

a

d

d

c

b

D0

VTBL.8 D0, {D1, D2}, D3

 VTBL : out of range indexes generate 0 result  VTBX : out of range indexes leave destination unchanged Page 22

© Copyright 2009Xilinx Xilinx Copyright 2011

a

{D1,D2}

Element Load Store Instructions  All treat memory as an array of structures (AoS) – SIMD registers are treated as structure of arrays (SoA) – Enables interleaving/de-interleaving for efficient SIMD processing – Transfer up to 256-bits in a single instruction x3 z2 y2 x2 z1 y1 x1 z0 y0 x0 element

3-element structure

 Three forms of Element Load Store instructions are provided  Forms distinguished by type of register list provided – Multiple Structure Access e.g. {D0, D1} – Single Structure Access e.g. {D0[2], D1[2]} – Single Structure Load to all lanes e.g. {D0[], D1[]}

Page 23

© Copyright 2009Xilinx Xilinx Copyright 2011

Multiple 2-Element Structure Access  VLD2, VST2 provide access to multiple 2-element structures – List can contain 2 or 4 registers – Transfer multiple consecutive 8, 16, or 32-bit 2-element structures

[R3]

x0 +2

y0

x0

+4

x1

+2

y0

+6

y1

+4

x1

+8

x2

+6

y1

+8

x2

+10

y2

+12

x3

+14

y3

[R1]

:

x3

x2

x1

x0

y3

y2

y1

y0

VLD2.16 {D2,D3}, [R1] Page 24

D 0 +12 x3 x7 x6 x5 x4 D : 1 +28 y3 y2 y1 y0 x7 D 2 +30 y7 y7 y6 y5 y4 D : 3 VLD2.16 {D0,D1,D2,D3}, [R3]!

!

D 2 D 3

+10

© Copyright 2009Xilinx Xilinx Copyright 2011

y2

x3

x2

x1

x0

Multiple 3/4-Element Structure Access  VLD3/4, VST3/4 provide access to 3 or 4-element structures – Lists contain 3/4 registers; optional space for building 128-bit vectors – Transfer multiple consecutive 8, 16, or 32-bit 3/4-element structures [R1]

x0

+2

y0

+4

z0

+6

x1

+8

y1

+10

z1

+12

x2

: +20

y3

+22

z3

[R1]

! x3

x2

x1

x0

y3

y2

y1

y0

z3

z2

z1

z0

:

D 3 D 4 D 5

VST3.16 {D3,D4,D5}, [R1] Page 25

© Copyright 2009Xilinx Xilinx Copyright 2011

x0

+2

y0

+4

z0

+6

x1

+8

y1

D 0 +10 z1 D 1 +12 x2 y3 y2 y1 y0 D : 2 +20 y3 D 3 +22 z3 z3 z2 z1 z0 D : 4 VLD3.16 {D0,D2,D4}, [R1]! x3

x2

x1

x0

Logical  NEON supports bitwise logical operations  VAND, VBIC, VEORR, VORN, VORR – Bitwise logical operation – Independent of data type

VAND

D0,

D0,

D1

VORR

Q0,

Q1,

Q15

VEOR

Q7,

Q1,

Q15

VORN

D15, D14, D1

VBIC

D0,

D30, D2

– 64-bit and 128-bit registers

D0

 VBIT, VBIF, VBSL

D1

– Bitwise multiplex operations

0 1 0 1 1 0

D2

– Insert True, Insert False, Select

– 3 versions overwrite different registers

D1

– 64-bit and 128-bit registers – Used with masks to provide selection Page 26

© Copyright 2009Xilinx Xilinx Copyright 2011

VBIT D1, D0, D2

NEON instruction summary  A comprehensive set of data prcoessing instructions  Form a general purpose SIMD instruction set suitable for compilers

 NEON operations fall in to the following categories  Addition / Subtraction ( Saturating, Halving, Rounding)  MIN, MAX, NEG, MOV, ABS, ABD, …  Multiplication (MUL, MLA, MLS, …)

 Comparison and Selection  Logic (AND , ORR, EOR, BIC, ORN, …)  Bitfield  Reciprocal Estimate/Step, Reciprocal Square Root Estimate/Step

 Miscellaneous (DUP, EXT, CLZ, CLS, TBL, REV, ZIP. TRN, …)

 Many more… Page 27

© Copyright 2009Xilinx Xilinx Copyright 2011

NEON instruction reference  Official NEON instruction Set reference is “Advanced SIMD” in ARM Architecture Reference Manual v7 A & R edition

 Available to partners on www.arm.com

Page 28

© Copyright 2009Xilinx Xilinx Copyright 2011

Further Reading  Documentation – ARM “ARM” v7 A&R – “NEON Support in the Realview compiler” white paper – “NEON optimizations in Android” white paper – Realview Compiler Guide (for intrinsics) – ARM Cortex-A Programmers‟ Guide (from www.arm.com downloads) – Software blogs on www.arm.com

Page 29

© Copyright 2009Xilinx Xilinx Copyright 2011

NEON Agenda

 NEON Hardware overview

 NEON Instruction set overview  NEON Software Support  NEON improves performance

Page 30 Page 30

© Copyright 2009Xilinx Xilinx Copyright 2011

How to use NEON NOEN optimized Open source Libraries – OpenMAX DL (Development Layer): APIs contain a comprehensive set of audio, video and imaging functions that can be used for a wide range of accelerated codec functionality such as MPEG-4, H.264, MP3, AAC and JPEG. – Broad open source support for NEON

Vectorizing Compilers – Exploits NEON SIMD automatically with existing C source code

NEON Intrinsics – C function call interface to NEON operations – Supports all data types and operations supported by NEON

Assembler Code – For those who really want to optimize at the lowest level

Page 31

© Copyright 2009Xilinx Xilinx Copyright 2011

OpenMAX DL v1.0 Library Summary  Video Domain – MPEG-4 simple profile

 Audio Domain  

– H.264 baseline

 Still Image Domain

MP3 AAC

 Signal Processing Domain    

– JPEG

 Image Processing Domain – Colorspace conversion – De-blocking / de-ringing

FIR IIR FFT Dot Product

– Rotation, scaling, compositing Spec from: http://www.khronos.org/openmax/

Opensource implementation for ARM11 & NEON available from: http://www.arm.com/zh/community/multimedia/standards-apis.php NOTE: OpenMax DL provides low level data processing functions, not the complete codecs Page 32

© Copyright 2009Xilinx Xilinx Copyright 2011

NEON in opensource Google – WebM – 11,000 lines NEON assembler! Bluez – official Linux Bluetooth protocol stack – NEON sbc audio encoder

Pixman (part of cairo 2D graphics library) – Compositing/alpha blending

ffmpeg – libavcodec – LGPL media player used in many Linux distros – NEON Video: MPEG-2, MPEG-4 ASP, H.264 (AVC), VC-1, VP3, Theora – NEON Audio: AAC, Vorbis, WMA

x264 – Google Summer Of Code 2009 – GPL H.264 encoder – e.g. for video conferencing

Android – NEON optimizations – Skia library, S32A_D565_Opaque 5x faster using NEON

Eigen2 – C++ vector math / linear algebra template library Theorarm – libtheora NEON version (optimized by Google) libjpeg – optimized JPEG decode (IJG library) FFTW – NEON enabled FFT library

LLVM – code generation backend used by Android Renderscript Copyright 2011 Xilinx

Page 33

© Copyright 2009 Xilinx

Automatic Vectorization: ARM Compiler  Automatic vectorization can generate code targeted for NEON from ordinary C source code – Less effort to produce efficient code – Portable - no compiler-specific source code features need to be used

 To enable automatic vectorization on armcc, use these options together:



--vectorize

- enable vectorization

--cpu Cortex-A9

- provide a CPU option with NEON support

-O2 or -O3

- select high optimization level.

-Otime

- optimize for speed over space

--fpmode fast

- the precision of vectorized floating-point operations is the same as VFP RunFast mode

--diag_warning=optimizations to obtain useful diagnostics from the compiler on what it could or could not optimize/vectorize

 Note: – When you specify --vectorize, automatic vectorization is enabled only if you also specify Otime and an optimization level of -O2 or -O3. Page 34

© Copyright 2009Xilinx Xilinx Copyright 2011

Automatic Vectorization: GNU tools  To enable automatic vectorization on GCC/g++, use these options together: -mcpu=cortex-a9

-Specify a suitable ARMv7-A processor

-mfpu=neon

- enable NEON support

-ftree-vectorize

- support SIMD on many arch - O3 implies -ftree-vectorize

-mvectorize-with-neon-quad

- By default, GCC 4.4 only vectorize for doubleword

-mfloat-abi=softfp

-Can use “hard” for more efficient floating point parameter passing, but all code must be compiled with this option

 Understand more with -ftree-vectorizer-verbose  Takes an integer value specifying the level of detail to provide, where 1 enables additional printouts and higher values add even more information.  What vectorization the compiler is performing, or what is unable to perform because of possible dependencies Page 35

© Copyright 2009Xilinx Xilinx Copyright 2011

Automatic Vectorization - how it works 2. Unroll the loop to the appropriate number of iterations, and perform other transformations like pointerization

void add_int(int * __restrict pa, int * __restrict pb, unsigned int n, int x) { unsigned int i; for(i = 0; i < (n & ~3); i++) pa[i] = pb[i] + x; }

void add_int(int *pa, int *pb, unsigned n, int x) { unsigned int i; for (i = ((n & ~3) >> 2); i; i--) { *(pa + 0) = *(pb + 0) + x; *(pa + 1) = *(pb + 1) + x; *(pa + 2) = *(pb + 2) + x; *(pa + 3) = *(pb + 3) + x; pa += 4; pb += 4; } }

1. Analyze each loop:  Are pointer accesses safe for vectorization?  What data types are being used? How do they map onto NEON vector registers?  How many loop iterations are there? 3. Map each unrolled operation onto a NEON vector lane, and generate corresponding NEON instructions Page 36

© Copyright 2009Xilinx Xilinx Copyright 2011

ARM RVDS & gcc vectorising compiler

int a[256], b[256], c[256]; foo () { int i; for (i=0; i<256; i++){ a[i] = b[i] + c[i]; } }

Page 37

armcc -S --cpu cortex-a8 -O3 -Otime --vectorize test.c

|L1.16| VLD1.32 {d0,d1},[r0]! SUBS r3,r3,#1 VLD1.32 {d2,d3},[r1]! VADD.I32 q0,q0,q1 VST1.32 {d0,d1},[r2]! BNE |L1.16|

.L2:

gcc -S -O3 -mcpu=cortex-a8 -mfpu=neon -ftree-vectorize -ftree-vectorizer-verbose=6 test.c

© Copyright 2009Xilinx Xilinx Copyright 2011

add r1, r0, ip add r3, r0, lr add r2, r0, r4 add r0, r0, #8 cmp r0, #1024 fldd d7, [r3, #0] fldd d6, [r2, #0] vadd.i32 d7, d7, d6 fstd d7, [r1, #0] bne .L2

Tuning C/C++ Code for Vectorizing The goal is to try to make the code simple, straight forward, and parallel, so that the compiler can easily convert the code to NEON assembly Loops can be modified for better vectorizing: – Short, simple loops work the best (even if it means multiple loops in your code) – Avoid breaks / loop-carried dependencies / conditions inside loops – Try to make the number of iteration a power of 2 – Try to make sure the number of iteration is known to the compiler

– Functions called inside a lop should be inlined

Pointer issues: – Using arrays with indexing vectorizes better than using pointer – Indirect addressing (multiple indexing or de-reference) doesn‟t vectorize

– Use __restricet key word to tell the compiler that pointers does not reference overlapping areas of memory

Use suitable data types – For best performance, always use the smallest data type that can hold the required values Page 38

© Copyright 2009Xilinx Xilinx Copyright 2011

NEON Intrinsics  Available in armcc, GCC/g++, and llvm. Syntax is the same, so source code that uses intrinsics can be compiled by any of these compilers.  Advantage: – Provide low-level access to NEON instructions. Compiler do hard work like:  Register allocation.  Code scheduling, or re-ordering instructions. The C compilers can reorder code to ensure the minimum number of stalls according to a specific processor.

 Disadvantage: – Possibly the compiler output is not exactly the code you want, so there is still some possibility of improvement when moving to NEON assembler code.

Page 39

© Copyright 2009Xilinx Xilinx Copyright 2011

NEON Intrinsics - Example  Include intrinsics header file #include  Use special NEON data types which correspond to D and Q registers, e.g. int8x8_t D-register containing 8x 8-bit elements int16x4_t D-register containing 4x 16-bit elements int32x4_t Q-register containing 4x 32-bit elements  Use special intrinsics versions of NEON instructions vin1 = vld1q_s32(ptr); vout = vaddq_s32(vin1, vin2); vst1q_s32(vout, ptr);  Strongly typed! – Use vreinterpret_s16_s32( ) to change the type

Page 40

© Copyright 2009Xilinx Xilinx Copyright 2011

NEON Intrinsics - Example  NEON intrinsic #include uint32x4_t double_elements(uint32x4_t input) { return(vaddq_u32(input, input)); }

 Command line with GCC  arm-none-linux-gnueabi-gcc -mfpu=neon intrinsic.c

 Command line with RVCT  armcc --cpu=Cortex-A9 intrinsic.c

Page 41

© Copyright 2009Xilinx Xilinx Copyright 2011

NEON Intrinsics : reference For information about the intrinsic functions and vector data types, see the:  RealView Compilation Tools Compiler Reference Guide, available from – http://infocenter.arm.com

 GCC documentation, available from – http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html

Page 42

© Copyright 2009Xilinx Xilinx Copyright 2011

NEON Assembler Code  Advantage: – Careful hand-scheduling is recommended to get the best out of any NEON assembler code you write, especially for performance-critical applications

 Disadvantage: – Need to be aware of some underlying hardware features, like pipelining and scheduling issues, memory access behavior and scheduling hazards. – Optimization is processor dependent.

Page 43

© Copyright 2009Xilinx Xilinx Copyright 2011

NEON Assembler Code - Example  Gas .text .arm .global double_elements double_elements: vadd.i32 q0,q0,q0 bx lr .end  Command: arm-none-linux-gnueabi-as -mfpu=neon asm.s

 RVCT AREA RO, CODE, READONLY ARM EXPORT double_elements double_elements VADD.I32 Q0, Q0, Q0 BX LR END  Command: armasm --cpu=Cortex-A9 asm.s Page 44

© Copyright 2009Xilinx Xilinx Copyright 2011

Enabling NEON before using NEON data engine is disabled at reset – Needs to be enabled in software before use

Enabling NEON requires 2 steps 1.Enable access to coprocessors 10 and 11 and allow Neon instructions

MRC ORR MCR ISB

p15, 0x0, r0, c1, c0, 2 r0, r0, #(0x0f << 20) p15, 0x0, r0, c1, c0, 2

; Read CP15 CPACR ; Full access rights ; Write CP15 CPACR

2.Enable NEON and VFP MOV r0, #0x40000000 ; set bit 30 VMSR FPEXC, r0 ; write r0 to Floating Point Exception ; Register Page 45

© Copyright 2009Xilinx Xilinx Copyright 2011

NEON Agenda

 NEON Hardware overview

 NEON Instruction set overview  NEON Software Support  NEON improves performance

Page 46 Page 46

© Copyright 2009Xilinx Xilinx Copyright 2011

NEON in Audio FFT: 256-point, 16-bit signed complex numbers – FFT is a key component of AAC, Voice/pattern recognition etc. – Hand optimized assembler in both cases

FFT time

No NEON (v6 SIMD asm)

With NEON (v7 NEON asm)

Cortex-A8 500MHz Actual silicon

15.2 us

3.8 us (x 4.0 performance)

Extreme example: FFT in ffmpeg: 12x faster – C code -> handwitten asm

– Scalar -> vector processing – Single-precision floating point on Cortex-A8 (VFPlite -> NEON)

Page 47

© Copyright 2009Xilinx Xilinx Copyright 2011

ARM RVDS vectorizing compiler • RVDS 4.0 professional includes auto-vectorizing armcc – armcc --vectorize --cpu=Cortex-A8 x.c • Up to 4x performance increase for benchmarks, with no source code changes

(no source code changes are permitted for benchmarking) ARM vs NEON (Vectorize) on Cortex-A8 169%

170% 135%

120%

100%

100%

70%

Improved vectorization in latest RVDS 4.0

20% Telecom

Consumer

ARM

NEON

• Simple source code changes can yield significant improvements above this – Use C „__restrict‟ keyword to work around C pointer aliasing issues – Make loops clearly multiple of 2n (e.g. use 4*n as loop end) to aid vectorization Page 48

© Copyright 2009Xilinx Xilinx Copyright 2011

FFmpeg/libav performance ffmpeg performance (relative to realtime)

git.ffmpeg.org snapshot 21-Sep-09

3 2.5 2

YouTube HQ video decode 480x270, 30fps Including AAC audio Real silicon measurements

v7vfp

1.5

v7neon

1 0.5 0 Cortex-A8 256KB L2 500MHz

– OMAP3 Beagleboard – ARM A9TC

NEON ~2x overall performance

Page 49

© Copyright 2009Xilinx Xilinx Copyright 2011

Cortex-A9 512KB L2 400MHz

AC3 decode MHz Using ffmpeg (LGPL opensource codec) with extensive NEON optimizations Dolby reference code also available optimized from Dolby and other vendors 60.0

50.7

46.4 50.0

MHz required (lower is better)

40.0 22.4 30.0 blade_runner (48KHz 192Kbit/s stereo) 20.0

16.5

Broadway-5.1-48KHz-448kbit.ac3

17.8

8.2 10.0

ffmpeg from git.libav.org Checkout from 14-Nov-2011 ./configure --extra-cflags=“-mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp” Benchmarked on 500MHz Cortex-A9 (ARM Versatile Express) Samples from: samples.mplayerhq.hu/A-codecs

0.0 v7neon v7fpu v7fpu --disableasm

Page 50

© Copyright 2009Xilinx Xilinx Copyright 2011

JPEG decode android_external_jpeg optimized for v6 and v7neon

16Mpixel digital camera test image - takes only 1.7s to decode on 500MHz A9+NEON (improved from 4.5s unoptimized) cycles/pixel - lower is better 160

140

140 120

83

100 80

52

60 40

20

Code from: https://github.com/mansr ARM Versatile Express 500MHz Cortex-A9 Ubuntu 11.04 Linux djpeg -outfile /dev/null testfile.jpg

0 djpeg

Page 51

djpeg.v6.opt

djpeg.v7neon.opt

© Copyright 2009Xilinx Xilinx Copyright 2011

Also libjpeg-turbo via Linaro http://libjpeg-turbo.virtualgl.org/

NEON Summary NEON will become standard on general purpose apps & mediacentric devices. – NEON option across the Cortex-A roadmap – NEON ideal for use by open OS systems with downloaded apps

Full enabling technology to support Neon – Compilers, profilers, debuggers, libraries all available now – Key differentiator: easy to program, popular with software engineers

Strong ARM NEON ecosystem

Complementary to DSP/FPGA Page 52

© Copyright 2009Xilinx Xilinx Copyright 2011

Zynq-7000 Hardware Architecture

© Copyright 2009Xilinx Xilinx Copyright 2012