PRIME: A Novel Processing-in-memory Architecture for

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation ... 545 5101 5824 17665 44043 73237 2899 1596 2 1E+00 1E+01 1E+02 1E+0...

41 downloads 1021 Views 1MB Size
Scalable and Energy-Efficient Architecture Lab (SEAL)

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory Ping Chi*, Shuangchen Li*, Tao Zhang†, Cong Xu‡, Jishen Zhaoδ, Yu Wang#, Yongpan Liu#, Yuan Xie* *Electrical

and Computer Engineering Department University of California, Santa Barbara

†Nvidia, ‡HP

Labs, δ University of California, Santa Cruz #Tsinghua University, Beijing, China

UCSB - Scalable Energy-Efficient Architecture Lab

1

Motivation • Challenges – Data movement is expensive – Applications demand large memory bandwidth

• Processing-in-memory (PIM) – Minimize data movement by placing computation near data or in memory – 3D stacking revives PIM • Embrace the large internal data transfer bandwidth • Reduce the overheads of data movement Micron, “Hybrid Memory Cube”, HC’11 2

Motivation • Neural network (NN) and deep learning (DL) – Provide solutions to various applications – Acceleration requires high memory bandwidth • PIM is a promising solution

• The size of NN increases • e.g., 1.32GB synaptic weights for Youtube video object recognition

• NN acceleration Deng et al, “Reduced-Precision Memory Value Approximation for Deep Learning”, HPL Report, 2015

• GPU, FPGA, ASIC • ReRAM crossbar 3

Motivation • Resistive Random Access Memory (ReRAM) – Data storage: alternative to DRAM and flash – Computation: matrix-vector multiplication (NN) Hu et al, “Dot-Product Engine (DPE) for Neuromorphic Computing: Programming 1T1M Crossbar to Accelerate Matrix-Vector Multiplication”, DAC’16.

• Use DPE to accelerate pattern recognition on MNIST

• no accuracy degradation vs. software approach (99% accuracy) with only 4bit DAC and ADC requirement • 1,000X ~ 10,000X speed-efficiency product vs. custom digital ASIC

Shafiee et al, “ISAAC: A Convolutional Neural Network Accelerator with InSitu Analog Arithmetic in Crossbars”, ISCA’16. 4

Key idea • PRIME: processing in ReRAM main memory – Based on ReRAM main memory design[1]

Memory Mode Store Data

a1

a2 a3

Comp. Mode w1,1

w1,2

w2,1

w2,2

w3,1

w3,2

b1

Store Weight b2

[1] Xu et al, “Overcoming the challenges of crossbar resistive memory architectures,” in HPCA’15.

5

ReRAM Basics Voltage

Top Electrode Metal Oxide

Wordline

LRS (‘1’) HRS (‘0’)

SET Cell

RESET

Bottom Electrode Voltage

(a) Conceptual view of a ReRAM cell

(b) I-V curve of bipolar (c) schematic view of a crossbar architecture switching

6

ReRAM Based NN Computation • Require specialized peripheral circuit design • DAC, ADC etc.

a1

w1,1

+

b1

w2,1 w1,2

a2

w2,2

a1 a2

+

b2

(a) An ANN with one input and one output layer

w1,1

w1,2 w2,2

w2,1

b1

b2

(b) using a ReRAM crossbar array for neural computation 7

PRIME Architecture Details

WDD

Vol.

SA

Vol.

col Mux. ReRAM Crossbar

A

WDD

WDD

ReRAM Crossbar col Mux.

WDD

Mat

ReRAM Crossbar col Mux. B C

SA

col Mux. ReRAM Crossbar

GDL

Mem Subarray FF Subarray

D Connection

Buffer Subarray Controller E

Global I/O Row Buffer

Data

Adr Global Row Decoder GWL

Bank

8

PRIME Architecture Details

SA

Vol.

col Mux. ReRAM Crossbar

WDD

A

ReRAM Crossbar col Mux. B C

SA

col Mux. ReRAM Crossbar D Connection

Buffer Subarray Controller E

Global I/O Row Buffer

GDL

A. Wordline decoder and driver with multi-level voltage sources; B. column multiplexer with analog subtraction and sigmoid circuitry; C. reconfigurable SA with counters for multi-level outputs, and added ReLU and 4-1 max pooling function units; D. connection between the FF and Buffer subarrays; E. PRIME controller.

Data

WDD

Vol.

WDD

ReRAM Crossbar col Mux.

WDD

Mat

Mem Subarray FF Subarray

Adr Global Row Decoder GWL

Bank

9

PRIME Architecture Details

SA

Vol.

col Mux. ReRAM Crossbar

WDD

A

ReRAM Crossbar col Mux. B C

SA

col Mux. ReRAM Crossbar D Connection

Buffer Subarray Controller E

Global I/O Row Buffer

GDL

A. Wordline decoder and driver with multi-level voltage sources; B. column multiplexer with analog subtraction and sigmoid circuitry; C. reconfigurable SA with counters for multi-level outputs, and added ReLU and 4-1 max pooling function units; D. connection between the FF and Buffer subarrays; E. PRIME controller.

Data

WDD

Vol.

WDD

ReRAM Crossbar col Mux.

WDD

Mat

Mem Subarray FF Subarray

Adr Global Row Decoder GWL

Bank

10

Overcome Precision Challenges • Technology limitation – Input precision (input voltage, DAC) – Weight precision (MLC level) – Output precision (analog computation, ADC)

• Propose input and synapse composing scheme – Compose multiple low-precision input signals for one – Compose multiple cells for one weight – Compose multiple phases for one computation

11

Implementing MLP/CNN Algorithms • MLP / Fully-connected Layer – Matrix-vector multiplication – Activation functions (sigmoid, ReLU)

• Convolution Layer

• Pooling Layer – Max pooling, mean pooling

• Local Response Normalization (LRN) Layer 12

System-Level Design Stage 1: Program Target Code Segment

Offline NN training

Modified Code: Map_Topology (); Program_Weight (); Config_Datapath (); Run(input_data); Post_Proc();

Stage 2: Compile

Opt. I: NN Map Opt. II: Data Place

Synaptic Weights Datapath Config Data Flow Ctrl

NN param. file

• NN mapping optimization – Small-scale NN: Replication – Medium-scale NN: Split-Merge – Large-scale NN: Inter-bank Communication

Stage 3: Execute Memory

… Controller mat PRIME FF subarray

13

Evaluation • Benchmarks – 3 MLPs (S, M, L), 3 CNNs (1, 2, VGG-D) – MNIST, ImageNet

• PRIME configurations – 2 FF subarrays and 1 Buffer subarray per bank – Each crossbar: 256x256 cells – Area overhead: 5.6%

14

Evaluation • Comparisons – Baseline CPU-only, pNPU-co, pNPU-pim

[1] [1] T. Chen et al., “DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,” in ASPLOS’14.

15

Performance results 2899 11802

PRIME

545 1596

73237 1.7

5.0

8.5

45.3

147.5

9440

44043 8.5

88.4 5.5

55.1 4.0

6.0

1E+01

33.3

42.4

1E+03 1E+02

pNPU-pim-x64 5658

17665

3527

2716 5101

1E+04

pNPU-pim-x1 2129 5824

1E+05

8.2

Speedup Norm. to CPU

pNPU-co

1E+00

CNN-1 CNN-2 MLP-S MLP-M MLP-L

VGG

gmean

• PRIME is even 4x better than pNPU-pim-x64

16

pNPU-co pNPU-pim PRIME pNPU-co pNPU-pim PRIME pNPU-co pNPU-pim PRIME pNPU-co pNPU-pim PRIME pNPU-co pNPU-pim PRIME pNPU-co pNPU-pim PRIME

Latency Norm. to pNPU-co

Performance results Compute + Buffer Memory

30% 100%

20%

10%

0%

CNN-1 CNN-2 MLP-S MLP-M MLP-L VGG-D

• PRIME reduces ~90% memory access overhead

17

Energy results

12.1 52.6

10834

138984

PRIME 165.9 1869.0

19.3 124.6

23922 12.6 79.0

56.1

1E+02

9.4

1E+03

7.3 11.2

335

1E+04

1E+01

11744

3801

1E+05

32548

pNPU-pim-x64

1E+06

1.2 1.8

Energy Save Norm. to CPU

pNPU-co

1E+00

CNN-1 CNN-2 MLP-S MLP-M MLP-L

VGG

gmean

• PRIME is even 200x better than pNPU-pim-x64

18

Executive Summary • Challenges:  Data movement is expensive  Apps demand high memory bandwidth, e.g. Neural Network

• Solutions:  Processing-in-memory (PIM)  ReRAM crossbar accelerates NN computation

• Our proposal:  A PIM architecture for NN acceleration in ReRAM based main memory, including a set of circuit and microarchitecture design to enable NN computation and a software/hardware interface for developers to implement various NNs

• Improve energy efficiency significantly, achieve better system performance and scalability for MLP and CNN workloads, require no extra processing units, and have low area overhead 19

Thank you! 20