Scalable and Energy-Efficient Architecture Lab (SEAL)
PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory Ping Chi*, Shuangchen Li*, Tao Zhang†, Cong Xu‡, Jishen Zhaoδ, Yu Wang#, Yongpan Liu#, Yuan Xie* *Electrical
and Computer Engineering Department University of California, Santa Barbara
†Nvidia, ‡HP
Labs, δ University of California, Santa Cruz #Tsinghua University, Beijing, China
UCSB - Scalable Energy-Efficient Architecture Lab
1
Motivation • Challenges – Data movement is expensive – Applications demand large memory bandwidth
• Processing-in-memory (PIM) – Minimize data movement by placing computation near data or in memory – 3D stacking revives PIM • Embrace the large internal data transfer bandwidth • Reduce the overheads of data movement Micron, “Hybrid Memory Cube”, HC’11 2
Motivation • Neural network (NN) and deep learning (DL) – Provide solutions to various applications – Acceleration requires high memory bandwidth • PIM is a promising solution
• The size of NN increases • e.g., 1.32GB synaptic weights for Youtube video object recognition
• NN acceleration Deng et al, “Reduced-Precision Memory Value Approximation for Deep Learning”, HPL Report, 2015
• GPU, FPGA, ASIC • ReRAM crossbar 3
Motivation • Resistive Random Access Memory (ReRAM) – Data storage: alternative to DRAM and flash – Computation: matrix-vector multiplication (NN) Hu et al, “Dot-Product Engine (DPE) for Neuromorphic Computing: Programming 1T1M Crossbar to Accelerate Matrix-Vector Multiplication”, DAC’16.
• Use DPE to accelerate pattern recognition on MNIST
• no accuracy degradation vs. software approach (99% accuracy) with only 4bit DAC and ADC requirement • 1,000X ~ 10,000X speed-efficiency product vs. custom digital ASIC
Shafiee et al, “ISAAC: A Convolutional Neural Network Accelerator with InSitu Analog Arithmetic in Crossbars”, ISCA’16. 4
Key idea • PRIME: processing in ReRAM main memory – Based on ReRAM main memory design[1]
Memory Mode Store Data
a1
a2 a3
Comp. Mode w1,1
w1,2
w2,1
w2,2
w3,1
w3,2
b1
Store Weight b2
[1] Xu et al, “Overcoming the challenges of crossbar resistive memory architectures,” in HPCA’15.
5
ReRAM Basics Voltage
Top Electrode Metal Oxide
Wordline
LRS (‘1’) HRS (‘0’)
SET Cell
RESET
Bottom Electrode Voltage
(a) Conceptual view of a ReRAM cell
(b) I-V curve of bipolar (c) schematic view of a crossbar architecture switching
6
ReRAM Based NN Computation • Require specialized peripheral circuit design • DAC, ADC etc.
a1
w1,1
+
b1
w2,1 w1,2
a2
w2,2
a1 a2
+
b2
(a) An ANN with one input and one output layer
w1,1
w1,2 w2,2
w2,1
b1
b2
(b) using a ReRAM crossbar array for neural computation 7
PRIME Architecture Details
WDD
Vol.
SA
Vol.
col Mux. ReRAM Crossbar
A
WDD
WDD
ReRAM Crossbar col Mux.
WDD
Mat
ReRAM Crossbar col Mux. B C
SA
col Mux. ReRAM Crossbar
GDL
Mem Subarray FF Subarray
D Connection
Buffer Subarray Controller E
Global I/O Row Buffer
Data
Adr Global Row Decoder GWL
Bank
8
PRIME Architecture Details
SA
Vol.
col Mux. ReRAM Crossbar
WDD
A
ReRAM Crossbar col Mux. B C
SA
col Mux. ReRAM Crossbar D Connection
Buffer Subarray Controller E
Global I/O Row Buffer
GDL
A. Wordline decoder and driver with multi-level voltage sources; B. column multiplexer with analog subtraction and sigmoid circuitry; C. reconfigurable SA with counters for multi-level outputs, and added ReLU and 4-1 max pooling function units; D. connection between the FF and Buffer subarrays; E. PRIME controller.
Data
WDD
Vol.
WDD
ReRAM Crossbar col Mux.
WDD
Mat
Mem Subarray FF Subarray
Adr Global Row Decoder GWL
Bank
9
PRIME Architecture Details
SA
Vol.
col Mux. ReRAM Crossbar
WDD
A
ReRAM Crossbar col Mux. B C
SA
col Mux. ReRAM Crossbar D Connection
Buffer Subarray Controller E
Global I/O Row Buffer
GDL
A. Wordline decoder and driver with multi-level voltage sources; B. column multiplexer with analog subtraction and sigmoid circuitry; C. reconfigurable SA with counters for multi-level outputs, and added ReLU and 4-1 max pooling function units; D. connection between the FF and Buffer subarrays; E. PRIME controller.
Data
WDD
Vol.
WDD
ReRAM Crossbar col Mux.
WDD
Mat
Mem Subarray FF Subarray
Adr Global Row Decoder GWL
Bank
10
Overcome Precision Challenges • Technology limitation – Input precision (input voltage, DAC) – Weight precision (MLC level) – Output precision (analog computation, ADC)
• Propose input and synapse composing scheme – Compose multiple low-precision input signals for one – Compose multiple cells for one weight – Compose multiple phases for one computation
11
Implementing MLP/CNN Algorithms • MLP / Fully-connected Layer – Matrix-vector multiplication – Activation functions (sigmoid, ReLU)
• Convolution Layer
• Pooling Layer – Max pooling, mean pooling
• Local Response Normalization (LRN) Layer 12
System-Level Design Stage 1: Program Target Code Segment
Offline NN training
Modified Code: Map_Topology (); Program_Weight (); Config_Datapath (); Run(input_data); Post_Proc();
Stage 2: Compile
Opt. I: NN Map Opt. II: Data Place
Synaptic Weights Datapath Config Data Flow Ctrl
NN param. file
• NN mapping optimization – Small-scale NN: Replication – Medium-scale NN: Split-Merge – Large-scale NN: Inter-bank Communication
Stage 3: Execute Memory
… Controller mat PRIME FF subarray
13
Evaluation • Benchmarks – 3 MLPs (S, M, L), 3 CNNs (1, 2, VGG-D) – MNIST, ImageNet
• PRIME configurations – 2 FF subarrays and 1 Buffer subarray per bank – Each crossbar: 256x256 cells – Area overhead: 5.6%
14
Evaluation • Comparisons – Baseline CPU-only, pNPU-co, pNPU-pim
[1] [1] T. Chen et al., “DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,” in ASPLOS’14.
15
Performance results 2899 11802
PRIME
545 1596
73237 1.7
5.0
8.5
45.3
147.5
9440
44043 8.5
88.4 5.5
55.1 4.0
6.0
1E+01
33.3
42.4
1E+03 1E+02
pNPU-pim-x64 5658
17665
3527
2716 5101
1E+04
pNPU-pim-x1 2129 5824
1E+05
8.2
Speedup Norm. to CPU
pNPU-co
1E+00
CNN-1 CNN-2 MLP-S MLP-M MLP-L
VGG
gmean
• PRIME is even 4x better than pNPU-pim-x64
16
pNPU-co pNPU-pim PRIME pNPU-co pNPU-pim PRIME pNPU-co pNPU-pim PRIME pNPU-co pNPU-pim PRIME pNPU-co pNPU-pim PRIME pNPU-co pNPU-pim PRIME
Latency Norm. to pNPU-co
Performance results Compute + Buffer Memory
30% 100%
20%
10%
0%
CNN-1 CNN-2 MLP-S MLP-M MLP-L VGG-D
• PRIME reduces ~90% memory access overhead
17
Energy results
12.1 52.6
10834
138984
PRIME 165.9 1869.0
19.3 124.6
23922 12.6 79.0
56.1
1E+02
9.4
1E+03
7.3 11.2
335
1E+04
1E+01
11744
3801
1E+05
32548
pNPU-pim-x64
1E+06
1.2 1.8
Energy Save Norm. to CPU
pNPU-co
1E+00
CNN-1 CNN-2 MLP-S MLP-M MLP-L
VGG
gmean
• PRIME is even 200x better than pNPU-pim-x64
18
Executive Summary • Challenges: Data movement is expensive Apps demand high memory bandwidth, e.g. Neural Network
• Solutions: Processing-in-memory (PIM) ReRAM crossbar accelerates NN computation
• Our proposal: A PIM architecture for NN acceleration in ReRAM based main memory, including a set of circuit and microarchitecture design to enable NN computation and a software/hardware interface for developers to implement various NNs
• Improve energy efficiency significantly, achieve better system performance and scalability for MLP and CNN workloads, require no extra processing units, and have low area overhead 19
Thank you! 20