NVIDIA Pascal GP100 GPU Expected To Feature 12 TFLOPs of Single Precision Compute, 4 TFLOPs of Double Precision Compute Performance

Hassan Mujtaba • Feb 18, 2016 12:04 AM EST

• Copy Shortlink

New details on NVIDIA's Pascal GPU have been dug up by 3DCenter (via Beyond3D) which showcase the total compute performance of the upcoming FinFET based chip. At CES 2016, NVIDIA announced their Pascal based Drive PX 2 module which is an automobile supercomputer which uses the graphics processing power of GPUs to drive cards autonomously. The presentation didn't mention the flagship chip but we expect to hear a more detailed session on those at GTC in April 2016.

NVIDIA Pascal GP100 Flagship GPU Might Come With 12 TFLOPs of Single Precision, 4 TFLOPs of Dual Precision Compute

The one GPU that everyone has their eyes on right now, whether they be enterprise of mainstream audience, is the flagship Pascal GPU which is known as GP100 (The naming scheme for the chip is not confirmed yet). This is going to be the flagship chip of the lineup which will be featured on Tesla, Quadro and GeForce graphics cards. The chip is based on the 16nm FinFET process which leads to efficiency improvements and better performance per watt but with Pascal, double precision compute returns with a bang. Maxwell which is NVIDIA's current gen architecture made some serious gains in the performance per watt department and Pascal is expected to keep the tradition move forward.

The information today comes from slides which have long existed but most people haven't had access to. The slides were found by iMacmatican, a Beyond3D forum member who has compiled a good list of details over at the forum. Since most of these slides date back to 2014-2015, there are bound to be some changes to the GPU design which we will also explain in a bit. First of all, a slide from a presentation in March 2014 detailed GFLOPs per watt for various NVIDIA GPUs. The approximate values for NVIDIA's CUDA generation of GPUs have been listed below:

Tesla: 0.5
Fermi: 2
Kepler: 5.5
Pascal: 14
Volta: 22

nvidia-pascal-gpu-dual-precision-performance

nvidia-pascal-gpu-single-precision-performance

Slide Credits: Beyond3D Forum

The slide clearly shows that Pascal is rated at GFLOPs per watt while Volta is rated at 22 GFLOPs per watt. Now the slide states that these approximations are for the Dual Precision or DGEMM (Dual Precision Floating General Matrix Multiply) GFLOPS/Watt and not single precision due to which Maxwell has been removed from the latter slides since it didn't feature any FP64 hardware under the hood. The fastest Kepler based Tesla K40X comes with 6.1 GFLOPs/W and the dual-chip Tesla K80X comes with 6.2 GFLOPs/W. Pascal is expected to take this around 14 GFLOPs/W which is more than twice of Dual Precision GFLOPs/W.

Coming to the Single Precision or SGEMM (Single precision floating General Matrix Multiply) GFLOPs/W are rated at 42 GFLOPs/W for Pascal. Maxwell is rated at 23 GFLOPs/W with the dual-chip offering pushing that up to 25 GFLOPs/W while Volta is rated at 73 GFLOPs/W. Now there's also a slide that details the HGEMM (Half Precision floating General Matrix Multiply). We know that Pascal and the latter generation of GPUs will come with mixed precision compute which allows users to get twice the compute performance in FP16 work loads compared to FP32 by computing at 16-bit with twice the accuracy of FP32. Compared to Maxwell which has just 26 half precision GFLOPs/W, Pascal will take that up to 85 GFLOPs/W while Volta will do up to 145 GFLOPs /W.

Coming to the more meaty part, 2014 slides are full of useful data on Pascal GPUs. Of course these slides pre date the time frame when Pascal GPUs actually taped out and entered NVIDIA Labs for testing which is what they stated themselves during SC15 and several months before that at GTC session in Japan. It is known that during some point, NVIDIA made the step to change their designs from HMC (Hybrid Memory Cube) based solutions to HBM2 based solutions and they presented the updated design to the audience at GTC 2015 in Japan.

The prototype Pascal board that was showcased back at GTC 2014 was actually based on an HMC implementation and that changed in 2015. From details mentioned in the slides, NVIDIA is claiming that they have integrated the memory (HBM2) to be part of the actual GPU die. Now this could mean one of two things, whether NVIDIA has actually managed to integrated HBM2 and a 16nm GPU on the same die or they could be using a similar design as the Fury cards from AMD which fuse the GPU and HBM chips on single interposer that makes them a single chip solution, sort of like an SOC.

What we know so far about Nvidia’s flagship Pascal GP100 GPU :

Pascal graphics architecture.
2x performance per watt estimated improvement over Maxwell.
To launch in 2016, purportedly the second half of the year.
DirectX 12 feature level 12_1 or higher.
Successor to the GM200 GPU found in the GTX Titan X and GTX 980 Ti.
Built on the 16nm FinFET manufacturing process from TSMC.
Allegedly has a total of 17 billion transistors, more than twice that of GM200.
Will feature four 4-Hi HBM2 stacks, for a total of 16GB of VRAM and 8-Hi stacks for up to 32GB for the professional compute SKUs.
Features a 4096-bit memory bus interface, same as AMD’s Fiji GPU power the Fury series.
Features NVLink (only compatible with next generation IBM PowerPC server processors)
Supports half precision FP16 compute at twice the rate of full precision FP32.

GPU Architecture	NVIDIA Fermi	NVIDIA Kepler	NVIDIA Maxwell	NVIDIA Pascal
GPU Process	40nm	28nm	28nm	16nm (TSMC FinFET)
Flagship Chip	GF110	GK210	GM200	GP100
GPU Design	SM (Streaming Multiprocessor)	SMX (Streaming Multiprocessor)	SMM (Streaming Multiprocessor Maxwell)	SMP (Streaming Multiprocessor Pascal)
Maximum Transistors	3.00 Billion	7.08 Billion	8.00 Billion	15.3 Billion
Maximum Die Size	520mm2	561mm2	601mm2	610mm2
Stream Processors Per Compute Unit	32 SPs	192 SPs	128 SPs	64 SPs
Maximum CUDA Cores	512 CCs (16 CUs)	2880 CCs (15 CUs)	3072 CCs (24 CUs)	3840 CCs (60 CUs)
FP32 Compute	1.33 TFLOPs(Tesla)	5.10 TFLOPs (Tesla)	6.10 TFLOPs (Tesla)	~12 TFLOPs (Tesla)
FP64 Compute	0.66 TFLOPs (Tesla)	1.43 TFLOPs (Tesla)	0.20 TFLOPs (Tesla)	~6 TFLOPs(Tesla)
Maximum VRAM	1.5 GB GDDR5	6 GB GDDR5	12 GB GDDR5	16 / 32 GB HBM2
Maximum Bandwidth	192 GB/s	336 GB/s	336 GB/s	720 GB/s - 1 TB/s
Maximum TDP	244W	250W	250W	300W
Launch Year	2010 (GTX 580)	2014 (GTX Titan Black)	2015 (GTX Titan X)	2016

We have seen several slides but there's one from an independent researcher who's also a CUDA fellow who posted the compute performance for several platforms in his presentation. The slide puts the NVIDIA Pascal GPU with Stacked DRAM (1 TB/s) featuring up to 4 TFLOPs of Double Precision (FP64) and 12 TFLOPs of Single Precision (FP32) compute performance. In a slide from 2014, after launch of the second generation Pascal GPUs, an NVIDIA presentation also mentioned a GPU known as Pascal-Solo (not to be mistaken with Han-Solo) in the slide showcasing their Tesla GPU accelerator roadmap. The Pascal-Solo GPU features just 1 GPU and has a 235W TDP. The part comes in both PCI-e Active/Passive cooling options and is expected to launch in 2016. The Beyond3D Forum member approximated that the Tesla GPU could launch in Q2 of 2016.

There's no doubt that Pascal GPUs will feature a lot of compute performance aimed at the Tesla and Quadro markets. The next generation FinFET based graphics cards will have a lot of muscle to flex toward the complex tasks that are put forward in the HPC workloads. Expect GTC 2016 to bring a lot of new information on Pascal based Tesla solutions.

GPU Family	AMD Vega	AMD Navi	NVIDIA Pascal	NVIDIA Volta
Flagship GPU	Vega 10	Navi 10	NVIDIA GP100	NVIDIA GV100
GPU Process	14nm FinFET	7nm FinFET	TSMC 16nm FinFET	TSMC 12nm FinFET
GPU Transistors	15-18 Billion	TBC	15.3 Billion	21.1 Billion
GPU Cores (Max)	4096 SPs	TBC	3840 CUDA Cores	5376 CUDA Cores
Peak FP32 Compute	13.0 TFLOPs	TBC	12.0 TFLOPs	>15.0 TFLOPs (Full Die)
Peak FP16 Compute	25.0 TFLOPs	TBC	24.0 TFLOPs	120 Tensor TFLOPs
VRAM	16 GB HBM2	TBC	16 GB HBM2	16 GB HBM2
Memory (Consumer Cards)	HBM2	HBM3	GDDR5X	GDDR6
Memory (Dual-Chip Professional/ HPC)	HBM2	HBM3	HBM2	HBM2
HBM2 Bandwidth	484 GB/s (Frontier Edition)	>1 TB/s?	732 GB/s (Peak)	900 GB/s
Graphics Architecture	Next Compute Unit (Vega)	Next Compute Unit (Navi)	5th Gen Pascal CUDA	6th Gen Volta CUDA
Successor of (GPU)	Radeon RX 500 Series	Radeon RX 600 Series	GM200 (Maxwell)	GP100 (Pascal)
Launch	2017	2019	2016	2017

Deal of the Day

NVIDIA Pascal GP100 GPU Expected To Feature 12 TFLOPs of Single Precision Compute, 4 TFLOPs of Double Precision Compute Performance

NVIDIA Pascal GP100 Flagship GPU Might Come With 12 TFLOPs of Single Precision, 4 TFLOPs of Dual Precision Compute

Deal of the Day

Comments

Popular Discussions

AMD Radeon RX 7000 & NVIDIA GeForce RTX 40 GPUs Available Below MSRP Across All Models In Germany

NVIDIA Acknowledges “Strong Competition” In AI Market, Reaffirms Company’s Business Not Just Hardware But Software Too

Intel Battlemage “Xe2” GPUs Might Be Limited To DisplayPort 2.0 UHBR13.5 Support

AMD Strix Point Halo “55W” Ryzen APU Spotted, Strix Point “28W” Benchmark Leaks Out

NVIDIA’s Monopoly Over The AI Markets Isn’t Sustainable, Analyst Worries About Increasing GPU Power Consumption

NVIDIA Pascal GP100 GPU Expected To Feature 12 TFLOPs of Single Precision Compute, 4 TFLOPs of Double Precision Compute Performance

NVIDIA Pascal GP100 Flagship GPU Might Come With 12 TFLOPs of Single Precision, 4 TFLOPs of Dual Precision Compute

Deal of the Day

Further Reading

Comments

Trending Stories

Popular Discussions