NVIDIA Pascal GP100 GPU Expected To Feature 12 TFLOPs of Single Precision Compute, 4 TFLOPs of Double Precision Compute Performance

Hassan Mujtaba

New details on NVIDIA's Pascal GPU have been dug up by 3DCenter (via Beyond3D) which showcase the total compute performance of the upcoming FinFET based chip. At CES 2016, NVIDIA announced their Pascal based Drive PX 2 module which is an automobile supercomputer which uses the graphics processing power of GPUs to drive cards autonomously. The presentation didn't mention the flagship chip but we expect to hear a more detailed session on those at GTC in April 2016.

NVIDIA Pascal GP100 Flagship GPU Might Come With 12 TFLOPs of Single Precision, 4 TFLOPs of Dual Precision Compute

The one GPU that everyone has their eyes on right now, whether they be enterprise of mainstream audience, is the flagship Pascal GPU which is known as GP100 (The naming scheme for the chip is not confirmed yet). This is going to be the flagship chip of the lineup which will be featured on Tesla, Quadro and GeForce graphics cards. The chip is based on the 16nm FinFET process which leads to efficiency improvements and better performance per watt but with Pascal, double precision compute returns with a bang. Maxwell which is NVIDIA's current gen architecture made some serious gains in the performance per watt department and Pascal is expected to keep the tradition move forward.

The information today comes from slides which have long existed but most people haven't had access to. The slides were found by iMacmatican, a Beyond3D forum member who has compiled a good list of details over at the forum. Since most of these slides date back to 2014-2015, there are bound to be some changes to the GPU design which we will also explain in a bit. First of all, a slide from a presentation in March 2014 detailed GFLOPs per watt for various NVIDIA GPUs. The approximate values for NVIDIA's CUDA generation of GPUs have been listed below:

  • Tesla: 0.5
  • Fermi: 2
  • Kepler: 5.5
  • Pascal: 14
  • Volta: 22
nvidia-pascal-gpu-dual-precision-performance
nvidia-pascal-gpu-single-precision-performance
nvidia-pascal-gpu-half-precision-performance
nvidia-pascal-gpu-vram-buffer

Slide Credits: Beyond3D Forum

The slide clearly shows that Pascal is rated at GFLOPs per watt while Volta is rated at 22 GFLOPs per watt. Now the slide states that these approximations are for the Dual Precision or DGEMM (Dual Precision Floating General Matrix Multiply) GFLOPS/Watt and not single precision due to which Maxwell has been removed from the latter slides since it didn't feature any FP64 hardware under the hood. The fastest Kepler based Tesla K40X comes with 6.1 GFLOPs/W and the dual-chip Tesla K80X comes with 6.2 GFLOPs/W. Pascal is expected to take this around 14 GFLOPs/W which is more than twice of Dual Precision GFLOPs/W.

Coming to the Single Precision or SGEMM (Single precision floating General Matrix Multiply) GFLOPs/W are rated at 42 GFLOPs/W for Pascal. Maxwell is rated at 23 GFLOPs/W with the dual-chip offering pushing that up to 25 GFLOPs/W while Volta is rated at 73 GFLOPs/W. Now there's also a slide that details the HGEMM (Half Precision floating General Matrix Multiply). We know that Pascal and the latter generation of GPUs will come with mixed precision compute which allows users to get twice the compute performance in FP16 work loads compared to FP32 by computing at 16-bit with twice the accuracy of FP32. Compared to Maxwell which has just 26 half precision GFLOPs/W, Pascal will take that up to 85 GFLOPs/W while Volta will do up to 145 GFLOPs /W.

nvidia-pascal-gpu-for-tesla
nvidia-pascal-gpu-performance

Coming to the more meaty part, 2014 slides are full of useful data on Pascal GPUs. Of course these slides pre date the time frame when Pascal GPUs actually taped out and entered NVIDIA Labs for testing which is what they stated themselves during SC15 and several months before that at GTC session in Japan. It is known that during some point, NVIDIA made the step to change their designs from HMC (Hybrid Memory Cube) based solutions to HBM2 based solutions and they presented the updated design to the audience at GTC 2015 in Japan.

The prototype Pascal board that was showcased back at GTC 2014 was actually based on an HMC implementation and that changed in 2015. From details mentioned in the slides, NVIDIA is claiming that they have integrated the memory (HBM2) to be part of the actual GPU die. Now this could mean one of two things, whether NVIDIA has actually managed to integrated HBM2 and a 16nm GPU on the same die or they could be using a similar design as the Fury cards from AMD which fuse the GPU and HBM chips on single interposer that makes them a single chip solution, sort of like an SOC.

What we know so far about Nvidia’s flagship Pascal GP100 GPU :

  • Pascal graphics architecture.
  • 2x performance per watt estimated improvement over Maxwell.
  • To launch in 2016, purportedly the second half of the year.
  • DirectX 12 feature level 12_1 or higher.
  • Successor to the GM200 GPU found in the GTX Titan X and GTX 980 Ti.
  • Built on the 16nm FinFET manufacturing process from TSMC.
  • Allegedly has a total of 17 billion transistors, more than twice that of GM200.
  • Will feature four 4-Hi HBM2 stacks, for a total of 16GB of VRAM and 8-Hi stacks for up to 32GB for the professional compute SKUs.
  • Features a 4096-bit memory bus interface, same as AMD’s Fiji GPU power the Fury series.
  • Features NVLink (only compatible with next generation IBM PowerPC server processors)
  • Supports half precision FP16 compute at twice the rate of full precision FP32.
GPU ArchitectureNVIDIA FermiNVIDIA KeplerNVIDIA MaxwellNVIDIA Pascal
GPU Process40nm28nm28nm16nm (TSMC FinFET)
Flagship ChipGF110GK210GM200GP100
GPU Design SM (Streaming Multiprocessor)SMX (Streaming Multiprocessor)SMM (Streaming Multiprocessor Maxwell)SMP (Streaming Multiprocessor Pascal)
Maximum Transistors3.00 Billion7.08 Billion8.00 Billion15.3 Billion
Maximum Die Size520mm2561mm2601mm2610mm2
Stream Processors Per Compute Unit32 SPs192 SPs128 SPs64 SPs
Maximum CUDA Cores512 CCs (16 CUs)2880 CCs (15 CUs)3072 CCs (24 CUs)3840 CCs (60 CUs)
FP32 Compute1.33 TFLOPs(Tesla)5.10 TFLOPs (Tesla)6.10 TFLOPs (Tesla)~12 TFLOPs (Tesla)
FP64 Compute0.66 TFLOPs (Tesla)1.43 TFLOPs (Tesla)0.20 TFLOPs (Tesla)~6 TFLOPs(Tesla)
Maximum VRAM1.5 GB GDDR56 GB GDDR512 GB GDDR516 / 32 GB HBM2
Maximum Bandwidth192 GB/s336 GB/s336 GB/s720 GB/s - 1 TB/s
Maximum TDP244W250W250W300W
Launch Year2010 (GTX 580)2014 (GTX Titan Black)2015 (GTX Titan X)2016

We have seen several slides but there's one from an independent researcher who's also a CUDA fellow who posted the compute performance for several platforms in his presentation. The slide puts the NVIDIA Pascal GPU with Stacked DRAM (1 TB/s) featuring up to 4 TFLOPs of Double Precision (FP64) and 12 TFLOPs of Single Precision (FP32) compute performance. In a slide from 2014, after launch of the second generation Pascal GPUs, an NVIDIA presentation also mentioned a GPU known as Pascal-Solo (not to be mistaken with Han-Solo) in the slide showcasing their Tesla GPU accelerator roadmap. The Pascal-Solo GPU features just 1 GPU and has a 235W TDP. The part comes in both PCI-e Active/Passive cooling options and is expected to launch in 2016. The Beyond3D Forum member approximated that the Tesla GPU could launch in Q2 of 2016.

There's no doubt that Pascal GPUs will feature a lot of compute performance aimed at the Tesla and Quadro markets. The next generation FinFET based graphics cards will have a lot of muscle to flex toward the complex tasks that are put forward in the HPC workloads. Expect GTC 2016 to bring a lot of new information on Pascal based Tesla solutions.

GPU FamilyAMD VegaAMD NaviNVIDIA PascalNVIDIA Volta
Flagship GPUVega 10Navi 10NVIDIA GP100NVIDIA GV100
GPU Process14nm FinFET7nm FinFETTSMC 16nm FinFETTSMC 12nm FinFET
GPU Transistors15-18 BillionTBC15.3 Billion21.1 Billion
GPU Cores (Max)4096 SPsTBC3840 CUDA Cores5376 CUDA Cores
Peak FP32 Compute13.0 TFLOPsTBC12.0 TFLOPs>15.0 TFLOPs (Full Die)
Peak FP16 Compute25.0 TFLOPsTBC24.0 TFLOPs120 Tensor TFLOPs
VRAM16 GB HBM2TBC16 GB HBM216 GB HBM2
Memory (Consumer Cards)HBM2HBM3GDDR5XGDDR6
Memory (Dual-Chip Professional/ HPC)HBM2HBM3HBM2HBM2
HBM2 Bandwidth484 GB/s (Frontier Edition)>1 TB/s?732 GB/s (Peak)900 GB/s
Graphics ArchitectureNext Compute Unit (Vega)Next Compute Unit (Navi)5th Gen Pascal CUDA6th Gen Volta CUDA
Successor of (GPU)Radeon RX 500 SeriesRadeon RX 600 SeriesGM200 (Maxwell)GP100 (Pascal)
Launch2017201920162017
Share this story

Deal of the Day

Comments