Intel’s New E-Core (Gracemont) and P-Core (Goldencove) Architectural Deep Dive

Usman Pirzada

Intel deep-dived into its upcoming E and P architectural design (something they referred to as big and small cores previously) and it looks like a true general leap in power and efficiency. To contextualize this for our readers, these were the designs formerly known as Gracemont and Goldencove. We will let the Intel experts explain the architecture this time:

Related Story Intel Gaudi 2 Accelerators Showcase Competitive Performance Per Dollar Against NVIDIA H100 In MLPerf 4.0 GenAI Benchmarks

Deep diving into Intel's Efficient Core (E-Core) by Stephen Robinson

Our primary goal was to build the world's most efficient x86 CPU core. We wanted to do that while still delivering more IPC than Intel's most prolific CPU microarchitectures to date: Skylake. We also set an aggressive silicon area target so that multi-core workloads could be scaled out using as many cores as necessary with these architectural anchors in place. We also wanted to deliver a wide frequency range. This allows us to save power by running at low voltage and creates headroom to increase frequency and ramp up performance for more demanding workloads.

Finally. We wanted to provide a rich ISA features such as advanced vector and AI instructions that accelerate modern workloads. I am pleased to say that we delivered on all of our goals and it is my honor to introduce Intel's, newest efficient x86 core microarchitecture. Thanks to a deep front end, a wide back end, and a design optimized to take advantage of Intel 7, this CPU core delivers a breakthrough in multi core performance. Let's now dive deeper into the details, starting with the front.

The first aspect in driving efficient IPC, is to make sure we can process instructions as quickly as possible. This starts with accurate branch prediction. Without accurate branch prediction, much of the work ends up being unused, which is wasteful. We implemented a 5,000 entry branch target count. We complemented it with a long history based branch prediction. This helps us quickly generate accurate instruction pointers. With accurate branch prediction things like instruction cache misses can be discovered and remedied early before becoming critical to program execution. Workloads, like web browsers, databases, packet processing, these all benefit from these capabilities.

We also have a 64 kilobyte instruction cache. That keeps the most useful instructions close without expending power in the memory subsystem. This micro architecture features Intel's first on-demand instruction length decoder which generates pre decode information that's stored alongside the instruction cache. This gives us the best combination of characteristics. Where the code that has never been seen before is decoded quickly. Yet. The next time it's executed. We bypass the length of the decoder and save energy. The new core also features Intel's revolutionary clustered out of order decoder, that enables decoding up to six instructions per cycle, while maintaining the energy efficiency of a much narrower core.

The second main aspect to achieving performance is ensuring you extract, any parallelism inherent in the program. With five wide allocation, a wide retire, a 256 entries out of order window and 17 execution ports, this microarchitecture delivers more general IPC than Intel Skylake core while consuming a fraction of the power. The execution ports are scaled to the unique requirements of each unit which maximizes both performance and energy efficiency.

Four general-purpose integer execution ports are complemented by dual integer multipliers and dividers. We can also resolve two branches per cycle. Now, for Vector operations, we have three SIMD ALUs. The integer multiplier supports Intel's, virtual neural network instructions (VNNI). Two symmetric floating point pipelines allow executing two independent, add or multiply operations. Thanks to Advanced Vector extensions.

We can also execute two floating-point multiply add instructions per cycle. Advanced crypto units round out the vector stack, which provide AES and Shaw acceleration. Now, the final aspect to achieving efficient performance, is a fast memory subsystem. Two load pipelines, plus two store pipelines, enable 32 byte read and 32 byte bandwidth at the same time. The L2 cache which is shared among four cores can be 2 or 4 megabytes depending on product level requirements. This large L2 provides high performance and power efficiency for single-threaded workloads by keeping data close.

It also provides enough bandwidth to serve all four cores. The L2 can provide 64 bytes of bandwidth per cycle with 17 cycle latency. The memory subsystem has deep buffering, and each four core module, can have up to 64 outstanding misses for the last level cache and beyond. Advanced prefetches exist at all cache levels to automatically detect a wide variety of streaming behavior. Now Intel Resource Director technology ensures that software can control resources, among the cores.

A robust set of security features along with having ISA that can support a wide range of data types is important for every new micro architecture. We support features like Intel control flow enforcement technology and Intel virtualization technology redirection protection. We put additional focus on security, validation and developed several novel techniques to harden, against certain attack vectors to maintain tight security. We also implemented the AVX ISA along with new extensions to support integer AI operations.

In addition to choosing what to include one of the most important aspects of designing new micro architecture, is deciding what not to include. We balanced this trade-off by focusing on those features that were needed and keeping out the rest. This results in area efficiency, which in turn allows products to scale out the number. Of course. This also helps reduce energy per instruction. Now, minimizing power is the biggest design challenge for today's processors. Power, is a combination of multiple factors, which voltage is the most important, this micro architecture and our focus design effort allow us to run at low voltage to reduce power consumption while the same at the same time creating the power head room to operate at higher frequencies.

Okay. Now, let's take a look at the results of this new design. First looking at late and see if we compare our core to a single Skylake core. We can achieve 40% more performance at the same power level of reduce power by 40% while maintaining the same performance level. To say it differently. A Skylake core would consume two and a half times more power to achieve the same performance. This is a tremendous achievement.

However, we're even more excited about the throughput results. If we compare four of our new CPU cores against two Skylakes running four threads, we deliver 80 percent more performance while still consuming less power. Alternatively. We deliver the same throughput while consuming 80% less power. Again. This means that Skylake would need to consume five times the power for the same performance as you can imagine, these are very exciting results for us. This is incredible when you consider that we can deliver four of our new cores, in a similar footprint as a single Skylake core. In conclusion, we are extremely proud of our new, highly scalable model.

Deep diving into Intel's Performance Core (P-Core) by Adi Yoaz

I'm pleased to announce that we delivered on all of our objectives and it is my privilege to introduce Intel's new performance X86 core architecture. Designed for speed, pushing the limits of low latency and single-sided application performance. To keep driving general-purpose performance. We have architected the machine to become wider, deeper and smarter. It has a deeper out of order scheduler and buffers, more physical registers, wider allocation window and more execution cores. Making the machine wider and deeper can expose higher degrees of parallelism and provide higher performance only if it is fed with instructions from the correct path and with data coming in on time for execution.

To make this new wider and deeper machine effective, we also made it smarter with features that improve Branch prediction and instructions supply, collapse dependency chains and bring data closer to the time when it is needed. On top of the base line features that speed up most common workloads, we added dedicated features for more cores with particular properties. For example, in order to better support applications with large code footprints, we are now tracking many more branch prediction targets.

For emerging workloads with large irregular datasets, the machine can simultaneously service four page table workloads. And for the evolving trends in machine learning, we added dedicated hardware and new ISA to perform, matrix multiplication operations for an order of magnitude performance increase in AI acceleration. This is architected for software ease of use, leveraging the x86 programming model. Additional performance is achieved through core autonomous fine grain power management technology.

The performance core integrates a new micro controller that can capture and account for events in granularity of micro seconds instead of milliseconds and tighten the power budget utilization based on actual application behavior. The result is higher average frequency for any given application. This is our largest architectural shift over a decade. And with that as an introduction. Let me know dive deeper.

The first step in building a balanced wider core is to widen and enhance the core’s front end, [illegible] supply was improved both from the decoder side and from the micro cache path. The lengthy code is now doubled running at 32 bytes per cycle and two decoders will added to enable 6 decoded microps coming per cycle from the decoders when delivering microoperations out of the microp cache. We can now get eight microps per cycle and the microp cache itself has increased to 4K instead of 2.25 k micro operations.

This allows us to better feed the out of order engine, deliver higher, microop bandwidth and do so in the lower latency, shorter pipeline. To better support software with large code footprint. We doubled the number of 4K pages and large Pages, stored in the IPL. He's, we have a smarter code prefetch mechanism hiding much of the instruction cache miss latency and improve branch prediction accuracy to reduce jump miss predicts.

The branch target buffer is more than 2x larger than the one on the previous generation, which greatly improves the performance of four cores if they have a lot of code. It uses a machine learning algorithm to dynamically grow and shrink its size. It shuts off excess capacity when it's not needed to save power and it turns on extra capacity when it's needed to improve performance. With a wider and smarter front end we now turn to the out-of-order part of the machine. The out-of-order engine is where the magic happens. And this is what separates CPU architectures from all other architectures. We are widening the machine by going from five to six wide rename allocation and from ten to twelve execution cores.

The machine is also becoming significantly deeper with the 512-entry reorder buffer wide physical registers and a deeper distributed per operation type scheduling window all tuned for performance and power efficiency. To further, improve both performance and power efficiency, smart features enable collapsing dependency chains by executing some simple instructions at the allocation stage thereby saving resources down the pipe. This allows other operations reside on the critical path run faster, while better utilizing execution bandwidth, and save power. With a wider deeper, and smarter out of order engine we also wanted to enhance our execution unit significantly. Let me start with the integer port. We added 1 cycle LEA on all 5 ports.  Execution units can also be used for generic arithmetic, calculations like additions and subtractions as well as for first multiplications by some fixed numbers. The LEA added on port end can also do scale operations in a single cycle similar to the LEAs we have on ports, 1 and 5. Similarly on the floating-point vector side we have added new dast adders on ports 1 and 5.

The L2 cache subsystems has more than doubled the number of demand or prefetch operations that can be serviced in parallel. A completely new L2 prefetch engine was developed to leveraging deeper, understanding of program behavior. The prefetch engine observes the running program in order to estimate the likelihood of future memory access patterns.

Share this story

Deal of the Day

Comments