Copyleft Jonathan Riddell 2001 May be copied under the terms of the GNU Free Documentation license only

31V4 Assignment 2: Systems Investigation

Student 9957824

The Evolution of the Pentium Processor

The Pentium 4 processor, which has recently become available, is the latest in a long family of processors from Intel, that started with the 8088. Describe the development of this family. Compare the internal structures of the various members of the family and discuss the reasons for change.

The Pentium 4 processor was announced in a blaze of publicity in late 2000 by Intel. It is the newest member of the most successful (in terms of unit sold) computer processors ever. It's history can be traced back to what is generally considered the first ever microprocessor - Intel's 4004 designed in 1969, and, of course, the development of transistor chips which led to it. When IBM and it's competitors used the 8088 in the first PCs, the success of the architecture which Intel now likes to call IA-32 (for Intel Architecture 32 bit, although it began as a 16 bit architecture) was assured. Recently there has been some question over the future of IA-32 whose insistence on backwards compatibility has created chips which are less efficient than rivals with more modern designs.

The 8086 was the first processor in the x86 family launched in mid 1978. All future members are backwards compatible with it. It used 16-bit registers and a 16-bit data bus with 20-bit addressing and had a clock speed of 5MHz. Like it's (incompatible) predecessors from Intel, it was mostly used for calculators and specialised but simple systems such as controlling traffic lights. A year later the 8088 was launched. It was identical to the 8086 except for its restricted 8-bit external data bus (like the 8086, 16-bits were used internally). The smaller data bus allowed the chip itself to be cheaper but also the hardware attached to it could be simpler making it the chip of choice for IBM when they hastily put together the PC in response to the rapidly growing personal computing market.

Because the 8086/8088 can only process 16 bits of data at one time a technique had to be developed to work with a 20-bit address line. A 16-bit segment register is set to point to a memory address and an offset from that address of zero to 64 kilobytes is set in another register. The actual memory address is created with a combination of these values, giving a total of 1 megabytes of addressable memory.

The 286 was backwards compatible with the 8086 and quickly became the standard for PCs. It introduced a layer of memory abstraction called protected mode which used the segment register as pointing to descriptor tables. These descriptor tables allow for 24-bit pointers to memory, upping physical memory to a possible 16 megabytes. They also gave some limited virtual memory support, which allows a partition (some operating systems use a file) on a hard disk to act as an extention of memory giving a possible 1 gigabyte of memory, plus protection mechanisms to prevent programs in user space overwriting kernel space memory. These features gave the first hope that multi-tasking would be possible on PCs.

By the mid 1980s the processing limitations of 16-bit processors were clear. Operating system features such as a good graphical interface and multi-tasking between several programs at once simply demand too much when you can only handle 16-bits at one time. So Intel worked on extending the 286 to use a 32-bit data bus, a 32-bit address bus, 32-bit registers, and a 32-bit internal architecture. The first 16 bits of the registers in the 386 kept exactly the same function as for the 286, providing the essential backwards compatibility.

The 386 was released in 1985 but 32-bit operating systems were not available for some time (around 1991 for Linux) so it had to be able to run 16-bit programs efficiently. Operating systems could choose on start up to run the chip in real-address mode, allowing them to run on the new chip with little alteration. Even with 32-bit operating systems, legacy code needs to be able to run in a multi-tasking environment so the 386 featured virtual-8086 mode to run under the normal protected mode and which emulates the 16-bit environment.

As a 32-bit processor, the 386 required many enhancements to the existing instruction set and many completely new instructions (including, at last, bit manipulation instructions). The processor also introduced paging - the basis for advanced virtual memory. Finally the 386 introduced some measures for pipelining, while one instruction was being executed the next was already being fetched from memory.

The 386 was a radical improvement over previous generations and eventually it allowed much more advanced operating systems to be run on desktop machines. By 1991 Linux offered reliable multi-tasking on desktop machines shortly followed by the release of Windows NT. Eventually 32-bit extentions were added to MS-DOS (already a 16-bit extention to the 8-bit 8088 version, itself a hack on the original 4-bit Quick and Dirty Operating System) bringing (slightly less) reliable multi-tasking to the masses for the first time.

Until 1989 x86 processors had no direct support for floating point calculations. Intel produced a chip called the 387 which could be added to motherboards to provide this support, otherwise the CPU would have to process floating point calculations using several integer arithmetic instructions making floating point calculations comparitavly slow. A better solution would be to have the floating point unit as part of the main chip, which is exactly what the 486 had. The main reason for adding the FPU to the processor chip at this time was that they were now cheap enough to be bought by desktop consumers. The pipelining introduced in the 386 was extended in the 486 so that up to 5 instructions could be processed at any one time, a first level on board memory cache added meaning many memory access instructions could perform at the same speed as the CPU - a significant speed increase given that over 90% of all data accessed is repeat data. The result was that for the first time Intel processors could perform (on average) one instruction per clock cycle.

Intel also saw an increase in the laptop computer market and they released the 386 SL and 486 SL to fill the gap. These chips were the first to come with power management, essential for long battery life on notebooks. These features came as standard on future generations.

The single most significant change for the next generation of x86 processors to consumers was the name - Pentium. Using a name which can be trademarked (numbers can not in the US) is a sign of Intel's recognition that it's competition had begun to catch up. For programmers however the Pentium introduced something of a challenge - superscaling. Rather than a simple fetch-decode-execute cycle, a superscalar processor has additional steps: fetch two instructions, decode both instructions, execute first instruction, if the second instruction doesn't depend on the first - execute it. In theory the Pentium can do an average of 2 instructions per clock cycle, twice that of the 486. At the time it only gave about a 20% increase over the 486 but modern compilers can put instructions which don't depend on each other adjacently giving a substantial increase in speed. Some of the limitations of the 386 were also beginning to show and the address bus was increased to 64 bits and internally much of the processor was modified to use 128 and even 256 bits.

Three years later Intel added around 50 extra instructions and a few extra registers to the Pentium dedicated to speeding up the processing of multimedia data such as images, video and sound (also data compression) and marketed it as MMX (which may or may not have stood for Multimedia Extentions). It was an acknowledgement of the abilities PCs were gaining and the statistics of a 70% improvement in multimedia applications were impressive but applications had to be compiled to take advantage of the new extentions; since this would mean missing the majority of the market which had not upgraded few programs were released for it (most programs are still compiled for 386s).

If extentions to the IA-32 architecture would be difficult to add then the most sure way to improve the standard would be by improving the existing architecture which is what Intel did with the P6 family, first launched in the Pentium Pro.

The P6 standard is a three way superscalar pipelined architecture. Besides the buzzword compliance this means it can do the superscaling done by the original Pentium but with 3 instructions instead of two. This places a further burden on the programmer but the processor has techniques to help:

Data throughout the processor is continuously analysed for the need to perform out-of-term execution making full use of the execution unit even when cache misses occur.

The diagram below shows the P6 micro-architecture.

The Pentium 2 added the MMX extentions to the P6 architecture, missed out due to Intel's forked development, as well as cache increases. It was released in 3 versions, the Pentium 2, the Celeron with less second level cache (making it the slowest chip on the market since the second level cache was an important part of the P6 architecture), and the Xeon with 2 megabytes of second level cache (giving a substantial speed increase) and the ability to work as part of a two, four or eight processor computer. The Pentium 3 was released a couple years later as a marketing upgrade - it's the same chip with some new MMX extentions. A Pentium 3 based Celeron was also released, this time with a respectable second level cache making it the one to go for - same chip, lower price.

P6 is a good design: it's lasted Intel for over 5 years. But it has a few bottlenecks which mean it's come to the end of it's life:

There are a few options the consumer can go from the P6 and Intel's answer is the Pentium 4.

The Pentium 4 uses a new internal architecture marketed as NetBurst (don't worry, it doesn't mean anything). It has some novel features over P6: ALUs run at twice the clock speed; a new system of cache called trace cache is used which caches the output of the decoder rather than the original instructions; even more pipelining, apparently twice that of P6 and even more MMX/SSE extentions.

However benchmarks for the Pentium 4 indicate that it is not as fast as competing systems, often slower than P6 processors and that any speed it has over it's predecessors is due to simple over-clocking rather than improved internal design - something which will keep the Pentium 4 as an expensive and power hungry processor.

There are a number of problems evident with the Pentium 4 to give these results. The new trace cache mechanism isn't a good as it first sounds as it only helps for instructions which are already decoded. Because of their reliance on trace cache Intel only used an 8 kilbyte level one cache, going back to 486 days. Even with trace cache there is little or no improvement over P6. Furthermore it has no level 3 cache - Intel's original specifications features a level 3 cache of around 1 megabyte which was left out in the final design

The Pentium 4 also suffers from a bad choice of execution units. 5 of the 7 execution units handle integers, giving poor performance for floating point instructions, the much hyped MMX extentions and slow bit shift operations. Certain calculations are often optimised using bit shift calculations but the new shift unit is much slower than the old barrel shifter. Finally the fix for the partial register stall bug uses bit shift operations, making the fix slower than keeping the problem.

It seems that with the trouble of upping P6 and the delays of their 64-bit architecture Intel rushed the Pentium 4 and the result is a poor processor. However there is now increasing choice in the x86 processor market.

After years of producing x86 clones later, slower and no cheaper than Intel, AMD has gained the upper hand. The Athlon processor has successfully solved most of the bottlenecks of the P6 design and is proving to be much more scalable with speeds in excess of 1GHz on the market.

It features

While AMD have been busy improving their processors, VIA have re-released the Cyrix brand with a completely new internal architecture. Taking the fact that 90% of the instructions executed are only 10% of the total IA-32 instruction set, VIA has designed the chip to perform these instructions extremely fast and the rest of the instruction set as micro-code - implementing them using several of the optimised instructions. The result is a low power chip that performs much faster than rivals of a similar clock speed. Any chip which receives Alan Cox's approval is worth a second look.

Finally Transmetta have taken a radical new approach with their Crusoe chip. It uses a completely new and non IA-32 architecture designed to be very low power for portable computers. x86 compatibility is achieved by using a software emulator on the chip, similar to the JIT compilers used for Java bytecode. Compared to the power consumption used by it's rivals this is a very promising chip for notebooks.

In summary, while most consumers simply see the annual upping of clock speed, a great deal of research and design has been put into the long development of IA-32 chips. Intel has achieved an impressive amount of innovation in the 20 year history of the architecture. Due to Intel's recent problems there is increasingly choice in the processors available, but the future is quite unpredictable with a new 64 bit architecture from Intel and a 64 bit extention to IA-32 from AMD it is far from clear which will be the lead to follow.

References

For exhaustive details of Intel's chips

IA-32 Intel Architecture Software Developer’s Manual Volume 1: Basic Architecture
http://developer.intel.com/design/pentium4/manuals/24547003.pdf
IA-32 Intel Architecture Software Developer’s Manual Volume 2: instruction reference
http://developer.intel.com/design/pentium4/manuals/24547103.pdf
IA-32 Intel Architecture Software Developer’s Manual Volume 3: system programming guide
http://developer.intel.com/design/pentium4/manuals/24547203.pdf
ia-32 hardware developers manual
http://developer.intel.com/design/pentiumII/manuals/24400101.pdf

And other IA-32 chips

AMD Athlon Manual
http://www.amd.com/products/cpg/athlon/pdf/architecture_wp.pdf
VIA Cyrix III datasheet
http://www.cyrix.com/products/cyr3.htm
Crusoe product brief
http://www.transmeta.com/crusoe/download/pdf/TM5600_ProductBrief_8-2-00.pdf

Other

The raw specifications and dates- History of the intel microprocessor
http://www.i-probe.com/i-probe/ip_intel_8.html
Thorough critisism of the Pentium 4
http://www.emulators.com/pentium4.htm