May 27, 2009
By Laiq Chughtai, Altera Corp.
What is the purpose of this paper?
This paper provides potential users an easily read and easy-to-understand overview of the capabilities of the device functions of Altera's Stratix IV FPGAs. It succinctly describes the feature set, the architecture innovations, and the process techniques that when combined make the Stratix IV FPGA the industry leader in both power and performance. Further, the article describes the functionality of these devices in far more detail than in the data sheet, but avoids the minute implementation details covered in the Stratix IV FPGA Device Handbook.
Designers contemplating designing with Stratix IV FPGAs may face a hurdle or two. The data sheet provides a very condensed overview of the complete device family, but does not describe the capabilities in enough detail. By comparison, the Device Handbook provides all the details that the designer needs, but — at 1,200+ pages — it probably will require several weeks of work to read and understand all of the details.
This paper describes the capabilities (what you can do) in detail, but leaves out the implementation details (how to utilize the capabilities). The idea is to give the designer enough information to evaluate the capabilities, without requiring weeks of study. Altera believes that system architects and designers who are in the early stages of FPGA device planning and evaluation will find the information presented here to be a valuable source of information before beginning a Stratix IV design.
Stratix IV Product Overview
Gigabit Transceivers
Stratix IV GT devices provide up to 24 transceivers supporting 9.95 to 11.3 Gbps, with a Physical Coding Sub-layer (PCS). Up to 8 additional transceivers are available supporting 2.5 to 8.5 Gbps, with PCS. Additionally, up to 16 additional transceivers are available supporting 2.5 to 6.5 Gbps, without PCS. The Stratix IV GT transceivers eliminate the need for an external 10G PHY device and thus enable the preferred implementation of 802.3ba (40G/100G) recommended configuration.
Stratix IV GX devices provide up to 32 full-duplex CDR-based transceivers with PCS, PMA and PCI Express Hard IP blocks (Figure 1). These transceivers support serial data rates between 600 Mbps and 8.5 Gbps. Additionally, up to 16 full-duplex CDR-based transceivers, supporting serial data rates between 600 Mbps and 6.5 Gbps are provided.
Figure 1. Stratix IV GX Transceiver Block Diagram
Stratix IV transceivers support the stringent jitter requirements of protocols such as PCI Express Gen II and CEI-6G for Interlaken implementation. In addition they support PCI Express Gen 1; XAUI (3.125 Gbps to 3.75 Gbps for HiGig support); GIGE (1.25Gbps); Serial RapidIO' (up to 3.125 Gbps); SONET/SDH up to OC-96 and both HD and 3G Serial Digital Interface. The transceiver channels also support basic single-width (600 Mbps to 3.75 Gbps) and basic double-width (1 Gbps to 8.5 Gbps) flexible functional modes to implement proprietary protocols.
The Stratix IV GX transceivers are structured into full-duplex (Transmitter and Receiver) six-channel groups called transceiver blocks that vary in count from device to device. Channels can be dynamically reprogrammed to support multiple protocols and data rates without disturbing the operation of any other part of the FPGA. Each transceiver has dynamically programmable differential output voltage (VOD) and pre-emphasis settings for improved signal integrity. To compensate for frequency-dependent losses in the physical medium, each transceiver supports adaptive 4-stage receiver equalization with up to 17dB of gain. In addition, selectable on-chip termination resistors help improve signal integrity on a variety of transmission media.
The programmable transceiver-to-FPGA interface supports data transfers in a wide variety of widths from 8 to 40 bits. Receiver rate-matching FIFO buffers resynchronize the received data with the local reference clock while phase compensation FIFO buffers perform clock domain translation between the transceiver block and the logic array.
PCI Express Hard IP
Stratix IV GX devices support PCI Express Gen1 and Gen2 protocols in x1, x4 and x8 configurations. This support is enabled by up to 4 PCI Express hard IP blocks that embed all layers of the PCI Express protocol stack including the transceiver modules, physical layer, data link layer, and transaction layer (see Figure 2) . Hard implementation enables fast compile times and high performance. It frees up resources within the FPGA fabric for user logic and eliminates the cost of soft-IP.
Figure 2. PCI-Express Hard IP Block Diagram
Each PCI Express Hard IP block is compliant with both Rev. 1.1 and Rev. 2.0 specifications of the PCI-SIG and supports both endpoint and root port functionality with user datapath width of 128-bits (x8, x4) and 64-bits (x8, x4, x2, x1). The transaction layer interface supports two virtual channels, single function, and vendor defined message pass-through. Each hard IP block supports all PCI Express memory, I/O, configuration, and message transactions with 64 outstanding request message tags; configurable maximum payload size up to 2,048 bytes; a maximum read request size up to 4,096 bytes; retry buffer size of 16 Kbytes; and configurable receive buffer size of 16 Kbytes per virtual channel.
Each block supports configuration space registers included with the transaction layer and serial read/write access for reconfiguration of initial core parameters. The non-intrusive local management interface provides access to configuration space in endpoint mode. Up to 32 message-signaled interrupts and 2048 MSI-X are configurable. Other configurability options include completion timeout control and capabilities registers and up to 6 base address registers plus expansion ROM.
To assist in system debug, each block includes a synchronous status and debug interface that provides access to critical test signals. Error reporting features include ECRC generation, and reporting and handling of surprise down errors, receiver overflow errors, completer abort errors and flow control protocol errors. Power management features include all power states (emulate D1, D2, and L2), software-initiated link power management, legacy PCI power management support, native active state power management support and block level power down when not in use.
Differential and Single Ended IO
Stratix IV FPGAs FPGA support up to 132 full-duplex, DC-coupled LVDS channels on the side I/O banks, each performing at up to 1.6 Gbps. 288 additional pseudo-LVDS channels are provided on top and bottom I/O banks. Stratix IV FPGA LVDS channels support interface standards such as SPI-4.2, SFI-4, SGMII, Utopia IV, 10 GbE XSBI, the RapidIO standard, and SerialLite II. The Stratix IV FPGA LVDS features include hard DPA block with serializer/deserializer (SERDES) and clock-forwarding capability for soft-CDR; programmable pre-emphasis and voltage output differential Voltage Output Differential (VOD) and differential On-Chip termination (OCT).
Stratix IV devices support up to 1104 single-ended user I/Os, which includes key features such as programmable slew rate and drive strength. Variable delay chains on inputs and outputs compensate for board trace mismatch, while each IO supports both serial and parallel dynamic OCT.
Stratix IV FPGAs include signal integrity I/O features like an 8:1:1 user I/O to power/ground ratio, signal return path optimization, staggered output delay control, and on-die/on-package de-coupling capacitance.
DDR Memory Interface
All banks on each Stratix IV FPGA support Double Data Rate (DDR) memory interface.The top and bottom I/O banks support data rates up to 1067 Mbps while the side I/O banks support data rates up to 667 Mbps. DDR support is enabled by up to 31 hard I/O registers behind each DQ pin; up to 4 Delay Lock Loop (DLL) circuits that dynamically control the clock delay needed by the DQS/CQ and CQn pin and compensate for PVT variations; read and write leveling circuitry to resynchronize CK and DQS signal timing for DDR3 interfaces and dynamic OCT.
The ALTMEMPHY megafunction, a part of the Quartus II design software, creates the datapath between the memory device and the memory controller and user logic in the Stratix IV FPGA. The GUI helps the user configure multiple variations of a memory datapath and interface including DDR3, DDR2, DDR SDRAM, and QDRII+/QDRII SRAM interfaces.
Power Management
Stratix IV FPGAs were designed with specific process, architectural, and system design features to address power concerns at the 40-nm process node.
On the process side, Stratix IV FPGAs benefit from multi-threshold, variable gate-length transistors to optimize power consumption against transistor function. Low-k inter-metal dielectric reduces cross talk, while three different gate oxide thicknesses ensure leakage reduction in transistors whose performance is not-critical. In addition, strained silicon helps improve channel mobility to enhance performance.
Considering the device architecture, Altera's Programmable Power Technology makes substrate bias voltage of transistors within each logic block programmable. This means that the substrate bias voltage can be adjusted to reduce power or increase performance. Without requiring specific designer input, Quartus II design software automatically sets individual logic blocks in high-performance state if they are in a timing critical path, while leaving all other logic blocks in low power state. This translates to higher performance without the power consumption penalty.
On the interface side, Stratix IV FPGAs support DDR3 at 1.5V that helps reduce memory interface power consumption compared with the older DDR2 memory interface. Additionally, series and parallel OCT is dynamically turned on and off during data transfers to further reduce interface power.
Altera's PowerPlay Power Analysis technology enables power estimation at each stage of the design flow. The Early Power Estimator uses early estimates of device resource usage, clock frequencies and toggle rates as input by the designer to provide a gross estimate of the power requirements of a design. This estimate is progressively refined as a design is compiled and simulated. By using simulation inputs, the Quartus II software reports a detailed power estimate using actual fitter results, timing constraints, and chip interface settings (Figure 3).
Figure 3. Quartus II Power Optimizing Compilation Flow
Project level settings direct the Quartus II design software to compile for the lowest power, the highest performance or the smallest area. To minimize power consumption, Quartus II automatically performs various power optimizations including powering down unused clock nets as well as memory, DSP and LAB blocks.
Logic Fabric
All variants of the Stratix IV device family use the same logic fabric. The basic unit of logic is the Adaptive Logic Module (ALM). To optimally utilize silicon area, the ALM offers enhanced flexibility compared with the traditional 4-input look-up table. In addition to the look-up tables (LUTs), each ALM contains two programmable registers with data, clock, enable, synchronous and asynchronous clear inputs, two dedicated full adders, a carry chain, a shared arithmetic chain and a register chain as shown in Figure 4. Ten ALM blocks stacked vertically form a Logic Array Block (LAB). Each ALM can drive any of the available types of local and global interconnects.
Figure 4. Stratix IV ALM Block Diagram
Depending on the area-performance settings selected by the designer, the Quartus II design software can engage an ALM in either Normal mode, Extended LUT Mode, Arithmetic Mode, Shared Arithmetic Mode or LUT-Register Mode (see Figure 5). In Normal mode the ALM can implement either a single 6-input function or two functions with varying numbers of shared inputs between them. The 7-input Extended LUT mode efficiently implements "if-else" code structures. It consists of two 5-input functions with four shared inputs and a select input that propagates one of these function outputs forward.
Figure 5. Summary of ALM Configurations
The Arithmetic Mode uses the dedicated full adders in combination with the look-up tables to efficiently implement adders, counters, accumulators, wide parity functions, and comparators. The dedicated adders allow the LUTs to be available to perform pre-adder logic; therefore, each adder can add the output of 2 four-input functions with carry-in. The carry chain provides a fast carry function between the dedicated adders.
In Shared Arithmetic Mode, the ALM can implement three-input addition within an ALM. The ALM is configured with 4, four-input LUTs, each of which either computes the sum of three inputs or the carry of three inputs. The output of the carry computation is fed to the next adder using a dedicated shared arithmetic chain. This shared arithmetic chain reduces the number of summation stages required to implement an adder tree, thus improving performance.
In LUT-Register Mode, two internal feedback loops stitch together the LUT resources to implement a master-slave latch that forms the core of a third register within the ALM. This LUT register shares its clock, clock enable, and asynchronous clear sources with the top dedicated register.
Clock Distribution Network
To enable effective implementation of designs for diverse applications, Stratix IV devices can support up to 104 distinct clock domains, each capable of supporting clock rates of 600 MHz. These clock domains are enabled by a hierarchical clock distribution structure consisting of 16 dedicated global clock networks (GCLKs), up to 88 regional clock networks (RCLKs), and 132 periphery clock networks (PCLKs). These clock networks can be driven by up to 71 unique clock sources per device quadrant. Stratix IV devices have up to 32 dedicated single-ended clock pins or 16 dedicated differential clock pins distributed evenly on all sides that can each drive 4 GCLK or RCLK networks. GCLK and RCLK networks can also be driven by PLL outputs and internal logic. Clock sources for PCLK networks include clock outputs from the DPA block, PLD-Transceiver interface clocks, horizontal I/O pins, and internal logic.
Each GCLK and RCLK has its own clock control block that supports static clock source selection for RCLK networks, glitch-free dynamic source selection for GCLK networks, global clock multiplexing, and clock power down including dynamic clock enable or disable.
Phase Lock Loops
Each Stratix IV device includes up to 12 PLLs. The VCO at the heart of each PLL operates from 600 MHz to 1300 MHz. The ref clock input to each PLL comes from either 4 dedicated clock input pins or other PLLs using either the GCLK and RCLK networks or dedicated connections between adjacent PLLs. Stratix IV PLLs can track spread-spectrum frequency variation in input clocks if they comply with input clock jitter specification.
Each PLL has either 7 or 10 output counters that can drive up to 4 GCLK and 20 RCLK networks and up to 6 single-ended output pins, two of which can be configured as a differential pair. In addition, the side PLLs drive the Digital Phase Alignment circuitry to support DDR memory interfaces. Each PLL supports programmable duty cycle and can achieve a phase shift resolution down to 96.125 ps. The PLL can drive clock frequencies of up to 717 MHz on to an internal clock network or an external clock output.
Tri-Matrix Memory
Stratix IV devices support three types of embedded memory blocks called MLABs, M9Ks and M144Ks. With different sizes and densities, each block type is suitable for a different application role as shown in Figure 6. The Quartus II design software infers the appropriate memory block to meet a user's size and functionality requirements. Each memory block type is capable of performing at 600 MHz.
Figure 6. Tri-Matrix Memory Hierarchy
With 640 bits per block, the MLAB is the smallest and most pervasive memory type. Half of the Logic Array Blocks (LABs) in the device can be used to implement MLAB memory blocks. At 20 bits wide and 32 deep, the MLAB is suitable for implementing small shift registers, FIFO buffers, and filter delay lines.
Each M9K is a discrete memory block with a maximum data width of 36 bits and a depth of 256 addresses. The block data width and depth is configurable and Quartus II design software can instantiate multiple blocks to implement wider and/or deeper memories.
The M144K is the largest discrete embedded memory block type. It supports various width and depth configurations up to 72 wide x 2048 deep. With its size, the M144K is suitable for applications like processor code storage, packet buffers, and video frame buffers etc. It includes Single Error Correct, Multiple Error Detect (SECDED) circuitry to detect and correct soft errors.
The M9K and M144K are true-dual port memory blocks that support simultaneous read and write from both ports to the same address. In simultaneous operation, the read can be configured to provide old data or new data as long as both ports use the same clock. Other features include pre-initialization/ROM mode, mixed clocking, byte enables, and address clock enables.
DSP Blocks
To efficiently implement the digital signal processing requirements of such complex systems as WiMAX, 3GPP WCDMA, high-performance computing (HPC), voice over Internet protocol (VoIP), H.264 video compression, medical imaging, and HDTV, Stratix IV devices feature programmable digital signal processing (DSP) blocks. Each block provides eight 18 x 18 multipliers, registers, adders, subtractors, accumulators, and summation unit-functions that are frequently required in typical DSP algorithms (see Figure 7). The total 18 x 18 multipliers range from 384 in the smallest GX device to 1360 in the largest E device.
Figure 7. DSP Block Diagram
Each 18 x 18 multiplier can also support word lengths of 9 and 12-bits. Two 18 x 18 multipliers can be combined to support 36-bit word length. Each DSP block supports both single precision (24-bit) and double precision (53-bit) floating-point arithmetic formats.
Each DSP block supports completely variable bit-widths and various rounding and saturation modes to meet the requirements of various applications such as filtering, transformation, modulation, compression, scaling, and equalization. The Stratix IV DSP blocks are rated to operate at 550 MHz.
Altera's DSP Builder technology allows users to take system definition/simulation information from The MathWorks/Simulink tools and generate timing-optimized register transfer level (RTL) code that can be synthesized by the Quartus II design software. The DSP Builder Signal Compiler reads Simulink Model Files (.mdl) that are built using DSP Builder and MegaCore' blocks and generates VHDL files and tool command language (Tcl) scripts for synthesis, hardware implementation, and simulation.
Configuration
Stratix IV devices use SRAM cells to store configuration data. The volatile SRAM memory must be configured each time the device powers up. Stratix IV devices can be configured using fast passive parallel (FPP), fast active serial (AS), passive serial (PS), or Joint Test Action Group (JTAG) configuration modes. All configuration schemes use either, an external controller (for example, a MAX' II device or microprocessor), a configuration device, or a download cable. Stratix IV configuration bit streams are compressed and can be optionally encrypted using the AES algorithm with a 256-bit security key. The bit stream is uncompressed and if necessary unencrypted within the device during device configuration. Stratix IV devices support both volatile and non-volatile storage for the encryption security key.
To detect soft errors in device configuration due to single event upset (SEU), dedicated circuitry is built into Stratix IV devices that continuously and automatically performs cyclic redundancy check (CRC) error detection.
To facilitate in-field updates to the device configuration, the Stratix IV devices include dedicated circuitry to support remote configuration updates. Soft logic (either the Nios II embedded processor or user logic) implemented in a Stratix IV device can download a new configuration image from a remote location, store it in configuration memory, and direct the dedicated remote system upgrade circuitry to initiate a reconfiguration cycle. The dedicated circuitry performs error detection during and after the configuration process, recovers from any error condition by reverting back to a safe configuration image, and provides error status information.
About Altera Stratix IV FPGAs
The Stratix IV family of FPGAs represent the fourth generation in the Stratix series. It is built upon Taiwan Semiconductor Manufacturing Company's (TSMC's) 40-nm process technology. This process utilizes 193-nm immersion lithography, extreme low-k dielectrics, variable channel lengths and oxide thicknesses, and strained silicon to enhance device performance and power efficiency.
The high-density, high-performance adaptive logic module (ALM) logic structure provides the most efficient logicfabric in any FPGA. The ALM logic structure is fully integrated in Quartus II design software to easily deliver the highest performance, highest logic utilization, and lowest compile times, as demonstrated by Stratix IV FPGAs on OpenCore designs.
The Stratix IV family includes three device variants:
- Stratix IV GT FPGAs with transceivers: Up to 530K logic elements (LEs) and 48 full-duplex CDR-based transceivers at up to 11.3 Gbps
- Stratix IV GX FPGAs with transceivers: Up to 530K LEs and 48 full-duplex CDR-based transceivers at up to 8.5 Gbps
- Stratix IV E (enhanced devices) FPGAs: Up to 680K LEs, 22.4-Mbit RAM, and 1,360 18 x 18-bit multipliers
Stratix IV GT FPGAs are available in 1517 and 1932 pin flip-chip packages. The Stratix IV GX devices are available in flip-chip packages with pin counts ranging from 780 to 1932 while the Stratix IV E devices are available in flip-chip packages with pin counts ranging from 780 to 1760. Stratix IV FPGAs offer vertical migration within each family variant providing flexibility in device selection. In addition, a vertical migration path exists between Stratix III and Stratix IV E devices.
The Stratix IV device family offers more than twice the resources of prior generations in almost all feature categories. They offer a highly compelling mix of features, performance, and power to meet demanding needs for the design of diverse applications. Production devices are available and shipping. They may be purchased online or through Altera distributors. Visit http://www.altera.com/products/devices/stratix-fpgas/stratix-iv/stxiv-index.jsp for complete device information.
Stratix IV FPGAs has been recognized worldwide for their technical innovation:
- EN-Genius Network selected the 40-nm Stratix IV FPGA and HardCopy IV ASIC families as the Best High-End FPGA Family for its annual "Product of the Year" award.
- Electronic Products China magazine selected Altera's 40-nm Stratix IV FPGAs for its "Product of the Year" award.
- EDN Innovation Awards: Altera's Stratix IV FPGA Wins "Innovation of the Year" award and Stratix IV FPGA 40-nm Design Team Takes Home "Innovator of the Year" honors.
- China Electronics News selected Stratix IV FPGAs as its "2008 Editor's Choice" award winner in the FPGA category. Winners of this award demonstrated a significant leap in innovation.
- EDN named Stratix IV FPGAs to its annual list of "Hot 100 Electronic Products". This list encompasses the 100 most significant products of 2008, as determined by the magazine's editors and readers.
- Stratix IV FPGAs received EDN China's "Leading Products Award" in the digital IC and programmable logic category. Winners of this award were chosen by a panel of technical experts and professors that selects products having the greatest impact on the electronics industry.
- Stratix IV FPGAs received Electronic Products 2008 "Product of the Year" award in the digital ICs category. Award winners are selected by the magazine's editors on the basis of innovative design, significant advancement in technology or application and substantial achievement in price and performance.
About the Author
Laiq Chughtai
Supervisor of product engineering, Altera Corp.
Laiq Chughtai manages a team of engineers responsible for bringing Altera's leading FPGA and custom logic device families to market. He joined Altera in 1999 and has participated in the development and introduction of each FPGA product generation. His expertise covers embedded memories, semiconductor test, characterization, yield enhancement and tools development. Mr. Chughtai is currently pursuing an MBA at the Haas School of Business at UC Berkeley. He has a BS in Computer Engineering from the University of Wisconsin in Madison.
====================
댓글 1개:
Hi your blog is cool.
I think it's a good point for debate. Looking forward to it.
댓글 쓰기