2009년 7월 31일 금요일

Xilinx Virtex-6 FPGA User Guide Lite

 
July 22, 2009 (1:44 PM EDT)
 
Xilinx Virtex-6 FPGA User Guide Lite
 
By Peter Alfke, Xilinx Inc.
 
Editor's Note As with similar "lite" user guides published by Programmable Logic DesignLine previously, this guide is intended to bridge the gap between a datasheet and a full, 1,000+ page user guide.

What is the purpose of this paper?
This paper gives potential users an easy-to-grasp idea of the device functions of Xilinx Virtex-6 FPGAs. It describes the functionality of these devices in far more detail than in the data sheet—but avoids the minute implementation details covered in the various Virtex-6 FPGA user guides.

In traditional product documentation, a data sheet provides concentrated information about the whole family, without describing the capabilities in great detail. On the other hand, user guides give all the details that the designer needs, but — at more than a thousand pages — they may require weeks of work to read and understand all the details.

This paper describes the capabilities (what you can do) in detail but leaves out the implementation details (how to utilize the capabilities). The idea is to give the designer enough information to evaluate the capabilities, without requiring weeks of study. This paper should create significant enthusiasm in many designers, who before did not have the patience or the motivation to study entire user guides.

General

Virtex-6 FPGA Data Sheet: DC and Switching Characteristics

Virtex-6 FPGAs build on the success of the Virtex-5 family. The more advanced 40 nm process makes it technically and economically possible to more than double the logic capacity of the largest family member (760,000 logic cells, 948,000 flip-flops, and 38 Mb of block RAM, compared to 360,000 logic cells, 207,000 flip-flops, and 18 Mb of block RAM in the largest Virtex-5 FPGA).

Advanced processing, innovative architecture and circuit design, and a lower supply voltage reduce static and dynamic power consumption by over 30%, a surprising feat that is highly appreciated by the user.

Higher performance is also the combined result of better processing, architecture, and tools. Advanced 40 nm processing offers transistors with three different oxide thicknesses and multiple threshold voltages as well as low-K dielectric between interconnect lines. Architectural improvements center around enhanced LUTs, with more flip-flops and better routing, better clock generation, low-skew clock distribution, faster I/O, and significantly faster 6.5 Gb/s transceivers. Many dedicated system-level blocks offer ASIC-like performance, size, and low power, while they are tightly integrated in a versatile FPGA structure. Finally, improved tools offer much faster synthesis, and place and route of the user's design.

CLBs, Slices, and LUTs

Virtex-6 FPGA Configurable Logic Block User Guide

The look-up tables (LUTs) in Virtex-6 FPGAs can be configured as either 6-input LUT (64-bit ROMs) with one output, or as two 5-input LUTs (32-bit ROMs) with separate outputs but common addresses or logic inputs. Each LUT output can optionally be registered in a flip-flop. Four such LUTs and their eight flip-flops as well as multiplexers and arithmetic carry logic form a slice, and two slices form a configurable logic block (CLB). Four flip-flops per slice (one per LUT) can optionally be configured as latches. In that case, the remaining four flip-flops in that slice must remain unused.

Between 25"50% of all slices can also use their LUTs as distributed 64-bit RAM or as 32-bit shift registers (SRL32) or as two SRL16s. Modern synthesis tools take advantage of these highly efficient logic, arithmetic, and memory features. Expert designers can also instantiate them.

Clock Management

Virtex-6 FPGA Clocking Resources User Guide

Each Virtex-6 FPGA has up to nine clock management tiles (CMTs), each consisting of two mixed-mode clock managers (MMCMs), which are PLL based.

Phase-Locked Loop
The MMCM can serve as a frequency synthesizer for a wider range of frequencies and as a jitter filter for incoming clocks. The heart of the MMCM is a voltage-controlled oscillator (VCO) with a frequency from ~400 MHz up to 1600 MHz, spanning more than one octave. There are three sets of programmable frequency dividers (D, M, and O).

The pre-divider D (programmable by configuration) reduces the input frequency and feeds one input of the traditional PLL phase/frequency comparator. The feedback divider (programmable by configuration) acts as a multiplier because it divides the VCO output frequency before feeding the other input of the phase comparator. D and M must be chosen appropriately to keep the VCO within its specified frequency range.

The VCO has eight equally-spaced output phases (0°, 45°, 90°, 135°, 180°, 225°, 270°, and 315°). Each can be selected to drive one of the seven output dividers, O0 to O6 (each programmable by configuration to divide by any integer from 1 to 128).

MMCM Programmable Features
The MMCM has three input-jitter filter options: low bandwidth, high bandwidth, or optimized mode. Low-bandwidth mode has the best jitter attenuation but not the smallest phase offset. High-bandwidth mode has the best phase offset, but not the best jitter attenuation. Optimized mode allows the tools to find the best setting.

The MMCM can have a fractional counter in either the feedback path (acting as a multiplier) or in one output path. Fractional counters allow non-integer increments of 1/8 and can thus increase frequency synthesis capabilities by a factor of 8.

The MMCM can also provide fixed or dynamic phase shift in small increments that depend on the VCO frequency. At 400 MHz the phase-shift timing increment is 44 ps; at 1600 MHz, it is 11.5 ps.

Clock Distribution
Each Virtex-6 FPGA provides five different types of clock lines (BUFG, BUFR, BUFIO, BUFH, and the high-performance clock) to address the different clocking requirements of high fanout, short propagation delay, and extremely low skew.

Global Clock Lines
In each Virtex-6 FPGA, 32 global-clock lines have the highest fanout and can reach every flip-flop clock, clock enable, set/reset, as well as many logic inputs. There are 12 global clock lines within any region. Global clock lines can be driven by global clock buffers, which can also perform glitchless clock multiplexing and the clock enable function. Global clocks are often driven from the CMT, which can completely eliminate the basic clock distribution delay.

Regional Clocks
Regional clocks can drive all clock destinations in their region as well as the region above and below. A region is defined as any area that is 40 I/O and 40 CLB high and half the chip wide. Virtex-6 FPGAs have between 6 and 18 regions. There are 6 regional clock tracks in every region. Each regional clock buffer can be driven from either of four clock-capable input pins, and its frequency can optionally be divided by any integer from 1 to 8.

I/O Clocks
I/O clocks are especially fast and serve only I/O logic and serializer/deserializer (SerDes) circuits, as described in the I/O Logic section. Virtex-6 devices have a high-performance direct connection from the MMCM to the I/O directly for low-jitter, high-performance interfaces.

Block RAM

Virtex-6 FPGA Memory Resources User Guide

Every Virtex-6 FPGA has between 156 and 1064 dual-port block RAMs, each storing 36 Kbits. Each block RAM has two completely independent ports that share nothing but the stored data.

Synchronous Operation
Each memory access, read and write, is controlled by the clock. All inputs, data, address, clock enables, and write enables are registered. Nothing happens without a clock. The input address is always clocked, retaining data until the next operation. An optional output data pipeline register allows higher clock rates at the cost of an extra cycle of latency. During a write operation, the data output can reflect either the previously stored data, the newly written data, or remain unchanged.

Programmable Data Width

  • Each port can be configured as 32K - 1, 16K - 2, 8K - 4, 4K - 9 (or 8), 2K - 18 (or 16), 1K - 36 (or 32), or 512 x 72 (or 64). The two ports can have different aspect ratios, without any constraints.
  • Each block RAM can be divided into two completely independent 18 Kb block RAMs that can each be configured to any aspect ratio from 16K x 1 to 512 x 36. Everything described previously for the full 36 Kb block RAM also applies to each of the smaller 18 Kb block RAMs.
  • In 18 Kb block RAMs, only simple dual-port mode can provide data width of >36 bits. In this mode, one port is dedicated to read and the other port is dedicated to write operation. In SDP mode one side (read or write) can be variable while the other is fixed to 32/36 or 64/72. There is no read output during write. The dual-port 36 Kb RAM both sides can be of variable width.
  • Two adjacent 36 Kb block RAMs can be configured as one cascaded 64K - 1 dual-port RAM without any additional logic.

    Error Detection and Correction
    Each 64 bit-wide block RAM can generate, store, and utilize eight additional Hamming-code bits, and perform single-bit error correction and double-bit error detection (ECC) during the read process. The ECC logic can also be used when writing to, or reading from external 64/72-wide memories. This works in simple dual-port mode and does not support read-during-write.

    FIFO Controller
    The built-in FIFO controller for single-clock (synchronous) or dual-clock (asynchronous or multirate) operation increments the internal addresses and provides four handshaking flags: full, empty, almost full, and almost empty. The almost full and almost empty flags are freely programmable. Similar to the block RAM, the FIFO width and depth are programmable, but the write and read ports always have identical width. First-word fall-through mode presents the first-written word on the data output even before the first read operation. After the first word has been read, there is no difference between this mode and the standard mode.

    Digital Signal Processing—DSP48E1 Slice

  • Virtex-6 FPGA DSP48E1 Slice User Guide

    DSP applications use many binary multipliers and accumulators, best implemented in dedicated DSP slices. All Virtex-6 FPGAs have many dedicated, full-custom, low-power DSP slices combining high speed with small size, while retaining system design flexibility. Each DSP48E1 slice fundamentally consists of a dedicated 25 - 18 bit two's complement multiplier and a 48-bit accumulator, both capable of operating at 600 MHz. The multiplier can be dynamically bypassed, and two 48-bit inputs can feed a single-instruction-multiple-data (SIMD) arithmetic unit (dual 24-bit add/subtract/accumulate or quad 12-bit add/subtract/accumulate), or a logic unit that can generate any one of 10 different logic functions of the two operands.

    The DSP48E1 includes an additional pre-adder, typically used in symmetrical filters. This new pre-adder improves performance in densely packed designs and reduces the logic slice count by up to 50%.

    The DSP48E1 slice provides extensive pipelining and extension capabilities that enhance speed and efficiency of many applications, even beyond digital signal processing, such as wide dynamic bus shifters, memory address generators, wide bus multiplexers, and memory-mapped I/O register files. The accumulator can also be used as a synchronous up/down counter. The multiplier can perform logic functions (AND, OR) and barrel shifting.

    Input/Output

    Virtex-6 FPGA SelectIO Resources User Guide

    The number of I/O pins varies from 240 to 1200 depending on device and package size. Each I/O pin is configurable and can comply with a large number of standards, using up to 2.5V. The Virtex-6 FPGA SelectIO Resources User Guide describes the I/O compatibilities of the various I/O options. With the exception of supply pins and a few dedicated configuration pins, all other package pins have the same I/O capabilities, constrained only by certain banking rules.

    All I/O pins are organized in banks, with 40 pins per bank. Each bank has one common VCCO output supply-voltage pin, which also powers certain input buffers. Some single-ended input buffers require an externally applied reference voltage (VREF). There are two VREF pins per bank (except configuration bank 0). A single bank can have only one VREF voltage value.

    I/O Electrical Characteristics
    Single-ended outputs use a conventional CMOS push/pull output structure driving High towards VCCO or Low towards ground, and can be put into high-Z state. The system designer can specify the slew rate and the output strength. The input is always active but is usually ignored while the output is active. Each pin can optionally have a weak pull-up or a weak pull-down resistor.

    Any signal pin pair can be configured as differential input pair or output pair. Differential input pin pairs can optionally be terminated with a 100ý- internal resistor. All Virtex-6 devices support differential standards beyond LVDS: HT, RSDS, BLVDS, differential SSTL, and differential HSTL.

    Digitally Controlled Impedance
    Digitally controlled impedance (DCI) can control the output drive impedance (series termination) or can provide parallel termination of input signals to VCCO, or split (Thevenin) termination to VCCO/2. DCI uses two pins per bank as reference pins, but one such pair can also control multiple banks. VRN must be resistively pulled to VCCO, while VRP must be resistively connected to ground. The resistor must be either 1- or 2- the characteristic trace impedance, typically close to 50Ω.

    I/O Logic Input and Output Delay

    This section describes the available logic resources connected to the I/O interfaces. All inputs and outputs can be configured as either combinatorial or registered. Double data rate (DDR) is supported by all inputs and outputs. Any input or output can be individually delayed by up to 32 increments of ≈78 ps each. This is implemented as IODELAY. The number of delay steps can be set by configuration and can also be incremented or decremented while in use.

    For using either IODELAY, the system designer must instantiate the IODELAY control block and clock it with a frequency close to 200 MHz. Each 32-tap total IODELAY is controlled by that frequency, thus unaffected by temperature, supply voltage, and processing variations.

    ISERDES and OSERDES
    Many applications combine high-speed bit-serial I/O with slower parallel operation inside the device. This requires a serializer and deserializer (SerDes) inside the I/O structure. Each input has access to its own deserializer (serial-to-parallel converter) with programmable parallel width of 2, 3, 4, 5, 6, 7, 8, or 10 bits. Each output has access to its own serializer (parallel-to-serial converter) with programmable parallel width of up to 8 bits wide for single data rate (SDR), or up to 10 bits wide for double data rate (DDR).

    System Monitor

    Virtex-6 FPGA System Monitor User Guide

    Every Virtex-6 FPGA contains a System Monitor circuit providing thermal and power supply status information. Sensor outputs are digitized by a 10-bit 200kSPS analog-to-digital converter (ADC). This fully tested and specified ADC can also be used to digitize up to 17 external analog input channels. The System Monitor ADC utilizes an on-chip reference circuit thereby eliminating the need for any external active components. On-chip temperature and power supplies are monitored with a measurement accuracy of ±4°C and ±1% respectively.

    By default the System Monitor continuously digitizes the output of all on-chip sensors. The most recent measurement results together with maximum and minimum readings are stored in dedicated registers for access at any time through the DRP or JTAG interfaces. Alarms limits can automatically indicate over temperature events and unacceptable power supply variation. A specified limit (for example: 125°C) can be used to initiate an automatic power down.

    The System Monitor does not require explicit instantiation in a design. Once the appropriate power supply connections are made, measurement data can be accessed at any time, even pre-configuration or during power down, through the JTAG test access port (TAP).

    Low-Power Gigabit Transceiver

    Virtex-6 FPGA GTX Transceivers User Guide

    Ultra-fast serial data transmission between ICs, over the backplane, or over longer distances is becoming increasingly popular and important. It requires specialized dedicated on-chip circuitry and differential I/O capable of coping with the signal integrity issues at these high data rates.

    All but one Virtex-6 device has between 8 to 36 gigabit transceiver circuits. Each GTX transceiver is a combined transmitter and receiver capable of operating at a data rate between 155 Mb/s and 6.5 Gb/s. The transmitter and receiver are independent circuits that use separate PLLs to multiply the reference frequency input by certain programmable numbers between 2 and 25, to become the bit-serial data clock. Each GTX transceiver has a large number of user-definable features and parameters. All of these can be defined during device configuration, and many can also be modified during operation.

    Transmitter
    The transmitter is fundamentally a parallel-to-serial converter with a conversion ratio of 8, 10, 16, 20, 32, or 40. The transmitter output drives the PC board with a single-channel differential current-mode logic (CML) output signal.

    TXOUTCLK is the appropriately divided serial data clock and can be used directly to register the parallel data coming from the internal logic. The incoming parallel data is fed through a small FIFO and can optionally be modified with the 8B/10B, 64B/66B, or the 64B/67B algorithm to guarantee a sufficient number of transitions. The bit-serial output signal drives two package pins with complementary CML signals. This output signal pair has programmable signal swing as well as programmable pre-emphasis to compensate for PC board losses and other interconnect characteristics.

    Receiver
    The receiver is fundamentally a serial-to-parallel converter, changing the incoming bit serial differential signal into a parallel stream of words, each 8, 10, 16, 20, 32, or 40 bits wide. The receiver takes the incoming differential data stream, feeds it through a programmable equalizer (to compensate for PC board and other interconnect characteristics), and uses the FREF input to initiate clock recognition. There is no need for a separate clock line. The data pattern uses non-return-to-zero (NRZ) encoding and optionally guarantees sufficient data transitions by using the selected encoding scheme. Parallel data is then transferred into the FPGA logic using the RXUSRCLK clock. The serial-to-parallel conversion ratio can be 8, 10, 16, 20, 32, or 40.

    Out-of-Band Signaling
    The GTX transceivers provide Out-of-Band (OOB) signaling, often used to send low-speed signals from the transmitter to the receiver, while high-speed serial data transmission is not active, typically when the link is in a power-down state or has not been initialized.

    Integrated Interface Blocks for PCI Express Designs

    The PCI Express standard is a packet-based, point-to-point serial interface standard. The differential signal transmission uses an embedded clock, which eliminates the clock-to-data skew problems of traditional wide parallel buses.

    The PCI Express Base Specification Revision 2.0 is backwards compatible with Revision 1.1 and defines a configurable raw data rate of 2.5 Gb/s, or 5.0 Gb/s per lane in each direction. To scale bandwidth, the specification allows multiple lanes to be joined to form a larger link between PCI Express devices.

    All Virtex-6 LXT and SXT devices include an integrated interface block for PCI Express technology that can be configured as an Endpoint or Root Port, designed to the PCI Express Base Specification Revision 2.0. The Root Port can be used:

  • To build the basis for a compatible Root Complex
  • To allow custom FPGA-FPGA communication via the PCI Express protocol
  • To attach ASSP Endpoint devices such as Fibre-channel HBAs to the FPGA

    This block is highly configurable to system design requirements and can operate 1, 2, 4, or 8 lanes at the 2.5 Gb/s data rate and the 5.0 Gb/s data rate. For high-performance applications, advanced buffering techniques of the block offer a flexible maximum payload size of up to 1024 bytes. The integrated block interfaces to the GTX transceivers for serial connectivity, and to block RAMs for data buffering. Combined, these elements implement the Physical Layer, Data Link Layer, and Transaction Layer of the PCI Express protocol.

    Xilinx provides a light-weight, configurable, ease-of-use LogiCORE wrapper that ties the various building blocks (the integrated block for PCI Express, the GTX transceivers, block RAM, and clocking resources) into an Endpoint or Root Port solution. The system designer has control over many configurable parameters: lane width, maximum payload size, FPGA logic interface speeds, reference clock frequency, and base address register decoding and filtering.

    10/100/1000 Mb/s Ethernet Controller (2500 Mb/s Supported)
    An integrated tri-mode Ethernet MAC (TEMAC) block is easily connected to the FPGA logic, the GTX transceivers, and the SelectIO resources. This TEMAC block saves logic resources and design effort. The Virtex-6 LXT and SXT devices have four TEMAC blocks, implementing the link layer of the OSI protocol stack.

    The CORE Generator software GUI helps to configure flexible interfaces to GTX transceiver or SelectIO technology, to the FPGA logic, and to a microprocessor (when required). The TEMAC is designed to the IEEE Std 802.3-2005 specification. 2500 Mb/s support is also available.

    Configuration

    Virtex-6 FPGA Configuration User Guide

    Virtex-6 FPGAs store their customized configuration in SRAM-type internal latches. The number of configuration bits is between 16 Mb and 160 Mb (2 to 20 MB), depending on device size but independent of the specific user-design implementation, unless compression mode is used. The configuration storage is volatile and must be reloaded whenever the FPGA is powered up. This storage can also be reloaded at any time by pulling the PROGRAM_B pin Low. Several methods and data formats for loading configuration are available, determined by the three mode pins.

    Bit-serial configurations can be either master serial mode where the FPGA generates the configuration clock (CCLK) signal, or slave serial mode where the external configuration data source also clocks the FPGA. For byte- and word-wide configurations, master SelectMAP mode generates the CCLK signal while slave SelectMAP mode receives the CCLK signal for the 8-, 16-, or 32-bit-wide transfer. Alternatively, serial-peripheral interface (SPI) and byte-peripheral interface (BPI) modes are used with industry-standard flash memories and are clocked by the CCLK output of the FPGA. JTAG mode uses boundary-scan protocols to load bit-serial configuration data.

    The bitstream configuration information is generated by the ISE software using a program called BitGen. The configuration process typically executes the following sequence:

  • Detects power-up (power-on reset) or PROGRAM_B when Low.
  • Clears the whole configuration memory.
  • Samples the mode pins to determine the configuration mode: master or slave, bit-serial or parallel, or bus width.
  • Loads the configuration data starting with the bus-width detection pattern followed by a synchronization word, checks for the proper device code, and ends with a cyclic redundancy check (CRC) of the complete bitstream.
  • Start-up executes a user-defined sequence of events: releasing the internal reset (or preset) of flip-flops, optionally waiting for the phase-locked loops (PLLs) to lock and/or the DCI to match, activating the output drivers, and transitions the DONE pin High.

    Dynamic Reconfiguration Port
    The dynamic reconfiguration port (DRP) gives the system designer easy access to configuration bits and status registers for three block types: 32 locations for each clock tile, 128 locations for the System Monitor, and 128 locations for each serial GTX transceiver. The DRP behaves like memory-mapped registers and can access and modify block-specific configuration bits as well as status and control registers.

    Encryption, Readback, and Partial Reconfiguration
    As a special option, the bitstream can be AES-encrypted to prevent unauthorized copying of the design. The Virtex-6 FPGA performs the decryption using the internally stored 256-bit key that can use battery backup or alternative non-volatile storage. Most configuration data can be read back without affecting the system's operation. Typically, configuration is an all-or-nothing operation, but the Virtex-6 FPGA also supports partial reconfiguration. When applicable in certain designs, partial reconfiguration can greatly improve the versatility of the FPGA. It is even possible to reconfigure a portion of the FPGA while the rest of the logic remains active i.e., active partial reconfiguration.

    About the author

     
    Peter Alfke joined Xilinx in 1988 as director of applications engineering. He currently serves as Distinguished Engineer in the Advanced Products Group.

    Peter graduated in electronic engineering from the Technical University in Hannover, Germany in 1957. He went on to work in telecom and computer design with LM Ericsson and Litton Industries before moving to California in 1968. He has spent forty years in Applications Engineering with Fairchild, Zilog, AMD, and now Xilinx. Peter holds more than 30 patents, has authored many application notes, and given worldwide seminars on digital integrated circuits. He is active in the newsgroup comp.arch.fpga.

  • 출처: http://www.pldesignline.com/howto/218600159

  • 댓글 없음: