2007년 1월 16일 화요일

Emailing: Programmable Logic DesignLine | How to maximize FPGA performance

http://www.pldesignline.com/196900804




January 15, 2007

How to maximize FPGA performance

The more that can be done upfront with good coding styles, timing constraints definition, and resource planning, the easier it will be for the downstream tools to achieve timing requirements.

Editor's Note: See also the related Product Release Article.

As FPGAs push the envelope of performance, understanding how to design for maximum performance requires knowledge of the device architecture and design software. Today's FPGAs resemble a true System-on-a-Chip (SoC) with many more sophisticated features than the glue logic FPGAs of the past. To maximize system performance, designers need to use proper design techniques such as defining timing constraints and selecting options in synthesis and implementation that work best for their design. This article describes how to achieve faster timing in the fewest design iterations.

Understanding the architecture
When evaluating a new FPGA architecture, it is important to understand the hardware features and the tradeoffs that can be made in the architecture. Datasheets, user guides, and technical papers on the architectural features should be thoroughly reviewed before moving forward with a design.

The first thing to learn about any FPGA is what makes up the basic fabric of logic. For example, each of the configurable logic blocks (CLBs) in a Xilinx Virtex-5 FPGA contains two slices; and each slice contains four 6-input look-up tables (LUTs), four registers, and dedicated carry logic. For maximum utilization of each slice, it is important to take into consideration the width of the LUTs, the connectivity between the basic elements, and any shared resources.

Many FPGA architectures also contain hard IP blocks, such as embedded memory and blocks used for DSP functions. If a hard-IP block continuously shows up as the source or destination of your critical path, there are a couple of things that can be analyzed to improve the performance. First, check to see if the design is making the most of the block's features and that the synthesis tool is inferring the features you expected from your RTL code. Use the dedicated pipeline registers inside the blocks to reduce the setup and clock-to-out timing. Evaluate the tradeoff between using dedicated blocks versus implementing the same function in slices to allow for placement flexibility. This can especially be important when using a high percentage of hard-IP blocks.

The clocking resources that are utilized in a design can also affect a design's performance. For example, Xilinx Virtex-5 FPGAs have I/O, regional, and global clocking resources. These devices are divided into clock regions which at most, can contain 4 regional clocks and 10 global clocks. During design planning, it is important to analyze how many clock regions are going to be used as well as specific clocks within a clock region. Placing your I/Os so that their interface logic does not require all the clock resources in a clock region gives the implementation tools greater placement flexibility.

Define timing requirements
Synthesis and implementation tools are driven by the performance goals that a user specifies with timing constraints. It is important to constrain all internal clock domains, input and output (I/O) paths, multi-cycle paths, and false paths. Define realistic timing constraints in synthesis order to prevent excessive replication.

In your synthesis report, check for any replicated registers and ensure that timing constraints that might apply to the original register also cover the replicated registers for implementation. When writing timing constraints for implementation, group the maximum number of paths with the same timing requirement first before generating a specific timing constraint. By consolidating constraints, implementation runtime and memory usage can be minimized.

Example of non-consolidated constraints (Xilinx constraint syntax)
TIMESPEC "TS_firsttimespec" = FROM "flopa" TO "flopb" 10ns;
TIMESPEC "TS_secondtimespec" = FROM "flopc" TO "flopb" 10ns;
TIMESPEC "TS_thirdtimespec" = FROM "flopd" TO "flopb" 10ns;

Consolidation of constraints using grouping
INST "flopa" TNM = "flopgroup";
INST "flopc" TNM = "flopgroup";
INST "flopd" TNM = "flopgroup";

TIMESPEC "TS_consolidated" = FROM "flopgroup" TO "flopb" 10ns;

Driving synthesis
For a synthesis tool to create a high-performance circuit, the tool needs to be properly driven by the designer. The first thing a designer needs to consider is proper coding techniques to ensure that inference of behavioral RTL made by the synthesis tool leads to the maximum usage of the architectural features. For example, Xilinx ISE Project Navigator's language templates – available in both Verilog and VHDL – are a great place to get coding examples.

Next, make sure that the synthesis tool has a complete picture of the design. If a design contains IP netlists or any other lower level black-boxed netlists, these netlists should be included in the synthesis project. Although the synthesis tool won't optimize any logic within the netlist, it will have a better understanding of how to optimize the HDL that interfaces to these lower level netlists.

The tool also needs to understand the performance goals of a design using the timing constraints supplied by the designer. If there are critical paths in the implementation that are not seen as critical in synthesis, try Synplicity Synplify PRO's –route constraint to force synthesis to focus on that path. Finally, there are a variety of tool settings in synthesis that should be explored. Refer to Fig 1 for suggested tool settings for Synplify PRO.


1. Suggested tool settings for Synplify PRO.
It is important to start off with a baseline set of tool options and incrementally add new switches to understand the effects. Also note there are a variety of attribute settings that can affect how inference of logic is done and synthesis is optimized. These attributes are an easy way to affect synthesis with out having to re-code (see Table 1)


Table 1. Helpful Synthesis Attributes.*

* For a complete listing of attributes and their functionality, please see the synthesis tool's documentation.

Although timing performance might be enhanced, options that do lead to the replication of logic such as retiming in Synplify PRO can impact area. If the design is affected by high-fanout nets and you want the synthesis tool to reduce that fanout, use fanout attributes specifically on that specific net, versus globally specifying a maximum fanout limit. If hierarchical boundaries are maintained, a designer should make sure that ports are registered at the hierarchical boundaries. If critical paths cross over these hierarchical boundaries, certain optimizations will not be allowed by the synthesis tool. This can lead both to lower performance and higher area utilization. Before moving on to implementation, it is always important to review the warnings in the synthesis report. It is also beneficial to check the RTL schematic view for how the synthesis tool is interpreting the HDL and the technology schematic to understand how the HDL is mapping to the specific FPGA architecture.

Choosing implementation options
Having obtained an acceptable timing estimate from the synthesis tool, use the implementation tools to determine the true performance of the design. The implementation options that can be used are unique for each design depending on the performance goals of the design, the synthesis flow used, and its overall structure. Once the majority of the functionality is defined in the design's HDL and the effort is focused on timing closure, it is beneficial to run a series of different implementations with different sets of options to determine which is the best combination for the design.

ISE Xplorer is an example of a tool that will allow a designer to determine which options work best. ISE Xplorer has been tuned for each Xilinx FPGA architecture to try the best set of combinations. Although initial runtime can be longer because multiple implementations need to be run, once the design has the right set of options, it will likely reduce the number of design iterations to achieve timing closure.

Physical synthesis options in implementation can be used to re-optimize and pack logic based on knowledge of the critical paths of a design, leading to better placement and routing. Note that physical synthesis can lead to increased area due to replication of logic. Like synthesis, if hierarchy is maintained on a design but the critical path crosses those hierarchical boundaries, physical synthesis will not be able to optimize that path and potentially, inefficient packing will occur. To evaluate whether keeping hierarchy is affecting the performance of the design, turn off hierarchy preservation with an attribute or option during implementation. If it does prove to have an impact, reconsider how the hierarchical boundaries are defined.

Evaluating critical paths
By understanding the characteristics of the critical path, a designer can make better decisions on what to do for the next design iteration. A data path is comprised of both logic and interconnect delay. Individual component delays that make up logic delay are fixed. Logic delay can only be reduced if the number of logic levels are reduced or the structure of the logic is changed. By comparison, interconnect delay is much more variable and is dependent on the placement of the logic, routing congestion, and the competition between nets for the fastest routing resources. Before routing the design a quick timing analysis after placement is recommended. Although this timing report will only have estimates for the routing delays, it will give an idea of the critical paths the implementation tools are working on. If the critical paths have a high number of logic levels, designers may want to work on improving the logic levels versus running it through PAR. When the design has an excessive amount of logic levels that lead to many routing interconnects:

  1. Try the different physical synthesis options to see if logic levels can be reduced.
  2. Go back to synthesis and verify that critical paths reported in implementation match what is reported in synthesis. If they are not, use constraints like Synplify PRO's –route to have the synthesis tool focus on these paths.
  3. Review the HDL code to ensure that it takes advantage of the hardware.

In the case where there are few logic levels but the certain data paths are not meeting the performance requirement:

  1. Evaluate fan-out on routes with long delay.
  2. If a critical path contains hard-IP blocks such as RAMBs or DSP48Es, verify the design is taking full advantage of the embedded registers. Also understand when to make the tradeoff between using these hard blocks versus using slice logic.
  3. Analyze clock skew. Large clock skew can be caused by inefficient use of the clock resources in the design.
  4. Perform a placement analysis. If logic appears to be placed far apart from each other, floorplanning of critical blocks may be required. Only floorplan where necessary. Over floorplanning will not give as much flexibility to the tool and could lead to worse performance.
  5. If area groups were created for a design with a previous version of software or prior to many design changes, consider removing those area groups to evaluate whether or not they are negatively affecting placement.
  6. Consider placing hard IP blocks such as embedded memory and DSP blocks.

Conclusion
Today's FPGAs have a variety of high performance features. In order to take full advantage of these features, a few things need to be considered. The more that can be done upfront with good coding styles, timing constraints definition, and resource planning, the easier it will be for the downstream tools to achieve timing requirements. It is also equally important to know what to do next when design requirements are not met in first iteration.

Michelle Fernandez is a technical marketing engineer in the Software Product Marketing Group at Xilinx. Based on the analysis of customer designs, Michelle provides recommendations aimed at improving FPGA design performance and ease of use to the development teams at Xilinx. Michelle joined Xilinx in 1999 and has held a variety of positions in customer support and field applications engineering. She holds a bachelor's of science degree in electrical engineering from University of California at Davis. Michelle can be contacted at: michelle.fernandez@xilinx.com.


댓글 없음: