Generating Configurable Hardware From Parallel Patterns

1 Introduction

Due to power and energy constraints, conventional full general-purpose processors are no longer able to sustain the performance and free energy improvement in commercial datacenters. To overcome the inefficiency of homogeneous multicore systems, heterogeneous architectures that feature specialized hardware accelerators have been widely considered to be a promising paradigm. In particular, field programmable gate arrays (FPGAs), which offer the potential of orders-of-magnitude performance/watt gains for a broad course of applications while retaining reconfigurability, attract increasing attending as a mainstream acceleration technology. For instance, both Microsoft and Baidu have incorporated FPGA-based accelerators in their datacenters to advance large-scale product workloads such as search engines[29, 10]

and neural networks

[24, 25]. Amazon also introduced F1 example[4], a compute example equipped with FPGA boards, in its Elastic Compute Cloud (EC2). Moreover, with the $16.7 billion acquisition of Altera, Intel recently appear the Heterogeneous Compages Research Platform (HARP)[3], which provides an FPGA and a Xeon processor in a single semiconductor package. Predictions have been made that as much as 30% of datacenter servers will have FPGAs by 2020[6]. This suggests that FPGAs could get a common component in future servers and could play an important role as primary computing resources[23].

On the other mitt, a major challenge in FPGA-based acceleration is programmability. FPGA programming is by and large recognized as an RTL (register-transfer level) design exercise, which requires notable hardware expertise in designing accelerator microarchitectures such as controls, information paths, and finite country machines[ix]. This makes the effort of FPGA programming prohibitive to most datacenter application developers. Information technology is fifty-fifty more challenging when the mainstream algorithm in an application domain is constantly evolving; i.e., an algorithm may have already been obsolete during the evolution procedure of its hardware accelerator.

Decades of research accept focused on improving FPGA programmability. Loftier-level synthesis (HLS)[12] that allows hardware designs to be described in loftier-level programming languages like C/C++ (such C/C++ programs for hardware designs are generally called hardware behavioral descriptions) is recognized equally an encouraging approach. In fact, a C program tin even be compiled by country-of-the-fine art HLS tools like Xilinx SDAccel into a working FPGA circuit without whatsoever modification of the program itself. All the same, a high-quality software plan is mostly far away from a high-quality hardware behavioral description due to the lack of proper consideration regarding the underlying FPGA architecture. Our experiments prove that a software plan, if naively treated every bit a hardware behavioral description, almost ever leads to an FPGA accelerator that performs orders-of-magnitude worse than running the program on a modernistic CPU. This is because HLS yet leaves programmers to face the challenge of identifying the optimal blueprint configuration among a tremendous number of choices, which in turn requires intimate noesis of hardware intricacies to efficiently reduce the design infinite and obtain a high-quality solution in a reasonable fourth dimension. Consequently, to programmers HLS still presents a meaning gap betwixt a software program and a loftier-quality hardware behavioral clarification, which prevents the FPGA programmability from being further improved.

This paper presents a comprehensive approach to pave the path from a software program to a high-quality hardware behavioral description that i) is functionally equivalent to the software programme, and 2) leads to a loftier-performance FPGA accelerator. The approach consists of iii master stages. The starting time stage, design space reduction, aims to reduce the tremendous blueprint space. Specifically, nosotros innovate the composable, parallel and pipeline (CPP) microarchitecture, a template of accelerator designs, as a specification of the programme-to-behavioral-description transformation. Such a carefully designed template fits for a variety of ciphering kernels and guarantees the quality of accelerator designs. Also, with the CPP microarchitecture as the transformation specification, the design space is restricted to only configurations of that specific microarchitecture. The 2d stage, automatic blueprint space exploration

, realizes a almost-optimal CPP microarchitecture configuration automatically with an belittling model and a machine-learning-based search engine. With this near-optimal configuration, the third phase,

automated accelerator generation, organizes a collection of lawmaking transformation primitives to transform the software program to the behavioral clarification of the desired CPP microarchitecture. We develop the AutoAccel framework to implement the proposed approach and brand the entire accelerator generation procedure automatic. In summary, this newspaper makes the following contributions:

The CPP microarchitecture. By introducing this broadly applicable accelerator design template as the specification of plan-to-behavioral-description transformation, nosotros achieve the objective of drastically reducing the design space while preserving accelerator pattern quality.
The analytical model. This proposed model captures the operation and resources merchandise-offs amidst all design configurations of the CPP microarchitecture, laying the foundation for fast, automated design space exploration.
The AutoAccel framework. AutoAccel automates the entire accelerator generation process, provides datacenter application developers with a nearly push-button feel of FPGA programming, and thus substantially improves the FPGA programmability.
Detailed evaluation. We evaluate AutoAccel via the MachSuite[30] criterion suite by proposing a metric to mensurate whether the qualities of AutoAccel-generated accelerators reach optimality. We also evaluate the accurateness of the proposed belittling model using Xilinx SDAccel and the on-board execution.

Our experiments evidence that the AutoAccel-generated accelerators outperform their respective software implementations by an average of 72x for the MachSuite computation kernels.

2 Background

A field-programmable gate assortment (FPGA) is an integrated excursion that contains an array of reprogrammable logic and retentiveness blocks: lookup tables (LUTs), flip-flops (FFs), digital betoken processing slices (DSPs) and block RAMs (BRAMs). Connected through a hierarchy of reconfigurable interconnects, these blocks tin be customized into different circuits to solve various computation problems. Such hardware customizability allows FPGA circuits to avoid the significant overhead of the general-purpose microprocessors, resulting in orders-of-magnitude performance/watt gains for a broad class of workloads.

However, the FPGA programmability outcome is a serious impediment against its adoption by datacenter application developers. Section2.one briefly describes state-of-the-art commercial HLS tools that represent the latest effort in improving the FPGA programmability through HLS. The fact that such tools leave programmers to accept full responsibleness for performance optimization motivates our work. In Department2.2 we and then introduce the Merlin compiler[2, 13, xiv], a compilation framework that attempts to convalesce the burden of transmission lawmaking optimization by providing a library of automated lawmaking transformation primitives. While the Merlin compiler still relies on programmers to determine the optimal combination and parameters of the transformation operations, and thus does not substantially relieve the burden, its transformation library serves as a skillful preliminary tool for us to agilely implement automatic generation of the CPP microarchitecture.

ii.1 Commercial HLS Tools

Commercial HLS tools such every bit Xilinx SDAccel[7] and Intel FPGA SDK for OpenCL[five] have been widely used to fast prototype user-divers functionalities expressed in high-level languages (e.yard., C/C++ and OpenCL) on FPGAs without involving register-transfer level (RTL) descriptions. The example design flow used by common commercial HLS tools is shown in Fig.1.

Figure 1: Common Commercial HLS Tool Design Flow

Commercial HLS tools usually have a set of language extensions for users, such as C pragmas, that provide the guidances of memory organization and job scheduling to complement the missing information of static analysis while optimizing the pattern. The linguistic communication extensions are specified by the user at the source lawmaking level, merely the core HLS code transformation and optimization happens at the intermediate representation (IR) level, indicating that the effectiveness of user guidances highly depends on its IR structure and front-finish compiler. It implies that two programs with the same functionality but unlike coding styles (leading to different IR structures) might event in a significant performance deviation. In fact, this difference can exist upwards to several orders of magnitude based on our experiences. Every bit a consequence, programmers have to pay attending to every detail that may impact the generated IR construction, which oftentimes requires a profound understanding of the FPGA architecture and circuit blueprint.

2.2 Merlin Compiler

The Merlin compiler[2, thirteen, 14] is a source-to-source transformation tool for FPGA acceleration based on the CMOST[34] compilation flow. It provides a transformation library and a set of pragmas with prefix "#pragma Accel" for developers to perform design optimization at the source-lawmaking level. Each pragma corresponds to a code transformation primitive, every bit listed in Table 1 .

Tabular array 1: Merlin Compiler Code Transformations

Figure 2: The Merlin Compiler Execution Menses

Based on the transformation library, Fig.2 presents the Merlin compiler execution menstruum. It leverages the ROSE compiler infrastructure[1] and polyhedral framework[38] for abstract syntax tree (AST) analysis and transformation. The front-end phase analyzes the user program and separates host and computation kernel. The kernel lawmaking transformation stage then applies multiple code transformations according to user-specified pragmas. Annotation that the Merlin compiler will perform all necessary code reconstructions to brand a transformation effective. For example, when performing loop unrolling, the Merlin compiler not merely unrolls a loop merely also conducts retentiveness partitioning for the sake of fugitive banking concern conflict[15]. Finally, the dorsum-terminate stage takes the transformed kernel and uses the HLS tool to generate the FPGA bitstream.

Compared to the pure HLS solution, the Merlin compiler further improves the FPGA programmability by making blueprint optimization "semiautomatic": instead of manually reconstructing the code to make one optimization functioning effective, programmers now tin can simply place a pragma and permit the Merlin compiler do the necessary changes. However, programmers still take to identify the best combination and parameters amongst these operations, i.e., manually searching in an exponential pattern space.

3 Accelerator Design Template

This section presents the details of the blueprint space reduction phase of the proposed approach. In general, our solution is to innovate an accelerator design template as the specification of the transformation from software programs to hardware behavioral descriptions. A software program will only be transformed to a hardware behavioral clarification of this introduced template, and then the pattern space is restricted to but configurations of the template. Every bit a issue, the design infinite is drastically reduced (run into Section4 for design infinite definition). Meanwhile, this template ought to be applicable for a variety of computation kernels, and guarantees the accelerator blueprint quality once a kernel fits into the template. Department 3.1 and iii.two present our proposed accelerator design template, the composable, parallel and pipeline (CPP) microarchitecture, too every bit showing how the CPP microarchitecture is derived. Section 3.three discusses the applicability of the CPP microarchitecture for diverse computation kernels.

3.1 Obstacles Towards Efficient Behavioral Description

We derive the CPP microarchitecture by conducting an assay on the major obstacles from a software program towards an efficient hardware behavioral description. Specifically, we start from a collection of ciphering kernels, straightforwardly treat their software implementations ^ane ^one 1The ciphering kernels and their software implementations are from the MachSuite benchmark suite[30] (see Section6.1). as behavioral descriptions, feed such naive behavioral descriptions into Xilinx SDAccel, and place the microarchitectural inefficiencies of the generated FPGA accelerators. Such inefficiencies represent the obstacles towards efficient behavioral descriptions.

We use the NW (Needleman-Wunsch algorithm) benchmark (see Department 6.1) as an example for demonstration and word. The NW benchmark processes a serial of genome sequence alignment jobs, each with a pair of 128-entry sequences as input and a pair of 256-entry sequences as output. The alignment engine applies the Needleman-Wunsch algorithm, a dynamic programming algorithm with quadratic fourth dimension complexity, to the input sequences, and generates the optimal post-aligned sequences given a predefined scoring system[22].

Figure iii: NW Kernel and the Respective Architecture

Fig.3 presents the NW code snippet and the microarchitecture of the FPGA accelerator generated by naively feeding the NW code into Xilinx SDAccel. Our experiments show that this accelerator performs 92x slower than a single CPU cadre. We dig into the implementation inefficiencies of the NW benchmark that cause such poor operation every bit follows.

Inefficiency #one: Inefficient off-scrap transaction. The kernel part is the top-level function of the NW benchmark and defines the entire accelerator. Its arguments—seqAs, seqBs, alignedAs and alignedBs that represent to the original sequence pairs and the aligned sequence pairs—define the input and output buffers that reside in the off-bit DRAM of the FPGA board. The FPGA accelerator connects to these off-flake buffers through AXI channels. The information width of each AXI channel is eight $.25, inferred from the information type of the corresponding statement (8-flake char blazon in the NW case). Equally a result, the off-scrap information transaction throughput is only one byte/wheel for each channel, or four byte/bicycle aggregately, while country-of-the-art CPU-FPGA platforms typically back up 64 byte/bicycle off-chip advice throughput.

Inefficiency #ii: No information caching. No data caching module is presented in the microarchitecture, with the outcome that every data access goes through the off-chip DRAM.

Inefficiency #3: Sequential loop scheduling. The kernel function body is a loop argument that iteratively traverses every sequence pair through the engine function that defines the hardware engine module. In the presented microarchitecture, the engine module accepts and processes only one sequence pair at a time, despite the fact that these sequence pairs are independent of each other and thus can exist processed in parallel or pipeline. Worse all the same, all loops presented in the NW kernel are scheduled to be candy sequentially, regardless of whether one is able to be mapped to a parallel or pipeline circuit. ² ⁱⁱ twoThe latest Xilinx flow starts to perform loop pipelining automatically, just simply for simple loop statements.

Inefficiency #iv: Inefficient on-bit memory utilization. The major computation of the NW algorithm is to generate a two-dimension score matrix. The engine function therefore includes a local two-dimensional array, M, to store the matrix, and some loop statements to calculate the values of the matrix elements. In the presented microarchitecture, the array Yard is mapped to an on-flake BRAM buffer that has but i write port, implying that even if the algorithm has the potential to generate multiple matrix element values per cycle, the BRAM buffer is non able to fulfill this potential because only one value tin can be written into the buffer in each cycle.

These inefficiencies, though demonstrated only in the NW case, are present in all MachSuite benchmarks and represent the major obstacles from software programs to loftier-quality hardware behavioral descriptions. The CPP microarchitecture is thus derived to resolve these inefficiencies.

three.2 CPP Microarchitecture

The composable, parallel and pipeline (CPP) microarchitecture is proposed equally a template of accelerator designs and a specification of the program-to-behavioral-description transformation. It includes a series of features to address the inefficiencies in the previous section. In the following text nosotros continue to use the NW benchmark as an instance to demonstrate the CPP microarchitecture along with its key features, equally shown in Fig.4.

Feature #1: Fibroid-grained pipeline with data caching. Fig.4 illustrates the NW accelerator design under the CPP microarchitecture. The overall CPP microarchitecture is a coarse-grained pipeline that consists of iii stages: load, compute and store. The kernel function in the NW source code only corresponds to the compute module instead of defining the entire accelerator. The input sequence pairs are processed tile past tile, i.e., iteratively loading a certain number of sequence pairs into on-chip buffers (Stage load), aligning these pairs (Stage compute), and storing the post-aligned pairs back to DRAM (Stage store). Different tiles are candy in pipeline since they are independent from each other. This feature addresses inefficiency #2 considering off-chip information motion only happens in the load and store stages, leaving the data accesses of computation completely on flake.

Effigy 4: NW Accelerator nether CPP Microarchitecture

The load and store modules connect to two input and output DRAM buffers, respectively, through AXI channels. The data widths of the AXI channels are decoupled from the type sizes of the peak-level function arguments. Hence, the off-chip bandwidth can potentially reach the highest physical bandwidth of the CPU-FPGA platform. As well, the load-compute-store pipeline improves the effective bandwidth of the accelerator by overlapping communication with computation. Consequently, inefficiency #1 is addressed too.

Characteristic #ii: Loop scheduling. The CPP microarchitecture tries to map every loop argument presented in the computation kernel function to either ane) a circuit that processes different loop iterations in parallel, 2) a pipeline where the loop body corresponds to the pipeline stages, or 3) a combination of both. As for the NW example, the loop argument in the kernel function is mapped to a fix of engine modules to process the sequence pairs in parallel. Moreover, the loop statements in the engine function are mapped to parallel and pipeline circuits besides. This resolves inefficiency #3.

Characteristic #3: On-chip buffer reorganization. In the CPP microarchitecture, all the on-chip BRAM buffers are partitioned to meet the port requirement of parallel circuits, where the number of partitions of each buffer is determined by the duplication factor of the parallel circuit that connects to the buffer. This feature is used for resolving inefficiency #4. In the NW instance, the on-chip buffers that cache the input and output sequence pairs are partitioned into multiple segments, each segment feeding 1 engine module. The local buffer M that stores the score matrix is also partitioned to let parallel read and write transactions.

In summary, the CPP microarchitecture guarantees the quality of accelerator designs by providing respective features to address the inefficiencies. Notwithstanding, it is not applicative to all kinds of computation kernels with diverse data processing patterns. The following section discusses the applicability of the CPP microarchitecture for various computation kernels.

three.3 Applicability for Computation Kernels

The CPP microarchitecture features a load-compute-store coarse-grained pipeline, which requires the computation kernel to process input data block by block. Meanwhile, the size of each block is required to be less than a few megabytes in society to be entirely cached on fleck. As a consequence, the CPP microarchitecture favors the computation kernels with regular data-level parallelism, like streaming or batch processing programs with the MapReduce[16] design. On the contrary, information technology does not fit well for the computation kernels featuring extensive random accesses on a large memory footprint, such as PageRank[26] and and the breadth-get-go search (BFS) algorithm.

4 Belittling Model

Another advantage of using CPP microarchitecture is to accept a articulate design infinite. This section presents our CPP microarchitecture analytical model that estimates the execution cycles and resource consumptions of these configurations; this lays the foundation for the

automatic pattern infinite exploration stage of the proposed approach.

Dissimilar most existing models[18, 20, 28, 32, 36] that analyze the source program directly, many parameters of our proposed model are obtained from the HLS synthesis reports of a few design points. This feature enables our model to capture almost scheduling optimizations performed past the HLS tool. As we will bear witness in Department 6, the proposed model has less than a fault charge per unit compared to the HLS report.

4.1 Performance Modeling

The operation model estimates an accelerator's overall execution cycle ( ) through Eq.1:

where , and announce the cycles of the load, compute and store modules, respectively. Since the load and store modules share the off-fleck bandwidth and are together overlapped with the compute module in our experimental platform, we brand a maximum operation between the cycles of the load/store modules and that of the compute module.

The execution cycles of the load, compute and store modules, as well equally all of their submodules, can exist quantified equally the total cycles of all its loops ( ), submodules ( ) and standalone logic ( ), as shown Eq.2.

where is obtained from the HLS report.

Then we model the loop execution. Although a loop argument can be scheduled in pipeline, parallel or the combination of both, the first two schedules can exist treated as special cases of the last one, and can together be modeled as Eq.iii:

where , , and denote the iteration latency, initiation interval, trip count and unroll cistron, respectively. and are obtained from the HLS study; is a design parameter that needs to be explored.

Subsequently, we break down and model the loop iteration in Eq.four, where the loop iteration latency is equanimous of the total cycles of all their sub-loops, submodules and standalone logic.

Eq.2 and Eq.four reflect the architecture hierarchy with nested modules and loops. The proposed model recursively traverses all the loops and modules until a loop or module does not contain whatsoever sub-structures. In addition, we can find that Eq.2 and Eq.four are almost identical. This is because the loop iteration tin can be treated equally a special "module" and modeled in the same mode for both operation and resource. Hence, we omit the loop iteration breakdowns in the following resources models.

4.2 Resource Modeling

The resource models estimate the consumptions of the four FPGA on-flake resources: BRAMs, LUTs, DSPs and FFs. As the DSP model is relatively straightforward and the FF model is like to the LUT model, we just demonstrate the BRAM and LUT models in this section.

BRAM modeling: The BRAM consumption of a hardware module consists of the BRAM blocks used by all its local buffers ( ) and those used by all its submodules ( ), equally shown in Eq.5:

where is the duplication gene of submodule which is equivalent to the unroll factor of the loop that includes this submodule. Nosotros utilize "duplication cistron" instead of "unroll factor" since the quondam is a improve fit for depicting hardware modules and the latter is more than suitable for describing loop statements.

So we model the BRAM consumption of on-chip buffers. A buffer's BRAM consumption is determined by three factors: 1) partitioning factors on all dimensions, , 2) the size of each sectionalization, , and 3) the bit-width of the buffer, , every bit shown in Eq.vi:

Eq.6 adopts a office to summate the BRAM consumption of a single partition. The two parameters are the size and the bit-width of the partitioning. Eq.seven presents its expression:

where denotes the size of a BRAM block that is a platform-dependent constant. is a function that calculates the minimum number of BRAM blocks needed to compose a BRAM buffer with bit-width . Eq.8 shows its expression, where is a platform-dependent constant that represents the largest supported bit-width of a BRAM building block.

LUT modeling: The LUT consumption of a hardware module ( ) is composed of the number of LUTs used by all loops, submodules, BRAM buffers (for control logic) and the standalone logic:

where depicts the LUT consumption of the loop iteration that is, once again, treated and modeled as a special "module." denotes the LUT consumption of the standalone logic and is obtained from 2 HLS reports.

We then model the LUT consumption of on-chip buffers ( ). Information technology can exist decoupled into two parts: ane) the control ( ) and data ( ) signals of each BRAM division, and 2) the -to-one multiplexer ( ) that selects the desired data from all the partitions, every bit shown in Eq.10:

where and are obtained from the HLS report, and tin be calculated via Eq.11. We can also see that the LUT consumption of a buffer depends on its BRAM usage.

Based on the proposed model, the design space of the CPP microarchitecture is composed of ane) the chapters and bit-width of every on-chip buffer, and 2) the unroll cistron of every loop, every bit indicated in Tabular array2. Unfortunately, the proposed model is neither linear nor convex, and therefore non able to be mathematically solved in polynomial time. Hence, we implement automatic design infinite exploration by leveraging a automobile-learning-based search engine that is able to profoundly reduce the number of search iterations needed to attain a virtually-optimal solution. This, together with the AutoAccel framework, volition be presented in the following department.

Tabular array 2: The CPP Microarchitecture Design Space

5 AutoAccel Framework

In this department we present the AutoAccel framework that takes a nested loop ³ ³ iiiComputation kernels with multiple nested loops tin can be decoupled into multiple sub-kernels, each corresponding to a CPP microarchitecture. Existing work[19] has extensively studied how to connect multiple accelerators through FIFO channels with efficient inter-accelerator advice. in C as input and performs a serial of transformations to produce a high-quality FPGA accelerator with the CPP microarchitecture. AutoAccel is congenital on height of the Merlin compiler and uses its transformation library to construct the CPP microarchitecture.

Fig.5 illustrates the overall flow of the AutoAccel framework. The input program is first evaluated by the legalization checking to determine whether information technology fits into the CPP microarchitecture. Next, we implement a CPP microarchitecture constructor to refactor the input program to a hardware behavioral description of the CPP microarchitecture. Subsequently, a design infinite builder is developed to place the design space via static lawmaking analysis. Later the design infinite has been built, we introduce a pattern space explorer with our proposed analytical model to realize the best design specification in minutes. Finally, we refactor the behavioral description lawmaking over again by applying the best design specification to generate the desired accelerator design. This blueprint can be directly fed into Xilinx SDAccel to derive a high-quality accelerator bitstream. In the rest of this section we nowadays the detailed implementation of each component.

5.i Legalization Checking

Since AutoAccel does not require any user modification of the input computation kernel code, the goal of legalization checking is to evaluate whether the input kernel is able to be mapped to the CPP microarchitecture. We briefly depict the evaluation points of the AutoAccel congenital-in legalization checking algorithm as follows:

Kernel size. The resource requirements of generated designs cannot exceed the chapters of unmarried FPGA cloth. This tin can be evaluated by running HLS with the basic configuration.

Task-dependent data chunk length. A job-dependent array is an array that is traversed by the PE-loop so that every PE will use a unlike chunk of data. To achieve the most efficient parallelism and pipeline scheduling in the CPP microarchitecture, the on-bit scratchpad memory is partitioned for every PE to avert writing conflicts. For example, the string length of NW kernel in Fig.3 is e'er 128, then it can be candy by AutoAccel. However, if the size of the chore data clamper is determined dynamically, AutoAccel cannot statically allocate a certain retention size to each PE; this results in the failure of legalization checking.

Chore-independent information size. A chore-independent array, on the other hand, is an array that is accessed past all PEs. For instance, in the breadth-start search (BFS) implementation of the MachSuite criterion[30], the assortment that stores the tree is task-independent, because every PE might access any function in the assortment so that it cannot be partitioned regularly. As a result, information technology is better to duplicate task-independent arrays in on-scrap memory for each PE to guarantee the efficiency. In instance the array is as well large to be stored in on-chip retentiveness, the kernel fails to pass the legalization checking.

We perform legalization checking by traversing an abstract syntax tree (AST). We clarify the iteration domain to reason kernel accessed data size past the polyhedral assay from [27].

v.2 CPP Microarchitecture Construction and Design Space Establishment

AutoAccel makes use of the transformation library of the Merlin compiler to preprocess user input code to fit the CPP microarchitecture. To constrain a design infinite when constructing the CPP microarchitecture, nosotros use static analysis and a polyhedral model to collect the necessary information (e.g., loop trip count, maximal buffer size, scrap-width, etc). Instead of specifying an integer number in Merlin pragmas for a certain configuration (due east.1000., information tiling size), nosotros define an expression "auto(min, max, inc)" to represent a set of design points. In the expression, min and max indicate the range while inc specifies the incremental operator from the minimum value to the maximum value. We currently back up two incremental operators: 1) seq that represents the " " increment, and 2) pow2 that represents the " " increment. This expression will exist replaced with a specific integer of the best configuration afterward the blueprint space exploration (DSE).

We at present introduce the transformation operations used to construct the CPP microarchitecture. Again, the NW benchmark is used as an example to demonstrate the transformation flow, as shown in Code1. The first three transformations are data tiling, coarse-grained pipeline and processing element duplication.

ane. Data tiling: The transformation first tiles a sub-loop in the nested loop and creates a set of on-scrap buffers for information caching. And so information technology instruments the code for establishing efficient off-chip data communication past enabling retention burst. The transformed lawmaking corresponds to lines 33-51 in Code1.

Since the CPP microarchitecture decouples the off-flake memory communication from computation, the analytical model does not cover the design points with different data tiling granularity. To solve this problem, we find the best design point of all possible information tiling granularities that are reported past the legalization checking algorithm in parallel. We programme to include data tiling granularity into the pattern infinite in the futurity.

ii. Coarse-grained pipeline: After the data tiling, we apply the coarse-grained pipeline transformation that encapsulates load, compute, store into three functions to depict the boundaries between pipeline stages (lines 41-51). After, the transformation duplicates on-chip buffers created past step 1 and interleaves all of them by enabling double buffering.

iii. Processing element duplication: The next step is to enable parallel computing. We apply the parallelism transformation to the compute stage in the tiled nested loop (lines 20-24). This creates multiple homogeneous processing elements (PEs) to process the loop iterations in parallel.

Until now, we have constructed a microarchitecture with a coarse-grained pipeline and a PE array that covers feature #one and part of feature #two of the CPP microarchitecture. Later, we focus on loop scheduling within PEs.

4. Small loop flatten: Based on our experiences, it is normally better to flat the in-PE loops with fixed, small-scale trip counts. The reason is that 1) flatting loops with small trip counts provides more opportunities for HLS to generate a more efficient scheduling, and 2) flatting such loops will not affect the overall resources utilization considerably. As a event, we make an ad hoc strategy to fully unroll in-PE loops with trip count less than 16.

ane void NW (...) {

two int M [129][129];

three

4 ...

five loop1 : for ( i =0; i <129; i ++) {

6 #pragma Accel parallel factor = auto (1,128, seq )

7 M [0][ i ] = ..,;

viii }

9 loop2 : for ( j =0; j <129; j ++) {

x #pragma Accel parallel factor = motorcar (i,128, seq )

11 1000 [ j ][0] = ...;

12 }

13 loop3 : for ( i =1; i <129; i ++) {

fourteen for ( j =1; j <129; j ++) {...

15 #pragma Accel parallel factor = machine (1,128, seq )

xvi M [ i ][ j ] = ...

17 }}

18 ...

19 }

20 void compute ( char seqAs [], char seqBs [], char alignedAs [], char alignedBs []) {

21 for ( int i =0; i < TILE_PAIRS ; i ++) {

22 #pragma Accel parallel factor = car (1, NUM_PAIRS , seq )

23 NW ( seqAs + i *128, seqBs + i *128, alignedAs + i *256, alignedBs + i *256);

24 }}

25 void load (...) { ... }

26 void shop (...) { ... }

27 void kernel ( char seqAs [], char seqBs [], char alignedAs [], char alignedBs []) {

28 #pragma Accel bitwidth variable = seqAs cistron = auto (eight,512, pow2 )

29 #pragma Accel bitwidth variable = seqBs cistron = motorcar (eight,512, pow2 )

thirty #pragma Accel bitwidth variable = alignedAs factor = auto (8,512, pow2 )

31 #pragma Accel bitwidth variable = alignedBs gene = car (eight,512, pow2 )

33 char seqAs_buf_x [128* TILE_PAIRS ]; char seqAs_buf_y [128* TILE_PAIRS ];

37 ...

38 #pragma AutoAccel variable = TILE_PAIRS value = car (1, NUM_PAIRS , seq )

39 const int TILE_PAIRS = 16;

xl int num_tiles = NUM_PAIRS / TILE_PAIRS ;

41 for ( int i =0; i < num_tiles +2; i ++) {

42 if ( i % 2 == 0) {

43 load ( );

44 compute ( seqAs_buf_y , seqBs_buf_y , alignedAs_buf_y , alignedBs_buf_y )

45 store ( );

46 }

47 else {

48 load ( );

49 compute ( seqAs_buf_x , seqBs_buf_x , alignedAs_buf_x , alignedBs_buf_x )

50 store ( );

51 }}}

Lawmaking 1: NW Code with the CPP Microarchitecture

5. Fine-grained parallel/pipeline: If an in-PE loop cannot be fully unrolled by step iv, it must satisfy one of the following conditions: one) its trip count is either unknown or larger than 16, 2) it has loop carried-dependency, or three) it contains one or more sub-loops that cannot be fully unrolled. In the commencement condition, nosotros apply fine-grained parallelism and explore the best partial-unroll factor (lines 6, 10, and fifteen). In the other ii conditions, we apply a fine-grained pipeline to improve the throughput and resource efficiency.

The in a higher place two transformations cover the remaining part of feature #2. Finally, we apply step vi to comprehend feature #3.

6. On-chip buffer reorganization: Nosotros finally apply memory coalescing to reorganize the on-scrap buffer (lines 28-31). Nosotros analyze the data blazon to determine the minimal flake-width, and e'er set the maximal bit-width to 512 bits since this is the maximal supported by the experimental platform. In improver, we only set the power-of-two bit-width values every bit DSE candidates, because HLS tools circular BRAM sizes up to a ability of two. Every bit a result, this reduced design space tin all the same cover the optimal solution in the original design space.

By applying the above code transformations, we are able to generate a transformed kernel code with the CPP microarchitecture and a design infinite. As can be seen in Lawmakingi, the design space of the NW example has roughly blueprint points. Therefore, an efficient DSE component is essential for the AutoAccel framework.

5.3 Blueprint Space Exploration

The DSE menstruation of AutoAccel, as shown in Fig.half-dozen, is implemented using OpenTuner[viii], an open-source framework for building domain-specific program tuners. The OpenTuner runtime has a search technique library that contains a collection of machine learning algorithms to comprehend as many customized tuning problems as possible. In order to assemble all search techniques, OpenTuner adopts a multi-armed bandit algorithm[17] equally a meta technique to judge the effectiveness of each search technique and classify design points co-ordinate to the judgment. Specifically, the search technique that can efficiently find loftier-quality pattern points will exist rewarded and allocated more pattern points. In contrast, the technique that performs poorly on loftier-quality design point discovery will be allocated fewer points and eventually disabled. By harnessing OpenTuner, our DSE flow is able to realize the all-time design point efficiently and effectively.

Figure half-dozen: Design Space Exploration Menstruum

In Figurehalf-dozen, the model initialization stage first parses the HLS reports of a few design points and generates the values of the design constants. While near values are obtained by running HLS once, the LUT consumption of the standalone logic of a loop iteration ( in Eq.nine) is calculated via two HLS reports. In item, we run HLS twice with ii consecutive unroll factors of a loop to calculate the increase of the LUT consumption. This increase is the LUT consumption of the loop'south standalone logic. Next, nosotros analyze the kernel source lawmaking to 1) establish the architecture hierarchy, and 2) fetch the design parameters and their value ranges from the car pragmas inserted during design space establishment (Section5.2). Afterward the model is initialized, we simply feed the parameter sets of the remaining design points to the model and collect the performance and resource estimations.

6 Experimental Evaluation

In this section we first draw our experimental setup, including hardware platform, software environments, and benchmarks. Then nosotros evaluate AutoAccel by analyzing the results of design infinite exploration (DSE), analytical model, and overall performance and energy efficiency.

6.1 Experimental Setup

The evaluation of AutoAccel is performed on the mainstream PCIe-based CPU-FPGA platform and the Xilinx SDAccel design flow. Table iii lists the detailed hardware and software configuration. An Xeon CPU is connected with a Xilinx Virtex-seven FPGA board through the PCIe interface. For a fair comparison, both the CPU and the FPGA textile were launched in 2012. On top of the platform hardware, we use Xilinx SDAccel to provide a hardware-software co-blueprint environment.

Table4 lists the benchmarks used in our experiment. We apply MachSuite[xxx], a benchmark suite that contains a broad form of computational kernels programmed as C functions for accelerator study, to evaluate the AutoAccel framework. For each kernel, MachSuite provides at least one implementation that is based on a commonly used algorithm in software programming, which makes it a natural fit for demonstrating AutoAccel.

Table 3: Configuration of Hardware and Software

Table 4: Benchmark Clarification

6.2 Pattern Space Exploration Evaluation

Fig.vii illustrates the process of finding the optimal blueprint point using the learning-based DSE approach with the analytical model to evaluate the performance and resource consumption. Thank you to the multi-armed bandit algorithm, the DSE process is able to find the correct direction to the optimal solution efficiently, so the DSE fourth dimension limit is prepare to only 180 seconds after the model initialization. Equally tin be seen in Fig.7, the execution cycles drop significantly in the get-go xx seconds except for KMP. We analyze the process log of KMP in item and find that the DSE spends some iterations attempting to improve the performance of the compute stage, considering KMP has a relative large blueprint infinite inside the compute module. Notwithstanding, the performance of KMP is heavily bounded by memory bandwidth then reduced compute latency does not benefit for overall operation comeback. Despite this, the DSE procedure for KMP is nonetheless able to be converged in fourth dimension.

Figure vii: Procedure of Finding the Optimal Design via DSE

Based on the best configuration realized by the DSE, Table5 presents the performance and resources utilization for each benchmark. ⁴ ⁴ 4We set 80% as the resource constraint based on the resources bachelor for users. Notation that the C2C metric in the second cavalcade is calculated by the following equation:

The concept of C2C shares the merits of the CTC ratio used in[33]. We employ C2C to clarify whether a design has accomplished optimality ( ). The blueprint is identified equally computational bound if C2C is larger than 1; otherwise it is communication bound.

Table v: C2C and Resources Utilization

According to Tablefive, the overall performance of AES, SPMV, KMP, and STENCIL is bounded by the off-chip bandwidth, considering those four designs demand to input or output a large amount of information, so the memory transaction time cannot be hidden by the computation fourth dimension even if the AutoAccel DSE has successfully found the design point with the largest fleck-width. In fact, the memory-bounded design may potentially be further optimized past introducing data reuse analysis. For example, [xi] leverages polyhedral analysis to realize and optimize the information access pattern, and this results in a much lower external memory transaction volume for stencil computation. However, the touch on of this kind of transformation cannot be estimated by our belittling model and is beyond the scope of this newspaper. Futurity work would extend the model to cover those transformations.

For the other four designs, VITERBI and NW are bounded past LUTs. We can meet that the C2C of both designs is higher than 2. It means that the PE in both designs consumes many LUTs, so even the overall design tin still be farther optimized by duplicating more PEs. There are no more than available LUTs to use.

On the other manus, FFT and GEMM are bounded by BRAM. Since their PE logics are relatively simple, BRAM becomes the major resource bottleneck. In this case, the DSE balances the computation and communication cycles by adjusting the PE number and buffer bit-width. As a consequence, the BRAM-bounded blueprint has a C2C value larger than but close to ane.

half-dozen.3 Belittling Model Evaluation

Nosotros conduct 2 experiments to evaluate the accuracy of the analytical model. The first experiment aims to evaluate whether the model-generated results are consistent with those collected from HLS reports. In item, nosotros randomly select twenty design points for each criterion, and compare the performance and resource usage for each design point between the model estimation and HLS report. Tabular array half dozen presents the boilerplate absolute difference rates for all cases.

We can meet that the proposed model aligns to the HLS report accurately on performance and BRAM/DSP usage, and likewise results in only moderate differences on LUT/FF usage. The differences are lead by the fact that the HLS tool adopts some resource-efficient implementations for its edifice blocks when a blueprint requires a large proportion of on-lath resources. For example, VITERBI includes a loop statement with initiation interval (Two) equal to 40. The hardware circuit for this loop has some 25-to-1 multiplexers to select one floating-point number from 25 numbers. We observe that when the number of Foot in the VITERBI design grows, the HLS tool automatically replaces a fully pipelined multiplexer implementation that consumes over 500 LUTs with the implementation that consumes just 32 LUTs to 1) meet the II=40 restriction, and 2) save on-board resources. Since such dynamic optimization strategies are hard to capture with a static belittling model, a few percentages of differences on LUT/FF usage is inevitable.

Table 6: Differences Between Model and HLS Reports

The second experiment evaluates the performance departure between the HLS written report and the actual on-board result. Tabular array 7 presents the accented performance deviation rate of the optimal design point identified by AutoAccel. We can come across that the boilerplate difference among all the benchmarks is just 6.2%, which proves that the bicycle estimation from the HLS tool is able to match the actual on-lath execution time for the proposed microarchitecture. Notation that the bodily frequency of generated designs is not variant dramatically due to the following two reasons. Start, Xilinx SDAccel 2016.3 optimizes the timing prior to optimizing other factors, so it might sacrifice resources efficiency (e.1000., enlarge II) to preserve the frequency. Second, all of our designs reserve sufficient resource for the tool to avoid strict timing constraints. As a result, the touch of frequency on performance departure is moderate.

Table vii: Differences Between Model and On-lath Results

In addition, we further analyze the benchmarks with over 10% performance deviation, i.eastward., AES and KMP. We find that such relatively a large difference is mainly because the accelerator designs for these benchmarks have a very small execution time ( ten ms). For these time frames, the showtime-up and terminate overhead bias the time significantly. On the contrary, we too find that the mistake charge per unit of the model to on-board execution is always less than five% when a design has an over 100-millisecond execution fourth dimension. Hence, the proposed model is able to accurately predict the on-lath execution fourth dimension of a design given that its execution time is tens of milliseconds or larger.

half-dozen.4 Operation and Energy Evaluation

We finally evaluate the functioning speed-up and free energy efficiency improvement of the generated FPGA accelerator designs. Figure8 compares the performances between the naive implementation of MachSuite, manual HLS designs and AutoAccel-generated accelerator designs, all of which are normalized to the performances of the corresponding software implementations. We can conspicuously see that AutoAccel-generated accelerators drastically outperform the naive implementations by 27,000x, indicating that AutoAccel has strongly addressed the gap from software programs towards high-quality hardware behavioral descriptions. Meanwhile, the AutoAccel-generated accelerators too outperform the software implementations past 72x, indicating that our approach does pb to loftier-quality accelerator designs.

We can too see from the experimental results that the manual designs only outperform the AutoAccel-generated designs past an boilerplate ii.v , fifty-fifty afterwards we spent several days to weeks applying more behavioral-level transformations to accomplish the optimal performance. In item, for the benchmarks with C2C<1 — AES, SPMV, KMP and STENCIL — the generated designs have achieved the aforementioned optimal performance as manual designs in the experimental platform, because these benchmarks are all of linear time complication, and their PEs run faster than the off-chip advice. On the other hand, the performances of the benchmarks with super-linear time complication — FFT, NW, VITERBI and GEMM — are bounded by FPGA on-bit resources. As a result. the performance tin can potentially be farther improved by using application-specific accelerator circuits to improve resource efficiency. For example, we employ the systolic array microarchitecture to improve the GEMM accelerator design and attain the optimal performance with all on-chip DSPs. Although such specialized architectures cannot be covered by AutoAccel, AutoAccel still preserves loftier accelerator quality while substantially improving the FPGA programmability.

Effigy 8: Speedup over an Intel Xeon CPU Core

Finally, nosotros analyze the energy efficiency gain of AutoAccel-generated designs. We approximate the energy efficiency (performance per watt) of our experiments by considering execution time and thermal design power (TDP). The TDP of the Intel Xeon CPU and the Xilinx FPGA used in this comparing is 80W and 25W, respectively. Accordingly, AutoAccel-generated designs tin can attain up to 1677.9 energy efficiency improvement and 260.iv on boilerplate.

7 Related Work

In this section we discuss related work in the analytical models and the automated frameworks for FPGA design optimization.

Analytical Modeling: Fast performance interpretation on FPGAs has become pop in recent years. In general, performance analysis is mainly performed at either IR level[32, 36, 28, 20, 18] or source code level[37]. Since near of the existing work performs analysis without explicitly because back-finish design flow[32, 36, 18, 28, xx], their analysis cannot reflect the optimization done by the commercial tool. On the other hand, similar to this paper, [37] builds the performance model with the help of the commercial tool, only [37] provides neither the resource model nor automated code transformation, so users still need to manually alter the kernel code while considering the FPGA resource limitation.

Automated Framework: Some projects aim to provide an automated framework to perform code generation and blueprint space exploration[28, twenty, 31]. The framework presented by [28, 20] accepts parallel patterns (e.one thousand., map, groupBy, filter, reduce, etc.) and performs FPGA accelerator generation with analytical DSE. Different from this paper, which automatically applies the CPP microarchitecture, the FPGA architecture generated past [28, 20] is composed of predefined hardware components (i.e., retentiveness, controller, and primitive operations) to guarantee efficiency. However the selection of these components highly depends on the semantic information of user-specified parallel patterns. Furthermore, the performance model for DSE in [28, 20] is built only for the predefined hardware components.

In improver, Melia[31] is a MapReduce framework that supports automated lawmaking generation from user-written C code to OpenCL. Melia asks users to provide the best configuration by leveraging the model from [32] to generate the OpenCL code. Consequently, Melia merely generates the FPGA accelerator design under a MapReduce programming model, and misses automatic pattern space exploration.

Finally, some frameworks as well focus on general-purpose programming languages such every bit C/C++[21, 35, 18]. SOAP3[18] is a framework that analyzes a kernel at the metasemantic intermediate representation (MIR) graph level and transforms it according to the outcome of design space exploration. However, SOAP3 adopts regression models for resource interpretation, so the model is non full general enough to cover nonlinear resource consumption. A framework in [21] uses an belittling model based on HLS results (like this paper) for maximizing throughput given resource constraints. However, they consider only loop pipelining and ignore the blueprint infinite of coarse- and fine-grained parallelism. Lin-analyzer[35] is a framework to identify the performance bottleneck for C/C++ programs, but information technology does not involve lawmaking transformation and simply focuses on fine-grained parallelism.

8 Conclusion

While the FPGA-based heterogeneous architectures are becoming a promising image to provide continued performance and free energy improvement in modern datacenters, accelerator programming arises as a serious challenge to application developers. In this paper nosotros advise the AutoAccel framework to provide a nearly push-button experience on mapping C functions into loftier-quality FPGA accelerator designs. Featuring the CPP microarchitecture, a fast belittling model-based design space exploration and automated lawmaking transformation, AutoAccel achieves 72x speed-up and 260.4 energy comeback for a wide class of computation kernels.

Furthermore, nosotros believe that the blueprint principles of AutoAccel can exist farther generalized to stimulate more research on the adoption of FPGAs in datacenters. For example, the CPP microarchitecture serves as a proof-of-concept that using an accelerator design template equally a specification of the program-to-behavioral-clarification transformation drastically reduces the design space while preserving the accelerator quality. Therefore, more microarchitectures might exist added in AutoAccel to improve the coverage of ciphering kernels. Also, more sophisticated, high-abstract code transformations (e.g., loop permutation) are able to be supported in the future, forth with polyhedral analysis, to grade a larger pattern space and create more optimization opportunities.

References

[1] Rose Compiler Infrastructure, 2000. http://rosecompiler.org/.
[two] Merlin Compiler, 2015. http://world wide web.falcon-calculating.com/index.php/solutions/merlin-compiler.
[three] Xeon+FPGA Platform for the Data Center. https://world wide web.ece.cmu.edu/~calcm/carl/lib/exe/fetch.php?media=carl15-gupta.pdf, 2015.
[four] Amazon ec2 f1 instance, 2016. https://aws.amazon.com/ec2/instance-types/f1/.
[5] Intel SDK for OpenCL Applications, 2016. https://software.intel.com/en-united states of america/intel-opencl.
[6] Intel to Kickoff Shipping Xeons With FPGAs in Early 2016. http://www.eweek.com/servers/intel-to-offset-shipping-xeons-with-fpgas-in-early on-2016.html, 2016.
[7] SDAccel Development Environment. http://world wide web.xilinx.com/products/design-tools/software-zone/sdaccel.html, 2016.
[8] Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O'Reilly, and Saman Amarasinghe. Opentuner: An extensible framework for program autotuning. In PACT, 2014.
[9] Pierre Bricaud. Reuse methodology manual: for organization-on-a-chip designs. Springer Science & Business Media, 2012.
[10] Adrian Chiliad Caulfield, Eric Southward Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim, Daniel Lo, Todd Massengill, Kalin Ovtcharov, Michael Papamichael, Lisa Woods, Sitaram Lanka, Derek Chiou, and Doug Burger. A deject-scale dispatch compages. In MICRO-49, 2016.
[11] J. Cong, P. Li, B. Xiao, and P. Zhang. An optimal microarchitecture for stencil computation acceleration based on nonuniform partitioning of data reuse buffers. TCAD, 2016.
[12] J. Cong, Bin Liu, South. Neuendorffer, J. Noguera, K. Vissers, and Zhiru Zhang. High-level synthesis for FPGAs: From prototyping to deployment. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 2011.
[thirteen] Jason Cong, Muhuan Huang, Peichen Pan, Yuxin Wang, and Peng Zhang. Source-to-source optimization for HLS. In FPGAs for Software Programmers. Springer International Publishing, 2016.
[14] Jason Cong, Muhuan Huang, Peichen Pan, Di Wu, and Peng Zhang. Software infrastructure for enabling fpga-based accelerations in data centers: Invited paper. In ISLPED, 2016.
[fifteen] Jason Cong, Wei Jiang, Bin Liu, and Yi Zou. Automatic retentiveness partitioning and scheduling for throughput and power optimization. TODAES, 2011.
[16] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on big clusters. Commun. ACM, 2008.
[17] Álvaro Fialho, Luis Da Costa, Marc Schoenauer, and Michèle Sebag. Analyzing bandit-based adaptive operator choice mechanisms.
Annals of Mathematics and Bogus Intelligence
, 2010.
[18] Xitong Gao, John Wickerson, and George A. Constantinides. Automatically optimizing the latency, expanse, and accuracy of c programs for high-level synthesis. In FPGA, 2016.
[nineteen] Muhuan Huang, Kevin Lim, and Jason Cong. A scalable, loftier-performance customized priority queue. In FPL, 2014.
[xx] D. Koeplinger, R. Prabhakar, Y. Zhang, C. Delimitrou, C. Kozyrakis, and K. Olukotun. Automated generation of efficient accelerators for reconfigurable hardware. In ISCA-43, 2016.
[21] Peng Li, Peng Zhang, Louis-Noel Pouchet, and Jason Cong. Resource-aware throughput optimization for loftier-level synthesis. In FPGA, 2015.
[22] Saul B Needleman and Christian D Wunsch. A general method applicative to the search for similarities in the amino acid sequence of ii proteins. Journal of molecular biology, 1970.
[23] E. Nurvitadhi, Jaewoong Sim, D. Sheffield, A. Mishra, Southward. Krishnan, and D. Marr.
Accelerating recurrent neural networks in analytics servers: Comparison of fpga, cpu, gpu, and asic.
In FPL, 2016.
[24] Jian Ouyang, Shiding Lin, Wei Qi, Yong Wang, Bo Yu, and Song Jiang. Sda: Software-defined accelerator for largescale dnn systems. In Hot Chips, 2014.
[25] Kalin Ovtcharov, Olatunji Ruwase, Joo-Young Kim, Jeremy Fowers, Karin Strauss, and Eric S Chung.
Toward accelerating deep learning at calibration using specialized hardware in the datacenter.
In Hot Chips, 2015.
[26] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank commendation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.
[27] Louis-Noel Pouchet, Peng Zhang, P. Sadayappan, and Jason Cong. Polyhedral-based data reuse optimization for configurable computing. In FPGA, 2013.
[28] Raghu Prabhakar, David Koeplinger, Kevin J Brown, HyoukJoong Lee, Christopher De Sa, Christos Kozyrakis, and Kunle Olukotun. Generating configurable hardware from parallel patterns. In ASPLOS-XXI, 2016.
[29] Andrew Putnam, Adrian One thousand Caulfield, Eric South Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Greyness, et al. A reconfigurable textile for accelerating large-calibration datacenter services. In ISCA-41, 2014.
[thirty] Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, and David Brooks. Machsuite: Benchmarks for accelerator design and customized architectures. In IISWC, 2014.
[31] Z. Wang, S. Zhang, B. He, and W. Zhang. Melia: A mapreduce framework on opencl-based fpgas. TPDS, 2016.
[32] Zeke Wang, Bingsheng He, Wei Zhang, and Shunning Jiang. A functioning analysis framework for optimizing opencl applications on fpgas. In HPCA-22, 2016.
[33] Chen Zhang, Peng Li, Guangyu Sunday, Yijin Guan, Bingjun Xiao, and Jason Cong.
Optimizing fpga-based accelerator design for deep convolutional neural networks.
In FPGA, 2015.
[34] Peng Zhang, Muhuan Huang, Bingjun Xiao, Hui Huang, and Jason Cong. CMOST: A arrangement-level fpga compilation framework. In DAC-52, 2015.
[35] Yard. Zhong, A. Prakash, Y. Liang, T. Mitra, and S. Niar. Lin-analyzer: A high-level performance analysis tool for fpga-based accelerators. In DAC-53, 2016.
[36] Guanwen Zhong, Alok Prakash, Siqi Wang, Yun Liang, Tulika Mitra, and Smail Niar. Design space exploration of fpga-based accelerators with multi-level parallelism. In Date, 2017.
[37] Hamid Reza Zohouri, Naoya Maruyama, Aaron Smith, Motohiko Matsuda, and Satoshi Matsuoka. Evaluating and optimizing opencl kernels for high performance calculating with fpgas. In SC, 2016.
[38] Wei Zuo, Peng Li, Deming Chen, Louis-Noël Pouchet, Shunan Zhong, and Jason Cong. Improving polyhedral code generation for high-level synthesis. In CODES+ISSS, 2013.