Key Notes – BNN-FPGA


  1. Studying FPGA acceleration for very low precision CNN;
  2. Ensuring full throughput and hardware utilization across the different input feature sizes;
  3. Implement BNN classifier on Zedboard;


  1. Comparison of CNNs and BNNs
  2. Introduction of Convolutional, Pool, and FC;
  3. Introduction of Binarized Neural Networks(BNN);

    y = \frac{x-\mu}{\sqrt{\sigma ^{2} + \epsilon}}\gamma + \beta

  4. CIRAR-10 BNN Model;

★ 3.FPGA Accelerator Design

3.1 Hardware Optimized BNN Model

  1. Removing the biases of this model;
  2. Involving the batch norm calculation as y = kx + h :

    k = \frac{\gamma}{\sqrt{\sigma ^{2} + \epsilon}}

    h = \beta – \frac{\mu \sigma}{\sqrt{\sigma ^{2} + \epsilon}}

    This reduces the number of operations and cuts the number of stored parameters to two.

  3. Empirical testing shows that k and h can be quantized to 16-bit; Quantizing the floating-point BNN inputs to 20-bit;

3.2 Retraining for +1 Edge-Padding

  1. Using the 0 padded BNN rather than -1 or 1;

3.3 System Architecture

  1. Ping-Pong Buffer A and B

3.4 Compute Unit Architectures

  1. Utilizing the line buffer architecture for 2D Conv;
  2. Replacing the multiplies in the Conv operation with sign inversions;
  3. Parallelizing across the three input channels(one output bit per cycle);
  1. Designing targets 8, 16, or 32, and supporting port larger power-of-two widths with minor changes.
  2. A standard line buffer is unsuitable for this task:
    a. A line buffer must be sized for the largest input fmap(width of 32)
    b. A line buffer is designed to shift one pixel per cycle to store the most recent rows.
  3. BitSel & variable-width line buffer(VWLB)
    a. The BitSel is responsible for reordering the input bits so that Convolver logic can be agnostic of the fmap width.
    b. Each Convolver applies f_{out} 3×3 Conv filter per cycle to data in the VWLB.
    c. For 32-wide input, each word contains exactly one line. Each cycle, the VWLB shifts up and the new 32bit line is written to the bottom row. They Slide the 3×3 conv windows across the VWLB to generate one 32-bit line of conv outputs.
    d. For an 8-wide input, each word contains four lines. They split each VWLB row into four banks, and map each input line to one or more VWLB banks.
  1. The VWLB achieves full hardware utilization regradless of input width;
  2. A new input word can be buffered every cycle;
  3. The BitSel deals with various input widths.

4. HLS Accelerator Implementation

  1. In BNN accelerator, the basic atom of processing is not a pixel but a word.(word per cycle)
  2. To increase the number of input streams they tile the loop and unroll the inner loop body.
  3. 64 convolutions per cycle
  4. Data buffer A and B at 2048 words