To see more or discuss possible work let's talk >>
(Click to enlarge)
Binary search tree implementation & acceleration on a heterogenous system.

September 2016 - November 2016 |  CMU Pittsburgh

(18-643 Reconfigurable Logic project)

Project | 03


  • Non-embarrassingly-parallel problem to accelerate. Inherently sequential.

  • Initialized a 16 million node binary tree on Xilinx ZedBoard, and implemented six 255-node search kernels on the FPGA for 6 parallel searches. Bench marked against sequential software search by ARM processor. 

  • Detailed analysis of DMA transfer bottleneck to choose the kernel size for FPGA.

  • Kernel synthesized from Vivado HLS featured stage skipping using parallel comparators making it faster than sequential search, but full tree search on FPGA was slowed down by DMA communication costs, sequential nature of binary search, and log(n) nature of search problem making ARM processor fast.


  • Parallelization thought process-How to parallelize a non-inherently-parallel problem like binary search, which is sequentially fast(log(n))-identify memory bound or computation bound.

  • Defining hardware through software – High Level Synthesis, generate optimized RTL from C, through compiler directive.

  • DMA transfer time depends on the kernel size and number of times a tree block is transferred to FPGA(determined by main tree size). Kernel size selected on a tradeoff -small enough to reduce individual transfer time, but large enough to reduce the number of transfers.


  • Designed the schematic of the datapath and register file of a 16 bit 4-stage pipelined processor, using 45nm transistor technology in Cadence.

  • Estimated the wire capacitances from rough floorplan idea, and included in the schematics for near-accurate simulation.

  • Processor supported 6 basic instructions- ADD, SHIFT(L&R), LOAD, STORE & AND.

  • Datapath critical paths were analyzed to adhere to 15FO4 delay specification.

  • Class of Adder used- Ripple Carry Select Adder. The 5 datapath buses were multiplexed to a common output bus, through optimized 5:1 tristate-based MUX for each lane.

Project | 02

16-bit processor design

October 2016-December 2016 | CMU, Pittsburgh

(18-622 Digital IC Design Project)

(Click to enlarge)


  • Involved setting up a wireless link between a PC, and a brain implant PCB designed for epilepsy detection and termination.

  • Developed a custom firmware, in C language, for the USB dongle plugged to a PC, wrote Verilog code for the FPGA that controlled the RFIC on brain implant PCB, and MATLAB script for real-time plot of recorded brain signals.


  • Implementing SPI protocol in Verilog for FPGA to communicate with SPI-compliant RFIC.

  • Working with USB 2.0 CDC class libraries, MRFI API(data link layer software by TI) and knowledge of code organization.

  • Extensive handling of test instruments :oscilloscopes, logic analyzer, and signal generators, to perform systematic debugging.

  • Experience of working in multicultural team environment.

Challenges solved:

  • Maximized overall system throughput and minimized packet loss by handling bottleneck created by RFIC data rate limitation, through implementation of FIFO buffers after an in-depth signal timing analysis.

Impact and significance:

  • The implant PCB is now capable of transmitting the recorded neural data, wirelessly to the dongle, and graph it in real time on MATLAB.

  • The default firmware on USB dongle allows it to only receive data. However my new modified firmware allows the dongle to receive as well as send data from PC back to implant PCB, as a COM port device, allowing for a closed-loop neural recording and stimulation system.

Project | 01

Wireless Radio Connectivity for responsive neurostimulator brain implants

June 2015-August 2015 | Univ. of Toronto, Canada

(Mitacs Globalink Research Internship Project)

(Click to enlarge)

Binary tree structure in linear memory

FPGA floorplan

Runtime stats

Challenges solved:

  • Expensive cache flush time- due to ARM core generating 255-node tree from the main tree, and pushing it into the DRAM, for DMA-ing into the FPGA SRAM-solved by pre-processing the main tree, and storing it in a blocked fashion(instead of usual linear array storage form of binary tree) to avoid on-the-fly tree generation and hence cache flush too.

  • Interleave DMA start signals, and FPGA kernel start signals, as well as DMA and FPGA completion wait times, to ensure pipeline-like operation and least overall execution time.

Impact and significance:

  • The FPGA kernel execution time for a leaf node search in an 8-level tree was 0.43 us, while ARM core took 0.6us for same search.

  • Despite this fast kernel, the FPGA-assisted solution was ~4.6 times SLOWER than pure sequential search by ARM processor (70us vs 15 us), for 6 leaf node searches on the main tree.


  1. Memory Bound problem. There is hardly any computation, that can be parallelized, it is the fast access to data residing in main memory which is critical. A leaf node search in a 16 million node tree, would require 24 sequential comparisons-Nothing computationally intensive to be beaten

  2. ARM runs at 600Mhz, FPGA at 250 Mhz. ARM also enjoys direct cached-access to the main memory (Disabling cache, resulted in ARM taking almost twice as much time)

  3. DMA communication costs from DRAM to FPGA space hurt the FPGA based solution to the maximum (~68% of total execution time) along with the AXI-interconnect latencies incurred in ARM-to-FPGA control signaling (Why cry for this? It’s all in the microseconds range to beat the software, every microsecond of overhead hurts!)

Limiting factors for improving performance:

  1. Currently the FPGA kernel made 2-stage (3 tree element) comparisons in parallel. Doing more stage comparisons in parallel, would make the kernel faster- But this was limited by the dual-ported SRAM of the FPGA, which could service only 2 tree data elements per clock cycle.

  2. Put more kernels in the fabric, to run more searches in parallel- The 6 kernels are limited by 4 General Purpose Ports and 2 High Performance AXI Ports to interface with the ARM processor.

  3. So this is a sequential problem, and speedup is limited by sequential bottleneck (Amdahl’s Law!). Solution- Increase the data set size(Gustafson’s Law!) .Problem? 512MB maximum DRAM in the Zedboard.

But amidst all this, FPGA heterogenous system are just not about speedup right? The critical evaluation parameter should have been speedup/Joule, which is where FPGAs prove beneficial with their very low static runtime power consumption. Accurate power consumption measurement for an embedded heterogenous board like Zedboard was tough, and so this parameter remained un-evaluated.

1 bit adder circuit.

Follow me

© 2016 by Shreedutt C. Hegde



T: +1 412 499 5803


  • Facebook Clean
  • Twitter Clean
  • White Google+ Icon