A RISC-V Based RoCC Convolution Accelerator

Repository: MyConvAccel

Project Goal and Challenges

This project builds a hardware accelerator for a fixed 5×5 convolution over a 32×32 input matrix. The goal is not only to make a standalone Chisel datapath work in simulation, but to turn it into a processor-controlled RoCC accelerator that can be launched from C code running on Rocket.

The main difficulty is that convolution is not just a multiply-add problem. A useful accelerator must also handle sliding-window data reuse, DMA memory traffic, fixed-point arithmetic, backpressure between pipeline stages, and the software-visible control protocol.

The design therefore connects three paths:

Compute path: LineBuffer, ShiftWindow, and a pipelined 5×5 MAC datapath.
Memory path: reading input/kernel data and writing output back to software-visible memory.
Control path: Rocket configures, starts, and polls the accelerator through RoCC custom instructions.

The expected result is a complete flow where the CPU only issues a few custom instructions, while the accelerator performs data movement, window construction, convolution, and result writeback in hardware.

Phase 0: Project Planning & Protocol Design

Before writing any code, we first settled the fundamental questions.

Protocol Design: RoCC Instruction Encoding

At a high level, the RoCC protocol is simple: the CPU passes register values to the accelerator via custom instructions, and the accelerator responds with a result.

Rocket Core (CPU)                      Accelerator
      |                                      |
      |  custom inst (rs1=addr, rs2=data)    |   ← RoCC interface: control
      | -----------------------------------> |
      |                                      |
      |  DMA read / write                    |   ← TileLink bus: data
      | <==================================> |
      |                                      |
      |  resp.data = status / result         |   ← RoCC interface: status
      | <----------------------------------- |

This accelerator uses a three-step flow: set addresses → trigger → poll. The CPU tells the accelerator where the input matrix, kernel, and output reside in memory, fires a start signal, then polls a status register for completion.

funct7	Instruction	rs1	Description
0	SET_ADDR_IN	base addr	Base address of input matrix
1	SET_ADDR_KER	base addr	Base address of kernel
2	SET_ADDR_OUT	base addr	Base address of output matrix
3	START_ACCEL	—	Non-blocking start
4	POLL_STATUS	—	Read status register to rd

The status register packs four bits for software polling:

Bit	Name	Meaning
0	busy	Accelerator is computing
1	done	Computation complete
2	overflow	Accumulator overflow
3	addr_err	Address check failed

Design Rationale

Why split addresses into three separate instructions?

A single RoCC instruction only carries two source operands (rs1 and rs2), not enough to pass three base addresses at once. The alternative is passing a struct pointer, but that forces the accelerator to issue its own DMA read just to fetch the configuration, adding complexity and latency. Three independent instructions pass one address each, keeping the hardware interface simple.
Why non-blocking START + polling instead of blocking for the result?

Non-blocking lets the CPU issue START_ACCEL and immediately switch to other work. Polling is preferred over interrupts because a 5×5 convolution completes in a short, predictable window: the overhead of an interrupt controller and context switch would outweigh the benefit.
Why continuous funct7 encoding instead of sparse assignment?

A single funct7 <= 4 check covers all valid instructions, keeping the hardware decoder minimal. Sparse encoding would add combinatorial logic with no upside.

Key Design Decisions

Is the convolution kernel size fixed or configurable?

Supporting variable kernel sizes increases control logic complexity significantly. A fixed 5×5 hardware datapath is used. Kernels smaller than 5×5 are pre-zero-padded by software.
How to handle boundary pixels?

Zero-padding: no special edge-case logic is needed in the sliding-window FSM.
Fixed-point precision trade-off.
- Why not floating-point? IEEE 754 multipliers are expensive, consuming significant area and pipeline stages. A Q8.8 fixed-point multiplier is just a 16-bit × 16-bit integer multiplier, producing a result in a single cycle with no extra hardware.
- Why not plain integer? Convolution weights are naturally fractional (e.g. 0.125, -0.5). Plain integers can’t represent them. Q8.8 splits 16 bits evenly: 8 bits for the integer part (±128 range), 8 bits for the fractional part (1/256 precision).
- Why a 32-bit accumulator? Although a raw 16-bit × 16-bit multiply produces a 32-bit product, the accumulator requirement depends on the numeric range, not just the product width. With Q8.8 values in the approximate range [-128, 128), the largest product magnitude is below 2^14 in real value (represented as Q16.16). Summing 25 such products requires fewer than 21 integer magnitude bits plus fractional bits, so a signed 32-bit accumulator provides enough headroom for this fixed 5×5 datapath. The key point: we bound the values by the Q8.8 numeric range, not by the worst possible 32-bit product bit pattern.
Why alignment constraints?

TileLink DMA works best when addresses are word-aligned. An unaligned address forces the bus to split a single transfer into multiple chunks, and the hardware must shift and stitch bytes back together. The accelerator checks the address on every SET_ADDR_* and rejects unaligned ones. This keeps the hardware simple and pushes alignment responsibility to software.

Parameter	Value
Input Matrix	32×32
Max Kernel	5×5
Data Format	Q8.8 Fixed Point (16-bit)
Accumulator	32-bit
Output	Same size, zero-padding
Performance Target	<2500 cycles, ≥40× speedup

Phase 1: Architecture Overview

The Problem: Running Convolution on a CPU

A 5×5 convolution over a 32×32 matrix means sliding a 5×5 window across 1024 output positions. Each output pixel requires 25 multiplies and 24 additions: 25,000 multiply-accumulates in total. On a general-purpose CPU, the arithmetic itself is not the bottleneck. The real bottleneck is the sliding. Every slide requires computing address offsets, updating loop counters, loading pixels and weights from memory, and storing results back. Most instructions are spent on loop control and address bookkeeping, not math.

A rough estimate: on a simple in-order RISC-V core, each output pixel costs ~100–150 cycles (two loads, one store, ~50 arithmetic ops, plus loop branches). Across 1024 pixels, that adds up to 100,000–150,000 cycles. In RTL simulation, the accelerator completes the same 32×32 case in 2428 cycles. Against the rough CPU estimate above, that suggests a speedup on the order of 40–50×. A measured bare-metal baseline is left for Phase 8.

The Fix: Offload to an Accelerator

The CPU is not built for sliding-window number crunching, so that job is handed to a RoCC accelerator. The accelerator does three things: fetch data from SRAM, run the convolution, and write results back. The CPU’s role shrinks to three steps: set addresses, issue START, and poll.

What does the accelerator look like inside?

The naive approach is a serial pipeline: DMA load → MAC compute → DMA store. Simple, but wasteful: the DMA waits for the MAC, and the MAC waits for the DMA. Only one piece of hardware is active at any moment.

What we actually built overlaps the pieces. An inputQueue preloads the entire matrix so that LineBuffer can start filling while DMA is still loading later rows. The ConvEngine runs a 6-stage pipeline, producing one result per cycle once full. A storeQueue absorbs results during computation, and DMA drains the queue to write them back as the pipeline drains. The result is partial overlap: input loading overlaps with LineBuffer filling and early compute, while the compute tail overlaps with output writeback — 2428 cycles end to end.

Module Map

Module	Role	Phase
ConvControl	RoCC decode + 4-state control FSM + status register	2
ConvDMA	7-state DMA engine over TileLink	3
LineBuffer + ConvEngine	Sliding window + MAC pipeline, 6-stage compute datapath	4
InputQueue + StoreQueue	Elastic buffers with backpressure	5

The following phases walk through each module in detail.

Phase 2: ConvControl — Instruction Decode & FSM

Phase 1 drew the architecture. The top line, RoCC control, is what Phase 2 is about. The CPU issues a custom instruction carrying funct7. ConvControl decodes it, executes it, and responds via rd.

Interface

ConvControl talks to the CPU through five signals and a valid/ready handshake: instrCmd.valid / funct7 / rs1 / rd arrive from the CPU, and instrReady goes back. When both valid and ready are high on the same cycle, the instruction fires.

Instruction Decode

Decoding uses five comparators: funct7 === 0.U through funct7 === 4.U. No priority encoder, no lookup table. The continuous encoding from Phase 0 pays off: funct7 is literally the instruction number.

Not every instruction is welcome at every moment. The rules:

SET (0–2): blocked in sBusy. Changing addresses mid-computation would corrupt the run.
START (3): only accepted in sIdle or sDone. You cannot start an already-running accelerator.
POLL (4): always accepted. It is a pure read: it cannot interfere with anything.

SET_ADDR: Storing Three Base Addresses

Depending on funct7, rs1 is written into addrIn, addrKer, or addrOut. If the accelerator is in sError, any SET clears it back to sIdle.

START_ACCEL: Validate and Fire

Address non-zero check

If any base address (addrIn, addrKer, addrOut) is zero, the state goes to sError.
Address alignment check

Input and output matrices require 8-byte alignment (addrIn(2,0) === 0.U). The kernel requires 2-byte alignment (addrKer(0) === 0.U). The DMA bus is 64 bits wide: each transfer moves 4 pixels. The kernel has only 25 coefficients, so 2-byte alignment is sufficient.

If all checks pass: sIdle → sBusy, and computation begins. If any check fails: sError, and the addr_err bit in the status register goes high.

FSM: Four States

         START pass                done
┌─── sIdle ────────────────► sBusy ─────────────► sDone
│      ▲                       ▲                    │
│      │                       │ START              │ SET
│      │                       └────────────────────┘
│      │
│      │ SET
│      │
│ START fail
│      │
│      ▼
└──► sError

sIdle: reset default. Waits for START.
sBusy: only accepts POLL. Counts down to zero, then raises done and moves to sDone.
sDone: holds done = 1. CPU can re-START (→ sBusy) or SET to reconfigure (→ sIdle).
sError: holds addrErr = 1. Only SET pulls it back to sIdle: no direct path to sBusy.

Phase 3: ConvDMA — TileLink DMA Engine

Key idea: ConvDMA turns memory operations into a valid/ready data stream. The serial version is easy to verify; the pipelined version improves throughput by overlapping request issue with response unpacking.

The convolution engine constantly reads pixels and weights from SRAM, then writes results back. A dedicated DMA engine manages these transfers. Phase 3 starts with a strictly serial DMA, one request in flight at a time. Once verified, we add pipelining so multiple requests can be issued without waiting for the prior response.

DMA Interface

Three Bundles connect the DMA to the L1 data cache: signals only, no logic. Valid/ready handshakes come from Decoupled.

SimpleMemReq: DMA’s request to L1. Carries 64-bit address, 64-bit write data, byte mask, a read/write flag, and a 4-bit tag. Tag is 0 in the serial DMA; reserved so a pipelined version can match out-of-order responses.
SimpleMemResp: L1’s reply. Returns 64-bit read data and echoes the request’s tag.
SimpleMemIO: bundles req (DMA→L1) and resp (L1→DMA, Flipped) into one port.

  ConvDMA                  SimpleMemIO            L1 data cache
┌─────────┐              ┌─────────────┐          ┌──────────┐
│         │── req ──────►│ req (output) │────────►│          │
│  FSM    │              │             │          │  L1 D$   │
│         │◄─ resp ──────│ resp (input) │◄────────│          │
└─────────┘              └─────────────┘          └──────────┘

Serial FSM

ConvDMA moves data between SRAM and the compute unit through two paths:

Load path (sIdle → sIssue → sWaitResp → sUnpack → loop): reads 64-bit words from memory, unpacks each into four 16-bit elements, pushes them into elemQueue.
Store path (sIdle → sGather → sIssue → loop): collects four 16-bit elements, packs them into a 64-bit word, writes it to memory.

Both paths share sIssue, branching on opReg. At most one request is in flight at any time.

Load path:   sIdle → sIssue → sWaitResp → sUnpack(×4) ─┐
              ▲                                        │
              └────────────────────────────────────────┘

Store path:  sIdle → sGather(×4) → sIssue ─┐
             ▲                             │
             └─────────────────────────────┘

Error path:  sIdle ──► sError  (SET addr ⇒ sIdle)

State	Responsibility	Path
sIdle	Accept command, latch baseAddr and length, check alignment	common
sIssue	Fire mem.req (read or write, depending on opReg)	shared
sWaitResp	Wait for mem.resp.valid, latch response data	load
sUnpack	4 cycles: slice 64-bit word into 4 × 16-bit elements, push into elemQueue	load
sGather	4 cycles: collect 4 × 16-bit elements from elemQueue, pack into 64-bit word	store
sDone	Transfer complete, waiting for upper layer to consume result	common
sError	Address misaligned, waiting for a valid address	common

Bottleneck: 6 cycles/word

Take the load path. One word goes through sIssue → sWaitResp → sUnpack×4 → back to sIssue: exactly 6 cycles. Memory sits idle for 4 of those 6 cycles during unpack. The bottleneck is not memory latency, it’s the FSM refusing to overlap issue with unpack.

cyc |     state |     mem.req    |   mem.resp   | loadStream
  0 |    sIssue | fire(rd)       |              |
  1 | sWaitResp |                | fire         |
  2 |   sUnpack |                               |
  3 |   sUnpack |                               | deq elem[0]
  4 |   sUnpack |                               | deq elem[1]
  5 |   sUnpack |                               | deq elem[2]
  6 |    sIssue | fire(rd)       |              | deq elem[3]

256 words × 6 + 1 cycle overhead = 1537 cycles. The next section fixes this.

Pipelined DMA

The core idea: decouple issue and unpack into two concurrent hardware processes.

Serial:
  Issue:    [sIssue]              [sIssue]              [sIssue]
  Response:    [sWait]               [sWait]               [sWait]
  Unpack:         [sUnpack×4]            [sUnpack×4]            [sUnpack×4]
             ↑── 6 cycles/word ──↑

Pipelined:
  Issue:    [sIssue][sIssue][sIssue][sIssue][sIssue]...
  Response:    [1 cycle]  [1 cycle]  [1 cycle]  [1 cycle]  [1 cycle]  ← enqueue FIFO
  Unpack:         [sUnpack×4][sUnpack×4][sUnpack×4]...               ← dequeue FIFO
             ↑── 4 cycles/word (unpack is the bottleneck) ──↑

A response FIFO absorbs the rate mismatch: the issue engine fires at 1 word/cycle (up to the inflight limit), responses drop into the FIFO automatically, and the unpack engine drains the FIFO at 4 cycles/word. Neither blocks the other. 1537 cycles → ~1033 cycles, a 33% improvement. Step-by-step implementation details are in a follow-up post.

Changes from Serial to Pipelined

1. Register → Queue

The serial DMA uses a single respWord register to latch one response at a time. The pipelined version replaces it with a response FIFO. Data drops into the FIFO automatically; the FSM only reads from the FIFO when it needs data.

2. sWaitResp disappears

sWaitResp no longer waits for mem.resp.valid or latches response data. Its only remaining job is to wait until the FIFO has data to pop.

3. inflightCount: credit-based flow control

The issue engine fires faster (1 word/cycle) than the unpack engine consumes (1 element/cycle → 0.25 word/cycle). Without a cap, the FIFO overflows. inflightCount acts as a credit account: issue spends a credit per request, unpack returns a credit per word consumed. inflightMax = 4 caps the window at 4 in-flight words.

4. sLoadActive: concurrent FSM

The three serial states sIssue, sWaitResp, and sUnpack merge into a single sLoadActive state. Inside, two independent when blocks run concurrently: the issue engine fires requests up to inflightMax, and the unpack engine drains the FIFO. Neither engine waits for the other.

A trace of the first nine cycles:

cyc |      issue engine       |      unpack engine      | inflight | what happened
  0 | fire read → addr 0x1000 |                         |    1     | burst starts
  1 | fire read → addr 0x1008 |                         |    2     |
  2 | fire read → addr 0x1010 | deq elem[0] from word 0 |    3     | unpack begins, both active
  3 | fire read → addr 0x1018 | deq elem[1]             |    4     | inflight window full
  4 |           —             | deq elem[2]             |    4     | issue blocked
  5 | fire read → addr 0x1020 | deq elem[3] (last)      |    4     | both fire: +1/−1 cancel
  6 |           —             | deq elem[4] from word 1 |    4     | inflight full again
  7 |           —             | deq elem[5]             |    4     |
  8 |           —             | deq elem[6]             |    4     |
  9 | fire read → addr 0x1028 | deq elem[7] (last)      |    4     | credit returned, issue fires

Key observations:

Cycles 0–3: The issue engine fires four read requests back-to-back before unpack starts. By cycle 3, inflight hits the cap of 4.
Cycles 5 and 9: Both engines advance in the same cycle. Issue fires (+1) while unpack finishes a word (−1). The net change to inflightCount is zero.
Cycles 4 and 6–8: Issue is blocked because inflightCount == inflightMax.

The only shared state is inflightCount. The fix is a single expression: inflightCount + issueFired − unpackWordDone. Both boolean events feed one arithmetic operation: +1 and −1 cancel cleanly when they coincide, and neither update is lost.

Phase 4: Compute Datapath — LineBuffer & ConvEngine

Key idea: DMA provides a row-major pixel stream, while convolution needs a 5×5 window. LineBuffer reuses rows, ShiftWindow assembles columns, and ConvUnit runs the pipelined MAC tree.

DMA delivers pixels one at a time in row-major order: row 0 left to right, then row 1, then row 2. But a 5×5 convolution at output pixel (r, c) needs 25 pixels arranged as a 5×5 neighbourhood centred at (r+2, c+2). A single pixel from the DMA is useless on its own: the compute datapath must assemble an entire 5×5 window before the MAC unit can start.

This is a data reshape problem. It splits naturally into two dimensions, so two modules are paired: LineBuffer handles the vertical direction (rows), ShiftWindow handles the horizontal direction (columns). LineBuffer collects 5 rows of 32 pixels from the DMA stream, then outputs one column of 5 vertically adjacent pixels per cycle. ShiftWindow buffers 5 consecutive columns from LineBuffer, shifting right each cycle, and outputs a full 5×5 window to the MAC unit.

Part A: LineBuffer

Why not read directly from SRAM? Each output pixel needs values from 5 different rows. Fetching them directly would require 5 independent reads per cycle targeting 5 different addresses: five memory ports. LineBuffer replaces that with one write port (DMA feeds in one pixel per cycle) and one 5-wide read port (5 rows × same column). A 160-entry register file is far cheaper than a 5-port SRAM.

1	sIdle ──► sPrime (load 5 rows) ──► sActive (32 output rows) ──► sDone

In sActive, while the buffer emits 36 columns per row, DMA loads the next input row into a separate tmpRow register. When the row ends, the buffer shifts up: row 0 discarded, rows 1..3 move up, tmpRow enters as row 4. Without tmpRow, DMA would overwrite rows still being output, and load and output could not overlap.

Zero-padding covers all four image borders through two mechanisms:

Top / bottom: Handled by what the five buffer rows hold. At output row 0, the top two slots are zero; at output row 31, the bottom two are zero. As the window slides down, real rows rotate in and out: no extra control logic.
Left / right: Each output row emits 36 columns (2 padding + 32 data + 2 padding). The colValid signal marks data columns. When low, ShiftWindow fills zeros regardless of colOut.

Part B: ShiftWindow → KernelROM → ConvUnit

LineBuffer outputs one 5-pixel column per cycle. The MAC unit needs a full 5×5 window. Three modules bridge the gap:

ShiftWindow: 5×5 register window. A 5×5 register array. Each cycle, all columns shift right by one: the oldest (c4) is dropped, and the new column from LineBuffer enters c0. When colValid is low, c0 gets zeros instead, implementing left/right padding. The 5×5 array is a combinational output: 400 bits of registers, cheaper than BRAM and zero read latency.

KernelROM: weight storage. 25-entry register file. Weights are written once before computation starts and remain read-only throughout. Combinational 5×5 output with zero latency, so ConvUnit receives both window and kernel operands in the same cycle.

ConvUnit: 5-stage MAC pipeline. A single combinational multiply-accumulate chain (25 multiplies + tree reduction) would have a critical path too long to meet timing. The fix: slice the pairwise addition tree into 5 pipeline stages, each doing one 32-bit addition.

1
2
3

Stage 0 (combinational): 25 parallel 16×16→32 multiplies
Stage 1–5 (registered):  25→13→7→4→2→1  pairwise reduction
                          Stage 5 also does rounding (+0x80), >>8, saturate to 16-bit

Why pairwise instead of Wallace tree? The binary tree is regular, depth is exactly ceil(log₂ 25) = 5 levels, and registers slip naturally between levels: critical path becomes a single 32-bit add.

ConvEngine: top-level glue. Instantiates ShiftWindow, KernelROM, and ConvUnit, connecting colIn / colValid to ShiftWindow and kernel / window to ConvUnit. inValid is delayed one cycle (RegNext) to align with ShiftWindow’s registered output. A stall input gates colValid to freeze the entire pipeline under backpressure. outValid is inValid delayed 5 cycles via ShiftRegister: when high, result holds a valid convolution output.

Phase 5: Top-Level Integration & Master FSM

Key idea: The top-level FSM controls phases, not individual pixels. Once a phase is enabled, queues and valid/ready handshakes decide when data actually moves.

Phase 5 wires the four modules from Phases 2–4 into a single ConvAccelTop. It has three jobs:

Instantiate ConvDMA, LineBuffer, ConvEngine, and two elastic Queues in one module.
Add a 5-state execution FSM that orchestrates the three phases (load kernel, load input, compute+store).
Expose a SimpleMemIO port to the outside, connecting to a simulated scratchpad for testing or HellaCache for Chipyard integration.

Note on ConvControl. ConvControl (Phase 2) handles RoCC instruction decode and the 4-state control FSM (sIdle/sBusy/sDone/sError). In the standalone test setup, ConvAccelTop uses its own built-in execution FSM with a simple start/done handshake. ConvControl will be re-integrated as the RoCC-facing wrapper in Phase 6.

ConvAccelTop: Skeleton & IO

ConvAccelTop is a standalone Module with a simple start/done handshake plus three memory-mapped address ports. The SimpleMemIO bundle carries all memory traffic to a simulated scratchpad or, later, to HellaCache.

               ┌────────────────────────────────────────────┐
        start ─┤                                            ├─ done
        kAddr ─┤                                            ├─ state[2:0]
        iAddr ─┤               ConvAccelTop                 │
        oAddr ─┤                                            ├─ mem.req.valid
               │                                            ├─ mem.req.bits.addr
               │                                            ├─ mem.req.bits.op
               │                                            ├─ mem.req.bits.data
               │                                            │
mem.rsp.valid ─┤                                            │
mem.rsp.data  ─┤                                            │
               └────────────────────────────────────────────┘

Signal	Width	Direction	Role
`start`	1	Input	Assert to launch a convolution run
`kernelAddr`	64	Input	Base address of 5×5 kernel in SRAM
`inputAddr`	64	Input	Base address of 32×32 input image
`outputAddr`	64	Input	Base address for 32×32 output
`mem.req`	—	Output	Memory read/write request (valid, addr, op, data)
`mem.rsp`	—	Input	Memory response (valid, data), driven by testbench scratchpad
`done`	1	Output	High when convolution completes (FSM reaches `sDone`)
`state`	3	Output	Current FSM state

The three address ports are sampled on the start pulse and held in internal registers: this prevents external changes from corrupting an in-progress run.

Submodule Instantiation & Wiring

Five submodules are instantiated: three built in earlier phases, plus two Chisel built-in Queues.

val dma        = Module(new ConvDMA)                        // Phase 3
val lineBuf    = Module(new LineBuffer)                     // Phase 4
val engine     = Module(new ConvEngine)                     // Phase 4
val storeQueue = Module(new Queue(SInt(16.W), 2048))       // Chisel built-in FIFO
val inputQueue = Module(new Queue(UInt(16.W), 1024))       // Chisel built-in FIFO

Queue is Chisel’s standard FIFO: it manages read/write pointers internally and applies backpressure automatically when full or empty.

1. io.mem ↔ DMA

1	io.mem <> dma.io.mem

<> is Chisel’s bulk-connection operator. Both io.mem and dma.io.mem are SimpleMemIO bundles, each containing multiple signals (req.valid, req.bits.addr, rsp.data, etc.). <> connects every like-named signal in one line.

2. DMA loadStream fanout

DMA reads data back through a single loadStream port, but the data heads to two consumers depending on FSM state:

sLoadKernel: loadStream → engine.io.kernelData, writing 25 weights into the kernel ROM.
sLoadInput: loadStream → inputQueue.io.enq, buffering all 1024 pixels.

Only one state is active at a time, so a when / elsewhen branch suffices: no arbiter or mux is needed.

3. Compute pipeline (three daisy chains)

Three segments, each using standard valid/ready handshakes:

inputQueue → LineBuffer: inputQueue.io.deq connects to lineBuf.io.in. Data advances only when the Queue has data (deq.valid) and LineBuffer is ready (in.ready). This path is only active during sLoadInput and sCompute.
LineBuffer → ConvEngine: lineBuf.io.colOut feeds engine.io.colIn. colValid carries an extra condition: when engine.stall is asserted, colValid drops, freezing the ConvEngine pipeline.
ConvEngine → storeQueue: engine.io.outValid drives storeQueue.io.enq.valid. Each result is pushed into the output queue.

4. storeQueue → DMA (writeback path)

1 2	dma.io.storeStream.valid := storeQueue.io.deq.valid storeQueue.io.deq.ready := dma.io.storeStream.ready

DMA reads results from storeQueue and writes them back to memory. When DMA is mid-burst and cannot accept more data, storeStream.ready drops: the Queue stops dequeuing, and backpressure propagates all the way up the pipeline.

InputQueue & StoreQueue: Elastic Buffers

DMA and ConvEngine work at different rhythms. DMA transfers in bursts: fast but irregular. ConvEngine produces and consumes one pixel per cycle: steady but inflexible. Without buffering, every speed mismatch would stall the pipeline or drop data.

A Queue is a standard FIFO with a ring buffer, read/write pointers, and a fill counter. It exposes two ports, enq (write side) and deq (read side), and manages the valid/ready handshake automatically:

Queue is empty: deq.valid = 0 (no data to read).
Queue is full: enq.ready = 0 (no room to write).
Queue is neither: both enq.ready and deq.valid are 1, and data can flow in and out simultaneously.

inputQueue

1	val inputQueue = Module(new Queue(UInt(16.W), 1024))

Depth 1024 = one full 32×32 image. DMA fills the queue during sLoadInput, while LineBuffer may already begin draining pixels from it. After the input DMA finishes, any remaining pixels continue draining during sCompute.

storeQueue

1	val storeQueue = Module(new Queue(SInt(16.W), 2048))

Depth 2048 = 1088 results + 960 slots of headroom. ConvEngine pushes one per cycle; DMA drains the queue and bursts results to memory. When DMA is busy, the queue absorbs the slack until DMA catches up.

Backpressure chain

Backpressure in hardware is a direct wire, not a message-passing protocol. When storeQueue fills up, its internal enq.ready drops from 1 to 0. Two modules are wired directly to this signal:

1 2	engine.io.stall := !storeQueue.io.enq.ready lineBuf.io.stall := !storeQueue.io.enq.ready

The moment enq.ready falls, both modules see it in the same cycle. The chain reaction completes in two cycles:

storeQueue full
→ storeQueue.io.enq.ready = 0
→ ConvEngine stalls (outValid has nowhere to go)
→ LineBuffer stalls (no new window consumed)
→ inputQueue stops draining, fills up
→ inputQueue.io.enq.ready = 0
→ DMA load stream stalls

When DMA catches up and frees a slot, enq.ready returns to 1, and the pipeline restarts on its own. No handshake, no notification, no software.

Master Execution FSM

The top-level FSM does not schedule individual pixels or convolution windows cycle by cycle. Instead, it enables a group of modules for each phase and lets the valid/ready handshakes move data through the pipeline whenever both sides are ready. In other words, it controls the phase, not every micro-operation.

The five states and their transitions:

sIdle :: sLoadKernel :: sLoadInput :: sCompute :: sDone :: Nil = Enum(5)

goLoadKernel = sIdle       && io.start
goLoadInput  = sLoadKernel && dma.io.done
goCompute    = sLoadInput  && dma.io.done
goDone       = sCompute    && resultCnt >= 1088.U && dma.io.done

sLoadKernel: DMA load stream is routed to ConvEngine‘s kernel write port. Each valid DMA word writes one kernel element. After the required kernel elements are loaded, DMA asserts done, and the FSM advances to sLoadInput.
sLoadInput: DMA load stream is routed into InputQueue. At the same time, the queue begins draining into LineBuffer: input loading and line-buffer filling overlap. Once LineBuffer has enough pixels to form valid windows, it asserts valid output toward ConvEngine. The compute pipeline is already running while the input DMA is still loading later pixels.
sCompute: Input DMA has completed, and the DMA command switches to store mode. The remaining data in the pipeline continues to drain. Meanwhile, StoreQueue feeds the DMA store stream: compute tail and output writeback overlap.

sLoadKernel:  DMA load kernel → ConvEngine kernel ROM

sLoadInput:   DMA load input → InputQueue → LineBuffer → ConvEngine → StoreQueue

sCompute:     InputQueue → LineBuffer → ConvEngine → StoreQueue → DMA store output

End-to-End Data Flow Walkthrough

The full lifecycle of one convolution run, aligned in time:

time ───────────────────────────────────────────────────────────────────────────────>

io.start          ┌─┐
                  └─┘

state             sIdle ──> sLoadKernel ──> sLoadInput ───────────> sCompute ──> sDone

dma.cmd                    load_kernel      load_input              store_output

dma.loadStream             [ kernel data ]  [ input pixels ........ ]              idle

InputQueue.enq                              [ input pixels ........ ]              idle
InputQueue.deq                                   [ pixels -> LineBuffer ........... ][drain]

LineBuffer                                            [ initial fill ][ colValid active ........ ][drain]

ConvEngine                                                       [ compute valid windows .... ][drain]

StoreQueue.enq                                                     [ results ............... ]
StoreQueue.deq                                                                  [ results -> DMA .... ]

dma.storeStream                                                                 [ output results .... ]

io.done                                                                                              ┌──
                                                                                                     └──

Three overlaps account for the performance gain:

Overlap	What happens	When
Load / compute	DMA fills InputQueue while LineBuffer drains it	sLoadInput
Compute / store	ConvEngine produces results while DMA drains StoreQueue	sCompute
Pipeline drain	DMA load done, compute still flushing residual data	sCompute tail

RoCC Response Protocol Preparation

This section describes how the top-level signals will connect to the RoCC interface during Phase 6 Chipyard integration. The current standalone ConvAccelTop uses a simple start/done handshake: the RoCC io.cmd/io.resp protocol is not yet wired in.

The instruction encoding and software-visible status fields were defined in Phase 0 and implemented in ConvControl (Phase 2). At the top level, the integration work is to connect ConvAccelTop’s execution signals to the RoCC response channel.

A RoCC response is considered transferred when io.resp.valid && io.resp.ready are both high, carrying:

1 2	io.resp.bits.rd // rd from the original command io.resp.bits.data // acknowledgement or status value

Three response patterns:

SET_ADDR_*: Respond immediately after the address register is updated. Returned data is an acknowledgement value: the command only changes configuration state.
START_ACCEL: Respond immediately, but this only means the accelerator accepted the start request, not that convolution has finished. After acceptance, the master FSM enters the active states and io.busy remains high until the run completes.
POLL_STATUS: Software uses this to observe completion. Response data comes from the status register (busy, done, overflow, addr_err bits).

At the top level, io.busy is driven by the master FSM:

1	io.busy := state =/= sIdle && state =/= sDone

The current design uses polling rather than interrupts, so io.interrupt stays low. If interrupt support is added later, it can be asserted when the FSM enters sDone.

Debug: Pipeline Drain & tmpRow Corruption

These two bugs are kept here because they capture the most important timing lesson in this project: putting data in the right register is not enough; the valid signal must be aligned with that data all the way through the pipeline.

Bug 1: colValid Shuts Off 2 Cycles Too Early

The data path.

LineBuffer outputs 36 columns per row. colValid controls whether ShiftWindow shifts in real data or zeros. ShiftWindow is a 5-column register array: new columns enter at reg(0), old columns exit at reg(4), and the window center is fixed at reg(2).

When colValid = true: the entire window shifts right, colIn → reg(0), reg(4) is discarded.
When colValid = false: the window still shifts right, but 0 → reg(0) instead.

1	val inImage = outputCol >= 2.U && outputCol <= 33.U // 32 image columns

inImage serves double duty: it controls both colOut (which data to output) and colValid (whether to mark it valid). The root cause sits right here: these two things should not be tied to one signal.

Tracing the 2-cycle offset.

Data takes 2 cycles to slide from reg(0) to the window center at reg(2):

outputCol=2:  reg = [img_0,      0,      0,      0,     0   ]  center=0       colValid=true
outputCol=3:  reg = [img_1,  img_0,      0,      0,     0   ]  center=0       colValid=true
outputCol=4:  reg = [img_2,  img_1,  img_0,      0,     0   ]  center=img_0   colValid=true
  ...
outputCol=33: reg = [img_31, img_30, img_29, img_28, img_27]  center=img_29  colValid=true
outputCol=34: reg = [0,      img_31, img_30, img_29, img_28]  center=img_30  colValid=false ✗
outputCol=35: reg = [0,      0,      img_31, img_30, img_29]  center=img_31  colValid=false ✗

The original colValid logic:

when (inImage) {
  io.colValid := true.B          // outputCol 2..33 → true
}.otherwise {
  io.colValid := false.B         // outputCol 0-1, 34-35 → false
}

colValid drops at outputCol=33. But img_30 and img_31 are still queued in the pipeline, sliding toward reg(2). When they arrive, colValid is already false: the results are computed but never marked valid. 2 lost per row × 32 rows = 64 lost.

Root cause.

colValid = inImage conflates two things: whether colOut holds image data vs. whether the window center holds image data. At outputCol=34-35, colOut is zero (correct, these are padding columns), but img_30 and img_31 still sit in reg(2), mid-flight through the MAC pipeline. colValid should stay high until those pixels drain.

Fix: extend colValid 2 cycles to drain the pipeline.

}.otherwise {
  // colOut set to zero, not read from buffer
  io.colOut   := VecInit.fill(5)(0.S(16.W))
  io.colValid := outputCol >= 34.U && outputCol <= 35.U    // ← extend 2 cycles
}

Why not just stretch inImage to 35? Because bufCol = (outputCol - 2.U)(4,0) maps outputCol to a buffer index. At outputCol=34, bufCol = 32, which wraps to 0 under the 5-bit truncation: colOut would read buffer(row)(0) instead of zero. Extending colValid alone, while keeping colOut at zero in the padding region, separates “what comes out of the buffer” from “whether the pipeline keeps running.”

Bug 2: tmpRow Overwritten During Row Switch

Phenomenon. 737 mismatches. Not sporadic: systematic. Rows 0–4 pass, row 5 and beyond are entirely wrong.

Locating the fault. Initially only the first two output rows were printed, and both looked correct. Expanding the print to all 32 rows revealed the break: row 5’s first pixel was 0x00C0, the identity of row 6, col 0. Incrementing test data makes each pixel its own row identifier:

0x0000 = row 0 first pixel
0x0020 = row 1 first pixel  (32)
0x00A0 = row 5 first pixel  (160)
0x00C0 = row 6 first pixel  (192)

row0 out: 0000 0000 0000 0001 0002 ... 001d  ← correct
row1 out: 0000 0000 0020 0021 0022 ... 003d  ← correct
row2 out: 0000 0000 0040 0041 0042 ... 005d  ← correct
row3 out: 0000 0000 0060 0061 0062 ... 007d  ← correct
row4 out: 0000 0000 0080 0081 0082 ... 009d  ← correct
row5 out: 0000 0000 00c0 00c1 00c2 ... 00bd  ← row6 data!!
row6 out: 0000 0000 00e0 00e1 00e2 ...       ← mixed

Row 6 data shifted into row 5’s position: the error jumped a full row at once.

Tracing the data source. LineBuffer shifts its 5-row buffer up at the end of every output row:

when (outputRow >= 2.U) {
  buffer(0) := buffer(1)
  buffer(1) := buffer(2)
  buffer(2) := buffer(3)
  buffer(3) := buffer(4)
  buffer(4) := tmpRow      // only entry point for new data
}

New data enters the buffer through exactly one path: tmpRow. If buffer holds wrong data, tmpRow was wrong first.

tmpRow is populated pixel-by-pixel during sActive:

when (io.in.valid && io.in.ready) {
  tmpRow(loadCol) := io.in.bits.asSInt
  when (loadCol === 31.U) {
    loadCol := 0.U   // 32 pixels loaded, wrap
  }.otherwise {
    loadCol := loadCol + 1.U
  }
}

io.in.ready is gated by needLoad, which checks row number but not column number:

1 2	val needLoad = outputRow >= 2 && outputRow + 3 < 32 io.in.ready := needLoad && !io.stall // ← no column gating

Cycle-by-cycle at outputRow=2. DMA has finished streaming row 5’s 32 pixels and continues: it has no concept of padding columns:

outputCol:  0   1   2   3  ...  31  32  33  34  35
           pad pad img img     img img img pad pad
needLoad:   T   T   T   T  ...  T   T   T   T   T

loadCol:   0   1   2   3  ...  29  30  31   0   1   ← wraps after 31!
DMA sends: R5  R5  R5  R5      R5  R5  R5  R6  R6   ← R5=row5, R6=row6
                                                     ↑
                                   row6 overwrites tmpRow(0) & tmpRow(1)!

At columns 32–35, loadCol has wrapped back to 0, but needLoad is still true. DMA is already sending row 6 pixels: they overwrite the first four slots of tmpRow. At the row-end shift, buffer(4) := tmpRow pulls the corrupted data into the buffer. After a few rows of shifts, the damage climbs through the buffer and surfaces at the output.

Root cause. loadCol wraps modulo 32. outputCol wraps modulo 36. The 4 padding columns per row create a window where DMA has advanced to the next image row but needLoad hasn’t stopped, and with no column gating, loadCol resets and gets overwritten.

Fix. Restrict DMA loading to image columns only:

// Before
io.in.ready := needLoad && !io.stall

// After
io.in.ready := needLoad && inImage && !io.stall

At columns 34–35, inImage is false: loading stops, tmpRow stays intact, and the row-end shift propagates correct data into the buffer.

Phase 6: Chipyard Integration & Verilator Build

Key idea: The RoCC wrapper keeps the standalone datapath intact. It only adds the CPU command path and translates the accelerator’s simple memory interface into Rocket’s DCache interface.

Up to Phase 5, the accelerator was still a standalone Chisel module. The testbench drove start, provided base addresses directly, and connected SimpleMemIO to a fake scratchpad memory. That was enough to verify the datapath, but the design was not yet a processor-controlled accelerator.

Phase 6 moves the design into Chipyard. The goal is to let a real Rocket core execute a bare-metal C program, issue custom RoCC instructions, and control the convolution accelerator through the same software-visible interface defined in Phase 0.

[C program]
     |
     | custom0
     v
[Rocket CPU] -- RoCC cmd/resp --> [ConvAccelRoCC]
                                      |-- [ConvControl]
                                      |     decode / address / status
                                      |
                                      `-- [ConvAccelTop]
                                            standalone core
                                             |
                                             | RoCC mem
                                             v
                                        [DCache] -> [Memory]

What Chipyard Provides

Chipyard is not just a simulator for a single Chisel module. It generates a complete RISC-V SoC around Rocket:

Rocket CPU, which runs the C benchmark.
L1 instruction and data caches.
A memory system behind the cache.
Peripheral support for things like printf.
A RoCC interface for custom accelerators.
A Verilator flow for simulating the full system.

For this project, the most important pieces are Rocket, RoCC, DCache, and Verilator. Rocket executes the C benchmark. RoCC carries custom accelerator commands. DCache is used by the accelerator to access memory. Verilator simulates the whole system cycle by cycle.

Registering the Accelerator

The accelerator is attached to Rocket through a Chipyard config fragment:

class WithConvAccel extends Config((site, here, up) => {
  case BuildRoCC =>
    up(BuildRoCC) ++ Seq(
      (p: Parameters) => {
        val accel = LazyModule(new ConvAccelRoCC(OpcodeSet.custom0)(p))
        accel
      }
    )
})

BuildRoCC is the hook Rocket uses to decide which RoCC accelerators are attached to the tile. up(BuildRoCC) keeps any accelerators already defined by the base config, and ++ Seq(...) appends this convolution accelerator.

OpcodeSet.custom0 is the bridge between software and hardware. In the C program, ROCC_INSTRUCTION_SS(0, ...) emits a custom0 instruction. In hardware, OpcodeSet.custom0 tells Rocket to route those custom0 instructions to ConvAccelRoCC.

LazyRoCC Wrapper

The RoCC wrapper has two layers. ConvAccelRoCC is the outer LazyRoCC declaration: it tells Rocket that this accelerator exists and which opcode set it listens to. ConvAccelRoCCModule is the actual hardware implementation.

class ConvAccelRoCC(opcodes: OpcodeSet)(implicit p: Parameters)
  extends LazyRoCC(opcodes) { 

  override lazy val module = new ConvAccelRoCCModule(this)
}

Inside the module, two important blocks are instantiated:

1 2	val control = Module(new ConvControl) val accel = Module(new ConvAccelTop)

ConvControl handles command decoding, address registers, and status bits. ConvAccelTop is the standalone convolution datapath built in the earlier phases.

Command Path

When Rocket decodes a custom0 instruction, it packages the decoded fields into io.cmd.bits:

io.cmd.bits.inst.funct  // funct7 command number
io.cmd.bits.rs1         // rs1 value, usually an address
io.cmd.bits.rs2         // rs2 value, unused here
io.cmd.bits.inst.rd     // destination register for response

The command is only accepted when both valid and ready are high:

1 2	io.cmd.ready := control.io.instrReady val cmdFire = io.cmd.valid && io.cmd.ready

cmdFire means the RoCC command has actually been transferred into the accelerator. The wrapper then forwards the decoded command to ConvControl:

control.io.instrCmd.valid  := cmdFire
control.io.instrCmd.funct7 := io.cmd.bits.inst.funct
control.io.instrCmd.rs1    := io.cmd.bits.rs1
control.io.instrCmd.rd     := io.cmd.bits.inst.rd

For funct7 = 0, 1, 2, ConvControl stores the input, kernel, and output addresses. For funct7 = 3, the wrapper pulses accel.io.start and launches the standalone accelerator.

Memory Path

The standalone accelerator uses SimpleMemIO, while Rocket exposes a DCache memory interface through io.mem. The RoCC wrapper translates between these two interfaces.

io.mem.req.valid := accel.io.mem.req.valid
accel.io.mem.req.ready := io.mem.req.fire

io.mem.req.bits.addr := accel.io.mem.req.bits.addr
io.mem.req.bits.cmd  := Mux(accel.io.mem.req.bits.isWrite, M_XWR, M_XRD)
io.mem.req.bits.size := log2Ceil(8).U          // fixed 64-bit transfer
io.mem.req.bits.data := accel.io.mem.req.bits.data

This is the key integration boundary. ConvDMA still thinks it is talking to a simple 64-bit memory port. The wrapper converts those requests into Rocket DCache requests. valid only means the accelerator has a request to send; fire means the DCache accepted it in this cycle. Feeding io.mem.req.fire back as accel.io.mem.req.ready prevents ConvDMA from advancing until the real cache-side handshake has completed.

The response path goes in the opposite direction:

1
2
3

accel.io.mem.resp.valid := io.mem.resp.valid
accel.io.mem.resp.bits.data := io.mem.resp.bits.data
accel.io.mem.resp.bits.tag  := io.mem.resp.bits.tag

After this connection, the accelerator can read the input and kernel arrays allocated by the C program, then write the output results back into the software-visible output buffer.

Phase 7: Bare-Metal C Test Program

Key idea: The C program is the software side of the same protocol. It writes input data, emits custom0 RoCC instructions, waits for done, then checks the hardware output against a software reference.

After the accelerator is registered in Chipyard, the next step is to drive it from software. The test program is a bare-metal C benchmark running on Rocket. It does not call a device driver or memory-mapped register file; instead, it emits RoCC custom instructions directly.

RoCC Software Wrappers

The C side wraps the five accelerator commands in small helper functions:

static inline void set_addr_in(uint64_t addr) {
    ROCC_INSTRUCTION_SS(0, addr, 0, 0);
}

static inline void set_addr_ker(uint64_t addr) {
    ROCC_INSTRUCTION_SS(0, addr, 0, 1);
}

static inline void set_addr_out(uint64_t addr) {
    ROCC_INSTRUCTION_SS(0, addr, 0, 2);
}

static inline void start_accel(void) {
    ROCC_INSTRUCTION_SS(0, 0, 0, 3);
}

static inline uint64_t poll_status(void) {
    uint64_t status;
    ROCC_INSTRUCTION_DSS(0, status, 0, 0, 4);
    return status;
}

The first argument, 0, selects the custom0 opcode. The last argument is the funct7 field decoded by ConvControl:

C helper	Opcode set	funct7	Meaning
`set_addr_in(addr)`	`custom0`	0	Store input base address
`set_addr_ker(addr)`	`custom0`	1	Store kernel base address
`set_addr_out(addr)`	`custom0`	2	Store output base address
`start_accel()`	`custom0`	3	Start the accelerator
`poll_status()`	`custom0`	4	Read the status register

This is the software-visible side of the same protocol defined in Phase 0 and decoded in Phase 6.

Benchmark Flow

The benchmark first computes a software reference, then launches the accelerator on the same input and kernel buffers:

sw_start = rdcycle();
software_conv_5x5_same_q88();
sw_end = rdcycle();
sw_cycles = sw_end - sw_start;

set_addr_in((uint64_t)(uintptr_t)input);
set_addr_ker((uint64_t)(uintptr_t)kernel);
set_addr_out((uint64_t)(uintptr_t)hw_out);

fence_rw();

acc_start = rdcycle();
start_accel();

for (poll_count = 0; poll_count < MAX_POLL; poll_count++) {
    status = poll_status();

    if (poll_count > 10 && (status & 0x2)) {
        break;
    }
}

fence_rw();

acc_end = rdcycle();
acc_cycles = acc_end - acc_start;

rdcycle reads the RISC-V cycle counter. The accelerator timing window starts after address setup and the first fence, then includes START, polling, completion wait, and the final fence.

The fence rw, rw is important because the CPU and accelerator share memory. Before starting the accelerator, the benchmark has already written the input and kernel arrays. The fence makes those writes visible to the memory system before the accelerator begins reading through DCache.

The polling loop waits for bit 1 of the status register:

1	status & 0x2 -> done

Once done is observed, the program compares the hardware output buffer against the software reference. This phase therefore checks two things at once: Rocket can control the accelerator through RoCC, and the accelerator can read and write C-visible memory correctly.

Phase 8: Performance Report & Summary

Key idea: The measured speedup comes from moving the whole convolution dataflow into hardware, not from replacing one instruction with one faster instruction.

The final benchmark compares the software convolution against the RoCC accelerator path.

Version	Cycles
Software convolution	586,893
RoCC accelerator	3,312
Speedup	177.20×
Cycle reduction	99.43%

The software version spends most of its time in nested loops: address calculation, boundary checks, loads, multiplies, adds, stores, and branches. The accelerator removes that loop overhead from the CPU. The sliding-window pattern, 5×5 MAC pipeline, buffering, and memory traffic are all handled in hardware.

The speedup is not just from a faster multiplier. It comes from moving the whole convolution dataflow into a dedicated datapath:

LineBuffer and ShiftWindow reuse pixels instead of repeatedly loading overlapping windows.
ConvUnit pipelines the 25-way MAC tree.
InputQueue and StoreQueue absorb rate mismatches between DMA and compute.
RoCC reduces CPU involvement to a few setup, start, and polling instructions.

At the end of this project, the design is no longer just a standalone Chisel module. It is a Rocket-controlled RoCC accelerator integrated through Chipyard, launched from bare-metal C, accessing memory through the DCache interface, and reporting completion through the same custom instruction protocol used for configuration.

References

[1] Chipyard, “Adding a RoCC Accelerator,” Chipyard Documentation. [Online]. Available: https://chipyard.readthedocs.io/en/stable/Customization/RoCC-Accelerators.html. Accessed: Jun. 25, 2026.

[2] CHIPS Alliance, “Rocket Chip Generator,” GitHub repository. [Online]. Available: https://github.com/chipsalliance/rocket-chip. Accessed: Jun. 25, 2026.

[3] RISC-V International, “RISC-V Instruction Set Manual,” GitHub repository. [Online]. Available: https://github.com/riscv/riscv-isa-manual. Accessed: Jun. 25, 2026.

[4] Chisel, “Interfaces and Connections,” Chisel Documentation. [Online]. Available: https://www.chisel-lang.org/docs/explanations/interfaces-and-connections. Accessed: Jun. 25, 2026.

[5] Verilator, “Verilator User’s Guide,” Verilator Documentation. [Online]. Available: https://verilator.org/guide/latest/. Accessed: Jun. 25, 2026.

[6] S. Eldridge, “rocket-rocc-examples,” GitHub repository. [Online]. Available: https://github.com/seldridge/rocket-rocc-examples. Accessed: Jun. 25, 2026.

Contents