Phase 0: Project Planning & Protocol Design

Before writing any code, we first settled the fundamental questions.

Protocol Design: RoCC Instruction Encoding

At a high level, the RoCC protocol is simple: the CPU passes register values to the accelerator via custom instructions, and the accelerator responds with a result.

1
2
3
4
5
6
7
8
9
10
Rocket Core (CPU)                      Accelerator
| |
| custom inst (rs1=addr, rs2=data) | ← RoCC interface: control
| -----------------------------------> |
| |
| DMA read / write | ← TileLink bus: data
| <==================================> |
| |
| resp.data = status / result | ← RoCC interface: status
| <----------------------------------- |

This accelerator uses a three-step flow: set addresses → trigger → poll. The CPU tells the accelerator where the input matrix, kernel, and output reside in memory, fires a start signal, then polls a status register for completion.

funct7 Instruction rs1 Description
0 SET_ADDR_IN base addr Base address of input matrix
1 SET_ADDR_KER base addr Base address of kernel
2 SET_ADDR_OUT base addr Base address of output matrix
3 START_ACCEL Non-blocking start
4 POLL_STATUS Read status register to rd

The status register packs four bits for software polling:

Bit Name Meaning
0 busy Accelerator is computing
1 done Computation complete
2 overflow Accumulator overflow
3 addr_err Address check failed

Design Rationale

  1. Why split addresses into three separate instructions?

    A single RoCC instruction only carries two source operands (rs1 and rs2), not enough to pass three base addresses at once. The alternative is passing a struct pointer, but that forces the accelerator to issue its own DMA read just to fetch the configuration, adding complexity and latency. Three independent instructions pass one address each, keeping the hardware interface simple.

  2. Why non-blocking START + polling instead of blocking for the result?

    Non-blocking lets the CPU issue START_ACCEL and immediately switch to other work. Polling is preferred over interrupts because a 5×5 convolution completes in a short, predictable window: the overhead of an interrupt controller and context switch would outweigh the benefit.

  3. Why continuous funct7 encoding instead of sparse assignment?

    A single funct7 <= 4 check covers all valid instructions, keeping the hardware decoder minimal. Sparse encoding would add combinatorial logic with no upside.

Key Design Decisions

  1. Is the convolution kernel size fixed or configurable?

    Supporting variable kernel sizes increases control logic complexity significantly. A fixed 5×5 hardware datapath is used. Kernels smaller than 5×5 are pre-zero-padded by software.

  2. How to handle boundary pixels?

    Zero-padding: no special edge-case logic is needed in the sliding-window FSM.

  3. Fixed-point precision trade-off.

    • Why not floating-point? IEEE 754 multipliers are expensive, consuming significant area and pipeline stages. A Q8.8 fixed-point multiplier is just a 16-bit × 16-bit integer multiplier, producing a result in a single cycle with no extra hardware.
    • Why not plain integer? Convolution weights are naturally fractional (e.g. 0.125, -0.5). Plain integers can’t represent them. Q8.8 splits 16 bits evenly: 8 bits for the integer part (±128 range), 8 bits for the fractional part (1/256 precision).
    • Why a 32-bit accumulator? Although a raw 16-bit × 16-bit multiply produces a 32-bit product, the accumulator requirement depends on the numeric range, not just the product width. With Q8.8 values in the approximate range [-128, 128), the largest product magnitude is below 2^14 in real value (represented as Q16.16). Summing 25 such products requires fewer than 21 integer magnitude bits plus fractional bits, so a signed 32-bit accumulator provides enough headroom for this fixed 5×5 datapath. The key point: we bound the values by the Q8.8 numeric range, not by the worst possible 32-bit product bit pattern.
  4. Why alignment constraints?

    TileLink DMA works best when addresses are word-aligned. An unaligned address forces the bus to split a single transfer into multiple chunks, and the hardware must shift and stitch bytes back together. The accelerator checks the address on every SET_ADDR_* and rejects unaligned ones. This keeps the hardware simple and pushes alignment responsibility to software.

Parameter Value
Input Matrix 32×32
Max Kernel 5×5
Data Format Q8.8 Fixed Point (16-bit)
Accumulator 32-bit
Output Same size, zero-padding
Performance Target <2500 cycles, ≥40× speedup

Phase 1: Architecture Overview

The Problem: Running Convolution on a CPU

A 5×5 convolution over a 32×32 matrix means sliding a 5×5 window across 1024 output positions. Each output pixel requires 25 multiplies and 24 additions: 25,000 multiply-accumulates in total. On a general-purpose CPU, the arithmetic itself is not the bottleneck. The real bottleneck is the sliding. Every slide requires computing address offsets, updating loop counters, loading pixels and weights from memory, and storing results back. Most instructions are spent on loop control and address bookkeeping, not math.

A rough estimate: on a simple in-order RISC-V core, each output pixel costs ~100–150 cycles (two loads, one store, ~50 arithmetic ops, plus loop branches). Across 1024 pixels, that adds up to 100,000–150,000 cycles. In RTL simulation, the accelerator completes the same 32×32 case in 2428 cycles. Against the rough CPU estimate above, that suggests a speedup on the order of 40–50×. A measured bare-metal baseline is left for Phase 8.

The Fix: Offload to an Accelerator

The CPU is not built for sliding-window number crunching, so that job is handed to a RoCC accelerator. The accelerator does three things: fetch data from SRAM, run the convolution, and write results back. The CPU’s role shrinks to three steps: set addresses, issue START, and poll.

What does the accelerator look like inside?

The naive approach is a serial pipeline: DMA load → MAC compute → DMA store. Simple, but wasteful: the DMA waits for the MAC, and the MAC waits for the DMA. Only one piece of hardware is active at any moment.

What we actually built overlaps the pieces. An inputQueue preloads the entire matrix so that LineBuffer can start filling while DMA is still loading later rows. The ConvEngine runs a 6-stage pipeline, producing one result per cycle once full. A storeQueue absorbs results during computation, and DMA drains the queue to write them back as the pipeline drains. The result is partial overlap: input loading overlaps with LineBuffer filling and early compute, while the compute tail overlaps with output writeback — 2428 cycles end to end.

Module Map

Module Role Phase
ConvControl RoCC decode + 4-state control FSM + status register 2
ConvDMA 7-state DMA engine over TileLink 3
LineBuffer + ConvEngine Sliding window + MAC pipeline, 6-stage compute datapath 4
InputQueue + StoreQueue Elastic buffers with backpressure 5

The following phases walk through each module in detail.


Phase 2: ConvControl — Instruction Decode & FSM

Phase 1 drew the architecture. The top line, RoCC control, is what Phase 2 is about. The CPU issues a custom instruction carrying funct7. ConvControl decodes it, executes it, and responds via rd.

Interface

ConvControl talks to the CPU through five signals and a valid/ready handshake: instrCmd.valid / funct7 / rs1 / rd arrive from the CPU, and instrReady goes back. When both valid and ready are high on the same cycle, the instruction fires.

Instruction Decode

Decoding uses five comparators: funct7 === 0.U through funct7 === 4.U. No priority encoder, no lookup table. The continuous encoding from Phase 0 pays off: funct7 is literally the instruction number.

Not every instruction is welcome at every moment. The rules:

  • SET (0–2): blocked in sBusy. Changing addresses mid-computation would corrupt the run.
  • START (3): only accepted in sIdle or sDone. You cannot start an already-running accelerator.
  • POLL (4): always accepted. It is a pure read: it cannot interfere with anything.

SET_ADDR: Storing Three Base Addresses

Depending on funct7, rs1 is written into addrIn, addrKer, or addrOut. If the accelerator is in sError, any SET clears it back to sIdle.

START_ACCEL: Validate and Fire

  1. Address non-zero check

    If any base address (addrIn, addrKer, addrOut) is zero, the state goes to sError.

  2. Address alignment check

    Input and output matrices require 8-byte alignment (addrIn(2,0) === 0.U). The kernel requires 2-byte alignment (addrKer(0) === 0.U). The DMA bus is 64 bits wide: each transfer moves 4 pixels. The kernel has only 25 coefficients, so 2-byte alignment is sufficient.

    If all checks pass: sIdle → sBusy, and computation begins. If any check fails: sError, and the addr_err bit in the status register goes high.

FSM: Four States

1
2
3
4
5
6
7
8
9
10
11
12
         START pass                done
┌─── sIdle ────────────────► sBusy ─────────────► sDone
│ ▲ ▲ │
│ │ │ START │ SET
│ │ └────────────────────┘
│ │
│ │ SET
│ │
│ START fail
│ │
│ ▼
└──► sError
  • sIdle: reset default. Waits for START.
  • sBusy: only accepts POLL. Counts down to zero, then raises done and moves to sDone.
  • sDone: holds done = 1. CPU can re-START (→ sBusy) or SET to reconfigure (→ sIdle).
  • sError: holds addrErr = 1. Only SET pulls it back to sIdle: no direct path to sBusy.

The convolution engine constantly reads pixels and weights from SRAM, then writes results back. A dedicated DMA engine manages these transfers. Phase 3 starts with a strictly serial DMA, one request in flight at a time. Once verified, we add pipelining so multiple requests can be issued without waiting for the prior response.

DMA Interface

Three Bundles connect the DMA to the L1 data cache: signals only, no logic. Valid/ready handshakes come from Decoupled.

  • SimpleMemReq: DMA’s request to L1. Carries 64-bit address, 64-bit write data, byte mask, a read/write flag, and a 4-bit tag. Tag is 0 in the serial DMA; reserved so a pipelined version can match out-of-order responses.
  • SimpleMemResp: L1’s reply. Returns 64-bit read data and echoes the request’s tag.
  • SimpleMemIO: bundles req (DMA→L1) and resp (L1→DMA, Flipped) into one port.
1
2
3
4
5
6
  ConvDMA                  SimpleMemIO            L1 data cache
┌─────────┐ ┌─────────────┐ ┌──────────┐
│ │── req ──────►│ req (output) │────────►│ │
│ FSM │ │ │ │ L1 D$ │
│ │◄─ resp ──────│ resp (input) │◄────────│ │
└─────────┘ └─────────────┘ └──────────┘

Serial FSM

ConvDMA moves data between SRAM and the compute unit through two paths:

  • Load path (sIdle → sIssue → sWaitResp → sUnpack → loop): reads 64-bit words from memory, unpacks each into four 16-bit elements, pushes them into elemQueue.
  • Store path (sIdle → sGather → sIssue → loop): collects four 16-bit elements, packs them into a 64-bit word, writes it to memory.

Both paths share sIssue, branching on opReg. At most one request is in flight at any time.

1
2
3
4
5
6
7
8
9
Load path:   sIdle → sIssue → sWaitResp → sUnpack(×4) ─┐
▲ │
└────────────────────────────────────────┘

Store path: sIdle → sGather(×4) → sIssue ─┐
▲ │
└─────────────────────────────┘

Error path: sIdle ──► sError (SET addr ⇒ sIdle)
State Responsibility Path
sIdle Accept command, latch baseAddr and length, check alignment common
sIssue Fire mem.req (read or write, depending on opReg) shared
sWaitResp Wait for mem.resp.valid, latch response data load
sUnpack 4 cycles: slice 64-bit word into 4 × 16-bit elements, push into elemQueue load
sGather 4 cycles: collect 4 × 16-bit elements from elemQueue, pack into 64-bit word store
sDone Transfer complete, waiting for upper layer to consume result common
sError Address misaligned, waiting for a valid address common

Bottleneck: 6 cycles/word

Take the load path. One word goes through sIssue → sWaitResp → sUnpack×4 → back to sIssue: exactly 6 cycles. Memory sits idle for 4 of those 6 cycles during unpack. The bottleneck is not memory latency, it’s the FSM refusing to overlap issue with unpack.

1
2
3
4
5
6
7
8
cyc |     state |     mem.req    |   mem.resp   | loadStream
0 | sIssue | fire(rd) | |
1 | sWaitResp | | fire |
2 | sUnpack | |
3 | sUnpack | | deq elem[0]
4 | sUnpack | | deq elem[1]
5 | sUnpack | | deq elem[2]
6 | sIssue | fire(rd) | | deq elem[3]

256 words × 6 + 1 cycle overhead = 1537 cycles. The next section fixes this.

Pipelined DMA

The core idea: decouple issue and unpack into two concurrent hardware processes.

1
2
3
4
5
6
7
8
9
10
11
Serial:
Issue: [sIssue] [sIssue] [sIssue]
Response: [sWait] [sWait] [sWait]
Unpack: [sUnpack×4] [sUnpack×4] [sUnpack×4]
↑── 6 cycles/word ──↑

Pipelined:
Issue: [sIssue][sIssue][sIssue][sIssue][sIssue]...
Response: [1 cycle] [1 cycle] [1 cycle] [1 cycle] [1 cycle] ← enqueue FIFO
Unpack: [sUnpack×4][sUnpack×4][sUnpack×4]... ← dequeue FIFO
↑── 4 cycles/word (unpack is the bottleneck) ──↑

A response FIFO absorbs the rate mismatch: the issue engine fires at 1 word/cycle (up to the inflight limit), responses drop into the FIFO automatically, and the unpack engine drains the FIFO at 4 cycles/word. Neither blocks the other. 1537 cycles → ~1033 cycles, a 33% improvement. Step-by-step implementation details are in a follow-up post.

Changes from Serial to Pipelined

1. Register → Queue

The serial DMA uses a single respWord register to latch one response at a time. The pipelined version replaces it with a response FIFO. Data drops into the FIFO automatically; the FSM only reads from the FIFO when it needs data.

2. sWaitResp disappears

sWaitResp no longer waits for mem.resp.valid or latches response data. Its only remaining job is to wait until the FIFO has data to pop.

3. inflightCount: credit-based flow control

The issue engine fires faster (1 word/cycle) than the unpack engine consumes (1 element/cycle → 0.25 word/cycle). Without a cap, the FIFO overflows. inflightCount acts as a credit account: issue spends a credit per request, unpack returns a credit per word consumed. inflightMax = 4 caps the window at 4 in-flight words.

4. sLoadActive: concurrent FSM

The three serial states sIssue, sWaitResp, and sUnpack merge into a single sLoadActive state. Inside, two independent when blocks run concurrently: the issue engine fires requests up to inflightMax, and the unpack engine drains the FIFO. Neither engine waits for the other.

A trace of the first nine cycles:

1
2
3
4
5
6
7
8
9
10
11
cyc |      issue engine       |      unpack engine      | inflight | what happened
0 | fire read → addr 0x1000 | | 1 | burst starts
1 | fire read → addr 0x1008 | | 2 |
2 | fire read → addr 0x1010 | deq elem[0] from word 0 | 3 | unpack begins, both active
3 | fire read → addr 0x1018 | deq elem[1] | 4 | inflight window full
4 | — | deq elem[2] | 4 | issue blocked
5 | fire read → addr 0x1020 | deq elem[3] (last) | 4 | both fire: +1/−1 cancel
6 | — | deq elem[4] from word 1 | 4 | inflight full again
7 | — | deq elem[5] | 4 |
8 | — | deq elem[6] | 4 |
9 | fire read → addr 0x1028 | deq elem[7] (last) | 4 | credit returned, issue fires

Key observations:

  • Cycles 0–3: The issue engine fires four read requests back-to-back before unpack starts. By cycle 3, inflight hits the cap of 4.
  • Cycles 5 and 9: Both engines advance in the same cycle. Issue fires (+1) while unpack finishes a word (−1). The net change to inflightCount is zero.
  • Cycles 4 and 6–8: Issue is blocked because inflightCount == inflightMax.

The only shared state is inflightCount. The fix is a single expression: inflightCount + issueFired − unpackWordDone. Both boolean events feed one arithmetic operation: +1 and −1 cancel cleanly when they coincide, and neither update is lost.


Phase 4: Compute Datapath — LineBuffer & ConvEngine

DMA delivers pixels one at a time in row-major order: row 0 left to right, then row 1, then row 2. But a 5×5 convolution at output pixel (r, c) needs 25 pixels arranged as a 5×5 neighbourhood centred at (r+2, c+2). A single pixel from the DMA is useless on its own: the compute datapath must assemble an entire 5×5 window before the MAC unit can start.

This is a data reshape problem. It splits naturally into two dimensions, so two modules are paired: LineBuffer handles the vertical direction (rows), ShiftWindow handles the horizontal direction (columns). LineBuffer collects 5 rows of 32 pixels from the DMA stream, then outputs one column of 5 vertically adjacent pixels per cycle. ShiftWindow buffers 5 consecutive columns from LineBuffer, shifting right each cycle, and outputs a full 5×5 window to the MAC unit.

Part A: LineBuffer

Why not read directly from SRAM? Each output pixel needs values from 5 different rows. Fetching them directly would require 5 independent reads per cycle targeting 5 different addresses: five memory ports. LineBuffer replaces that with one write port (DMA feeds in one pixel per cycle) and one 5-wide read port (5 rows × same column). A 160-entry register file is far cheaper than a 5-port SRAM.

1
sIdle ──► sPrime (load 5 rows)  ──► sActive (32 output rows)  ──► sDone

In sActive, while the buffer emits 36 columns per row, DMA loads the next input row into a separate tmpRow register. When the row ends, the buffer shifts up: row 0 discarded, rows 1..3 move up, tmpRow enters as row 4. Without tmpRow, DMA would overwrite rows still being output, and load and output could not overlap.

Zero-padding covers all four image borders through two mechanisms:

  • Top / bottom: Handled by what the five buffer rows hold. At output row 0, the top two slots are zero; at output row 31, the bottom two are zero. As the window slides down, real rows rotate in and out: no extra control logic.
  • Left / right: Each output row emits 36 columns (2 padding + 32 data + 2 padding). The colValid signal marks data columns. When low, ShiftWindow fills zeros regardless of colOut.

Part B: ShiftWindow → KernelROM → ConvUnit

LineBuffer outputs one 5-pixel column per cycle. The MAC unit needs a full 5×5 window. Three modules bridge the gap:

ShiftWindow: 5×5 register window. A 5×5 register array. Each cycle, all columns shift right by one: the oldest (c4) is dropped, and the new column from LineBuffer enters c0. When colValid is low, c0 gets zeros instead, implementing left/right padding. The 5×5 array is a combinational output: 400 bits of registers, cheaper than BRAM and zero read latency.

KernelROM: weight storage. 25-entry register file. Weights are written once before computation starts and remain read-only throughout. Combinational 5×5 output with zero latency, so ConvUnit receives both window and kernel operands in the same cycle.

ConvUnit: 5-stage MAC pipeline. A single combinational multiply-accumulate chain (25 multiplies + tree reduction) would have a critical path too long to meet timing. The fix: slice the pairwise addition tree into 5 pipeline stages, each doing one 32-bit addition.

1
2
3
Stage 0 (combinational): 25 parallel 16×16→32 multiplies
Stage 1–5 (registered): 25→13→7→4→2→1 pairwise reduction
Stage 5 also does rounding (+0x80), >>8, saturate to 16-bit

Why pairwise instead of Wallace tree? The binary tree is regular, depth is exactly ceil(log₂ 25) = 5 levels, and registers slip naturally between levels: critical path becomes a single 32-bit add.

ConvEngine: top-level glue. Instantiates ShiftWindow, KernelROM, and ConvUnit, connecting colIn / colValid to ShiftWindow and kernel / window to ConvUnit. inValid is delayed one cycle (RegNext) to align with ShiftWindow’s registered output. A stall input gates colValid to freeze the entire pipeline under backpressure. outValid is inValid delayed 5 cycles via ShiftRegister: when high, result holds a valid convolution output.


Phase 5: Top-Level Integration & Master FSM

Phase 5 wires the four modules from Phases 2–4 into a single ConvAccelTop. It has three jobs:

  • Instantiate ConvDMA, LineBuffer, ConvEngine, and two elastic Queues in one module.
  • Add a 5-state execution FSM that orchestrates the three phases (load kernel, load input, compute+store).
  • Expose a SimpleMemIO port to the outside, connecting to a simulated scratchpad for testing or HellaCache for Chipyard integration.

Note on ConvControl. ConvControl (Phase 2) handles RoCC instruction decode and the 4-state control FSM (sIdle/sBusy/sDone/sError). In the standalone test setup, ConvAccelTop uses its own built-in execution FSM with a simple start/done handshake. ConvControl will be re-integrated as the RoCC-facing wrapper in Phase 6.

ConvAccelTop: Skeleton & IO

ConvAccelTop is a standalone Module with a simple start/done handshake plus three memory-mapped address ports. The SimpleMemIO bundle carries all memory traffic to a simulated scratchpad or, later, to HellaCache.

1
2
3
4
5
6
7
8
9
10
11
12
               ┌────────────────────────────────────────────┐
start ─┤ ├─ done
kAddr ─┤ ├─ state[2:0]
iAddr ─┤ ConvAccelTop │
oAddr ─┤ ├─ mem.req.valid
│ ├─ mem.req.bits.addr
│ ├─ mem.req.bits.op
│ ├─ mem.req.bits.data
│ │
mem.rsp.valid ─┤ │
mem.rsp.data ─┤ │
└────────────────────────────────────────────┘
Signal Width Direction Role
start 1 Input Assert to launch a convolution run
kernelAddr 64 Input Base address of 5×5 kernel in SRAM
inputAddr 64 Input Base address of 32×32 input image
outputAddr 64 Input Base address for 32×32 output
mem.req Output Memory read/write request (valid, addr, op, data)
mem.rsp Input Memory response (valid, data), driven by testbench scratchpad
done 1 Output High when convolution completes (FSM reaches sDone)
state 3 Output Current FSM state

The three address ports are sampled on the start pulse and held in internal registers: this prevents external changes from corrupting an in-progress run.

Submodule Instantiation & Wiring

Five submodules are instantiated: three built in earlier phases, plus two Chisel built-in Queues.

1
2
3
4
5
val dma        = Module(new ConvDMA)                        // Phase 3
val lineBuf = Module(new LineBuffer) // Phase 4
val engine = Module(new ConvEngine) // Phase 4
val storeQueue = Module(new Queue(SInt(16.W), 2048)) // Chisel built-in FIFO
val inputQueue = Module(new Queue(UInt(16.W), 1024)) // Chisel built-in FIFO

Queue is Chisel’s standard FIFO: it manages read/write pointers internally and applies backpressure automatically when full or empty.

1. io.mem ↔ DMA

1
io.mem <> dma.io.mem

<> is Chisel’s bulk-connection operator. Both io.mem and dma.io.mem are SimpleMemIO bundles, each containing multiple signals (req.valid, req.bits.addr, rsp.data, etc.). <> connects every like-named signal in one line.

2. DMA loadStream fanout

DMA reads data back through a single loadStream port, but the data heads to two consumers depending on FSM state:

  • sLoadKernel: loadStreamengine.io.kernelData, writing 25 weights into the kernel ROM.
  • sLoadInput: loadStreaminputQueue.io.enq, buffering all 1024 pixels.

Only one state is active at a time, so a when / elsewhen branch suffices: no arbiter or mux is needed.

3. Compute pipeline (three daisy chains)

Three segments, each using standard valid/ready handshakes:

  • inputQueue → LineBuffer: inputQueue.io.deq connects to lineBuf.io.in. Data advances only when the Queue has data (deq.valid) and LineBuffer is ready (in.ready). This path is only active during sLoadInput and sCompute.
  • LineBuffer → ConvEngine: lineBuf.io.colOut feeds engine.io.colIn. colValid carries an extra condition: when engine.stall is asserted, colValid drops, freezing the ConvEngine pipeline.
  • ConvEngine → storeQueue: engine.io.outValid drives storeQueue.io.enq.valid. Each result is pushed into the output queue.

4. storeQueue → DMA (writeback path)

1
2
dma.io.storeStream.valid := storeQueue.io.deq.valid
storeQueue.io.deq.ready := dma.io.storeStream.ready

DMA reads results from storeQueue and writes them back to memory. When DMA is mid-burst and cannot accept more data, storeStream.ready drops: the Queue stops dequeuing, and backpressure propagates all the way up the pipeline.

InputQueue & StoreQueue: Elastic Buffers

DMA and ConvEngine work at different rhythms. DMA transfers in bursts: fast but irregular. ConvEngine produces and consumes one pixel per cycle: steady but inflexible. Without buffering, every speed mismatch would stall the pipeline or drop data.

A Queue is a standard FIFO with a ring buffer, read/write pointers, and a fill counter. It exposes two ports, enq (write side) and deq (read side), and manages the valid/ready handshake automatically:

  • Queue is empty: deq.valid = 0 (no data to read).
  • Queue is full: enq.ready = 0 (no room to write).
  • Queue is neither: both enq.ready and deq.valid are 1, and data can flow in and out simultaneously.

inputQueue

1
val inputQueue = Module(new Queue(UInt(16.W), 1024))

Depth 1024 = one full 32×32 image. DMA fills the queue during sLoadInput, while LineBuffer may already begin draining pixels from it. After the input DMA finishes, any remaining pixels continue draining during sCompute.

storeQueue

1
val storeQueue = Module(new Queue(SInt(16.W), 2048))

Depth 2048 = 1088 results + 960 slots of headroom. ConvEngine pushes one per cycle; DMA drains the queue and bursts results to memory. When DMA is busy, the queue absorbs the slack until DMA catches up.

Backpressure chain

Backpressure in hardware is a direct wire, not a message-passing protocol. When storeQueue fills up, its internal enq.ready drops from 1 to 0. Two modules are wired directly to this signal:

1
2
engine.io.stall  := !storeQueue.io.enq.ready
lineBuf.io.stall := !storeQueue.io.enq.ready

The moment enq.ready falls, both modules see it in the same cycle. The chain reaction completes in two cycles:

1
2
3
4
5
6
7
storeQueue full
→ storeQueue.io.enq.ready = 0
→ ConvEngine stalls (outValid has nowhere to go)
→ LineBuffer stalls (no new window consumed)
→ inputQueue stops draining, fills up
→ inputQueue.io.enq.ready = 0
→ DMA load stream stalls

When DMA catches up and frees a slot, enq.ready returns to 1, and the pipeline restarts on its own. No handshake, no notification, no software.

Master Execution FSM

The top-level FSM does not schedule individual pixels or convolution windows cycle by cycle. Instead, it enables a group of modules for each phase and lets the valid/ready handshakes move data through the pipeline whenever both sides are ready. In other words, it controls the phase, not every micro-operation.

The five states and their transitions:

1
2
3
4
5
6
sIdle :: sLoadKernel :: sLoadInput :: sCompute :: sDone :: Nil = Enum(5)

goLoadKernel = sIdle && io.start
goLoadInput = sLoadKernel && dma.io.done
goCompute = sLoadInput && dma.io.done
goDone = sCompute && resultCnt >= 1088.U && dma.io.done
  • sLoadKernel: DMA load stream is routed to ConvEngine‘s kernel write port. Each valid DMA word writes one kernel element. After the required kernel elements are loaded, DMA asserts done, and the FSM advances to sLoadInput.
  • sLoadInput: DMA load stream is routed into InputQueue. At the same time, the queue begins draining into LineBuffer: input loading and line-buffer filling overlap. Once LineBuffer has enough pixels to form valid windows, it asserts valid output toward ConvEngine. The compute pipeline is already running while the input DMA is still loading later pixels.
  • sCompute: Input DMA has completed, and the DMA command switches to store mode. The remaining data in the pipeline continues to drain. Meanwhile, StoreQueue feeds the DMA store stream: compute tail and output writeback overlap.
1
2
3
4
5
sLoadKernel:  DMA load kernel → ConvEngine kernel ROM

sLoadInput: DMA load input → InputQueue → LineBuffer → ConvEngine → StoreQueue

sCompute: InputQueue → LineBuffer → ConvEngine → StoreQueue → DMA store output

End-to-End Data Flow Walkthrough

The full lifecycle of one convolution run, aligned in time:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
time ───────────────────────────────────────────────────────────────────────────────>

io.start ┌─┐
└─┘

state sIdle ──> sLoadKernel ──> sLoadInput ───────────> sCompute ──> sDone

dma.cmd load_kernel load_input store_output

dma.loadStream [ kernel data ] [ input pixels ........ ] idle

InputQueue.enq [ input pixels ........ ] idle
InputQueue.deq [ pixels -> LineBuffer ........... ][drain]

LineBuffer [ initial fill ][ colValid active ........ ][drain]

ConvEngine [ compute valid windows .... ][drain]

StoreQueue.enq [ results ............... ]
StoreQueue.deq [ results -> DMA .... ]

dma.storeStream [ output results .... ]

io.done ┌──
└──

Three overlaps account for the performance gain:

Overlap What happens When
Load / compute DMA fills InputQueue while LineBuffer drains it sLoadInput
Compute / store ConvEngine produces results while DMA drains StoreQueue sCompute
Pipeline drain DMA load done, compute still flushing residual data sCompute tail

RoCC Response Protocol Preparation

This section describes how the top-level signals will connect to the RoCC interface during Phase 6 Chipyard integration. The current standalone ConvAccelTop uses a simple start/done handshake: the RoCC io.cmd/io.resp protocol is not yet wired in.

The instruction encoding and software-visible status fields were defined in Phase 0 and implemented in ConvControl (Phase 2). At the top level, the integration work is to connect ConvAccelTop’s execution signals to the RoCC response channel.

A RoCC response is considered transferred when io.resp.valid && io.resp.ready are both high, carrying:

1
2
io.resp.bits.rd   // rd from the original command
io.resp.bits.data // acknowledgement or status value

Three response patterns:

  • SET_ADDR_*: Respond immediately after the address register is updated. Returned data is an acknowledgement value: the command only changes configuration state.
  • START_ACCEL: Respond immediately, but this only means the accelerator accepted the start request, not that convolution has finished. After acceptance, the master FSM enters the active states and io.busy remains high until the run completes.
  • POLL_STATUS: Software uses this to observe completion. Response data comes from the status register (busy, done, overflow, addr_err bits).

At the top level, io.busy is driven by the master FSM:

1
io.busy := state =/= sIdle && state =/= sDone

The current design uses polling rather than interrupts, so io.interrupt stays low. If interrupt support is added later, it can be asserted when the FSM enters sDone.

Debug: Pipeline Drain & tmpRow Corruption

Bug 1: colValid Shuts Off 2 Cycles Too Early

The data path.

LineBuffer outputs 36 columns per row. colValid controls whether ShiftWindow shifts in real data or zeros. ShiftWindow is a 5-column register array: new columns enter at reg(0), old columns exit at reg(4), and the window center is fixed at reg(2).

When colValid = true: the entire window shifts right, colIn → reg(0), reg(4) is discarded.
When colValid = false: the window still shifts right, but 0 → reg(0) instead.

1
val inImage = outputCol >= 2.U && outputCol <= 33.U   // 32 image columns

inImage serves double duty: it controls both colOut (which data to output) and colValid (whether to mark it valid). The root cause sits right here: these two things should not be tied to one signal.

Tracing the 2-cycle offset.

Data takes 2 cycles to slide from reg(0) to the window center at reg(2):

1
2
3
4
5
6
7
outputCol=2:  reg = [img_0,      0,      0,      0,     0   ]  center=0       colValid=true
outputCol=3: reg = [img_1, img_0, 0, 0, 0 ] center=0 colValid=true
outputCol=4: reg = [img_2, img_1, img_0, 0, 0 ] center=img_0 colValid=true
...
outputCol=33: reg = [img_31, img_30, img_29, img_28, img_27] center=img_29 colValid=true
outputCol=34: reg = [0, img_31, img_30, img_29, img_28] center=img_30 colValid=false ✗
outputCol=35: reg = [0, 0, img_31, img_30, img_29] center=img_31 colValid=false ✗

The original colValid logic:

1
2
3
4
5
when (inImage) {
io.colValid := true.B // outputCol 2..33 → true
}.otherwise {
io.colValid := false.B // outputCol 0-1, 34-35 → false
}

colValid drops at outputCol=33. But img_30 and img_31 are still queued in the pipeline, sliding toward reg(2). When they arrive, colValid is already false: the results are computed but never marked valid. 2 lost per row × 32 rows = 64 lost.

Root cause.

colValid = inImage conflates two things: whether colOut holds image data vs. whether the window center holds image data. At outputCol=34-35, colOut is zero (correct, these are padding columns), but img_30 and img_31 still sit in reg(2), mid-flight through the MAC pipeline. colValid should stay high until those pixels drain.

Fix: extend colValid 2 cycles to drain the pipeline.

1
2
3
4
5
}.otherwise {
// colOut set to zero, not read from buffer
io.colOut := VecInit.fill(5)(0.S(16.W))
io.colValid := outputCol >= 34.U && outputCol <= 35.U // ← extend 2 cycles
}

Why not just stretch inImage to 35? Because bufCol = (outputCol - 2.U)(4,0) maps outputCol to a buffer index. At outputCol=34, bufCol = 32, which wraps to 0 under the 5-bit truncation: colOut would read buffer(row)(0) instead of zero. Extending colValid alone, while keeping colOut at zero in the padding region, separates “what comes out of the buffer” from “whether the pipeline keeps running.”

Bug 2: tmpRow Overwritten During Row Switch

Phenomenon. 737 mismatches. Not sporadic: systematic. Rows 0–4 pass, row 5 and beyond are entirely wrong.

Locating the fault. Initially only the first two output rows were printed, and both looked correct. Expanding the print to all 32 rows revealed the break: row 5’s first pixel was 0x00C0, the identity of row 6, col 0. Incrementing test data makes each pixel its own row identifier:

1
2
3
4
0x0000 = row 0 first pixel
0x0020 = row 1 first pixel (32)
0x00A0 = row 5 first pixel (160)
0x00C0 = row 6 first pixel (192)
1
2
3
4
5
6
7
row0 out: 0000 0000 0000 0001 0002 ... 001d  ← correct
row1 out: 0000 0000 0020 0021 0022 ... 003d ← correct
row2 out: 0000 0000 0040 0041 0042 ... 005d ← correct
row3 out: 0000 0000 0060 0061 0062 ... 007d ← correct
row4 out: 0000 0000 0080 0081 0082 ... 009d ← correct
row5 out: 0000 0000 00c0 00c1 00c2 ... 00bd ← row6 data!!
row6 out: 0000 0000 00e0 00e1 00e2 ... ← mixed

Row 6 data shifted into row 5’s position: the error jumped a full row at once.

Tracing the data source. LineBuffer shifts its 5-row buffer up at the end of every output row:

1
2
3
4
5
6
7
when (outputRow >= 2.U) {
buffer(0) := buffer(1)
buffer(1) := buffer(2)
buffer(2) := buffer(3)
buffer(3) := buffer(4)
buffer(4) := tmpRow // only entry point for new data
}

New data enters the buffer through exactly one path: tmpRow. If buffer holds wrong data, tmpRow was wrong first.

tmpRow is populated pixel-by-pixel during sActive:

1
2
3
4
5
6
7
8
when (io.in.valid && io.in.ready) {
tmpRow(loadCol) := io.in.bits.asSInt
when (loadCol === 31.U) {
loadCol := 0.U // 32 pixels loaded, wrap
}.otherwise {
loadCol := loadCol + 1.U
}
}

io.in.ready is gated by needLoad, which checks row number but not column number:

1
2
val needLoad = outputRow >= 2 && outputRow + 3 < 32
io.in.ready := needLoad && !io.stall // ← no column gating

Cycle-by-cycle at outputRow=2. DMA has finished streaming row 5’s 32 pixels and continues: it has no concept of padding columns:

1
2
3
4
5
6
7
8
outputCol:  0   1   2   3  ...  31  32  33  34  35
pad pad img img img img img pad pad
needLoad: T T T T ... T T T T T

loadCol: 0 1 2 3 ... 29 30 31 0 1 ← wraps after 31!
DMA sends: R5 R5 R5 R5 R5 R5 R5 R6 R6 ← R5=row5, R6=row6

row6 overwrites tmpRow(0) & tmpRow(1)!

At columns 32–35, loadCol has wrapped back to 0, but needLoad is still true. DMA is already sending row 6 pixels: they overwrite the first four slots of tmpRow. At the row-end shift, buffer(4) := tmpRow pulls the corrupted data into the buffer. After a few rows of shifts, the damage climbs through the buffer and surfaces at the output.

Root cause. loadCol wraps modulo 32. outputCol wraps modulo 36. The 4 padding columns per row create a window where DMA has advanced to the next image row but needLoad hasn’t stopped, and with no column gating, loadCol resets and gets overwritten.

Fix. Restrict DMA loading to image columns only:

1
2
3
4
5
// Before
io.in.ready := needLoad && !io.stall

// After
io.in.ready := needLoad && inImage && !io.stall

At columns 34–35, inImage is false: loading stops, tmpRow stays intact, and the row-end shift propagates correct data into the buffer.


Phase 6: Chipyard Integration & Verilator Build

TODO


Phase 7: Bare-Metal C Test Program

TODO


Phase 8: Performance Report & Summary

TODO