A RISC-V Based RoCC Convolution Accelerator
Contents
- 1. Phase 0: Project Planning & Protocol Design
- 2. Phase 1: Architecture Overview
- 3. Phase 2: ConvControl — Instruction Decode & FSM
- 4. Phase 3: ConvDMA — TileLink DMA Engine
- 5. Phase 4: Compute Datapath — LineBuffer & ConvEngine
- 6. Phase 5: Top-Level Integration & Master FSM
- 7. Phase 6: Chipyard Integration & Verilator Build
- 8. Phase 7: Bare-Metal C Test Program
- 9. Phase 8: Performance Report & Summary
- Repository: MyConvAccel
Phase 0: Project Planning & Protocol Design
Before writing any code, we first settled the fundamental questions.
Protocol Design: RoCC Instruction Encoding
At a high level, the RoCC protocol is simple: the CPU passes register values to the accelerator via custom instructions, and the accelerator responds with a result.
1 | Rocket Core (CPU) Accelerator |
This accelerator uses a three-step flow: set addresses → trigger → poll. The CPU tells the accelerator where the input matrix, kernel, and output reside in memory, fires a start signal, then polls a status register for completion.
| funct7 | Instruction | rs1 | Description |
|---|---|---|---|
| 0 | SET_ADDR_IN | base addr | Base address of input matrix |
| 1 | SET_ADDR_KER | base addr | Base address of kernel |
| 2 | SET_ADDR_OUT | base addr | Base address of output matrix |
| 3 | START_ACCEL | — | Non-blocking start |
| 4 | POLL_STATUS | — | Read status register to rd |
The status register packs four bits for software polling:
| Bit | Name | Meaning |
|---|---|---|
| 0 | busy | Accelerator is computing |
| 1 | done | Computation complete |
| 2 | overflow | Accumulator overflow |
| 3 | addr_err | Address check failed |
Design Rationale
Why split addresses into three separate instructions?
A single RoCC instruction only carries two source operands (
rs1andrs2), not enough to pass three base addresses at once. The alternative is passing a struct pointer, but that forces the accelerator to issue its own DMA read just to fetch the configuration, adding complexity and latency. Three independent instructions pass one address each, keeping the hardware interface simple.Why non-blocking START + polling instead of blocking for the result?
Non-blocking lets the CPU issue
START_ACCELand immediately switch to other work. Polling is preferred over interrupts because a 5×5 convolution completes in a short, predictable window: the overhead of an interrupt controller and context switch would outweigh the benefit.Why continuous funct7 encoding instead of sparse assignment?
A single
funct7 <= 4check covers all valid instructions, keeping the hardware decoder minimal. Sparse encoding would add combinatorial logic with no upside.
Key Design Decisions
Is the convolution kernel size fixed or configurable?
Supporting variable kernel sizes increases control logic complexity significantly. A fixed 5×5 hardware datapath is used. Kernels smaller than 5×5 are pre-zero-padded by software.
How to handle boundary pixels?
Zero-padding: no special edge-case logic is needed in the sliding-window FSM.
Fixed-point precision trade-off.
- Why not floating-point? IEEE 754 multipliers are expensive, consuming significant area and pipeline stages. A Q8.8 fixed-point multiplier is just a 16-bit × 16-bit integer multiplier, producing a result in a single cycle with no extra hardware.
- Why not plain integer? Convolution weights are naturally fractional (e.g. 0.125, -0.5). Plain integers can’t represent them. Q8.8 splits 16 bits evenly: 8 bits for the integer part (±128 range), 8 bits for the fractional part (1/256 precision).
- Why a 32-bit accumulator? Although a raw 16-bit × 16-bit multiply produces a 32-bit product, the accumulator requirement depends on the numeric range, not just the product width. With Q8.8 values in the approximate range [-128, 128), the largest product magnitude is below 2^14 in real value (represented as Q16.16). Summing 25 such products requires fewer than 21 integer magnitude bits plus fractional bits, so a signed 32-bit accumulator provides enough headroom for this fixed 5×5 datapath. The key point: we bound the values by the Q8.8 numeric range, not by the worst possible 32-bit product bit pattern.
Why alignment constraints?
TileLink DMA works best when addresses are word-aligned. An unaligned address forces the bus to split a single transfer into multiple chunks, and the hardware must shift and stitch bytes back together. The accelerator checks the address on every
SET_ADDR_*and rejects unaligned ones. This keeps the hardware simple and pushes alignment responsibility to software.
| Parameter | Value |
|---|---|
| Input Matrix | 32×32 |
| Max Kernel | 5×5 |
| Data Format | Q8.8 Fixed Point (16-bit) |
| Accumulator | 32-bit |
| Output | Same size, zero-padding |
| Performance Target | <2500 cycles, ≥40× speedup |
Phase 1: Architecture Overview
The Problem: Running Convolution on a CPU
A 5×5 convolution over a 32×32 matrix means sliding a 5×5 window across 1024 output positions. Each output pixel requires 25 multiplies and 24 additions: 25,000 multiply-accumulates in total. On a general-purpose CPU, the arithmetic itself is not the bottleneck. The real bottleneck is the sliding. Every slide requires computing address offsets, updating loop counters, loading pixels and weights from memory, and storing results back. Most instructions are spent on loop control and address bookkeeping, not math.
A rough estimate: on a simple in-order RISC-V core, each output pixel costs ~100–150 cycles (two loads, one store, ~50 arithmetic ops, plus loop branches). Across 1024 pixels, that adds up to 100,000–150,000 cycles. In RTL simulation, the accelerator completes the same 32×32 case in 2428 cycles. Against the rough CPU estimate above, that suggests a speedup on the order of 40–50×. A measured bare-metal baseline is left for Phase 8.
The Fix: Offload to an Accelerator
The CPU is not built for sliding-window number crunching, so that job is handed to a RoCC accelerator. The accelerator does three things: fetch data from SRAM, run the convolution, and write results back. The CPU’s role shrinks to three steps: set addresses, issue START, and poll.
What does the accelerator look like inside?
The naive approach is a serial pipeline: DMA load → MAC compute → DMA store. Simple, but wasteful: the DMA waits for the MAC, and the MAC waits for the DMA. Only one piece of hardware is active at any moment.
What we actually built overlaps the pieces. An inputQueue preloads the entire matrix so that LineBuffer can start filling while DMA is still loading later rows. The ConvEngine runs a 6-stage pipeline, producing one result per cycle once full. A storeQueue absorbs results during computation, and DMA drains the queue to write them back as the pipeline drains. The result is partial overlap: input loading overlaps with LineBuffer filling and early compute, while the compute tail overlaps with output writeback — 2428 cycles end to end.
Module Map
| Module | Role | Phase |
|---|---|---|
| ConvControl | RoCC decode + 4-state control FSM + status register | 2 |
| ConvDMA | 7-state DMA engine over TileLink | 3 |
| LineBuffer + ConvEngine | Sliding window + MAC pipeline, 6-stage compute datapath | 4 |
| InputQueue + StoreQueue | Elastic buffers with backpressure | 5 |
The following phases walk through each module in detail.
Phase 2: ConvControl — Instruction Decode & FSM
Phase 1 drew the architecture. The top line, RoCC control, is what Phase 2 is about. The CPU issues a custom instruction carrying funct7. ConvControl decodes it, executes it, and responds via rd.
Interface
ConvControl talks to the CPU through five signals and a valid/ready handshake: instrCmd.valid / funct7 / rs1 / rd arrive from the CPU, and instrReady goes back. When both valid and ready are high on the same cycle, the instruction fires.
Instruction Decode
Decoding uses five comparators: funct7 === 0.U through funct7 === 4.U. No priority encoder, no lookup table. The continuous encoding from Phase 0 pays off: funct7 is literally the instruction number.
Not every instruction is welcome at every moment. The rules:
- SET (0–2): blocked in sBusy. Changing addresses mid-computation would corrupt the run.
- START (3): only accepted in sIdle or sDone. You cannot start an already-running accelerator.
- POLL (4): always accepted. It is a pure read: it cannot interfere with anything.
SET_ADDR: Storing Three Base Addresses
Depending on funct7, rs1 is written into addrIn, addrKer, or addrOut. If the accelerator is in sError, any SET clears it back to sIdle.
START_ACCEL: Validate and Fire
Address non-zero check
If any base address (
addrIn,addrKer,addrOut) is zero, the state goes to sError.Address alignment check
Input and output matrices require 8-byte alignment (
addrIn(2,0) === 0.U). The kernel requires 2-byte alignment (addrKer(0) === 0.U). The DMA bus is 64 bits wide: each transfer moves 4 pixels. The kernel has only 25 coefficients, so 2-byte alignment is sufficient.If all checks pass: sIdle → sBusy, and computation begins. If any check fails: sError, and the
addr_errbit in the status register goes high.
FSM: Four States
1 | START pass done |
- sIdle: reset default. Waits for START.
- sBusy: only accepts POLL. Counts down to zero, then raises
doneand moves to sDone. - sDone: holds
done = 1. CPU can re-START (→ sBusy) or SET to reconfigure (→ sIdle). - sError: holds
addrErr = 1. Only SET pulls it back to sIdle: no direct path to sBusy.
Phase 3: ConvDMA — TileLink DMA Engine
The convolution engine constantly reads pixels and weights from SRAM, then writes results back. A dedicated DMA engine manages these transfers. Phase 3 starts with a strictly serial DMA, one request in flight at a time. Once verified, we add pipelining so multiple requests can be issued without waiting for the prior response.
DMA Interface
Three Bundles connect the DMA to the L1 data cache: signals only, no logic. Valid/ready handshakes come from Decoupled.
- SimpleMemReq: DMA’s request to L1. Carries 64-bit address, 64-bit write data, byte mask, a read/write flag, and a 4-bit tag. Tag is 0 in the serial DMA; reserved so a pipelined version can match out-of-order responses.
- SimpleMemResp: L1’s reply. Returns 64-bit read data and echoes the request’s tag.
- SimpleMemIO: bundles
req(DMA→L1) andresp(L1→DMA,Flipped) into one port.
1 | ConvDMA SimpleMemIO L1 data cache |
Serial FSM
ConvDMA moves data between SRAM and the compute unit through two paths:
- Load path (sIdle → sIssue → sWaitResp → sUnpack → loop): reads 64-bit words from memory, unpacks each into four 16-bit elements, pushes them into
elemQueue. - Store path (sIdle → sGather → sIssue → loop): collects four 16-bit elements, packs them into a 64-bit word, writes it to memory.
Both paths share sIssue, branching on opReg. At most one request is in flight at any time.
1 | Load path: sIdle → sIssue → sWaitResp → sUnpack(×4) ─┐ |
| State | Responsibility | Path |
|---|---|---|
| sIdle | Accept command, latch baseAddr and length, check alignment | common |
| sIssue | Fire mem.req (read or write, depending on opReg) | shared |
| sWaitResp | Wait for mem.resp.valid, latch response data | load |
| sUnpack | 4 cycles: slice 64-bit word into 4 × 16-bit elements, push into elemQueue | load |
| sGather | 4 cycles: collect 4 × 16-bit elements from elemQueue, pack into 64-bit word | store |
| sDone | Transfer complete, waiting for upper layer to consume result | common |
| sError | Address misaligned, waiting for a valid address | common |
Bottleneck: 6 cycles/word
Take the load path. One word goes through sIssue → sWaitResp → sUnpack×4 → back to sIssue: exactly 6 cycles. Memory sits idle for 4 of those 6 cycles during unpack. The bottleneck is not memory latency, it’s the FSM refusing to overlap issue with unpack.
1 | cyc | state | mem.req | mem.resp | loadStream |
256 words × 6 + 1 cycle overhead = 1537 cycles. The next section fixes this.
Pipelined DMA
The core idea: decouple issue and unpack into two concurrent hardware processes.
1 | Serial: |
A response FIFO absorbs the rate mismatch: the issue engine fires at 1 word/cycle (up to the inflight limit), responses drop into the FIFO automatically, and the unpack engine drains the FIFO at 4 cycles/word. Neither blocks the other. 1537 cycles → ~1033 cycles, a 33% improvement. Step-by-step implementation details are in a follow-up post.
Changes from Serial to Pipelined
1. Register → Queue
The serial DMA uses a single respWord register to latch one response at a time. The pipelined version replaces it with a response FIFO. Data drops into the FIFO automatically; the FSM only reads from the FIFO when it needs data.
2. sWaitResp disappears
sWaitResp no longer waits for mem.resp.valid or latches response data. Its only remaining job is to wait until the FIFO has data to pop.
3. inflightCount: credit-based flow control
The issue engine fires faster (1 word/cycle) than the unpack engine consumes (1 element/cycle → 0.25 word/cycle). Without a cap, the FIFO overflows. inflightCount acts as a credit account: issue spends a credit per request, unpack returns a credit per word consumed. inflightMax = 4 caps the window at 4 in-flight words.
4. sLoadActive: concurrent FSM
The three serial states sIssue, sWaitResp, and sUnpack merge into a single sLoadActive state. Inside, two independent when blocks run concurrently: the issue engine fires requests up to inflightMax, and the unpack engine drains the FIFO. Neither engine waits for the other.
A trace of the first nine cycles:
1 | cyc | issue engine | unpack engine | inflight | what happened |
Key observations:
- Cycles 0–3: The issue engine fires four read requests back-to-back before unpack starts. By cycle 3, inflight hits the cap of 4.
- Cycles 5 and 9: Both engines advance in the same cycle. Issue fires (+1) while unpack finishes a word (−1). The net change to
inflightCountis zero. - Cycles 4 and 6–8: Issue is blocked because
inflightCount == inflightMax.
The only shared state is inflightCount. The fix is a single expression: inflightCount + issueFired − unpackWordDone. Both boolean events feed one arithmetic operation: +1 and −1 cancel cleanly when they coincide, and neither update is lost.
Phase 4: Compute Datapath — LineBuffer & ConvEngine
DMA delivers pixels one at a time in row-major order: row 0 left to right, then row 1, then row 2. But a 5×5 convolution at output pixel (r, c) needs 25 pixels arranged as a 5×5 neighbourhood centred at (r+2, c+2). A single pixel from the DMA is useless on its own: the compute datapath must assemble an entire 5×5 window before the MAC unit can start.
This is a data reshape problem. It splits naturally into two dimensions, so two modules are paired: LineBuffer handles the vertical direction (rows), ShiftWindow handles the horizontal direction (columns). LineBuffer collects 5 rows of 32 pixels from the DMA stream, then outputs one column of 5 vertically adjacent pixels per cycle. ShiftWindow buffers 5 consecutive columns from LineBuffer, shifting right each cycle, and outputs a full 5×5 window to the MAC unit.
Part A: LineBuffer
Why not read directly from SRAM? Each output pixel needs values from 5 different rows. Fetching them directly would require 5 independent reads per cycle targeting 5 different addresses: five memory ports. LineBuffer replaces that with one write port (DMA feeds in one pixel per cycle) and one 5-wide read port (5 rows × same column). A 160-entry register file is far cheaper than a 5-port SRAM.
1 | sIdle ──► sPrime (load 5 rows) ──► sActive (32 output rows) ──► sDone |
In sActive, while the buffer emits 36 columns per row, DMA loads the next input row into a separate tmpRow register. When the row ends, the buffer shifts up: row 0 discarded, rows 1..3 move up, tmpRow enters as row 4. Without tmpRow, DMA would overwrite rows still being output, and load and output could not overlap.
Zero-padding covers all four image borders through two mechanisms:
- Top / bottom: Handled by what the five buffer rows hold. At output row 0, the top two slots are zero; at output row 31, the bottom two are zero. As the window slides down, real rows rotate in and out: no extra control logic.
- Left / right: Each output row emits 36 columns (2 padding + 32 data + 2 padding). The
colValidsignal marks data columns. When low, ShiftWindow fills zeros regardless ofcolOut.
Part B: ShiftWindow → KernelROM → ConvUnit
LineBuffer outputs one 5-pixel column per cycle. The MAC unit needs a full 5×5 window. Three modules bridge the gap:
ShiftWindow: 5×5 register window. A 5×5 register array. Each cycle, all columns shift right by one: the oldest (c4) is dropped, and the new column from LineBuffer enters c0. When colValid is low, c0 gets zeros instead, implementing left/right padding. The 5×5 array is a combinational output: 400 bits of registers, cheaper than BRAM and zero read latency.
KernelROM: weight storage. 25-entry register file. Weights are written once before computation starts and remain read-only throughout. Combinational 5×5 output with zero latency, so ConvUnit receives both window and kernel operands in the same cycle.
ConvUnit: 5-stage MAC pipeline. A single combinational multiply-accumulate chain (25 multiplies + tree reduction) would have a critical path too long to meet timing. The fix: slice the pairwise addition tree into 5 pipeline stages, each doing one 32-bit addition.
1 | Stage 0 (combinational): 25 parallel 16×16→32 multiplies |
Why pairwise instead of Wallace tree? The binary tree is regular, depth is exactly ceil(log₂ 25) = 5 levels, and registers slip naturally between levels: critical path becomes a single 32-bit add.
ConvEngine: top-level glue. Instantiates ShiftWindow, KernelROM, and ConvUnit, connecting colIn / colValid to ShiftWindow and kernel / window to ConvUnit. inValid is delayed one cycle (RegNext) to align with ShiftWindow’s registered output. A stall input gates colValid to freeze the entire pipeline under backpressure. outValid is inValid delayed 5 cycles via ShiftRegister: when high, result holds a valid convolution output.
Phase 5: Top-Level Integration & Master FSM
Phase 5 wires the four modules from Phases 2–4 into a single ConvAccelTop. It has three jobs:
- Instantiate ConvDMA, LineBuffer, ConvEngine, and two elastic Queues in one module.
- Add a 5-state execution FSM that orchestrates the three phases (load kernel, load input, compute+store).
- Expose a
SimpleMemIOport to the outside, connecting to a simulated scratchpad for testing or HellaCache for Chipyard integration.
Note on ConvControl. ConvControl (Phase 2) handles RoCC instruction decode and the 4-state control FSM (sIdle/sBusy/sDone/sError). In the standalone test setup, ConvAccelTop uses its own built-in execution FSM with a simple
start/donehandshake. ConvControl will be re-integrated as the RoCC-facing wrapper in Phase 6.
ConvAccelTop: Skeleton & IO
ConvAccelTop is a standalone Module with a simple start/done handshake plus three memory-mapped address ports. The SimpleMemIO bundle carries all memory traffic to a simulated scratchpad or, later, to HellaCache.
1 | ┌────────────────────────────────────────────┐ |
| Signal | Width | Direction | Role |
|---|---|---|---|
start |
1 | Input | Assert to launch a convolution run |
kernelAddr |
64 | Input | Base address of 5×5 kernel in SRAM |
inputAddr |
64 | Input | Base address of 32×32 input image |
outputAddr |
64 | Input | Base address for 32×32 output |
mem.req |
— | Output | Memory read/write request (valid, addr, op, data) |
mem.rsp |
— | Input | Memory response (valid, data), driven by testbench scratchpad |
done |
1 | Output | High when convolution completes (FSM reaches sDone) |
state |
3 | Output | Current FSM state |
The three address ports are sampled on the start pulse and held in internal registers: this prevents external changes from corrupting an in-progress run.
Submodule Instantiation & Wiring
Five submodules are instantiated: three built in earlier phases, plus two Chisel built-in Queues.
1 | val dma = Module(new ConvDMA) // Phase 3 |
Queue is Chisel’s standard FIFO: it manages read/write pointers internally and applies backpressure automatically when full or empty.
1. io.mem ↔ DMA
1 | io.mem <> dma.io.mem |
<> is Chisel’s bulk-connection operator. Both io.mem and dma.io.mem are SimpleMemIO bundles, each containing multiple signals (req.valid, req.bits.addr, rsp.data, etc.). <> connects every like-named signal in one line.
2. DMA loadStream fanout
DMA reads data back through a single loadStream port, but the data heads to two consumers depending on FSM state:
sLoadKernel:loadStream→engine.io.kernelData, writing 25 weights into the kernel ROM.sLoadInput:loadStream→inputQueue.io.enq, buffering all 1024 pixels.
Only one state is active at a time, so a when / elsewhen branch suffices: no arbiter or mux is needed.
3. Compute pipeline (three daisy chains)
Three segments, each using standard valid/ready handshakes:
- inputQueue → LineBuffer:
inputQueue.io.deqconnects tolineBuf.io.in. Data advances only when the Queue has data (deq.valid) and LineBuffer is ready (in.ready). This path is only active duringsLoadInputandsCompute. - LineBuffer → ConvEngine:
lineBuf.io.colOutfeedsengine.io.colIn.colValidcarries an extra condition: whenengine.stallis asserted, colValid drops, freezing the ConvEngine pipeline. - ConvEngine → storeQueue:
engine.io.outValiddrivesstoreQueue.io.enq.valid. Each result is pushed into the output queue.
4. storeQueue → DMA (writeback path)
1 | dma.io.storeStream.valid := storeQueue.io.deq.valid |
DMA reads results from storeQueue and writes them back to memory. When DMA is mid-burst and cannot accept more data, storeStream.ready drops: the Queue stops dequeuing, and backpressure propagates all the way up the pipeline.
InputQueue & StoreQueue: Elastic Buffers
DMA and ConvEngine work at different rhythms. DMA transfers in bursts: fast but irregular. ConvEngine produces and consumes one pixel per cycle: steady but inflexible. Without buffering, every speed mismatch would stall the pipeline or drop data.
A Queue is a standard FIFO with a ring buffer, read/write pointers, and a fill counter. It exposes two ports, enq (write side) and deq (read side), and manages the valid/ready handshake automatically:
- Queue is empty:
deq.valid= 0 (no data to read). - Queue is full:
enq.ready= 0 (no room to write). - Queue is neither: both
enq.readyanddeq.validare 1, and data can flow in and out simultaneously.
inputQueue
1 | val inputQueue = Module(new Queue(UInt(16.W), 1024)) |
Depth 1024 = one full 32×32 image. DMA fills the queue during sLoadInput, while LineBuffer may already begin draining pixels from it. After the input DMA finishes, any remaining pixels continue draining during sCompute.
storeQueue
1 | val storeQueue = Module(new Queue(SInt(16.W), 2048)) |
Depth 2048 = 1088 results + 960 slots of headroom. ConvEngine pushes one per cycle; DMA drains the queue and bursts results to memory. When DMA is busy, the queue absorbs the slack until DMA catches up.
Backpressure chain
Backpressure in hardware is a direct wire, not a message-passing protocol. When storeQueue fills up, its internal enq.ready drops from 1 to 0. Two modules are wired directly to this signal:
1 | engine.io.stall := !storeQueue.io.enq.ready |
The moment enq.ready falls, both modules see it in the same cycle. The chain reaction completes in two cycles:
1 | storeQueue full |
When DMA catches up and frees a slot, enq.ready returns to 1, and the pipeline restarts on its own. No handshake, no notification, no software.
Master Execution FSM
The top-level FSM does not schedule individual pixels or convolution windows cycle by cycle. Instead, it enables a group of modules for each phase and lets the valid/ready handshakes move data through the pipeline whenever both sides are ready. In other words, it controls the phase, not every micro-operation.
The five states and their transitions:
1 | sIdle :: sLoadKernel :: sLoadInput :: sCompute :: sDone :: Nil = Enum(5) |
- sLoadKernel: DMA load stream is routed to
ConvEngine‘s kernel write port. Each valid DMA word writes one kernel element. After the required kernel elements are loaded, DMA assertsdone, and the FSM advances tosLoadInput. - sLoadInput: DMA load stream is routed into
InputQueue. At the same time, the queue begins draining intoLineBuffer: input loading and line-buffer filling overlap. OnceLineBufferhas enough pixels to form valid windows, it asserts valid output towardConvEngine. The compute pipeline is already running while the input DMA is still loading later pixels. - sCompute: Input DMA has completed, and the DMA command switches to store mode. The remaining data in the pipeline continues to drain. Meanwhile,
StoreQueuefeeds the DMA store stream: compute tail and output writeback overlap.
1 | sLoadKernel: DMA load kernel → ConvEngine kernel ROM |
End-to-End Data Flow Walkthrough
The full lifecycle of one convolution run, aligned in time:
1 | time ───────────────────────────────────────────────────────────────────────────────> |
Three overlaps account for the performance gain:
| Overlap | What happens | When |
|---|---|---|
| Load / compute | DMA fills InputQueue while LineBuffer drains it | sLoadInput |
| Compute / store | ConvEngine produces results while DMA drains StoreQueue | sCompute |
| Pipeline drain | DMA load done, compute still flushing residual data | sCompute tail |
RoCC Response Protocol Preparation
This section describes how the top-level signals will connect to the RoCC interface during Phase 6 Chipyard integration. The current standalone
ConvAccelTopuses a simplestart/donehandshake: the RoCCio.cmd/io.respprotocol is not yet wired in.
The instruction encoding and software-visible status fields were defined in Phase 0 and implemented in ConvControl (Phase 2). At the top level, the integration work is to connect ConvAccelTop’s execution signals to the RoCC response channel.
A RoCC response is considered transferred when io.resp.valid && io.resp.ready are both high, carrying:
1 | io.resp.bits.rd // rd from the original command |
Three response patterns:
SET_ADDR_*: Respond immediately after the address register is updated. Returned data is an acknowledgement value: the command only changes configuration state.START_ACCEL: Respond immediately, but this only means the accelerator accepted the start request, not that convolution has finished. After acceptance, the master FSM enters the active states andio.busyremains high until the run completes.POLL_STATUS: Software uses this to observe completion. Response data comes from the status register (busy, done, overflow, addr_err bits).
At the top level, io.busy is driven by the master FSM:
1 | io.busy := state =/= sIdle && state =/= sDone |
The current design uses polling rather than interrupts, so io.interrupt stays low. If interrupt support is added later, it can be asserted when the FSM enters sDone.
Debug: Pipeline Drain & tmpRow Corruption
Bug 1: colValid Shuts Off 2 Cycles Too Early
The data path.
LineBuffer outputs 36 columns per row. colValid controls whether ShiftWindow shifts in real data or zeros. ShiftWindow is a 5-column register array: new columns enter at reg(0), old columns exit at reg(4), and the window center is fixed at reg(2).
When colValid = true: the entire window shifts right, colIn → reg(0), reg(4) is discarded.
When colValid = false: the window still shifts right, but 0 → reg(0) instead.
1 | val inImage = outputCol >= 2.U && outputCol <= 33.U // 32 image columns |
inImage serves double duty: it controls both colOut (which data to output) and colValid (whether to mark it valid). The root cause sits right here: these two things should not be tied to one signal.
Tracing the 2-cycle offset.
Data takes 2 cycles to slide from reg(0) to the window center at reg(2):
1 | outputCol=2: reg = [img_0, 0, 0, 0, 0 ] center=0 colValid=true |
The original colValid logic:
1 | when (inImage) { |
colValid drops at outputCol=33. But img_30 and img_31 are still queued in the pipeline, sliding toward reg(2). When they arrive, colValid is already false: the results are computed but never marked valid. 2 lost per row × 32 rows = 64 lost.
Root cause.
colValid = inImage conflates two things: whether colOut holds image data vs. whether the window center holds image data. At outputCol=34-35, colOut is zero (correct, these are padding columns), but img_30 and img_31 still sit in reg(2), mid-flight through the MAC pipeline. colValid should stay high until those pixels drain.
Fix: extend colValid 2 cycles to drain the pipeline.
1 | }.otherwise { |
Why not just stretch inImage to 35? Because bufCol = (outputCol - 2.U)(4,0) maps outputCol to a buffer index. At outputCol=34, bufCol = 32, which wraps to 0 under the 5-bit truncation: colOut would read buffer(row)(0) instead of zero. Extending colValid alone, while keeping colOut at zero in the padding region, separates “what comes out of the buffer” from “whether the pipeline keeps running.”
Bug 2: tmpRow Overwritten During Row Switch
Phenomenon. 737 mismatches. Not sporadic: systematic. Rows 0–4 pass, row 5 and beyond are entirely wrong.
Locating the fault. Initially only the first two output rows were printed, and both looked correct. Expanding the print to all 32 rows revealed the break: row 5’s first pixel was 0x00C0, the identity of row 6, col 0. Incrementing test data makes each pixel its own row identifier:
1 | 0x0000 = row 0 first pixel |
1 | row0 out: 0000 0000 0000 0001 0002 ... 001d ← correct |
Row 6 data shifted into row 5’s position: the error jumped a full row at once.
Tracing the data source. LineBuffer shifts its 5-row buffer up at the end of every output row:
1 | when (outputRow >= 2.U) { |
New data enters the buffer through exactly one path: tmpRow. If buffer holds wrong data, tmpRow was wrong first.
tmpRow is populated pixel-by-pixel during sActive:
1 | when (io.in.valid && io.in.ready) { |
io.in.ready is gated by needLoad, which checks row number but not column number:
1 | val needLoad = outputRow >= 2 && outputRow + 3 < 32 |
Cycle-by-cycle at outputRow=2. DMA has finished streaming row 5’s 32 pixels and continues: it has no concept of padding columns:
1 | outputCol: 0 1 2 3 ... 31 32 33 34 35 |
At columns 32–35, loadCol has wrapped back to 0, but needLoad is still true. DMA is already sending row 6 pixels: they overwrite the first four slots of tmpRow. At the row-end shift, buffer(4) := tmpRow pulls the corrupted data into the buffer. After a few rows of shifts, the damage climbs through the buffer and surfaces at the output.
Root cause. loadCol wraps modulo 32. outputCol wraps modulo 36. The 4 padding columns per row create a window where DMA has advanced to the next image row but needLoad hasn’t stopped, and with no column gating, loadCol resets and gets overwritten.
Fix. Restrict DMA loading to image columns only:
1 | // Before |
At columns 34–35, inImage is false: loading stops, tmpRow stays intact, and the row-end shift propagates correct data into the buffer.
Phase 6: Chipyard Integration & Verilator Build
TODO
Phase 7: Bare-Metal C Test Program
TODO
Phase 8: Performance Report & Summary
TODO