Aman Desai
Projects

baby-gpu

SystemVerilog · FPGA · RTL · Verification

Context

baby-gpu is a work-in-progress graphics accelerator project. The repo calls the first architecture UrbanaGPU-1: a small GPU-like core written in portable SystemVerilog, with the RealDigital Urbana FPGA board as the first hardware target.

The goal is not to jump straight to a full GPU. The first useful target is a minimal programmable pipeline that can accept command streams, run small kernels, write a framebuffer, and eventually display a 160x120 RGB565 image scaled to 640x480.

Approach

The repo is split so the GPU core stays separate from board-specific details. Portable RTL lives under rtl/, while platform wrappers for simulation, FPGA, and future ASIC work live outside the core. That keeps the design from becoming too tied to Vivado, Urbana, DDR3, or a specific video-output path too early.

The current architecture is centered on a small programmable SIMD core rather than only fixed-function draw blocks. The first shape is intentionally narrow: one core, four SIMD lanes, a shared program counter, private lane register files, and a blocking load/store path into global memory.

  • Command FIFO and command processor for host-driven control
  • Register file and launch state for kernel setup
  • Fixed-function clear and rectangle-fill blocks for bring-up smoke tests
  • Programmable core path with scheduler, instruction decode, SIMD ALU, and LSU
  • Global-memory framebuffer model where pixel writes are ordinary stores

Current status

The project already has more than a skeleton. The repo includes RTL for the core blocks, a small ISA envelope, kernel-level simulations, formal harnesses for selected datapath and control blocks, and synthesis smoke coverage through Yosys. The fixed-function graphics blocks are still useful, but mainly as bring-up infrastructure.

The implemented instruction subset is small on purpose. It covers basic integer operations, loads and stores, special-register reads, convergent branches, predicated stores, and 16-bit stores for RGB565 framebuffer writes.

The useful milestone is getting simple kernels to run through the whole path. vector_add proves launch, IDs, loads, ALU work, stores, and completion. framebuffer_gradient proves the same path can write pixels without turning the framebuffer into a special case inside the execution core.

Current focus

  • Harden the programmable path before adding speculative GPU features
  • Keep block-level formal checks close to the RTL as the core changes
  • Run optional Vivado synthesis smoke before making FPGA platform claims
  • Keep the framebuffer path as ordinary global memory instead of a separate architecture