baby-gpu | Aman Desai

Context

baby-gpu is a work-in-progress graphics accelerator project. The repo calls the first architecture UrbanaGPU-1: a small GPU-like core written in portable SystemVerilog, with the RealDigital Urbana FPGA board as the first hardware target.

The goal is not to jump straight to a full GPU. The first useful target is a minimal programmable pipeline that can accept command streams, run small kernels, write a framebuffer, and eventually display a 160x120 RGB565 image scaled to 640x480.

Approach

The repo is split so the GPU core stays separate from board-specific details. Portable RTL lives under rtl/, while platform wrappers for simulation, FPGA, and future ASIC work live outside the core. That keeps the design from becoming too tied to Vivado, Urbana, DDR3, or a specific video-output path too early.

The current architecture is centered on a small programmable SIMD core rather than only fixed-function draw blocks. The first shape is intentionally narrow: one core, four SIMD lanes, a shared program counter, private lane register files, and a blocking load/store path into global memory.

Command FIFO and command processor for host-driven control
Register file and launch state for kernel setup
Fixed-function clear and rectangle-fill blocks for bring-up smoke tests
Programmable core path with scheduler, instruction decode, SIMD ALU, and LSU
Global-memory framebuffer model where pixel writes are ordinary stores

Current status

The project already has more than a skeleton. The repo includes RTL for the core blocks, a small ISA envelope, kernel-level simulations, formal harnesses for selected datapath and control blocks, and synthesis smoke coverage through Yosys. The fixed-function graphics blocks are still useful, but mainly as bring-up infrastructure.

The implemented instruction subset is small on purpose. It covers basic integer operations, loads and stores, special-register reads, convergent branches, predicated stores, and 16-bit stores for RGB565 framebuffer writes.

The useful milestone is getting simple kernels to run through the whole path. vector_add proves launch, IDs, loads, ALU work, stores, and completion. framebuffer_gradient proves the same path can write pixels without turning the framebuffer into a special case inside the execution core.

Current focus

Harden the programmable path before adding speculative GPU features
Keep block-level formal checks close to the RTL as the core changes
Run optional Vivado synthesis smoke before making FPGA platform claims
Keep the framebuffer path as ordinary global memory instead of a separate architecture