baby-gpu
Context
baby-gpu is a work-in-progress graphics accelerator project. The repo calls the first architecture UrbanaGPU-1: a small GPU-like core written in portable SystemVerilog, with the RealDigital Urbana FPGA board as the first hardware target.
The goal is not to jump straight to a full GPU. The first useful target is a minimal programmable pipeline that can accept command streams, run small kernels, write a framebuffer, and eventually display a 160x120 RGB565 image scaled to 640x480.
Approach
The repo is split so the GPU core stays separate from board-specific details.
Portable RTL lives under rtl/, while platform wrappers for simulation, FPGA,
and future ASIC work live outside the core. That keeps the design from becoming
too tied to Vivado, Urbana, DDR3, or a specific video-output path too early.
The current architecture is centered on a small programmable SIMD core rather than only fixed-function draw blocks. The first shape is intentionally narrow: one core, four SIMD lanes, a shared program counter, private lane register files, and a blocking load/store path into global memory.
- Command FIFO and command processor for host-driven control
- Register file and launch state for kernel setup
- Fixed-function clear and rectangle-fill blocks for bring-up smoke tests
- Programmable core path with scheduler, instruction decode, SIMD ALU, and LSU
- Global-memory framebuffer model where pixel writes are ordinary stores
Current status
The project already has more than a skeleton. The repo includes RTL for the core blocks, a small ISA envelope, kernel-level simulations, formal harnesses for selected datapath and control blocks, and synthesis smoke coverage through Yosys. The fixed-function graphics blocks are still useful, but mainly as bring-up infrastructure.
The implemented instruction subset is small on purpose. It covers basic integer operations, loads and stores, special-register reads, convergent branches, predicated stores, and 16-bit stores for RGB565 framebuffer writes.
The useful milestone is getting simple kernels to run through the whole path.
vector_add proves launch, IDs, loads, ALU work, stores, and completion.
framebuffer_gradient proves the same path can write pixels without turning the
framebuffer into a special case inside the execution core.
Current focus
- Harden the programmable path before adding speculative GPU features
- Keep block-level formal checks close to the RTL as the core changes
- Run optional Vivado synthesis smoke before making FPGA platform claims
- Keep the framebuffer path as ordinary global memory instead of a separate architecture