Monday, January 14, 2013

storing pixels

So as output is in theory solved by having the DMA send a lines of 16bits halfwords, we need to focus on the storage and memory.
As with horizontal resolution, different choices can lead to different modes, but I’ll show some generic principles, you can adapt and create your own mode. Or reuse one.
There are several general ways to store pixels in order to output them.

The first, direct approach is a frame buffer.

A frame buffer is a chunk of memory storing pixels exactly as they will be output, so that you output to the screen what you’ve got in memory. Sometimes, things can be a little messier with color planes (i.e. one framebuffer per Rn G and B) and banks.

What resolution can we use if we want to use it with the current memory size ?

We can, of course use Flash for static images. But then, only using static images is quite restrictive for a game console.

Let’s pretend use 2 bytes per pixel. We have 192kb of memory on the stm32f4, but we can only use 128 as the other 64k is core coupled memory and won’t be seen by the DMA. So we can store 64k pixels (half that for 16 bits).

For a 4/3 aspect ratio, that’s sqrt(3/4*64*1024) =  221 lines.

We thus could store one screen of  294x221 pixels, or two screens of (only!) 221x147 if we want double buffering !

Which begins to enter the domain of lame for a 32bits game console.

For a better resolution of 640x480 @ 12bpp, we would need 640x480x2 = 614 400 bytes , which is about 5 times the RAM and more than half the Flash size.

For this, we need to compress the image we want in RAM and decompress it at 25 MHz pixel clock. 

So we will need some translation functions that prepares a line of pixels for the current line from the frame buffer while the DMA outputs a line buffer. and then exchange those front and back buffers, exactly timed at 31 000 times per second.

Needless to say we won’t use jpeg.

Note that we need this compressed data to be manipulated by our game, so PNG and the like (really LZ77/LZW in memory pixel storage) will be too CPU intensive also, as well as impractical to manipulate for non static images.

First we can use less bits per pixel by using indexed colors. By example, we could use a 256 colors palette using 1 byte per pixel giving nice colors and practical output. See the article about it on wikipedia, it’s well explained and has nice parrot pictures. Arrr !
The display function will translate the pixel color ids to a table of pixel colors and store it in the buffer. Generally it’s done by hardware by means of a RAMDAC (in : pixel ids, out : VGA signal, inside : some RAM for the palette / a DAC, hence, a ramdac. See wikipedia article.) but those chips are becoming hard to find/expensive and that’s an additional chip and that would force us using indexed color, so no).

Is it feasible by software ?

Let’s calculate how much time lines per seconds we could output.
Let’s consider a naive pixel / byte algorithm, not using 32bit word-aware method.
  • Excluding loops housekeeping, that’s around 6 clocks per pixel IIUC the ARM reference latency roughly
    • 1 read per pixel for the pixelID, 
    • 1 read to read the palette color from memory 
    • 1 store per pixel to the buffer. 
So the STM32 @ 168M can output 168e6/(6*640) = 43750 lines/sec. That’ more than 31kHz horizontal refresh rate so that’s possible (note that we include hblank periods) !

(taking 70% of the CPU power - not counting V blank periods where we’re not outputting video, compared to None when we’re using a framebuffer. That’s a serious memory/cpu tradeoff, but if we can do better, having 50% of a 168MHz CPU isn’t so bad after all).

Note that if we can use 16 colors palette, that’s half the RAM also (4bits pp) or in 4 colors that’s 2bpp (with 1/4 less pixel reads and keeping the whole palette in registers so no palette reads … that with combined word writes can make it much faster).

Note also that with a palette you can manipulate the palette individually from the pixel data, so you can do fadeouts quite easily by switching palettes (not for free but because you’re already doing the translation work).

But that is quite expensive, and another method will be used first.

Tiled backgrounds
Another technique, often used for backgrounds and very similar to text mode goes further : instead of having a palette of pixels, lets have a palette of sub-images, composed to make a bigger image : one for a tree top, one for a bottom tree, one for grass : repeat many times and you have a big forest with 3 small images + a map.

It’s similar to text modes in that instead of doing it with letters (buffer of characters on screen + small bitmaps representing letters), you do it with color images (which can be letters).

Nice editors exist for tiled data, and we will use one to compose our images. 

Storing such an image need storing tiles +  a tilemap referencing your elements. The bigger the tile, the less bits you need to store the tilemap, the more you need to store the tiles. Note that tiles can be stored with a palette also.

Many other choices can be made, and combining them is possible, but we have few cycles to spare for now, so let's consider only tiles for now.

bitbox VGA generation

While the preceding post was about generic video generation, this post will specify what is used by bitBox console for Video Generation.

First, the DAC : it will be a simple DAC made of resistors. A R2R ladder could be used, it can be nice to only have few values of resistors when manufacturing. Well, that’s nice but for now we’ll using less resistors since we will manufacture by hand (duh) so a resistor DAC will be used. I first tried a 8bit RRRGGGBB (as 8 bit).

That’s what the uzebox (The 8bit homebrew console, it’s great and has been a great inspiration) used with a 8 bit microcontroller, but here we have the capacity (cpu and momory wise) to do a little more.

How much colors should we be able to display ?

It’s a question of balance : more bits in the DAC looks better, but more bits mean more CPU to build the signal and memory to store the nice graphics, as well as a bigger RAM / Flash to store the graphics and more hardware complexity.

I finally settled for 4096 colors, which is 4-4-4 = 12 bits + 4 unused bits on a 16 bits output bus. The use of a palette will be defined by the software, so let’s not talk about that now.

15 bits could also have been done, but I think 12bits will provide nice colors anyway. The games won’t be photorealistic, so vivid colors is aimed at, not realistic.

Then, how many pixels should we be able to output ? That’s a software thing !  Nothing in hardware sets the number of pixels, as vertically it’s how often we fire the h-sync, and horizontally is how fast we make the pixel vary.

Let's try defining a first video mode (all by software).

We should try to build on a standard VGA timings, which might be easier for VGA screens to sync on because it’s a standard resolution, as well as being compatible with many screens.

The universal resolution is 640x480, 60Hz, which is a resolution supported by quasi everything (even HDMI supports it - but of course we are not generating hdmi with a few resistors).

Note, however, that this will be the resolution the screen thinks it gets. By example, there is no difference between varying the pixel levels twice slower and having horizontally twice larger pixels : it’s the same thing.
As well, if you’re outputting the same line twice, it will effectively provide half the resolution. That will provide you by example 320x240 @ 12 bits if you vary the pixel clock for 240 pixels.

You can also "forget" to send anything for 20 lines after and 20 lines before your signal, so you’ll have black lines and 320x200. Which has the nice property of needing a 64k frame buffer if we use 1 byte per pixel. 128k for double buffering… but more on that later.

Extra reading on that subject :

Outputting pixels

The next thing to consider is how to store pixels in memory and how to output them.

Outputting can be done by bitbanging, ie writing them clocked by the instruction clock of the processor.

The problem is that we won’t have much time left to do anything else, and while the main CPU is perfectly good at outputting bytes or halfwords, it really is much more powerful so all those cycles could be spent doing more useful things such as adding 4 bytes in parallel or running nice effects. It would be nice if we had a small bit of silicon on the MCU able to move data from memory to a peripheral (here GPIOs).

As a matter of fact, we do! It’s called a DMA for direct memory access.  The stm32f4 has two of them.

The only thing we need to do is :
  • generate with a clock-based interrupt line hsyncs at 31khz (see VGA generation posts and VGA timings references)
  • for some of those lines, generate vsync 
  • for the actual lines, 
    • point the DMA to a part of memory, tell it the pace / width of output, 
    • let it run in the background
    • fill another place of memory with the next line of pixels (or the whole screen) 
    • return from the interrupt ASAP letting the processor to interesting things in the foreground.
  • In foreground, process user input, calculating the next frame or decompressing a nice purple tentacle from a PNG to RAM, ...
The next post will focus on storing & generating pixels.

Friday, January 11, 2013

VGA software generation

The VGA software generation from a chip is quite simple as well as quite tricky to achieve.
Simply said, to output a vga signal, you should think of it as a Cathod Ray tube, scanning from top left to bottom right in lines, and being shut during getting back to left or back up to first position, as a Z pattern (let’s thing progressive scan here).

Then, to output a VGA signal, you need to generate three varying red, green, and blue signals (as 0-0.7 volts, 0 meaning black and 0.7 full color), as well as H sync (to tell the tube to go left) and V sync (to tell the tube to go to top right info)

Nice tutorials are available, so instead of copying and paraphrasing them here, I’ll just link to them. Great links for VGA and Video signal generation are :
- and finally a GREAT tutorial for video generation :
- A search engine using “VGA signal timings” terms by example
Composite is a little trickier with separate luma+chroma
The principle is very simple, what can be tricky is having the timing perfectly done (or not too badly done) because you’re trying to generate three 20MHz signals on a microcontroller … as well as (hopefully) running a simple game !

Hardware considerations

So the idea is to deliver a simple, cheap, hardware base, home-reproducible, and versatile to hack.
Video signals and sound generation and processing will be software-generated, so the exact characteristics (screen resolution, tile-based engine, frame buffer or even 3D raster, number of sound voices) will be defined by kernel software and will evolve as the hardware is pushed by the software.

Kernels are just drivers set to allow simpler game development by abstracting lower level VGA generation (graphics signal generation) in libs.
The aim is to be simple and cheap, while getting up to date hardware (not in the of powerful meaning - that’s not the point, but easy to find and cheap).

The main CPU will be the STM32F4 from STmicro, which is a quite powerful platform to build on.

Running at 168 MHz, 192 kB RAM and 1MB Flash memory, fast DMAs and 32 bit thumb2 cortex M4F instruction set with simd and float instructions, this little beast seem to have what it takes to bring us to the world of homemade snes (not ) ! It’s about 10-15$ also - even if the whole platform will be more expensive, (whole car vs engine).

meet the bitbox console

Hi, this is a personal blog aimed at relating my adventures in developing a simple DIY console, based on ARM chip. The base of it will be a single chip, the STM32F4 from STMicroelectonics.
The minimal hardware design will hopefully allow for hackability, as quasi everything will be based on this chip +software rendering of the video signal.
More on this later !