Wednesday, April 22, 2015

DIY FPGA-based HDMI ambient lighting

Ambient lighting is a technique that creates light effects around the television that correspond to the video content. It has been pioneered by Philips under the brand Ambilight. In this project we will create a basic FPGA-based ambient lighting system that reads the video signal over HDMI. This means we are not limited to computer output. We can use it together with DVD players, video game consoles, etc.

The complete source code for the entire project is available on GitHub.

Design outline

Ideally we would like to snag the signal off the HDMI cable without disturbing it. However, the signals are really really fast current mode logic differential pairs, and signal integrity issues make any kind of passive tap a non-starter. We are stuck with doing the next best thing: decoding and recoding the signal. This has the advantage that we gain the ability to modify the signal on the way through the system to add debug information etc.

Thanks to our "man in the middle" position between the player and the display we can see all the pixel values, aggregate them in properly sized and positioned boxes and use this signal to drive a strand of LEDs over their SPI interface. These LEDs will be arranged in a ring around the back of the display, resulting in an ambient lighting effect.

Due to the limitations of the FPGA boards currently available on a hobby budget, we will not be able to process video signals at resolutions higher than 720p. That's unfortunate, but I expect this limitation to go away as series 7 FPGAs drop in price.

Meet the components

At the heart of the system is the FPGA. We are going to save ourselves a lot of trouble, and use a board that comes with an FPGA suitable for HDMI deserialisation/serialisation and two HDMI connectors already affixed. We are going to use the wonderful Scarab miniSpartan6+.

For the leds we are going to use a strand of 25 digitally addressable LEDs based on the WS2801 controller. There are denser, brighter, cheaper addressable LED strands out there now, but I have these lying around from another project, so that's what we will use.

Finally we need a way to do logic level conversions. The FPGA operates at 3.3V, but the WS2801 expects to see 5V as a high logic level. Therefore we will need to use a level shifter circuit. We are going to use a really simple, cheap MOSFET-based level shifter by adafruit. We could do our own level shifting, but this unit is so simple and cheap, there's no reason not to use it.

HDMI Decoding/Re-encoding

The HDMI specification is huge, sprawling and complex and contains lots of odd video modes and assorted features. For this project, we shall limit ourselves to the 24 bits per pixel RGB modes, as these are both the most common modes and the sanest ones to support. We will also ignore HDCP encryption, EDID (which means we need to configure the proper modes on the HDMI source), CEC, and all other sorts of frills.

Electrically, the video signal is sent over four shielded differential pairs. Three are used for the color channels, and one is used for the pixel clock. Note that the pixel clock beats once for every pixel, not once every bit of every pixel. Also, the HDMI specification allows the channels to have significant delays with respect to each other. This means that we will need to perform clock recovery to receive the individual bits, and we will need to delay the individual channels to get them to line up.
The actual bits are encoded using a special form of 8b/10b encoding. Decoding the channel data yields the pixel values and the control signals. For re-encoding, this procedure needs to be reversed.

Xilinx has released a whole bunch of documentation on implementing video interfaces using the Spartan 6 family of FPGAs. The most relevant one is the application note Implementing a TMDS Video Interface in the Spartan-6 FPGA. As we can see the procedure is fairly involved, but not overly complicated, and reference designs for both the HDMI receiver and transmitter are provided along with the application note, that target  Digilent's Atlys board. The Spartan 6 FPGA series contains a bunch of hard IP that assists greatly in implementing HDMI receivers and transmitters.

The app note includes the following diagram of an HDMI receiver:

Practically this means the following:

  1. We need to send all four input signals (three data/color channels R,G,B and the clock signal CLK) to differential input buffers, configured to TDMS operation.
  2. CLK is fed into a phase-locked loop (PLL) and used to derive CLKx10 and CLKx2 signals. 
  3. CLKx10 has the same frequency as the bitstream, so it's used to clock three input de-serializer blocks (ISERDES2) and associated delay circuitry (IODELAY2). These blocks capture five bits at a time and output them every beat of the CLKx2 clock.
  4. Two five-bit blocks are glued back together in the 5:10 Gear Box, yielding a potentially correct 10 bit symbol.
  5. Looking for known values in the data stream, we sync up the symbols for each channel. If we don't see the known values, we shift the window by one bit and try again. Eventually every channel should end up in sync, and we should be receiving valid symbols from every channels.
  6. Then we bring the channels in sync with each other, so that the three symbols of each channel are asserted on each beat of the pixel clock.
  7. Finally the symbols are fed through the TMDS decoder to yield the actual RGB pixel values and control signals.

Re-encoding and transmission follows a similar process but in reverse. The pixel values are passed through a TMDS encoder yielding 10 bit symbols. The symbols are hacked up in two halves of five bits each and are shoved onto the differential pairs using an output serialization block (OSERDES2) connected to a differential output buffer configured for TMDS operation.

Standing on the shoulders of ... hamsters?

Normally we would try and port the reference designs to the miniSpartan6+. However, it turns out we can steal borrow most of the HDMI handling from Mike Fields of Hamsterworks. His project MiniSpartan6+ DVID Logo performs exactly the kind of decoding/re-encoding we need. It's solid, legible code that happens to target the specific hardware we are using. Therefore we are going to use it as a basis for this project. Mike, if you are reading this, you're the man. Seriously.

Averaging the HDMI signal

The decoded video stream is presented as an old fashioned VGA signal, and consists of the following signals:
  1. The pixel clock. We can sample the pixel color values on the rising edge of this clock.
  2. Three 8-bit values for the red, green and blue channels of the current pixel
  3. HSync and VSynch, signals that alert us to the start of a line or screen respectively
  4. Blank, that signals the blanking period. Originally this signal was used in CRT monitors to turn off the electron beam while it returned to the left edge of the screen after a line.
Using the three control signals and a few counters we can determine the (x,y) position of the current pixel. The value of x is increased for every pixel clock. The value of x is set to zero and the value of y is increased at the start of the blanking period (or at hsync). At vsync, both x and y are set to zero.

Using red, green, blue, x and y, we can do on the fly averaging. By comparing with x and y, we can determine what sampling box (connected to an LED) each individual pixel should contribute to, if any. Then we sum the color values into an accumulation register per LED. To obtain the average value, we need to divide the value in the register by the number of pixels. We are going to make sure that the sampling boxes contain a number of pixels that is a power of two. Then we can just wire the MSBs of the registers to the output to get the average values without performing expensive divisions.
The process for a single sampling box is illustrated in the figure below.

We are going to be using sampling boxes of 128x128 pixels. This means that our accumulation registers will need to be able to hold the sum of  2**7 x 2**7 8-bit values for every (red, green, blue) channel, and will therefore have to be 22 bit wide. In total, we will require 1650 bits of accumulators. However, the accumulator only contains the proper value at the end of the frame. Therefore we need another 600 bits to act as a "double buffer" to keep the values of the MSBs stable as we continuously drive the LEDs over their SPI interface. In total we require 2250 bits of state, which is a puny amount, even for a small FPGA like the LX9.

Driving the LEDs

Our LED string uses a SPI-like system to set the color values for the individual pixels. It's made to run at 5V and it uses the WS2801 controller. We would like to drive it directly from the digital output pins of the FPGA, but as soon as we open the datasheet of the WS2801 controller, we notice we've hit a pretty serious snag:

In order for the inputs to register as logic level high, they need to be at 0.8Vdd. Since the LEDs are powered at 5V, this means that we need to get above 4V for the controller to register a high signal. The FPGA is a 3.3V device. This means we will be needing to perform level shifting.

We are going to use a really simple, cheap MOSFET-based level shifter by adafruit that I have lying around. It's designed to work with I2C, SPI and serial and supports bidirectional operation. The circuit follows the suggested schematic from the NXP I2C level shifting app note, but with four channels instead of two:

By passing the signals through this device we can convert between 3.3V and 5V logic levels. We must ensure every line is not driven from both ends at the same time, but in this application that is not a concern because the WS2801s in our design have no user-accessible outputs (they are wired to the LEDs).

Now that the electrical part is taken care of, we still need to write the actual values to the LEDs. For the sake of simplicity, we just continuously send the values in the SPI buffer to the LEDs, instead of trying to sync with the HDMI source frame rate. This may lead to some tearing artifact during rapid transitions, but in practice this is not a problem.

From the data sheet, the timing diagram for the WS2801 is as follows:

A new frame starts when the clock is low for more than 0.5 ms. Then we can output the RGB values R-first, MSB-first without pausing between the LEDs. Theoretically the WS2801 can operate at up to 25MHz, but since we're dealing with lots of wires and breadboard prototyping techniques, we are going to settle for a much more modest clock speed.

We generate a 25 kHz clock by dividing the on-board 50 MHz MEMS oscillator of the miniSpartan6+ board. We output all 600 bits of the SPI buffer in order on the falling edge of the clock, so that the WS2801 can sample them on the following rising edge. When we are done, we bring the clock line low and wait for the equivalent of 40 bits to reset the data frame. This means that we perform about 40 updates of the LEDs every second, This slower than the data is pouring in from the HDMI averager module, but it does not suffer from visible lag.

If we blindy output the averages to the LEDs, the colors will be very washed out. We need to perform gamma correction in order to match the colors of the LEDs to the colors on screen. Experimentally, we get the best results using a gamma value of 2.5. We create the VHDL code for the gamma lookup table using a tiny python script:

for i in range(256):
    g = int(math.pow(float(i) / 255.0, 2.5) * 254.0 + 1.5)
    print 'X"%02X",' % (g),

Notice how we map the values to the interval 1-255, instead of 0-255. We do this because at intensity 0 the individual "subpixel" LEDs actually turn off completely, and this results in significant color artifacts.

Putting everything together

Now can assemble everything and wire all the components together.

In the picture above we can see the LEDs affixed to a piece of cardboard arranged in the proper pattern. The level shifter is placed on a solder-less breadboard that also contains the 5V power supply. The FPGA board is powered from the 5V power supply and drives the leds through the level shifter. The cardboard support can be attached to the back of the television using M4 screws threaded through the VESA mount, or just by using more masking tape.

Now we can hook up an HDMI source (such as my trusty old WD TV Live HD) and enjoy the show!

Tuesday, October 14, 2014

Zero-confirmation bitcoin transactions

Bitcoin transactions and confirmations

Bitcoin transactions are incorporated into the bitcoin blockchain, that acts as a shared ledger. Once the transaction has been part of the blockchain for a long enough period of time, it is accepted as truth by all the nodes in the network and becomes an unalterable part of the bitcoin permanent record. This makes it irreversible and resistant to both network events and deliberate tampering in the form of double-spending attacks.

A block that is added to the blockchain that contains the transaction is called a confirmation. So a newly broadcast transaction is said to have zero confirmations. When it's included in a block, it has one confirmation. Another block is mined on top of the block containing the transaction, we have two confirmations. Generally, a transaction is not considered definitive unless it has six confirmations. On average this takes about an hour.

In this post we will develop a crude system for handling zero-confirmation transactions without incurring into any appreciable risk.

BTC-mirror, a zero-confirmation application

Let's assume, for the sake of argument, that you really hate bitcoins, and that you want to get rid of them as soon as possible. Unimaginable, I know, but bear with me. So, what we want to do is to send those bitcoins back to the sender as quickly as humanly possible.

We cannot immediately create a new bitcoin transaction to send an equal amount of coins back to the sending address. If the original transaction gets double-spent or whatever, the new transaction will still be incorporated into the blockchain and we will lose money. Of course, we could just wait for six confirmations, but that would mean holding onto those coins for over an hour.

Instead we will craft a bitcoin transaction by hand that includes the original transaction as one of the inputs. If the original transaction ends up being rejected, our new transaction will be invalid as well, and will not be incorporated into the blockchain. Because of this mechanism we can broadcast our transaction immediately into the network, before the original transaction is even incorporated in a block. Zero confirmations.

First things first

In order to interact with the bitcoin network we are going to run a full node. Yes, I know, the blockchain is really large and it takes a long time to sync. Suck it up. We also need a full transaction index. Set "txindex=1" in your bitcoin.conf before starting your node as re-indexing everything is a huge pain.

If you feel uncomfortable sending around real money, or if you want to cut down on the server load in terms of network traffic and disk space, you can also set your node up on the test network (testnet=1). It's up to you.
For this project I used bitcoind version, your mileage may vary using different versions.

Set up an RPC endpoint, username and password (rpcport, rpcuser, rpcpassword options). Pick a proper password.

We will also be using python 2.7 and python-bitcoinrpc. Go install them now.

Bitcoin shield activated

Now we're ready to implement our application. Using the walletnotify option in bitcoin.conf we can run a script every time a transaction is received that affects our wallet:

walletnotify=/home/drx/ %s

The script will run with a single parameter, the transaction identifier (txid) of the transaction that affects our wallet. It may fire multiple time for a single transaction, so you should take care to process every txid just once. In our case, we don't really have to since the transaction input trick will protect us, as we would effectively be double-spending the original transaction. In other applications this can be a disastrous oversight that will cost you a lot of money.

After all the required preliminary python incantations (imports, main function definitions, etc), it's time to get to work. First we use bitcoinrpc to obtain a proxy to our bitcoin node's RPC service. Then we ask for the transaction details of the transaction we're being called to process:

access = AuthServiceProxy("http://:@")
txid = sys.argv[1]
txinfo = access.gettransaction(txid)

We can now use the transaction information txinfo to find out which addresses in our wallets were affected and how. We are only going to mirror transactions that affect our wallet only as a receiver. This prevents us from mirroring send transactions and change consolidations that would create infinite loops:

myaddresses = set()
for details in txinfo["details"]:
    if details["category"] != "receive":

Now we need to find the actual transaction outputs that credited these addresses, and figure out the total amount we need to send back. For that we need to dive into the transaction itself:

tx = access.decoderawtransaction(txinfo["hex"])
newtx_inputs = []
total_amount = Decimal(0)
for vout in tx["vout"]:
    for address in vout["scriptPubKey"]["addresses"]:
        if address in myaddresses:
            total_amount += vout["value"]

We also need to find the sender's address, so we know where to send the money. This is done by looking at the inputs of the original transaction. The transaction can have multiple inputs that are all under the control of the sending party. We just send everything back to one of the input addresses at random. This is not just laziness on our part, it also results in a smaller transaction, which in turn results in a smaller blockchain on every single node.

The sender address does not appear in this transaction, it appears in the transactions that created the inputs. Therefore we need to look back in the ledger. This is the part that needs the transaction index to be enabled on your node.

addresses = []
for vin in tx["vin"]:
    # For every input, find the txid and output number used
    intxid = vin["txid"]
    invout = vin["vout"]
    # Get the outputs of the input transaction
    intx = access.getrawtransaction(intxid, verbose=1)
    vout = intx["vout"][invout]
to_address = random.choice(addresses)

Now we are finally ready to put everything back together into a new bitcoin transaction, sign it and send it to the network:

newtx_hex = access.createrawtransaction(newtx_inputs,{to_address:float(total_amount)})
newtx_hex = access.signrawtransaction(newtx_hex)["hex"]

Congratulations, you are now immune to bitcoin.

Wednesday, July 30, 2014

Retro Pinball: Reverse Engineering the Bally AS-2518-51 Sound Module Assembly

One of my friends has a beautiful 1980 Nitro Ground Shaker pinball machine. It's in great condition, but there is absolutely no sound output. As I was trying to figure out a way to troubleshoot the sound module, I found myself falling deeper and deeper into the 8-bit pinball rabbit hole. In this post I will try and explain the inner workings of the amazing Bally AS-2518-51 Sound Module Assembly.


The 2518-51 Sound Module is based on the famous Motorola 6800 series microprocessor (U3 in the picture and in the schematic). Sound synthesis is performed by a General Instrument AY-3-8910 (U1). There's also a 6820 Peripheral Interface Adapter (PIA) (U2) to interface the AY38910 to the MPU, a ROM (U4) that contains game-specific code and data and a RAM chip (U10). The microprocessor is clocked by a 3.58 MHz crystal. Due to the built-in clock divider the 6800 operates at 895 kHz.

The Bally AS-2518-51 Sound Module Assembly as found in Nitro Ground Shaker
The pinball system interfaces with the sound module using a 15 pin connector (J1) that provides both power and input. The only output of the sound module happens through the speaker connector (J2).

The manual contains schematics for all the electronics in the pinball machine. The schematic for the sound assembly module is reproduced below.

Schematic of the AS-2518-51 Sound Module
The sound module's basic operation is as follows. The pinball's main processing board (known as the MPU Module Assembly) asserts the number of the sound it wants to play on the sound module's address lines on the J1 connector (pins 1, 2, 3, 4, 12 and optionally 13). Then it pulses the sound interrupt line on pin 8. The 6800 picks up this signal (through the PIA) and jumps into action. It reads the number of the sound to play (from the IO port on the AY-3-8910, through the PIA) and generates commands for the AY-3-8910 (again, through the PIA) to synthesize the requested sound.

Reconstructing the Memory Map

The 6800 accesses its on-bus peripherals (RAM, ROM and PIA) by mapping pieces of its 64 kiB address space to each individual peripheral. This is all done in hardware by wiring some of the lines of the address bus to pins of the peripheral chips that cause the chip to turn on or off. Pins that cause a chip to turn on are usually called CS (for chip select). Pins that cause a chip to turn off are usually called /CS (the negation of CS). A chip is enabled if all CS lines are pulled high and all /CS lines are pulled low.
All chips on the bus are only allowed to be active when the address bus contains a valid address. When this is the case the 6800 pulls the VMA line high (for Valid Memory Address). This line is usually connected to a CS pin, fulfilling this requirement.

RAM chip connections
The RAM (U10) chip's data lines and address lines A0 through A6 are connected to the data and the address bus of the 6800 respectively. Two address lines (A7 and A12) are also wired to /CS pins. This means that the RAM chip is only active if the A7 and A12 lines are both low.

PIA chip connections
The PIA (U2) is active when A7 is high (CS) and A12 is low (/CS). Also note that the chip is only wired to address lines A0 and A1.
ROM chip connections

The ROM chip is enabled when NAND(VMA, A12) is low (/CS). This is the case when A12 (and VMA) are high. The ROM chip is connected to the address bus through lines A0 to A10. Line A11 can be enabled with a jumper when the ROM chip is 4 kiB instead of 2kiB, but this is not the case for Nitro Ground Shaker.

Based on the above information we can reconstruct the memory map as seen by the 6800.

StartEndA7A12DeviceAddress LinesActual Range
0x00000x007F00RAMA0 - A60x0000 - 0x007F
0x00800x0FFF10PIAA0 - A10x0080 - 0x0083
1ROMA0 - A100x1000 - 0x17FF

Disassembling the ROM contents

ROM images can be obtained from the Internet Pinball Database for use with the PinMAME emulator. We are interested in the roms for the regular (6-digit) version of the pinball machine. The ROM bundle consists of four files. Three are for the MPU Module Assembly, and one is for U4 on the Sound Module Assembly. We are interested in the latter.

The game configuration in the PinMAME source code contains the following lines:

BY35_ROMSTART888(ngndshkr,"776-17_1.716",CRC(f2d44235) SHA1(282106767b5ec5180fa8e7eb2eb5b4766849c920),
                          "776-11_2.716",CRC(b0396b55) SHA1(2d10c4af7ecfa23b64ffb640111b582f44256fd5),
                          "720-35_6.716",CRC(78d6d289) SHA1(47c3005790119294309f12ea68b7e573f360b9ef))
BY51_SOUNDROM8(           "776-15_4.716",CRC(63c80c52) SHA1(3350919fce237b308b8f960948f70d01d312e9c0))

The ROM image for U4 (SOUNDROM) is contained in the file 776-15_4.716.

We disassemble the ROM contents using DASMx, a two-pass disassembler that supports the 6800 family of processors. When disassembling we must make sure to specify the correct origin address to ensure that the ROM contents line up with the addresses as seen by the processor. Since the ROM is mapped at address 0x1000 we must specify this when we invoke the disassembler:

dasmx -c6800 -o0x1000 -w ngndshkr\776-15_4.716

The result is a listing of data and instructions for the addresses in ROM.

Reverse Engineering the Boot Procedure

According to the datasheet of the 6800 family of processors, when the 6800 is started it will start execution at the address located at 0xFFFF and 0xFFFE. Other memory locations stored at the top of memory are used to react to other events:

Table 1 of the datasheet of the 6800 as published by Hitachi.
These addressess are mapped to addresses 0x1F8 and up on the actual ROM, since only address lines A0-A10 are used.

The ROM contains the following values:
17F8 : 10     
17F9 : 23 
17FA : 10     
17FB : 00     
17FC : 10     
17FD : 00     
17FE : 10     
17FF : 00

On (re)start, software interrupt or non-maskable interrupt (NMI, pin 6 connected to pushbutton SW1), execution starts at address 0x1000, which is the first address in ROM. On interrupt request (IRQ, pin 4 connected to the PIA) execution starts at address 0x1023.

First we initialize the stack. The stack pointer is set to 0x007F. The stack grows downward (toward lower addresses) and 0x007F is the highest address available in RAM.

1000 : 8E 00 7F    "   " [3] lds #$007F

Next up is basic PIA configuration. Refer tot the datasheet for the PIA for the meaning of the various registers.
1003 : 4F    "O" [2] clra
1004 : 97 83    "  " [4] staa X0083

Set PIA control register B (CRB) to zero. This is mainly to set bit CRB-2 to zero to allow us to write to the direction register using address 0x0082.

1006 : 43    "C" [2] coma
1007 : 97 82    "  " [4] staa X0082

Write 0xFF to the direction register for port B. This configures all pins of port B (PB) on the PIA as outputs.
1009 : 86 34    " 4" [2] ldaa #$34
100B : 97 83    "  " [4] staa X0083

Write 0x34 (00110100) to the control register for port B (CRB). This means:

 CRB-0/1:  Disable \IRQB by CB1
 CRB-2: Read/writes on 0x82 will now affect I/O port B. This enables output to the PB pins of the PIA.
 CRB-3/5:  (110) CB2 is output, CB2 set high. CB2 switches the analog output.

100D : 86 07    "  " [2] ldaa #$07
100F : 97 81    "  " [4] staa X0081

Set PIA Control Register A (CRA) to 0x07 (00000111):

CRA-0/1: Enable \IRQA by CA1, Set low-to-high transition. This is the main interrupt: when the pinball wants to play a sound it will trigger this interrupt. The PIA will then in turn lower the \IRQ line of the 6800 and an interrupt request will take place.
CRA-2: Select peripheral A register. Read/writes to 0x80 will now affect I/O port A. This enables us to read/write data from the AY-3-8910.
CRA-3/5: (000) CA2 setup does not matter as that pin is not connected.

1011 : 7F 00 3F    "  ?" [6] clr X003F

Clear memory location 0x003F. As of yet unexplained.

Then we launch a busy loop, presumably with the intention of giving the pinball machine as a whole a chance to come online. Interrupts request presented during this time as not handled, as interrupts are still masked.
1014 : 86 32    " 2" [2] ldaa #$32
1016 L1016:
1016 : CE 3D 2D    " =-" [3] ldx #$3D2D
1019 L1019:
1019 : 09    " " [4] dex
101A : 26 FD    "& " [4] bne L1019
101C : 4A    "J" [2] deca
101D : 26 F7    "& " [4] bne L1016

The inner loop at L1019 is executed 0x3D2D (15661) times and it consumes 8 cycles (see the timing information between square brackets), for a total of 125288 cycles per inner loop, or about 139 ms. This loop is executed 50 (0x32) times, yielding a delay of about 7 seconds.

101F : 96 80     "  " [3] ldaa X0080

Reading from 0x0080 gets us the value currently on port A of the PIA. This is the value presented to it by the AY-3-8910.

1021 : 0E     " " [2] cli
1022 : 3E     ">" [9] wai

Now we're ready to work. We unmask the interrupt requests and wait for an interrupt.

Playing a Sound: Overview

When the pinball machine wants to play a sound, it triggers the IRQ line on the 6800 through the interrupt logic of the PIA. This triggers the execution of the interrupt service routine (ISR) located at address 0x1023. The ISR performs some initialization and then reads the value asserted by the pinball machine on the AY-3-8910's IO port A to determine the sound that is being requested. Sounds are played with interrupts unmasked, which means that an incoming interrupt request can interrupt the currently playing sound. Since the ISR resets the stack at every invocation, the newer sound completely overrides the older one. An interesting note is that (at least on Nitro Ground Shaker) sound 3 is always played with interrupts masked, rendering it uninterruptible. Sound requests will be ignored while sound 3 is playing.

Sounds 0 through 4 are played using specialized subroutines that issue commands to the AY-3-8910 programmable sound generator using native 6800 code. Sounds 5 through 32 on the other hand rely on a custom virtual machine (VM) implementation that executes a domain specific bytecode.
Overview of sound playing logic

The bytecode can contain arithmetic operations on the AY-3-8910's registers, can perform both conditional and unconditional branches and even jump to, and return from, subroutines. It has a small stack and an explicit program counter. The sound VM is therefore a complete virtual computer optimized for the task of driving the sound generator.

Virtual Machine Bytecode

The VM code takes the form of  simple bytecode instructions where the high nibble of the first byte contains the instruction opcode to be executed. Execution is handled on a per-opcode basis using a table of pointers to opcode handler subroutines. The first parameter (places in the low nibble) is passed to the handler routine in accumulator A. Accumulator B contains the byte that follows the opcode as the second parameter. Whether this is actually used by the instruction is determined by the opcode handler, as it determines the instruction length by increasing the VM program counter VM_PC. By indexing relative to VM_PC and incrementing it accordingly afterward, the VM can execute instructions of arbitrary length. We have observed instructions of length 3.


By disassembling the opcode handlers we can understand the purpose of the different opcodes. All opcodes 0-F are in use, but I have not yet figured out the purpose and usage of some of them. I assigned the mnemonics based on my understanding of the instructions. The mnemonics used in the original documentation/source code are likely very different. Also, this table reflects my current understanding of the bytecode and the bytecode interpreter and it is likely to contain major mistakes and omissions.

Store VAL in AY-3-8910 register REG

REG += 1

REG -= 1
Branch to (relative) offset
Branch if REG is zero
Branch if REG is not zero
Pause. Use DELAY if A!=0, otherwise use default delay.
91 ?REG

A3CALL0?OFFS_HIGHOFFS_LOWCall bytecode (A=0) or native (A!=0) subroutine.

Return from bytecode subroutine
Write value in RAM to REG. Purpose unclear.
Read value in REG to RAM. Purpose unclear.

Freeze execution of sound (until next interrupt) (?)

End execution of sound. (?)


The  Bally AS-2518-51 Sound Module Assembly is an extremely interesting piece of kit. In this post we have described the general functionality of the Sound Module Assembly, we have reconstructed the memory map as seen by the 6800 processor, we have disassembled the firmware and taken a first stab at reverse engineering the the sound playing logic. 

The sounds are generated, either directly by the 6800 microprocessor or using a custom virtual machine programmed in a specially designed bytecode. Because of this, it is far from trivial to extract the sound effects directly from the ROM images into a description of the AY-3-8910 register states that can be dumped into an YM file or similar. It is much easier to instrument a full emulator (like PinMAME) if this is the goal.

It should however be possible to craft a program that "compiles" a YM file into either a native or a bytecode routine so that it can be added to a pinball machine as one of the sounds available. Whether it can be done within the hardware constraints and whether it can be cleanly patched into the firmware remains to be seen.

Wednesday, August 8, 2012

Complex CPU Addressing modes, for free!

One of the most powerful features of a central processing units is the ability to access memory using complex indexing modes. These modes are essential to implement compound data structures such as arrays, structures and classes in an efficient manner. Our FPGA soft-processor core will need to do all these things too.
The instruction set architecture (ISA) that we have developed in the previous post support only basic register indirection, like this:

MOV [R1], R2
MOV R1,[R2]

where the [R1] denotes the memory address pointed to by the register R1.

However, as we shall see, we can support rather complex indexing modes essentially for free!
Recall the layout of our execution unit:

Not that the memory address register (MAR) can only be loaded using bus3, which means taking a trip through the ALU. We can therefore load MAR with the result of an ALU operation at no additional cost. This gives us e.g. the following indirect addressing options:

MOV [R1 + R2], R3
MOV R1, [R2 - R3]

The additional information needed in the instruction is minimal, we just need to specify an ALU operation and the bus2 driver just like we do for regular ALU operations.

Note that the above also holds true for reading/writing the memory block register (MBR) itself! We can only write to the MBR by going through the ALU. Also, when we want to store the contents of MBR in a register, we have to go through the ALU too! This gives us the opportunity to replace the stores with ALU operations, yielding instructions of the form

MOV [R1 + R2] - R3, R4
MOV R1 + R2, [R3 - R4]

To achieve this, we will need to store an additional ALU operation and bus2 driver in the instruction. Fortunately, we have room to spare in these instructions! Below we see the instruction words for load and store operations as they are currently in use.

load0010unused[src reg]unuseddst reg
store0011unusedsrc reg[dst reg]unused

To support the new indexing modes, we need to modify the instruction words to contain two complete ALU specification instead of one. We shall call them the "top alu" (suffix T) and "bottom alu" (suffix B) configuration, and they both consist of an ALU operation (OP, 4 bits) and the driver specifiers for bus1 (B1, 4 bits) and bus2 (B2, 4 bits).

opcodetop alubottom aludst reg
load0010OPTB1TB2TOPBunusedB2Bdst reg

The load operation always uses the contents of the addressed memory location as input for bus1 during bottom alu, leaving B1B unused. The store operation stores the result into the addressed memory location, leaving the destination register unused.

The code than handles the instruction in the control unit has not become significantly more complex, even through we need to juggle a few more parameters:

when X"2" =>;
        -- Indirect register load
        case PHASE is
                when X"0" =>;
                        -- Tranfer bottom ALU operation into MAR        
                        CONTR <= X"0000" & IR(15 downto 4) & X"C";      
                        PHASE <= unsigned(phase) + 1;
                when others =>
                        -- Transfer MBR + top ALU into output register
                        CONTR <= X"0000" & IR(27 downto 24) & X"D" & IR(19 downto 16) & IR(3 downto 0);
                        -- end of instruction, load the next instruction
                        PHASE <= (others => '0');
                        PC <= unsigned(PC) + 1; 
                end case;
when X"3" =>;
        -- Indirect register store
        -- (Store the contents of rN at the address in rM)
        case PHASE is
                when X"0" =>
                        -- Tranfer bottom ALU operation into MAR        
                        CONTR <= X"0000" & IR(15 downto 4) & X"C";      
                        PHASE <= unsigned(phase) + 1;
                when others =>
                        -- Transfer top ALU operation into MBR
                        CONTR <= X"0000" & IR(27 downto 16) & X"D";                                                     
                        -- end of instruction, load the next instruction
                        PHASE <= (others => '0');
                        PC <= unsigned(PC) + 1; 
                end case;

As always, the complete, synthesizable VHDL for this project is available on bitbucket.

Of course, with all these funky addressing modes, assembling instructions by hand is starting to become quite a chore. What we need, is an assembler!

Tuesday, July 24, 2012


XENTRAL is a simple Harvard Architecture CPU. It targets the Spartan6 FPGA present on the Digilent's Nexys3 board, but it's all written using portable behavioral VHDL1. You should be able to use your favorite FPGA, synthesizer and simulator.

All project files are available on github under the Creative Commons Attribution Non-Commercial (CC-by-NC) license.

Architecture Overview

XENTRAL is a Harvard Architecture processor. This means that code and data are handled separately.
The picture above is a very high-level schematic representation of the XENTRAL architecture. Instructions are fetched from ROM3. Based on the instruction, the control unit manages the execution unit. The execution unit performs the actual execution of the instruction, by e.g. adding the contents of two CPU registers, writing a value in RAM or performing I/O.

XENTRAL is a Harvard Architecture processor. This means that code and data are handled separately. This effectively precludes the use of self-modifying code. It also prevents the processor from loading its own program. The program must be programmed into the instruction memory. This is a common approach for embedded devices and microcontrollers, but it does not play nice with operating systems and boot loaders. As the project grows, we will re-visit this limitation.

Overview of the Execution Unit

Schematic representation of the execution unit

Tuesday, May 22, 2012

LightDHT - A lightweight python implementation of the Bittorrent distributed hashtable

I have always been fascinated by distributed data structures, but my efforts have always been dampened by the fact that they are really awkward to play with. In order to get anything that resembles real-life usage, you need to be running hundreds of nodes. Needless to say, wrangling hundreds of processes gets rather cumbersome. Fortunately, there are already thousands of people out there that are participating in one of the largest distributed data structures in operation today: the bittorrent distributed hashtable.
In order to facilitate experimentation with this wonderful system, I have written a simple python implementation of a DHT node. It is mostly (with a few exceptions -- see below) compliant with BEP0005. It should be able to act as a full member of the DHT, processing requests without disrupting the network.
Interested in playing around? Get the code on github!
From the documentation:
The aim of LightDHT is to provide a simple, flexible implementation of the Bittorrent DHT for use in research applications. If you want to trade files, you have come to the wrong place. LightDHT does not implement the actual file transfer parts of the bittorrent protocol. It only takes part in the DHT. 
The main philosophy of LightDHT comes down to two considerations: ease of use over performance and adaptability over scalability. 
In order to keep LightDHT easy to use, all DHT RPC calls are performed synchronously. This means that when you call a DHT method, your program will block until you have an answer to your request. That answer will be the return value of the function. This has the advantage that it keeps the logical program flow intact, and makes it more comfortable to use. 
In order to maintain O(log N) scaling across the network, BEP0005 (the standard governing the DHT) mandates that implementations use a bucket-based approach to the routing table. This enables the node to fulfill all requests in constant time and (more or less) constant memory. In LightDHT, we throw that recommendation to the wind. 
Since the main focus of LightDHT is reseach, we are going to keep around all the data we can. This means that we keep around every single node we know about. Since in practice the number of nodes is limited and the request rates are rather low, we do not bother keeping the routing table organized in a tree structure for quick lookups. Instead we keep it in a dictionary and sort on-demand. The performance penalty is well worth the reduced complexity of the implementation, and the flexibility of having all nodes in an easy to use data structure.

Saturday, December 10, 2011

Festive and Nerdy

You might have seen those big outdoor displays where every pixel is composed out of a little three-color LED light. In this post I'll tell you how I turned a string of those pixel lights into Christmas tree lights!

Adafruit industries, one of my favorite suppliers, recently started selling strings of a particularly advanced model of these pixel lights. Have a look at the product image, and tell me if it doesn't already look like a string a Christmas tree lights..
"12mm Diffused Digital RGB LED Pixels"
As you might have seen in my previous post, these pixels are rather easy to control from a micro-controller. However, stuffing a breadboard in your tree is not exactly festive. Thankfully, you can get these handy little tins that are just about the right size for a wall-wart plug, a teensy 2.0 with female headers (courtesy of Pieter Floris) and a big mess of wires.

A wall-wart plug, a Teensy2.0 and a big mess of wires.
Note that, thanks to the female-pin Teensy, everything is wired up using jumper wires. That's right - this is a no-solder project. Because the underside of the Teensy is not isolated, and because I have a number of wires in there that might come loose, I have isolated the interior of the tin using electrical tape. This is most likely not compliant with any code enacted after 1953 or so, so any imitation is purely at your own risk.

The only remaining question is how to fasten the plugs and wires onto the enclosure. Ideally you would use specially molded plastic caps and bits, but the odds of finding them in the right size and shape at your local hardware store are quite slim. Instead, I liberally applied black sugru around the holes I had cut into the tin, and then pressed the wires and the plug into it. The sugru just molds around the objects. After a quick touch-up I left it to cure overnight, and the result was really really good.

Now we just pop closed the tin and voila, a presentable programmable Christmas tree lights driver!

The controller and a single string of LED pixels, running a test pattern.
And here's the tree, all festive and nerdy:

Festive and Nerdy