Wednesday, August 8, 2012

Complex CPU Addressing modes, for free!

One of the most powerful features of a central processing units is the ability to access memory using complex indexing modes. These modes are essential to implement compound data structures such as arrays, structures and classes in an efficient manner. Our FPGA soft-processor core will need to do all these things too.
 
The instruction set architecture (ISA) that we have developed in the previous post support only basic register indirection, like this:

MOV [R1], R2
MOV R1,[R2]

where the [R1] denotes the memory address pointed to by the register R1.

However, as we shall see, we can support rather complex indexing modes essentially for free!
 
Recall the layout of our execution unit:


Not that the memory address register (MAR) can only be loaded using bus3, which means taking a trip through the ALU. We can therefore load MAR with the result of an ALU operation at no additional cost. This gives us e.g. the following indirect addressing options:

MOV [R1 + R2], R3
MOV R1, [R2 - R3]

The additional information needed in the instruction is minimal, we just need to specify an ALU operation and the bus2 driver just like we do for regular ALU operations.

Note that the above also holds true for reading/writing the memory block register (MBR) itself! We can only write to the MBR by going through the ALU. Also, when we want to store the contents of MBR in a register, we have to go through the ALU too! This gives us the opportunity to replace the stores with ALU operations, yielding instructions of the form

MOV [R1 + R2] - R3, R4
MOV R1 + R2, [R3 - R4]


To achieve this, we will need to store an additional ALU operation and bus2 driver in the instruction. Fortunately, we have room to spare in these instructions! Below we see the instruction words for load and store operations as they are currently in use.

bits31-2827-2423-2019-1615-1211-87-43-0
load0010unused[src reg]unuseddst reg
store0011unusedsrc reg[dst reg]unused

To support the new indexing modes, we need to modify the instruction words to contain two complete ALU specification instead of one. We shall call them the "top alu" (suffix T) and "bottom alu" (suffix B) configuration, and they both consist of an ALU operation (OP, 4 bits) and the driver specifiers for bus1 (B1, 4 bits) and bus2 (B2, 4 bits).

opcodetop alubottom aludst reg
bits31-2827-2423-2019-1615-1211-87-43-0
load0010OPTB1TB2TOPBunusedB2Bdst reg
store0011OPTB1TB2TOPBB1BB2Bunused

The load operation always uses the contents of the addressed memory location as input for bus1 during bottom alu, leaving B1B unused. The store operation stores the result into the addressed memory location, leaving the destination register unused.

The code than handles the instruction in the control unit has not become significantly more complex, even through we need to juggle a few more parameters:

when X"2" =>;
        -- Indirect register load
        case PHASE is
                when X"0" =>;
                        -- Tranfer bottom ALU operation into MAR        
                        CONTR <= X"0000" & IR(15 downto 4) & X"C";      
                        PHASE <= unsigned(phase) + 1;
                when others =>
                        -- Transfer MBR + top ALU into output register
                        CONTR <= X"0000" & IR(27 downto 24) & X"D" & IR(19 downto 16) & IR(3 downto 0);
                        -- end of instruction, load the next instruction
                        PHASE <= (others => '0');
                        PC <= unsigned(PC) + 1; 
                end case;
when X"3" =>;
        -- Indirect register store
        -- (Store the contents of rN at the address in rM)
        case PHASE is
                when X"0" =>
                        -- Tranfer bottom ALU operation into MAR        
                        CONTR <= X"0000" & IR(15 downto 4) & X"C";      
                        PHASE <= unsigned(phase) + 1;
                when others =>
                        -- Transfer top ALU operation into MBR
                        CONTR <= X"0000" & IR(27 downto 16) & X"D";                                                     
                        -- end of instruction, load the next instruction
                        PHASE <= (others => '0');
                        PC <= unsigned(PC) + 1; 
                end case;

As always, the complete, synthesizable VHDL for this project is available on bitbucket.

Of course, with all these funky addressing modes, assembling instructions by hand is starting to become quite a chore. What we need, is an assembler!

Tuesday, July 24, 2012

XENTRAL: A Simple FPGA CPU


XENTRAL is a simple Harvard Architecture CPU. It targets the Spartan6 FPGA present on the Digilent's Nexys3 board, but it's all written using portable behavioral VHDL1. You should be able to use your favorite FPGA, synthesizer and simulator.

All project files are available on github under the Creative Commons Attribution Non-Commercial (CC-by-NC) license.

Architecture Overview

XENTRAL is a Harvard Architecture processor. This means that code and data are handled separately.
The picture above is a very high-level schematic representation of the XENTRAL architecture. Instructions are fetched from ROM3. Based on the instruction, the control unit manages the execution unit. The execution unit performs the actual execution of the instruction, by e.g. adding the contents of two CPU registers, writing a value in RAM or performing I/O.

XENTRAL is a Harvard Architecture processor. This means that code and data are handled separately. This effectively precludes the use of self-modifying code. It also prevents the processor from loading its own program. The program must be programmed into the instruction memory. This is a common approach for embedded devices and microcontrollers, but it does not play nice with operating systems and boot loaders. As the project grows, we will re-visit this limitation.

Overview of the Execution Unit

Schematic representation of the execution unit

Tuesday, May 22, 2012

LightDHT - A lightweight python implementation of the Bittorrent distributed hashtable

I have always been fascinated by distributed data structures, but my efforts have always been dampened by the fact that they are really awkward to play with. In order to get anything that resembles real-life usage, you need to be running hundreds of nodes. Needless to say, wrangling hundreds of processes gets rather cumbersome. Fortunately, there are already thousands of people out there that are participating in one of the largest distributed data structures in operation today: the bittorrent distributed hashtable.
In order to facilitate experimentation with this wonderful system, I have written a simple python implementation of a DHT node. It is mostly (with a few exceptions -- see below) compliant with BEP0005. It should be able to act as a full member of the DHT, processing requests without disrupting the network.
Interested in playing around? Get the code on github!
From the documentation:
The aim of LightDHT is to provide a simple, flexible implementation of the Bittorrent DHT for use in research applications. If you want to trade files, you have come to the wrong place. LightDHT does not implement the actual file transfer parts of the bittorrent protocol. It only takes part in the DHT. 
The main philosophy of LightDHT comes down to two considerations: ease of use over performance and adaptability over scalability. 
In order to keep LightDHT easy to use, all DHT RPC calls are performed synchronously. This means that when you call a DHT method, your program will block until you have an answer to your request. That answer will be the return value of the function. This has the advantage that it keeps the logical program flow intact, and makes it more comfortable to use. 
In order to maintain O(log N) scaling across the network, BEP0005 (the standard governing the DHT) mandates that implementations use a bucket-based approach to the routing table. This enables the node to fulfill all requests in constant time and (more or less) constant memory. In LightDHT, we throw that recommendation to the wind. 
Since the main focus of LightDHT is reseach, we are going to keep around all the data we can. This means that we keep around every single node we know about. Since in practice the number of nodes is limited and the request rates are rather low, we do not bother keeping the routing table organized in a tree structure for quick lookups. Instead we keep it in a dictionary and sort on-demand. The performance penalty is well worth the reduced complexity of the implementation, and the flexibility of having all nodes in an easy to use data structure.

Saturday, December 10, 2011

Festive and Nerdy

You might have seen those big outdoor displays where every pixel is composed out of a little three-color LED light. In this post I'll tell you how I turned a string of those pixel lights into Christmas tree lights!


Adafruit industries, one of my favorite suppliers, recently started selling strings of a particularly advanced model of these pixel lights. Have a look at the product image, and tell me if it doesn't already look like a string a Christmas tree lights..
"12mm Diffused Digital RGB LED Pixels"
As you might have seen in my previous post, these pixels are rather easy to control from a micro-controller. However, stuffing a breadboard in your tree is not exactly festive. Thankfully, you can get these handy little tins that are just about the right size for a wall-wart plug, a teensy 2.0 with female headers (courtesy of Pieter Floris) and a big mess of wires.

A wall-wart plug, a Teensy2.0 and a big mess of wires.
Note that, thanks to the female-pin Teensy, everything is wired up using jumper wires. That's right - this is a no-solder project. Because the underside of the Teensy is not isolated, and because I have a number of wires in there that might come loose, I have isolated the interior of the tin using electrical tape. This is most likely not compliant with any code enacted after 1953 or so, so any imitation is purely at your own risk.

The only remaining question is how to fasten the plugs and wires onto the enclosure. Ideally you would use specially molded plastic caps and bits, but the odds of finding them in the right size and shape at your local hardware store are quite slim. Instead, I liberally applied black sugru around the holes I had cut into the tin, and then pressed the wires and the plug into it. The sugru just molds around the objects. After a quick touch-up I left it to cure overnight, and the result was really really good.

Now we just pop closed the tin and voila, a presentable programmable Christmas tree lights driver!

The controller and a single string of LED pixels, running a test pattern.
And here's the tree, all festive and nerdy:

Festive and Nerdy

Sunday, December 4, 2011

It's the season! (preview)

I while ago my eye dropped on this string of individually addressable RGB LEDs, and I realized they are practically ready-made Christmas lights. A quick order, a long wait, and an evening with my Teensy++ later, I have a sneak preview of our new Christmas tree decoration!

A WS2801 LED string is practically already a Christmas tree decoration!
Since this creation will actually be used "in production" I need to spend some time making it "living-room acceptable". I can tell you this will involve some patience and a lot of Sugru!


 

Wednesday, November 23, 2011

PCB as a Capacitive Soil Moisture Sensor

Electronics enthusiasts have been designing ways of automating plant care for decades, with mixed results. A traditional weak point of these installations is the resisitive moisture sensor, that is inaccurate and prone to degradation. In this post we will design and fabricate an inexpensive capacitive soil moisture sensor out of a printed circuit board that exhibits none of the weaknesses of its resistive brethren.


Friday, October 14, 2011

Selectively Unchecking Iterators: the key to High Performance C++ in Visual Studio

In Visual Studio 2005, Microsoft introduced checked iterators for STL containers. This means that every single operation that uses these iterators is range-checked to prevent memory accesses outside the container's allocated storage. If an range violation occurs, the program terminates with an exception at the location of the violation, instead of trashing potentially unrelated memory as C++ has been happily doing for decades.

This safety net comes with a price: every single iterator change undergoes scrutiny by the runtime to determine whether it is safe. For most programs this is a paltry price to pay for the peace of mind of safe container access. However, for performance sensitive code that include tight loops that iterate over STL containers, the penalties can be massive.

The folks at Microsoft realized this, and implemented a way to disable this functionality. If you define _SECURE_SCL to 0, the checking goes away. You are back at the level of performance (and unsafeness) of  C++ that you know and love. Unfortunately, there is a little snag. You can't exchange STL containers between modules that are compiled with _SECURE_SCL set to 0 and modules that are not. It may look like it's working, but it will blow up in your face when you least expect it. Also note that std::string is an STL container, and it features prominently in most modern C++ interfaces. The reality of the matter is this: if you are defining _SECURE_SCL to 0 you cannot use any C++ code that you did not compile yourself. No libraries, no DLLs, no nothing. For any non-trivial program, that's a pretty tall order.

Also, by setting _SECURE_SCL to 0 you are disabling iterator checking for the entire application. In any given program there are maybe 100s of places where STL iterators are used, but only a handful (one function, two iterators in my current code) where these checks hinder performance to any real degree. Disabling iterator checking globally, if it is even possible given the limitations discussed previously, is rarely desirable.

It turns out that, as of VS2010, there is a way to selectively disable iterator checking. It is a fairly obscure piece of Visual Studio trivia, and I will share it with you so that it is preserved for posterity:

std::vector<int> v; 
for (int j=0;j<1024*1024*5;++j) {  
    v.push_back(j);}  
}  
std::vector<int>::iterator i = v.begin();  
std::vector<int>::iterator::_Unchecked_type ui;  
int tot = 0;  
std::vector<int>::iterator::_Unchecked_type uiend = _Unchecked(v.end());  
for (ui=_Unchecked(v.begin());ui!=uiend;++ui) {  
    tot += *ui;      
}  
Note that we can uncheck the end iterator outside the loop, as the size of the vector does not change during the iteration. I encourage you to try it with checked and unchecked iterators to get a feel for the difference in performance. Be aware however that the inner loop here is very thin: performance gains will be inflated since the inner loop goes from doing mainly range checking to doing mainly nothing!

Readers are encouraged to contribute a set of macros that abstract iterator unchecking in a convenient way.