Tuesday, May 22, 2012

LightDHT - A lightweight python implementation of the Bittorrent distributed hashtable

I have always been fascinated by distributed data structures, but my efforts have always been dampened by the fact that they are really awkward to play with. In order to get anything that resembles real-life usage, you need to be running hundreds of nodes. Needless to say, wrangling hundreds of processes gets rather cumbersome. Fortunately, there are already thousands of people out there that are participating in one of the largest distributed data structures in operation today: the bittorrent distributed hashtable.
In order to facilitate experimentation with this wonderful system, I have written a simple python implementation of a DHT node. It is mostly (with a few exceptions -- see below) compliant with BEP0005. It should be able to act as a full member of the DHT, processing requests without disrupting the network.
Interested in playing around? Get the code on github!
From the documentation:
The aim of LightDHT is to provide a simple, flexible implementation of the Bittorrent DHT for use in research applications. If you want to trade files, you have come to the wrong place. LightDHT does not implement the actual file transfer parts of the bittorrent protocol. It only takes part in the DHT. 
The main philosophy of LightDHT comes down to two considerations: ease of use over performance and adaptability over scalability. 
In order to keep LightDHT easy to use, all DHT RPC calls are performed synchronously. This means that when you call a DHT method, your program will block until you have an answer to your request. That answer will be the return value of the function. This has the advantage that it keeps the logical program flow intact, and makes it more comfortable to use. 
In order to maintain O(log N) scaling across the network, BEP0005 (the standard governing the DHT) mandates that implementations use a bucket-based approach to the routing table. This enables the node to fulfill all requests in constant time and (more or less) constant memory. In LightDHT, we throw that recommendation to the wind. 
Since the main focus of LightDHT is reseach, we are going to keep around all the data we can. This means that we keep around every single node we know about. Since in practice the number of nodes is limited and the request rates are rather low, we do not bother keeping the routing table organized in a tree structure for quick lookups. Instead we keep it in a dictionary and sort on-demand. The performance penalty is well worth the reduced complexity of the implementation, and the flexibility of having all nodes in an easy to use data structure.