Writing linux file system
Really just an excuse to play with B-trees.
Data structures (and algorithms) is a classic topic that was woefully under-addressed at university: in first and second year we covered only quick sort and hash tables. From programming experience, I knew that arrays and brute force for-loops were not an adequate way to manage data in real applications. But the only resource (this was before Wikipedia and the ubiquity of free software) for learning what real programmers did were expensive textbooks prescribed for classes in which they were subsequently neglected. One such is File Structures by Folk, Zoellick and Riccardi.
I devoured this book. I borrowed it from a friend and read it cover to cover on my summer holiday. However I did not much like C++, and did not attempt the exercises. And no sufficiently small projects involving the structures occurred to me. Windows 95 was the environment of the time, so a file system was not feasible without significant hacking of Windows internals. A database project might have been suitable,
When I got my netbook in 2009 I started learning Unix seriously. I’m ashamed that it took this long! I’d used it in the university CS labs, and I’ve long used Cygwin for getting real work done on Windows systems. But I had had only sporadic attempts at maintaining my own Linux system.
Linux supports myriad file systems. This had long tempted me — all kinds of data management tricks could be done with a virtual file system à la /proc . With FUSE, user-mode file systems are supported, too! No kernel hacking required.
- Efficient, read-only storage of small files. I’ve accumulated enormous amounts of these (like documentation in HTML format), and they were not handled efficiently on NTFS (internal and external fragmentation, overlarge directories, etc.)
- Tags for files, using paths like /tag/starcon2 as a view on all starcon2-related files.
- Full text index storage; the index would require a file for each word, which again is inefficient on NTFS.
- File system interfaces to databases.
- Proper file system access to a Subversion repository.
At some point I discovered the Windows equivalent, Dokan. (I should confess once more that I may have been insufficiently Linux-indoctrinated to have started work on my file system with FUSE. I don’t actually remember which I tried first, but it may not have been FUSE! :-/)
I needed a simple project just to get the feel of file system implementation. So instead of going with any of the interesting ideas above, I started with a simple, vanilla, hierarchical file system. Based on b-trees! B-trees are naturally for hierarchical file systems, and seem to be the way of the future: btrfs (which far outshines my own attempt) is on course to being the new, efficient, reliable standard file system for Linux.
It has been over a year since I did any significant work on it, but it’s kind of apposite that I post about it while still jetlagged from the London trip, because I did most of the development during my previous big overseas trip. I remember writing the code for the special subfilesystem while sweltering in a basement room in Athens. Unfortunately on my last trip I did not do any project work. But it did at least remind me of my file system project, so hopefully I’ll find time to resume work on this. Or perhaps start a new file system, that’s less an inferior rehash of existing systems and more a useful prototype of something new.
Features
- log n directory lookups, and free in-order traversals (as long as the order you care about is by name).
- Hard links — even hardlinks to directories! Unfortunately I could never test them because the POSIX spec and the Linux file system interface prevents them. Of course there is the major hazard of cycles in the directory graph.
- Special introspective subfilesystem, .fs, which contains (read only) representations of tree nodes, blocks and other metadata.
- Runs on Windows and Linux. Admittedly it is somewhat bug-ridden on each. And there is the outstanding question of case-sensitivity.
Architecture
This project has a distinct modularity, with several layers of functionality.
Architecture overview of file system
OS-interface layer
Drivers for FUSE on Linux, and Dokan on Windows. File system is contained in a growable file — maintainance of this file is handled by the lowest layer.
This layer handles file system startup and integration with the FUSE and Dokan drivers.
Command-layer
Supports basic file and directory operations. FUSE and Dokan specify a number of operations that file system implementations can support. These correspond closely with the OS calls that applications can make for file manipulation.
This layer has knowledge of how file data and metadata is stored in the file system. Files and directories are stored as collection of keys. Each file or directory has a unique id; this becomes the prefix for the keys of its metadata.
File data is stored in extents (sequences of contiguous blocks), each of which is recorded by a key of the form prefix/Xstart-end. This indicates that blocks start to end of file prefix are located in file system blocks starting at that key’s value.
Tree-layer
The file system contains a single B-tree. Each key (a variable-length sequence of bytes), has a type and a value (a single 4-byte datum).
This layer provides a generic tree-traversal function. Operations on individual keys are atomic; the traversal function takes a callback that operates on the key when found. If the key is altered, maintenance operations such as splitting and merging are performed automatically after the callback is run.
Several predefined callbacks exist, for adding, deleting, and updating an existing key.
Block-layer
The file system file is divided into fixed-size blocks. The block types are:
- super block – always at position 0.
- tree block – these form a B-tree, containing all keys in the file system in lexicographical order. Root is referenced in super block.
- data block – file data. Referenced by extent keys for each file.
- bitmap block – form a bitmap of all blocks in the file system, indicating which are free. Referenced by a list in the super block.
Blocks in the cache are indexed two ways: in a hash table by position in the file, and in a LRU linked list. A block can be pinned, indicating it’s in use by some operation and should not be discarded from the cache.
Free blocks are recorded in bitmaps. The super block contains a list of addresses of bitmap blocks. Allocation of a new block begins at a target address (typically closest to related blocks, such as neighbouring parts of the file or tree blocks). The bitmap is searched from that point until a free block is found. Additionally, blocks can be allocated at the end of the file system data file.
Special layer
This layer implements the .FS virtual directory in the file system root. Under this directory are virtual items that indicate the underlying structure. For instance .FS/keys contains a directory for each prefix, and in each a file for each key starting with that prefix containing the key’s value.
File system layer
This layer simply handled opening and closing of the file system data file. When the data file is created, it bootstraps the super block and root tree block.