2024-02-06
The year is 2024. I am on vacation and dream up a couple of toy programs I would like to build. It has been a few years since I built a standalone toy, I have been busy. So instead of actually building any of the toys I think of, I spend my time researching if anything has changed since the last time I did it. Should pick up new tools or techniques?
It turns out lots of things have changed! There’s some great stuff out there, including decent quorum-write regional cloud databases now. Oh and the ability to have a fascinating hour-long novel conversation with transistors. But things are still awkward for small fast tinkering.
Going back in time, I struggled constantly rewriting the database for the prototype for Tailscale, so I ended up writing my in-memory objects out as a JSON file. It went far further than I planned. Somewhere in the intervening years I convinced myself it must have been a bad idea even for toys, given all the pain migrating away from it caused. But now that I find myself in an empty text editor wanting to write a little web server, I am not so sure. The migration was painful, and a lot of that pain was born by others (which is unfortunate, I find handing a mess to someone else deeply unpleasant). Much of that pain came from the brittle design of the caching layers on top (also my doing), which came from not moving to an SQL system soon enough.
I suspect, considering the process retrospect, a great deal of that pain can be avoided by committing to migrating directly to an SQL system the moment you need an index. You can pay down a lot of exploratory design work in a prototype before you need an index, which n is small, full scans are fine. But you don’t make it very far into production before one of your values of n crosses something around a thousand and you long for an index.
With a clear exit strategy for avoiding big messes, that means the JSON file as database is still a valid technique for prototyping. And having spent a couple of days remembering what a misery it is to write a unit test for software that uses postgresql (mocks? docker?? for a database program I first ran on a computer with less power than my 2024 wrist watch?) and struggling figuring out how to make my cgo sqlite cross-compile to Windows, I’m firmly back to thinking a JSON file can be a perfectly adequate database for a 200-line toy.
Before you jump into this and discover it won’t work, or just as bad, dismiss the small and unscaling as always a bad idea, consider the requirements of your software. Using a JSON file as a database means your software:
Programming is the art of tradeoffs. You have to decide what matters and what does not. Some of those decisions need to be made early, usually with imperfect information. You may very well need a powerful SQL DBMS from the moment you start programming, depending on the kind of program you’re writing!
An implementation of jsonfile (which Brad called JSONMutexDB, which is cooler because it has an x in it, but requires more typing) can fit in about 70 lines of Go. But there are a couple of lessons we ran into in the early days of Tailscale that can be paid down relatively easily, growing the implementation to 85 lines. (More with comments!) I think it’s worth describing the interesting things we ran into, both in code and here.
You can find the implementation of jsonfile here: https://github.com/crawshaw/jsonfile/blob/main/jsonfile.go. The interface is:
type JSONFile[Data any] struct { … }
func New[Data any](path string) (*JSONFile[Data], error)
func Load[Data any](path string) (*JSONFile[Data], error)
func (p *JSONFile[Data]) Read(fn func(data *Data))
func (p *JSONFile[Data]) Write(fn func(*Data) error) error
There is some experience behind this design. In no particular order:
One of the early pain points in the transition was figuring out the equivalent of when to BEGIN
, COMMIT
, and ROLLBACK
.
The first version exposed the mutex directly (which was later converted into a RWMutex).
There is no advantage to paying this transition cost later. It is easy to box up read/write transactions with a callback. This API does that, and provides a great point to include other safety mechanisms.
There are two forms of this. The first is if the write fn fails half-way through, having edited the db object in some way. To avoid this, the implementation first creates an entirely new copy of the DB before applying the edit, so the entire change set can be thrown away on error. Yes, this is inefficient. No, it doesn’t matter. Inefficiency in this design is dominated by the I/O required to write the entire database on every edit. If you are concerned about the duplicate-on-write cost, you are not modeling I/O cost appropriately (which is important, because if I/O matters, switch to SQL).
The second is from a full disk. The easy to write a file in Go is to call os.WriteFile, which the first implementation did. But that means:
write(2)
.close(2)
.A failure can occur in any of those system calls, resulting in a corrupt DB.
So this implementation creates a new file, loads the DB into it, and when that has all succeeded, uses rename(2)
.
It is not a panacea, our operating systems do not make all the promises we wish they would about rename.
But it is much better than the default.
A nasty issue I have run into twice is aliasing memory. This involves doing something like:
list := []int{1, 2, 3}
db.Write(func() { db.List = list })
list[0] = 10 // editing the database!
Backups. An intermediate version of this code kept the previous database file on write.
But there’s an easier and even more robust strategy: never rename the file back to the original.
Always create a new file, mydb.json.<timestamp>
.
On starting, load the most recent file.
Then when your data is worth backing up (if ever), have a separate program prune down the number of files and send them somewhere robust.
Constant memory. Not in this implementation but you may want to consider, is removing the risk of a Read function editing memory. You can do that with View* types generated by the viewer tool. It’s neat, but more than quadruples the complexity of JSONFileDB, complicates the build system, and initially isn’t very important in the sorts of programs I write. I have found several memory aliasing bugs in all the code I’ve written on top of a JSON file, but have yet to accidentally write when reading. Still, for large code bases Views are quite pleasant and well-worth considering about the point when a project should move to a real SQL.
There is some room for performance improvements too (using cloner instead of unmarshalling a fresh copy of the data for writing), though I must point out again that needing more performance is a good sign it is time to move on to SQLite, or something bigger.
It’s a tiny library. Copy and edit as needed. It is an all-new implementation so I will be fixing bugs as I find them.
(As a bonus: this was my first time using a Go generic! 👴 It went fine. Parametric polymorphism is ok.)
Why go out of my way to devise an inadequate replacement for a database?
Most projects fail before they start. They fail because the activation energy is too high. Our dreams are big and usually too much, as dreams should be.
But software is not building a house or traveling the world. You can realize a dream with the tools you have on you now, in a few spare hours. This is the great joy of it, you are free from physical and economic constraint.
If you start. Be willing to compromise almost everything to start.