Over the past few weeks I’ve been working to pack the entire Crossref database into a distributable SQLite file. While this sounds somewhat insane – the resulting file is 900GB – it’s quite a cool project for, say, embedded systems work in situations where no internet connection is available. It also provides speedy local indexed lookups, working faster than the internet-dependent API ever could.
There were some snags I hit along the way:
- I had to write the whole thing in Rust. It turns out that you can do it in Python, but I got much better speeds doing it all in Rust.
- An ORM approach in Python was far too slow.
- Setting the PRIMARY KEY to be the DOI on the database was a bad idea. There are so many commonalities in prefixes that the B-tree just grew and grew until the whole program was abort(3)-ed for an out of memory error, taking down the terminal emulator in which it was running, too. Rust does not recover well from OOM errors.
- It’s best to build the index of DOIs at the end, once the data is in place.
- In Rust, you can use a bounded channel to read data and block until it’s been processed in a writer thread. This was a neat way of reading the tar.gz files at max speed but not exceeding the memory capacity of the host machine,