blog/by_date/2023/09/30/manybooks.md

4.4 KiB

Many books!

Calibre is an awesome tool that lets you catalog your books, convert them and much more. I use Kavita for books that I actually read, but Calibre helps me store a huge amount of books (500k+) in a way that I can search them.

So I got about half a million books. They all came in a non-searchable format - basically files named from 0.fb2 to 555555.fb2. Challenge was to preserve all thouse books in some searchable format and not to make it nightmare for system IO - just imagine unarchiving 500k small files! Or even worse - storing them raw. I had an idea.

Challenges.

First one, like I said before, is working with hundreds of thousands of small files. Sure you can just throw more hardware for any issue. But what if one day I'd need to use this enormous library on something like a laptop or RPI? Sounds like nightmare. The typical go-to solution in a case of many files is to split them in chunks (directories). This worked out well - they already came to me in a group of ~2 GB archives. Keeping them in same groups makes it easy to add new archives - I just add another library as the same "chunk". I'd lose track of book movements if I ever decided to merge them.

Second one is, obviously, search. Each archive ends up being a single Calibre library. I've tried to add all 500k books to a single library, and it didn't end up well. Calibre doesn't allow you to search across multiple libraries at once. And even if it did, I'd still compress each group back, so it is easier to store and also enables checksumming in my case. Solution is dead simple - storing each Calibre database in a text format. It is super easy to search thru text! Calibre allows exporting its database as a CSV file, which allows me to use something like grep (or ripgrep in my case) to easily look up in which group the book I need is located. Neat.

Third one is optional, but storing one big file is much easier than storing thousands of smaller ones. Here is where archival tools come up handy.

Solutions.

As for grouping, I've tried to keep things look the same as they came to me. Those archives I got were named in ranges, like 000000-134445.zip. So I kept this numbering with a little change. I was horrified by a thought that one day they will release books past 1 million, and it will create an extra digit. It'd ruin alphabetical sorting!! And I just added prefixes like 000001_000000-134445. I can process and store a million books, but no way I can do this for a million of such groups! Even this collection is about 400 GB compressed. This is "good enough" for a forseable future.

Second issue is search. You can export Calibre database in CSV by running this command: $ calibredb catalog index.csv inside your Calibre library. This will put a file called index.csv inside it. Rename it to group's name like 000001_000000-134445.csv and store in a separate directory. Later you can do something like $ rg "Война и Мир", and it'll show you something like this:

$ rg "Война и Мир"
000003_074392-091839.csv
3835:"Толстой, Лев Николаевич","Лев Николаевич Толстой","Лев Толстой Война и Мир Том 1","","fb2","3834","","","rus","000003_074392-091839","0101-01-01T03:00:00+03:00","","","Война и мир","1.0","1324920","prose_classic","2023-09-29T23:11:18+03:00","Война и мир. Том 1","Война и мир. Том 1","67c9d4ef-fe03-48c0-a897-a819f6d21b37"

See! It gave us all the info about the book and also the name of the collection where this book is located! Now I can just unarchive that library and use Calibre to find it there and export in desired format.

As for the third issue, I just used tar with maximum xz and sha1 for checksumming. I actually have bash functions to automate this process here. I call archive ./dir or unarchive ./dir_hash.tar.xz, and it automatically checks its hash and extracts, and, well, creates an archive and appends a hash to it. Maximum xz compression reduced, on average, from 2 GB incoming zip archive to 1.3 GB output tar archive. Compression is one of the things that actually makes me happy.

Conclusion.

Just like in life, you can solve big problems by decomposing them into smaller ones. Now I have an enormous private ebook library that is easy to search and copy to back up drives. A friendly reminder - make a backup!