File Formats and Implementation Details
Concepts
Every backup starts by cutting files into content-defined chunks, roughly 1MB1 in size, using the FastCDC algorithm. Chunks are then ID'd by their SHA-224 hash.
Next, we need to organize lists of chunks back into their respective files, and files back into their directories. Let's represent each directory as a tree, where each node is a file made of chunks:
"PXL_20240804_202813830.jpg": {
"chunks": [
"oo98aq2o7ma75pmgmu6qc40jm8ds5blod7ne3ooendmqe",
"73rqnbmg905r3sv77eqcpvgjodbsv6m8mon6kdobj8vfq"
],
"metadata": {
"type": "posix",
"mode": 33188,
"size": 1097373,
"uid": 1000,
"gid": 100,
"atime": "2024-08-17T19:38:42.334637269Z",
"mtime": "2024-08-06T01:40:45.36797951Z"
}
}
A node can also be a subdirectory, whose ID is the SHA-224 of its serialized tree.
"Camera": {
"tree": "cti2sslfl8i9j3kvvfqkv2bust1pd1oiks0n2nhkg6ecu",
"metadata": {
"type": "posix",
"mode": 16877,
"uid": 1000,
"gid": 100,
"atime": "2024-08-17T08:13:52.026693074Z",
"mtime": "2024-08-16T07:35:05.949493629Z"
}
Note that we save basic metadata (owners, permissions, etc.)
but omit things we can't easily restore, or which depend on particular filesystems
(inode numbers, change times, extended attributes, etc.).
Backpak focuses on saving your files in a space-efficient format, not trying to make
an exact image of a POSIX filesystem a la tar
or rsync
.
Special files like dev nodes and sockets are skipped for this same reason.
Files
Packs
Saving each chunk and tree as a separate file would make the backup larger than its source material. Instead, let's group them into larger files, which we'll call packs. We aim for 100 MB per pack, though compression shenanigans can cause it to overshoot.1
Each pack contains:
- The magic bytes
MKBAKPAK
- The file version number (currently 1)
- A Zstandard-compressed stream of either chunks or trees (which we'll collectively call blobs)
- A manifest of what's in the pack, as
(blob type, length, ID)
tuples. - The manifest length, in bytes, as a 32-bit big-endian integer. This lets a reader quickly seek to the manifest.
Since a pack's manifest uniquely identifies all the blobs inside (and, for what it's worth, the order in which they're stored), the SHA-224 of the manifest is the pack's ID.
Indexes
Reading each pack every time to rediscover its contents would be a huge slowdown, and for cloud-stored repositories, a huge bandwidth suck. As we make a backup, let's build an index of all the packs we've built. Each index contains:
- The magic bytes
MKBAKIDX
- The file version number (currently 1)
- A Zstandard-compressed map of each pack's ID to its manifest
We can also use the index for resumable backups! As we finish each pack, we write a work-in-progress index to disk. If the backup is interrupted and restarted, we read the WIP index and resume from wherever the last pack left off.
Snapshots
After packing all our blobs and writing the index, the last step of a backup is to upload a snapshot. Each contains:
- The magic bytes
MKBAKSNP
- The file version number (currently 1)
- A CBOR file containing snapshot metadata (author, tags, and time), the absolute paths that the snapshot backed up, and the root tree of the backup.
We don't bother with compressing snapshots since they're so small.
Smaller chunks means better deduplication, but more to keep track of. 1MB was chosen as a hopefully-reasonable compromise — each gigabyte of chunks gives about 30 kilobytes of chunk IDs.
It's hard to know how large a compressed stream will be without flushing it, and flushing often can hurt the overall compression ratio. Backpak tries not to do that, but this means it often overshoots packs' target size.