Backpak: deduplicating backups done simply

Backpak is a backup and archiving program that offers:

Content-Addressed Storage: Files are split into chunks based on what's inside them, and each chunk is tracked with a unique ID. This gives us some huge advantages:
1. Only new chunks are added to the backup, so files are deduplicated even if they are moved or renamed between backups!
2. Because chunks are split based on their contents, small changes to large files (e.g., disk images) don't cause the entire file to be recopied.
3. Because IDs are a cryptographic hash (SHA-224), they double as verification that the bytes inside haven't rotted.
Compression: In the bad old days, you had to choose between leaving data uncompressed or incurring massive slodwons on already-compressed files (videos, ZIP archives, etc.). Today, we have Zstandard, and it rips through high-entropy data at several gigabytes a second. Backpak uses it almost everywhere.
Bring Your Own Encryption: The first rule of crypto club is "don't roll your own crypto" — Backpak uses GPG by default, and can be configured to encrypt your data with anything else you'd like.
Support for multiple backends: Backpak was designed to support many different backup targets, starting with local filesystems (or anything mounted as such, like SSHFS) and Backblaze B2. Additional backends like rsync or S3 are planned next.

Backpak ships as a simple CLI for Linux and MacOS. Windows support is a work-in-progress.

Why another backup system?

There's lots of good choices when it comes to open-source backup software! Restic and BorgBackup are close contenders, but weren't all the things I wanted in one place.

I hope you find Backpak useful!

Getting Started

Creating a repository

Backpak saves backups in a repository. We can make one in a local folder:

backpak --repository ~/myrepo init filesystem

Or, if you'd like to upload to Backblaze B2, the -r/--repository flag just sets the repo's config file:

$ backpak -r ~/myrepo.toml \
    init --gpg MY_FAVORITE_GPG_KEY \
    backblaze \
        --key-id "deadbeef"
        --application-key "SOMEBASE64" \
        --bucket "matts-bakpak"

With --gpg, Backpak will run a quick check that it can round-trip data with

gpg --encrypt --recipient <KEY>

then encrypt all files in the repo using that command. You can edit the repo config file to use a different, arbitrary command.

More backends to follow.

Backing up

Let's make a backup!

$ backpak -r ~/myrepo backup ~/src/backpak/src
Walking {"/home/me/src/backpak/src"} to see what we've got...
/ 297 KB
Opening repository srctest
Building a master index
Finding a parent snapshot
Running backup...
/ P 17 KB + 7 KB | R 281 KB | Z 8 KB | U 9 KB
I 2 packs indexed
D 20 KB downloaded
/home/me/src/backpak/src

Snaphsot afe4ajdi done

We print updates as we go:

How much we Packed into this backup (files + metadata)
How much we Reused from previous backups
How much Zstandard ensmallened the data
How much we Uploaded

If interrupted, the incomplete backup will leave behind a backpak-wip.index and a handful of other files. This allows Backpak to resume where it left off.

You can also:

Pass multiple paths to backup.
Specify a backup author with --author (otherwise the machine's hostname is used).
Annotate your backup with --tag.
Skip over files and folders (matching regular expressions) with --skip.
Dereference symbolic links with -L.
See what you'd backup with --dry-run. (Most commands have this!)

Your new backup is saved as a snapshot. You can view a list of the repository's snapshots with... snapshots:

$ backpak -r ~/myrepo snapshots
...
snapshot afe4ajdifcgfkghmq2tivqlsjnptvri5inb8inn99k0k2
Author: my-desktop
Date:   Thu Nov 7 2024 22:55:36 US/Pacific

  - /home/me/src/backpak/src

By default, we see the snapshot ID, the author, any tags, the date, and the paths backed up. We can get some additional info by passing more flags:

--sizes will calculate how much data each snapshot adds to the repo.
--file-sizes breaks this down further, showing which files added data, sorted largest to smallest.
--stat shows the changes each backup made compared to the previous — what was added, removed, etc. (Kinda like git log --stat.) Add --metadata to see changes to that as well.

Each snapshot can be referenced by a few digits of its ID (enough to be unique), or relative to the most recent snapshot — LAST is the latest, followed by LAST~, then LAST~2, LAST~3, and so on.¹

Using these, we can do some routine things, like list the files in the snapshot:

$ backpak -r ~/myrepo ls LAST
src/
src/backend/
src/backend/backblaze.rs
src/backend/cache.rs
...
src/ui/snapshots.rs
src/ui/usage.rs
src/ui.rs
src/upload.rs

Or compare the snapshot to whatever's in the directory currently:

$ backpak -r ~/myrepo diff ra8o
   + src/some-new-thing
   + src/some-other-new-thing

Restoring data

To restore a snapshot,

$ backpak -r ~/myrepo restore LAST

by default, restore doesn't delete anything. If you want to do that:

$ backpak -r ~/myrepo restore --delete LAST
- /home/me/src/backpak/src/some-new-thing
- /home/me/src/backpak/src/some-other-new-thing

Additional flags like --times and --permissions can restore metadata, and --output can restore the snapshot to a different directory than where it came from.

If you'd like to dump an individual file from a snapshot, you can do that too:

$ backpak -r ~/myrepo dump LAST src/lib.rs
//! Some big dumb backup system.
//!
//! See the [`backup`] module for an overview and a crappy block diagram.

pub mod backend;
pub mod backup;
pub mod blob;
...

Deleting snapshots

Sometimes you want to remove old snapshots, or you backed up the wrong things. You can remove a snapshot from your repository with

$ backpak -r ~/myrepo forget <ID>

This only deletes the snapshot itself, not the data it points to. (After all, many snapshots can reference the same data!) To run garbage collection on the repo and remove files that aren't referenced by any snapshot anymore, run

$ backpak -r ~/myrepo prune

Repository health

If you'd like to know how much space a repository is using, try usage:

$ backpak -r photo-backup.toml usage
2 snapshots, from 2024-08-17T12:39:15 to 2024-08-17T12:57:30
16.48 GB unique data
16.48 GB reused (deduplicated)

2 indexes reference 165 packs

Backblaze usage after zstd compression and gpg:
snapshots: 1 KB
indexes:   448 KB
packs:     16.29 GB
total:     16.29 GB

Like any sane backup system, Backpak tries very hard to make sure data is always left in a consistent state — packs are always uploaded before the index that references them, which is uploaded before its snapshot, etc. But if you're the "trust but verify" type:

$ backpak -r photo-backup.toml check

This reads the indexes and ensures that every pack they mention is present. check --read-packs will go a step further and verify the contents of each pack! To state the obvious, expect this to take a while since it's reading every byte in the repo.

Read up on this implementation details if you're wondering what the hell an index or a pack is.

Other commands

backpak copy will copy snapshots between repositories. You can add --skip to leave files you don't want out of the new one.
backpak filter-snapshot creates a copy of a snapshot in the same repo, but with certain files skipped. (--skip is mandatory!)
backpak cat will print objects in the repo as JSON. It's mostly meant for debugging.

If your Git habits die hard, HEAD, HEAD~1, HEAD~2, etc. also work.

File Formats and Implementation Details

Concepts

Every backup starts by cutting files into content-defined chunks, roughly 1MB¹ in size, using the FastCDC algorithm. Chunks are then ID'd by their SHA-224 hash.

Next, we need to organize lists of chunks back into their respective files, and files back into their directories. Let's represent each directory as a tree, where each node is a file made of chunks:

"PXL_20240804_202813830.jpg": {
    "chunks": [
      "oo98aq2o7ma75pmgmu6qc40jm8ds5blod7ne3ooendmqe",
      "73rqnbmg905r3sv77eqcpvgjodbsv6m8mon6kdobj8vfq"
    ],
    "metadata": {
      "type": "posix",
      "mode": 33188,
      "size": 1097373,
      "uid": 1000,
      "gid": 100,
      "atime": "2024-08-17T19:38:42.334637269Z",
      "mtime": "2024-08-06T01:40:45.36797951Z"
    }
  }

A node can also be a subdirectory, whose ID is the SHA-224 of its serialized tree.

"Camera": {
  "tree": "cti2sslfl8i9j3kvvfqkv2bust1pd1oiks0n2nhkg6ecu",
  "metadata": {
    "type": "posix",
    "mode": 16877,
    "uid": 1000,
    "gid": 100,
    "atime": "2024-08-17T08:13:52.026693074Z",
    "mtime": "2024-08-16T07:35:05.949493629Z"
  }

Note that we save basic metadata (owners, permissions, etc.) but omit things we can't easily restore, or which depend on particular filesystems (inode numbers, change times, extended attributes, etc.). Backpak focuses on saving your files in a space-efficient format, not trying to make an exact image of a POSIX filesystem a la tar or rsync. Special files like dev nodes and sockets are skipped for this same reason.

Files

Packs

Saving each chunk and tree as a separate file would make the backup larger than its source material. Instead, let's group them into larger files, which we'll call packs. We aim for 100 MB per pack, though compression shenanigans can cause it to overshoot.¹

Each pack contains:

The magic bytes MKBAKPAK
The file version number (currently 1)
A Zstandard-compressed stream of either chunks or trees (which we'll collectively call blobs)
A manifest of what's in the pack, as (blob type, length, ID) tuples.
The manifest length, in bytes, as a 32-bit big-endian integer. This lets a reader quickly seek to the manifest.

Since a pack's manifest uniquely identifies all the blobs inside (and, for what it's worth, the order in which they're stored), the SHA-224 of the manifest is the pack's ID.

Indexes

Reading each pack every time to rediscover its contents would be a huge slowdown, and for cloud-stored repositories, a huge bandwidth suck. As we make a backup, let's build an index of all the packs we've built. Each index contains:

The magic bytes MKBAKIDX
The file version number (currently 1)
A Zstandard-compressed map of each pack's ID to its manifest

We can also use the index for resumable backups! As we finish each pack, we write a work-in-progress index to disk. If the backup is interrupted and restarted, we read the WIP index and resume from wherever the last pack left off.

Snapshots

After packing all our blobs and writing the index, the last step of a backup is to upload a snapshot. Each contains:

The magic bytes MKBAKSNP
The file version number (currently 1)
A CBOR file containing snapshot metadata (author, tags, and time), the absolute paths that the snapshot backed up, and the root tree of the backup.

We don't bother with compressing snapshots since they're so small.

Smaller chunks means better deduplication, but more to keep track of. 1MB was chosen as a hopefully-reasonable compromise — each gigabyte of chunks gives about 30 kilobytes of chunk IDs.

It's hard to know how large a compressed stream will be without flushing it, and flushing often can hurt the overall compression ratio. Backpak tries not to do that, but this means it often overshoots packs' target size.

Backpak Documentation