This blog post is highly technical. I cover details about how our email storage system works and how it was impacted by a complex server corruption and failure. I explain why our normal procedures failed in this instance, the ways in which our system helped us to track down and restore almost all the impacted emails, and some improvements we are making for our system to be more resilient in future.
Background
To show how things fit together, I’m going to start with an overview of our infrastructure and internal terminology. Some details have been shared before, but the exact setup changes over time. If you want to get straight on with the story, skip to the heading “The failure” below.
A slot
Each mail server is a standalone 2U or 4U storage server. Storage on each server is split into multiple slots (currently 15, 16 or 40 per machine depending on hardware model).
All current slots are called “teraslots”, consisting of:
- A 1Tb partition for long-term email storage.
- Space on a shared high-speed SSD for indexes, caches, recently delivered email, etc.
- Space on a separate partition for long-term search indexes.
- Space on a RAM filesystem for search indexes of recently delivered email.
All partitions are encrypted, either with LUKS or directly on the hardware if supported.
Every slot runs an entirely separate instance of the Cyrus IMAP server, complete with its own configuration files and internal IP address. Instances can be started and stopped independently. Configuration files are generated by a template system driven from a master configuration file, for consistency and ease of management.
A store
A store is the logical “mail server” on which a user’s email is stored. A store is made up of a number of slots, using Cyrus’ built-in replication to keep them synchronised. At any time, one slot is the master for a store, and it replicates changes to all the other slots.
A normal store currently consists of three slots, two in New York and one in Iceland. To move a slot between machines, it’s easiest to configure a new replica slot, wait until all data is replicated and the new slot is up to date, and then remove the unwanted replica slot. In the case where we’re moving things around, a store may consist of more slots – there’s no real limit to how many replicas you can run at once.
We spread the related slots of different stores such that no one machine has too many “pairs” to another machine. This means that if a single server fails, the load spreads to many machines, rather than doubling the load on one other box.
For each store, one slot is marked in our database as the “master slot”. In the past we used to bind an IP address to the master slot and use magic IP failover, but no more. Instead, all access is either via a perl library (which knows which slot is the master) or via the nginx frontend proxy, which selects a backend using the same perl library during login.
Replication
The cyrus replication system doesn’t record the actual changes to mailboxes: it just writes “something changed in mailbox X” to a log file (or in our multi-replica configuration, to a separate log file per replica).
A separate process, sync_client, runs on the master slot. It takes a working copy of the replication log files that other cyrus processes create. If the sync_client process fails or is interrupted, it always starts from the last log file it was processing and re-runs all the mailboxes. This means that a repeating failure stops all new replication for that replica, which becomes relevant later.
The sync_client process combines duplicate events, then connects to a sync_server on the other end and requests some information about each mailbox named in the log. It compares the state on the replica with the state on the master, and determines which commands need to be run to bring the replica into sync, including renaming mailboxes. Mailboxes have a unique identifier as well as their name, and that identifier persists through renames.
We have extensively tested this replication process. Over the years we have fixed many bugs and added several features, including the start of progress towards a true master-master replication system. We still run a task every week which compares the state of all folders on the master and replica, to ensure that replication is working correctly.
Failover
Our failover process is fairly seamless. If everything goes cleanly, the only thing users see is a disconnection of their IMAP/POP connection, followed by a slow reconnection. There’s no connection error and no visible downtime. Ideally it takes under 10 seconds. In practice it’s sometimes up to 30 seconds, because we give long-running commands 10 seconds to complete, and replication can take a moment to catch up as well.
Failover works like this (failover from slot A to slot B):
- Check the size of log files in replica-channel directories on slot A. If any are more than a couple of KB in size, abort. We know that applying a log file of a few KB usually only takes a few seconds. If they’re much bigger than that, replication has fallen behind and we should wait until it catches up, or work out what’s wrong. It’s possible to override this with a “FORCE” flag. We also check that there’s a valid target slot to move to, and a bunch of other sanity stuff.
- Mark the store as “moving” in the database.
- Wait 2 seconds. Every proxy re-reads the store status every second from a local status daemon which gets updates from the database, so within 1 second, all connection attempts to the store from our web UI or from an external IMAP/POP connection will be replaced by a sleep loop. The sleep loop just holds off responding until the moving flag gets cleared, and it can get a connection to the new backend.
- Shut down master slot A. At this point, all existing IMAP/POP connections are dropped. It will wait for up to 10 seconds to let them shut down cleanly before force closing the remaining processes.
- Inspect all channel directories for log files on slot A again. Run sync_client on each log file to ensure that they sync completely. If there are any failures, bail out by restarting with the master still on slot A.
- Shut down the target slot B.
- Change the database to label the new slot B as the master for this store.
- Start up slot A as a replica.
- Start up slot B as the master (this also marks the store as “up”).
Within a second, the proxies know that the store is available again and they continue the login process for waiting connections.
Unclean failover
In the case of a clean failover, all the log files are run, and the replica is 100% up-to-date.
If for some reason we need to do a forced failover (say, a machine has failed completely and we need to get new hardware in), then we can have what’s called a “split brain”, where changes were written to one machine, but have not been seen by the other machine, so it makes its own changes without knowledge of what has already happened.
Split-brain is a common problem in distributed systems. It’s particularly nasty with IMAP because there are many counters with very strict semantics. There’s the UID number, which must increase without gaps (at least in theory) and there’s the MODSEQ number, which similarly must increase without changes ever being made “in the past”, otherwise clients will be confused.
Recovering from split brain without breaking any of the guarantees to clients was a major goal of the rewrite of both the replication protocol and mailbox storage which I did in 2008-2009. These changes eventually lead to Cyrus version 2.4.
Anti-corruption measures
We also want to be safe against random corruption. We have found that corruption occurs a couple of times per year across all our servers (the cause of this is hard to say, most likely hard drive or RAID controller issues from what we’ve seen) and we were bitten hard in the past by a particularly nasty linux kernel bug which wrote a few zeros into our metadata files during repack.
Since then we have added checksums everywhere. A separate crc32 for each record in a cyrus index, cache or db, and a sha1 of each message file. Cyrus also sends a crc32 of the “sync state” along with every mailbox, allowing it to determine if the set of changes did actually create the same mailbox state during a sync. A sync_crc mismatch triggers a full comparison of all data in that mailbox, allowing sync_client to repair the error and resync the mailbox.
And now you know enough about our architecture to understand what happened!
The failure
On Thursday 27th February, November, at 4:30am Melbourne time, Rob N (the on-call admin for the night) was paged by our monitoring system because one of our servers (imap21) was not responding. You can read his initial blog post about the incident as well.
He initially thought it was just a crashed server – there are various reasons why complex systems crash, and it wasn’t obvious which one it was. Our normal procedure is to restart the server and restart all the IMAP slots, replicate any remaining changes off the machine, then fail over to replicas while the search indexes catch up.
Unfortunately, things were worse than that. It started up with significant corruption on the operating system partition, and one of the 15 IMAP partitions failed to mount. At this point he force-failed-over (meaning moving the IMAP server master to its New York replica without doing a full sync first) and went back to bed to get some sleep and deal with things in the morning.
Initial cleanup
I was away during the first couple of days of this, on school camp with my daughter in one of the few places in the world where you can’t even get phone reception, but I have notes from the others and log files to go on.
On Thursday morning, there were a number of support tickets from people who were missing a few days of recent email. A quick look at the logs showed that one of the 11 master slots (of the 15 on this server, two were empty and two were replicas) had not been replicating for a couple of days.
The cause of this was an edge case issue with our replication system. If there is a loop in mailbox renames (i.e. folder ‘A’ was renamed to ‘B’ at the same time that folder ‘B’ was renamed to ‘A’, or a more complex circle of names), then the replication system bails out and tells us it needs to be fixed.
Normally I would notice this, and the others knew about the possibility as well, but over the previous few weeks I had been moving slots around in Iceland as part of our efforts to split entirely from the Opera network while retaining full offsite redundancy. There had been a lot of cyrus noise in the notification channel, and nobody picked up that this noise pointed to a more significant issue.
As mentioned above, this single issue in the working log file means that no new events can be processed. Interestingly, changes to other mailboxes in the same file might get replicated, because it will re-visit them over and over, but other newly-changed mailboxes will not be noticed.
To make things worse, the failed channel was also the “new user server”. Every week a task runs , calculates which store is least loaded, and sets a database field to direct newly created users to that store. So there were brand new users, and some of them had used our import facility to bring large amounts of email in.
Re-replication
The usual fix in this sort of case is to re-run the log files, even though the slot is now a replica. Because we keep a sha1 of the content of each message, the replication system can detect different files at the different ends. It can then fix up mailboxes by re-injecting BOTH messages with new UIDs, higher than any yet used, and then expunging the original UID at both ends. This means that no matter which one a client saw first, it now gets a repaired view of the mailbox with both messages in it. I wrote a long justification for this logic, and why we do it, which you can read at https://lists.andrew.cmu.edu/pipermail/cyrus-devel/2010-July/001500.html.
We also use the same method in reconstruct to repair from a damaged mailbox on disk, but I’m about to re-visit that based on the below.
So, for messages within an existing mailbox, almost all were restored by this, except where things were broken for other reasons.
Bogus reconstructs
Because sync wasn’t working, the team in Melbourne started running reconstructs to repair broken folders. In theory, this was a great idea – it’s what I would have recommended.
In practice though, the data on the machine was badly broken in subtle ways, and this just made things worse. Looking through the log files, I see numerous cases where reconstruct determined that the stored sha1 of the message was not the same as the sha1 of the file on disk, and so injected the message again with a new UID. This should never happen unless the file is actually corrupted. Looking at the files on disk now, their sha1 is actually the old value, not the new one that it calculated during the reconstruct.
So the machine either has a faulty CPU, faulty memory, dodgy RAID card, intermittently faulty disk… whatever it is, it’s pretty horrible! We will be able to test it more once we’re happy that all the salvageable data is copied off.
Due to the faulty hardware, the attempts to make things better actually made things worse!
The worst cases were where reconstruct decided that a folder didn’t exist, and wiped it from the mailboxes database. The replication engine then wiped the copy on the replica. Likewise, where mailboxes didn’t exist at the other end due to not being replicated yet (see above), they got wiped from the original copy. These are the cases where, for 49 users, we can see that some emails were lost entirely.
Thankfully, repeated checksum mismatches caused replication to bail out in many places, saving us from a lot worse pain.
Time to cut our losses
On the morning of Friday 28th when I got back into phone range, pulled out my laptop on the bus home and inspected the damage, sync was still bailing out everywhere.
First thing I said was “turn off all the replication and reconstructs – let’s stop making things worse”.
I had an empty machine (imap14) which I had just finished clearing out in New York (see above about all the noise – lots of reconfiguring to get everything into consistent teraslots). I configured new replica pairs on imap14 for all the slots that used to be paired with imap21. It took about 4 days to get everything fully replicated again (moving that much data around, including sanity checking and building new search databases takes time), but we’re now fully replicated again, nothing fewer than 3 slots for each store, one offsite.
I left imap21 up, but with nothing talking to it or trusting it any more.
Examining the mess
The first job was to determine the extent of the damage. Thankfully, the log partition on imap21 wasn’t damaged, so we had backup copies of all the log files from everywhere on our log server by the time I got home. These aren’t the ‘sync log’ files from above; they’re the general syslog used to monitor and audit the everyday operations of the system. We keep them around for a few months to help track down the history of problems.
The first step was to identify what had happened, and locate any messages that hadn’t been replicated. We syslog important actions within Cyrus, in a consistent format with the label ‘auditlog:’.
Here’s an example of a recent message appended to my mailbox and then moved to Trash:
2014-03-10T03:11:04.544356-04:00 imap20 sloti20t12/lmtp[23349]: auditlog: append sessionid=<sloti20t12-23349-1394435464-1> mailbox=<user.brong> uniqueid=<6af857f64475158a> uid=<950415> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>
2014-03-10T03:27:31.374972-04:00 imap20 sloti20t12/imap[22916]: auditlog: append sessionid=<sloti20t12-22916-1394436442-1> mailbox=<user.brong.Trash> uniqueid=<24bbf0f44475158a> uid=<1157077> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>
2014-03-10T03:27:31.390532-04:00 imap20 sloti20t12/imap[22916]: auditlog: expunge sessionid=<sloti20t12-22916-1394436442-1> mailbox=<user.brong> uniqueid=<6af857f64475158a> uid=<950415> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>
(The sessionid field can be used to track the exact login time, IP address, etc – and for deliveries it can link through the logs to find the server which sent it to us originally).
It was appended to my INBOX via LMTP, then appended to my Trash folder and expunged from my Inbox. The ‘guid’ field is the sha1 of the underlying message.
Over on my New York replica slot:
2014-03-10T03:11:04.544356-04:00 imap20 sloti20t12/lmtp[23349]: auditlog: append sessionid=<sloti20t12-23349-1394435464-1> mailbox=<user.brong> uniqueid=<6af857f64475158a> uid=<950415> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>
2014-03-10T03:27:31.374972-04:00 imap20 sloti20t12/imap[22916]: auditlog: append sessionid=<sloti20t12-22916-1394436442-1> mailbox=<user.brong.Trash> uniqueid=<24bbf0f44475158a> uid=<1157077> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>
2014-03-10T03:27:31.390532-04:00 imap20 sloti20t12/imap[22916]: auditlog: expunge sessionid=<sloti20t12-22916-1394436442-1> mailbox=<user.brong> uniqueid=<6af857f64475158a> uid=<950415> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>
And of course, the same in Iceland:
2014-03-10T03:11:06.127098-04:00 timap5 slotti5t02/syncserver[6838]: auditlog: append sessionid=<slotti5t02-6838-1394336626-1> mailbox=<user.brong> uniqueid=<6af857f64475158a> uid=<950415> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>
2014-03-10T03:27:33.575200-04:00 timap5 slotti5t02/syncserver[6838]: auditlog: expunge sessionid=<slotti5t02-6838-1394336626-1> mailbox=<user.brong> uniqueid=<6af857f64475158a> uid=<950415> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>
2014-03-10T03:27:33.833070-04:00 timap5 slotti5t02/syncserver[6838]: auditlog: append sessionid=<slotti5t02-6838-1394336626-1> mailbox=<user.brong.Trash> uniqueid=<24bbf0f44475158a> uid=<1157077> guid=<d0744eaad56ecc0342dd22da2a2187ee16935ecd> cid=<d0744eaad56ecc03>
Anyway, we found something else interesting in the log files. For nearly 24 hours before the hard crash, there had been random transient failures in the replication logs. It would attempt to replicate a change, get a checksum error, try to replicate the exact same change, and the second time it would work! The machine was obviously intermittently failing well before it finally crashed.
We also found that two of the other master slots had encountered replication errors before the hard crash. The errors had caused replication to repeatedly bail out and they were also falling behind – this was enough noise that anyone would have noticed something was wrong – but it started about 2am our time, and everyone was asleep.
Based on user tickets and the state of the machine before the crash, we knew that there was still email missing. It was time to try to identify it. We determined the following algorithm to find messages that needed to be recovered.
- If there was an ‘append’ on imap21 via something that was NOT sync_client or syncserver, but no matching append on the replica with the same guid, that’s bad.
- If there was an ‘expunge’ on the replica by a sync program that can’t be matched to an expunge by an imap or pop3 process, then that was a message which shouldn’t have been wiped (but see above about the re-UIDing process; if we find another append to the same folder with the same GUID in either case, then the lower-UID one isn’t interesting any more).
- If a folder was deleted by a sync tool without a matching intended delete, then we have a problem.
Replication – the bad bits
I discovered, to my horror, that when replicating an expunged event from the master, the sync protocol ignored the sha1 of the message. It should either have checked that it was the same message file that was expunged on the replica as well, or just bailed out. This meant it would expunge a never-seen message from the replica if the same UID had been seen and then deleted on the master. Ouch.
This is also a flaw with folder deletes. When a deleted folder is replicated, it just removes the folder and message files from the disk on the replica. That’s wrong and should never be allowed. To ensure we can always recover accidentally deleted mail, any deleted folders are supposed to be moved to a special DELETED namespace. They’re only permanently deleted a week later by a cleanup process.
The end result of all this: some new folders created after the failover from imap21 were irreversibly deleted when we tried to sync over the missing messages from imap21.
Recovery of messages
In either case, an append that never synced or an expunge that wasn’t justifiable, the fix is the same. Find a file with the same sha1 in any of that user’s mailboxes, or their backup, or on disk on imap21, and that’s your message. We did that for nearly half a million emails in total, over half of those belonging to 3 users who had just imported a large amount of email from external services.
After that, we were left trying to recover a few folders which got wiped. The fix in that case is to recover from backup (hopefully most of the messages) and then search for any appends since the highest UID in the backup, and append those as well (unless a justifiable expunge can be found for them).
The great thing about the log file containing a GUID which is the sha1, is that you can determine with cryptographic certainty that you’ve found the right file. We pulled files from imap21 (don’t have to care about corruption, the sha1 will check it), files from the replicas, and even the backup server for the user to look for matching messages.
Some statistics
- 370 affected users
- 458471 found messages
- 29837 lost messages (no sha1 file able to be found)
- 32 lost folders (deleted, nothing in backup)
- 49 users with SOMETHING lost
We have emailed all affected users with details of exactly what was lost and recovered for their account.
Making sure this does not happen again
We take our uptime and reliability very seriously. Losing anything is a major failure for us, and we are determined to take steps to ensure this situation can not happen again.
Operational changes
- We now page the on-duty if replication falls more than 5 minutes behind to all replicas.
- We alert (non-emergency) if ANY replica falls more than 5 minutes behind. This alert is made to look different from the Cyrus notices that can happen during regular maintenance, so everybody will be aware that there is an issue.
- We will no longer attempt to reconstruct or keep replication going if we suspect a faulty server (at least until replication is super-safe). We’ll be much quicker to just declare the machine faulty and recover messages independently.
Software changes
- The immediate fix is to do the GUID test even if the message was expunged on the master. This avoids the nasty case of expunging an unseen message during split brain.
- The larger fix is two-pronged. First make reconstruct a lot less likely to damage existing mailboxes in case of corruption. There’s no point re-injecting a corrupt message file, nobody wants a file full of random blocks off disk. If the sha1 doesn’t match, then abort and get the admin to fix the permission problems or check the disk first.
- The second prong is changes to the replication system, so that it’s impossible to make the replica delete message files immediately via the replication protocol. The most you will be able to do is make it move them aside for later removal on the scheduled rotation (we run on Saturdays and delete things that have been marked for deletion for at least one week at that point – so between 1 and 2 weeks).
- Finally, we want to integrate the ability to fetch files from a replica to repair a corrupt local copy. We already have a separate magic perl script that can do this, but it runs outside Cyrus, with its own ugly locking tricks to force Cyrus to accept what it’s doing. It would be much neater to have this integrated into reconstruct, so that it’s replication-aware.
Finally
This has been a long saga (2 weeks from failure to restoring everything we could), and a learning experience for everyone involved. Our detailed logging and checksums meant we were able to recover the vast majority of messages affected by the corruption, but we are obviously unhappy that we lost anything, and we will be taking the steps outlined above to prevent this issue happening again.
We apologise to the 370 users who have been without some email for over a week while we took our time making sure we fully understood what had happened and what needed to be recovered. We’re especially sorry for small number of people for whom we lost emails irretrievably. We have contacted these people individually.
We’re proud of our reliability track record at FastMail and we respect the trust that people place in us to store their email. We are working hard to restore and maintain that trust, both by being open about exactly what happened with this incident and by updating the system and processes we have in place to improve our resilience in the future.