Recently we stumbled upon a very strange bug that took us several hours to find. We were losing data in one of the MySQL tables. Let’s put some background on the whole setup.
The Project
We have two MySQL tables: “actions” and “actions_archive”. The first one is holding some temporary data which is constantly pushed to, updated and removed from this table. The second one is a mirror of the first one. The difference is that we don’t remove anything from the archive table. It’s an archive 🙂 I know what you’re thinking – this could be designed differently like using soft deletes. Yeah, I know. But when the system was being built we didn’t know we’ll need the archive. It was introduced much later and for legacy purposes we left the initial “actions” table intact. Let’s just ignore the design part and focus on the architecture.
The Problem
One beautiful day (or maybe it was raining, I can’t remember), while browsing the archive table’s contents, we noticed that we have had duplicate entries. After looking into it we found out that several identical records had different primary keys. This is something that should have never happened! This led us to the conclusion that the system created a new entry in the archive table with an ID of the original record, which was already in the archive table. Ok, this is somewhat complicated, so let me clear this up. Let’s say the proper “actions_archive” table looks like this:
|---------------------------------------------------|
| id_actions_archive | id_action | some_data |
|---------------------------------------------------|
| 100 | 150 | foo |
| 101 | 151 | bar |
| 102 | 152 | baz |
|---------------------------------------------------|
Now let’s take a look on a corrupted version of this table:
|---------------------------------------------------|
| id_actions_archive | id_action | some_data |
|---------------------------------------------------|
| 100 | 150 | foo |
| 101 | 151 | bar |
(...)
| 332 | 151 | bar |
(...)
| 637 | 151 | bar |
| 638 | 152 | baz |
|---------------------------------------------------|
As you can see in the latter version of the table, we have duplicated entries. It’s even worse because we probably lost some data. That’s because we also update records in the archive table when we update the data in the original “actions” table. We use the “id_action” key for the update, so when we have several entries bound to that ID, all of them get updated (actually – overwritten). That’s not good.
The Bug Search
We started searching for the problem and found nothing. It was obviously very frustrating. All the inserts and updates were bulletproof so they couldn’t corrupt the data. We even created a logging mechanism that would record the moment when the duplicates are being created. But that did not help us diagnose the problem.
The Discovery
Finally, after a lot of brainstorming, searching, reading and banging the wall with our heads we found the bug. It was not the project – it was MySQL.
The problem could be fixed by removing data from the “actions” table. After removing for example last 10 records from the table and restarting MySQL, new records received ID’s that were already assigned before.
So, let’s say the last “id_action” was 500, then we deleted 100 records, restarted MySQL and inserted a new record to the table. The newly added record’s ID turned out to be 401.
“Wait, what?!” Yeah, that was exactly what we said. It was totally unexpected and weird.
I asked myself, “What’s going on?’ ‘What’s the problem?”
The thing is that after restarting MySQL it determines the auto_increment value for each table using this algorithm (it regards only InnoDB tables):
SELECT MAX(ai_col) FROM t FOR UPDATE;
That’s very bizarre behaviour, at least for me. Such an important thing as an auto_increment value should be saved somewhere. The irony here is that we’re talking about MySQL (DB, duh!) and it keeps the auto_increment value in the memory… When MySQL is restarted it takes the last entry from a table and sets the auto_increment value equal to the ID of the record found + 1.
The documentation actually states that this is a “feature” but in my opinion, this is a major bug. You can read more about it in MySQL Reference Manual. Knowing this we quickly found a solution to the problem but no one will ever know how much tears we shed.
The Conclusions
You can stumble upon bugs that make you want to jump out through a window. Don’t give up and always try to find a solution. If you don’t know the bug’s origin try to do all you can to track it down. Walk around the bug, try to nail it another way. We used a logging mechanism and even if it didn’t help us directly, it eliminated some possible holes.
Brainstorms really help. Even the most irrelevant (and potentially dumb) ideas may be of help. Listen to every opinion and talk it through with the team. Sometimes one thing leads to another and suddenly you find the solution.
Don’t fully rely on third-party software. PHP, MySQL and others are also full of bugs. Even when the dev team says it’s a feature.