July 28, 2011: 09:37PM
Why
I build large websites. We don't use transactions too often. One subsystem I work on, MogileFS, uses transactions in one spot. This trivial use causes an occasional deadlock to happen. A while back I wrote in two quick workarounds, and today I revisited those changes, and found some interesting transactional behaviors in the process.
What.
MogileFS is a distributed file manager. I would say it is consistent, persistent, and eventually redundant (asynchronous replicated copies). This asynchrosity required we use a queue system. I wrote a lightweight internal system using MySQL, which it already uses to store file locations.
How.
Sadly I suck at info graphics, so in bullet points:
- Client uploads a new file, a replication request is INSERT'ed into the queue.
- Each tracker machine (the file managers) has one process (JobMaster), which pulls batches of rows to work on.
- N rows are pulled from the DB, scheduled for future retry, and then distributed internally. This is the transaction.
- 'replication workers' then work on the rows, and DELETE completed rows.
This allows the trackers to each work on rows which no other tracker is working on already, efficiently pulling in large numbers of rows and distributing them to the local workers.
The transaction is simple: BEGIN; SELECT (rows to work on) ORDER BY nexttry FOR UPDATE; UPDATE nexttry = now() + 600; COMMIT;
Everything else related to these tables are INSERT's and DELETE's. There is no processing between the SELECT and UPDATE and COMMIT, they are done sequentially. It is completely fine even if two trackers get the same work, it's just faster if they don't.
Original Fix.
Unfortunately the original system had a bad case of deadlocks. For new file uploads we set a nexttry of "0" (immediate), and for fixing a file we say NOW (immediate, but less so). I made two changes:
- New files get scheduled for UNIX_TIMESTAMP(), so they are more distant in the index.
- Internally retry on deadlocks during the SELECT routine.
The rabbit hole.
So now I decide to revisit, as it's not ideal to schedule new uploads for UNIX_TIMESTAMP(). This is where everything goes weird.
I wrote this handy test script to emulate the usage pattern MogileFS uses and turn it into a load test. This was wildly successful, as I immediately saw failures.
MySQL 5.0
I started with 5.0, since my local test instance is still 5.0.83 somehow... and to be honest a lot of MogileFS users are probably still stuck on 5.0, despite being EOL'd.
With 5.0, setting nexttry to UNIX_TIMESTAMP() created a handful of deadlocks, while anything else created fuckloads of deadlocks.
'./transactiontest.pl 0 1 200 6 0 300': sleep for 0 seconds between SELECTs, sleep 1 second between injects, inject 200 rows at once, fork 6 SELECTors, test mode zero, for 300 seconds.
- mode 0: nexttry of 0
- mode 1: nexttry set to epoch seconds manually
- mode 2: nexttry set to UNIX_TIMESTAMP(DATE_SUB(NOW, INTERVAL 1 MONTH))
- mode 3: nexttry set to UNIX_TIMESTAMP('20110101')
- mode 4: nexttry set to UNIX_TIMESTAMP() (the original workaround)
- mode 5: nexttry set to UNIX_TIMESTAMP() - 86400 * 60
Results for 5.0
mode 4 resulted in 650 deadlocks over 300 seconds. This was the "normal" case, and in all cases the deadlock was between the SELECT call and a DELETE call, with the SELECT call being rolled back every time. Very rarely (once) INSERT would die.
All other modes resulted in 7,000+ deadlocks, always in the SELECT, against the DELETE. It's possible this wasn't always the case, but every time I checked the deadlock history it was between a SELECT and a DELETE. However changing the way the INSERT worked profoundly changes the deadlock rate. UNIX_TIMESTAMP() on its own is clearly magical. This is most of my puzzle. Do you know why? I'd love to know why.
MySQL 5.1
Being a good OSS engineer, I also tested at least one other version (heh), and went with MySQL 5.1. I've not managed to deploy 5.5 in production yet, so this sounded like a good place to start.
The tests and script were all the same, except I'd lowered the test time to 60 seconds. The results were all very similar. I also varied the sleep time between injection rounds, from 0 to 1 second. There was no change.
Results for 5.1 (5.1.58)
The results are completely different! Failures were very rare. Usually 6 or so deadlocks per run. Occasionally a burst of SELECT failures would happen, but they were averaged out over multiple runs.
Mainly, now the deadlock is primarily against the DELETE commands! The same deadlock happens: SELECT vs DELETE, except in 5.1 MySQL is rolling back the DELETE command 90% of the time. This time, all modes came out roughly the same. There's no more magic UNIX_TIMESTAMP().
The glitch in 5.1
It seems that one of the reasons why so few deadlocks happen in 5.1 is that I managed to tickle a bug. On repeated runs, after the same amount of time each time, MySQL would hang one of the INSERT's. It would hang for much longer than innodb_lock_wait_timeout. It doesn't seem to think that the transaction is locked at all, but it is! On many refreshes, the transaction is stated as waiting for 0 seconds on the same lock.
pre. ---TRANSACTION 0 2953910, ACTIVE 32 sec, process no 9261, OS thread id
140008481404672 inserting mysql tables in use 1, locked 1
LOCK WAIT 910 lock struct(s), heap size 227312, 909 row lock(s), undo log entries 1
MySQL thread id 2, query id 32148 192.168.69.154 mfs update
INSERT INTO file_to_replicate VALUES (1401, UNIX_TIMESTAMP() - 86400 * 60, NULL, 0, 0)
------- TRX HAS BEEN WAITING 0 SEC FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 2 page no 4 n bits 1272 index `nexttry` of table
`mfs`.`file_to_replicate` trx id 0 2953910 lock_mode X locks gap before rec
insert intention waiting
Record lock, heap no 602 PHYSICAL RECORD: n_fields 2; compact format; info
bits 0
0: len 4; hex 4e321be5; asc N2 ;; 1: len 4; hex 0000006e; asc n;;
The transaction hangs for a bit over 90 seconds, I'm not sure at all which timeout it's hitting, then dies. It'll go a few thousand more rounds before hanging again. The SELECT's all fly by over and over during this time. Their cycling should allow the INSERT to complete instantly. This is running on an SSD with innodb tuned, and the system is doing no IO. What gives?
What now?
I'm looking for the simplest solution possible to allow full control over scheduling when events should run, without causing any deadlocks. I have several perfectly good workarounds, so I would not consider complicating the process to improve it. This means I will not likely throw in gearman (no batching), or zeroMQ (big dependency), or write a hugely more complicated transaction, or much larger table. This works, but I would love to know why MySQL is so bizarre, and if I'm missing something obvious.