Skip to main content

SQLite concurrency and why you should care about it

· 8 min read

SQLite is a powerful database engine, but due to its design, it has limitations that should not be overlooked.

Jellyfin has used a SQLite-based database for storing most of its data for years, but it has also encountered issues on many systems. In this blog post, I will explain how we address these limitations and how developers using SQLite can apply the same solutions.

This will be a technical blog post intended for developers and everyone wanting to learn about concurrency.

Also Jellyfin's implementation of locking for SQLite should be fairly easy to be implemented into another EF Core application if you are facing the same issue.

- JPVenson

The Premise

SQLite is a file-based database engine running within your application and allows you to store data in a relational structure. Overall it gives your application the means of storing structured data as a single file and without having to depend on another application to do so. Naturally this also comes at a price. If your application fully manages this file, the assumption must be made that your application is the sole owner of this file, and nobody else will tinker with it while you are writing data to it.

So an application that wants to use SQLite as its database needs to be the only one accessing it. Having established this fact, an important thought arises: if only a single write operation should be performed on a single file at a time, this rule must also apply to operations within the same application.

The W-A-L mode

SQLite has a feature that tries to get around this limitation: the Write-Ahead-Log (WAL). The WAL is a separate file that acts as a journal of operations that should be applied to an SQLite file. This allows multiple parallel writes to take place and get enqueued into the WAL. When another part of the application wants to read data, it reads from the actual database, then scans the WAL for modifications and applies them on the fly. This is not a foolproof solution; there are still scenarios where WAL does not prevent locking conflicts.

SQLite transactions

A transaction is supposed to ensure two things. Modifications made within a transaction can be reverted, either when something goes wrong or when the application decides it should and optionally a transaction may also block other readers from reading data that is modified within a transaction. This is where it gets spicy and we come to the real reason why I am writing this blog post. For some reason on some systems that run Jellyfin when a transaction takes place the SQLite engine reports the database is locked and instead of waiting for the transaction to be resolved the engine refuses to wait and just crashes. This seems to be a not uncommon issue and there are many reports to be found on the issue.

The factor that makes this issue so bad is that it does not happen reliably. So far we only have one team member where this can be (somewhat) reliably be reproduced which makes this an even worse a bug. From the reports this issue happens across all operating systems, drive speeds and with or without virtualization. So we do not have any deciding factor identified that even contributes to the likelihood of the issue happening.

The Jellyfin factor

Having established the general theory on how SQLite behaves, we also have to look at the specifics of Jellyfins usage of SQLite. During normal operations on a recommended setup (Non-Networked Storage and preferably SSD) its unusual for any problems to arise, however the way Jellyfin utilises the SQLite db up to 10.11 is very suboptimal. In versions prior to 10.11 Jellyfin had a bug in its parallel task limit which resulted in exponential overscheduling of library scan operations which hammered the database engine with thousands of parallel write requests that an SQLite engine is simply not able to handle. While most SQLite engine implementations have retry behavior, they also have timeouts and checks in place to prevent limitless waiting so if we stress the engine enough, it just fails with an error. That and very long running and frankly unoptimized transactions could lead to the database just being overloaded with requests and flaking out.

The solution

Since we moved the codebase over to EF Core proper, we have the tools to actually do something about this as EF Core gives us a structured abstraction level. EF Core supports a way of hooking into every command execution or transaction by creating (Interceptors)[https://learn.microsoft.com/en-us/ef/core/logging-events-diagnostics/interceptors]. With an interceptor we can finally do the straight forward idea of just "not" writing to the database in parallel in a transparent way to the caller. The overall idea is to have multiple strategies of locking. Because all levels of synchronization will inevitably come at the cost of performance, we only want to do it when it is really necessary. So, I decided on three locking strategies:

  1. No-Lock
  2. Optimistic locking
  3. Pessimistic locking

As a default, the no-lock behavior does exactly what the name implies. Nothing. This is the default because my research shows that for 99% all of this is not an issue and every interaction at this level will slow down the whole application.

Both the optimistic and pessimistic behaviors use two interceptors—one for transactions and one for commands—and override SaveChanges in JellyfinDbContext.

Optimistic locking behavior

Optimistic locking means to assume the operation in question will succeed and only handle issues afterwards. In essence this can be boiled down to "Try and Retry and Retry ..." for a set number of times until either we succeed with the operation or fail entirely. This still leaves the possibility that we will not actually be able to perform a write, but the introduced overhead is far less than the Pessimistic locking behavior.

The idea behind how this works is simple: every time two operations try to write to the database, one will always win. The other will fail, wait some time, then retry a few times.

Jellyfin uses the Polly library perform the retry behavior and will only retry operations it will find have been locked due to this exact issue.

Pessimistic locking behavior

Pessimistic locking always locks when a write to SQLite should be performed. Essentially every time an transaction is started or a write operation on the database is done though EF Core, Jellyfin will wait until all other read operations are finished and then block all other operations may they be read or write until the write in question has been performed. This however means, that Jellyfin can only ever perform a single write to the database, even if it would technically does not need to.

In theory, an application should have no issue reading from table "Alice" while writing to table "Bob" however to eliminate all possible sources of concurrency related locking, Jellyfin will only ever allow a single write performed on its database in this mode. While this will absolutely result in the most stable operation, it will undoubtedly also be the slowest.

Jellyfin uses a ReaderWriterLockSlim to lock the operations, that means we allow an unlimited number of reads to happen concurrently while only one write may ever be done on the database.

The future Smart locking behavior

In the future we might also consider combining both modes, to get the best of both worlds.

The result

Initial testing showed that with both modes, we had great success in handling the underlying issue. While we are not yet sure why this happens only on some systems when others work, we at least now have an option for users previously left out of using Jellyfin.

When I was researching this topic, I found many reports all over the internet facing the same error but nobody was able to provide a conclusive explanation whats really happening here. There have been similar proposals made to handle it but there wasn't a "ready to drop in" solution that handles all the different cases or only code that required massive modifications to every EF Core query. Jellyfin's implementation of the locking behaviors should be a copy-paste solution for everyone having the same issues as its using interceptors and the caller has no idea of the actual locking behavior.

Best of luck,

- JPVenson