May 01 2015

Why does Wikipedia use MySQL as data store rather than a NoSQL database?

Category: UncategorizedFractalizeR @ 5:48 pm

Answer by Domas Mituzas:

Billions of views hit the cache layer, not the application – application generates only about 50000 queries a second.

Now, there are two kinds of databases serving the wikis – 'core' – which has all the metadata, and text storage.

There're six shards for core database, split by language (so requests don't really have to hit multiple shards) – and they're all under 500G, and active dataset for each shard fits into 64GB (on some into 32GB) systems and doesn't thrash disks much.

Text storage is a bit different beast – essentially it is append-only chunks of data that later go through differential-concatenated-compression batch process and end up being part of larger append-only chunks. 🙂 Technically this is the only place where we do key:value data storage and that could benefit from NoSQL solutions, but we don't touch it much and it doesn't take too much attention (well, neither do other databases, TBH).

Now, the 'economic' part is different – MediaWiki relies a lot on transactions and consistent read snapshots and what not (and being able to recover after crash too 🙂 – and also it has few heavyweight queries.

MySQL is remarkably fast at these kinds of operations – and if you look at per-node performance, there're no NoSQL solutions that provide it at the same time with e.g. transactions and underlying storage that isn't naive.

You can see query distribution at:
http://noc.wikimedia.org/cgi-bin/ng/report.py?db=enwiki&prefix=query
(don't judge too much, hasn't been reviewed for a while 🙂

Technically, serving the wiki database main cost is not only in serving a single page object (where data locality doesn't matter much), but also having lots of crossindexing (page links, image links, template links, metadata, etc), as well as providing reporting – different views for recent changes, watchlists, per-user contributions, per-page contributions etc.

Moving such features to a NoSQL solution would lose multiple efficiency factors – starting with data locality both at page level and at server level, as well as would move certain problems from data layer to application layer (though I can't complain, that would probably mean a more efficient implementation of watchlists, for example)

Still, data would have to be cached and served from disks too to have a decent economic solution – and I really doubt those ratios could improve a lot by having different storage solution.

Anyway, I'm not telling that NoSQL wouldn't work – database isn't a problem at all, and there's a big chance that replacing database with different technology wouldn't introduce a problem. There's simply no need to chase the fashion.

OTOH, Wikipedia already uses Lucene and memcached for eons, don't these count as NoSQL? 🙂

P.S. I do this at few days a year on volunteer basis, so, the system really really doesn't have too many problems.

Why does Wikipedia use MySQL as data store rather than a NoSQL database?

Leave a Reply

You must be logged in to post a comment. Login now.