Dec 27 2009

WARNING: mismatch_cnt is not 0 on /dev/md0

Category: Articles,Linux administrationFractalizeR @ 12:37 pm

I’ve gotten cron log from one of my servers today which says:

 WARNING: mismatch_cnt is not 0 on /dev/md0

That worried me a little and I decided to investigate.

Quick look at

cat /proc/mdstat

showed, that array is allright. This mismatch_cnt value is the value of the blocks, that are not synchronized between RAID-1 (mirrored) drives. On my server the value was

cat /sys/block/md0/md/mismatch_cnt

and partition was md0, which is BOOT. SO, I quickly found the way to repair the problem by using

echo repair >/sys/block/md0/md/sync_action
watch cat /proc/mdstat

and after repair is completed

echo check >/sys/block/md0/md/sync_action
watch cat /proc/mdstat

After those operations the value of desynchronized blocks went to 0 again:

cat /sys/block/md0/md/mismatch_cnt

I came across this post on CentOS forum, telling, that the problem is quite common for RAID-1 raids and especially for SWAP partitions:

... that fills me with dread.  The whole point of RAID-1 is supposed to
be that data that gets written to one drive also gets written to the
other drive.  But yes, apparently will see this on systems where the
file is being constantly written to.
(this is a post from 2007 that discusses the issue),16699

Apparently, a non-zero number is common on RAID-1 and RAID-10 due to
various (harmless?) issues like aborted writes in a swap file.

Also mentions that it can happen with VMWare VM files.

And lastly, "please explain mismatch_cnt so I can sleep better at night".

So my take on all of that is, if you see it on RAID-5 or RAID-6, you
should worry.  But if it's on an array with memory mapped files or swap
files/partitions that is RAID-1 or RAID-10, it's less of a worry.

So, I got a little calmed. And after I found this explanation, I became calm 😉

Suppose I memory-map a file and often modify the mapped memory.
The system will at some point decide to write that block of the file
to the device. It will send a request to raid1, which will send one
request each to two different devices. They will each DMA the data
out of that memory to the controller at different times so they could
quite possibly get different data (if I changed the mapped memory
between those two DMA request). So the data on the two drives in a
mirror can easily be different. If a ‘check’ happens at exactly this
time it will notice.
Normally that block will be written out again (as it is still ‘dirty’)
and again and again if necessary as long as I keep writing to the
memory. Once I stop writing to the memory (e.g. close the file,
unmount the filesystem) a final write will be made with the same data
going to both devices. During this time we will never read that block
from the filesystem, so the filesystem will never be able to see any
difference between the two devices in a raid1.

So: if you are actively writing to a file while ‘check’ is running on
a raid1, it could show up as a difference in mismatch_cnt. But you
have to get the timing just right (or wrong).

I think it is possible in the above scenario to truncate the file
while a write is underway but with new data in memory. If you do
this, the system might not write out that last ‘new’ data, so the last
write to the particular block on storage may have written different
data to the two different drives, and this difference will not be
corrected by the filesystem e.g on unmount. Note that the inconsistent
data will never be read by the filesystem (the file has been
truncated, remember) so there is no risk of data corruption.
In this case the difference could remain for some time until later
when a ‘check’ or ‘repair’ notices it.

So, repair and check is necessary. Until they find something bad, no need to worry.

UPDATE 24/05/2010:

About month or two ago (several months after mismatch_cnt problem) I received smartd report about that one of the drives inside array had relocated sectors. Several days after MySQL database crashed (one of the tables in database, that contained search index was somehow damaged). So, I had to replace damaged drive.

If you have this damn “mismatch_cnt is not 0” on your system, I highly recommend you to execute long smartctl tests by issuing:

smartctl --test long /dev/sda

This will not require any system downtime and will be performed by drive controller in the background. Just replace /dev/sda with your device and after the test is completed, watch it’s results by

smartctl -a /dev/md2

Tags: , , , ,

3 Responses to “WARNING: mismatch_cnt is not 0 on /dev/md0”

  1. Mismatch_cnt is not Zero | Nico's Blog says:

    […] FractializeR reports in his post that he calmed down, as soon as he read the memory-mapped file explanation. However, I cannot agree to his assessment, because up to that point to where the check run first time, I – for sure – did not execute any memory mapped operation (dd should not do so). So there still must be some different explanation for this out there. Thus, the post of Chris Siebenmann brings this unclearity to the core: unless you know from where it comes running the commands above remains risky. Filed under Uncategorized Comment (RSS)  |  Trackback  |  Permalink […]

  2. WARNING: mismatch_cnt is not 0 on /dev/md1 — Have You Tried IT says:

    […] dyskach tworzących macierz RAID1. Szczegółowy opis tego problemu i rozwiązanie znalazłem tutaj (po […]

Leave a Reply

You must be logged in to post a comment. Login now.