Archive for March, 2013

Where’s the money Lebowski: bypassing page the cache

Posted in Uncategorized on March 2, 2013 by voline

Into the wild

I had a bit of a scare today.  I’ve been rearranging some partitions on my laptop hard drive without a backup (don’t have the extra storage).  Not for the feint of heart!  I would have much more peace of mind using a partition editor like gparted, where I assume its been fairly well tested.  However, gparted won’t move a partition unless it contains one of its supported file systems and it won’t move a partition across another partition, both of which I’m needing to do.  These would be great additions to the tool!

Using ddpt

I’ve been using a variant of dd called ddpt to do the low level copying of data blocks.  Ddpt can speak to the drive via scsi, so I’d guess I could get better read/write rates (turns out maybe not that much better).  There were several moves that needed to be completed and partition juggling to be had.  I was on the last move which was shifting a 500GB+ partition closer to the beginning of the disk.  Before doing the move I verified that the to block address has the content that I was expecting, just so I was sure I was overwriting what I was expecting.  I then set it off and left to come back in 5 hours when it should have completed.

On overlapping partition moves

Incidentally, this process is a bit scarier because this is an overlapping move (and remember with dd and friends this can only be a shift to the left, else you corrupt your data).  That is to say, some the input contents will be overwritten as the copy takes place.  If for some reason the copy is interrupted after some of the input has been over written, you’ve got a huge mess on your hands.  If you just restart the copy, you will corrupt your data.  The situation is not hopeless though, but its potentially very time consuming to fix.  Basically, you need to run a moving window the size of the read/write offsets from the beginning sector by sector.  If the sectors from the beginning are equal (make sure to check the first several sectors), you can start the move from the beginning as before.  Otherwise, if the sectors at the ends of the window are not equal, then the copy has started overwriting the original partition (and thus currently your partition is in a useless state).  Just continue moving the window down until the sectors start matching.  While the sectors are matching continue moving the window until they stop matching.  The length of the matching sectors should be equal to the size of the window (unless by some coincidence there could be some matching sectors at the ends of the window which make the matching sectors size larger than the window, but it doesn’t matter, the size should never be less than the window size).  When you come to the mismatching sectors again, this is the point where you may resume your sector move.

Heart attack!

When I came back to the completed move, I immediately ran some sanity tests before I assume everything went well and potentially make things worse.  The first thing I do is check the first destination block to make sure it has the filesystem header I expect.  This shouldn’t really be necessary… but wait it hasn’t changed!  Ok, remain calm with your seats in the upright position.  Don’t panic.  Let’s see if the first sector of the source has changed… it hasn’t!  Ok what’s going on here.  My first thought is that if nothing has changed I can just restart the move from the beginning.  Don’t be too hasty, why didn’t ddpt report any errors and exit?  Instead it ran for the full length and exited normally.  So ddpt thinks everything is good, which means that the drive must not have errored.  Thus the drive must have executed all those writes successfully.

Caching

Then it hit me caching!  When reading the blocks, I was not running ddpt in pt mode (ie not using the scsi layer).  So I was getting blocks from the kernel’s page cache, which might have those block s cached if I’d recently read them (and I had).  When writing I was using pt mode, which necessarily bypasses the kernel page cache.  Searching through ddpt documentation, I found the fua and fua_nv bits.  The description wasn’t so helpful that I could fully understand the implications of using it, but I could tell it might be useful.  Time to dust off the SCSI spec (SBC-3 5.8 table 40) and see what it says.  Since I wasn’t completely sure that the volatile cache on the disk was right either, I set FUA=0 and FUA_NV=1 to get the block from non-volatile cache or directly from the media.  Lo! And behold!  The sectors were as they should be according to the drive!  Ok, but where do we go from here?

Dumping the cache (the cops are on our tail!)

After doing a few quick googles, I found that you can tell linux to drop the pages in its cache (if they aren’t dirty!).  My biggest concern now was, what if it causes the blocks to be written to disk before dropping them?  Then you end up with blood all over your money, which makes it completely unusable. Now I wouldn’t expect pages to get written back to disk fromthe cache unless they were dirty (if the kernel thinks nothing has changed, why would it write the same data to disk that’s already there).  Some more looking around lead me to /proc/sys/vm/dirty_writeback_centisecs, which says about how long a dirty page will stay in the cache before its written back to disk.  By default this is 5 seconds.  So by the time I was running these sanity check commands, any dirty block should already have been written to disk.  In fact, since the only thing writing to the disk was not going through the page cache, it should have been a very long time since there was a dirty page destined for the sectors I cared about.

Time to pull the trigger.  Done and done.  After telling linux to drop the pages in the page cache (no need to have it drop inodes or dnodes since there was no filesystem associated with those blocks), the sectors from the disk are returning what they should when going through the page cache. Mount readonly and do fsck, everything is fine…  Whew!

Looking back…

I don’t think the disk’s volatile cache could have been inconsistent with the media.  So I shouldn’t have needed the FUA or FUA_NV bits.  Running in pt mode should have been sufficient.

 

Advertisements