Moving data with mirrors

The main data volume on the system at work ran out of PEs due to the old default PE size of 4 MB and 64k of PEs being in the volume group. There’s no way to change that without recreating the volume group, and blowing away everything. So I have to move the 250 gigs of crap to somewhere else, then delete the old volume group, recreate it, then move everything back. Since moving that much data requires several hours and hence downtime to users, that’s not good.

So, the idea is to move the data around live. This requires making a RAID 1 mirror on top of lvm — not usually done this way. It also requires learning enough about mdadm to be able to create a mirror without destroying the good data (ouch). The process of intially creating the mirror will require the regular lvm device be unmounted and the RAID (md) device mounted. But after that is done, the syncing of the mirror can happen live. Hence downtime is kept to an absolute minimum.

There’s a real derth of information out there for doing this type of thing, and maybe for good reason. I’m also disappointed that this is quite a bit of hassle compared to how I used to do this sort of thing on a DG/UX (Data General) system back in the mid 90s.

  1. Create a new phys volume on temporary disk
  2. Create a new volume group on new disk with larger extent size (32 megs)
  3. Mirror the old data and new data together. NOTE: This requires RAID on top of lvm, not lvm on top of RAID
  4. Break the mirror after it syncs, leaving data on the temporary disk
  5. Create a new PV on the original disk with 32 meg PEs
  6. Extend the volume group onto the new PE
  7. Move the PEs from the temporary disk to the original disk
  8. Shrink the volume group so it lives just on the original disk

Of course, before doing any of this, testing and documentation is required, hence this post. Also, this procedure was used and tested on RHEL 3. For the love of your job, data, sanity, and all that is holy, do not trust what I am saying here. Use it as a guide with other docs and test it on non-production box with data you can afford to lose. Then before doing it on a production box, make sure you have safe backups — preferably multiples.

Some commands to use to get a feel for what is on your disk already

  • pvscan
  • vgdisplay -v vg_group_name

For purposes of the demo, the good data is on vg san0 and physical device /dev/sdd. The temporary disk will be named san2 on /dev/sdc.

Initial test set up

pvcreate /dev/sdc
vgcreate -s 16m san2 /dev/sdc # NOTE: Using 16m as extent size just for testing
lvcreate -l 60 --name newhome san2

RAID 1 (mirror) background

Idea will be to create a RAID 1 mirror consisting of the main data disk and a “missing” disk. This will start it off in a degraded mode. After that we can add the temporary (new) disk to it as a hot add and it wil rebuild the mirror onto it. After that is done, we mark as bad the original disk and leave ourselves with data on the new disk. Then we kill the mirror and go back to naked lvm.

Finding info on if you could create a mirror and preserve old data, let alone if it’s on top of lvm was difficult, as in, my googling could not find anyone who did it.

Creating the mirror

THE ORDER OF THE DEVICES BELOW IS CRITICAL. The first device will be the master and will copy to the second device.

mdadm --create /dev/md0 -l 1 -n 2 /dev/san0/home /dev/san2/newhome
mdadm --detail /dev/md0
mount /dev/md0 /mnt/home

If that bothers you (and it really should) you can create the array with just the original disk, mount it, then hot add the second disk into the array after you know the data is there. Also, if you’re paranoid after reading the various notes below and intend to fsck the md device before mounting, it’ll save you time if you do it like this since the fsck won’t be beating the disk at the same time as the mirror sync is going.

mdadm --create /dev/md0 -l 1 -n 2 /dev/san0/home missing
mdadm --detail /dev/md0
fsck -f /dev/md0
resize2fs /dev/md0
mount /dev/md0 /mnt/home
mdadm /dev/md0 -a /dev/san2/newhome

At this point the mirror should start rebuilding. Now it will say it has three devices and the missing will still be listed, but when it’s done the rebuild it only shows the two devices (but the total # of devices still says three). So maybe there’s a better way to do this, but it works.

Run the –detail option again to monitor when the drive is done rebuilding. At this point we should be able to break off the original disk and be left with our data on the new disk.

But before doing that, just in case, we should create an mdadm.conf file so if the array has to be stopped and restarted, we don’t have to scan for it.

echo -e 'DEVICEt/dev/san0/home /dev/san2/newhome' >> /etc/mdadm.conf
mdadm --detail --scan >> /etc/mdadm.conf

Stopping / Restarting the array

mdadm --stop --scan # will stop the array
mdadm --detail --scan # will start the array iff mdadm.conf file above was appended with correct info
mdadm -Ac partitions -m 0 /dev/md0 # will start it back up (if there is no config file)

Breaking the mirror

After (and only after, verify first) that the mirror syncing is done (.e.g. mdadm --detail /dev/md0), the mirror can be broken and we’ll have two identical mountable ext3 lvm partitions. One on the original disk, one on the temporary (new) disk. Before killing the mirror, be sure to unmount any file systems using the mirror. When checking the rebuild status with mdadm, look for a line that says “Rebuild Status :” — if that line is not there, it’s rebuilt.

umount /dev/md0
mdadm --manage --set-faulty /dev/md0 /dev/san0/home
mdadm --manage --set-faulty /dev/md0 /dev/san2/newhome
mdadm --stop --scan

Once this is done we can actually mount and use the parition, but the raid “superblock” will still be associated with each partition and that could lead to it accidently being restarted. To really destroy the mirror, the raid superblock needs to be zeroed out.

mdadm --misc --zero-superblock /dev/san2/newhome
mdadm --misc --zero-superblock /dev/san0/home


NOTE: This procedure goes for minimize downtime at the risk of file system safety. Before starting this you really should fsck the disk you are going to mirror — offline. Also, read notes below.

NOTE: The set-faulty commands above aren’t really needed if you’re just going to zero the superblocks right away but if you don’t, you could technically mount up both partitions, make changes, then unmount them and restart the mirror with probably really bad effects.

NOTE: fsck will fail on the mirror device because the size of the device shrinks a wee bit. However, once the mirror is broken, a fsck should work and pass OK.

NOTE: When you import a disk with an ext3 file system into a mirror like this, the physical size of the partiton shrinks a bit because the md superblock is stored at the end of the partition. Hence while it will mount, it will fail a fsck and there’s the risk (I think) that real data might overwrite the raid superblock, especially if the partition is filled. Hence to be very safe, one should run ‘fsck -f /dev/md0 ; resize2fs /dev/md0’ against the new raid partition before remounting. However, once you do this, when you destroy the mirror the filesystem probably should be rezised out to original size again.

Reference websites

5 thoughts on “Moving data with mirrors”

  1. Actually pulled off the first part of this plan last night. Total downtime was 60 minutes. I went paranoid and ran fsck and resize2fs before starting the mirror rebuild which added to the downtime. The time required to fsck the 250 gig, 650k inode disk, was 15 minutes each time. The rest of the time was spent shutting down services in an attempt to get the disk to umount. The mirror syncing took 7 hours.

    Also of note was the procedure to rescan the scsi bus after adding the disk (from the Apple xraid we have).

    dd if=/dev/zero of=/dev/sdd bs=512 count=1
    pvcreate /dev/sdd
    vgcreate san2 /dev/sdd
    lvcreate -l 7500 --name emoh san2
    mkfs.ext3 /dev/san2/emoh
    service httpd stop
    service smb stop
    service sendmail stop
    service naviagent stop
    service ldap stop
    service mysqld stop
    chkconfig ipop3 off
    chkconfig pop3s off
    chkconfig imap off
    chkconfig imaps off

    … then various killing of procs in an attempt to umount /u. Also had to umount /u from other boxes (nfs mounted) as well. “fuser -m /u” helped find them. Then stopped nfs and nfslock services too.

    umount /u
    fsck -f -y /dev/san1/home
    mdadm --create /dev/md0 -l 1 -n 2 /dev/san1/home missing
    fsck -f /dev/md0
    resize2fs /dev/md0
    mount /dev/md0 /u
    service ldap start
    service naviagent start
    service nfslock start
    service nfs start
    service sendmail start
    service smb start
    service mysqld start
    chkconfig imap on
    chkconfig imaps on
    chkconfig ipop3 on
    chkconfig pop3s on
    service httpd start
    mdadm --detail /dev/md0
    mdadm /dev/md0 -a /dev/san2/emoh
    mdadm --detail /dev/md0

    Note, for some reason, xinetd was not running after all of this, so I had to restart it too.

  2. This process was duplicated for two other san partitions so in all there were three raid devices mirroed now, between the EMC and XRAID.

    Next step is to break the mirror and leave just the data on the xraid, then redo the vol group on the emc, then mirror the data back across.

    First step, break mirror…

    mdadm --detail /dev/md1
    mdadm --manage --set-faulty /dev/md1 /dev/san0/mail
    mdadm -r /dev/md1 /dev/san0/mail
    mdadm --detail /dev/md2
    mdadm --manage --set-faulty /dev/md2 /dev/san0/local
    mdadm -r /dev/md2 /dev/san0/local
    mdadm --detail /dev/md0
    mdadm --manage --set-faulty /dev/md0 /dev/san1/home
    mdadm -r /dev/md0 /dev/san1/home

    Next step, destroy the old volume groups (ouch ouch ouch) and recreate them with proper PE size of 32M, matching same size as the existing mirror one (we can grow them later)

    vgdisplay -v san0
    vgdisplay -v san1
    lvchange -an /dev/san0/mail /dev/san0/local /dev/san1/home
    lvremove /dev/san0/mail /dev/san0/local /dev/san1/home
    vgchange -an san0 san1
    vgremove san0 san1

    Next step, create a new volume group spanning the two raid sets, then create lv’s to match the ones on the xraid.

    vgcreate san0 /dev/sda /dev/sdb
    lvcreate -l 1600 --name local san0
    lvcreate -l 1200 --name mail san0
    lvcreate -l 7500 --name home san0
    mkfs.ext3 /dev/san0/mail
    mkfs.ext3 /dev/san0/local
    mkfs.ext3 /dev/san0/home

    Now that is done, mirror the data back to the EMC side of the house, being sure to wait between each hot spare add until the previous mirror rebuild is complete to keep from killing the box.

    mdadm /dev/md1 -a /dev/san0/mail
    mdadm /dev/md2 -a /dev/san0/local
    mdadm /dev/md0 -a /dev/san0/home

  3. Final stop of process is to break the mirror again, destroy it, check the partitions that will remain just on the EMC array, and grow them as needed. First, check to make sure all mirrors are in sync and OK, then unmount disks

    mdadm --detail /dev/md0
    mdadm --detail /dev/md1
    mdadm --detail /dev/md2
    service httpd stop
    service smb stop
    service sendmail stop
    service naviagent stop
    service ldap stop
    service mysqld stop
    chkconfig ipop3 off
    chkconfig pop3s off
    chkconfig imap off
    chkconfig imaps off

    After this I fsck’ed each lvm partition, then ran resize2fs and remounted except for the home one which we need to expand first. However, being paranoid I fsck’ed, resized and remounted home just to make sure all was well. Probably an unnecessary step and theoretically if something bad happened and one of these gets trashed, the other side of the mirror is still there (and the full backup done last night on tape as well).

    Critical bits that resized the partition…

    lvextend -l +2500 /dev/san0/home
    resize2fs /dev/san0/home

  4. Your first paragraph is wrong … well, not entirely.

    Under LVM1, there are a maximum of 65,536 PEs/VG. With the default PE size of 4MB (except in version 1.0.7, where the default is 32MB/PE; SUSE Linux 9.0 was the only distribution to ever ship that version), this means that you have a total of 256GB of managable space. In pracice, however, you could use 255.9GB, as a small portion was used for the VG meta-data itself.

    With LVM2, which shows up in the 2.6.0 and later kernels, you have 2^24 PEs/VG. The default PE size is 4MB under LVM2, as well. This means that by default, one can have up to 64TB (that’s right, terrabytes) of managable storage in a single VG.

    So, how does this relate to your first paragraph? Well, you can simply convert an existing LVM1 VG to LVM2, without having to reformat anything. Just run (substituting the name of the volume group for the “vg_name”, of course):

    # vgconvert -M2 vg_name [another_vg_name […]]

    Consult the vgconvert(8) man page, if you like (but it’s a very simple command).

    BTW, as far as I know, LVM2 still supports a maximum of 99 VGs/system, just like LVM1 did, but I haven’t actually dug far enough into the code to find that out for sure.

Leave a Reply

Your email address will not be published. Required fields are marked *