Thursday, April 09, 2009

RAIDZ On-Disk Format

A while back, I came up with a way to examine ZFS on-disk format using a modified mdb and zdb (see paper and slides. I also used the method described to recover a removed file (see here in my blog. During the past week, I decided to try to understand the layout of raidz. In other words, how raidz organizes data on disk. It's simple to say that raidz on disk is basically raid5 with variable length stripes, but what does that really mean?

To do this, I once again use the modified mdb, and made a further modification to zdb. In addtion, I implemented a new command (zuncompress) which allow me to uncompress ZFS data existing in a regular file. Since I fear that most of the 10 people or so who read this will not want to read a long description of how I determined the layout, here I'll just give a summary. If anyone really wants the details, reply to the blog and maybe I'll go into them.

First, some general characteristics:

- Each disk contains the 4MB labels at the beginning and end of the disk. For information on these, please see the ZFS On-Disk Specification paper. The starting point for any walk of ZFS on-disk starts with an uberblock_t which is found in this area.

- The metadata used for raidz is the same as for other ZFS objects. In other words, uberblock_t contains the location of a objset_phys_t, which in turn contains the location of the meta-object set (mos), and so on. A difference is that physically, individual structures on disk may exist across different disks, and not necessarily all of the disks. For example, let's take a mos (basically an array of dnode_phys_t structures), on a 5 disk raidz volume. This might be compressed to 0xa00 (2560 decimal) bytes. This may be organized on the raidz disks as follows:
- 512 bytes on disk 0
- 512 bytes on disk 1
- 512 bytes on disk 2
- 1024 bytes on disk 3
- 1024 bytes on disk 4
If you do the arithmetic, you'll find this is 0xe00 bytes (3.5k) and not 0xa00 (2.5K) bytes. The actual allocated size may be still larger. The reason for the extra 1k bytes is the next point.

- Each metadata object (as well as data itself) has its own parity. The extra 1k bytes in the previous point is for parity. If the parity in the above example is on disk 4, it must be 1024 bytes large, since the largest size of any of the blocks containing the object is 1k bytes. Even a metadata structure that only takes up 512 bytes (for instance, an objset_phys_t), will take up 1024 bytes on the disks, one disk containing the 512-byte structure, and another containing 512-bytes of parity.

- Block offsets as reported by zdb (and described in the ZFS On-Disk Specification) are for the entire space (i.e., if you have 5 100GB disks making up a raidz pool, the block offsets start at 0 and go to 500GB).

- Since block offsets cover the entire pool, you cannot simply look at the offsets reported by zdb and map them to locations on disk. The kernel routine, vdev_raidz_map_alloc() (see http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c#644), converts offset and size to locations on the disks. I have added an option to zdb that, given a raidz pool, offset, and size (as reported by zdb), calls this routine and prints out the values of the returned map. This shows the location on the disk(s) and sizes for both the data itself, and parity.

- I recently saw on #opensolaris irc, a person stating that a write of a 1 byte file results in a write to all disks in raidz. That may be true (I haven't checked), but only 1024 bytes are used for the 1 byte file. One 512 byte block containing the 1 byte of data, and the other 512 byte block on a different disk containing parity. It is not using space on all n disks for a 1 byte file.

- ZFS basically uses a scatter-gather approach to reading and writing data on a raidz pool. The disks are read at the correct offsets into a buffer large enough to contain the data. So on a read, data on the first disk is read into the beginning of the buffer, data from the second disk is read into the same buffer virtually contiguous with the end of the data from the first disk, and so on. The resulting buffer is then de-compressed, and the data returned to the requestor.

So, that's the basics. I was going to turn my attention to the in-memory representation of ZFS, but now think instead I'll take a stab at automating the techniques I am using. Once I have that done, I'll try automating recovery of a removed file. From there, we'll see.

Saturday, April 04, 2009

More information about Free One Day OpenSolaris Internals Training

I thought I would say a few words about what is planned for the free one day OpenSolaris Internals training class (see http://sl.osunix.org/FreeKernelTrainingDay for a list of topics, and to sign up).

Regardless of the topics covered, I want to make this as close to a "classroom" setting as possible. For me, this means that attendees should be able to follow along with anything I am doing on OpenSolaris by doing it themselves. So, for instance, if I am using mdb to examine some data structure, students should be able to do the same on their machines. For some topics, notably ZFS, this will require students to either build an mdb dmod, and the modified mdb and zdb that I use, or load a version of OpenSolaris that contains these (to be provided by osunix.org). Source for the modified mdb, zdb, and rawzfs mdb dmod is available for download at ftp://ftp.bruningsystems.com/mdb.tar.Z, ftp://ftp.bruningsystems.com/zdb.tar.Z, and ftp://ftp.bruningsystems.com/raw_dmods.tar.Z. If we do a kmdb session, students will either need to run OpenSolaris in a VM (virtualbox), or have 2 machines connectable via tip or a terminal server for console access.

Currently, the plan is to give attendees access to some slides, use IRC, and give students access to a window on my machine where they can see what I am doing, and try the same on their machine. Best would be a window where everyone can "see" my desktop, but I'm still looking into the best way to do that (any suggestions for this are welcome). It would be great to have audio, preferably conferencing, but this may cost money, and... the class is free. That should mean free for me as well. If anyone has a suggestion for free, conferenced audio, I would appreciate it.

I would like to decide on topics to be covered in the next week or so. So, if you are interested in attending, please go to http://sl.osunix.org/FreeKernelTrainingDay, take a look at the topics, and sign up. If you have ideas for other kernel-related topics, please let me know. Depending on how this goes, I may do more of these in the future.

Thursday, April 02, 2009

Free One-day OpenSolaris Internals class

I am holding a free, one day OpenSolaris Internals class on-line on April 18 or 19. We'll cover 2 topics as determined by a vote of topics that may be covered. For more information, see http://sl.osunix.org/FreeKernelTrainingDay. I hope to see you there!

OpenSolaris Internals class

I am teaching an OpenSolaris Internals class at Systemics in Warsaw, Poland the week of May 4-8. The course will be held in English. For a detailed topic outline, see here. For pricing, location information, and availability, please send email to magdalena.sternick@systemics.pl. If you have questions about course content, please email me at max@bruningsystems.com.