A few things, mostly technical notes...

Tuesday, January 09, 2007

Speeding up ext3 filesystems

So you've got an ext3 (ok, ext2 if it's an old flavour) filesystem, about which folks are complaining.

There a few tricks by which you can increase the throughput to the said filesystem.

But before revealing the hacks, let's see what goes on behind the scenes.

ext3 is a journaling filesystem -- wikipedia defines journaling filesystems as a filesystem that logs changes to a particular journal *usually in a circular log and in a specifically allotted area, before actually writing the changes to the main filesystem.

ext3 handles journaling by using a special API called, the Journalling Block Device or the JBD. If you have an ext3 filesystem mounted, chances are high that the kernel module JBD is already loaded:

root@linUX:> lsmod |grep jbd
jbd 71385 1 ext3

JBD's job is to implement journals on any kind of block device. ext3 code will work with the JBD API. ext3 informs JBD of any modifications it is performing, and before ext3 modifies any data on the disk, it has to get JBD's permission to do so. JBD accordingly does the journalling for ext3.

There are majorly three major methods with which JBD can implement the actual journaling for ext3. They are:

(1) journal
(2) writeback
(3) ordered

By default, ext3 filesystems have three journalling methods:

Journal mode

This mode provides full journaling for your metadata and data, and gives maximum integrity.

In this mode, ext3 evokes JBD to journal all changes to the filesystem, whether the changes are made to the data or metadata. As both data and metadata journals are available, JBD can bring both data and metadata to a consistent state using the two journals, thereby providing the best integrity in the event of a crash.

Now, maintaining two journals may cause performance drawbacks. This can be reduced by keeping larger journals. Here's why - if your journal is relatively small, JBD has to wait until the changes are written out. If your journal log is big enough, it does not have to wait frequently.

ordered mode

This is, by far, the most used default method by most linux flavours.

metadata and data blocks are bundled into "transactions" of either 1K, 2K, or 4K sizes. when it is time for the filesystem to write this transaction onto the disk, the data blocks of the said transaction are written first onto the filesystem. metadata is committed to journal only after the data blocks are written out.

Generally, integrity provided by ordered mode writes are sufficient enough. That being said, if you are sure that most of your continuous writes to a certain filesystem is to overwrite existing files, rather than appending them, this is certainly not the right journaling mode for you. Here's why - ordered mode does not keep persistent information about which blocks were written, and which blocks weren’t. Or, it does not perform ordered transactions. It is left to your drive's write cache to write to the list of blocks supplied whenever the cache can get to do that. In other words, it does not ensure that the blocks are actually written out into the drive in an orderly manner, as it uses your drive's cache.

What does this mean? Well, if your system craps out while it was overwriting a file, and your drive's cache did not finish updating all the blocks cited by the JBD API, you may end up having blocks which were not updated at all. Or, in short, theoretically, you may have some old data in your file after a system crash.

writeback mode

Unlike the earlier two modes we saw, only metadata is journaled in this mode. Data is written to the filesystem as soon as JBD completes metadata journaling.

To quote the man page, it is rumoured to be the highest-throughput option for an ext3 filesystem.

As only metadata is journaled, the problem that we saw for "ordered" mode of old data appearing in files after a crash and journal recovery is severely possible in journaling mode.

Enough Theory, where's my fix?

Option 1)

Assuming that you're ready to live with the problems for writeback mode, mount your said slow filesystem in writeback mode.

A word of caution: You should unmount your filesystem and remount it with the journalling mode you want. You cannot change modes while the filesystem is mounted, or just by a "mount -o remount" with new options.

# umount /slowfs
# mount yourdevice /slowfs -o data=writeback

To make this permanent, modify your /etc/fstab:

yourdevice /slowfs ext3 data=writeback 1 2

See man mount for answers to your possible questions.

Option 2)

By default, a filesystem has to update inode access times for each files as the said file was last accessed. Disabling this would also speed up access to your /slowfs, provided your applications do not need atime to be updated on your files.

This option can be added from the command line as:

# umount /slowfs
# mount yourdevice /slowfs -o noatime,data=writeback

Entry for /etc/fstab would look like:

yourdevice /slowfs ext3 data=writeback,noatime 1 2

Disclaimer: Sure, this worked for me, but your mileage may vary. Hence, use at your own risk. Don't flame me, if your txt files look like hexdump after this change :)

1 comment:

Rajeesh || നമ്പ്യാര്‍ said...

I took quite some time to find this blog out, hence a comment much later its posting ;-)
Read about noatime and writeback here too: http://kerneltrap.org/node/14148

Someone has submitted a patch replacing noatime with relatime.


Creative Commons License
This work is licensed under a Creative Commons License.