Skip site navigation (1) Skip section navigation (2)

Large data storage in FreeBSD


Purpose and background

The UFS filesystem

When the UFS filesystem was introduced to BSD in 1982, its use of 32 bit offsets and counters to address the storage was considered to be ahead of its time. Since most fixed-disk storage devices use 512 byte sectors, 32 bits allowed for 2 Terabytes of storage. That was an almost un-imaginable quantity for the time. But now that 250 and 400 Gigabyte disks are available at consumer prices, it's trivial to build a hardware or software based storage array that can exceed 2TB for a few thousand dollars.

The UFS2 filesystem was introduced in 2003 as a replacement to the original UFS and provides 64 bit counters and offsets. This allows for files and filesystems to grow to 2^73 bytes (2^64 * 512) in size and hopefully be sufficient for quite a long time. UFS2 largely solved the storage size limits imposed by the filesystem. Unfortunately, many tools and storage mechanisms still use or assume 32 bit values, often keeping FreeBSD limited to 2TB.

We need to ensure that FreeBSD supports large storage sizes and that the benefits of UFS2 can actually be realized so that FreeBSD can remain relevant in the enterprise world. This page describes known issues and limits and provides a focus for further auditing, validation, and fixing.

Limits on disk partitioning

The first limit that is encountered is in disk partitioning. For x86 and amd64 PC's, the FDISK MBR table is used by the BIOS to partition the disk into logical extents and identify which partition ('slice' in FreeBSD terms) to boot from. The MBR is defined to use 32 bit disk offsets, and since it's an industry standard and interoperability is required, there is nothing that can be done to change this. As long as booting a PC requires the MBR, the boot slice in FreeBSD is going to be limited to 2TB.

The GPT partitioning scheme was introduced with the ia64 architecture as an MBR replacement. It provides 64 bit offsets and allows for an arbitrary number of partitions. It also provides a compatibility mode with MBR where it can generate an MBR-compatible structure on the disk for use with systems that don't understand GPT. However, to get the full benefits for boot storage, the BIOS and the FreeBSD loader must understand it. For secondary storage, GPT can be used by any architecture regardless of BIOS or boot support.

Many systems don't require an MBR or GPT, and even PCs don't require it if booting and inter-operating with other OS's is not required. The next limit that comes in, though, is with the BSD disklabel. This label defines up to 8 partitions on a disk, MBR slice, or other storage extent for filesystems and swap space. Unfortunately, the on-disk format of the disk label again uses 32 bit quantities, so it is also limited to 2TB. Fixing this would require creating a new format that is incompatible with the old and would require an update to the FreeBSD boot loader. This would complicate interoperability and the upgrade path. Also, if a new format is going to be created, it should also address the 8 partition limit that exists now. Given these requirements, it's tempting to just adopt the GPT format instead for secondary storage partitioning.

Testing large capacities

Even though large drives are cheap, it still isn't always feasible or economical to test on real hardware. Swap-backed memory disks, via the md(4) driver, can provide a good substitute for some of the testing. Backing with swap means that only the pages that are dirtied by data are actually allocated, so a multi-terabyte storage can be simulated with a minimal amount of physical RAM+swap. Note that this is less true with UFS1 since it will initialize all of the inode blocks during newfs, which will dirty quite a bit of data. But for UFS2, swap-backed md has the potential for working well. Unfortunately, the kernel md driver has a number of 32-bit size limits of its own that need to be fixed. Details are provided below.

It is still possible to avoid disklabels and MBRs for testing by using newfs directly on the raw disk or md disk. Sysinstall can be tested from a running system by just selecting Expert mode and just performing the MBR and disklabel steps. Beware that sysinstall might have other bugs that will wipe out your existing system, so care must be taken here!

Userland Tool Status

The following userland tools need auditing and testing for 64-bit cleanliness:

Task Responsible Last updated Status Details
newfs Pawel Jakub Dawidek 19 Sept 2004 Done Handling of '-s' option was fixed. Newfs should be now fully usable for large file systems.
df     Not done An audit is needed to make sure that all reported fields are 64-bit clean. There are reports with certain fields being incorrect or negative with NFS volumes, which could either be an NFS or df problem.
du Pawel Jakub Dawidek 7 Jan 2005 Done Big files/directories handling was broken. It was fixed and du should be now fully usable on large file systems with large files/directories.
growfs   12 Sept 2004 In progress Growfs has problems with expanding to new cylinder groups. It also initializes UFS2 inode blocks instead of leaving them for lazy initialization. It also needs a 64-bit audit.
sysinstall     Not done A full audit is needed. Reports exist of problems with >1TB partitions.
fsck_ffs Pierre Beyssac 15 Jan 2005 In progress A full audit is needed. At least some printf format changes are necessary.
dump/restore     Not done A full audit is needed. At least some printf format changes are necessary in dump(8).
fsdb     Not done A full audit is needed. At least some printf format changes are necessary.
quota tools Dag-Erling C. Smørgrav & Kirk McKusick   Done Extensive changes are need. Disk quotas are currently handled as 32-bit quantities, which limits the maximum possible quota at 2TB. Two tasks are needed: 1) have the current tools (kernel+userland, edquota for example) fail gracefully when presented with 64-bit quantities and 2) extend the quota file format and tools to 64-bit while providing a compatibility mode and/or migration tools.

Kernel Driver Status

Many storage peripherals simply are not designed to handle >2TB capacities. For those that are, an audit should be done to verify that their drivers handle the sizes correctly and pass those sizes correctly to the rest of the kernel.

Task Responsible Last updated Status Details
md Pawel Jakub Dawidek 17 Sept 2004 Done Swap backed disks can now be created up to 16TB in size on i386. This corresponds to 2^32*4096.

Subsystem Status

Some filesystem-related subsystems require testing with >2TB volumes, or need to be adapted. The following areas have been identified:

Task Responsible Last updated Status Details
snapshots Pierre Beyssac 15 Jan 2004 In progress Taking snapshots fails on filesystems >2TB, returning EFBIG (on a 5TB filesystem) and subsequently crashing the system in softupdates.
quotas Dag-Erling C. Smørgrav & Kirk McKusick   Done The quota subsystem handles 32-bit quantities, which limits quotas to 2TB. Blockings of the syncer have been observed while attempting to set quotas over that limit (try 4000000000 KBytes as a hard limit in edquota(8) for some uid, then create somes files owned by that uid). See also the userland entry for quota tools.