This article is about the Sun Microsystems filesystem. For other uses, see ZFS (disambiguation).
|Introduced||November 2005 with OpenSolaris|
|Directory contents||Extensible hash table|
|Max. volume size||256 trillion yobibytes (2128 bytes)|
|Max. file size||16 exbibytes (264 bytes)|
|Max. number of files|
|Max. filename length||255 ASCII characters (fewer for multibyte character encodings such as Unicode)|
|Forks||Yes (called "extended attributes", but they are full-fledged streams)|
|File system permissions||POSIX, NFSv4 ACLs|
|Supported operating systems||Solaris, OpenSolaris, illumos distributions, OpenIndiana, FreeBSD, Mac OS X Server 10.5 (only read-only support), NetBSD, Linux via third-party kernel module or ZFS-FUSE, OSv|
ZFS is a combined file system and logical volume manager designed by Sun Microsystems. The features of ZFS include protection against data corruption, support for high storage capacities, efficient data compression, integration of the concepts of filesystem and volume management, snapshots and copy-on-write clones, continuous integrity checking and automatic repair, RAID-Z and native NFSv4ACLs.
The ZFS name is registered as a trademark of Oracle Corporation; although it was briefly given the retrofitted expanded name "Zettabyte File System", it is no longer considered an initialism. Originally, ZFS was proprietary, closed-source software developed internally by Sun as part of Solaris, with a team led by the CTO of Sun's storage business unit and Sun Fellow Jeff Bonwick. In 2005, the bulk of Solaris, including ZFS, was licensed as open-source software under the Common Development and Distribution License (CDDL), as the OpenSolaris project. ZFS became a standard feature of Solaris 10 in June 2006.
In 2010, Oracle stopped the releasing of source code for new OpenSolaris and ZFS development, effectively forking their closed-source development from the open-source branch. In response, OpenZFS was created as a new open-source development umbrella project, aiming at bringing together individuals and companies that use the ZFS filesystem in an open-source manner.
Overview and ZFS design goals
ZFS compared to most other file systems
Historically, the management of stored data has involved two aspects — the physical management of block devices such as hard drives and SD cards, and devices such as RAID controllers that present a logical single device based upon multiple physical devices (often undertaken by a volume manager, array manager, or suitable device driver), and the management of files stored as logical units on these logical block devices (a file system).
- Example: A RAID array of 2 hard drives and an SSD caching disk is controlled by Intel's RST system, part of the chipset and firmware built into a desktop computer. The user sees this as a single volume, containing an NTFS-formatted drive of their data, and NTFS is not necessarily aware of the manipulations that may be required (such as rebuilding the RAID array if a disk fails). The management of the individual devices and their presentation as a single device, is distinct from the management of the files held on that apparent device.
ZFS is unusual, because unlike most other storage systems, it unifies both of these roles and acts as both the volume manager and the file system. Therefore, it has complete knowledge of both the physical disks and volumes (including their condition, status, their logical arrangement into volumes, and also of all the files stored on them). ZFS is designed to ensure (subject to suitable hardware) that data stored on disks cannot be lost due to physical error or misprocessing by the hardware or operating system, or bit rot events and data corruption which may happen over time, and its complete control of the storage system is used to ensure that every step, whether related to file management or disk management, is verified, confirmed, corrected if needed, and optimized, in a way that storage controller cards, and separate volume and file managers cannot achieve.
ZFS also includes a mechanism for snapshots and replication, including snapshot cloning; the former is described by the FreeBSD documentation as one of its "most powerful features", having features that "even other file systems with snapshot functionality lack". Very large numbers of snapshots can be taken, without degrading performance, allowing snapshots to be used prior to risky system operations and software changes, or an entire production ("live") file system to be fully snapshotted several times an hour, in order to mitigate data loss due to user error or malicious activity. Snapshots can be rolled back "live" or the file system at previous points in time viewed, even on very large file systems, leading to "tremendous" savings in comparison to formal backup and restore processes, or cloned "on the spot" to form new independent file systems.
Summary of key differentiating features
Examples of features specific to ZFS which facilitate its objective include:
- Designed for long term storage of data, and indefinitely scaled datastore sizes with zero data loss, and high configurability.
- Hierarchical checksumming of all data and metadata, ensuring that the entire storage system can be verified on use, and confirmed to be correctly stored, or remedied if corrupt. Checksums are stored with a block's parent block, rather than with the block itself. This contrasts with many file systems where checksums (if held) are stored with the data so that if the data is lost or corrupt, the checksum is also likely to be lost or incorrect.
- Can store a user-specified number of copies of data or metadata, or selected types of data, to improve the ability to recover from data corruption of important files and structures.
- Automatic rollback of recent changes to the file system and data, in some circumstances, in the event of an error or inconsistency.
- Automated and (usually) silent self-healing of data inconsistencies and write failure when detected, for all errors where the data is capable of reconstruction. Data can be reconstructed using all of the following: error detection and correction checksums stored in each block's parent block; multiple copies of data (including checksums) held on the disk; write intentions logged on the SLOG (ZIL) for writes that should have occurred but did not occur (after a power failure); parity data from RAID/RAIDZ disks and volumes; copies of data from mirrored disks and volumes.
- Native handling of standard RAID levels and additional ZFS RAID layouts ("RAIDZ"). The RAIDZ levels stripe data across only the disks required, for efficiency (many RAID systems stripe indiscriminately across all devices), and checksumming allows rebuilding of inconsistent or corrupted data to be minimised to those blocks with defects;
- Native handling of tiered storage and caching devices, which is usually a volume related task. Because it also understands the file system, it can use file-related knowledge to inform, integrate and optimize its tiered storage handling which a separate device cannot;
- Native handling of snapshots and backup/replication which can be made efficient by integrating the volume and file handling. ZFS can routinely take snapshots several times an hour of the data system, efficiently and quickly. (Relevant tools are provided at a low level and require external scripts and software for utilization).
- Native data compression and deduplication, although the latter is largely handled in RAM and is memory hungry.
- Efficient rebuilding of RAID arrays — a RAID controller often has to rebuild an entire disk, but ZFS can combine disk and file knowledge to limit any rebuilding to data which is actually missing or corrupt, greatly speeding up rebuilding;
- Ability to identify data that would have been found in a cache but has been discarded recently instead; this allows ZFS to reassess its caching decisions in light of later use and facilitates very high cache hit levels;
- Alternative caching strategies can be used for data that would otherwise cause delays in data handling. For example, synchronous writes which are capable of slowing down the storage system can be converted to asynchronous writes by being written to a fast separate caching device, known as the SLOG (sometimes called the ZIL – ZFS Intent Log).
- Highly tunable – many internal parameters can be configured for optimal functionality.
- Can be used for high availability clusters and computing, although not fully designed for this use.
Inappropriately specified systems
Unlike many file systems, ZFS is intended to work in a specific way and towards specific ends. It expects or is designed with the assumption of a specific kind of hardware environment. If the system is not suitable for ZFS, then ZFS may underperform significantly. ZFS developers Calomel stated in their 2017 ZFS benchmarks that:
- "On mailing lists and forums there are posts which state ZFS is slow and unresponsive. We have shown in the previous section you can get incredible speeds out of the file system if you understand the limitations of your hardware and how to properly setup your raid. We suspect that many of the objectors of ZFS have setup their ZFS system using slow or otherwise substandard I/O subsystems."
Common system design failures:
- Inadequate RAM — ZFS may use a large amount of memory in many scenarios;
- Inadequate disk free space — ZFS uses copy on write for data storage; its performance may suffer if the disk pool gets too close to full. Around 70% is a recommended limit for good performance. Above a certain percentage, typically set to around 80%, ZFS switches to a space-conserving rather than speed-oriented approach, and performance plumments as it focuses on preserving working space on the volume;
- No efficient dedicated SLOG device, when synchronous writing is prominent — this is notably the case for NFS and ESXi; even SSD based systems may need a separate SLOG device for expected performance. The SLOG device is only used for writing apart from when recovering from a system error. It can often be small (for example, in FreeNAS, the SLOG device only needs to store the largest amount of data likely to be written in about 10 seconds (or the size of two 'transaction groups'), although it can be made larger to allow longer lifetime of the device). SLOG is therefore unusual in that its main criteria are pure write functionality, low latency, and loss protection – usually little else matters.
- Lack of suitable caches, or misdesigned caches — for example, ZFS can cache read data in RAM ("ARC") or a separate device ("L2ARC"); in some cases adding extra ARC is needed, in other cases adding extra L2ARC is needed, and in some situations adding extra L2ARC can even degrade performance, by forcing RAM to be used for lookup data for the slower L2ARC, at the cost of less room for data in the ARC.
- Use of hardware RAID cards, perhaps in the mistaken belief that these will 'help' ZFS. While routine for other filing systems, ZFS handles RAID natively, and is designed to work with a raw and unmodified low level view of storage devices, so it can fully use its functionality. A separate RAID card may leave ZFS less efficient and reliable. For example, ZFS checksums all data, but most RAID cards will not do this as effectively, or for cached data. Separate cards can also mislead ZFS about the state of data, for example after a crash, or by mis-signalling exactly when data has safely been written, and in some cases this can lead to issues and data loss. Separate cards can also slow down the system, sometimes greatly, by adding latency to every data read/write operation, or by undertaking full rebuilds of damaged arrays where ZFS would have only needed to do minor repairs of a few seconds.
- Use of poor quality components – Calomel identify poor quality RAID and network cards as common culprits for low performance.
- Poor configuration/tuning – ZFS options allow for a wide range of tuning, and mis-tuning can affect performance. For example, suitable memory caching parameters for file shares on NFS are likely to be different from those required for block access shares using iSCSI and Fiber Channel. A memory cache that would be appropriate for the former, can cause timeout errors and start-stop issues as data caches are flushed - because the time permitted for a response is likely to be much shorter on these kinds of connections, the client may believe the connection has failed, if there is a delay due to "writing out" a large cache. Similarly, an inappropriately large in-memory write cache can cause "freezing" (without timeouts) on file share protocols, even when the connection does not time out.
ZFS terminology and storage structure
Because ZFS acts as both volume manager and file system, the terminology and layout of ZFS storage covers two aspects:
- How physical devices such as hard drives are organized into vdevs (virtual devices - ZFS's fundamental "blocks" of redundant storage) which are used to create redundant storage for a ZFS pool or zpool (the top level of data container in a ZFS system); and
- How datasets (file systems) and volumes (also known as zvols, a block device) - the two kinds of data structures which ZFS is capable of presenting to a user - are held within a pool, and the features and capabilities they present to the user.
ZFS commands allow examination of the physical storage in terms of devices, vdevs they are organized into, data pools stored across those vdevs, and in various other ways.
Physical storage structure: devices and vdevs
The physical devices used by ZFS (such as hard drives (HDDs) and SSDs) are organized into vdevs ("virtual devices") before being used to store data. The vdev is a fundamental part of ZFS, and the main method by which ZFS ensures redundancy against physical device failure. ZFS stores the data in a pool striped across all the vdevs allocated to that pool, for efficiency, and each vdev must have sufficient disks to maintain the integrity of the data stored on that vdev. If a vdev were to become unreadable (due to disk errors or otherwise) then the entire pool will also fail.
Therefore, it is easiest to describe ZFS physical storage by first looking at vdevs.
Each vdev can be one of:
- a single device, or
- multiple devices in a mirrored configuration, or
- multiple devices in a ZFS RAID ("RaidZ") configuration.
ZFS exposes the individual disks within the system, but manages its in-use data storage capacity at the level of vdevs, with each vdev acting as an independent unit of redundant storage. Devices might not be in a vdev if they are unused spare disks, offline disks, or cache devices.
Each vdev that the user defines, is completely independent from every other vdev, so different types of vdev can be mixed arbitrarily in a single ZFS system. If data redundancy is required (so that data is protected against physical device failure), then this is ensured by the user when they organize devices into vdevs, either by using a mirrored vdev or a RaidZ vdev. Data on a single device vdev may be lost if the device develops a fault. Data on a mirrored or RaidZ vdev will only be lost if enough disks fail at the same time (or before the system has resilvered any replacements due to recent disk failures). A ZFS vdev will continue to function in service if it is capable of providing at least one copy of the data stored on it, although it may become slower due to error fixing and resilvering, as part of its self-repair and data integrity processes. However ZFS is designed to not become unreasonably slow due to self-repair (unless directed to do so by an administrator) since one of its goals is to be capable of uninterrupted continual use even during self checking and self repair.
Since ZFS device redundancy is at vdev level, this also means that if a pool is stored across several vdevs, and one of these vdevs completely fails, then the entire pool content will be lost. This is similar to other RAID and redundancy systems, which require the data to be stored or capable of reconstruction from enough other devices to ensure data is unlikely to be lost due to physical devices failing. Therefore, it is intended that vdevs should be made of either mirrored devices or a RaidZ array of devices, with sufficient redundancy, for important data, so that ZFS can automatically limit and where possible avoid data loss if a device fails. Backups and replication are also an expected part of data protection.
Vdevs can be manipulated while in active use. A single disk can have additional devices added to create a mirrored vdev, and a mirrored vdev can have physical devices added or removed to leave a larger or smaller number of mirrored devices, or a single device. A RaidZ vdev cannot be converted to or from a mirror, although additional vdevs can always be added to expand storage capacity (which can be any kind including RaidZ). A device in any vdev can be marked for removal, and ZFS will de-allocate data from it to allow it to be removed or replaced.
Of note, the devices in a vdev do not have to be the same size, but ZFS may not use the full capacity of all disks in a vdev, if some are larger than other. This only applies to devices within a single vdev. As vdevs are independent, ZFS does not care if different vdevs have different sizes or are built from different devices.
Also as a vdev cannot be shrunk in size, it is common to set aside a small amount of unused space (for example 1-2GB on a multi-TB disk), so that if a disk needs replacing, it is possible to allow for slight manufacturing variances and replace it with another disk of the same nominal capacity but slightly smaller actual capacity.
In addition to devices used for main data storage, ZFS also allows and manages devices used for caching purposes. These can be single devices or multiple mirrored devices, and are fully dedicated to the type of cache designated. Cache usage and its detailed settings can be fully deleted, created and modified without limit during live use. A list of ZFS cache types is given later in this article.
ZFS can handle devices formatted into partitions for certain purposes, but this is not common use. Generally caches and data pools are each given a complete device (or multiple complete devices).
Data structures: Pools, datasets and volumes
The top level of data management is a ZFS pool (or zpool). A ZFS system can have multiple pools defined. The vdevs to be used for a pool are specified when the pool is created, and ZFS will use all of the specified vdevs to maximize performance when storing data – a form of striping across the vdevs. Therefore, it is important to ensure that each vdev is sufficiently redundant, as loss of any vdev in a pool would cause loss of the pool, as with any other striping.
A ZFS pool can be expanded at any time by adding new vdevs, including when the system is 'live'. The storage space / vdevs already allocated to a pool cannot be shrunk, as data is stored across all vdevs in the pool (even if it is not yet full). However, as explained above, the individual vdevs can each be modified at any time (within stated limits), and new vdevs added at any time, since the addition or removal of mirrors, or marking of a redundant disk as offline, do not affect the ability of that vdev to store data.
Within pools, ZFS recognizes two types of data store:
- A pool can contain datasets, which are containers storing a native ZFS file system. Datasets can contain other datasets ("nested datasets"), which are transparent for file system purposes. A dataset within another dataset is treated much like a directory for the purposes of file system navigation, but it allows a branch of a file system to have different settings for compression, deduplication and other settings. This is because file system settings are per-dataset (and can be inherited by nested datasets).
- A pool can also contain volumes (also known as zvols), which can be used as block storage devices by other systems. An example of a volume would be an iSCSI or Fibre Channel target for another system, used to create NAS, a SAN, or any other ZFS-backed raw block storage capability. The volume will be seen by other systems as a bare storage device which they can use as they like. Capabilities such as snapshots, redundancy, "scrubbing" (data integrity and repair checks), deduplication, compression, cache usage, and replication are operational but not exposed to the remote system, which "sees" only a bare file storage device. Because ZFS does not create a file storage system on the block device or control how the storage space is used, it cannot create nested ZFS datasets or volumes within a volume.
Since volumes are presented as block devices, they can also be formatted with any other file system, to add ZFS features to that file system, although this is not usual practice. For example, a ZFS volume can be created, and then the block device it presents can be partitioned and formatted with a file system such as ext4 or NTFS. This can be done either locally or over a network (using iSCSI or similar). The resulting file system will be accessible as normal, but will also gain ZFS benefits such as data resilience, data integrity/.scrubbing, snapshots, and additional option for data compression.
- Scrub / scrubbing – ZFS can periodically or on demand check all data and all copies of that data, held in the entire of any pool, dataset or volume (including nested datasets and volumes they contain), to confirm that all copies match the expected integrity checksums, and correct them if not. This is an intensive process and can run in the background, adjusting its activity to match how busy the system is.
- (Re-)silver / (re-)silvering – ZFS automatically remedies any defects found, and regenerates its data onto any new or replacement disks added to a vdev, or to multiple vdevs. (Re-)silvering is the ZFS equivalent of rebuilding a RAID array, but as ZFS has complete knowledge of how storage is being used, and which data is reliable, it can often avoid the full rebuild that other RAID rebuilds require, and copy and verify only the minimum data needed to restore the array to full operation.
Resizing of vdevs, pools, datasets and volumes
Generally ZFS does not expect to reduce the size of a pool, and does not have tools to reduce the set of vdevs that a pool is stored on. Therefore, to remove an entire vdev that is in active use, or to reduce the size of a pool, the data stored on it must be moved to another pool or a temporary copy made (or if easier, it can be deleted and later restored from backups/copies) so that the devices making up the vdev can be freed for other use or the pool deleted and recreated using fewer vdevs or a smaller size.
Additional capacity can be added to a pool at any time, simply by adding more devices if needed, defining the unused devices into vdevs and adding the new vdevs to the pool.
The capacity of an individual vdev is generally fixed when it is defined. There is one exception to this rule: single drive and mirrored vdevs can be expanded to larger (but not smaller) capacities, without affecting the vdev's operation, by adding larger disks and replacing/removing smaller disks, as shown in the example below.
A pool can be expanded into unused space, and the datasets and volumes within a pool can be likewise expanded to use any unused pool space. Datasets do not need a fixed size and can dynamically grow as data is stored, but volumes, being block devices, need to have their size defined by the user, and must be manually resized as required (which can be done 'live').
- A vdev is made up initially from a single 4TB hard drive, and data stored on it. (Note- not recommended in practice due to risk of data loss).
- Two 6TB drives are added to the vdev while 'live'. The vdev is now configured as a 3-way mirror. Its size is still limited to 4TB (the extra 2TB on each of the new disks is unusable). ZFS will automatically copy data to the new disks (resilvering).
- The original disk is removed, again while 'live'. The vdev that remains contains two 6TB disks and is now a 2-way 6TB mirror, of which 4TB is being used. The pool using that vdev can now use the extra space and expand by 2TB. Any datasets or volumes in the pool can now use the extra space.
- If desired a further disk can be removed, leaving a single device vdev of 6TB (not recommended). Alternatively a set of disks can be added, either configured as a new vdev (to add to the pool or use for a second pool), or added as extra mirrors for the existing vdev.
ZFS will automatically allocate data storage across all vdevs in a pool (and all devices in each vdev) in a way that generally maximises the performance of the pool. ZFS will also update its write strategy to take account of new disks added to a pool, when they are added.
As a general rule, ZFS allocate writes across vdevs based on the free space in each vdev. This ensures that vdevs which have proportionately less data already, are given more writes when new data is to be stored. This helps to ensure that as the pool becomes more used, the situation does not develop that some vdevs become full, forcing writes to occur on a limited number of devices. It also means that when data is read (and reads are much more frequent than writes in most uses), different parts of the data can be read from as many disks as possible at the same time, giving much higher read performance. Therefore, as a general rule, pools and vdevs should be managed and new storage added, so that the situation does not arise that some vdevs in a pool are almost full and others almost empty, as this will make the pool less efficient.
See also: Hard disk error rates and handling and Silent data corruption
One major feature that distinguishes ZFS from other file systems is that it is designed with a focus on data integrity by protecting the user's data on disk against silent data corruption caused by data degradation, current spikes, bugs in disk firmware, phantom writes (the previous write did not make it to disk), misdirected reads/writes (the disk accesses the wrong block), DMA parity errors between the array and server memory or from the driver (since the checksum validates data inside the array), driver errors (data winds up in the wrong buffer inside the kernel), accidental overwrites (such as swapping to a live file system), etc.
A 2012 research showed that neither any of the then-major and widespread filesystems (such as UFS, Ext,XFS, JFS, or NTFS) nor hardware RAID (which has some issues with data integrity) provided sufficient protection against data corruption problems. Initial research indicates that ZFS protects data better than earlier efforts. It is also faster than UFS and can be seen as its replacement.
ZFS data integrity
For ZFS, data integrity is achieved by using a Fletcher-based checksum or a SHA-256 hash throughout the file system tree. Each block of data is checksummed and the checksum value is then saved in the pointer to that block—rather than at the actual block itself. Next, the block pointer is checksummed, with the value being saved at its pointer. This checksumming continues all the way up the file system's data hierarchy to the root node, which is also checksummed, thus creating a Merkle tree. In-flight data corruption or phantom reads/writes (the data written/read checksums correctly but is actually wrong) are undetectable by most filesystems as they store the checksum with the data. ZFS stores the checksum of each block in its parent block pointer so the entire pool self-validates.
When a block is accessed, regardless of whether it is data or meta-data, its checksum is calculated and compared with the stored checksum value of what it "should" be. If the checksums match, the data are passed up the programming stack to the process that asked for it; if the values do not match, then ZFS can heal the data if the storage pool provides data redundancy (such as with internal mirroring), assuming that the copy of data is undamaged and with matching checksums. It is optionally possible to provide additional in-pool redundancy by specifying copies=2 (or copies=3 or more), which means that data will be stored twice (or three times) on the disk, effectively halving (or, for copies=3, reducing to one third) the storage capacity of the disk. Additionally some kinds of data used by ZFS to manage the pool are stored multiple times by default for safety, even with the default copies=1 setting.
If other copies of the damaged data exist or can be reconstructed from checksums and parity data, ZFS will use a copy of the data (or recreate it via a RAID recovery mechanism), and recalculate the checksum—ideally resulting in the reproduction of the originally expected value. If the data passes this integrity check, the system can then update all faulty copies with known-good data and redundancy will be restored.
ZFS and hardware RAID
If the disks are connected to a RAID controller, it is most efficient to configure it as a HBA in JBOD mode (i.e. turn off RAID function). If a hardware RAID card is used, ZFS always detects all data corruption but cannot always repair data corruption because the hardware RAID card will interfere. Therefore, the recommendation is to not use a hardware RAID card, or to flash a hardware RAID card into JBOD/IT mode. For ZFS to be able to guarantee data integrity, it needs to either have access to a RAID set (so all data is copied to at least two disks), or if one single disk is used, ZFS needs to enable redundancy (copies) which duplicates the data on the same logical drive. Using ZFS copies is a good feature to use on notebooks and desktop computers, since the disks are large and it at least provides some limited redundancy with just a single drive.
There are several reasons as to why it is better to rely solely on ZFS by using several independent disks and RAID-Z or mirroring.
When using hardware RAID, the controller usually adds controller-dependent data to the drives which prevents software RAID from accessing the user data. While it is possible to read the data with a compatible hardware RAID controller, this inconveniences consumers as a compatible controller usually isn't readily available. Using the JBOD/RAID-Z combination, any disk controller can be used to resume operation after a controller failure.
Note that hardware RAID configured as JBOD may still detach drives that do not respond in time (as has been seen with many energy-efficient consumer-grade hard drives), and as such, may require TLER/CCTL/ERC-enabled drives to prevent drive dropouts.
Software RAID using ZFS
ZFS offers software RAID through its RAID-Z and mirroring organization schemes.
RAID-Z is a data/parity distribution scheme like RAID-5, but uses dynamic stripe width: every block is its own RAID stripe, regardless of blocksize, resulting in every RAID-Z write being a full-stripe write. This, when combined with the copy-on-write transactional semantics of ZFS, eliminates the write hole error. RAID-Z is also faster than traditional RAID 5 because it does not need to perform the usual read-modify-write sequence.
As all stripes are of different sizes, RAID-Z reconstruction has to traverse the filesystem metadata to determine the actual RAID-Z geometry. This would be impossible if the filesystem and the RAID array were separate products, whereas it becomes feasible when there is an integrated view of the logical and physical structure of the data. Going through the metadata means that ZFS can validate every block against its 256-bit checksum as it goes, whereas traditional RAID products usually cannot do this.
In addition to handling whole-disk failures, RAID-Z can also detect and correct silent data corruption, offering "self-healing data": when reading a RAID-Z block, ZFS compares it against its checksum, and if the data disks did not return the right answer, ZFS reads the parity and then figures out which disk returned bad data. Then, it repairs the damaged data and returns good data to the requestor.
RAID-Z does not require any special hardware: it does not need NVRAM for reliability, and it does not need write buffering for good performance. With RAID-Z, ZFS provides fast, reliable storage using cheap, commodity disks.
There are three different RAID-Z modes: RAID-Z1 (similar to RAID 5, allows one disk to fail), RAID-Z2 (similar to RAID 6, allows two disks to fail), and RAID-Z3 (Also referred to as RAID 7 allows three disks to fail). The need for RAID-Z3 arose recently because RAID configurations with future disks (say, 6–10 TB) may take a long time to repair, the worst case being weeks. During those weeks, the rest of the disks in the RAID are stressed more because of the additional intensive repair process and might subsequently fail, too. By using RAID-Z3, the risk involved with disk replacement is reduced.
Mirroring, the other ZFS RAID option, is essentially the same as RAID 1, allowing any number of disks to be mirrored. Like RAID 1 it also allows faster read and resilver/rebuild speeds since all drives can be used simultaneously and parity data is not calculated separately, and mirrored vdevs can be split to create identical copies of the pool.
Resilvering and scrub
ZFS has no tool equivalent to fsck (the standard Unix and Linux data checking and repair tool for file systems). Instead, ZFS has a built-in scrub function which regularly examines all data and repairs silent corruption and other problems. Some differences are:
- fsck must be run on an offline filesystem, which means the filesystem must be unmounted and is not usable while being repaired, while scrub is designed to be used on a mounted, live filesystem, and does not need the ZFS filesystem to be taken offline.
- fsck usually only checks metadata (such as the journal log) but never checks the data itself. This means, after an fsck, the data might still not match the original data as stored.
- fsck cannot always validate and repair data when checksums are stored with data (often the case in many file systems), because the checksums may also be corrupted or unreadable. ZFS always stores checksums separately from the data they verify, improving reliability and the ability of scrub to repair the volume. ZFS also stores multiple copies of data – metadata in particular may have upwards of 4 or 6 copies (multiple copies per disk and multiple disk mirrors per volume), greatly improving the ability of scrub to detect and repair extensive damage to the volume, compared to fsck.
- scrub checks everything, including metadata and the data. The effect can be observed by comparing fsck to scrub times – sometimes a fsck on a large RAID completes in a few minutes, which means only the metadata was checked. Traversing all metadata and data on a large RAID takes many hours, which is exactly what scrub does.
The official recommendation from Sun/Oracle is to scrub enterprise-level disks once a month, and cheaper commodity disks once a week.
ZFS is a 128-bit file system, so it can address 1.84 × 1019 times more data than 64-bit systems such as Btrfs. The maximum limits of ZFS are designed to be so large that they should never be encountered in practice. For instance, fully populating a single zpool with 2128 bits of data would require 1024 3 TB hard disk drives.
Some theoretical limits in ZFS are:
- 248: number of entries in any individual directory
- 16 exbibytes (264 bytes): maximum size of a single file
- 16 exbibytes: maximum size of any attribute
- 256 quadrillion zebibytes (2128 bytes): maximum size of any zpool
- 256: number of attributes of a file (actually constrained to 248 for the number of files in a directory)
- 264: number of devices in any zpool
- 264: number of zpools in a system
- 264: number of file systems in a zpool
With Oracle Solaris, the encryption capability in ZFS is embedded into the I/O pipeline. During writes, a block may be compressed, encrypted, checksummed and then deduplicated, in that order. The policy for encryption is set at the dataset level when datasets (file systems or ZVOLs) are created. The wrapping keys provided by the user/administrator can be changed at any time without taking the file system offline. The default behaviour is for the wrapping key to be inherited by any child data sets. The data encryption keys are randomly generated at dataset creation time. Only descendant datasets (snapshots and clones) share data encryption keys. A command to switch to a new data encryption key for the clone or at any time is provided—this does not re-encrypt already existing data, instead utilising an encrypted master-key mechanism.
Storage devices, spares, and quotas
Pools can have hot spares to compensate for failing disks. When mirroring, block devices can be grouped according to physical chassis, so that the filesystem can continue in the case of the failure of an entire chassis.
Storage pool composition is not limited to similar devices, but can consist of ad-hoc, heterogeneous collections of devices, which ZFS seamlessly pools together, subsequently doling out space to diverse filesystems[clarification needed] as needed. Arbitrary storage device types can be added to existing pools to expand their size.
The storage capacity of all vdevs is available to all of the file system instances in the zpool. A quota can be set to limit the amount of space a file system instance can occupy, and a reservation can be set to guarantee that space will be available to a file system instance.
Caching mechanisms: ARC (L1), L2ARC, Transaction groups, SLOG (ZIL)
ZFS uses different layers of disk cache to speed up read and write operations. Ideally, all data should be stored in RAM, but that is usually too expensive. Therefore, data is automatically cached in a hierarchy to optimize performance versus cost; these are often called "hybrid storage pools". Frequently accessed data will be stored in RAM, and less frequently accessed data can be stored on slower media, such as solid state drives (SSDs). Data that is not often accessed is not cached and left on the slow hard drives. If old data is suddenly read a lot, ZFS will automatically move it to SSDs or to RAM.
ZFS caching mechanisms include one each for reads and writes, and in each case, two levels of caching can exist, one in computer memory (RAM) and one on fast storage (usually solid state drives (SSDs)), for a total of four caches.
|Where stored||Read cache||Write cache|
|First level cache||In RAM||Known as ARC, due to its use of a variant of the adaptive replacement cache (ARC) algorithm. RAM will always be used for caching, thus this level is always present. The efficiency of the ARC algorithm means that disks will often not need to be accessed, provided the ARC size is sufficiently large. If RAM is too small there will hardly be any ARC at all; in this case, ZFS always needs to access the underlying disks which impacts performance considerably.||Handled by means of "transaction groups" – writes are collated over a short period (typically 5 – 30 seconds) up to a given limit, with each group being written to disk ideally while the next group is being collated. This allows writes to be organized more efficiently for the underlying disks at the risk of minor data loss of the most recent transactions upon power interruption or hardware fault. In practice the power loss risk is avoided by ZFS write journaling and by the SLOG/ZIL second tier write cache pool (see below), so writes will only be lost if a write failure happens at the same time as a total loss of the second tier SLOG pool, and then only when settings related to synchronous writing and SLOG use are set in a way that would allow such a situation to arise. If data is received faster than it can be written, data receipt is paused until the disks can catch up.|
|Second level cache||On fast storage devices (which can be added or removed from a "live" system without disruption in current versions of ZFS, although not always in older versions)||Known as L2ARC ("Level 2 ARC"), optional. ZFS will cache as much data in L2ARC as it can, which can be tens or hundreds of gigabytes in many cases. L2ARC will also considerably speed up deduplication if the entire deduplication table can be cached in L2ARC. It can take several hours to fully populate the L2ARC from empty (before ZFS has decided which data are "hot" and should be cached). If the L2ARC device is lost, all reads will go out to the disks which slows down performance, but nothing else will happen (no data will be lost).||Known as SLOG or ZIL ("ZFS Intent Log"), optional but an SLOG will be created on the main storage devices if no cache device is provided. This is the second tier write cache, and is often misunderstood. Strictly speaking, ZFS does not use the SLOG device to cache its disk writes. Rather, it uses SLOG to ensure writes are captured to a permanent storage medium as quickly as possible, so that in the event of power loss or write failure, no data which was acknowledged as written, will be lost. The SLOG device allows ZFS to speedily store writes and quickly report them as written, even for storage devices such as HDDs that are much slower. In the normal course of activity, the SLOG is never referred to or read, and it does not act as a cache; its purpose is to safeguard data in flight during the few seconds taken for collation and "writing out", in case the eventual write were to fail. If all goes well, then the storage pool will be updated at some point within the next 5 to 60 seconds, when the current transaction group is written out to disk (see above), at which point the saved writes on the SLOG will simply be ignored and overwritten. If the write eventually fails, or the system suffers a crash or fault preventing its writing, then ZFS can identify all the writes that it has confirmed were written, by reading back the SLOG (the only time it is read from), and use this to completely repair the data loss. |
This becomes crucial if a large number of synchronous writes take place (such as with ESXi, NFS and some databases), where the client requires confirmation of successful writing before continuing its activity; the SLOG allows ZFS to confirm writing is successful much more quickly than if it had to write to the main store every time, without the risk involved in misleading the client as to the state of data storage. If there is no SLOG device then part of the main data pool will be used for the same purpose, although this is slower.
If the log device itself is lost, it is possible to lose the latest writes, therefore the log device should be mirrored. In earlier versions of ZFS, loss of the log device could result in loss of the entire zpool, although this is no longer the case. Therefore, one should upgrade ZFS if planning to use a separate log device.
Copy-on-write transactional model
ZFS uses a copy-on-writetransactionalobject model. All block pointers within the filesystem contain a 256-bit checksum or 256-bit hash (currently a choice between Fletcher-2, Fletcher-4, or SHA-256) of the target block, which is verified when the block is read. Blocks containing active data are never overwritten in place; instead, a new block is allocated, modified data is written to it, then any metadata blocks referencing it are similarly read, reallocated, and written. To reduce the overhead of this process, multiple updates are grouped into transaction groups, and ZIL (intent log) write cache is used when synchronous write semantics are required. The blocks are arranged in a tree, as are their checksums (see Merkle signature scheme).
Snapshots and clones
An advantage of copy-on-write is that, when ZFS writes new data, the blocks containing the old data can be retained, allowing a snapshot version of the file system to be maintained. ZFS snapshots are consistent (they reflect the entire data as it existed at a single point in time), and can be created extremely quickly, since all the data composing the snapshot is already stored, with the entire storage pool often snapshotted several times per hour. They are also space efficient, since any unchanged data is shared among the file system and its snapshots. snapshots are inherently read-only, ensuring they will not be modified after creation, although they should not be relied on as a sole means of backup. Entire snapshots can be restored and also files and directories within snapshots.
Writeable snapshots ("clones") can also be created, resulting in two independent file systems that share a set of blocks. As changes are made to any of the clone file systems, new data blocks are created to reflect those changes, but any unchanged blocks continue to be shared, no matter how many clones exist. This is an implementation of the Copy-on-write principle.
Sending and receiving snapshots
ZFS file systems can be moved to other pools, also on remote hosts over the network, as the send command creates a stream representation of the file system's state. This stream can either describe complete contents of the file system at a given snapshot, or it can be a delta between snapshots. Computing the delta stream is very efficient, and its size depends on the number of blocks changed between the snapshots. This provides an efficient strategy, e.g., for synchronizing offsite backups or high availability mirrors of a pool.
Dynamic striping across all devices to maximize throughput means that as additional devices are added to the zpool, the stripe width automatically expands to include them; thus, all disks in a pool are used, which balances the write load across them.
Variable block sizes
ZFS uses variable-sized blocks, with 128 KB as the default size. Available features allow the administrator to tune the maximum block size which is used, as certain workloads do not perform well with large blocks. If data compression is enabled, variable block sizes are used. If a block can be compressed to fit into a smaller block size, the smaller size is used on the disk to use less storage and improve IO throughput (though at the cost of increased CPU use for the compression and decompression operations).
Lightweight filesystem creation
In ZFS, filesystem manipulation within a storage pool is easier than volume manipulation within a traditional filesystem; the time and effort required to create or expand a ZFS filesystem is closer to that of making a new directory than it is to volume manipulation in some other systems.
Pools and their associated ZFS file systems can be moved between different platform architectures, including systems implementing different byte orders. The ZFS block pointer format stores filesystem metadata in an endian-adaptive way; individual metadata blocks are written with the native byte order of the system writing the block. When reading, if the stored endianness does not match the endianness of the system, the metadata is byte-swapped in memory.
This does not affect the stored data; as is usual in POSIX systems, files appear to applications as simple arrays of bytes, so applications creating and reading data remain responsible for doing so in a way independent of the underlying system's endianness.
Data deduplication capabilities were added to the ZFS source repository at the end of October 2009, and relevant OpenSolaris ZFS development packages have been available since December 3, 2009 (build 128).
Effective use of deduplication may require large RAM capacity; recommendations range between 1 and 5 GB of RAM for every TB of storage. Insufficient physical memory or lack of ZFS cache can result in virtual memory thrashing when using deduplication, which can either lower performance or result in complete memory starvation.Solid-state drives (SSDs) can be used to cache deduplication tables, thereby speeding up deduplication performance.
Other storage vendors use modified versions of ZFS to achieve very high data compression ratios. Two examples in 2012 were GreenBytes and Tegile. In May 2014, Oracle bought GreenBytes for its ZFS deduplication and replication technology.
And you are assuming that you will be able to replace the offlined drive and that the pool will be resilvered before another failure (assuming raidz1) or before additional drives get offlined (assuming raidz2, raidz3 or mirrors). I don't know about you, but I'd rather not gamble on being able to replace the drive and resilver fast enough.
No, I'm merely pointing out that ZFS won't just go and happily corrupt all the data on the drive as the forum post claims; once the error threshold is reached, the drive will be offlined and ZFS won't attempt to correct any more data on it.
Although I would agree with you that the likelihood of this occurring is low on a PER BLOCK basis, but when your system is reading TB of data, I would say the likelihood of you experiencing this at some point if you have bad memory is much higher. I would also say that the effect of bad memory on ZFS is worse than the effect of bad memory on UFS, EXT, or NTFS (see my previous post and the FreeNAS forum post). Now... if you are doing regular backups, rsyncs, replications, or scrubbing, then you are reading those TB of data regularly (as all good ZFS systems should). Which leads to regular corruption of data.
Like I said earlier, that still requires correction to be performed in-place or between two bad memory locations, and that the faults are static but not caught by POST checks.