Browse Source

Merge pull request #697 from trapexit/readme

add README segment on benchmarking
pull/700/head
trapexit 5 years ago
committed by GitHub
parent
commit
4c55eb0cc0
No known key found for this signature in database GPG Key ID: 4AEE18F83AFDEB23
  1. 141
      README.md
  2. 294
      man/mergerfs.1

141
README.md

@ -148,7 +148,7 @@ FUSE applications communicate with the kernel over a special character device: `
In Linux 4.20 a new feature was added allowing the negotiation of the max message size. Since the size is in multiples of [pages](https://en.wikipedia.org/wiki/Page_(computer_memory)) the feature is called `max_pages`. There is a maximum `max_pages` value of 256 (1MiB) and minimum of 1 (4KiB). The default used by Linux >=4.20, and hardcoded value used before 4.20, is 32 (128KiB). In mergerfs its referred to as `fuse_msg_size` to make it clear what it impacts and provide some abstraction. In Linux 4.20 a new feature was added allowing the negotiation of the max message size. Since the size is in multiples of [pages](https://en.wikipedia.org/wiki/Page_(computer_memory)) the feature is called `max_pages`. There is a maximum `max_pages` value of 256 (1MiB) and minimum of 1 (4KiB). The default used by Linux >=4.20, and hardcoded value used before 4.20, is 32 (128KiB). In mergerfs its referred to as `fuse_msg_size` to make it clear what it impacts and provide some abstraction.
Since there should be no downsides to increasing `fuse_msg_size` / `max_pages`, outside a minor bump in RAM usage due to larger message buffers, mergerfs defaults the value to 256. On kernels before 4.20 the value has no effect. The reason the value is configurable is to enable experimentation and benchmarking. See the `nullrw` section for benchmarking examples.
Since there should be no downsides to increasing `fuse_msg_size` / `max_pages`, outside a minor bump in RAM usage due to larger message buffers, mergerfs defaults the value to 256. On kernels before 4.20 the value has no effect. The reason the value is configurable is to enable experimentation and benchmarking. See the BENCHMARKING section for examples.
### symlinkify ### symlinkify
@ -166,30 +166,7 @@ Due to how FUSE works there is an overhead to all requests made to a FUSE filesy
By enabling `nullrw` mergerfs will work as it always does **except** that all reads and writes will be no-ops. A write will succeed (the size of the write will be returned as if it were successful) but mergerfs does nothing with the data it was given. Similarly a read will return the size requested but won't touch the buffer. By enabling `nullrw` mergerfs will work as it always does **except** that all reads and writes will be no-ops. A write will succeed (the size of the write will be returned as if it were successful) but mergerfs does nothing with the data it was given. Similarly a read will return the size requested but won't touch the buffer.
Example:
```
$ dd if=/dev/zero of=/path/to/mergerfs/mount/benchmark ibs=1M obs=512 count=1024 iflag=dsync,nocache oflag=dsync,nocache conv=fdatasync status=progress
1024+0 records in
2097152+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 15.4067 s, 69.7 MB/s
$ dd if=/dev/zero of=/path/to/mergerfs/mount/benchmark ibs=1M obs=1M count=1024 iflag=dsync,nocache oflag=dsync,nocache conv=fdatasync status=progress
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.219585 s, 4.9 GB/s
$ dd if=/path/to/mergerfs/mount/benchmark of=/dev/null bs=512 count=102400 iflag=dsync,nocache oflag=dsync,nocache conv=fdatasync status=progress
102400+0 records in
102400+0 records out
52428800 bytes (52 MB, 50 MiB) copied, 0.757991 s, 69.2 MB/s
$ dd if=/path/to/mergerfs/mount/benchmark of=/dev/null bs=1M count=1024 iflag=dsync,nocache oflag=dsync,nocache conv=fdatasync status=progress
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.18405 s, 5.8 GB/s
```
It's important to test with different `obs` (output block size) values since the relative overhead is greater with smaller values. As you can see above the size of a read or write can massively impact theoretical performance. If an application performs much worse through mergerfs it could very well be that it doesn't optimally size its read and write requests. In such cases contact the mergerfs author so it can be investigated.
See the BENCHMARKING section for suggestions on how to test.
### xattr ### xattr
@ -656,13 +633,75 @@ done
``` ```
# PERFORMANCE
mergerfs is at its core just a proxy and therefore its theoretical max performance is that of the underlying devices. However, given it is a FUSE filesystem working from userspace there is an increase in overhead relative to kernel based solutions. That said the performance can match the theoretical max but it depends greatly on the system's configuration. Especially when adding network filesystems into the mix there are many variables which can impact performance. Drive speeds and latency, network speeds and lattices, general concurrency, read/write sizes, etc. Unfortunately, given the number of variables it has been difficult to find a single set of settings which provide optimal performance. If you're having performance issues please look over the suggestions below.
NOTE: be sure to read about these features before changing them
* enable (or disable) `splice_move`, `splice_read`, and `splice_write`
* increase cache timeouts `cache.attr`, `cache.entry`, `cache.negative_entry`
* enable (or disable) page caching (`cache.files`)
* enable `cache.open`
* enable `cache.statfs`
* enable `cache.symlinks`
* enable `cache.readdir`
* change the number of worker threads
* disable `security_capability` and/or `xattr`
* disable `posix_acl`
* disable `async_read`
* test theoretical performance using `nullrw` or mounting a ram disk
* use `symlinkify` if your data is largely static
* use tiered cache drives
* use lvm and lvm cache to place a SSD in front of your HDDs (howto coming)
If you come across a setting that significantly impacts performance please contact trapexit so he may investigate further.
# BENCHMARKING
Filesystems are complicated. They do many things and many of those are interconnected. Additionally, the OS, drivers, hardware, etc. all can impact performance. Therefore, when benchmarking, it is **necessary** that the test focus as narrowly as possible.
For most throughput is the key benchmark. To test throughput `dd` is useful but **must** be used with the correct settings in order to ensure the filesystem or device is actually being tested. The OS can and will cache data. Without forcing synchronous reads and writes and/or disabling caching the values returned will not be representative of the device's true performance.
When benchmarking through mergerfs ensure you only use 1 branch to remove any possibility of the policies complicating the situation. Benchmark the underlying filesystem first and then mount mergerfs over it and test again. If you're experience speeds below your expectation you will need to narrow down precisely which component is leading to the slowdown. Preferably test the following in the order listed (but not combined).
1. Enable `nullrw` mode with `nullrw=true`. This will effectively make reads and writes no-ops. Removing the underlying device / filesystem from the equation. This will give us the top theoretical speeds.
2. Mount mergerfs over `tmpfs`. `tmpfs` is a RAM disk. Extremely high speed and very low latency. This is a more realistic best case scenario. Example: `mount -t tmpfs -o size=2G tmpfs /tmp/tmpfs`
3. Mount mergerfs over a local drive. NVMe, SSD, HDD, etc. If you have more than one I'd suggest testing each of them as drives and/or controllers (their drivers) could impact performance.
4. Finally, if you intend to use mergerfs with a network filesystem, either as the source of data or to combine with another through mergerfs, test each of those alone as above.
Once you find the component which has the performance issue you can do further testing with different options to see if they impact performance. For reads and writes the most relevant would be: `cache.files`, `async_read`, `splice_move`, `splice_read`, `splice_write`. Less likely but relevant when using NFS or with certain filesystems would be `security_capability`, `xattr`, and `posix_acl`. If you find a specific system, drive, filesystem, controller, etc. that performs poorly contact trapexit so he may investigate further.
Sometimes the problem is really the application accessing or writing data through mergerfs. Some software use small buffer sizes which can lead to more requests and therefore greater overhead. You can test this out yourself by replace `bs=1M` in the examples below with `ibs` or `obs` and using a size of `512` instead of `1M`. In one example test using `nullrw` the write speed dropped from 4.9GB/s to 69.7MB/s when moving from `1M` to `512`. Similar results were had when testing reads. Small writes overhead may be improved by leveraging a write cache but in casual tests little gain was found. More tests will need to be done before this feature would become available. If you have an app that appears slow with mergerfs it could be due to this. Contact trapexit so he may investigate further.
### write benchmark
With synchronized IO
```
$ dd if=/dev/zero of=/mnt/mergerfs/1GB.file bs=1M count=1024 oflag=dsync,nocache conv=fdatasync status=progress
```
Without synchronized IO
```
$ dd if=/dev/zero of=/mnt/mergerfs/1GB.file bs=1M count=1024 oflag=nocache conv=fdatasync status=progress
```
### read benchmark
```
$ dd if=/mnt/mergerfs/1GB.file of=/dev/null bs=1M count=1024 iflag=nocache conv=fdatasync status=progress
```
# TIPS / NOTES # TIPS / NOTES
* **use_ino** will only work when used with mergerfs 2.18.0 and above. * **use_ino** will only work when used with mergerfs 2.18.0 and above.
* Run mergerfs as `root` (with **allow_other**) unless you're merging paths which are owned by the same user otherwise strange permission issues may arise. * Run mergerfs as `root` (with **allow_other**) unless you're merging paths which are owned by the same user otherwise strange permission issues may arise.
* https://github.com/trapexit/backup-and-recovery-howtos : A set of guides / howtos on creating a data storage system, backing it up, maintaining it, and recovering from failure. * https://github.com/trapexit/backup-and-recovery-howtos : A set of guides / howtos on creating a data storage system, backing it up, maintaining it, and recovering from failure.
* If you don't see some directories and files you expect in a merged point or policies seem to skip drives be sure the user has permission to all the underlying directories. Use `mergerfs.fsck` to audit the drive for out of sync permissions. * If you don't see some directories and files you expect in a merged point or policies seem to skip drives be sure the user has permission to all the underlying directories. Use `mergerfs.fsck` to audit the drive for out of sync permissions.
* Do **not** use `cache.files=off` or `direct_io` if you expect applications (such as rtorrent) to [mmap](http://linux.die.net/man/2/mmap) files. Shared mmap is not currently supported in FUSE w/ `direct_io` enabled. Enabling `dropcacheonclose` is recommended when `cache.files=partial|full|auto-full` or `direct_io=false`.
* Do **not** use `cache.files=off` (or `direct_io`) if you expect applications (such as rtorrent) to [mmap](http://linux.die.net/man/2/mmap) files. Shared mmap is not currently supported in FUSE w/ `direct_io` enabled. Enabling `dropcacheonclose` is recommended when `cache.files=partial|full|auto-full` or `direct_io=false`.
* Since POSIX functions give only a singular error or success its difficult to determine the proper behavior when applying the function to multiple targets. **mergerfs** will return an error only if all attempts of an action fail. Any success will lead to a success returned. This means however that some odd situations may arise. * Since POSIX functions give only a singular error or success its difficult to determine the proper behavior when applying the function to multiple targets. **mergerfs** will return an error only if all attempts of an action fail. Any success will lead to a success returned. This means however that some odd situations may arise.
* [Kodi](http://kodi.tv), [Plex](http://plex.tv), [Subsonic](http://subsonic.org), etc. can use directory [mtime](http://linux.die.net/man/2/stat) to more efficiently determine whether to scan for new content rather than simply performing a full scan. If using the default **getattr** policy of **ff** its possible those programs will miss an update on account of it returning the first directory found's **stat** info and its a later directory on another mount which had the **mtime** recently updated. To fix this you will want to set **func.getattr=newest**. Remember though that this is just **stat**. If the file is later **open**'ed or **unlink**'ed and the policy is different for those then a completely different file or directory could be acted on. * [Kodi](http://kodi.tv), [Plex](http://plex.tv), [Subsonic](http://subsonic.org), etc. can use directory [mtime](http://linux.die.net/man/2/stat) to more efficiently determine whether to scan for new content rather than simply performing a full scan. If using the default **getattr** policy of **ff** its possible those programs will miss an update on account of it returning the first directory found's **stat** info and its a later directory on another mount which had the **mtime** recently updated. To fix this you will want to set **func.getattr=newest**. Remember though that this is just **stat**. If the file is later **open**'ed or **unlink**'ed and the policy is different for those then a completely different file or directory could be acted on.
* Some policies mixed with some functions may result in strange behaviors. Not that some of these behaviors and race conditions couldn't happen outside **mergerfs** but that they are far more likely to occur on account of the attempt to merge together multiple sources of data which could be out of sync due to the different policies. * Some policies mixed with some functions may result in strange behaviors. Not that some of these behaviors and race conditions couldn't happen outside **mergerfs** but that they are far more likely to occur on account of the attempt to merge together multiple sources of data which could be out of sync due to the different policies.
@ -680,7 +719,7 @@ The reason this is the default is because any other policy would be more expensi
If you always want the directory information from the one with the most recent mtime then use the `newest` policy for `getattr`. If you always want the directory information from the one with the most recent mtime then use the `newest` policy for `getattr`.
#### `mv /mnt/pool/foo /mnt/disk1/foo` removes `foo`
#### 'mv /mnt/pool/foo /mnt/disk1/foo' removes 'foo'
This is not a bug. This is not a bug.
@ -735,14 +774,14 @@ There appears to be a bug in the OpenVZ kernel with regard to how it handles ioc
#### Plex doesn't work with mergerfs #### Plex doesn't work with mergerfs
It does. If you're trying to put Plex's config / metadata on mergerfs you have to leave `direct_io` off because Plex is using sqlite which apparently needs mmap. mmap doesn't work with `direct_io`. To fix this place the data elsewhere or disable `direct_io` (with `dropcacheonclose=true`).
It does. If you're trying to put Plex's config / metadata on mergerfs you have to leave `direct_io` off because Plex is using sqlite3 which apparently needs mmap. mmap doesn't work with `direct_io`. To fix this place the data elsewhere or disable `direct_io` (with `dropcacheonclose=true`). Sqlite3 does not need mmap but the developer needs to fall back to standard IO if mmap fails.
If the issue is that scanning doesn't seem to pick up media then be sure to set `func.getattr=newest` as mentioned above. If the issue is that scanning doesn't seem to pick up media then be sure to set `func.getattr=newest` as mentioned above.
#### mmap performance is really bad #### mmap performance is really bad
There [is a bug](https://lkml.org/lkml/2016/3/16/260) in caching which affects overall performance of mmap through FUSE in Linux 4.x kernels. It is fixed in [4.4.10 and 4.5.4](https://lkml.org/lkml/2016/5/11/59).
There [is/was a bug](https://lkml.org/lkml/2016/3/16/260) in caching which affects overall performance of mmap through FUSE in Linux 4.x kernels. It is fixed in [4.4.10 and 4.5.4](https://lkml.org/lkml/2016/5/11/59).
#### When a program tries to move or rename a file it fails #### When a program tries to move or rename a file it fails
@ -782,15 +821,23 @@ Due to the overhead of [getgroups/setgroups](http://linux.die.net/man/2/setgroup
The gid cache uses fixed storage to simplify the design and be compatible with older systems which may not have C++11 compilers. There is enough storage for 256 users' supplemental groups. Each user is allowed up to 32 supplemental groups. Linux >= 2.6.3 allows up to 65535 groups per user but most other *nixs allow far less. NFS allowing only 16. The system does handle overflow gracefully. If the user has more than 32 supplemental groups only the first 32 will be used. If more than 256 users are using the system when an uncached user is found it will evict an existing user's cache at random. So long as there aren't more than 256 active users this should be fine. If either value is too low for your needs you will have to modify `gidcache.hpp` to increase the values. Note that doing so will increase the memory needed by each thread. The gid cache uses fixed storage to simplify the design and be compatible with older systems which may not have C++11 compilers. There is enough storage for 256 users' supplemental groups. Each user is allowed up to 32 supplemental groups. Linux >= 2.6.3 allows up to 65535 groups per user but most other *nixs allow far less. NFS allowing only 16. The system does handle overflow gracefully. If the user has more than 32 supplemental groups only the first 32 will be used. If more than 256 users are using the system when an uncached user is found it will evict an existing user's cache at random. So long as there aren't more than 256 active users this should be fine. If either value is too low for your needs you will have to modify `gidcache.hpp` to increase the values. Note that doing so will increase the memory needed by each thread.
While not a bug some users have found when using containers that supplemental groups defined inside the container don't work properly with regard to permissions. This is expected as mergerfs lives outside the container and therefore is querying the host's group database. There might be a hack to work around this (make mergerfs read the /etc/group file in the container) but it is not yet implemented and would be limited to Linux and the /etc/group DB. Preferably users would mount in the host group file into the containers or use a standard shared user & groups technology like NIS or LDAP.
#### mergerfs or libfuse crashing #### mergerfs or libfuse crashing
**NOTE:** as of mergerfs 2.22.0 it includes the most recent version of libfuse (or requires libfuse-2.9.7) so any crash should be reported. For older releases continue reading...
First... always upgrade to the latest version unless told otherwise.
If using mergerfs below 2.22.0:
If suddenly the mergerfs mount point disappears and `Transport endpoint is not connected` is returned when attempting to perform actions within the mount directory **and** the version of libfuse (use `mergerfs -v` to find the version) is older than `2.9.4` its likely due to a bug in libfuse. Affected versions of libfuse can be found in Debian Wheezy, Ubuntu Precise and others. If suddenly the mergerfs mount point disappears and `Transport endpoint is not connected` is returned when attempting to perform actions within the mount directory **and** the version of libfuse (use `mergerfs -v` to find the version) is older than `2.9.4` its likely due to a bug in libfuse. Affected versions of libfuse can be found in Debian Wheezy, Ubuntu Precise and others.
In order to fix this please install newer versions of libfuse. If using a Debian based distro (Debian,Ubuntu,Mint) you can likely just install newer versions of [libfuse](https://packages.debian.org/unstable/libfuse2) and [fuse](https://packages.debian.org/unstable/fuse) from the repo of a newer release. In order to fix this please install newer versions of libfuse. If using a Debian based distro (Debian,Ubuntu,Mint) you can likely just install newer versions of [libfuse](https://packages.debian.org/unstable/libfuse2) and [fuse](https://packages.debian.org/unstable/fuse) from the repo of a newer release.
If using mergerfs at or above 2.22.0:
First upgrade if possible, check the known bugs section, and contact trapexit.
#### mergerfs appears to be crashing or exiting #### mergerfs appears to be crashing or exiting
@ -852,7 +899,7 @@ Using the **hard_remove** option will make it so these temporary files are not u
#### How well does mergerfs scale? Is it "production ready?" #### How well does mergerfs scale? Is it "production ready?"
Users have reported running mergerfs on everything from a Raspberry Pi to dual socket Xeon systems with >20 cores. I'm aware of at least a few companies which use mergerfs in production. [Open Media Vault](https://www.openmediavault.org) includes mergerfs as its sole solution for pooling drives. The only reports of data corruption have been due to a kernel bug.
Users have reported running mergerfs on everything from a Raspberry Pi to dual socket Xeon systems with >20 cores. I'm aware of at least a few companies which use mergerfs in production. [Open Media Vault](https://www.openmediavault.org) includes mergerfs as its sole solution for pooling drives.
#### Can mergerfs be used with drives which already have data / are in use? #### Can mergerfs be used with drives which already have data / are in use?
@ -968,6 +1015,15 @@ When combined with something like [SnapRaid](http://www.snapraid.it) and/or an o
MergerFS is not intended to be a replacement for ZFS. MergerFS is intended to provide flexible pooling of arbitrary drives (local or remote), of arbitrary sizes, and arbitrary filesystems. For `write once, read many` usecases such as bulk media storage. Where data integrity and backup is managed in other ways. In that situation ZFS can introduce major maintenance and cost burdens as described [here](http://louwrentius.com/the-hidden-cost-of-using-zfs-for-your-home-nas.html). MergerFS is not intended to be a replacement for ZFS. MergerFS is intended to provide flexible pooling of arbitrary drives (local or remote), of arbitrary sizes, and arbitrary filesystems. For `write once, read many` usecases such as bulk media storage. Where data integrity and backup is managed in other ways. In that situation ZFS can introduce major maintenance and cost burdens as described [here](http://louwrentius.com/the-hidden-cost-of-using-zfs-for-your-home-nas.html).
#### What should mergerfs NOT be used for?
* databases: Even if the database stored data in separate files (mergerfs wouldn't offer much otherwise) the higher latency of the indirection will kill performance. If it is a lightly used SQLITE database then it may be fine but you'll need to test.
* VM images: For the same reasons as databases. VM images are accessed very aggressively and mergerfs will introduce too much latency (if it works at all).
* As replacement for RAID: mergerfs is just for pooling branches. If you need that kind of device performance aggregation or high availability you should stick with RAID.
#### Can drives be written to directly? Outside of mergerfs while pooled? #### Can drives be written to directly? Outside of mergerfs while pooled?
Yes, however its not recommended to use the same file from within the pool and from without at the same time. Especially if using caching of any kind (cache.files, cache.entry, cache.attr, cache.negative_entry, cache.symlinks, cache.readdir, etc.). Yes, however its not recommended to use the same file from within the pool and from without at the same time. Especially if using caching of any kind (cache.files, cache.entry, cache.attr, cache.negative_entry, cache.symlinks, cache.readdir, etc.).
@ -1045,29 +1101,6 @@ In Linux setreuid syscalls apply only to the thread. GLIBC hides this away by us
For non-Linux systems mergerfs uses a read-write lock and changes credentials only when necessary. If multiple threads are to be user X then only the first one will need to change the processes credentials. So long as the other threads need to be user X they will take a readlock allowing multiple threads to share the credentials. Once a request comes in to run as user Y that thread will attempt a write lock and change to Y's credentials when it can. If the ability to give writers priority is supported then that flag will be used so threads trying to change credentials don't starve. This isn't the best solution but should work reasonably well assuming there are few users. For non-Linux systems mergerfs uses a read-write lock and changes credentials only when necessary. If multiple threads are to be user X then only the first one will need to change the processes credentials. So long as the other threads need to be user X they will take a readlock allowing multiple threads to share the credentials. Once a request comes in to run as user Y that thread will attempt a write lock and change to Y's credentials when it can. If the ability to give writers priority is supported then that flag will be used so threads trying to change credentials don't starve. This isn't the best solution but should work reasonably well assuming there are few users.
# PERFORMANCE
mergerfs is at its core just a proxy and therefore its theoretical max performance is that of the underlying devices. However, given it is a FUSE filesystem working from userspace there is an increase in overhead relative to kernel based solutions. That said the performance can match the theoretical max but it depends greatly on the system's configuration. Especially when adding network filesystems into the mix there are many variables which can impact performance. Drive speeds and latency, network speeds and lattices, general concurrency, read/write sizes, etc. Unfortunately, given the number of variables it has been difficult to find a single set of settings which provide optimal performance. If you're having performance issues please look over the suggestions below.
NOTE: be sure to read about these features before changing them
* enable (or disable) `splice_move`, `splice_read`, and `splice_write`
* increase cache timeouts `cache.attr`, `cache.entry`, `cache.negative_entry`
* enable (or disable) page caching (`cache.files`)
* enable `cache.open`
* enable `cache.statfs`
* enable `cache.symlinks`
* enable `cache.readdir`
* change the number of worker threads
* disable `security_capability` and/or `xattr`
* disable `posix_acl`
* disable `async_read`
* test theoretical performance using `nullrw` or mounting a ram disk
* use `symlinkify` if your data is largely static
* use tiered cache drives
* use lvm and lvm cache to place a SSD in front of your HDDs (howto coming)
# SUPPORT # SUPPORT
Filesystems are very complex and difficult to debug. mergerfs, while being just a proxy of sorts, is also very difficult to debug given the large number of possible settings it can have itself and the massive number of environments it can run in. When reporting on a suspected issue **please, please** include as much of the below information as possible otherwise it will be difficult or impossible to diagnose. Also please make sure to read all of the above documentation as it includes nearly every known system or user issue previously encountered. Filesystems are very complex and difficult to debug. mergerfs, while being just a proxy of sorts, is also very difficult to debug given the large number of possible settings it can have itself and the massive number of environments it can run in. When reporting on a suspected issue **please, please** include as much of the below information as possible otherwise it will be difficult or impossible to diagnose. Also please make sure to read all of the above documentation as it includes nearly every known system or user issue previously encountered.

294
man/mergerfs.1

@ -365,7 +365,7 @@ message buffers, mergerfs defaults the value to 256.
On kernels before 4.20 the value has no effect. On kernels before 4.20 the value has no effect.
The reason the value is configurable is to enable experimentation and The reason the value is configurable is to enable experimentation and
benchmarking. benchmarking.
See the \f[C]nullrw\f[] section for benchmarking examples.
See the BENCHMARKING section for examples.
.SS symlinkify .SS symlinkify
.PP .PP
Due to the levels of indirection introduced by mergerfs and the Due to the levels of indirection introduced by mergerfs and the
@ -405,39 +405,7 @@ were successful) but mergerfs does nothing with the data it was given.
Similarly a read will return the size requested but won\[aq]t touch the Similarly a read will return the size requested but won\[aq]t touch the
buffer. buffer.
.PP .PP
Example:
.IP
.nf
\f[C]
$\ dd\ if=/dev/zero\ of=/path/to/mergerfs/mount/benchmark\ ibs=1M\ obs=512\ count=1024\ iflag=dsync,nocache\ oflag=dsync,nocache\ conv=fdatasync\ status=progress
1024+0\ records\ in
2097152+0\ records\ out
1073741824\ bytes\ (1.1\ GB,\ 1.0\ GiB)\ copied,\ 15.4067\ s,\ 69.7\ MB/s
$\ dd\ if=/dev/zero\ of=/path/to/mergerfs/mount/benchmark\ ibs=1M\ obs=1M\ count=1024\ iflag=dsync,nocache\ oflag=dsync,nocache\ conv=fdatasync\ status=progress
1024+0\ records\ in
1024+0\ records\ out
1073741824\ bytes\ (1.1\ GB,\ 1.0\ GiB)\ copied,\ 0.219585\ s,\ 4.9\ GB/s
$\ dd\ if=/path/to/mergerfs/mount/benchmark\ of=/dev/null\ bs=512\ count=102400\ iflag=dsync,nocache\ oflag=dsync,nocache\ conv=fdatasync\ status=progress
102400+0\ records\ in
102400+0\ records\ out
52428800\ bytes\ (52\ MB,\ 50\ MiB)\ copied,\ 0.757991\ s,\ 69.2\ MB/s
$\ dd\ if=/path/to/mergerfs/mount/benchmark\ of=/dev/null\ bs=1M\ count=1024\ iflag=dsync,nocache\ oflag=dsync,nocache\ conv=fdatasync\ status=progress
1024+0\ records\ in
1024+0\ records\ out
1073741824\ bytes\ (1.1\ GB,\ 1.0\ GiB)\ copied,\ 0.18405\ s,\ 5.8\ GB/s
\f[]
.fi
.PP
It\[aq]s important to test with different \f[C]obs\f[] (output block
size) values since the relative overhead is greater with smaller values.
As you can see above the size of a read or write can massively impact
theoretical performance.
If an application performs much worse through mergerfs it could very
well be that it doesn\[aq]t optimally size its read and write requests.
In such cases contact the mergerfs author so it can be investigated.
See the BENCHMARKING section for suggestions on how to test.
.SS xattr .SS xattr
.PP .PP
Runtime extended attribute support can be managed via the \f[C]xattr\f[] Runtime extended attribute support can be managed via the \f[C]xattr\f[]
@ -1383,6 +1351,160 @@ do
done done
\f[] \f[]
.fi .fi
.SH PERFORMANCE
.PP
mergerfs is at its core just a proxy and therefore its theoretical max
performance is that of the underlying devices.
However, given it is a FUSE filesystem working from userspace there is
an increase in overhead relative to kernel based solutions.
That said the performance can match the theoretical max but it depends
greatly on the system\[aq]s configuration.
Especially when adding network filesystems into the mix there are many
variables which can impact performance.
Drive speeds and latency, network speeds and lattices, general
concurrency, read/write sizes, etc.
Unfortunately, given the number of variables it has been difficult to
find a single set of settings which provide optimal performance.
If you\[aq]re having performance issues please look over the suggestions
below.
.PP
NOTE: be sure to read about these features before changing them
.IP \[bu] 2
enable (or disable) \f[C]splice_move\f[], \f[C]splice_read\f[], and
\f[C]splice_write\f[]
.IP \[bu] 2
increase cache timeouts \f[C]cache.attr\f[], \f[C]cache.entry\f[],
\f[C]cache.negative_entry\f[]
.IP \[bu] 2
enable (or disable) page caching (\f[C]cache.files\f[])
.IP \[bu] 2
enable \f[C]cache.open\f[]
.IP \[bu] 2
enable \f[C]cache.statfs\f[]
.IP \[bu] 2
enable \f[C]cache.symlinks\f[]
.IP \[bu] 2
enable \f[C]cache.readdir\f[]
.IP \[bu] 2
change the number of worker threads
.IP \[bu] 2
disable \f[C]security_capability\f[] and/or \f[C]xattr\f[]
.IP \[bu] 2
disable \f[C]posix_acl\f[]
.IP \[bu] 2
disable \f[C]async_read\f[]
.IP \[bu] 2
test theoretical performance using \f[C]nullrw\f[] or mounting a ram
disk
.IP \[bu] 2
use \f[C]symlinkify\f[] if your data is largely static
.IP \[bu] 2
use tiered cache drives
.IP \[bu] 2
use lvm and lvm cache to place a SSD in front of your HDDs (howto
coming)
.PP
If you come across a setting that significantly impacts performance
please contact trapexit so he may investigate further.
.SH BENCHMARKING
.PP
Filesystems are complicated.
They do many things and many of those are interconnected.
Additionally, the OS, drivers, hardware, etc.
all can impact performance.
Therefore, when benchmarking, it is \f[B]necessary\f[] that the test
focus as narrowly as possible.
.PP
For most throughput is the key benchmark.
To test throughput \f[C]dd\f[] is useful but \f[B]must\f[] be used with
the correct settings in order to ensure the filesystem or device is
actually being tested.
The OS can and will cache data.
Without forcing synchronous reads and writes and/or disabling caching
the values returned will not be representative of the device\[aq]s true
performance.
.PP
When benchmarking through mergerfs ensure you only use 1 branch to
remove any possibility of the policies complicating the situation.
Benchmark the underlying filesystem first and then mount mergerfs over
it and test again.
If you\[aq]re experience speeds below your expectation you will need to
narrow down precisely which component is leading to the slowdown.
Preferably test the following in the order listed (but not combined).
.IP "1." 3
Enable \f[C]nullrw\f[] mode with \f[C]nullrw=true\f[].
This will effectively make reads and writes no\-ops.
Removing the underlying device / filesystem from the equation.
This will give us the top theoretical speeds.
.IP "2." 3
Mount mergerfs over \f[C]tmpfs\f[].
\f[C]tmpfs\f[] is a RAM disk.
Extremely high speed and very low latency.
This is a more realistic best case scenario.
Example: \f[C]mount\ \-t\ tmpfs\ \-o\ size=2G\ tmpfs\ /tmp/tmpfs\f[]
.IP "3." 3
Mount mergerfs over a local drive.
NVMe, SSD, HDD, etc.
If you have more than one I\[aq]d suggest testing each of them as drives
and/or controllers (their drivers) could impact performance.
.IP "4." 3
Finally, if you intend to use mergerfs with a network filesystem, either
as the source of data or to combine with another through mergerfs, test
each of those alone as above.
.PP
Once you find the component which has the performance issue you can do
further testing with different options to see if they impact
performance.
For reads and writes the most relevant would be: \f[C]cache.files\f[],
\f[C]async_read\f[], \f[C]splice_move\f[], \f[C]splice_read\f[],
\f[C]splice_write\f[].
Less likely but relevant when using NFS or with certain filesystems
would be \f[C]security_capability\f[], \f[C]xattr\f[], and
\f[C]posix_acl\f[].
If you find a specific system, drive, filesystem, controller, etc.
that performs poorly contact trapexit so he may investigate further.
.PP
Sometimes the problem is really the application accessing or writing
data through mergerfs.
Some software use small buffer sizes which can lead to more requests and
therefore greater overhead.
You can test this out yourself by replace \f[C]bs=1M\f[] in the examples
below with \f[C]ibs\f[] or \f[C]obs\f[] and using a size of \f[C]512\f[]
instead of \f[C]1M\f[].
In one example test using \f[C]nullrw\f[] the write speed dropped from
4.9GB/s to 69.7MB/s when moving from \f[C]1M\f[] to \f[C]512\f[].
Similar results were had when testing reads.
Small writes overhead may be improved by leveraging a write cache but in
casual tests little gain was found.
More tests will need to be done before this feature would become
available.
If you have an app that appears slow with mergerfs it could be due to
this.
Contact trapexit so he may investigate further.
.SS write benchmark
.PP
With synchronized IO
.IP
.nf
\f[C]
$\ dd\ if=/dev/zero\ of=/mnt/mergerfs/1GB.file\ bs=1M\ count=1024\ oflag=dsync,nocache\ conv=fdatasync\ status=progress
\f[]
.fi
.PP
Without synchronized IO
.IP
.nf
\f[C]
$\ dd\ if=/dev/zero\ of=/mnt/mergerfs/1GB.file\ bs=1M\ count=1024\ oflag=nocache\ conv=fdatasync\ status=progress
\f[]
.fi
.SS read benchmark
.IP
.nf
\f[C]
$\ dd\ if=/mnt/mergerfs/1GB.file\ of=/dev/null\ bs=1M\ count=1024\ iflag=nocache\ conv=fdatasync\ status=progress
\f[]
.fi
.SH TIPS / NOTES .SH TIPS / NOTES
.IP \[bu] 2 .IP \[bu] 2
\f[B]use_ino\f[] will only work when used with mergerfs 2.18.0 and \f[B]use_ino\f[] will only work when used with mergerfs 2.18.0 and
@ -1402,7 +1524,7 @@ all the underlying directories.
Use \f[C]mergerfs.fsck\f[] to audit the drive for out of sync Use \f[C]mergerfs.fsck\f[] to audit the drive for out of sync
permissions. permissions.
.IP \[bu] 2 .IP \[bu] 2
Do \f[B]not\f[] use \f[C]cache.files=off\f[] or \f[C]direct_io\f[] if
Do \f[B]not\f[] use \f[C]cache.files=off\f[] (or \f[C]direct_io\f[]) if
you expect applications (such as rtorrent) to you expect applications (such as rtorrent) to
mmap (http://linux.die.net/man/2/mmap) files. mmap (http://linux.die.net/man/2/mmap) files.
Shared mmap is not currently supported in FUSE w/ \f[C]direct_io\f[] Shared mmap is not currently supported in FUSE w/ \f[C]direct_io\f[]
@ -1460,7 +1582,7 @@ value based on all found would require a scan of all drives.
.PP .PP
If you always want the directory information from the one with the most If you always want the directory information from the one with the most
recent mtime then use the \f[C]newest\f[] policy for \f[C]getattr\f[]. recent mtime then use the \f[C]newest\f[] policy for \f[C]getattr\f[].
.SS \f[C]mv\ /mnt/pool/foo\ /mnt/disk1/foo\f[] removes \f[C]foo\f[]
.SS \[aq]mv /mnt/pool/foo /mnt/disk1/foo\[aq] removes \[aq]foo\[aq]
.PP .PP
This is not a bug. This is not a bug.
.PP .PP
@ -1532,18 +1654,21 @@ yet fixed.
.PP .PP
It does. It does.
If you\[aq]re trying to put Plex\[aq]s config / metadata on mergerfs you If you\[aq]re trying to put Plex\[aq]s config / metadata on mergerfs you
have to leave \f[C]direct_io\f[] off because Plex is using sqlite which
have to leave \f[C]direct_io\f[] off because Plex is using sqlite3 which
apparently needs mmap. apparently needs mmap.
mmap doesn\[aq]t work with \f[C]direct_io\f[]. mmap doesn\[aq]t work with \f[C]direct_io\f[].
To fix this place the data elsewhere or disable \f[C]direct_io\f[] (with To fix this place the data elsewhere or disable \f[C]direct_io\f[] (with
\f[C]dropcacheonclose=true\f[]). \f[C]dropcacheonclose=true\f[]).
Sqlite3 does not need mmap but the developer needs to fall back to
standard IO if mmap fails.
.PP .PP
If the issue is that scanning doesn\[aq]t seem to pick up media then be If the issue is that scanning doesn\[aq]t seem to pick up media then be
sure to set \f[C]func.getattr=newest\f[] as mentioned above. sure to set \f[C]func.getattr=newest\f[] as mentioned above.
.SS mmap performance is really bad .SS mmap performance is really bad
.PP .PP
There is a bug (https://lkml.org/lkml/2016/3/16/260) in caching which
affects overall performance of mmap through FUSE in Linux 4.x kernels.
There is/was a bug (https://lkml.org/lkml/2016/3/16/260) in caching
which affects overall performance of mmap through FUSE in Linux 4.x
kernels.
It is fixed in 4.4.10 and 4.5.4 (https://lkml.org/lkml/2016/5/11/59). It is fixed in 4.4.10 and 4.5.4 (https://lkml.org/lkml/2016/5/11/59).
.SS When a program tries to move or rename a file it fails .SS When a program tries to move or rename a file it fails
.PP .PP
@ -1649,11 +1774,23 @@ fine.
If either value is too low for your needs you will have to modify If either value is too low for your needs you will have to modify
\f[C]gidcache.hpp\f[] to increase the values. \f[C]gidcache.hpp\f[] to increase the values.
Note that doing so will increase the memory needed by each thread. Note that doing so will increase the memory needed by each thread.
.PP
While not a bug some users have found when using containers that
supplemental groups defined inside the container don\[aq]t work properly
with regard to permissions.
This is expected as mergerfs lives outside the container and therefore
is querying the host\[aq]s group database.
There might be a hack to work around this (make mergerfs read the
/etc/group file in the container) but it is not yet implemented and
would be limited to Linux and the /etc/group DB.
Preferably users would mount in the host group file into the containers
or use a standard shared user & groups technology like NIS or LDAP.
.SS mergerfs or libfuse crashing .SS mergerfs or libfuse crashing
.PP .PP
\f[B]NOTE:\f[] as of mergerfs 2.22.0 it includes the most recent version
of libfuse (or requires libfuse\-2.9.7) so any crash should be reported.
For older releases continue reading...
First...
always upgrade to the latest version unless told otherwise.
.PP
If using mergerfs below 2.22.0:
.PP .PP
If suddenly the mergerfs mount point disappears and If suddenly the mergerfs mount point disappears and
\f[C]Transport\ endpoint\ is\ not\ connected\f[] is returned when \f[C]Transport\ endpoint\ is\ not\ connected\f[] is returned when
@ -1669,6 +1806,11 @@ install newer versions of
libfuse (https://packages.debian.org/unstable/libfuse2) and libfuse (https://packages.debian.org/unstable/libfuse2) and
fuse (https://packages.debian.org/unstable/fuse) from the repo of a fuse (https://packages.debian.org/unstable/fuse) from the repo of a
newer release. newer release.
.PP
If using mergerfs at or above 2.22.0:
.PP
First upgrade if possible, check the known bugs section, and contact
trapexit.
.SS mergerfs appears to be crashing or exiting .SS mergerfs appears to be crashing or exiting
.PP .PP
There seems to be an issue with Linux version \f[C]4.9.0\f[] and above There seems to be an issue with Linux version \f[C]4.9.0\f[] and above
@ -1758,7 +1900,6 @@ I\[aq]m aware of at least a few companies which use mergerfs in
production. production.
Open Media Vault (https://www.openmediavault.org) includes mergerfs as Open Media Vault (https://www.openmediavault.org) includes mergerfs as
its sole solution for pooling drives. its sole solution for pooling drives.
The only reports of data corruption have been due to a kernel bug.
.SS Can mergerfs be used with drives which already have data / are in .SS Can mergerfs be used with drives which already have data / are in
use? use?
.PP .PP
@ -1970,6 +2111,21 @@ Where data integrity and backup is managed in other ways.
In that situation ZFS can introduce major maintenance and cost burdens In that situation ZFS can introduce major maintenance and cost burdens
as described as described
here (http://louwrentius.com/the-hidden-cost-of-using-zfs-for-your-home-nas.html). here (http://louwrentius.com/the-hidden-cost-of-using-zfs-for-your-home-nas.html).
.SS What should mergerfs NOT be used for?
.IP \[bu] 2
databases: Even if the database stored data in separate files (mergerfs
wouldn\[aq]t offer much otherwise) the higher latency of the indirection
will kill performance.
If it is a lightly used SQLITE database then it may be fine but
you\[aq]ll need to test.
.IP \[bu] 2
VM images: For the same reasons as databases.
VM images are accessed very aggressively and mergerfs will introduce too
much latency (if it works at all).
.IP \[bu] 2
As replacement for RAID: mergerfs is just for pooling branches.
If you need that kind of device performance aggregation or high
availability you should stick with RAID.
.SS Can drives be written to directly? Outside of mergerfs while pooled? .SS Can drives be written to directly? Outside of mergerfs while pooled?
.PP .PP
Yes, however its not recommended to use the same file from within the Yes, however its not recommended to use the same file from within the
@ -2128,58 +2284,6 @@ If the ability to give writers priority is supported then that flag will
be used so threads trying to change credentials don\[aq]t starve. be used so threads trying to change credentials don\[aq]t starve.
This isn\[aq]t the best solution but should work reasonably well This isn\[aq]t the best solution but should work reasonably well
assuming there are few users. assuming there are few users.
.SH PERFORMANCE
.PP
mergerfs is at its core just a proxy and therefore its theoretical max
performance is that of the underlying devices.
However, given it is a FUSE filesystem working from userspace there is
an increase in overhead relative to kernel based solutions.
That said the performance can match the theoretical max but it depends
greatly on the system\[aq]s configuration.
Especially when adding network filesystems into the mix there are many
variables which can impact performance.
Drive speeds and latency, network speeds and lattices, general
concurrency, read/write sizes, etc.
Unfortunately, given the number of variables it has been difficult to
find a single set of settings which provide optimal performance.
If you\[aq]re having performance issues please look over the suggestions
below.
.PP
NOTE: be sure to read about these features before changing them
.IP \[bu] 2
enable (or disable) \f[C]splice_move\f[], \f[C]splice_read\f[], and
\f[C]splice_write\f[]
.IP \[bu] 2
increase cache timeouts \f[C]cache.attr\f[], \f[C]cache.entry\f[],
\f[C]cache.negative_entry\f[]
.IP \[bu] 2
enable (or disable) page caching (\f[C]cache.files\f[])
.IP \[bu] 2
enable \f[C]cache.open\f[]
.IP \[bu] 2
enable \f[C]cache.statfs\f[]
.IP \[bu] 2
enable \f[C]cache.symlinks\f[]
.IP \[bu] 2
enable \f[C]cache.readdir\f[]
.IP \[bu] 2
change the number of worker threads
.IP \[bu] 2
disable \f[C]security_capability\f[] and/or \f[C]xattr\f[]
.IP \[bu] 2
disable \f[C]posix_acl\f[]
.IP \[bu] 2
disable \f[C]async_read\f[]
.IP \[bu] 2
test theoretical performance using \f[C]nullrw\f[] or mounting a ram
disk
.IP \[bu] 2
use \f[C]symlinkify\f[] if your data is largely static
.IP \[bu] 2
use tiered cache drives
.IP \[bu] 2
use lvm and lvm cache to place a SSD in front of your HDDs (howto
coming)
.SH SUPPORT .SH SUPPORT
.PP .PP
Filesystems are very complex and difficult to debug. Filesystems are very complex and difficult to debug.

Loading…
Cancel
Save