diff --git a/README.md b/README.md index 08405133..ae718f5c 100644 --- a/README.md +++ b/README.md @@ -172,11 +172,11 @@ See the BENCHMARKING section for suggestions on how to test. ### xattr -Runtime extended attribute support can be managed via the `xattr` option. By default it will passthrough any xattr calls. Given xattr support is rarely used and can have significant performance implications mergerfs allows it to be disabled at runtime. +Runtime extended attribute support can be managed via the `xattr` option. By default it will passthrough any xattr calls. Given xattr support is rarely used and can have significant performance implications mergerfs allows it to be disabled at runtime. The performance problems mostly comes when file caching is enabled. The kernel will send a `getxattr` for `security.capability` *before every single write*. It doesn't cache the responses to any `getxattr`. This might be addressed in the future but for now mergerfs can really only offer the following workarounds. `noattr` will cause mergerfs to short circuit all xattr calls and return ENOATTR where appropriate. mergerfs still gets all the requests but they will not be forwarded on to the underlying filesystems. The runtime control will still function in this mode. -`nosys` will cause mergerfs to return ENOSYS for any xattr call. The difference with `noattr` is that the kernel will cache this fact and itself short circuit future calls. This will be more efficient than `noattr` but will cause mergerfs' runtime control via the hidden file to stop working. +`nosys` will cause mergerfs to return ENOSYS for any xattr call. The difference with `noattr` is that the kernel will cache this fact and itself short circuit future calls. This is more efficient than `noattr` but will cause mergerfs' runtime control via the hidden file to stop working. # FUNCTIONS / POLICIES / CATEGORIES @@ -584,7 +584,7 @@ With #2 one could use dm-cache as well but there is another solution which requi 1. Create 2 mergerfs pools. One which includes just the slow drives and one which has both the fast drives (SSD,NVME,etc.) and slow drives. 2. The 'cache' pool should have the cache drives listed first. 3. The best `create` policies to use for the 'cache' pool would probably be `ff`, `epff`, `lfs`, or `eplfs`. The latter two under the assumption that the cache drive(s) are far smaller than the backing drives. If using path preserving policies remember that you'll need to manually create the core directories of those paths you wish to be cached. Be sure the permissions are in sync. Use `mergerfs.fsck` to check / correct them. You could also tag the slow drives as `=NC` though that'd mean if the cache drives fill you'd get "out of space" errors. -4. Enable `moveonenospc` and set `minfreespace` appropriately. Perhaps setting `minfreespace` to the size of the largest cache drive. +4. Enable `moveonenospc` and set `minfreespace` appropriately. To make sure there is enough room on the "slow" pool you might want to set `minfreespace` to at least as large as the size of the largest cache drive if not larger. This way in the worst case the whole of the cache drive(s) can be moved to the other drives. 5. Set your programs to use the cache pool. 6. Save one of the below scripts or create you're own. 7. Use `cron` (as root) to schedule the command at whatever frequency is appropriate for your workflow. @@ -642,13 +642,14 @@ done # PERFORMANCE -mergerfs is at its core just a proxy and therefore its theoretical max performance is that of the underlying devices. However, given it is a FUSE filesystem working from userspace there is an increase in overhead relative to kernel based solutions. That said the performance can match the theoretical max but it depends greatly on the system's configuration. Especially when adding network filesystems into the mix there are many variables which can impact performance. Drive speeds and latency, network speeds and lattices, general concurrency, read/write sizes, etc. Unfortunately, given the number of variables it has been difficult to find a single set of settings which provide optimal performance. If you're having performance issues please look over the suggestions below. +mergerfs is at its core just a proxy and therefore its theoretical max performance is that of the underlying devices. However, given it is a FUSE filesystem working from userspace there is an increase in overhead relative to kernel based solutions. That said the performance can match the theoretical max but it depends greatly on the system's configuration. Especially when adding network filesystems into the mix there are many variables which can impact performance. Drive speeds and latency, network speeds and latency, general concurrency, read/write sizes, etc. Unfortunately, given the number of variables it has been difficult to find a single set of settings which provide optimal performance. If you're having performance issues please look over the suggestions below. NOTE: be sure to read about these features before changing them * enable (or disable) `splice_move`, `splice_read`, and `splice_write` * increase cache timeouts `cache.attr`, `cache.entry`, `cache.negative_entry` * enable (or disable) page caching (`cache.files`) +* enable `cache.writeback` * enable `cache.open` * enable `cache.statfs` * enable `cache.symlinks` @@ -660,7 +661,7 @@ NOTE: be sure to read about these features before changing them * test theoretical performance using `nullrw` or mounting a ram disk * use `symlinkify` if your data is largely static * use tiered cache drives -* use lvm and lvm cache to place a SSD in front of your HDDs (howto coming) +* use lvm and lvm cache to place a SSD in front of your HDDs If you come across a setting that significantly impacts performance please contact trapexit so he may investigate further. @@ -685,12 +686,6 @@ Sometimes the problem is really the application accessing or writing data throug ### write benchmark -With synchronized IO -``` -$ dd if=/dev/zero of=/mnt/mergerfs/1GB.file bs=1M count=1024 oflag=dsync,nocache conv=fdatasync status=progress -``` - -Without synchronized IO ``` $ dd if=/dev/zero of=/mnt/mergerfs/1GB.file bs=1M count=1024 oflag=nocache conv=fdatasync status=progress ``` @@ -943,11 +938,11 @@ That said, for the average person, the following should be fine: #### Why are all my files ending up on 1 drive?! -Did you start with empty drives? Did you explicitly configure a `category.create` policy? +Did you start with empty drives? Did you explicitly configure a `category.create` policy? Are you using a path preserving policy? -The default create policy is `epmfs`. That is a path preserving algorithm. With such a policy for `mkdir` and `create` with a set of empty drives it will naturally select only 1 drive when the first directory is created. Anything, files or directories, created in that first directory will be placed on the same branch because it is preserving paths. +The default create policy is `epmfs`. That is a path preserving algorithm. With such a policy for `mkdir` and `create` with a set of empty drives it will select only 1 drive when the first directory is created. Anything, files or directories, created in that first directory will be placed on the same branch because it is preserving paths. -This catches a lot of new users off guard but changing the default would break the setup for many existing users. If you do not care about path preservation and wish your files to be spread across all your drives change to `mfs` or similar policy as described above. +This catches a lot of new users off guard but changing the default would break the setup for many existing users. If you do not care about path preservation and wish your files to be spread across all your drives change to `mfs` or similar policy as described above. If you do want path preservation you'll need to perform the manual act of creating paths on the drives you want the data to land on before transferring your data. Setting `func.mkdir=epall` can simplify managing path preservation for `create`. Or use `func.mkdir=rand` if you're intersted in just grouping together directory content by drive. #### Do hard links work? @@ -971,13 +966,6 @@ Whenever you run into a split permission issue (seeing some but not all files) t If using a network filesystem such as NFS, SMB, CIFS (Samba) be sure to pay close attention to anything regarding permissioning and users. Root squashing and user translation for instance has bitten a few mergerfs users. Some of these also affect the use of mergerfs from container platforms such as Docker. -#### Why is only one drive being used? - -Are you using a path preserving policy? The default policy for file creation is `epmfs`. That means only the drives with the path preexisting will be considered when creating a file. If you don't care about where files and directories are created you likely shouldn't be using a path preserving policy and instead something like `mfs`. - -This can be especially apparent when filling an empty pool from an external source. If you do want path preservation you'll need to perform the manual act of creating paths on the drives you want the data to land on before transferring your data. Setting `func.mkdir=epall` can simplify managing path preservation for `create`. - - #### Is my OS's libfuse needed for mergerfs to work? No. Normally `mount.fuse` is needed to get mergerfs (or any FUSE filesystem to mount using the `mount` command but in vendoring the libfuse library the `mount.fuse` app has been renamed to `mount.mergerfs` meaning the filesystem type in `fstab` can simply be `mergerfs`. @@ -1042,7 +1030,7 @@ MergerFS is not intended to be a replacement for ZFS. MergerFS is intended to pr #### Can drives be written to directly? Outside of mergerfs while pooled? -Yes, however its not recommended to use the same file from within the pool and from without at the same time. Especially if using caching of any kind (cache.files, cache.entry, cache.attr, cache.negative_entry, cache.symlinks, cache.readdir, etc.). +Yes, however its not recommended to use the same file from within the pool and from without at the same time (particularly writing). Especially if using caching of any kind (cache.files, cache.entry, cache.attr, cache.negative_entry, cache.symlinks, cache.readdir, etc.) as there could be a conflict between cached data and not. #### Why do I get an "out of space" / "no space left on device" / ENOSPC error even though there appears to be lots of space available? @@ -1093,11 +1081,17 @@ and the kernel use internally (also called the "nodeid"). Generally collision, if it occurs, shouldn't be a problem. You can turn off the calculation by not using `use_ino`. In the future it might be worth creating different strategies for users to select from. -#### I notice massive slowdowns of writes over NFS -Due to how NFS works and interacts with FUSE when not using `cache.files=off` or `direct_io` its possible that a getxattr for `security.capability` will be issued prior to any write. This will usually result in a massive slowdown for writes. Using `cache.files=off` or `direct_io` will keep this from happening (and generally good to enable unless you need the features it disables) but the `security_capability` option can also help by short circuiting the call and returning `ENOATTR`. +#### I notice massive slowdowns of writes when enabling cache.files. + +When file caching is enabled in any form (`cache.files!=off` or `direct_io=false`) it will issue `getxattr` requests for `security.capability` prior to *every single write*. This will usually result in a performance degregation, especially when using a network filesystem (such as NFS or CIFS/SMB/Samba.) Unfortunately at this moment the kernel is not caching the response. + +To work around this situation mergerfs offers a few solutions. -You could also set `xattr` to `noattr` or `nosys` to short circuit or stop all xattr requests. +1. Set `security_capability=false`. It will short curcuit any call and return `ENOATTR`. This still means though that mergerfs will receive the request before every write but at least it doesn't get passed through to the underlying filesystem. +2. Set `xattr=noattr`. Same as above but applies to *all* calls to getxattr. Not just `security.capability`. This will not be cached by the kernel either but mergerfs' runtime config system will still function. +3. Set `xattr=nosys`. Results in mergerfs returning `ENOSYS` which *will* be cached by the kernel. No future xattr calls will be forwarded to mergerfs. The downside is that also means the xattr based config and query functionality won't work either. +4. Disable file caching. If you aren't using applications which use `mmap` it's probably simplier to just disable it all together. The kernel won't send the requests when caching is disabled. #### What are these .fuse_hidden files? diff --git a/man/mergerfs.1 b/man/mergerfs.1 index 5d0d25e0..fc862e22 100644 --- a/man/mergerfs.1 +++ b/man/mergerfs.1 @@ -416,6 +416,12 @@ option. By default it will passthrough any xattr calls. Given xattr support is rarely used and can have significant performance implications mergerfs allows it to be disabled at runtime. +The performance problems mostly comes when file caching is enabled. +The kernel will send a \f[C]getxattr\f[] for +\f[C]security.capability\f[] \f[I]before every single write\f[]. +It doesn\[aq]t cache the responses to any \f[C]getxattr\f[]. +This might be addressed in the future but for now mergerfs can really +only offer the following workarounds. .PP \f[C]noattr\f[] will cause mergerfs to short circuit all xattr calls and return ENOATTR where appropriate. @@ -426,8 +432,8 @@ The runtime control will still function in this mode. \f[C]nosys\f[] will cause mergerfs to return ENOSYS for any xattr call. The difference with \f[C]noattr\f[] is that the kernel will cache this fact and itself short circuit future calls. -This will be more efficient than \f[C]noattr\f[] but will cause -mergerfs\[aq] runtime control via the hidden file to stop working. +This is more efficient than \f[C]noattr\f[] but will cause mergerfs\[aq] +runtime control via the hidden file to stop working. .SH FUNCTIONS / POLICIES / CATEGORIES .PP The POSIX filesystem API is made up of a number of functions. @@ -1317,8 +1323,11 @@ mean if the cache drives fill you\[aq]d get "out of space" errors. .IP "4." 3 Enable \f[C]moveonenospc\f[] and set \f[C]minfreespace\f[] appropriately. -Perhaps setting \f[C]minfreespace\f[] to the size of the largest cache -drive. +To make sure there is enough room on the "slow" pool you might want to +set \f[C]minfreespace\f[] to at least as large as the size of the +largest cache drive if not larger. +This way in the worst case the whole of the cache drive(s) can be moved +to the other drives. .IP "5." 3 Set your programs to use the cache pool. .IP "6." 3 @@ -1392,7 +1401,7 @@ That said the performance can match the theoretical max but it depends greatly on the system\[aq]s configuration. Especially when adding network filesystems into the mix there are many variables which can impact performance. -Drive speeds and latency, network speeds and lattices, general +Drive speeds and latency, network speeds and latency, general concurrency, read/write sizes, etc. Unfortunately, given the number of variables it has been difficult to find a single set of settings which provide optimal performance. @@ -1409,6 +1418,8 @@ increase cache timeouts \f[C]cache.attr\f[], \f[C]cache.entry\f[], .IP \[bu] 2 enable (or disable) page caching (\f[C]cache.files\f[]) .IP \[bu] 2 +enable \f[C]cache.writeback\f[] +.IP \[bu] 2 enable \f[C]cache.open\f[] .IP \[bu] 2 enable \f[C]cache.statfs\f[] @@ -1432,8 +1443,7 @@ use \f[C]symlinkify\f[] if your data is largely static .IP \[bu] 2 use tiered cache drives .IP \[bu] 2 -use lvm and lvm cache to place a SSD in front of your HDDs (howto -coming) +use lvm and lvm cache to place a SSD in front of your HDDs .PP If you come across a setting that significantly impacts performance please contact trapexit so he may investigate further. @@ -1513,16 +1523,6 @@ If you have an app that appears slow with mergerfs it could be due to this. Contact trapexit so he may investigate further. .SS write benchmark -.PP -With synchronized IO -.IP -.nf -\f[C] -$\ dd\ if=/dev/zero\ of=/mnt/mergerfs/1GB.file\ bs=1M\ count=1024\ oflag=dsync,nocache\ conv=fdatasync\ status=progress -\f[] -.fi -.PP -Without synchronized IO .IP .nf \f[C] @@ -1998,12 +1998,13 @@ That said, for the average person, the following should be fine: .PP Did you start with empty drives? Did you explicitly configure a \f[C]category.create\f[] policy? +Are you using a path preserving policy? .PP The default create policy is \f[C]epmfs\f[]. That is a path preserving algorithm. With such a policy for \f[C]mkdir\f[] and \f[C]create\f[] with a set of -empty drives it will naturally select only 1 drive when the first -directory is created. +empty drives it will select only 1 drive when the first directory is +created. Anything, files or directories, created in that first directory will be placed on the same branch because it is preserving paths. .PP @@ -2012,6 +2013,13 @@ break the setup for many existing users. If you do not care about path preservation and wish your files to be spread across all your drives change to \f[C]mfs\f[] or similar policy as described above. +If you do want path preservation you\[aq]ll need to perform the manual +act of creating paths on the drives you want the data to land on before +transferring your data. +Setting \f[C]func.mkdir=epall\f[] can simplify managing path +preservation for \f[C]create\f[]. +Or use \f[C]func.mkdir=rand\f[] if you\[aq]re intersted in just grouping +together directory content by drive. .SS Do hard links work? .PP Yes. @@ -2058,23 +2066,6 @@ Root squashing and user translation for instance has bitten a few mergerfs users. Some of these also affect the use of mergerfs from container platforms such as Docker. -.SS Why is only one drive being used? -.PP -Are you using a path preserving policy? -The default policy for file creation is \f[C]epmfs\f[]. -That means only the drives with the path preexisting will be considered -when creating a file. -If you don\[aq]t care about where files and directories are created you -likely shouldn\[aq]t be using a path preserving policy and instead -something like \f[C]mfs\f[]. -.PP -This can be especially apparent when filling an empty pool from an -external source. -If you do want path preservation you\[aq]ll need to perform the manual -act of creating paths on the drives you want the data to land on before -transferring your data. -Setting \f[C]func.mkdir=epall\f[] can simplify managing path -preservation for \f[C]create\f[]. .SS Is my OS\[aq]s libfuse needed for mergerfs to work? .PP No. @@ -2180,9 +2171,10 @@ availability you should stick with RAID. .SS Can drives be written to directly? Outside of mergerfs while pooled? .PP Yes, however its not recommended to use the same file from within the -pool and from without at the same time. +pool and from without at the same time (particularly writing). Especially if using caching of any kind (cache.files, cache.entry, -cache.attr, cache.negative_entry, cache.symlinks, cache.readdir, etc.). +cache.attr, cache.negative_entry, cache.symlinks, cache.readdir, etc.) +as there could be a conflict between cached data and not. .SS Why do I get an "out of space" / "no space left on device" / ENOSPC error even though there appears to be lots of space available? .PP @@ -2278,20 +2270,40 @@ Generally collision, if it occurs, shouldn\[aq]t be a problem. You can turn off the calculation by not using \f[C]use_ino\f[]. In the future it might be worth creating different strategies for users to select from. -.SS I notice massive slowdowns of writes over NFS -.PP -Due to how NFS works and interacts with FUSE when not using -\f[C]cache.files=off\f[] or \f[C]direct_io\f[] its possible that a -getxattr for \f[C]security.capability\f[] will be issued prior to any -write. -This will usually result in a massive slowdown for writes. -Using \f[C]cache.files=off\f[] or \f[C]direct_io\f[] will keep this from -happening (and generally good to enable unless you need the features it -disables) but the \f[C]security_capability\f[] option can also help by -short circuiting the call and returning \f[C]ENOATTR\f[]. -.PP -You could also set \f[C]xattr\f[] to \f[C]noattr\f[] or \f[C]nosys\f[] -to short circuit or stop all xattr requests. +.SS I notice massive slowdowns of writes when enabling cache.files. +.PP +When file caching is enabled in any form (\f[C]cache.files!=off\f[] or +\f[C]direct_io=false\f[]) it will issue \f[C]getxattr\f[] requests for +\f[C]security.capability\f[] prior to \f[I]every single write\f[]. +This will usually result in a performance degregation, especially when +using a network filesystem (such as NFS or CIFS/SMB/Samba.) +Unfortunately at this moment the kernel is not caching the response. +.PP +To work around this situation mergerfs offers a few solutions. +.IP "1." 3 +Set \f[C]security_capability=false\f[]. +It will short curcuit any call and return \f[C]ENOATTR\f[]. +This still means though that mergerfs will receive the request before +every write but at least it doesn\[aq]t get passed through to the +underlying filesystem. +.IP "2." 3 +Set \f[C]xattr=noattr\f[]. +Same as above but applies to \f[I]all\f[] calls to getxattr. +Not just \f[C]security.capability\f[]. +This will not be cached by the kernel either but mergerfs\[aq] runtime +config system will still function. +.IP "3." 3 +Set \f[C]xattr=nosys\f[]. +Results in mergerfs returning \f[C]ENOSYS\f[] which \f[I]will\f[] be +cached by the kernel. +No future xattr calls will be forwarded to mergerfs. +The downside is that also means the xattr based config and query +functionality won\[aq]t work either. +.IP "4." 3 +Disable file caching. +If you aren\[aq]t using applications which use \f[C]mmap\f[] it\[aq]s +probably simplier to just disable it all together. +The kernel won\[aq]t send the requests when caching is disabled. .SS What are these .fuse_hidden files? .PP NOTE: mergerfs >= 2.26.0 will not have these temporary files.