Browse Source

add more info on different caching techniques

pull/558/head
Antonio SJ Musumeci 6 years ago
parent
commit
3a1213435f
  1. 49
      README.md
  2. 124
      man/mergerfs.1

49
README.md

@ -473,25 +473,50 @@ A B C
# CACHING # CACHING
MergerFS does not natively support any sort of tiered caching. Most users have no use for such a feature and it would complicate the code. However, there are a few situations where a cache drive could help with a typical mergerfs setup.
#### page caching
The kernel performs caching of data pages on all files not opened with `O_DIRECT`. Due to mergerfs using FUSE and therefore being a userland process the kernel can double cache the content being read through mergerfs. Once from the underlying filesystem and once for mergerfs. Using `direct_io` and/or `dropcacheonclose` help minimize the double caching. `direct_io` will instruct the kernel to bypass the page cache for files opened through mergerfs. `dropcacheonclose` will cause mergerfs to instruct the kernel to flush a file's page cache for which it had opened when closed. If most data is read once its probably best to enable both (read above for details and limitations).
If a cache is desired for mergerfs do not enable `direct_io` and instead possibly use `auto_cache` or `kernel_cache`. By default FUSE will invalidate cached pages when a file is opened. By using `auto_cache` it will instead use `getattr` to check if a file has changed when the file is opened and if so will flush the cache. `ac_attr_timeout` is the timeout for keeping said cache. Alternatively `kernel_cache` will keep the cache across opens unless invalidated through other means. You should only uses these if you do not plan to write/modify the same files through mergerfs and the underlying filesystem at the same time. It could lead to corruption. Then again doing so without caching can also cause issues.
It's a difficult balance between memory usage, cache bloat & duplication, and performance. Ideally mergerfs would be able to disable caching for the files it reads/writes but allow page caching for itself. That would limit the FUSE overhead. However, there isn't good way to achieve this.
#### entry & attribute caching
Given the relatively high cost of FUSE due to the kernel <-> userspace round trips there are kernel side caches for file entries and attributes. The entry cache limits the `lookup` calls to mergerfs which ask if a file exists. The attribute cache limits the need to make `getattr` calls to mergerfs which provide file attributes (mode, size, type, etc.). As with the page cache these should not be used if the underlying filesystems are being manipulated at the same time as it could lead to odd behavior or data corruption. The options for setting these are `entry_timeout` and `negative_timeout` for the entry cache and `attr_timeout` for the attributes cache. `negative_timeout` refers to the timeout for negative responses to lookups (non-existant files).
#### writeback caching
writeback caching is a technique for improving write speeds by batching writes at a faster device and then bulk writing to the slower device. With FUSE the kernel will wait for a number of writes to be made and then send it to the filesystem as one request. mergerfs currently uses a slightly modified and vendored libfuse 2.9.7 which does not support writeback caching. However, a prototype port to libfuse 3.x has been made and the writeback cache appears to work as expected (though performance improvements greatly depend on the way the client app writes data). Once the port is complete and thoroughly tested writeback caching will be available.
#### tiered caching
Some storage technologies support what some call "tiered" caching. The placing of usually smaller, faster storage as a transparent cache to larger, slower storage. NVMe, SSD, Optane in front of traditional HDDs for instance.
MergerFS does not natively support any sort of tiered caching. Most users have no use for such a feature and its inclusion would complicate the code. However, there are a few situations where a cache drive could help with a typical mergerfs setup.
1. Fast network, slow drives, many readers: You've a 10+Gbps network with many readers and your regular drives can't keep up. 1. Fast network, slow drives, many readers: You've a 10+Gbps network with many readers and your regular drives can't keep up.
2. Fast network, slow drives, small'ish bursty writes: You have a 10+Gbps network and wish to transfer amounts of data less than your cache drive but wish to do so quickly. 2. Fast network, slow drives, small'ish bursty writes: You have a 10+Gbps network and wish to transfer amounts of data less than your cache drive but wish to do so quickly.
The below will mostly address usecase #2. It will also work for #1 assuming the data is regularly accessed and was placed into the system via this method. Otherwise a similar script may need to be written to populate the cache from the backing pool.
With #1 its arguable if you should be using mergerfs at all. RAID would probably be the better solution. If you're going to use mergerfs there are other tactics that may help: spreading the data across drives (see the mergerfs.dup tool) and setting `func.open=rand`, using `symlinkify`, or using dm-cache or a similar technology to add tiered cache to the underlying device.
With #2 one could use dm-cache as well but there is another solution which requires only mergerfs and a cronjob.
1. Create 2 mergerfs pools. One which includes just the backing drives and one which has both the cache drives (SSD,NVME,etc.) and backing drives.
1. Create 2 mergerfs pools. One which includes just the slow drives and one which has both the fast drives (SSD,NVME,etc.) and slow drives.
2. The 'cache' pool should have the cache drives listed first. 2. The 'cache' pool should have the cache drives listed first.
3. The best policies to use for the 'cache' pool would probably be `ff`, `epff`, `lfs`, or `eplfs`. The latter two under the assumption that the cache drive(s) are far smaller than the backing drives. If using path preserving policies remember that you'll need to manually create the core directories of those paths you wish to be cached. (Be sure the permissions are in sync. Use `mergerfs.fsck` to check / correct them.)
4. Enable `moveonenospc` and set `minfreespace` appropriately.
3. The best `create` policies to use for the 'cache' pool would probably be `ff`, `epff`, `lfs`, or `eplfs`. The latter two under the assumption that the cache drive(s) are far smaller than the backing drives. If using path preserving policies remember that you'll need to manually create the core directories of those paths you wish to be cached. Be sure the permissions are in sync. Use `mergerfs.fsck` to check / correct them. You could also tag the slow drives as `=NC` though that'd mean if the cache drives fill you'd get "out of space" errors.
4. Enable `moveonenospc` and set `minfreespace` appropriately. Perhaps setting `minfreespace` to the size of the largest cache drive.
5. Set your programs to use the cache pool. 5. Set your programs to use the cache pool.
6. Save one of the below scripts.
7. Use `crontab` (as root) to schedule the command at whatever frequency is appropriate for your workflow.
6. Save one of the below scripts or create you're own.
7. Use `cron` (as root) to schedule the command at whatever frequency is appropriate for your workflow.
### Time based expiring
##### time based expiring
Move files from cache to backing pool based only on the last time the file was accessed.
Move files from cache to backing pool based only on the last time the file was accessed. Replace `-atime` with `-amin` if you want minutes rather than days. May want to use the `fadvise` / `--drop-cache` version of rsync or run rsync with the tool "nocache".
``` ```
#!/bin/bash #!/bin/bash
@ -506,11 +531,11 @@ BACKING="${2}"
N=${3} N=${3}
find "${CACHE}" -type f -atime +${N} -printf '%P\n' | \ find "${CACHE}" -type f -atime +${N} -printf '%P\n' | \
rsync --files-from=- -aq --remove-source-files "${CACHE}/" "${BACKING}/"
rsync --files-from=- -axqHAXWES --preallocate --remove-source-files "${CACHE}/" "${BACKING}/"
``` ```
### Percentage full expiring
##### percentage full expiring
Move the oldest file from the cache to the backing pool. Continue till below percentage threshold. Move the oldest file from the cache to the backing pool. Continue till below percentage threshold.
@ -534,7 +559,7 @@ do
head -n 1 | \ head -n 1 | \
cut -d' ' -f2-) cut -d' ' -f2-)
test -n "${FILE}" test -n "${FILE}"
rsync -aq --remove-source-files "${CACHE}/./${FILE}" "${BACKING}/"
rsync -axqHAXWES --preallocate --remove-source-files "${CACHE}/./${FILE}" "${BACKING}/"
done done
``` ```

124
man/mergerfs.1

@ -974,10 +974,82 @@ https://github.com/trapexit/bbf
bbf (bad block finder): a tool to scan for and \[aq]fix\[aq] hard drive bbf (bad block finder): a tool to scan for and \[aq]fix\[aq] hard drive
bad blocks and find the files using those blocks bad blocks and find the files using those blocks
.SH CACHING .SH CACHING
.SS page caching
.PP
The kernel performs caching of data pages on all files not opened with
\f[C]O_DIRECT\f[].
Due to mergerfs using FUSE and therefore being a userland process the
kernel can double cache the content being read through mergerfs.
Once from the underlying filesystem and once for mergerfs.
Using \f[C]direct_io\f[] and/or \f[C]dropcacheonclose\f[] help minimize
the double caching.
\f[C]direct_io\f[] will instruct the kernel to bypass the page cache for
files opened through mergerfs.
\f[C]dropcacheonclose\f[] will cause mergerfs to instruct the kernel to
flush a file\[aq]s page cache for which it had opened when closed.
If most data is read once its probably best to enable both (read above
for details and limitations).
.PP
If a cache is desired for mergerfs do not enable \f[C]direct_io\f[] and
instead possibly use \f[C]auto_cache\f[] or \f[C]kernel_cache\f[].
By default FUSE will invalidate cached pages when a file is opened.
By using \f[C]auto_cache\f[] it will instead use \f[C]getattr\f[] to
check if a file has changed when the file is opened and if so will flush
the cache.
\f[C]ac_attr_timeout\f[] is the timeout for keeping said cache.
Alternatively \f[C]kernel_cache\f[] will keep the cache across opens
unless invalidated through other means.
You should only uses these if you do not plan to write/modify the same
files through mergerfs and the underlying filesystem at the same time.
It could lead to corruption.
Then again doing so without caching can also cause issues.
.PP
It\[aq]s a difficult balance between memory usage, cache bloat &
duplication, and performance.
Ideally mergerfs would be able to disable caching for the files it
reads/writes but allow page caching for itself.
That would limit the FUSE overhead.
However, there isn\[aq]t good way to achieve this.
.SS entry & attribute caching
.PP
Given the relatively high cost of FUSE due to the kernel <\-> userspace
round trips there are kernel side caches for file entries and
attributes.
The entry cache limits the \f[C]lookup\f[] calls to mergerfs which ask
if a file exists.
The attribute cache limits the need to make \f[C]getattr\f[] calls to
mergerfs which provide file attributes (mode, size, type, etc.).
As with the page cache these should not be used if the underlying
filesystems are being manipulated at the same time as it could lead to
odd behavior or data corruption.
The options for setting these are \f[C]entry_timeout\f[] and
\f[C]negative_timeout\f[] for the entry cache and \f[C]attr_timeout\f[]
for the attributes cache.
\f[C]negative_timeout\f[] refers to the timeout for negative responses
to lookups (non\-existant files).
.SS writeback caching
.PP
writeback caching is a technique for improving write speeds by batching
writes at a faster device and then bulk writing to the slower device.
With FUSE the kernel will wait for a number of writes to be made and
then send it to the filesystem as one request.
mergerfs currently uses a slightly modified and vendored libfuse 2.9.7
which does not support writeback caching.
However, a prototype port to libfuse 3.x has been made and the writeback
cache appears to work as expected (though performance improvements
greatly depend on the way the client app writes data).
Once the port is complete and thoroughly tested writeback caching will
be available.
.SS tiered caching
.PP
Some storage technologies support what some call "tiered" caching.
The placing of usually smaller, faster storage as a transparent cache to
larger, slower storage.
NVMe, SSD, Optane in front of traditional HDDs for instance.
.PP .PP
MergerFS does not natively support any sort of tiered caching. MergerFS does not natively support any sort of tiered caching.
Most users have no use for such a feature and it would complicate the
code.
Most users have no use for such a feature and its inclusion would
complicate the code.
However, there are a few situations where a cache drive could help with However, there are a few situations where a cache drive could help with
a typical mergerfs setup. a typical mergerfs setup.
.IP "1." 3 .IP "1." 3
@ -988,41 +1060,55 @@ Fast network, slow drives, small\[aq]ish bursty writes: You have a
10+Gbps network and wish to transfer amounts of data less than your 10+Gbps network and wish to transfer amounts of data less than your
cache drive but wish to do so quickly. cache drive but wish to do so quickly.
.PP .PP
The below will mostly address usecase #2.
It will also work for #1 assuming the data is regularly accessed and was
placed into the system via this method.
Otherwise a similar script may need to be written to populate the cache
from the backing pool.
With #1 its arguable if you should be using mergerfs at all.
RAID would probably be the better solution.
If you\[aq]re going to use mergerfs there are other tactics that may
help: spreading the data across drives (see the mergerfs.dup tool) and
setting \f[C]func.open=rand\f[], using \f[C]symlinkify\f[], or using
dm\-cache or a similar technology to add tiered cache to the underlying
device.
.PP
With #2 one could use dm\-cache as well but there is another solution
which requires only mergerfs and a cronjob.
.IP "1." 3 .IP "1." 3
Create 2 mergerfs pools. Create 2 mergerfs pools.
One which includes just the backing drives and one which has both the
cache drives (SSD,NVME,etc.) and backing drives.
One which includes just the slow drives and one which has both the fast
drives (SSD,NVME,etc.) and slow drives.
.IP "2." 3 .IP "2." 3
The \[aq]cache\[aq] pool should have the cache drives listed first. The \[aq]cache\[aq] pool should have the cache drives listed first.
.IP "3." 3 .IP "3." 3
The best policies to use for the \[aq]cache\[aq] pool would probably be
\f[C]ff\f[], \f[C]epff\f[], \f[C]lfs\f[], or \f[C]eplfs\f[].
The best \f[C]create\f[] policies to use for the \[aq]cache\[aq] pool
would probably be \f[C]ff\f[], \f[C]epff\f[], \f[C]lfs\f[], or
\f[C]eplfs\f[].
The latter two under the assumption that the cache drive(s) are far The latter two under the assumption that the cache drive(s) are far
smaller than the backing drives. smaller than the backing drives.
If using path preserving policies remember that you\[aq]ll need to If using path preserving policies remember that you\[aq]ll need to
manually create the core directories of those paths you wish to be manually create the core directories of those paths you wish to be
cached. cached.
(Be sure the permissions are in sync.
Use \f[C]mergerfs.fsck\f[] to check / correct them.)
Be sure the permissions are in sync.
Use \f[C]mergerfs.fsck\f[] to check / correct them.
You could also tag the slow drives as \f[C]=NC\f[] though that\[aq]d
mean if the cache drives fill you\[aq]d get "out of space" errors.
.IP "4." 3 .IP "4." 3
Enable \f[C]moveonenospc\f[] and set \f[C]minfreespace\f[] Enable \f[C]moveonenospc\f[] and set \f[C]minfreespace\f[]
appropriately. appropriately.
Perhaps setting \f[C]minfreespace\f[] to the size of the largest cache
drive.
.IP "5." 3 .IP "5." 3
Set your programs to use the cache pool. Set your programs to use the cache pool.
.IP "6." 3 .IP "6." 3
Save one of the below scripts.
Save one of the below scripts or create you\[aq]re own.
.IP "7." 3 .IP "7." 3
Use \f[C]crontab\f[] (as root) to schedule the command at whatever
Use \f[C]cron\f[] (as root) to schedule the command at whatever
frequency is appropriate for your workflow. frequency is appropriate for your workflow.
.SS Time based expiring
.SS time based expiring
.PP .PP
Move files from cache to backing pool based only on the last time the Move files from cache to backing pool based only on the last time the
file was accessed. file was accessed.
Replace \f[C]\-atime\f[] with \f[C]\-amin\f[] if you want minutes rather
than days.
May want to use the \f[C]fadvise\f[] / \f[C]\-\-drop\-cache\f[] version
of rsync or run rsync with the tool "nocache".
.IP .IP
.nf .nf
\f[C] \f[C]
@ -1038,10 +1124,10 @@ BACKING="${2}"
N=${3} N=${3}
find\ "${CACHE}"\ \-type\ f\ \-atime\ +${N}\ \-printf\ \[aq]%P\\n\[aq]\ |\ \\ find\ "${CACHE}"\ \-type\ f\ \-atime\ +${N}\ \-printf\ \[aq]%P\\n\[aq]\ |\ \\
\ \ rsync\ \-\-files\-from=\-\ \-aq\ \-\-remove\-source\-files\ "${CACHE}/"\ "${BACKING}/"
\ \ rsync\ \-\-files\-from=\-\ \-axqHAXWES\ \-\-preallocate\ \-\-remove\-source\-files\ "${CACHE}/"\ "${BACKING}/"
\f[] \f[]
.fi .fi
.SS Percentage full expiring
.SS percentage full expiring
.PP .PP
Move the oldest file from the cache to the backing pool. Move the oldest file from the cache to the backing pool.
Continue till below percentage threshold. Continue till below percentage threshold.
@ -1067,7 +1153,7 @@ do
\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ head\ \-n\ 1\ |\ \\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ head\ \-n\ 1\ |\ \\
\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ cut\ \-d\[aq]\ \[aq]\ \-f2\-) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ cut\ \-d\[aq]\ \[aq]\ \-f2\-)
\ \ \ \ test\ \-n\ "${FILE}" \ \ \ \ test\ \-n\ "${FILE}"
\ \ \ \ rsync\ \-aq\ \-\-remove\-source\-files\ "${CACHE}/./${FILE}"\ "${BACKING}/"
\ \ \ \ rsync\ \-axqHAXWES\ \-\-preallocate\ \-\-remove\-source\-files\ "${CACHE}/./${FILE}"\ "${BACKING}/"
done done
\f[] \f[]
.fi .fi

Loading…
Cancel
Save