You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

133 lines
6.3 KiB

10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
  1. Directories and files
  2. ===========================
  3. When talking about file systems, many people would assume directories, list files under a directory, etc. These are expected if we want to hook up Seaweed File System with linux by FUSE, or with Hadoop, etc.
  4. Sample usage
  5. #####################
  6. Two ways to start a weed filer
  7. .. code-block:: bash
  8. # assuming you already started weed master and weed volume
  9. weed filer
  10. # Or assuming you have nothing started yet,
  11. # this command starts master server, volume server, and filer in one shot.
  12. # It's strictly the same as starting them separately.
  13. weed server -filer=true
  14. Now you can add/delete files, and even browse the sub directories and files
  15. .. code-block:: bash
  16. # POST a file and read it back
  17. curl -F "filename=@README.md" "http://localhost:8888/path/to/sources/"
  18. curl "http://localhost:8888/path/to/sources/README.md"
  19. # POST a file with a new name and read it back
  20. curl -F "filename=@Makefile" "http://localhost:8888/path/to/sources/new_name"
  21. curl "http://localhost:8888/path/to/sources/new_name"
  22. # list sub folders and files
  23. curl "http://localhost:8888/path/to/sources/?pretty=y"
  24. # if lots of files under this folder, here is a way to efficiently paginate through all of them
  25. curl "http://localhost:8888/path/to/sources/?lastFileName=abc.txt&limit=50&pretty=y"
  26. Design
  27. ############
  28. A common file system would use inode to store meta data for each folder and file. The folder tree structure are usually linked. And sub folders and files are usually organized as an on-disk b+tree or similar variations. This scales well in terms of storage, but not well for fast file retrieval due to multiple disk access just for the file meta data, before even trying to get the file content.
  29. Seaweed-FS wants to make as small number of disk access as possible, yet still be able to store a lot of file metadata. So we need to think very differently.
  30. We can take the following steps to map a full file path to the actual data block:
  31. .. code-block:: bash
  32. file_parent_directory => directory_id
  33. directory_id+fileName => file_id
  34. file_id => data_block
  35. Because default Seaweed-FS only provides file_id=>data_block mapping, only the first 2 steps need to be implemented.
  36. There are several data features I noticed:
  37. * the number of directories usually is small, or very small
  38. * the number of files can be small, medium, large, or very large
  39. This leads to a novel (as far as I know now) approach to organize the meta data for the directories and files separately.
  40. A "weed filer" server is to provide these two missing parent_directory=>directory_id, and directory_id+filename=>file_id mappings, completing the "common" file storage interface.
  41. Assumptions
  42. ###############
  43. I believe these are reasonable assumptions:
  44. * The number of directories are smaller than the number of files by one or more magnitudes.
  45. * Very likely for big systems, the number of files under one particular directory can be very high, ideally unlimited, far exceeding the number of directories.
  46. * Directory meta data is accessed very often.
  47. Data structure
  48. #################
  49. This assumed differences between directories and files lead to the design that the metadata for directories and files should have different data structure.
  50. * Store directories in memory
  51. * all of directories hopefully all be in memory
  52. * efficient to move/rename/list_directories
  53. * Store files in a sorted string table in <dir_id/filename, file_id> format
  54. * efficient to list_files, just simple iterator
  55. * efficient to locate files, binary search
  56. Complexity
  57. ###################
  58. For one file retrieval, if the parent directory includes n folders, then it will take n steps to navigate from root to the file folder. However, this O(n) step is all in memory. So in practice, it will be very fast.
  59. For one file retrieval, the dir_id+filename=>file_id lookup will be O(logN) using LevelDB, a log-structured-merge (LSM) tree implementation. The complexity is the same as B-Tree.
  60. For file listing under a particular directory, the listing in LevelDB is just a simple scan, since the record in LevelDB is already sorted. For B-Tree, this may involves multiple disk seeks to jump through.
  61. For directory renaming, it's just trivially change the name or parent of the directory. Since the directory_id stays the same, there are no change to files metadata.
  62. For file renaming, it's just trivially delete and then add a row in leveldb.
  63. Details
  64. ########################
  65. In the current first version, the path_to_file=>file_id mapping is stored with an efficient embedded leveldb. Being embedded, it runs on single machine. So it's not linearly scalable yet. However, it can handle LOTS AND LOTS of files on Seaweed-FS on other master/volume servers.
  66. Switching from the embedded leveldb to an external distributed database is very feasible. Your contribution is welcome!
  67. The in-memory directory structure can improve on memory efficiency. Current simple map in memory works when the number of directories is less than 1 million, which will use about 500MB memory. But I would expect common use case would have a few, not even more than 100 directories.
  68. Use Cases
  69. #########################
  70. Clients can assess one "weed filer" via HTTP, list files under a directory, create files via HTTP POST, read files via HTTP POST directly.
  71. Although one "weed filer" can only sits in one machine, you can start multiple "weed filer" on several machines, each "weed filer" instance running in its own collection, having its own namespace, but sharing the same Seaweed-FS storage.
  72. Future
  73. ###################
  74. In future version, the parent_directory=>directory_id, and directory_id+filename=>file_id mappings will be refactored to support different storage system.
  75. The directory meta data may be switched to some other in-memory database.
  76. The LevelDB implementation may be switched underneath to external data storage, e.g. MySQL, TokyoCabinet, etc. Preferably some pure-go implementation.
  77. Also, a HA feature will be added, so that multiple "weed filer" instance can share the same set of view of files.
  78. Later, FUSE or HCFS plugins will be created, to really integrate Seaweed-FS to existing systems.
  79. Helps Wanted
  80. ########################
  81. This is a big step towards more interesting Seaweed-FS usage and integration with existing systems.
  82. If you can help to refactor and implement other directory meta data, or file meta data storage, please do so.