Filesystems in the Linux kernel¶
This under-development manual will, some glorious day, provide comprehensive information on how the Linux virtual filesystem (VFS) layer works, along with the filesystems that sit below it. For now, what we have can be found below.
Core VFS documentation¶
See these manuals for documentation about the VFS layer itself and how its algorithms work.
- Overview of the Linux Virtual File System
- Introduction
- Registering and Mounting a Filesystem
- The Superblock Object
- The Inode Object
- The Address Space Object
- The File Object
- Directory Entry Cache (dcache)
- Mount Options
- Resources
- Introduction to pathname lookup
- RCU-walk — faster pathname lookup in Linux
- A walk among the symlinks
- The Linux VFS
- The proc filesystem
- Events based on file descriptors
- eventpoll (epoll) interfaces
- The Filesystem for Exporting Kernel Objects
- The debugfs filesystem
- splice API
- pipes API
- dentry_operations
- inode_operations
- xattr_handler operations
- super_operations
- file_system_type
- address_space_operations
- file_lock_operations
- lock_manager_operations
- buffer_head
- block_device_operations
- file_operations
- dquot_operations
- vm_operations_struct
- Implementation expectations (features and bugs :-))
- Configuration
- Example
- NOTE
- Request Basics
- Extent Mapping
- VFS -> File System Implementation
- 1. What’s New?
- Overview
- The Filesystem context
- The Filesystem Context Operations
- Filesystem context Security
- VFS Filesystem context API
- Superblock Creation Helpers
- Parameter Description
- Parameter Helper Functions
- Quota netlink interface
- Deprecated create_proc_entry
- The iterator interface
- Formatted output
- Making it all work
- seq_list
- The extra-simple version
- 1) Overview
- 2) Features
- 3) Setting mount states
- 4) Use cases
- 5) Detailed semantics
- 6) Quiz
- 7) FAQ
- 8) Implementation
- Formal notes
- General notes
- Translation algorithms
- Idmappings when creating filesystem objects
- Idmappings on idmapped mounts
- In-Kernel Automounting
- Automatic Mountpoint Expiry
- Userspace Driven Expiry
- General Filesystem Caching
- Network Filesystem Caching API
- Cache Backend API
- Cache on Already Mounted Filesystem
Filesystem support layers¶
Documentation for the support code within the filesystem layer for use in filesystem implementations.
- The Linux Journalling API
- Overview
- Data Types
- Functions
- See also
- Introduction
- Threat model
- Key hierarchy
- Encryption modes and usage
- User API
- Access semantics
- Encryption policy enforcement
- Inline encryption support
- Direct I/O support
- Implementation details
- Tests
- Introduction
- Use cases
- User API
- Accessing verity files
- File digest computation
- Built-in signature verification
- Filesystem support
- Implementation details
- Userspace utility
- Tests
- FAQ
- Overview
- Per-Inode Context
- Buffered Read Helpers
- API Function Reference
Filesystems¶
Documentation for filesystem implementations.
- v9fs: Plan 9 Resource Sharing for Linux
- About
- Usage
- Options
- Behavior
- Resources
- Filesystems supported by ADFS
- Mount options for ADFS
- Mapping of ADFS permissions to Linux permissions
- RISC OS file type suffix
- Mount options for the AFFS
- Handling of the Users/Groups and protection flags
- Symbolic links
- Examples
- IMPORTANT NOTE
- Bugs, Restrictions, Caveats
- Overview
- Compilation
- Usage
- Mountpoints
- Dynamic Root
- Proc Filesystem
- The Cell Database
- Security
- The @sys Substitution
- Purpose
- Context
- Content
- Mount Traps
- Mountpoint expiry
- Communicating with autofs: detecting the daemon
- Communicating with autofs: the event pipe
- Communicating with autofs: root directory ioctls
- Communicating with autofs: char-device ioctls
- Catatonic mode
- The «ignore» mount option
- autofs, name spaces, and shared mounts
- The problem
- The Solution
- autofs Miscellaneous Device mount control interface
- The ioctls
- Warning
- License
- Author
- What is this Driver?
- Which is it, BFS or BEFS?
- How to Install
- Using BFS
- Mount Options
- How to Get Lastest Version
- Any Known Bugs?
- Special Thanks
- Mount Syntax
- Mount Options
- More Information
- 1. Introduction
- 2. Servicing Coda filesystem calls
- 3. The message layer
- 4. The interface at the call level
- 5. The minicache and downcalls
- 6. Initialization and cleanup
- What is configfs?
- Using configfs
- Configuring FakeNBD: an Example
- Coding With configfs
- struct config_item
- struct config_item_type
- struct configfs_attribute
- struct configfs_bin_attribute
- struct config_group
- struct configfs_subsystem
- An Example
- Hierarchy Navigation and the Subsystem Mutex
- Item Aggregation Via symlink(2)
- Automatically Created Subgroups
- Dependent Subsystems
- Usage Notes
- Memory Mapped cramfs image
- Tools
- For /usr/share/magic
- Hacker Notes
- Motivation
- Usage
- Enabling DAX on ext2 and erofs
- Enabling DAX on xfs and ext4
- Summary
- Details
- Enabling DAX on virtiofs
- Implementation Tips for Block Driver Writers
- Implementation Tips for Filesystem Writers
- Handling Media Errors
- Shortcomings
- Credits
- Caveats
- Mount options
- Usage
- Setup
- Locking
- See Also
- Mount-wide Passphrase
- Notes
- Overview
- Mount options
- Sysfs Entries
- On-disk details
- Options
- Specification
- References
- 1. About this Book
- 2. High Level Design
- 3. Global Structures
- 4. Dynamic Structures
- Background and Design issues
- Key Features
- Mount Options
- Debugfs Entries
- Sysfs Entries
- Usage
- Design
- A list of GFS2 uevents
- Information common to all GFS2 uevents (uevent environment variables)
- Glock Statistics
- Mount options
- Writing to HFS Filesystems
- Creating HFS filesystems
- Credits
- Mount options
- References
- Credits
- File names
- Extended attributes
- Symlinks
- Codepages
- Known bugs
- What does «unbalanced tree» message mean?
- Bugs in OS/2
- Codepage bugs described above
- History
- Definitions
- What is FUSE?
- Filesystem type
- Mount options
- Control filesystem
- Aborting a filesystem connection
- How do non-privileged mounts work?
- How are requirements fulfilled?
- I think these limitations are unacceptable?
- Kernel — userspace interface
- Caveats
- Mount options
- Ioctls
- NILFS2 usage
- Disk format
- NFSv4 client identifier
- See Also
- Making Filesystems Exportable
- Reference counting in pnfs
- RPC Cache
- rpcsec_gss support for kernel RPC servers
- NFSv4.1 Server Implementation
- Kernel NFS Server Statistics
- Reexporting NFS filesystems
- Overview
- Web site
- Features
- Supported mount options
- Known bugs and (mis-)features
- Using NTFS volume and stripe sets
- Summary and Features
- Mount Options
- Todo list
- References
- Credits
- Caveats
- Mount options
- Introduction
- Scope
- User interface
- Fixing stuff
- Overview
- Options
- Disk format
- Mailing List Archives
- Mailing List Submissions
- Documentation
- Running ORANGEFS On a Single Server
- Userspace Filesystem Source
- Building ORANGEFS on a Single Server
- Running xfstests
- Options
- Debugging
- Protocol between Kernel Module and Userspace
- Overlay objects
- Upper and Lower
- Directories
- whiteouts and opaque directories
- readdir
- renaming directories
- Non-directories
- Permission model
- Multiple lower layers
- Metadata only copy up
- Data-only lower layers
- Sharing and copying layers
- Non-standard behavior
- Changes to underlying filesystems
- NFS export
- Volatile mount
- User xattr
- Testsuite
- Preface
- Chapter 1: Collecting System Information
- Chapter 2: Modifying System Parameters
- Chapter 3: Per-process Parameters
- Chapter 4: Configuring procfs
- Chapter 5: Filesystem behavior
- Option
- Specification
- What is ramfs?
- ramfs and ramdisk:
- ramfs and tmpfs:
- What is rootfs?
- What is initramfs?
- Populating initramfs:
- External initramfs images:
- Contents of initramfs:
- Why cpio rather than tar?
- Future directions:
- Semantics
- klog and relay-apps example code
- The relay interface user space API
- The relay interface kernel API
- Resources
- Credits
- KSMBD — SMB3 Kernel Server
- Mounting root file system via SMB (cifs.ko)
- spufs
- spu_create
- spu_run
- 1. Filesystem Features
- 2. Using Squashfs
- 3. Squashfs Filesystem Design
- 3.1 Compression options
- 3.2 Inodes
- 3.3 Directories
- 3.4 File data
- 3.5 Fragment lookup table
- 3.6 Uid/gid lookup table
- 3.7 Export table
- 3.8 Xattr table
- 4. TODOs and Outstanding Issues
- 4.1 TODO list
- 4.2 Squashfs Internal Cache
- What it is
- Using sysfs
- Directory Creation
- Attributes
- Subsystem-Specific Callbacks
- Reading/Writing Attribute Data
- Top Level Directory Layout
- Current Interfaces
- Documentation
- Introduction
- Mount options
- Quick usage instructions
- References
- Introduction
- UBIFS Authentication
- Future Extensions
- References
- Introduction
- Usage
- Internals
- USING VFAT
- VFAT MOUNT OPTIONS
- LIMITATION
- TODO
- POSSIBLE PROBLEMS
- TEST SUITE
- NOTES ON THE STRUCTURE OF THE VFAT FILESYSTEM
- Preamble
- Introduction
- Transactions in XFS
- Transactions are Asynchronous
- Transaction Reservations
- Log Space Accounting
- Re-logging Explained
- Delayed Logging: Concepts
- Delayed Logging: Design
- Introduction
- Self Describing Metadata
- Runtime Validation
- Structures
- Inodes and Dquots
- 1. What is a Filesystem Check?
- 2. Theory of Operation
- 3. Testing Plan
- 4. User Interface
- 5. Kernel Algorithms and Data Structures
- 6. Userspace Algorithms and Data Structures
- 7. Conclusion and Future Work
- Introduction
- Zonefs Overview
- Zonefs User Space Tools
File management in the Linux kernel¶
This document describes how locking for files (struct file) and file descriptor table (struct files) works.
Up until 2.6.12, the file descriptor table has been protected with a lock (files->file_lock) and reference count (files->count). ->file_lock protected accesses to all the file related fields of the table. ->count was used for sharing the file descriptor table between tasks cloned with CLONE_FILES flag. Typically this would be the case for posix threads. As with the common refcounting model in the kernel, the last task doing a put_files_struct() frees the file descriptor (fd) table. The files (struct file) themselves are protected using reference count (->f_count).
In the new lock-free model of file descriptor management, the reference counting is similar, but the locking is based on RCU. The file descriptor table contains multiple elements — the fd sets (open_fds and close_on_exec, the array of file pointers, the sizes of the sets and the array etc.). In order for the updates to appear atomic to a lock-free reader, all the elements of the file descriptor table are in a separate structure — struct fdtable. files_struct contains a pointer to struct fdtable through which the actual fd table is accessed. Initially the fdtable is embedded in files_struct itself. On a subsequent expansion of fdtable, a new fdtable structure is allocated and files->fdtab points to the new structure. The fdtable structure is freed with RCU and lock-free readers either see the old fdtable or the new fdtable making the update appear atomic. Here are the locking rules for the fdtable structure —
- All references to the fdtable must be done through the files_fdtable() macro:
struct fdtable *fdt; rcu_read_lock(); fdt = files_fdtable(files); . if (n max_fds) . . rcu_read_unlock();
struct file *file; rcu_read_lock(); file = lookup_fd_rcu(fd); if (file) < . >. rcu_read_unlock();
rcu_read_lock(); file = files_lookup_fd_rcu(files, fd); if (file) < if (atomic_long_inc_not_zero(&file->f_count)) *fput_needed = 1; else /* Didn't get the reference, someone's freed */ file = NULL; > rcu_read_unlock(); . return file;
spin_lock(&files->file_lock); fd = locate_fd(files, file, start); if (fd >= 0) < /* locate_fd() may have expanded fdtable, load the ptr */ fdt = files_fdtable(files); __set_open_fd(fd, fdt); __clear_close_on_exec(fd, fdt); spin_unlock(&files->file_lock); .