Demystifying the Linux Kernel Socket File Systems (Sockfs)
All Linux networking works with System Calls creating network sockets (using the Socket System Call). The Socket System Call returns an integer (socket descriptor).
“Writing” or “reading” to/from that socket descriptor (as though it were a file) using generic System Calls write / read respectively creates TCP network traffic rather than file-system writes/reads.
Note: The file-system descriptor would have been created by the “Open” system call IF … the descriptor were a “regular” file-system descriptor, intended for “regular” / file-system writes and reads (via System Calls write/read respectively) to files etc.
Further Note: This implies that the network socket descriptor created by the “socket” System Call will be used by systems programmer to write/read , using the same System Calls write/read used for “regular” file system writes/reads (System Calls that would, under normal and other circumstances, write/read data to/from memory).
Further further Note: A System Call “write” (to the descriptor that was created by the socket System Call) must translate “magically” into a TCP transaction that “writes” the data across the network (ostensibly to the client on the other end), with the data “written” encapsulated within the payload section of a TCP packet.
This process of adapting and hijacking the kernel file-system infrastructure to incorporate network operations /socket operations is called SOCKFS (Socket File System).
So how does the linux kernel accomplish this process, where a file-system write is “faked” into a network-system “write”, if indeed it can be called that ?
Well…as is usually the case, the linux kernel’s methods begins at System / Kernel Initialization, when a special socket file-system (statically defined sock_fs_type) for networks is “registered” by register_file_system. This happens in sock_init. File systems are registered so that disk partitions can be mounted for that file system.
The kernel registered file system type sock_fs_type so that it could create a fake mount point using kern_mount (for the file system sock_fs_type). This mount point is necessary if the kernel is to later create a “fake file” *struct file using existing/generic mechanisms and infrastructure made available for the Virtual File System (VFS). These mechanisms and infrastructure would include a mount point being available.
Note: No “actual” mount point exists, not in the sense an inode etc etc.
We will blog on file systems later.
Then when the socket System Call is initiated (to create the socket descriptor), the kernel executes sock_create to create a new descriptor (aka the socket descriptor). The kernel also executes sock_map_fd, which creates a “fake file” , and assigns the “fake file” to the socket descriptor. The “fake” files ops ( file->f_op) are then initialized to be socket_file_ops (statically defined at compile time in source/net/socket.c).
The kernel assigns/maps the socket descriptor created earlier to the new “fake” file using fd_install.
This socket descriptor is returned by the Socket System Call (as required by the MAN page of the Socket System Call) to the user program.
I only call it “fake” file because a System Call write executed against that socket descriptor will use the VFS infrastructure created, but the data will not be written into a disk-file anywhere. It will, instead, be translated into a network operation because of the f_op‘s assigned to the “fake” file (socket_file_ops).
The kernel is now set up to create network traffic when System Calls write/read are executed to/from to the “fake” file descriptor (the socket descriptor) which was returned to the user when System Call socket was executed.
In point of fact, a System Call write to the “fake” files socket descriptor will then translate into a call to __sock_sendmsg within the kernel, instead of a write into the “regular” file system. Because that is how socket_file_ops is statically defined before assignment to the “fake” file.
And then we are into networking space. And the promised Lan of milk, honey, TCP traffic, SOCKFS and File Systems.
No one said understanding the kernel was easy. But extremely gratification awaits those that work on it. And also creates enormous opportunities for innovation. I explain Linux Kernel concepts and more in my classes ( Advanced Linux Kernel Programming @UCSC-Extension, and also in other classes that I teach independently).
As always, Feedback, Questions and Comments are appreciated and will be responded to. I will like to listen to gripes, especially if you also paypal me some. Thanks
About Anand
Anand is a veteran of Silicon Valley with development experience and patents that span Processors, Operating systems, Networking and Systems development. Anand has been working for the past few years with Service Providers and large Enterprises developing e and Training systems.
Subscribe
Subscribe to our e-mail newsletter to receive updates.
Related Posts:
Understanding the Linux Kernel, Second Edition by
Get full access to Understanding the Linux Kernel, Second Edition and 60K+ other titles, with a free 10-day trial of O’Reilly.
There are also live events, courses curated by job role, and more.
System Calls Related to Networking
We won’t be able to discuss all system calls related to networking. However, we shall examine the basic ones, namely those needed to send a UDP datagram.
In most Unix-like systems, the User Mode code fragment that sends a datagram looks like the following:
int sockfd; /* socket descriptor */ struct sockaddr_in addr_local, addr_remote; /* IPv4 address descriptors */ const char *mesg[] = "Hello, how are you?"; sockfd = socket(PF_INET, SOCK_DGRAM, 0); addr_local.sin_family = AF_INET; addr.sin_port = htons(50000); addr.sin_addr.s_addr = htonl(0xc0a050f0); /* 192.160.80.240 */ bind(sockfd, (struct sockaddr *) & addr_local, sizeof(struct sockaddr_in)); addr_remote.sin_family = AF_INET; addr_remote.sin_port = htons(49152); inet_pton(AF_INET, "192.160.80.110", &addr_remote.sin_addr); connect(sockfd, (struct sockaddr *) &addr_remote, sizeof(struct sockaddr_in)); write(sockfd, mesg, strlen(mesg)+1);
Obviously, this listing does not represent the complete source code of the program. For instance, we have not defined a main( ) function, we have omitted the proper #include directives for loading the header files, and we have not checked the return values of the system calls. However, the listing includes all network-related system calls issued by the program to send a UDP datagram.
Let’s describe the system calls in the order the program uses them.
The socket( ) System Call
The socket( ) system call creates a new endpoint for a communication between two or .
Get Understanding the Linux Kernel, Second Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.