- Tag matching logic¶
- Tag matching implementation¶
- Linux kernel header files to match the current kernel
- Kernel Network Stack Made Easy
- Tuesday, 28 October 2014
- The Routing Engine In Linux Kernel
- Essential kernel data structures
- A set of well known kernel APIs for route lookup
- route configuration, behind the curtain!
- Policy based routing
Tag matching logic¶
The MPI standard defines a set of rules, known as tag-matching, for matching source send operations to destination receives. The following parameters must match the following source and destination parameters:
- Communicator
- User tag — wild card may be specified by the receiver
- Source rank – wild car may be specified by the receiver
- Destination rank – wild
The ordering rules require that when more than one pair of send and receive message envelopes may match, the pair that includes the earliest posted-send and the earliest posted-receive is the pair that must be used to satisfy the matching operation. However, this doesn’t imply that tags are consumed in the order they are created, e.g., a later generated tag may be consumed, if earlier tags can’t be used to satisfy the matching rules.
When a message is sent from the sender to the receiver, the communication library may attempt to process the operation either after or before the corresponding matching receive is posted. If a matching receive is posted, this is an expected message, otherwise it is called an unexpected message. Implementations frequently use different matching schemes for these two different matching instances.
To keep MPI library memory footprint down, MPI implementations typically use two different protocols for this purpose:
1. The Eager protocol- the complete message is sent when the send is processed by the sender. A completion send is received in the send_cq notifying that the buffer can be reused.
2. The Rendezvous Protocol — the sender sends the tag-matching header, and perhaps a portion of data when first notifying the receiver. When the corresponding buffer is posted, the responder will use the information from the header to initiate an RDMA READ operation directly to the matching buffer. A fin message needs to be received in order for the buffer to be reused.
Tag matching implementation¶
There are two types of matching objects used, the posted receive list and the unexpected message list. The application posts receive buffers through calls to the MPI receive routines in the posted receive list and posts send messages using the MPI send routines. The head of the posted receive list may be maintained by the hardware, with the software expected to shadow this list.
When send is initiated and arrives at the receive side, if there is no pre-posted receive for this arriving message, it is passed to the software and placed in the unexpected message list. Otherwise the match is processed, including rendezvous processing, if appropriate, delivering the data to the specified receive buffer. This allows overlapping receive-side MPI tag matching with computation.
When a receive-message is posted, the communication library will first check the software unexpected message list for a matching receive. If a match is found, data is delivered to the user buffer, using a software controlled protocol. The UCX implementation uses either an eager or rendezvous protocol, depending on data size. If no match is found, the entire pre-posted receive list is maintained by the hardware, and there is space to add one more pre-posted receive to this list, this receive is passed to the hardware. Software is expected to shadow this list, to help with processing MPI cancel operations. In addition, because hardware and software are not expected to be tightly synchronized with respect to the tag-matching operation, this shadow list is used to detect the case that a pre-posted receive is passed to the hardware, as the matching unexpected message is being passed from the hardware to the software.
Linux kernel header files to match the current kernel
I’ve seen a few questions about linux-headers packages but couldn’t find anything to address my specific issue. I’m on Kubuntu 16.04, and I got the following error (from VirtualBox):
Please install the Linux kernel «header» files matching the current kernel for adding new hardware support to the system. The distribution packages containing the headers are probably: linux-headers-generic linux-headers-4.13.0-43-generic
I was surprised to see that linux-headers-generic was not installed, although I’m not really sure if it’s supposed to be there by default. In any case, while the kernel is 4.13.0-43-generic, the corresponding headers are, indeed, not installed:
$ uname -r 4.13.0-43-generic $ aptitude search linux-headers | grep ^i id linux-headers-4.13.0-32 - Header files related to Linux kernel versi id linux-headers-4.13.0-32-generic - Linux kernel headers for version 4.13.0 on i A linux-headers-4.13.0-37 - Header files related to Linux kernel versi i A linux-headers-4.13.0-37-generic - Linux kernel headers for version 4.13.0 on i A linux-headers-4.13.0-38 - Header files related to Linux kernel versi i A linux-headers-4.13.0-38-generic - Linux kernel headers for version 4.13.0 on i A linux-headers-4.13.0-39 - Header files related to Linux kernel versi i A linux-headers-4.13.0-39-generic - Linux kernel headers for version 4.13.0 on
The linux-headers-generic «will always depend on the latest generic kernel headers available», so I thought that installing it would install the latest packages (in this case, linux-headers-4.13.0-43-generic as required by VirtualBox) and keep them up to date. However, if I try that, I’m asked to install what appear to be really old packages:
$ sudo aptitude install linux-headers-generic The following NEW packages will be installed: linux-headers-4.4.0-127 linux-headers-4.4.0-127-generic linux-headers-generic 0 packages upgraded, 3 newly installed, 0 to remove and 0 not upgraded. Need to get 10.8 MB of archives. After unpacking 78.4 MB will be used.
- Should either of the linux-headers-generic packages have been there by default? Which one?
- Do I need to install either of them in my case?
- If I install the necessary linux-headers-4.13.0-43-generic package directly, what happens when the kernel is upgraded?
Kernel Network Stack Made Easy
For Curious Linux Kernel Programmers and TCP/IP stack hackers.
Tuesday, 28 October 2014
The Routing Engine In Linux Kernel
In this discussion we will zoom into the Linux Kernel code to understand what really happens when we work on internet routing.
We keep the latest kernel 3.17.1 as the reference for this discussion. So it’s worth mentioning that the well known traditional ‘routing cache’ is removed from 3.6 kernel onwards and the routing database is only FIB TRIE now.
This article doesnt talk about the routing protocols like RIP, OSPF, BGP, EGP etc. Also we will not focus on the commands used for routing table configuration and management.
Routing is the brain of Internet protocol, which allows the packets to cross the LAN boundaries. Lets do not spend much time to discuss the same details that you find here, http://en.wikipedia.org/wiki/Routing. Instead lets peek into the implementation details of it.
- The nexthop : The directly connected device to which the concerned packet must handover.
- The output interface : the interface of this device to reach the nexthop
- type of the route (based on the destination address in the ip header of the packet in question): Few of the important routing flags are below
- RTCF_LOCAL : The destination address is local and the packet should terminate on this device. Packet will be given to the kernel method ip_local_deliver().
- RTCF_BROADCAST and RTCF_MULTICAST : uses when the destination address in broadcast or multicast address resp.
- refer include/uapi/linux/in_route.h for the rest of the flags.
- Additionally, a route entry also gives few more information like MTU, priority, protocol id, metrics etc.
Based on the route lookup result, the packet will be given to ip_local_deliver() in the case of local delivery, ip_forward() in the case of forwarding. In forwarding case, the packet is send to the next hop (Found in the route lookup) via the output interface and the packets continue its journey to the destination address.
Essential kernel data structures
Forwarding Information Base (FIB): For any given destination address, the routing subsystem is expected to give a route entry with the above discussed information . Such routing information(either statically configured or dynamically populated by the routing protocols) are stored in a kernel database called FIB.
At boot time, two FIB tables will be created by default, called RT_TABLE_MAIN and RT_TABLE_LOCAL with table id 244 and 255 resp. More FIB tables will be created when policy based routing is enabled.
A route entry in kernel is ‘struct fib_info’, which holds the routing information that we discussed above (out dev, next hop, scope, priority, metric etc)
The Destination cache used by the routing engine: This improves the performance of Linux routing engine. ‘struct dst_entry’ holds too many critical information required to process a packet going to each destination. This will be created after the route lookup. And the packet (Struct sk_buff) itself will hold a pointer to this entry, but not to fib_info. So after the route lookup, d uring the journey of the packet in the kernel network stack, destination cache could be referred at any point using skb_dst() api.
where input() and output() function pointers will be pointing to ip_local_deliver()/ip_forward() and ip_output() respectively. This id filled based on the route lookup result, as we discussed above.
As in the table below there are quite many kernel APIs available to do the route lookup for us. We can choose the appropriate API based on the available input data and the direction of the packet (ingress or Egress). All of them boils down to the core FIB lookup api, fib_lookup(). This method returns ‘struct fib_result‘ in successful route lookup. This structure is below
fib_lookup() method searches both the default tables for a matching route, first searches the RT_TABLE_LOCAL and if there is no match, lookup will be done in the main table (RT_TABLE_MAIN). Using the ‘fib_result’, the destination cache (dst_entry) will be created for this destination, and input() and output() function pointers will be assigned properly as we discussed above.
fib_lookup() expects ‘struct flowi4’ as the input argument for the table lookup. This object carries the source and destination address, ToS value and many more as below.
A set of well known kernel APIs for route lookup
route configuration, behind the curtain!
Kernel routing tables could be managed using standard tools provided in iproute2 (ip command) & net-tools(route, netstat etc) packages. Where the iproute2 package uses NETLINK sockets to talk to the kernel, and the net-tools package uses IOCTLS.
As you know, NETLINK is an extension of generic socket framework (I will discuss about it in a separate article). NETLINK_ROUTE is the netlink message type used to link the admin commands to the routing subsystem. The most important rtnetlink routing commands and corresponding kernel handlers are noted below.
Policy based routing
As we discussed, there will be only two FIB tables (LOCAL and MAIN) created at network stack boot time. Policy routing is a feature which allows us to create up to 255 routing tables. This extends the control and flexibility in the routing decisions. Admin have to attach every table with a specific ‘rule’, so that when a packet arrives in routing framework, it searches for a matching rule and the corresponding table will be picked for route lookup (fib_rules_lookup() is the API to do this). See the example below, with that, any packet comes with ‘tos’ value 0x02 will hit routing table 190 in route lookup.