Linux zero copy file

How can I achieve zero copy mechanism in POSIX?

I want to share/transfer data between two process locally/network. General IPC mechanism shared memory and Message Queues can be used to transfer data. But these mechanisms involve multiple copies. I came across zero copy mechanism, which reduces the copying overhead on CPU. Linux supports this using sendfile and splice . These APIs are not in POSIX. How can I achieve zero copy using only POSIX APIs?

1 Answer 1

Shared memory between two processes is zero-copy if you keep the shared data in the shared memory. Otherwise there has to be a copy somewhere (e.g. into and out of shared mem). You can reduce this to one copy if one of the processes keeps the shared data in shared memory, and the other process just reads it from there.

The Linux man pages for sendfile(2) and vmsplice(2) don’t mention a POSIX alternative, so I doubt there is one. To send data between processes with only one copy, set up a pipe between them and use vmsplice to put pages into the pipe with zero-copy. On the receiving end, I think just use read(2) to get the pages out of the pipe.

Over the network, zero-copy is even harder. Why no zero-copy networking in linux kernel? has some comments and answers. The receive side would be hard to implement on top of the usual socket API, unless it only worked when a thread was blocked on read(2) on the socket. Otherwise, how would it know where in the process’s virtual memory to put the packet?

Читайте также:  Linux delete file with find

Источник

Does Linux have zero-copy? splice or sendfile?

When splice was introduced it was discussed on the kernel list that sendfile was re-implemented based off of splice. The documentation for splice SLICE_F_MOVE states:

Attempt to move pages instead of copying. This is only a hint to the kernel: pages may still be copied if the kernel cannot move the pages from the pipe, or if the pipe buffers don’t refer to full pages. The initial implementation of this flag was buggy: therefore starting in Linux 2.6.21 it is a no-op (but is still permitted in a splice() call); in the future, a correct implementation may be restored.

So does that mean Linux has no zero-copy method for writing to sockets? Or was this fixed at some point and nobody updated the documentation for years? Does either of sendfile or splice have a zero copy implementation in any of the latest 3.x kernel versions? Since Google has no answer to this query, I’m creating a stackoverflow question for the next poor schmuck who wants to know if there’s any benefit to using vmsplice and splice or sendfile over plain old write.

I don’t know much about slice, but if you’re interested in zero-copy sockets specifically, you should take a look at memory mapped sockets: kernel.org/doc/Documentation/networking/packet_mmap.txt

Under «NOTES» the splice (2) manpage says «Though we talk of copying, actual copies are generally avoided.» So very likely things are zero-copy when possible, but the kernel will not error if it cannot do things zero copy.

@Eloff: The sparkling-new AF_XDP achieves true zero-copy for raw packets, using ideas borrowed from Infiniband (RDMA) and DPDK.

Читайте также:  Linux открыть закрыть терминал

2 Answers 2

sendfile has been ever since, and still is zero-copy (assuming the hardware allows for it, but that is usually the case). Being zero-copy was the entire point of having this syscall in the first place. sendfile is nowadays implemented as a wrapper around splice .

That suggests that splice , too, is zero-copy, and this is indeed the case. At least in theory, and at least in some cases. The problem is figuring out how to correctly use it so it works reliably and so it is zero-copy. The documentation is. sparse, to say the least.

In particular, splice only works zero-copy if the pages were given as «gift», i.e. you don’t own them any more (formally, but in reality you still do). That is a non-issue if you simply splice a file descriptor onto a socket, but it is a big issue if you want to splice data from your application’s address space, or from one pipe to another. It is unclear what to do with the pages afterwards (and when). The documentation states that you may not touch the pages afterwards or do anything with them, never, not ever. So if you follow the letter of the documentation, you must leak the memory.
That’s obviously not correct (it can’t be), but there is no good way of knowing (for you at least!) when it’s safe to reuse or release that memory. The kernel doing a sendfile would know, since as soon as it receives the TCP ACK, it knows that the data is never needed again. The problem is, you don’t ever get to see an ACK. All you know when splice has returned is that data has been accepted to be sent (but you have no idea whether it has already been sent or received, nor when this will happen).
Which means you need to figure this out somehow on an application layer, either by doing manual ACKs (comes for free with reliable UDP), or by assuming that if the other side sends an answer to your request, they obviously must have gotten the request.

Читайте также:  Which linux does linus torvalds use

Another thing you have to manage is the finite pipe space. The default is very small, but even if you increase the size, you can’t just naively splice a file of any size. sendfile on the other hand will just let you do that, which is cool.

All in all, sendfile is nice because it just works, and it works well, and you don’t need to care about any of the above details. It’s not a panacea, but it sure is a great addition.
I would, personally, stay away from splice and its family until the whole thing is greatly overhauled and until it is 100% clear what you have to do (and when) and what you don’t have to do.

The real, effective gains over plain old write are marginal for most applications, anyway. I recall some less than polite comments by Mr. Torvalds a few years ago (when BSD had a form of write that would do some magic with remapping pages to get zero-copy, and Linux didn’t) which pointed out that making a copy usually isn’t any issue, but playing tricks with pages is [won’t repeat that here].

Источник

Оцените статью
Adblock
detector