Linux fast copy files

Fastest way to copy a large file locally

I was asked this in an interview. I said lets just use cp. Then I was asked to mimic implementation cp itself. So I thought okay, lets open the file, read one by one and write it to another file. Then I was asked to optimize it further. I thought lets do chunks of read and write those chunks. I didn’t have a good answer about what would be good chunk size. Please help me out with that. Then I was asked to optimize even further. I thought may be we could read from different threads in parallel and write it in parallel. But I quickly realized reading in parallel is OK but writing will not work in parallel(without locking I mean) since data from one thread might overwrite others. So I thought okay, lets read in parallel, put it in a queue and then a single thread will take it off the queue and write it to the file one by one. Does that even improve performance? (I mean not for small files. it would be more overhead but for large files) Also, is there like an OS trick where I could just point two files to the same data in disk? I mean I know there are symlinks but apart from that?

But I quickly realized reading in parallel is OK but writing will not work in parallel(without locking I mean) since data from one thread might overwrite others. What is that based on? There are numerous ways to write to a file from multiple threads in ways that require no locking. You can use pwrite() or open() the file multiple times. The real problem with parallel writes to most files is the extra seeks required for the physical disk heads. If the filesystem is a high-end HPC filesystem, though, files can be spread over multiple disks and parallel writes can be much faster

2 Answers 2

«The fastest way to copy a file» is going to depend on the system — all the way from the storage media to the CPUs. The most likely bottleneck will be the storage media — but it doesn’t have to be. Imagine high-end storage that can move data faster than your system can create physical page mappings to read the data into.

In general, the fastest way to move a lot of data is to make as few copies of it as possible, and to avoid any extra operations, especially S-L-O-W ones such as physical disk head seeks.

So for a local copy on a common single-rotating-disk workstation/desktop/laptop system, the biggest thing to do is minimize physical disk seeks. That means read and write single-threaded, in large chunks (1 MB, for example) so the system can do whatever optimization it can, such as read-ahead or write coalescing.

Читайте также:  Изменить версию python по умолчанию linux

That will likely get you to 95% or even better of the system’s maximum copy performance. Even standard C buffered fopen() / fread() / fwrite() probably gets at least 80-90% of the best possible performance.

You can get the last few percentage points in a few ways. First, by matching your IO block size to a multiple of the file system’s block size so that you’re always reading full blocks from the filesystem. Second, you can use direct IO to bypass copying your data through the page cache. It will be faster to go disk->userspace or userspace->disk than it is to go disk->page cache->userspace and userspace->page cache->disk, but for single-spinning-disk copy that’s not going to matter much, if it’s even measurable.

You can use various dd options to test copying a file like this. Try using direct , or notrunc .

You can also try using sendfile() to avoid copying data into userspace entirely. Depending on the implementation, that might be faster than using direct IO.

Pre-allocating the destination file may or may not improve copy performance — that will depend on the filesystem. If the filesystem doesn’t support sparse files, though, preallocating the file to a specific length might very well be very, very slow.

There just isn’t all that much you can do to dramatically improve performance of a copy from and to the same single spinning physical disk — those disk heads will dance, and that will take time.

SSDs are much easier — to get maximal IO rates, just use parallel IO via multiple threads. But again, the «normal» IO will probably be at 80-90% of maximal.

Things get a lot more interesting and complex optimizing IO performance for other types of storage systems such as large RAID arrays and/or complex filesystems that can stripe single files across multiple underlying storage devices. Maximizing IO on such systems involves matching the software’s IO patterns to the characteristics of the storage, and that can be quite complex.

Finally, one important part of maximizing IO rates is not doing things that dramatically slow things down. It’s really easy to drag a physical disk down to a few KB/sec IO rates — read/write small chunks from/to random locations all over the disk. If your write process drops 16-byte chunks to random locations, the disk will spend almost all its time seeking and it won’t move much data at all while doing that.

In fact, not «killing yourself» with bad IO patterns is a lot more important than spending a lot of effort attempting to get a four or five percentage points faster in optimal cases.

Because if IO is a bottleneck on a simple system, just go buy a faster disk.

Источник

The faster and safer way to copy files in Linux than cp

Have you tried copying large files on Linux, sometimes it just takes ages? I was doing that when I thought there should be a faster and better way to copy files in Linux. So I started searching and came across these commands which can offer better copying speed.

Читайте также:  Настройка сервера rdp linux

As simple cp command is very useful but sometimes it can slow down the process. These commands should help you get your copying done in the fastest way.

For copying sometimes the tar command can be a better alternative. Sometimes providing a faster and safer alternative. Here is how to use tar.

How to copy files faster using tar command in Linux

To copy files open a terminal, it can generally be opened by Ctrl + Alt + T. Now in terminal change the current directory to the folder from which you want to copy files.

Now just run the command below to copy files.

tar cf - . | (cd /output/directory/ && tar xvf -)

While executing the command just replace /output/directory with the directory in which you want to copy files. All the files and subfolders are copied from the current directory to the /output/directory.

Now if you want, you can also use pv to help you monitor the progress of copying files. For example:

tar cf - . | pv | (cd /output/directory && tar xvf -)

Cp vs tar why is the speed difference?

In cp vs tar, tar sometimes has much higher copy speed than cp. The reason behind that is cp does open-read-close-open-write-close in a loop. And while tar does reading and writing in a separate process. Tar also uses multiple threads to read and write and can even several files at once.

This makes tar clearly win in comparison of cp vs tar. As tar works in a more speedy and efficient way.

Here are other alternatives that you can use to copy files in fastest way on Linux.

Another alternative command

The other command that is fast and very versatile for copying files between two locations is rsync . It can be used to copy between local as well as remote locations.

To copy files using rsync you need to enter the command below.

rsync -a Downloads/songs/abc.zip Downloads/music/

To view the progress while copying large size files you can use the command below.

rsync --info=progress2 -auvz Downloads/songs/abc.zip Downloads/music/

If you are wondering, here is what -auvz stands for.

  • a: archive files and directory while synchronizing.
  • u: Don’t copy files from source to destination, if destination already has newer files.
  • v: Verbose output.
  • z: Compress data during the transfer.

In the above example, copying is being done locally, but you can use rsync for copying over remote locations also.

You can also use like n for the dry run (to perform a trial run without synchronization) and r for recursive (sync files and directories recursively). If you are transferring from a remote location you can also use -e ssh to secure communications. Here are some other commands that you can use if you want.

If your system doesn’t come preinstalled with rsync, then you can install using the commands below.

cp vs rsync which one is better?

Although rsync is not generally faster than cp, but as it only syncs files that are modified or new. It can offer better speed when synchronizing files. The rsync also has multiple advanced options that are not available in cp.

Читайте также:  Узнать внешний ip linux команда

How to install rsync

rsyn comes pre-installed on most Linux distros. But if it’s not preinstalled you can install with the command below.

On Debian and Ubuntu-based systems use the command below.

sudo apt-get install rsync

On CentOS/RHEL based systems use the command below.

For SUSE/Open based systems.

These commands will install rsync on your system. Now you can try copying files with a better speed.

SCP for copying

Secure copy or also known as SCP, can also be used for copying. Although it is not for fast copying, it can be used for the secure transmission of files between a local host and a remote host. Or between two remote hosts. So when you are looking for secure transmission of files you can use this method.

Here is how you can use SCP for file transferring from a local to a remote host.

For transferring of file from a remote host to local host you can use the command below.

scp [email protected]_host:file_name.zip /local/directory/

For transferring of a file from remote host to remote host use the command below.

scp [email protected]_host:/remote/directory/file_name.zip [email protected]_host:/remote/directory/

I hope these commands should help you. Here are some other articles that I think you might like How to add a user to a group in Linux and how can Linux NTFS file.

Источник

Can I copy large files faster without using the file cache?

After adding the preload package, my applications seem to speed up but if I copy a large file, the file cache grows by more than double the size of the file. By transferring a single 3-4 GB virtualbox image or video file to an external drive, this huge cache seems to remove all the preloaded applications from memory, leading to increased load times and general performance drops. Is there a way to copy large, multi-gigabyte files without caching them (i.e. bypassing the file cache)? Or a way to whitelist or blacklist specific folders from being cached?

Make sure you’re not suffering from too high write buffering. Try echo 50000000 > /proc/sys/vm/dirty_background_bytes and echo 200000000 > /proc/sys/vm/dirty_bytes . For more details, see lonesysadmin.net/2013/12/22/… for details.

3 Answers 3

There is the nocache utility, which can prepended to a command like ionice and nice . It works by preloading a library which adds posix_fadvise with the POSIX_FADV_DONTNEED flag to any open calls.

In simple terms, it advises the kernel that caching is not needed for that particular file; the kernel will then normally not cache the file. See here for the technical details.

It does wonders for any huge copy jobs, e. g. if you want to backup a multi terabyte disk in the background with the least possible impact on you running system, you can do something along nice -n19 ionice -c3 nocache cp -a /vol /vol2 .

A package will be available in Ubuntu 13.10 and up. If you are on a previous release you can either install the 13.10 package or opt for this 12.04 backport by François Marier.

Источник

Оцените статью
Adblock
detector