Linux split binary files

What is the fastest and most reliable way to split a 50GB binary file into chunks of 5GB or less, and then reassemble it later?

Our servers are running Ubuntu Linux, and the binary file is a BSON dump of a large MongoDB collection. How reliable is a tool like split ? Is there a faster or better way to do this?

5 Answers 5

split is very reliable. We use it for porting large log files, and it worked well for up to a couple of GBs (not 50 gb anyway).

I believe you can try using the split for your requirement, and let us know.

Split into 5GB files

It will split into multiple files of 5GB and name it as xaa, xab, xac, . and so on.

Concatenate

by this you can concatenate as single file in the other end.

On OS X I have to give explicitly the output file name: split -b3m file.zip file.zip otherwise split creates files called xaa , xab , etc.

My edit attempt to change the quotes to code (commands) was rejected. If someone could kindly do that this answer would look prettier. Thanks 🙂

AFAIK they are completely reliable, and I doubt there is something more efficient.

split & cat are totally reliable. You can additionally compress in-line like this. Suppose your input file is dump.bson :

And then reconstitute with this:

cat dump.bson.gz.* | gunzip > dump.bson 

Tip, this works just as well with xz(dec) in place of g(un)zip

If you have rar installed, it’s worked very well for me:

To Separate

rar a -m0 -v5000m newfilename giantfile.foo 
  • a = add files to archive
  • m0 = no compression
  • v5000m = split into chunks of 5000 megabytes

To Reassemble

Start with the first part. The rest will be found automatically:

unrar x newfilename.part1.tar 
  • CRC on the content of the split archive,
  • split-file ordering kept automatically,
  • multiple files and dirs can be included.

From your question one could assume, that between the split phase and the reassemble phase there might be a transport phase — in this case a process, that involves compression could be quite beneficial.

The standard GNU tar has the -M option to create multi-volume archives and the -L option to define the length of a segment. Together with one of the compression options (e.g. -z , j , . ) you could create a toolchain, that

  • comes with every Linux distro I know of
  • provides splitting, reassembly, compression and decompression
  • has the additional benefit of being able to pack more than one file into a single archive sequence (e.g. the MongoDB data folder as-is)
Читайте также:  How to partition disks in linux

You must log in to answer this question.

Hot Network Questions

Subscribe to RSS

To subscribe to this RSS feed, copy and paste this URL into your RSS reader.

Site design / logo © 2023 Stack Exchange Inc; user contributions licensed under CC BY-SA . rev 2023.7.14.43533

By clicking “Accept all cookies”, you agree Stack Exchange can store cookies on your device and disclose information in accordance with our Cookie Policy.

Источник

Splitting a binary into chunks on Linux, and re-combining them on Windows

Recently, I needed to transfer a binary over a very limited network connection allowing only small packets to be sent. I ended up splitting the binary into pieces on my Linux box and reassembled the pieces on the target Windows host.

If, for some reason, you cannot use easier means like IP fragmentation and work with a smaller maximum transfer unit (MTU), here’s how to do the splitting and re-combining.

Split binary into pieces on Linux

Splitting a file into pieces on Linux is very straightforward – just use the split program (man).

The following command will split evil.exe into pieces of 1000 bytes, prefix them with chunk and use a numeric suffix for each chunk.

split -b 1000 -d evil.exe chunk 

So we will end up with something like this:

chunk00 chunk16 chunk32 chunk48 chunk01 chunk17 chunk33 chunk49 chunk02 chunk18 chunk34 chunk50 chunk03 chunk19 chunk35 chunk51 chunk04 chunk20 chunk36 chunk52 chunk05 chunk21 chunk37 chunk53 chunk06 chunk22 chunk38 chunk54 chunk07 chunk23 chunk39 chunk55 chunk08 chunk24 chunk40 chunk56 chunk09 chunk25 chunk41 chunk57 chunk10 chunk26 chunk42 chunk58 chunk11 chunk27 chunk43 chunk59 chunk12 chunk28 chunk44 chunk13 chunk29 chunk45 chunk14 chunk30 chunk46 chunk15 chunk31 chunk47 

Now that we have our chunks, we can host them for the Windows machine to download.

Download from Windows

To download the individual chunks to the Windows host, let’s use a quick PowerShell one-liner with Invoke-WebRequest:

0..59 | %  $chunk = 'chunk' -f $_; iwr 1.2.3.4/$chunk -outfile $chunk > 

If all you have is a command prompt and cannot download the chunks directly, one idea is to convert the binary chunks into hex strings and then send these strings through the prompt of the shell you might have.

Combine the chunks

Now that we have all pieces to the puzzle, let’s assemble them into the self-contained binary we actually want, with Get-ChildItem , Get-Content and Set-Content :

gci -Filter "chunk*" | gc -Enc Byte -Read 1000 | sc evil.exe -Enc Byte 

Doing a Get-FileHash evil.exe on the Windows host should now return the same hash as shasum -a 256 evil.exe on Linux.

Updated: September 09, 2020

Like to comment? Feel free to send me an email or reach out on Twitter.

Did this or another article help you? If you like and can afford it, you can buy me a coffee (3 EUR) ☕️ to support me in writing more posts. In case you would like to contribute more or I helped you directly via email or coding/troubleshooting session, you can opt to give a higher amount through the following links or adjust the quantity: 50 EUR, 100 EUR, 500 EUR. All links redirect to Stripe.

© 2023 David Hamann GmbH. Imprint/Impressum, Privacy Policy/Datenschutz. Here you can find posts by tags and categories.

Источник

Splitting a binary file on binary delimiter?

I’m working on a shell script to convert MPO stereographic 3D images into standard JPEG images. A MPO file is just two JPEG images, concatenated together. As such, you can split out the JPEG files by finding the byte offset of the second JPEG’s magic number header (0xFFD8FFE1). I’ve done this manually using hexdump/xxd, grep, head, and tail. The problem here is grep: what can I use to search a binary directly for a specific magic number, and get back a byte offset? Or should I not use a shell script for this at all? Thanks.

5 Answers 5

You can do this using bbe (http://bbe-.sourceforge.net/) which is a sed like program for binary files:

In order to extract the first JPEG use:

bbe -b '/\xFF\xD8\xFF\xE1/:' -e 'D 2' -o first_jpeg mpo_file 
bbe -b '/\xFF\xD8\xFF\xE1/:' -e 'D 1' -o second_jpeg mpo_file 

Note that this will not work if the JPEG’s magic number occurs somewhere else in the MPO file.

Thanks for bringing bbe to my attention! By the way, your Sourceforge link is broken, and bbe.sf.net is a different project.

I think that Bart is on to your biggest problem.. If that binary sequence repeats during the process, you will get partial JPEGs.

I did a quick test by concatenating some JPEGs and then extracting them with awk (please note that the magic number in my files ended in 0xE0 and not 0xE1):

 # for i in *.jpg ; do cat $i ; done > test.mpo # awk 'BEGIN ; FILENAME="image0"FILENUM".jpg"; printf "%s",RS$0 > FILENAME;>' test.mpo # file image0*.jpg image01.jpg: JPEG image data, JFIF standard 1.01 image010.jpg: JPEG image data, JFIF standard 1.01 image011.jpg: JPEG image data, JFIF standard 1.01 

This seemed to work ok for me, but the above mentioned issues are still unhandled and very real.

I guess 0xE1 in the magic number indicates that it’s the second image in the sequence and you never have more than 2 images. Adjust as needed. =)

I’m giving you the answer check because awk is available on every Unix system (and it reminds me that I need to learn more about it :). My shell script is currently doing some rudimentary checks for an image03.jpg or lack of image02.jpg and aborting, which helps to handle the magic number problem. I could also do some checking for EXIF headers. Unfortunately, I’m only aware of one program that can read these files natively — the Fujifilm Windows app — although Wikipedia claims that Digikam supports MPO. Will have to look at their source, and my camera’s documentation. For now, this is good.

Glad it worked out for you.. en.wikipedia.org/wiki/Magic_number_%28programming%29#Examples says that 0xFF 0xD8 is the beginning of the JPEG magic number, so it stands to reason that what follows it is up to the implementation.

Since the blog entry now 404s, here is the script that I wrote based on it. I haven’t used it in many years.

#!/usr/bin/env bash # Script to convert 3D MPO files, as used in the Fuji FinePix series of 3D cameras, into standard JPEG files. # Based on work by David Glover, posted at http://www.davidglover.org/2010/09/using-the-fuji-finepix-real-3d-w3-camera-on-a-mac-or-linuxunix.html # This script requires exiftool and ImageMagick. FULLNAME="$1" FILENAME="$(basename $FULLNAME)" DIRNAME="$(dirname $FULLNAME)" BASENAME="$" # Create output directories mkdir -p "$DIRNAME"/stereoscopic-rl/ mkdir -p "$DIRNAME"/stereoscopic-mpo/ mkdir -p "$DIRNAME"/stereoscopic-anaglyph/ mkdir -p "$DIRNAME"/monoscopic-l/ mkdir -p "$DIRNAME"/monoscopic-r/ # Create separate left and right images exiftool -trailer:all= "$FULLNAME" -o "$DIRNAME"/monoscopic-l/"$BASENAME"-left.jpg exiftool "$FULLNAME" -mpimage2 -b > "$DIRNAME"/monoscopic-r/"$BASENAME"-right.jpg # Move the MPO file to its new home mv "$FULLNAME" "$DIRNAME"/stereoscopic-mpo/ # Determine parallax value and create cropped images for stereo generation # 36 is only appropriate for 4:3 or 3:2 images parallax=$(exiftool -b -Parallax "$DIRNAME"/monoscopic-r/"$BASENAME"-right.jpg) parallax=$(echo "$parallax"*36+0.5 | bc | cut -d . -f 1) # The above pipeline can't deal with a parallax of zero # In theory, this fix doesn't cover values between zero and -1 # TODO improve the calculation if [ ! $parallax ]; then parallax=0 fi echo $parallax if [ $parallax -ge 0 ]; then convert "$DIRNAME"/monoscopic-l/"$BASENAME"-left.jpg -crop +"$parallax"+0 "$DIRNAME"/monoscopic-l/"$BASENAME"-left-cropped.jpg convert "$DIRNAME"/monoscopic-r/"$BASENAME"-right.jpg -crop -"$parallax"+0 "$DIRNAME"/monoscopic-r/"$BASENAME"-right-cropped.jpg else convert "$DIRNAME"/monoscopic-l/"$BASENAME"-left.jpg -crop -"$((-1*$parallax))"+0 "$DIRNAME"/monoscopic-l/"$BASENAME"-left-cropped.jpg convert "$DIRNAME"/monoscopic-r/"$BASENAME"-right.jpg -crop +"$((-1*$parallax))"+0 "$DIRNAME"/monoscopic-r/"$BASENAME"-right-cropped.jpg fi # Create stereoscopic images for cross-eye (right-left) and anaglyph (red-cyan) viewing convert "$DIRNAME"/monoscopic-r/"$BASENAME"-right-cropped.jpg "$DIRNAME"/monoscopic-l/"$BASENAME"-left-cropped.jpg +append "$DIRNAME"/stereoscopic-rl/"$BASENAME"-stereoscopic-rl.jpg composite -stereo 0 "$DIRNAME"/monoscopic-r/"$BASENAME"-right-cropped.jpg "$DIRNAME"/monoscopic-l/"$BASENAME"-left-cropped.jpg "$DIRNAME"/stereoscopic-anaglyph/"$BASENAME"-stereoscopic-anaglyph.jpg # Clean up separated parallax-corrected images rm "$DIRNAME"/monoscopic-l/"$BASENAME"-left-cropped.jpg rm "$DIRNAME"/monoscopic-r/"$BASENAME"-right-cropped.jpg exit 0 

Источник

Оцените статью
Adblock
detector