How to dump part of binary file
I have binary and want to extract part of it, starting from know byte string (i.e. FF D8 FF D0) and ending with known byte string (AF FF D9) In the past I’ve used dd to cut part of binary file from beginning/ending but this command doesn’t seem to support what I ask. What tool on terminal can do this?
7 Answers 7
Locate the start/end position, then extract the range.
$ xxd -g0 input.bin | grep -im1 FFD8FFD0 | awk -F: '' 0000cb0 $ ^FFD8FFD0^AFFFD9^ 0009590 $ dd ibs=1 count=$((0x9590-0xcb0+1)) skip=$((0xcb0)) if=input.bin of=output.bin
I found «..count=$((0x9590-0xcb0+2)) skip=$((0xcb0+1)). » to match exactly starting from «FFD8..» and ending to «AFFF..». Thank you for your nice procedure. Cheers
After couple of extractions I noticed that this is only approximate solution. +1, +2 all depend on content. For example 007d820: 74290068656c6c6f2e6a706700ffd8ff gives 007d820 for both ’74 29 00 68′ and ’00 ff d8 ff’ so something slightly different has to be done
This does not work. If the pattern to match is split on two lines of xxd output it will never be found (by default xxd -g0 group lines per 16 bytes). For a pattern of 4 bytes long the probability to have a split is 25%. Also, the grep|awk will print the address of the beginning of the line where the pattern occur, so a delta of up to line size can happen, you end up with more data than you really want.
We’re not talking about probability here, but certainty! Even with 160 (the max is 256 for xxd), the probability is more than 2%, which is huge. If you automate this, you need a script that works all the time, not 98% of the times. See my answer below for a proposal that works all the time.
xxd -c1 -p file | awk -v b="ffd8ffd0" -v e="aaffd9" ' found == 1 < print $0 str = str $0 if (str == e) if (length(str) == length(e)) str = substr(str, 3)> found == 0 < str = str $0 if (str == b) if (length(str) == length(b)) str = substr(str, 3)> END< exit found >' | xxd -r -p > new_file test $ -eq 0 || rm new_file
The idea is to use awk between two xxd to select the part of the file that is needed. Once the 1st pattern is found, awk prints the bytes until the 2nd pattern is found and exit.
The case where the 1st pattern is found but the 2nd is not must be taken into account. It is done in the END part of the awk script, which return a non-zero exit status. This is catch by bash ‘s $ where I decided to delete the new file.
Note that en empty file also mean that nothing has been found.
Yet another mark reassignment — lOranger’ solution fails if 2nd pattern can be found before the 1st — giving $len with negative sign. This solution searches after the 1st pattern match, so it doesn’t have such problem, nor generates intermediate triple size file.
After testing this more, I found it without issues, but it’s rather slow on larger files. Does anyone see a place for some optimisation, or this is the best one can get from xxd/awk?
Try the new sed version that I just post. This one can be optimized replacing string concatenation and extraction with rotatory indexes in arrays, but it is less readable; and I do not want to do it if not needed ;-).
This should work with standard tools (xxd, tr, grep, awk, dd). This correctly handles the «pattern split across line» issue, also look for the pattern only aligned at byte offset (not nibble).
file= outfile= startpattern="ff d8 ff d0" endpattern="af ff d9" xxd -g0 -c1 -ps $ | tr '\n' ' ' > $.hex start=$((($(grep -bo "$" $.hex\ | head -1 | awk -F: '')-1)/3)) len=$((($(grep -bo "$" $.hex\ | head -1 | awk -F: '')-1)/3-$)) dd ibs=1 count=$ skip=$ if=$ of=$
Note: The script above use a temporary file to prevent having the binary>hex conversion twice. A space/time trade-off is to pipe the result of xxd directly into the two grep . A one-liner is also possible, at the expense of clarity.
One could also use tee and named pipe to prevent having to store a temporary file and converting output twice, but I’m not sure it would be faster (xxd is fast) and is certainly more complex to write.
lOranger, I used -c64 to compensate a bit, and cut and sed to calculate correct address, but -c1 should be real solution. I’ll mark your solution, but when I manage to make it work. First I needed to change place of grep ‘s pattern and filename to make grep work, but regardless I get dd: invalid number I imagine problem in start/len calculation/grammar. Also can’t we exclude empty space and save 1/3 of output .hex file which would be double the input file size instead triple as it is now?
Sorry, there was a typo in the script: grep pattern should be before the filename. I also added a | head -1 to cover the case where the pattern appears multiple times in the input, which can happen. Concerning your question, the space between hex bytes is necessary, otherwise you have the «nibble» issue (pattern is not aligned on byte boundaries).
I’m afraid it still doesn’t work. I get input file as result. I used my -c64 script, and get expected dump, but I was unwilling to post it here as it was fragile on boundaries (better than provided, but still..)
Please note that you have to convert your hex pattern to lowercase (or add option -i in grep ). I’ve just tested the script here with a big binary file and it works fine. Please print the value of $ and $ to debug (you can check that start and len > 0 to prevent cases where the pattern is not found in the input.
Just in case: pastebin.com/raw.php?i=hZ5UqAF9 Patterns are in lower case. It simply returns the input file as dump, so start and end position are 0 and input file length.
See this link for a way to do binary grep. Once you have the start and end offset, you should be able with dd to get what you need.
A variation on the awk solution that assumes that your binary file, once converted in hex with spaces, fits in memory:
xxd -c1 -p file | tr "\n" " " | sed -n -e 's/.*\(ff d8 ff d0.*aa ff d9\).*/\1/p' | xxd -r -p > new_file
WOW, this is so sweet and looks so easy. Couldn’t be better than this. I’ll leave mark on IOranger’s answer as it is correct and answered earlier, but this is by far my favourite snippet
Too bad the quickest get the mark, not the shortest. Anyway, it can still be optimized by removing the tr , replacing it inside sed by -e ‘1h’ -e ‘2,$H’ -e ‘$
Thanks. I tested this on 1GB laptop, and it was fine for 5MB file, but it made my system inaccessible on 50MB file. Is there maybe some general rule for determining «limit» file size based on available RAM, in your opinion?
A 50MB file means 150MB once decoded and once bytes are separated by spaces. IT is not that much, but could cause sed to behave very slowly: a line of 150MB is a lot ! You could try the -n option to sed to remove buffering, but it could just worsen the problem. It is difficult to give an opinion on the limit: I do not know about sed implementation. The best is to do many tries. Sorry not to be able to help more.
Another solution in sed , but using less memory:
xxd -c1 -p file | sed -n -e '1' -e '/ff\nd8\nff\nd0/' -e 'N;D' | sed -n -e '1' -e '/aa\nff\nd9/' -e 'P;N;D' | xxd -r -p > new_file test $ -eq 1 || rm new_file
The 1st sed prints from ff d8 ff d0 till the end of file. Note that you need as much N in -e ‘1’ as there is bytes in your 1st pattern less one.
The 2nd sed prints from the beginning of the file to aa ff d9 . Note again that you need as much N in -e ‘1’ as there is bytes in your 2nd pattern less one.
Again, a test is needed to check if the 2nd pattern is found, and delete the file if it is not.
Note that the Q command is a GNU extension to sed . If you do not have it, you need to trash the rest of the file once the pattern is found (in a loop like the 1st sed , but not printing the file), and check after hex to binary conversion that the new_file end with the wright pattern.
How do I extract a single chunk of bytes from within a file?
On a Linux desktop (RHEL4) I want to extract a range of bytes (typically less than 1000) from within a large file (>1 Gig). I know the offset into the file and the size of the chunk. I can write code to do this but is there a command line solution? Ideally, something like:
magicprogram --offset 102567 --size 253 < input.binary >output.binary
6 Answers 6
dd skip=102567 count=253 if=input.binary of=output.binary bs=1
The option bs=1 sets the block size, making dd read and write one byte at a time. The default block size is 512 bytes.
The value of bs also affects the behavior of skip and count since the numbers in skip and count are the numbers of blocks that dd will skip and read/write, respectively.
Here is example using hex offsets: dd if=in.bin bs=1 status=none skip=$((0x88)) count=$((0x80)) of=out.bin .
Is there a specific reason why you use bs=1 and count=253 and not the other way round? Would the larger block size make the command more efficient?
@rexford: The skip number is also given in blocks, and is not a multiple of 253. And given that the OS does its own buffering when reading from a normal file on a file system, in this case efficiency will not be as bas as when reading from a device.
This is an old question, but I’d like to add another version of the dd command that is better-suited for large chunks of bytes:
dd if=input.binary of=output.binary skip=$offset count=$bytes iflag=skip_bytes,count_bytes
where $offset and $bytes are numbers in byte units.
The difference with Thomas’s accepted answer is that bs=1 does not appear here. bs=1 sets the input and output block size to 1 byte, which makes it terribly slow when the number of bytes to extract is large.
This means we leave the block size ( bs ) at its default of 512 bytes. Using iflag=skip_bytes,count_bytes , we tell dd to treat the values after skip and count as byte amount instead of block amount.
@Timmmm GNU dd can be used for iflag support ( brew install coreutils ). Note: by default the utilities are installed with a g prefix (e.g. gdd instead of dd )
head -c + tail -c
Not sure how it compares to dd in efficiency, but it is fun:
printf "123456789" | tail -c+2 | head -c3
picks 3 bytes, starting at the 2nd one:
@elvis.dukaj yes, there should be no different. Just give it a try with printf ‘\x01\x02’ > f and hd .
Much faster than dd with bs=1, thank you! Please note that tail counts bytes from 1, not from 0. Also, tail exits with error code 1 when its output is closed prematurely by head. Make sure to ignore that error when using «set -e».
dd bs= count=1 skip= if=input.binary of=output.binary
it is a detail for the executor, and still better than the above, true you’d need to re-calc like: req_offset=$(bc
I have had the same problem, trying to cut parts of a RAW disk image. dd with bs=1 is unusable, therefore I have made a simple C program for the task.
// usage: // ./cutfile srcfile destfile offset length // ./cutfile my.image movie.avi 4524 20412452 // compile, presuming it is saved as cutfile.cc: // gcc cutfile.cc -o cutfile -std=c11 -pedantic -W -Wall -Werror #include #include #include int main(int argc, char *argv[]) < if(argc != 5) < printf("error, need 4 arguments!\n"); return 1; >const unsigned blocksize = 16*512; // can adjust unsigned char buffer[blocksize]; FILE *f = fopen(argv[1], "rb"); FILE *fout = fopen(argv[2], "wb"); long offset = atol(argv[3]); long length = atol(argv[4]); if(f==NULL || fout==NULL) < perror("cannot open file"); return 1; >fseek(f, offset, SEEK_SET); while(length > blocksize) < fread(buffer, 1, blocksize, f); fwrite(buffer, 1, blocksize, fout); length -= blocksize; >if(length>0) < // copy rest fread(buffer, 1, length, f); fwrite(buffer, 1, length, fout); >fclose(fout); fclose(f); return 0; >
How to cut a file to a given size under Linux?
I want to shrink a file’s size by brute-force, that is, I don’t care about the rest, I just want to cut the file, say by half, and discard the rest. The first thing that comes to mind is Perl’s truncate. I’m following the example on that page and did the exactly the same thing:
seq 9 > test.txt ls -l test.txt perl -we 'open( FILE, "< ./test.txt" ) && truncate( FILE, 8 ) && close(FILE);'
$ ls -lgG test.txt -rw-rw---- 1 18 2013-08-08 09:49 test.txt
5 Answers 5
SIZE can be specified as bytes, KB, K, MB, M, etc. I assume you can calculate the desired size by hand; if not, you could probably use the stat command to get information about the file's current size.
Note that this chops off from the bottom. i.e. if used on log files you will loose the most recent log lines.
opens the file for reading. However, to truncate the file you need to modify it, so a read-only file handle isn't going to work. You need to use the "modify" mode ( "+>" ).
As a side issue, it always amazes me when people let system calls fail silently and then ask what went wrong. An essential part of diagnosing a problem is looking at the error message produced; even if you don't understand it, it makes life much easier for those you ask for help.
The following would have been somewhat more helpful:
although admittedly that would only have reported "invalid argument". Still, that is useful information and might well have led you to the conclusion that the open mode was wrong (as it did for me).