Compare two files line by line and generate the difference in another file
I want to compare file1 with file2 and generate a file3 which contains the lines in file1 which are not present in file2.
I tried diff but it generates some numbers and other symbols in front of different lines that makes it difficult for me to compare files.
14 Answers 14
diff(1) is not the answer, but comm(1) is.
NAME comm - compare two sorted files line by line SYNOPSIS comm [OPTION]. FILE1 FILE2 . -1 suppress lines unique to FILE1 -2 suppress lines unique to FILE2 -3 suppress lines that appear in both files
comm -2 -3 file1 file2 > file3
The input files must be sorted. If they are not, sort them first. This can be done with a temporary file, or.
provided that your shell supports process substitution (bash does).
What does «sorted» mean? That the lines have the same order? Then it’s probably fine for most use cases — as in, checking for what lines have been added by comparing with a backed-up older version. If newly added lines cannot be between the existing lines, that’s more of an issue.
@EgorHans: if the file has e.g. lines containing integers such as «3\n1\n3\n2\n» lines must first be reordered in to ascending or descending order e.g. «\1\n2\n3\n3\n» with duplicates adjacent. That is «sorted» and both files must be sorted in a similar manner. When the newer file has new lines it does not matter if they are «between existing lines» because after the sort they are not, they’re in sorted order.
The Unix utility diff is meant for exactly this purpose.
$ diff -u file1 file2 > file3
See the manual and the Internet for options, different output formats, etc.
That does not do the job requested; it inserts a bunch of extra characters, even with the use of commandline switches suggested in other answers.
You can find the difference with:
diff -a --suppress-common-lines -y a.txt b.txt
You can redirict the output in an output file (c.txt) using:
diff -a --suppress-common-lines -y a.txt b.txt > c.txt
This will answer your question:
«. which contains the lines in file1 which are not present in file2.»
There are two limitations to this answer: (1) it only works for short lines (less than 80 chars by default, although this can be modified) and, more important, (2) it add a »
In many cases, you’ll also want to use -d , which will make diff do its best to find the smallest possible diff. -i , -E , -w , -B and —suppress-blank-empty can also be useful occasionally, although not always. If you don’t know what fits your use case, try diff —help first (which is generally a good idea when you don’t know what a command can do).
Also, using —line-format=%L, you keep diff from generating any extra characters (at least, the help says it works like this, yet about to try it out).
- lines which are exist only in file2:
grep -Fxvf file1 file2 > file3
grep -Fxvf file2 file1 > file3
grep -Fxf file1 file2 > file3
Switches description (see also man grep ):
- The -F tells grep to interpret PATTERNS as fixed strings, not regular expressions.
- The -x tells grep to select only those matches that exactly match the whole line not partiall match.
- With the -f , grep obtains the patterns from FILE, one per line.
- The -v just inverts the sense of matching, to select non-matching lines.
Sometimes diff is the utility you need, but sometimes join is more appropriate. The files need to be pre-sorted or, if you are using a shell which supports process substitution such as bash, ksh or zsh, you can do the sort on the fly.
Join is really so useful and fast. It can be use for many cases like finding difference just like this one, or finding commons in two files.
It ususally works much better in most cases for me. You may want to sort files prior, if order of lines is not important (e.g. some text config files).
sdiff -w 185 file1.cfg file2.cfg
Nice utility! I love how it marks the differentiating lines. Makes it much easier to compare configs. This together with sort is a deadly combo (e.g. sdiff <(sort file1) <(sort file2) )
You could use diff with following output formatting:
diff --old-line-format='' --unchanged-line-format='' file1 file2
—old-line-format=» , disable output for file1 if line was differ compare in file2.
—unchanged-line-format=» , disable output if lines were same.
I’m surprised nobody mentioned diff -y to produce a side-by-side output, for example:
diff -y file1 file2 > file3
And in file3 (different lines have a symbol | in middle):
If you need to solve this with coreutils the accepted answer is good:
You can also use sd (stream diff), which doesn’t require sorting nor process substitution and supports infinite streams, like so:
cat file1 | sd 'cat file2' > file3
Probably not that much of a benefit on this example, but still consider it; in some cases you won’t be able to use comm nor grep -F nor diff .
Here’s a blogpost I wrote about diffing streams on the terminal, which introduces sd.
Many answers already, but none of them perfect IMHO. Thanatos’ answer leaves some extra characters per line and Sorpigal’s answer requires the files to be sorted or pre-sorted, which may not be adequate in all circumstances.
I think the best way of getting the lines that are different and nothing else (no extra chars, no re-ordering) is a combination of diff , grep , and awk (or similar).
If the lines do not contain any »
This one-liner diffs both files, then filters out the ed-style output of diff, then removes the trailing »
comm doesn’t require sorting (in newer versions?) — just use —nocheck-order. I use this a lot when manipulating csvs from the CLI
diff a1.txt a2.txt | grep '> ' | sed 's/> //' > a3.txt
I tried almost all the answers in this thread, but none was complete. After few trails above one worked for me. diff will give you difference but with some unwanted special charas. where you actual difference lines starts with ‘> ‘. so next step is to grep lines starts with ‘> ‘and followed by removing the same with sed.
This is a bad idea. You would also need to modify lines starting with < . You will see this if you swap the order of the input files. Even if you did this you would want to omit grep by using more sed: ` diff a1 a2 | sed '/>/s///’` This can still break lines containing > or < in the right situation and still leaves extra lines describing line numbers. If you wanted to try this approach a better way would be: diff -C0 a1 a2 | sed -ne ‘/^[+-] /s/^..//p’ .
Use the Diff utility and extract only the lines starting with < in the output
If you have a CSV file with single or even multiple columns, you can do these line by line «diff» operations using the sqlite3 embedded db. It comes with python, so should be available on most linux/macs. You can script the sqlite3 commands on the bash shell without needing to write python.
- Create your a.csv and b.csv files
- Ensure sqlite3 is installed using the command «sqlite3 -help»
- Run the below commands directly on the Linux/Mac shell (or put it in a script)
echo " .mode csv .import a.csv atable .import b.csv btable create table result as select * from atable EXCEPT select * from btable; .output result.csv select * from result ; .quit " | sqlite3 temp.db
Note : Ensure there is a newline for each of the sqlite3 commands.
- Import the 2 csvs into «atable» and «btable» respectively.
- Use the «except» sql operator to select the data available in «atable» but missing in «btable». Create a «result» table using the select query statement
- Output the result table to result.csv by running «select * from result;»
If you need to operate on specific columns, sqlite3 or any db is the way to go.
I have tried diff’ing on multiple GB files using the builtin diff and comm tools. Sqlite beats linux utilities by a mile.
How can I compare two files line by line?
The result should display the lines that are present in fileA but not in fileB.
I tried tkdiff but since some lines are jumbled it shows many differences.
6 Answers 6
I can’t speak to how portable this is but I tried to cover all the bases. I did my best to replicate the two files in my testing based on your information. If you run into special character issues with sed they can be esacped in the second line of the cleanLine function.
#!/bin/bash # compare two files and return lines in # first file that are missing in second file ProgName=$ Pid=$$ CHK_FILE="$1" REF_FILE="$2" D_BUG="$3" TMP_FILE="/tmp/REF_$.tmp" declare -a MISSING='()' m=0 scriptUsage() < cat [-d|--debug] Lines in 'file_to_check' not present in 'reference_file' are printed to standard output. file_to_check: File being checked reference_file: File to be checked against -d|--debug: Run script in debug mode (Optional) -h|--help: Print this help message ENDUSE > # delete temp file on any exit trap 'rm $TMP_FILE > /dev/null 2>&1' EXIT #-- check args [[ $CHK_FILE == "-h" || $CHK_FILE == "--help" ]] && < scriptUsage; exit 0; >[[ -n $CHK_FILE && -n $REF_FILE ]] || < >&2 echo "Not enough arguments!"; scriptUsage; exit 1; > [[ $D_BUG == "-d" || $D_BUG == "--debug" ]] && set -x [[ -s $CHK_FILE ]] || < >&2 echo "File $CHK_FILE not found"; exit 1; > [[ -s $REF_FILE ]] || < >&2 echo "File $REF_FILE not found"; exit 1; > #-- #== edit temp file to 3 match comparison rules # copy ref file to temp for editing cp "$REF_FILE" $TMP_FILE || < >&2 echo "Unable to create temporary file"; exit 1; > # rule 3 - ignore empty lines sed -i '/^\s*$/d' $TMP_FILE # rule 1 - ignore begin/end of line spaces sed -i 's/^[[:space:]][[:space:]]*//;s/[[:space:]][[:space:]]*$//' $TMP_FILE # rule 2 - multi space/tab as single space sed -i 's/[[:space:]][[:space:]]*/ /g' $TMP_FILE #== # function to clean LINE to match 3 rules # & escape '/' and '.' for later sed command cleanLine() < var=$(echo "$1" | sed 's/^[[:space:]][[:space:]]*//;s/[[:space:]][[:space:]]*$//;s/[[:space:]][[:space:]]*/ /g') echo $var | sed 's/\//\\\//g;s/\./\\\./g' >### parse check file while IFS='' read -r LINE || [[ -n $LINE ]] do if [[ -z $LINE ]] then continue else CLN_LINE=$(cleanLine "$LINE") FOUND=$(sed -n "/$CLN_LINE/" $TMP_FILE) [[ -z $FOUND ]] && MISSING[$m]="$LINE" && ((m++)) FOUND="" fi done < "$CHK_FILE" ### #++ print missing line(s) (if any) if (( $m >0 )) then printf "\n Missing line(s) found:\n" #*SEE BELOW ON THIS for (( p=0; $p" done echo else printf "\n **No missing lines found**\n\n" fi #* using 'for p in $' causes: #* "SPACED LINES" to become: #* "SPACED" #* "LINES" when printed to stdout! #++
Comparing two files in linux terminal
There are two files called «a.txt» and «b.txt» both have a list of words. Now I want to check which words are extra in «a.txt» and are not in «b.txt». I need a efficient algorithm as I need to compare two dictionaries.
12 Answers 12
if you have vim installed,try this:
you will find it fantastic.
comm compares (sorted) input files and by default outputs three columns: lines that are unique to a, lines that are unique to b, and lines that are present in both. By specifying -1 , -2 and/or -3 you can suppress the corresponding output. Therefore comm -23 a b lists only the entries that are unique to a. I use the <(. ) syntax to sort the files on the fly, if they are already sorted you don't need this.
@AliImran, comm is more efficient because it does the job in a single run, without storing the entire file in memory. Since you’re using dictionaries that are most likely already sorted you don’t even need to sort them. Using grep -f file1 file2 on the other hand will load the entire file1 into memory and compare each line in file2 with all of those entries, which is much less efficient. It’s mostly useful for small, unsorted -f file1 .
Thanks @AndersJohansson for sharing the «comm» command. Its nifty indeed. I frequently have to do outer joins between files and this does the trick.
If you prefer the diff output style from git diff , you can use it with the —no-index flag to compare files not in a git repository:
git diff --no-index a.txt b.txt
Using a couple of files with around 200k file name strings in each, I benchmarked (with the built-in time command) this approach vs some of the other answers here:
git diff --no-index a.txt b.txt # ~1.2s comm -23 <(sort a.txt) <(sort b.txt) # ~0.2s diff a.txt b.txt # ~2.6s sdiff a.txt b.txt # ~2.7s vimdiff a.txt b.txt # ~3.2s
comm seems to be the fastest by far, while git diff --no-index appears to be the fastest approach for diff-style output.
Update 2018-03-25 You can actually omit the --no-index flag unless you are inside a git repository and want to compare untracked files within that repository. From the man pages:
This form is to compare the given two paths on the filesystem. You can omit the --no-index option when running the command in a working tree controlled by Git and at least one of the paths points outside the working tree, or when running the command outside a working tree controlled by Git.