- What is the difference between «sort -u» and «sort | uniq»?
- 5 Answers 5
- Difference Between sort | uniq and sort -u
- 1. Overview
- 2. Basic Usage
- 3. Counting Unique Entries
- 4. Summary
- How to use the uniq command to process lists in Linux
- Great Linux resources
- Syntax
- Examples
- Without any option
- With -c, —count option
- With -d, —repeated option
- With -D, —all-repeated option
- With -u, —unique option
- With -i, —ignore-case option
- With -f, —skip-fields=N
- With -s, —skip-char=N option
- With -w, —check-chars=N option
- With —version option
- Wrap up
- Uniq and Sort command in Linux
- Sort command in Linux
What is the difference between «sort -u» and «sort | uniq»?
Everywhere I see someone needing to get a sorted, unique list, they always pipe to sort | uniq . I’ve never seen any examples where someone uses sort -u instead. Why not? What’s the difference, and why is it better to use uniq than the unique flag to sort?
5 Answers 5
sort | uniq existed before sort -u , and is compatible with a wider range of systems, although almost all modern systems do support -u — it’s POSIX. It’s mostly a throwback to the days when sort -u didn’t exist (and people don’t tend to change their methods if the way that they know continues to work, just look at ifconfig vs. ip adoption).
The two were likely merged because removing duplicates within a file requires sorting (at least, in the standard case), and is an extremely common use case of sort. It is also faster internally as a result of being able to do both operations at the same time (and due to the fact that it doesn’t require IPC (Inter-process communication) between uniq and sort ). Especially if the file is big, sort -u will likely use fewer intermediate files to sort the data.
On my system I consistently get results like this:
$ dd if=/dev/urandom of=/dev/shm/file bs=1M count=100 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 8.95208 s, 11.7 MB/s $ time sort -u /dev/shm/file >/dev/null real 0m0.500s user 0m0.767s sys 0m0.167s $ time sort /dev/shm/file | uniq >/dev/null real 0m0.772s user 0m1.137s sys 0m0.273s
It also doesn’t mask the return code of sort , which may be important (in modern shells there are ways to get this, for example, bash ‘s $PIPESTATUS array, but this wasn’t always true).
Difference Between sort | uniq and sort -u
The Kubernetes ecosystem is huge and quite complex, so it’s easy to forget about costs when trying out all of the exciting tools.
To avoid overspending on your Kubernetes cluster, definitely have a look at the free K8s cost monitoring tool from the automation platform CAST AI. You can view your costs in real time, allocate them, calculate burn rates for projects, spot anomalies or spikes, and get insightful reports you can share with your team.
Connect your cluster and start monitoring your K8s costs right away:
1. Overview
In this tutorial, we’ll do a quick comparison of the Linux commands sort | uniq and sort -u. Both use sort to remove duplicate entries from a list, but they operate in slightly different manners.
Note that all commands below are platform-independent.
2. Basic Usage
Let’s start with a list of colors in a file named color:
% cat color Black green red red yellow Green red
If we want to remove duplicates, uniq would work in some cases. Checking the man page for uniq:
Repeated lines in the input will not be detected if they are not adjacent, so it may be necessary to sort the files first.
For our list, the result would not be a list of unique entries because our list has duplicated, non-adjacent entries of “red”:
% uniq color Black green red yellow Green red
There are a couple of ways around this. First, using the -u argument with uniq removes all duplicates, both adjacent and non-adjacent:
% uniq -u color Black green yellow Green red
Alternatively, taking the man page suggestion, sorting the list before calling uniq will remove all of the duplicates.
% sort color Black Green green red red red yellow
Piping this to uniq yields:
% sort color | uniq Black Green green red yellow
Now, checking the man page for sort, we can see that the -u flag will provide the same output:
% sort -u color Black Green green red yellow
So, generally speaking, both sort | uniq and sort -u do the same thing. But there are some differences.
For example, sort has other options, like sorting on delimiters. But we can use these regardless of using -u or piping to uniq.
3. Counting Unique Entries
After finding a unique list of items, many times we’ll also want to know the number of unique items. The -c option for uniq will return a count for each duplicated line:
% uniq -c color 1 Black 1 green 2 red 1 yellow 1 Green 1 red
Kind of useful, but it again hits the issue of ignoring non-adjacent duplicates. To avoid that, we could sort the list first, then pipe the output to uniq:
sort color | uniq -c 1 Black 1 Green 1 green 3 red 1 yellow
Now we have a list of unique entries regardless of adjacency.
Taking it a step further, let’s say we want a count of unique items in the list. We can pipe to wc:
Or with sort -u instead of uniq:
And we get a count of our unique list items.
4. Summary
In this short article, we described the differences between using sort | uniq and sort -u.
How to use the uniq command to process lists in Linux
Processing data can be a challenge. Let’s see how the uniq command works and how it supplements the sort command.
We have seen the sort command in our previous article, but sorting any file will often result in many duplicate lines adjacent to each other. It becomes too difficult to properly view those lines.
Great Linux resources
In this scenario, the uniq command helps you to print duplicate lines once in the output. It actually discards the lines which are repeated and prints the first adjacent repeated line, which enables us to view the output properly.
The lines used in the input file for the uniq command can neither exceed 2048 bytes in length (including any newline characters) nor contain null characters.
Syntax
Examples
Below are a series of examples, beginning with no options. We’ll walk through several use cases. Some involve only uniq , and others rely on additional commands.
Without any option
Below is a file named file2, which contains some data. Note that this file is not sorted, and the duplicate lines are not adjacent to each other. Before using the uniq command with this file, we should sort it. In the example, I have tried the uniq command with the original file, but it only prints the output as it is, much like a cat output. In the next example, we take output from a sort command and pipe it with uniq command. This helps us understand the behavior of the uniq command:
$ cat file2 ChhatrapatiShahuMaharaj Dr.B.R.Ambedkar Budhha Dr.B.R.Ambedkar Budhha Dr.B.R.Ambedkar Budhha $ uniq file2 ChhatrapatiShahuMaharaj Dr.B.R.Ambedkar Budhha Dr.B.R.Ambedkar Budhha Dr.B.R.Ambedkar Budhha $ sort file2 Budhha Budhha Budhha ChhatrapatiShahuMaharaj Dr.B.R.Ambedkar Dr.B.R.Ambedkar Dr.B.R.Ambedkar $ sort file2 | uniq Budhha ChhatrapatiShahuMaharaj Dr.B.R.Ambedkar
With -c, —count option
Below, in the next example, we’re using the -c option to count the repeated lines. The uniq command prints that count as a prefix with the line. The below example tells us that the first line is repeated three times, the second line one time, and the third line three times:
$ sort file2 | uniq -c 3 Budhha 1 ChhatrapatiShahuMaharaj 3 Dr.B.R.Ambedkar
With -d, —repeated option
The -d option prints only lines that are repeated. It discards non-duplicate lines. Therefore, line ChhatrapatiShahuMaharaj has been discarded in the below example:
$ sort file2 | uniq -d Budhha Dr.B.R.Ambedkar
In the below example, I’ve used the -c option to cross-check whether the -d option is only printing the repeated lines or not:
$ sort file2 | uniq -cd 3 Budhha 3 Dr.B.R.Ambedkar
With -D, —all-repeated option
The -D option prints repeated lines and discards the non-duplicate lines. In the below example, the uniq command prints all duplicate lines only and discards non-duplicate lines:
$ sort file2 | uniq -D Budhha Budhha Budhha Dr.B.R.Ambedkar Dr.B.R.Ambedkar Dr.B.R.Ambedkar
With -u, —unique option
Opposite of the above option, the -u option prints unique lines i.e., non-duplicate lines. Therefore, in the below example, it prints ChhatrapatiShahuMaharaj as an output:
$ sort file2 | uniq -u ChhatrapatiShahuMaharaj
With -i, —ignore-case option
Using the -i option, we can ignore the case sensitivity of characters. Below I’ve given an output of the uniq command with and without the -i option to compare:
$ cat file3 aaaa aaaa AAAA AAAA bbbb BBBB $ uniq file3 aaaa AAAA bbbb BBBB $ uniq -i file3 aaaa bbbb
With -f, —skip-fields=N
Sometimes we need to skip some fields to filter duplicate lines. This is possible using the -f option. In the following example, we’re skipping the first field (first column) to compare the duplicate lines from the second field. I’ve given both examples, with and without the -f option, for a better understanding of the option’s behavior:
$ cat file5 Amit aaaa Ajit aaaa Advi bbbb Kaju bbbb $ uniq file5 Amit aaaa Ajit aaaa Advi bbbb Kaju bbbb $ uniq -f 1 file5 Amit aaaa Advi bbbb
With -s, —skip-char=N option
Just like the field, we can skip characters as well by using the -s option. Please keep in mind that the uniq command prints only the first duplicate line and discards other duplicate lines. Therefore 33aa and 55bb have been discarded. Here is the example:
$ cat file4 22aa 33aa 44bb 55bb $ uniq file4 22aa 33aa 44bb 55bb $ uniq -s 2 file4 22aa 44bb
With -w, —check-chars=N option
Just like skipping characters, we can consider characters as well using the -w option, such as in the example:
$ cat file6 aa12 aa34 bb56 bb78 $ uniq file6 aa12 aa34 bb56 bb78 $ uniq -w 2 file6 aa12 bb56
With —version option
Use the —version option to check the version of the uniq command.
$ uniq --version uniq (GNU coreutils) 8.4 Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later . This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Written by Richard M. Stallman and David MacKenzie.
Wrap up
uniq does not detect repeated lines unless they are adjacent. The uniq command can count and print the number of repeated lines. Just like duplicate lines, we can filter unique lines (non-duplicate lines) as well and can also ignore case sensitivity. We can skip fields and characters before comparing duplicate lines and also consider characters for filtering lines.
After reviewing the multiple uniq command options, I would like to share a small image to keep it with you for reference.
Uniq and Sort command in Linux
The uniq command is used to remove all the repeated lines/words present in a file .
- Using -c option : It tells the number of times a line is repeated.
- Using -d option : It only prints the repeated lines.
The output of -c and -d : - Using -D option : It also prints only duplicate lines but not one per group
- Using -u option : It prints only the unique lines.
- Using -i option : It is used to make the comparison case-insensitive.
The uniq command is often used with the sort command because it compares adjacent characters.
Sort command in Linux
The sort command sorts the file content in a specific order either ascending or descending order.
Sorting In Reverse Order
To sort specific column
If a file has more than one column, a column number is used to sort a specific column.
To sort a file with numeric data in reverse order
If you don’t want to sort in reverse order, remove the option r and execute the command.
To check if the file is already sorted or not
To sort and remove duplicates
To know the version of the sort command give the below command,
Recommended Readings
I am Tharani N V, a Content writer, and SEO specialist. I have a year of experience in both technical and non-technical content writing.