Linux uniq and sort

Содержание

What is the difference between «sort -u» and «sort | uniq»?
5 Answers 5
Difference Between sort | uniq and sort -u
1. Overview
2. Basic Usage
3. Counting Unique Entries
4. Summary
How to use the uniq command to process lists in Linux
Great Linux resources
Syntax
Examples
Without any option
With -c, —count option
With -d, —repeated option
With -D, —all-repeated option
With -u, —unique option
With -i, —ignore-case option
With -f, —skip-fields=N
With -s, —skip-char=N option
With -w, —check-chars=N option
With —version option
Wrap up
Uniq and Sort command in Linux
Sort command in Linux

What is the difference between «sort -u» and «sort | uniq»?

Everywhere I see someone needing to get a sorted, unique list, they always pipe to sort | uniq . I’ve never seen any examples where someone uses sort -u instead. Why not? What’s the difference, and why is it better to use uniq than the unique flag to sort?

5 Answers 5

sort | uniq existed before sort -u , and is compatible with a wider range of systems, although almost all modern systems do support -u — it’s POSIX. It’s mostly a throwback to the days when sort -u didn’t exist (and people don’t tend to change their methods if the way that they know continues to work, just look at ifconfig vs. ip adoption).

The two were likely merged because removing duplicates within a file requires sorting (at least, in the standard case), and is an extremely common use case of sort. It is also faster internally as a result of being able to do both operations at the same time (and due to the fact that it doesn’t require IPC (Inter-process communication) between uniq and sort ). Especially if the file is big, sort -u will likely use fewer intermediate files to sort the data.

On my system I consistently get results like this:

$ dd if=/dev/urandom of=/dev/shm/file bs=1M count=100 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 8.95208 s, 11.7 MB/s $ time sort -u /dev/shm/file >/dev/null real 0m0.500s user 0m0.767s sys 0m0.167s $ time sort /dev/shm/file | uniq >/dev/null real 0m0.772s user 0m1.137s sys 0m0.273s

It also doesn’t mask the return code of sort , which may be important (in modern shells there are ways to get this, for example, bash ‘s $PIPESTATUS array, but this wasn’t always true).

Источник

Difference Between sort | uniq and sort -u

The Kubernetes ecosystem is huge and quite complex, so it’s easy to forget about costs when trying out all of the exciting tools.

To avoid overspending on your Kubernetes cluster, definitely have a look at the free K8s cost monitoring tool from the automation platform CAST AI. You can view your costs in real time, allocate them, calculate burn rates for projects, spot anomalies or spikes, and get insightful reports you can share with your team.

Connect your cluster and start monitoring your K8s costs right away:

1. Overview

In this tutorial, we’ll do a quick comparison of the Linux commands sort | uniq and sort -u. Both use sort to remove duplicate entries from a list, but they operate in slightly different manners.

Note that all commands below are platform-independent.

2. Basic Usage

Let’s start with a list of colors in a file named color:

% cat color Black green red red yellow Green red

If we want to remove duplicates, uniq would work in some cases. Checking the man page for uniq:

Repeated lines in the input will not be detected if they are not adjacent, so it may be necessary to sort the files first.

For our list, the result would not be a list of unique entries because our list has duplicated, non-adjacent entries of “red”:

% uniq color Black green red yellow Green red

There are a couple of ways around this. First, using the -u argument with uniq removes all duplicates, both adjacent and non-adjacent:

% uniq -u color Black green yellow Green red

Alternatively, taking the man page suggestion, sorting the list before calling uniq will remove all of the duplicates.

% sort color Black Green green red red red yellow

Piping this to uniq yields:

% sort color | uniq Black Green green red yellow

Now, checking the man page for sort, we can see that the -u flag will provide the same output:

% sort -u color Black Green green red yellow

So, generally speaking, both sort | uniq and sort -u do the same thing. But there are some differences.

For example, sort has other options, like sorting on delimiters. But we can use these regardless of using -u or piping to uniq.

3. Counting Unique Entries

After finding a unique list of items, many times we’ll also want to know the number of unique items. The -c option for uniq will return a count for each duplicated line:

% uniq -c color 1 Black 1 green 2 red 1 yellow 1 Green 1 red

Kind of useful, but it again hits the issue of ignoring non-adjacent duplicates. To avoid that, we could sort the list first, then pipe the output to uniq:

sort color | uniq -c 1 Black 1 Green 1 green 3 red 1 yellow

Now we have a list of unique entries regardless of adjacency.

Taking it a step further, let’s say we want a count of unique items in the list. We can pipe to wc:

Or with sort -u instead of uniq:

And we get a count of our unique list items.

4. Summary

In this short article, we described the differences between using sort | uniq and sort -u.

Источник

How to use the uniq command to process lists in Linux

Processing data can be a challenge. Let’s see how the uniq command works and how it supplements the sort command.

We have seen the sort command in our previous article, but sorting any file will often result in many duplicate lines adjacent to each other. It becomes too difficult to properly view those lines.

Great Linux resources

In this scenario, the uniq command helps you to print duplicate lines once in the output. It actually discards the lines which are repeated and prints the first adjacent repeated line, which enables us to view the output properly.

Читайте также: Linux and serial port

The lines used in the input file for the uniq command can neither exceed 2048 bytes in length (including any newline characters) nor contain null characters.

Syntax

Examples

Below are a series of examples, beginning with no options. We’ll walk through several use cases. Some involve only uniq , and others rely on additional commands.

Without any option

Below is a file named file2, which contains some data. Note that this file is not sorted, and the duplicate lines are not adjacent to each other. Before using the uniq command with this file, we should sort it. In the example, I have tried the uniq command with the original file, but it only prints the output as it is, much like a cat output. In the next example, we take output from a sort command and pipe it with uniq command. This helps us understand the behavior of the uniq command:

$ cat file2 ChhatrapatiShahuMaharaj Dr.B.R.Ambedkar Budhha Dr.B.R.Ambedkar Budhha Dr.B.R.Ambedkar Budhha $ uniq file2 ChhatrapatiShahuMaharaj Dr.B.R.Ambedkar Budhha Dr.B.R.Ambedkar Budhha Dr.B.R.Ambedkar Budhha $ sort file2 Budhha Budhha Budhha ChhatrapatiShahuMaharaj Dr.B.R.Ambedkar Dr.B.R.Ambedkar Dr.B.R.Ambedkar $ sort file2 | uniq Budhha ChhatrapatiShahuMaharaj Dr.B.R.Ambedkar

With -c, —count option

Below, in the next example, we’re using the -c option to count the repeated lines. The uniq command prints that count as a prefix with the line. The below example tells us that the first line is repeated three times, the second line one time, and the third line three times:

$ sort file2 | uniq -c 3 Budhha 1 ChhatrapatiShahuMaharaj 3 Dr.B.R.Ambedkar

With -d, —repeated option

The -d option prints only lines that are repeated. It discards non-duplicate lines. Therefore, line ChhatrapatiShahuMaharaj has been discarded in the below example:

$ sort file2 | uniq -d Budhha Dr.B.R.Ambedkar

In the below example, I’ve used the -c option to cross-check whether the -d option is only printing the repeated lines or not:

$ sort file2 | uniq -cd 3 Budhha 3 Dr.B.R.Ambedkar

With -D, —all-repeated option

The -D option prints repeated lines and discards the non-duplicate lines. In the below example, the uniq command prints all duplicate lines only and discards non-duplicate lines:

$ sort file2 | uniq -D Budhha Budhha Budhha Dr.B.R.Ambedkar Dr.B.R.Ambedkar Dr.B.R.Ambedkar

With -u, —unique option

Opposite of the above option, the -u option prints unique lines i.e., non-duplicate lines. Therefore, in the below example, it prints ChhatrapatiShahuMaharaj as an output:

$ sort file2 | uniq -u ChhatrapatiShahuMaharaj

With -i, —ignore-case option

Using the -i option, we can ignore the case sensitivity of characters. Below I’ve given an output of the uniq command with and without the -i option to compare:

$ cat file3 aaaa aaaa AAAA AAAA bbbb BBBB $ uniq file3 aaaa AAAA bbbb BBBB $ uniq -i file3 aaaa bbbb

With -f, —skip-fields=N

Sometimes we need to skip some fields to filter duplicate lines. This is possible using the -f option. In the following example, we’re skipping the first field (first column) to compare the duplicate lines from the second field. I’ve given both examples, with and without the -f option, for a better understanding of the option’s behavior:

$ cat file5 Amit aaaa Ajit aaaa Advi bbbb Kaju bbbb $ uniq file5 Amit aaaa Ajit aaaa Advi bbbb Kaju bbbb $ uniq -f 1 file5 Amit aaaa Advi bbbb

With -s, —skip-char=N option

Just like the field, we can skip characters as well by using the -s option. Please keep in mind that the uniq command prints only the first duplicate line and discards other duplicate lines. Therefore 33aa and 55bb have been discarded. Here is the example:

$ cat file4 22aa 33aa 44bb 55bb $ uniq file4 22aa 33aa 44bb 55bb $ uniq -s 2 file4 22aa 44bb

With -w, —check-chars=N option

Just like skipping characters, we can consider characters as well using the -w option, such as in the example:

$ cat file6 aa12 aa34 bb56 bb78 $ uniq file6 aa12 aa34 bb56 bb78 $ uniq -w 2 file6 aa12 bb56

With —version option

Use the —version option to check the version of the uniq command.

$ uniq --version uniq (GNU coreutils) 8.4 Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later . This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Written by Richard M. Stallman and David MacKenzie.

Wrap up

uniq does not detect repeated lines unless they are adjacent. The uniq command can count and print the number of repeated lines. Just like duplicate lines, we can filter unique lines (non-duplicate lines) as well and can also ignore case sensitivity. We can skip fields and characters before comparing duplicate lines and also consider characters for filtering lines.

After reviewing the multiple uniq command options, I would like to share a small image to keep it with you for reference.

Источник

Uniq and Sort command in Linux

The uniq command is used to remove all the repeated lines/words present in a file .

Using -c option : It tells the number of times a line is repeated.
Using -d option : It only prints the repeated lines.
The output of -c and -d :
Using -D option : It also prints only duplicate lines but not one per group
Using -u option : It prints only the unique lines.
Using -i option : It is used to make the comparison case-insensitive.