Linux group by count
awk is very powerful when it comes for file formatting. In this article, we will discuss some wonderful grouping features of awk. awk can group a data based on a column or field , or on a set of columns. It uses the powerful associative array for grouping. If you are new to awk, this article will be easier to understand if you can go over the article how to parse a simple CSV file using awk.
Let us take a sample CSV file with the below contents. The file is kind of an expense report containing items and their prices. As seen, some expense items have multiple entries.
$ cat file Item1,200 Item2,500 Item3,900 Item2,800 Item1,600
The delimiter(-F) used is comma since its a comma separated file. x+=$2 stands for x=x+$2. When a line is parsed, the second column($2) which is the price, is added to the variable x. At the end, the variable x contains the sum. This example is same as discussed in the awk example of finding the sum of all numbers in a file.
If your input file is a text file with the only difference being the comma not present in the above file, all you need to make is one change. Remove this part from the above command: -F»,» . This is because the default delimiter in awk is whitespace.
This gives us the total sum of all the items pertaining to «Item1». In the earlier example, no condition was specified since we wanted awk to work on every line or record. In this case, we want awk to work on only the records whose first column($1) is equal to Item1.
$ VAR="Item1" $ awk -F, -v inp=$VAR '$1==inpEND' file 800
-v is used to pass the shell variable to awk, and the rest is same as the last one.
4. To find unique values of first column
$ awk -F, 'END' file Item1 Item2 Item3
Arrays in awk are associative and is a very powerful feature. Associate arrays have an index and a corresponding value. Example: a[«Jan»]=30 meaning in the array a, «Jan» is an index with value 30. In our case here, we use only the index without values. So, the command a[$1] works like this: When the first record is processed, in the array named a, an index value «Item1» is stored. During the second record, a new index «Item2», during third «Item3» and so on. During the 4th record, since the «Item1» index is already there, no new index is added and the same continues.
Now, once the file is processed completely, the control goes to the END label where we print all the index items. for loop in awk comes in 2 variants: 1. The C language kind of for loop, Second being the one used for associate arrays.
for i in a : This means for every index in the array a . The variable «i» holds the index value. In place of «i», it can be any variable name. Since there are 3 elements in the array, the loop will run for 3 times, each time holding the value of an index in the «i». And by printing «i», we get the index values printed.
Note: The order of the output in the above command may vary from system to system. Associative arrays do not store the indexes in sequence and hence the order of the output need not be the same in which it is entered.
5. To find the sum of individual group records. i.e, to sum all records pertaining to Item1 alone, Item2 alone, and so on.
$ awk -F, 'END' file Item1, 800 Item2, 1300 Item3, 900
a[$1]+=$2 . This can be written as a[$1]=a[$1]+$2. This works like this: When the first record is processed, a[«Item1»] is assigned 200(a[«Item1»]=200). During second «Item1» record, a[«Item1»]=800 (200+600) and so on. In this way, every index item in the array is stored with the appropriate value associated to it which is the sum of the group.
$ awk -F"," 'END' file Item1,200 Item2,500 Item3,900 Item2,800 Item1,600 Total,3000
This is same as the first example except that along with adding the value every time, every record is also printed, and at the end, the «Total» record is also printed.
7. To print the maximum or the biggest record of every group:
$ awk -F, 'END>' OFS=, file Item1,600 Item2,800 Item3,900
Before storing the value($2) in the array, the current second column value is compared with the existing value and stored only if the value in the current record is bigger. And finally, the array will contain only the maximum values against every group. In the same way, just by changing the «lesser than(<)" symbol to greater than(>), we can find the smallest element in the group.
if (condition)
>
>else
>
8. To find the count of entries against every group:
$ awk -F, 'END' file Item1 2 Item2 2 Item3 1
a[$1]++ : This can be put as a[$1]=a[$1]+1. When the first «Item1» record is parsed, a[«Item1»]=1 and every item on encountering «Item1» record, this count is incremented, and the same follows for other entries as well. This code simply increments the count by 1 for the respective index on encountering a record. And finally on printing the array, we get the item entries and their respective counts.
$ awk -F, '!a[$1]++' file Item1,200 Item2,500 Item3,900
A little tricky this one. In this awk command, there is only condition, no action statement. As a result, if the condition is true, the current record gets printed by default.
!a[$1]++ : When the first record of a group is encountered, a[$1] remains 0 since ++ is post-fix, and not(!) of 0 is 1 which is true, and hence the first record gets printed. Now, when the second records of «Item1» is parsed, a[$1] is 1 (will become 2 after the command since its a post-fix). Not(!) of 1 is 0 which is false, and the record does not get printed. In this way, the first record of every group gets printed.
Simply by removing ‘!’ operator, the above command will print all records other than the first record of the group.
10. To join or concatenate the values of all group items. Join the values of the second column with a colon separator:
$ awk -F, 'END' OFS=, file Item1,200:600 Item2,500:800 Item3,900
This if condition is pretty simple: If there is some value in a[$1], then append or concatenate the current value using a colon delimiter, else just assign it to a[$1] since this is the first value.
To make the above if block clear, let me put it this way: «if (a[$1])» means «if a[$1] has some value».
$ awk -F, 'END' OFS=, file Item1,200:600 Item2,500:800 Item3,900
Ternary operator is a short form of if-else condition. An example of ternary operator is: x=x>10?»Yes»:»No» means if x is greater than 10, assign «Yes» to x, else assign «No».
In the same way: a[$1]=a[$1]?a[$1]»:»$2:$2 means if a[$1] has some value assign a[$1]»:»$2 to a[$1] , else simply assign $2 to a[$1].
Concatenate variables in awk:
One more thing to notice is the way string concatenation is done in awk. To concatenate 2 variables in awk, use a space in-between.
Examples:
z=x y #to concatenate x and y z=x":"y #to concatenate x and y with a colon separator.
Bash script group and count by a specific field
sorry if I open a new question, but it’s not related to a previous one since now I need a bash command to analyze the output. I have an output from query stored in a file like this:
3277654321 333011123456789 3277654321 333015123456789 3277654321 333103123456789 3277654321 333201123456789 3291234567 333991123456789 3291234567 333991123456789 3291234567 333011123456789
What I need is a bash command to count the field1 and field2 having the same first 5 digits and report an output like this:
3277654321=4;33301=2;33310=1;33320=1 3291234567=3;33399=2;33301=1
This would not be too challenging to do with awk , and undoubtedly someone will provide the code-writing service eventually, even though SO is not intended to be a «please write my code for me» service. But it seems to me like it would be better to generate the report you want directly from your database, using a query which actually generates the final result rather than an intermediate list.
Generating a report like that direct from the database is all kinds of messy. As long as the data is sorted (which the database can do), then post-processing is not unreasonable. And the database can and should do a bit more work for you: it could reasonably generate the first field, the first 5 characters of the second field, and a count of the number of entries: SELECT field1, SUBSTR(field2, 1, 5) AS field2, COUNT(*) AS number FROM TheTable GROUP BY field1, field2 ORDER BY field1, field2 . You then have less data being transferred over the wire, which helps a lot if the database is remote.
@JonathanLeffler: The following seems to work fine in sqlite3; for mysql, you’d need to change the comma to the word SEPARATOR in group_concat: select field1||»=»||SUM(count2)||»;»||group_concat(field2||»=»||count2,»;») as fields FROM (select field1, SUBSTR(field2,1,5) AS field2, COUNT(*) as count2 from tmp GROUP BY field1, field2 ORDER BY field1, field2) GROUP BY field1 ORDER BY field1; . It’s not that messy, imho.
@rici: Yup, as you show, with the non-standard GROUP_CONCAT aggregate (and with ORDER BY available in sub-queries, also non-standard), it is not too bad. (Non-standard, as in ‘not part of ISO standard SQL’, AFAIK.)
2 Answers 2
Using awk on the original data
What you’re seeking is a control-break report. For once, the Wikipedia entry isn’t much help on the subject. The sample data is shown sorted; this solution assumes that the data is sorted, therefore (but it is trivial to add a sort operation before the awk script if it isn’t sorted; OTOH, since the data comes from a database, the DBMS could perfectly well sort the data).
For testing purposes, I created a file awk.script containing:
< f1 = $1 f2 = substr($2, 1, 5) if (oldf1 != f1) < if (oldf1 != 0) < summary = summary ";" oldf2 "=" f2_count printf("%s=%d%s\n", oldf1, f1_count, summary) >oldf1 = f1 f1_count = 0 oldf2 = f2 f2_count = 0 summary = "" > else if (oldf2 != f2) < summary = summary ";" oldf2 "=" f2_count oldf2 = f2 f2_count = 0 >f1_count++ f2_count++ > END < if (oldf1 != 0) < summary = summary ";" oldf2 "=" f2_count printf("%s=%d%s\n", oldf1, f1_count, summary) >>
And put the seven lines of sample data into a file called data , and then ran:
$ awk -f awk.script data 3277654321=4;33301=2;33310=1;33320=1 3291234567=3;33399=2;33301=1 $
Make the DBMS do more work
At the moment, the data is similar to the output from a query such as:
SELECT Field1, Field2 FROM SomeTable ORDER BY Field1, Field2
The output could be made better for your report by having the DBMS generate the first field, the first 5 characters of the second field, and a count of the number of entries:
SELECT field1, SUBSTR(field2, 1, 5) AS field2, COUNT(*) AS number FROM SomeTable GROUP BY field1, field2 ORDER BY field1, field2
You then have less data being transferred over the wire, which helps a lot if the database is remote. You also have a simpler report. The data file becomes ( data2 ):
3277654321 33301 2 3277654321 33310 1 3277654321 33320 1 3291234567 33399 2 3291234567 33301 1
The awk script becomes ( awk.script2 ):
< if (oldf1 != $1) < if (oldf1 != 0) printf("%s=%d%s\n", oldf1, f1_count, summary) oldf1 = $1 f1_count = 0 summary = "" >summary = summary ";" $2 "=" $3 f1_count += $3 > END
$ awk -f awk.script2 data2 3277654321=4;33301=2;33310=1;33320=1 3291234567=3;33399=2;33301=1 $
Make the DBMS do even more work
Depending on your DBMS and whether it supports GROUP_CONCAT and ORDER BY clauses in sub-queries, you can note that rici suggested «It’s not that messy, IMHO».
The following seems to work fine in SQLite3; for MySQL, you’d need to change the comma to the word SEPARATOR in GROUP_CONCAT:
SELECT field1 || "=" || SUM(count2) || ";" || group_concat(field2 || "=" || count2, ";") AS fields FROM (SELECT field1, SUBSTR(field2, 1, 5) AS field2, COUNT(*) AS count2 FROM tmp GROUP BY field1, field2 ORDER BY field1, field2 ) GROUP BY field1 ORDER BY field1
Note that both GROUP_CONCAT and ORDER BY clauses in sub-queries are not defined by ISO standard SQL, as far as I know, so not all DBMS will support the features. (The ORDER BY feature was omitted for reason, but the reasoning didn’t include considerations of ‘orthogonality’.)
If the DBMS produces the data in the format you need, there’s no need for an awk script to post-process it. What’s best will ultimately depend on what else you’re doing. Generally, use the DBMS to do calculations where it makes sense. IMO, don’t use the DBMS for all formatting — I expect report generation with pagination etc to be done outside the DBMS proper — but if it can be persuaded to generate the data you need, by all means make it do the work.