Parsing files in linux

How can I parse CSV files on the Linux command line? [closed]

Closed. This question is seeking recommendations for books, tools, software libraries, and more. It does not meet Stack Overflow guidelines. It is not currently accepting answers.

We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.

to extract fields from columns 2, 5 and 6 from all rows. It should be able to handle the csv file format: https://www.rfc-editor.org/rfc/rfc4180 which means quoting fields and escaping inner quotes as appropriate, so for an example row with 3 fields:

field1,"field, number ""2"", has inner quotes and a comma",field3 
field, number "2", has inner quotes and a comma 

I appreciate that there are numerous solutions, Perl, Awk (etc.) to this problem but I would like a native bash command line tool that does not require me to invoke some other scripting environment or write any additional code(!).

I don’t want to write any scripts and want to use something prepackaged for the job 🙂 (In exactly the same way as I don’t write a sort or grep tool everytime I want to use one). I realise that the functionality I’m asking for is slightly less generic that the average shell tool but would be immensely useful nonetheless — hence the question.

I would expect this kind of operation to be extremely slow in Bash. AWK or cut are the right tools for this job.

If you want a tool with this functionality without doing a script, you’re going have to write that tool yourself. Using bash tools what you want is definitely possible though.

‘cut’ doesn’t quite cut it (ha ha) because it doesn’t handle quoted strings containing delimiters, which are common in CVS files (e.g. exports from spreadsheets)

12 Answers 12

csvtool is really good. Available in Debian / Ubuntu ( apt-get install csvtool ). Example:

csvtool namedcol Account,Cost input.csv > output.csv 

I also gave it a try. Looked quite promising in the beginning, but if your CSVs are larger than around 100MB it dies with a pretty stack overflow.

Just tried to use csvtool , and 5 years later (it’s 2017 now) it still doesn’t have streaming support and results in stack overflow for a 110MB csv file

My FOSS CSV stream editor CSVfix does exactly what you want. There is a binary installer for Windows, and a compilable version (via a makefile) for UNIX/Linux.

BTW — thanks for not answering a) «why not write one yourself?» b) «use awk/perl». If I had wanted to use either of those 2 options I wouldn’t have bothered asking the question in the first place.

@Joel: The problem is the way you worded your question. You asked for a «bash command» when you should have said «standalone program». Your request has nothing at all to do with bash.

csvfix does exactly the right thing. It’s a powerful csv stream editor, runs on windows & linux, and does more than what I hoped for!

Читайте также:  Вывод дерева папок linux

It took me a while to find the right command, but eventually I used the order command to accomplish this: csvfix order -f 2,5,6 filename

As suggested by @Jonathan in a comment, there is a module for python that provides the command line tool csvfilter. It works like cut, but properly handles CSV column quoting:

csvfilter -f 1,3,5 in.csv > out.csv 

If you have python (and you should), you can install it simply like this:

I found csvkit to be useful, it is based on python csv module and has quite a lot of options for parsing complex csv files.

Although it seems to be a bit slow. I am getting 4MB/s (with 100% cpu) when extracting one field from a 7GB csv with 5 columns.

To extract 4th column from file.csv

Try crush-tools, they are great at manipulating delimited data. It sounds like exactly what you’re looking for.

My gut reaction would be to write a script wrapper around Python’s csv module (if there isn’t already such a thing).

I wrote one of these tools too (UNIX only) called csvprintf. It can also converts to XML in an online fashion.

Perl script (requires Text::CSV_XS):

#!/usr/bin/perl use strict; use warnings; use Getopt::Long; my @opt_columns; GetOptions("column=i@" => \@opt_columns) or die "Failed parsing options\n"; die "Must give at least one --column\n" if int(@opt_columns) == 0; @opt_columns = map < $_-1 >@opt_columns; # convert 1-based to 0-based use Text::CSV_XS; my $csv = Text::CSV_XS->new ( < binary =>1 > ); open(my $stdin, "<-") or die "Couldn't open stdin\n"; open(my $stdout, ">-") or die "Couldn't open stdout\n"; while (my $row = $csv->getline($stdin)) < my @nrow = @[@opt_columns]; $csv->print($stdout, \@nrow); print "\n"; > 

Put it in a file csvcut.pl .

Example of taking only columns 3 and 4:

cat foo.csv | ./csvcut.pl --c 3 --c 4 

This will only quote columns that need quoting, so if an input column has «Bar» (with quotes) it will come out Bar (without quotes).

This looks remarkably like a Perl script solution such as the OP said he didn’t want. (‘I appreciate that there are numerous solutions, perl, awk (etc.) to this problem but I would like a native bash command line tool that does not require me to invoke some other scripting environment or write any additional code (!).’)

For a super lightweight wrapper around Python’s csv module, you could look at pluckr.

pluckr seems to have a subset of the functionality of csvfilter (eg —out-delimiter isn’t implemented).

ffe is another great tool. It requires you to create a configuration file for most non-trivial tasks. The upside is that it’s very flexible and can handle all sorts of structure, logic, and formatting that other tools can’t.

I like to use csvtool for quick jobs and use ffe for complex jobs or jobs that require frequent repeating.

A quick google reveals an awk script that seems to handle csv files.

This sounds like a job for awk.

You will most likely need to write your own script for your specific needs, but this site has some dialogue about how to go about doing this.

You could also use the cut utility to strip the fields out.

where the -f argument is the field you want and -d is the delimeter you want. You could then sort these results, find the unique ones, or use any other bash utility. There is a cool video here about working with CSV files from the command line. Only about a minute, I’d take a look.

However, I guess you could group the cut utility with awk and not want to use it. I don’t really know what exactly you mean by native bash command though, so I’ll still suggest it.

Источник

How to parse a CSV file in Bash?

I’m working on a long Bash script. I want to read cells from a CSV file into Bash variables. I can parse lines and the first column, but not any other column. Here’s my code so far:

 cat myfile.csv|while read line do read -d, col1 col2 < <(echo $line) echo "I got:$col1|$col2" done 

It's only printing the first column. As an additional test, I tried the following: read -d, x y < <(echo a,b,) And $y is empty. So I tried: read x y < <(echo a b) And $y is b . Why?

6 Answers 6

You need to use IFS instead of -d :

while IFS=, read -r col1 col2 do echo "I got:$col1|$col2" done < myfile.csv 

To skip a given number of header lines:

skip_headers=3 while IFS=, read -r col1 col2 do if ((skip_headers)) then ((skip_headers--)) else echo "I got:$col1|$col2" fi done < myfile.csv 

Note that for general purpose CSV parsing you should use a specialized tool which can handle quoted fields with internal commas, among other issues that Bash can't handle by itself. Examples of such tools are cvstool and csvkit .

The proposed solution is fine for very simple CSV files, that is, if the headers and values are free of commas and embedded quotation marks. It is actually quite tricky to write a generic CSV parser (especially since there are several CSV "standards"). One approach to making CSV files more amenable to *nix tools is to convert them to TSV (tab-separated values), e.g. using Excel.

@Zsolt: There's no reason that should be the case. You must have a typo or a stray non-printing character.

@DennisWilliamson You should enclose the seperator e.g. when using ; : while IFS=";" read col1 col2; do .

@thomas.mc.work: That's true in the case of semicolons and other characters that are special to the shell. In the case of a comma, it's not necessary and I tend to prefer to omit characters that are unnecessary. For example, you could always specify variables for expansion using curly braces (e.g. $ ), but I omit them when they're not necessary. To me, it looks cleaner.

@DennisWilliamson, From some time, bash source tree offer a loadable builtin csv parser! Have a look at my answer! Of course there are some limitations.

How to parse a CSV file in Bash?

Coming late to this question and as bash do offer new features, because this question stand about bash and because none of already posted answer show this powerful and compliant way of doing precisely this.

Parsing CSV files under bash , using loadable module

Conforming to RFC 4180, a string like this sample CSV row:

1 12 2 22.45 3 Hello, "man". 4 A, b. 5 42 

bash loadable .C compiled modules.

Under bash, you could create, edit, and use loadable c compiled modules. Once loaded, they work like any other builtin!! ( You may find more information at source tree. 😉

Current source tree (Oct 15 2021, bash V5.1-rc3) do contain a bunch of samples:

accept listen for and accept a remote network connection on a given port asort Sort arrays in-place basename Return non-directory portion of pathname. cat cat(1) replacement with no options - the way cat was intended. csv process one line of csv data and populate an indexed array. dirname Return directory portion of pathname. fdflags Change the flag associated with one of bash's open file descriptors. finfo Print file info. head Copy first part of files. hello Obligatory "Hello World" / sample loadable. . tee Duplicate standard input. template Example template for loadable builtin. truefalse True and false builtins. tty Return terminal name. uname Print system information. unlink Remove a directory entry. whoami Print out username of current user. 

There is an full working cvs parser ready to use in examples/loadables directory: csv.c!!

Under Debian GNU/Linux based system, you may have to install bash-builtins package by

Using loadable bash-builtins:

enable -f /usr/lib/bash/csv csv 

From there, you could use csv as a bash builtin.

With my sample: 12,22.45,"Hello, ""man"".","A, b.",42

csv -a myArray '12,22.45,"Hello, ""man"".","A, b.",42' printf "%s\n" "$" | cat -n 1 12 2 22.45 3 Hello, "man". 4 A, b. 5 42 

Then in a loop, processing a file.

while IFS= read -r line;do csv -a aVar "$line" printf "First two columns are: [ '%s' - '%s' ]\n" "$" "$" done  

This way is clearly the quickest and strongest than using any other combination of bash builtins or fork to any binary.

Unfortunely, depending on your system implementation, if your version of bash was compiled without loadable , this may not work.

Complete sample with multiline CSV fields.

Conforming to RFC 4180, a string like this single CSV row:

12,22.45,"Hello ""man"", This is a good day, today!","A, b.",42 
1 12 2 22.45 3 Hello "man", This is a good day, today! 4 A, b. 5 42 

Full sample script for parsing CSV containing multilines fields

Here is a small sample file with 1 headline, 4 columns and 3 rows. Because two fields do contain newline, the file are 6 lines length.

Id,Name,Desc,Value 1234,Cpt1023,"Energy counter",34213 2343,Sns2123,"Temperatur sensor to trigg for alarm",48.4 42,Eye1412,"Solar sensor ""Day / Night""",12199.21 

And a small script able to parse this file correctly:

#!/bin/bash enable -f /usr/lib/bash/csv csv file="sample.csv" exec " numcols=$ while read -ru $FD line;do while csv -a row "$line" ; (( $ < numcols )) ;do read -ru $FD sline || break line+=$'\n'"$sline" done printf "$fieldfmt\\n" "$" done 

This may render: (I've used printf "%q" to represent non-printables characters like newlines as $'\n' )

Id : "1234" Name : "Cpt1023" Desc : "Energy\ counter" Value : "34213" Id : "2343" Name : "Sns2123" Desc : "$'Temperatur sensor\nto trigg for alarm'" Value : "48.4" Id : "42" Name : "Eye1412" Desc : "$'Solar sensor "Day /\nNight"'" Value : "12199.21" 

You could find a full working sample there: csvsample.sh.txt or csvsample.sh.

Note:

In this sample, I use head line to determine row width (number of columns). If you're head line could hold newlines, (or if your CSV use more than 1 head line). You will have to pass number or columns as argument to your script (and the number of head lines).

Warning:

Of course, parsing CSV using this is not perfect! This work for many simple CSV files, but care about encoding and security!! For sample, this module won't be able to handle binary fields!

Note about quoted multi-line fields

In particular if multi-line field is located on last column, this method won't loop correctly upto second quote.

For this, you have to check quotes parity in $line before parsing using csv module.

Источник

Оцените статью
Adblock
detector