Linux sort csv file

Содержание

cat | sort csv file by name in bash
3 Answers 3
bash sort quoted csv files by numeric key
5 Answers 5
sort csv file using unix utililty sort
4 Answers 4
Bash. Как сортировать csv файл в bash?

cat | sort csv file by name in bash

works good for a naming like a001, a002, a010. a100 but in my files the names are fup a bit so they are like a1. a2. a10. a100 and the command i wrote will arrange my things like this:

cn201 cn202 cn202 cn203 cn204 cn99 cn98 cn97 cn96 .. cn9

Do you want to sort the contents of the files, or concatenate the files in the numerical order. So «contents of file cn5, contents of file cn20, contents of file cn100» or «sort the contents of all these files».

Floris, i have 10 csv files containing data that I want to concatenate sorted ascending by name > cn1, cn2, cn10. cn100

3 Answers 3

If I understand correctly, you want to use the -V (version-sort) flag instead of -n . This is only available on GNU sort, but that’s probably the one you are using.

However, it depends how you want the prefixes to be sorted.

@Adrian: By the way, you don’t need the cat *.csv | . You’re giving sort an explicit list of files ( *.csv ) so it will never read from its standard input, so the cat does nothing of value.

If you don’t have the -V option, sort allows you to be more precise about what characters constitute a sort key.

sort -t\ -k2.3n *.csv > output.csv

The .3 tells sort that the key to sort on starts with the 3rd character of the second field, effectively skipping the cn prefix. You can put the n directly in the field specifier, which saves you two whole characters, but more importantly for more complex sorts, allows you to treat just that key as a number, rather than applying -n globally (which is only an issue if you specify multiple keys with several uses of -k ).

Источник

bash sort quoted csv files by numeric key

the problem is that since all values are quoted, they don’t get sorted correctly by -n (numeric) option. is there a solution?

5 Answers 5

A little trick, which uses a double quote as the separator:

sort --field-separator='"' --key=4 -n

For a quoted csv use a language that has a proper csv parser. Here is an example using perl .

perl -MText::ParseWords -lne ' chomp; push @line, [ parse_line(",", 0, $_) ]; >< @line = sort < $a->[1] $b->[1] > @line; for (@line) < local $" = qw(","); print qq("@$_"); >' file

"aaa","1","xxx" "bbb","609","zzz" "ccc, Inc.","6100","yyy"

Explanation:

Remove the new line from input using chomp function.
Using a code module Text::Parsewords parse the quoted line and store it in an array of array without the quotes.
In the END block, sort the array of array on second column and assign it to the original array of array.
For every item in our array of array, we set the output list separator to «,» and we print it with preceding and trailing » to create the lines in original format.

less behaves like cat when its output goes to a pipe, and sort is perfectly capable of reading the file (so sort -t ‘»‘ -k 4n sort2.txt would be better; it would sort numerically, too).

Also, a first field containing «Joe «»The Man»» Bloggs» (valid CSV) would throw your field count off horribly. You can level that complaint at my answer, too, but it would require something like «Joe «»The Man»»,»»The Guy»» Bloggs» to confuse it, which is even more esoteric (and one of my ‘reasonable assumptions’ is that such a string would not appear — there’d be a space after the embedded comma).

Updated based on your suggestions, Jonathan. Thank you, I forgot that sort can pull the file directly.

There isn’t going to be a really simple solution. If you make some reasonable assumptions, then you could consider:

sed 's/","/^A/g' input.csv | sort -t'^A' -k 2n | sed 's/^A/","/g'

This replaces the «,» sequence with Control-A (shown as ^A in the code), then uses that as the field delimiter in sort (the numeric sort on column 2), and then replace the Control-A characters with «,» again.

If you use bash , you can use the ANSI C quoting mechanism $’\1′ to embed the control characters visibly into the script; you just have to finish the single-quoted string before the escape, and restart it afterwards:

sed 's/","/'$'\1''/g' input.csv | sort -t$'\1' -k 2n | sed 's/'$'\1''/","/g'

Or play with double quotes instead of single quotes, but that gets messy because of the double quotes that you are replacing. But you can simply type the characters verbatim and editors like vim will be happy to show them to you.

Источник

sort csv file using unix utililty sort

is there a way to sort very large CSV file using sort ?
Simply sort by the first column, however, the data might contain line breaks within a column (standard CSV file rules apply). Would the line breaks break the sort utility?

Actually quite difficult. You might want to take a look at my FOSS project at code.google.com/p/csvfix which does sorting of CSV files, among many other things, and runs on Unix.

@user Not too good I’d guess, the sort is performed in memory. I haven’t tested it on enormous inputs.

@Neil Butterworth: if it only sort in memory and no merge sort from disk, then it won’t work for large inputs, right?

4 Answers 4

The sort function will sort the lines in asciicographical order. To get a more sophisticated effect, you might use the UNIX utility awk.

I believe you should try something like this cat old.csv | sort > new.csv

UPD: To prepare data if needed we can use AWK script.

It’s quite simple to prepare data using a AWK script, which is exactly for preparing (formating) huge log files)) I didn’t say that this command would work, I sad smth like this.

You could do it with a mix of utilities. Hopefully I’ve understood it correctly . and if so, this might do the job. IF not, point out where I’ve gone wrong in an assumption 🙂 This requires that the number of fields per CSV record is fixed (it’s also a dirt simple example that doesn’t cover various CSV variations (e.g., hello,»world,how»,are,you would break as «world,how» would be split into two fields)):

hello,world,how,are,you one,two,three,four,five once,I,caught,a fish,alive hey,now,hey,now,now

hello,world,how,are,you hey,now,hey,now,now once,I,caught,a fish,alive one,two,three,four,five

In essence, all we’re doing with the awk script is merging the multi-line records into a single-line that we can then feed to sort , then break again with tr . I’m using a pipe as the replacement for the newline char — just choose something that you can gurantee will not appear in a CSV record.

Now it might not be perfect for what you want, but hopefully it’ll nudge you in the right direction. The main thing with the awk script I’ve knocked up is that it needs to know how many fields there are per CSV record. This needs to be fixed. If it’s variable, then all bets are off as there’d need to be more rules in there do define the semantic nature of the file that you want to sort.

Источник

Bash. Как сортировать csv файл в bash?

Сначала придётся поменять запятые внутри кавычек на что-нибудь другое, потом отсортировать, а потом заменить заменитель кавычек обратно на кавычки. Короче, sed с регэкспами поможет.

так просто и отсортируйте по второму полю:

sort -k2 -n file.csv > sorted.csv

Saboteur, На таком примере работать к сожалению не будет:

«FIELD_1, one»,1,»one»
«FIELD_1, three»,3,»three»
«FIELD_1, two, three»,2,»two»

я так понял, все что находится внутри «» не считается как одна ячейка и запятые внутри тоже считаются. Можно ли это как-то обойти?

sort --field-separator="\"" -k3 -n file.csv > sorted.csv

sotvm

Win332, Saboteur,
ключ -n не нужен
попробуй с ним/без
«FIELD_6, 2, 3″, 2 ,»xxx, 2», второй
«FIELD_5, 3, 3», третий
«FIELD_1″,1 ,»xxx, 2», первый

зы
а можно задать разделитель не из одного символа?
или sort это не поддерживает ?

ключ -n нужен, если будет больше 9 значений, потому что по алфавиту 10 < 2, а по числам 10 >2

разделитель не из одного символа поддерживает awk, а sort нет.

sotvm

Saboteur,
спасиб,понял ,НО
почему с ключём -n не работает ,в чём подвох?
«FIELD_6, 2, 3″, 2 ,»xxx, 2», второй
«FIELD_5, 3, 3», третий
«FIELD_1″,1 ,»xxx, 2», первый

sotvm

Saboteur,
а ,я понял ,во второй строке всего две ячейки,
а можно это игнорировать,
допустим у меня несколько строк с разным кол-вом ячеек 1-10 ,
но нисмотря на это мне нужно как то упорядочить их по последнему значению или по 3,
а так получается,что не просто пропускает строку,где этой ячейки нет,
но и остальные строки тоже не обрабатывает.
или sort — это простая команда и это не для неё задача?

в двух словах,если не трудно,
Спасибо

зы
если в примере за разделитель берём « ,
то думаю правильнее использовать с ключом -b (игнор пробелов ) ?
вдруг там его нет или наоборот 2-3 .
иль не правильно?

сорт простая команда, основная задача — сортировать строки.
То, что он еще умеет сортировать числа, да еще и по столбцам — это вообще прекрасно, но что-то сложное нужно разбирать чем-нибудь другим.
Например через awk переделать строку таким образом, чтобы искомый столбец был первым. И сортировать сортом уже по первому столбцу, который есть везде.

p.s. В примере мы используем ключ -n (сортировать по числам). В процессе преобразования куска строки в число, пробелы и так отсекутся.

sotvm

да вроде всё работает -попробовал (сам учусь),я даже код не писал,твой copy/past-ил.
то что содержится в кавычках «» ,читается как единая строка/символ. -ЛОЖЬ
единственное/возможно ,предварительно сам csv нужно разбить на строки,
или там каждое значение/парамет FILD идёт с новой строки?
тогда должно работать

Да, похоже я ошибся, он действительно работает.
Вот собрал csv файл который не срабатывает:

«FIELD_1, one»,1,»one»
«FIELD_1, three»,3,»three»
«FIELD_1, two»,2,»two»

на
sort —field-separator=»,» -k2 -n file.csv > sorted.csv

Я уже очень давно мучаюсь с этой проблемой но так и не нашел причины почему это не сортируется

sotvm

SOTVM, Да, так работает. Но что делать если в csv файле самые разные данные которые не ограничиваются таким порядком запятых?
Такой пример работать не будет:

«FIELD_1, two, three»,2,»two»
«FIELD_1, one»,1,»one»
«FIELD_1, three»,3,»three»

Можно как-то посчитать все что в «» как одну ячейку не смотря на запятые? Может как-то —field-separator=»,» надо настроить?

sotvm

попробуй русский ман почитай,он вроде как попонятней,поэксперементируй ツ
ну или может кто даст дельный совет,
я сам ещё толком не разобрался со всеми тонкостями
www.opennet.ru/man.shtml?category=1&russian=&topic=sort

sotvm

Win332,
мне кажется сперва нужно проти sed|awk
заменить/подставить/дополнить свой уникальный маркер разделителя
потом по нему отсортировать
потом удалить маркер
тады ему/ей команде sort будет пок на «» , и др.
я не умею буквами писать/излагать мысли ツ ,
словами/на пальцах,я бы вывел алгоритм,
иногда «пыжишься целый день» — вроде наговнокодишь,
а потом придёт умный чел и выдаст команду в одну строку 30 символов
вот такой он линупс/баш (•̮•)

Источник