How can I quickly sum all numbers in a file?
I’m looking to write a script which will print the sum of all numbers in the file. I’ve got a solution, but it’s not very efficient. (It takes several minutes to run.) I’m looking for a more efficient solution. Any suggestions?
@brian d foy, I’m too embarrassed to post it. I know why it’s slow. It’s because I call «cat filename | head -n 1» to get the top number, add it to a running total, and call «cat filename | tail. » to remove the top line for the next iteration. I have a lot to learn about programming.
That’s. very systematic. Very clear and straight forward, and I love it for all that it is a horrible abomination. Built, I assume, out of the tools that you knew when you started, right?
@MarkRoberts It must have taken you a long while to work that out. It’s a very cleaver problem solving technique, and oh so wrong. It looks like a classic case of over think. Several of Glen Jackman’s solutions shell scripting solutions (and two are pure shell that don’t use things like awk and bc ). These all finished adding a million numbers up in less than 10 seconds. Take a look at those and see how it can be done in pure shell.
32 Answers 32
Please mark this as the best answer. It also works if you want to sum the first value in each row, inside a TSV (tab-separated value) file.
@EthanFurman I actually have a tab delimited file as you explained but not able to make -F ‘\t’ do the magic. Where exactly is the option meant to be inserted? I have it like this awk -F ‘\t’ ‘ < sum += $0 >END < print sum >‘ file
None of the solution thus far use paste . Here’s one:
If the file has a trailing newline, a trailing + will incur a syntax error. Fix the error by removing the trailing + :
paste -sd+ fiilename | sed 's/+$//g' | bc
As an example, calculate Σn where 1
$ seq 100000 | paste -sd+ | bc -l 5000050000
(For the curious, seq n would print a sequence of numbers from 1 to n given a positive number n .)
seq 100000 | paste -sd+ — | bc -l on Mac OS X Bash shell. And this is by far the sweetest and the unixest solution!
@SimoA. I vote that we use the term unixiest in place of unixest because to the sexiest solution is always the unixiest 😉
For a Perl one-liner, it’s basically the same thing as the awk solution in Ayman Hourieh’s answer:
If you’re curious what Perl one-liners do, you can deparse them:
% perl -MO=Deparse -nle '$sum += $_ > END < print $sum'
The result is a more verbose version of the program, in a form that no one would ever write on their own:
BEGIN < $/ = "\n"; $\ = "\n"; >LINE: while (defined($_ = )) < chomp $_; $sum += $_; >sub END < print $sum; >-e syntax OK
Just for giggles, I tried this with a file containing 1,000,000 numbers (in the range 0 - 9,999). On my Mac Pro, it returns virtually instantaneously. That's too bad, because I was hoping using mmap would be really fast, but it's just the same time:
use 5.010; use File::Map qw(map_file); map_file my $map, $ARGV[0]; $sum += $1 while $map =~ m/(\d+)/g; say $sum;
Wow, that shows a deep understanding on what code -nle actually wraps around the string you give it. My initial thought was that you shouldn't post while intoxicated but then I noticed who you were and remembered some of your other Perl answers 🙂
-n and -p just put characters around the argument to -e, so you can use those characters for whatever you want. We have a lot of one-liners that do interesting things with that in Effective Perl Programming (which is about to hit the shelves).
Just for fun, let's benchmark it:
$ for ((i=0; i random_numbers $ time perl -nle '$sum += $_ > END < print $sum' random_numbers 16379866392 real 0m0.226s user 0m0.219s sys 0m0.002s $ time awk '< sum += $1 >END < print sum >' random_numbers 16379866392 real 0m0.311s user 0m0.304s sys 0m0.005s $ time < < tr "\n" + < random_numbers ; echo 0; >| bc; > 16379866392 real 0m0.445s user 0m0.438s sys 0m0.024s $ time < s=0;while read l; do s=$((s+$l));done16379866392 real 0m9.309s user 0m8.404s sys 0m0.887s $ time < s=0;while read l; do ((s+=l));done16379866392 real 0m7.191s user 0m6.402s sys 0m0.776s $ time < sed ':a;N;s/\n/+/;ta' random_numbers|bc; >^C real 4m53.413s user 4m52.584s sys 0m0.052s
I aborted the sed run after 5 minutes
I've been diving to lua, and it is speedy:
$ time lua -e 'sum=0; for line in io.lines() do sum=sum+line end; print(sum)' < random_numbers 16388542582.0 real 0m0.362s user 0m0.313s sys 0m0.063s
and while I'm updating this, ruby:
$ time ruby -e 'sum = 0; File.foreach(ARGV.shift) <|line| sum+=line.to_i>; puts sum' random_numbers 16388542582 real 0m0.378s user 0m0.297s sys 0m0.078s
Heed Ed Morton's advice: using $1
$ time awk ' < sum += $1 >END < print sum >' random_numbers 16388542582 real 0m0.421s user 0m0.359s sys 0m0.063s
$ time awk ' < sum += $0 >END < print sum >' random_numbers 16388542582 real 0m0.302s user 0m0.234s sys 0m0.063s
Your awk script should execute a bit faster if you use $0 instead of $1 since awk does field splitting (which obviously takes time) if any field is specifically mentioned in the script but doesn't otherwise.
Another option is to use jq :
-s ( --slurp ) reads the input lines into an array.
Wonderful solution. I had a tab delimited file where I wanted to sum column 6. Did that with the following command: awk '< print $6 >' myfile.log | jq -s add
sum=0 while read -r line do (( sum += line )) done < file echo $sum
And it's probably one of the slowest solutions and therefore not so suitable for large amounts of numbers.
I prefer to use GNU datamash for such tasks because it's more succinct and legible than perl or awk. For example
where 1 denotes the first column of data.
This does not appear to be a standard component as I do not see it in my Ubuntu installation. Would like to see it benchmarked, though.
I prefer to use R for this:
I'm a fan of R for other applications but it's not good for performance in this way. File I/O is a major issue. I've tested passing args to a script which can be sped up using the vroom package. I'll post more details when I've benchmarked some other scripts on the same server.
( echo 0 ; sed 's/$/ +/' foo ; echo p ) | dc
This assumes the numbers are integers. If you need decimals, try
( echo 0 2k ; sed 's/$/ +/' foo ; echo p ) | dc
Adjust 2 to the number of decimals needed.
Perl 6
~$ perl6 -e '.say for 0..1000000' > test.in ~$ perl6 -e 'say sum lines' < test.in 500000500000
$ perl -MList::Util=sum -le 'print sum <>' nums.txt
# Ruby ruby -e 'puts open("random_numbers").map(&:to_i).reduce(:+)' # Python python -c 'print(sum(int(l) for l in open("random_numbers")))'
Converting to float seems to be about twice as fast on my system (320 vs 640 ms). time python -c "print(sum([float(s) for s in open('random_numbers','r')]))"
I couldn't just pass by. Here's my Haskell one-liner. It's actually quite readable:
sum (read ) lines getContents
Unfortunately there's no ghci -e to just run it, so it needs the main function, print and compilation.
main = (sum (read ) lines getContents) >>= print
To clarify, we read entire input ( getContents ), split it by lines , read as numbers and sum . is fmap operator - we use it instead of usual function application because sure this all happens in IO. read needs an additional fmap , because it is also in the list.
$ ghc sum.hs [1 of 1] Compiling Main ( sum.hs, sum.o ) Linking sum . $ ./sum 1 2 4 ^D 7
Here's a strange upgrade to make it work with floats:
main = ((0.0 + ) sum (read ) lines getContents) >>= print
$ ./sum 1.3 2.1 4.2 ^D 7.6000000000000005
How can I quickly sum all numbers in a file?
Each line contains text and numbers in one column. I need to calculate the sum of the numbers in each row. How can I do that? Thx example.log contains:
time=31sec time=192sec time=18sec time=543sec
There is almost the same question in Stack Overflow: How can I quickly sum all numbers in a file?. Maybe time to have cross-site duplicates?
10 Answers 10
If your grep support -o option, you can try:
$ grep -o '[[:digit:]]*' file | paste -sd+ - | bc 784
With a newer version (4.x) of GNU awk :
Let me explain that. - There is just one case where s can be empty; if the input data contains no lines (i.e. if there is no input at all). In that case there are two behaviours possible; 1) no input => no output, or 2) always output something, if only 0. Both are sensible options depending on the application context. The +0 is addressing option 2). To address option 1) you'd rather have to write END . - Therefore it makes no sense to assume either option (for this corner case of no data) until it is specified by the question.
@slm, that answer is not any more or less verbose than the other answers here and is self explanatory. It also has the advantage of working with input like time=1.4e5sec
@StéphaneChazelas - agreed, but this is a new user and we do encourage users to provide more than single line answers. A bit of text explaining how it works would make it a much stronger answer than just code.
@slm, this is a new user with one of the best answers (from a technical stand point) and he gets two downvotes and a negative comment. Not a very warm welcome.
@TomFenech, the POSIX syntax for awk requires that those pattern/action items be separated by either ";" or "newline", so you may find awk implementations where it fails without this ";".
tr -cs 0-9 '[\n*]' | grep . | paste -sd + - | bc
@user1717828: you should rather use the (shorter, and more compatible!) -F'=' instead of --field-separator =
@user1717828: -F'=' or -F '=' are 2 ways of doing the -F fs (fs is "=" in your case) . I added the singlequotes to ensure the fs is properly seen & interpreted by awk, not the shell (usefull if the fs is ';' for example)
Everyone has posted awesome awk answers, which I like very much.
A variation to @cuonglm replacing grep with sed :
sed 's/[^0-9]//g' example.log | paste -sd'+' - | bc
- The sed strips everything except for the numbers.
- The paste -sd+ - command joins all the lines together as a single line
- The bc evaluates the expression
You should use a calculator.
With your four lines that prints:
time=31 time=223 time=241 time=784
If speed is what you're after then dc is what you want. Traditionally it was bc 's compiler - and still is for many systems.
@glennjackman - your measurements don't include dc as near as I can tell. What are you talking about?
By the way, when comparing the old crew to the new crew - such as when you benchmark perl v the standard unix toolset - it really doesn't make much sense if you use GNU tools compiled on a GNU toolchain. All of the bloat that can negatively affect Perl's performance is also in all of those GNU-compiled GNU utils. Sad but true. You need a real, simply built, simple toolset to accurately judge the difference. Like an heirloom-toolchest set statically linked against musl libs for instance - in that way you can bench the one-tool/one-job paradigm vs the one-tool-to-rule-them-all one.
Sum all the numbers in a file given by positional parameter
I want to sum all the numbers in a file (columns and lines) given by the first parameter, but my program shows sum=sum+$i instead of the numeric sum:
sum=0; file=$1 for i in $file do sum=sum+$i; done; echo "The sum is: " $sum
$cat file.txt 10 20 10 40 50
I have also tried this awk 'BEGIN
Show the sum of all integers in a file. The file can contain multiple lines and in each line can be multiple integers.
4 Answers 4
$cat file1.txt 10 20 10 40 50 $awk '> END ' file1.txt 130
cat file.txt | xargs | sed -e 's/\ /+/g' | bc
You can also use a simple read and an array to sum the value relying on word splitting to separate the values into an array via the default IFS (Internal Field Separator), e.g.
#!/bin/bash declare -i sum=0 fn="$" ## read from file as 1st argument (default stdin) while read -r line; do ## read each line a=( $line ) ## separate values into array for i in $; do ## for each value in array ((sum += i)) ## add to sum done done
Example Input File
$ cat dat/numfile.txt 10 20 10 40 50
Example Use/Output
$ bash sumnumfile.sh dat/numfile.txt sum: 130