Select random lines from a file
In a Bash script, I want to pick out N random lines from input file and output to another file. How can this be done?
I disagree with sort -R as it does a lot of excess work, particularly for long files. You can use $RANDOM , % wc -l , jot , sed -n (à la stackoverflow.com/a/6022431/563329), and bash functionality (arrays, command redirects, etc) to define your own peek function which will actually run on 5,000,000-line files.
8 Answers 8
Use shuf with the -n option as shown below, to get N random lines:
If you just need a random set of lines, not in a random order, then shuf is very inefficient (for big file): better is to do reservoir sampling, as in this answer.
I ran this on a 500M row file to extract 1,000 rows and it took 13 min. The file had not been accessed in months, and is on an Amazon EC2 SSD Drive.
Sort the file randomly and pick first 100 lines:
lines=100 input_file=/usr/share/dict/words # This is the basic selection method
sort actually sorts identical lines together, so if you may have duplicate lines and you have shuf (a gnu tool) installed, it's better to use it for this.
Andalso, this is definitely going to make you wait a lot if you have a considerably huge file -- 80kk lines --, whereas, shuf -n acts quite instantaneously.
@J.F.Sebastian The code: sort -R input | head -n
Well According to a comment on the shuf answer he shuffed 78 000 000 000 lines in under a minute.
EDIT: I beat my own record
powershuf did it in 0.047 seconds
$ time ./powershuf.py -n 10 --file lines_78000000000.txt > /dev/null ./powershuf.py -n 10 --file lines_78000000000.txt > /dev/null 0.02s user 0.01s system 80% cpu 0.047 total
The reason it is so fast, well I don't read the whole file and just move the file pointer 10 times and print the line after the pointer.
Old attempt
First I needed a file of 78.000.000.000 lines:
seq 1 78 | xargs -n 1 -P 16 -I% seq 1 1000 | xargs -n 1 -P 16 -I% echo "" > lines_78000.txt seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000.txt > lines_78000000.txt seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000000.txt > lines_78000000000.txt
This gives me a a file with 78 Billion newlines 😉
$ time shuf -n 10 lines_78000000000.txt shuf -n 10 lines_78000000000.txt 2171.20s user 22.17s system 99% cpu 36:35.80 total
The bottleneck was CPU and not using multiple threads, it pinned 1 core at 100% the other 15 were not used.
Python is what I regularly use so that's what I'll use to make this faster:
#!/bin/python3 import random f = open("lines_78000000000.txt", "rt") count = 0 while 1: buffer = f.read(65536) if not buffer: break count += buffer.count('\n') for i in range(10): f.readline(random.randint(1, count))
This got me just under a minute:
$ time ./shuf.py ./shuf.py 42.57s user 16.19s system 98% cpu 59.752 total
I did this on a Lenovo X1 extreme 2nd gen with the i9 and Samsung NVMe which gives me plenty read and write speed.
I know it can get faster but I'll leave some room to give others a try.
Well, according to your description of powershuf's inner functionning, it looks like it is just randomish. Using a file with just two lines, one being 1 character long, the other being 20 characters long, I expect both lines to be choosen with equal chances. This doesn't seem to be the case with your program.
There was an issue with files shorter than 4KB and some other math mistakes that made it horrible with small files. I fixed them for as far as I could find the issues, please give it another try.
Hi Stein. It doesn't seem to work. Did you test it the way I suggested in my above comment? Before making something quicker than shuf, I reckon you should focus on making something that works as accurately as shuf. I really doubt anyone can beat shuf with a python program. BTW, unless you use the -r option, shuf doesn't output the same line twice, and of course this takes additional processing time.
Why does powershuf discard the first line? Can it ever pick the very first line? It seems to also funnel the search in a weird way: if you have 10 lines too long, then 1 line of valid length, then 5 lines and another line of valid length, then the iteration will find the 10 lines more often than the 5, and funnel about two thirds of the time into the first valid line. The program doesn't promise this, but it would make sense to me if the lines were effectively filtered by length and then random lines were chosen from that set.
The question is how to get random lines from a text file in a bash script, not how to write a Python script.
My preferred option is very fast, I sampled a tab-delimited data file with 13 columns, 23.1M rows, 2.0GB uncompressed.
# randomly sample select 5% of lines in file # including header row, exclude blank lines, new seed time \ awk 'BEGIN !/^$/ < if (rand() "data-sample.txt">' data.txt # awk tsv004 3.76s user 1.46s system 91% cpu 5.716 total
Randomly sample select approximately 5% of lines in file. Law of large numbers will make it close, but since each line is decided independently, there is no way to guarantee it will actually be 5% of lines.
seq 1 100 | python3 -c 'print(__import__("random").choice(__import__("sys").stdin.readlines()))'
Just for completeness's sake and because it's available from Arch's community repos: there's also a tool called shuffle , but it doesn't have any command line switches to limit the number of lines and warns in its man page: "Since shuffle reads the input into memory, it may fail on very large files."
# Function to sample N lines randomly from a file # Parameter $1: Name of the original file # Parameter $2: N lines to be sampled rand_line_sampler() < N_t=$(awk '' $1 | wc -l) # Number of total lines N_t_m_d=$(( $N_t - $2 - 1 )) # Number oftotal lines minus desired number of lines N_d_m_1=$(( $2 - 1)) # Number of desired lines minus 1 # vector to have the 0 (fail) with size of N_t_m_d echo '0' > vector_0.temp for i in $(seq 1 1 $N_t_m_d); do echo "0" >> vector_0.temp done # vector to have the 1 (success) with size of desired number of lines echo '1' > vector_1.temp for i in $(seq 1 1 $N_d_m_1); do echo "1" >> vector_1.temp done cat vector_1.temp vector_0.temp | shuf > rand_vector.temp paste -d" " rand_vector.temp $1 | awk '$1 != 0 ' | sed 's/^ *//' > sampled_file.txt # file with the sampled lines rm vector_0.temp vector_1.temp rand_vector.temp > rand_line_sampler "parameter_1" "parameter_2"
In the below 'c' is the number of lines to select from the input. Modify as needed:
#!/bin/sh gawk ' BEGIN < srand(); c = 5 >c/NR >= rand() < lines[x++ % c] = $0 >END < for (i in lines) print lines[i] >' "$@"
This does not guarantee that eactly c lines are selected. At best you can say that the average number of lines being selected is c .
That is incorrect: c/NR will be >= 1 (larger than any possible value of rand() ) for the first c lines, thus filling lines[]. x++ % c forces lines[] to c entries, assuming there are at least c lines in the input
Right, c/NR will be guaranteed to be larger than any value produced from rand for the first c lines. After that, it may or may not be larger than rand . Therefore we can say that lines in the end contains at least c entries, and in general more than that, i.e. not exactly c entries. Furthermore, the first c lines of the file are always picked, so the whole selection is not what could be called a random pick.
uh, x++ % c constrains lines[] to indices 0 to c-1. Of course, the first c inputs initially fill lines[], which are replaced in round robin fashion when the random condition is met. A small change (left as an exercise for the reader) could be made to randomly replace entries in lines[], rather than in a round-robin.
How to find lines containing a string in linux [closed]
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
I have a file in Linux, I would like to display lines which contain a specific string in that file, how to do this?
5 Answers 5
The usual way to do this is with grep , which uses a regex pattern to match lines:
Each line which matches the pattern will be output. If you want to search for fixed strings only, use grep -F 'pattern' file . fgrep is shorthand for grep -F .
addition grep -rn 'string' /path/ if you want to search a string in a folder in which file including and line number
Besides grep , you can also use other utilities such as awk or sed
Here is a few examples. Let say you want to search for a string is in the file named GPL .
Your sample file
$ cat -n GPL 1 The GNU General Public License is a free, copyleft license for 2 The licenses for most software and other practical works are designed 3 the GNU General Public License is intended to guarantee your freedom to 4 GNU General Public License for most of our software;
$ grep is GPL The GNU General Public License is a free, copyleft license for the GNU General Public License is intended to guarantee your freedom to
$ awk /is/ GPL The GNU General Public License is a free, copyleft license for the GNU General Public License is intended to guarantee your freedom to
$ sed -n '/is/p' GPL The GNU General Public License is a free, copyleft license for the GNU General Public License is intended to guarantee your freedom to
How can I get a specific line from a file? [duplicate]
I want to extract an exact line from a very big file. For example, line 8000 would be gotten like this:
command -line 8000 > output_line_8000.txt
Many of the methods below are mentioned in this SO Q&A as well: stackoverflow.com/questions/6022384/…
6 Answers 6
There's already an answer with perl and awk . Here's a sed answer:
The advantage of the q command is that sed will quit as soon as the 8000-th line is read (unlike the other perl and awk methods (it was changed after common creativity, haha)).
A pure Bash possibility (bash≥4):
This will slurp the content of file in an array ary (one line per field), but skip the first 7999 lines ( -s 7999 ) and only read one line ( -n 1 ).
It's Saturday and I had nothing better to do so I tested some of these for speed. It turns out that the sed , gawk and perl approaches are basically equivalent. The head&tail one is the slowest but, suprisingly, the fastest by an order of magnitude is the pure bash one:
$ for i in ; do echo "This is line $i" >>file; done
The above creates a file with 50 million lines which occupies 100M.
$ for cmd in "sed -n '8000' file" \ "perl -ne 'print && exit if $. == 8000' file" \ "awk 'FNR==8000 ' file" "head -n 8000 file | tail -n 1" \ "mapfile -s 7999 -n 1 ary < file; printf '%s' \"$\"" \ "tail -n 8001 file | head -n 1"; do echo "$cmd"; for i in ; do (time eval "$cmd") 2>&1 | grep -oP 'real.*?m\K[\d\.]+'; done | awk 'END'; done sed -n '8000' file 0.04502 perl -ne 'print && exit if $. == 8000' file 0.04698 awk 'FNR==8000 ' file 0.04647 head -n 8000 file | tail -n 1 0.06842 mapfile -s 7999 -n 1 ary < file; printf '%s' "This is line 8000 " 0.00137 tail -n 8001 file | head -n 1 0.0033