Improving Shell Script Performance - regex

This shell script is used to extract a line of data from $2 if it contains the pattern $line.
$line is constructed using the regular expression [A-Z0-9.-]+#[A-Z0-9.-]+ (a simple email match), form the lines in file $1.
#! /bin/sh
clear
for line in `cat "$1" | grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+"`
do
echo `cat "$2" | grep -m 1 "\b$line\b"`
done
File $1 has short lines of data (< 100 chars) and contains approx. 50k lines (approx. 1-1.5 MB).
File $2 has slightly longer lines of text (> 80 to < 200) and has 2M+ lines (approx. 200MB).
The desktops this is running on has plenty of RAM (6 Gig) and Xenon processors with 2-4 cores.
Are there any quick fixes to increase performance as currently it takes 1-2 hours to completely run (and output to another file).
NB: I'm open to all suggestions but we're not in the position to complexity re-write the whole system etc. In addition the data come from a third party and is prone to random formatting.

Quick suggestions:
Avoid the useless use of cat and change cat X | grep Y to grep Y X.
You can process the grep output as it is produced by piping it rather than using backticks. Using backticks requires the first grep to complete before you can start the second grep.
Thus:
grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+" "$1" | while read line; do
grep -m 1 "\b$line\b" "$2"
done
Next step:
Don't process $2 repeatedly. It's huge. You can save up all your patterns and then execute a single grep over the file.
Replace loop with sed.
No more repeated grep:
grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+" "$1" | sed -E 's/^|$/\\1/g' > patterns
grep -f patterns "$2"
Finally, using some bash fanciness (see man bash → Process Substitution) we can ditch the temporary file and do this in one long line:
grep -f <(grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+" "$1" | sed -E 's/^|$/\\b/g') "$2"
That's great unless you have so many patterns grep -f runs out of memory and barfs. If that happens you'll need to run it in batches. Annoying, but doable:
grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+" "$1" | sed -E 's/^|$/\\1/g' > patterns
while [ -s patterns ]; do
grep -f <(head -n 100 patterns) "$2"
sed -e '1,100d' -i patterns
done
That'll process 100 patterns at a time. The more it can do at once the fewer passes it'll have to make over your 2nd file.

the problem is you are piping too many shell commands, as well as unnecessary use of cat.
one possible solution using just awk
awk 'FNR==NR{
# get all email address from file1
for(i=1;i<=NF;i++){
if ( $i ~ /[a-zA-Z0-9.-]+#[a-zA-Z0-9.-]+/){
email[$i]
}
}
next
}
{
for(i in email) {
if ($0 ~ i) {
print
}
}
}' file1 file2

I would take the loop out, since greping a 2 million line file 50k times is probably pretty expensive ;)
To allow for you to take the loop out
First create a file of all your Email Addresses with your outer grep command.
Then use this as a pattern file to do your secondary grep by using grep -f

If $1 is a file, don't use "cat | grep". Instead, pass the file directly to grep. Should look like
grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+" $1
Besides, you may want to adjust your regex. You should at least expect the underscore ("_") in an email address, so
grep -i -o -E "[A-Z0-9._-]+#[A-Z0-9.-]+" $1

As John Kugelman has already answered, process the grep output by piping it rather than using backticks. If you are using backticks the whole expression within the backticks will be run first, and then the outer expression will be run with the output from the backticks as arguments.
First of all, this will be a lot slower than necessary as piping would allow the two programs to run simultaneously (which is really good if they are both CPU intensive and you have multiple CPUs). However there is another very important aspect to this, the line
for line in `cat "$1" | grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+"`
may become to long for the shell to handle. Most shells (to my knowledge at least) limit the length of a command line, or at least the arguments to a command, and I think this could become a problem for the for loop too.

Related

How to pass regular expression matching string from a file in awk?

I have a requirement where I have to split a large file into small files. Each line of the large file containing the matching string should be put into another file with the output file name same as the matching string. For one string I can get it done via awk as shown below.
awk '/apple/{print}' large_file.txt > apple.txt
I want a script which takes the regular expression matching string from another file and puts the results into a file with the same name as the matching string. How to get it done with awk command?
Let's say the string to be matched is put into a file called matching_string.txt the contents of which would look like this:
apple
orange
mango
If the large_file.txt is something like:
apple is a great fruit
we should eat apple
orange is juicy
mango is the king of fruits
litchi is a seasonal fruit
then the resulting file should be
apple.txt:
apple is a great fruit
we should eat apple
orange.txt:
orange is juicy
mango.txt:
mango is the king of fruits
I am new to the Linux environment and beginner level at scripting. Any other solution using regular expression, sed, python etc. should be also okay.
EDIT
Working Script:
I tweaked my script a little based on the answer by #Stephen Quan, it works for the tsch shell.
#!/bin/tcsh -f
foreach word ("`cat pattern.txt`")
if (-r ${word}.txt) then
rm -rf ${word}.txt
endif
awk "/${word}/ { print }" large.txt > ${word}.txt
end
Why use awk? Grep does the job too. Usually, awk '/pattern/{print}' can be replaced by the shorter grep -e 'pattern'.
pattern=apple
grep -e "$pattern" large.txt > "$pattern.txt"
Write a script or a shell function. For instance, a simple shell function can be defined ad-hoc and then called.
filter() { grep -e "$1" large.txt > "$1.txt"; }
for pattern in apple orangle mango; do filter "$pattern"; done
As a shell script (e.g. filter.sh):
#!/bin/sh
grep -e "$1" large.txt > "$1.txt"
Needless to say, the script file must have the executable bit set, otherwise it cannot be executed (obviously).
Assuming your pattern file (e.g. pattern.txt) contains one pattern per line:
#!/bin/sh
while IFS= read -r pattern <&3; do
filter "$pattern"
# or: ./filter.sh "$pattern"
done 3< pattern.txt
All of that can be done without script or function if you simply want a one-shot task to be done (but defining and using the function is not really more complicated than calling its body directly):
while IFS= read -r pattern <&3; do
grep -e "$pattern" large.txt > "$pattern.txt"
done 3< pattern.txt
Note that a for loop cannot be used here, since your program will break as soon as one of your patterns contains space or tab characters.
To do this in awk:
for word in $(cat matching_string.txt)
do
awk "/${word}/ { print }" large_file.txt > ${word}.txt
done
while IFS= read -r word
do
if [ -f ${word}.txt ]; then rm ${word}.txt; fi
awk "/${word}/ { print }" large_file.txt > ${word}.txt
done < matching_string.txt
The pattern is a regex pattern followed by a command. Note that when you get into regex-capture groups, you may find that the implementation of awk varies from one platform to another.
If it is a simplistic regex, I prefer perl because in cross-platform environments (particularly osx and git-bash on Windows), perl has a more consistent implementation for regex handling. In this case, the perl solution would be:
while IFS= read -r word
do
if [ -f ${word}.txt ]; then rm ${word}.txt; fi
perl -ne "if (/${word}/) { print }" < large_file.txt > ${word}.txt
done < matching_string.txt
I wanted to also demonstrate capture groups. In this case, it is a bit of over-engineered to represent your line as 3 capture groups (prefix, word, postfix), but, I do this because it serves as a template for you to create more complex regex capture group processing scenarios:
while IFS= read -r word
do
if [ -f ${word}.txt ]; then rm ${word}.txt; fi
perl -ne "if (/(.*)(${word})(.*)/) { print $1$2$3 . '\n' }" < large_file.txt > ${word}.txt
done < matching_string.txt
use grep -e pattern:
pattern=orange
grep -e "$pattern" large.txt > "$pattern.txt"
then use the read command to read all Patterns and generate all files:
filename='patternfile.txt'
while read pattern; do
grep -e "$pattern" large.txt > "$pattern.txt"
done < $filename

Slow bash script using grep and sed

I'm trying to speed up my script, which currently takes approx 30 seconds. I am a novice to bash, and I am sure I am using some bad scripting practice (found some hint in https://unix.stackexchange.com/a/169765 but still cannot fix my issue).
What I need to do is get data from an external file, and extract numbers into two arrays. My script works fine, except it is too slow.
readData=`cat $myfile`
# readData = [[1491476100000,60204],[1491476130000,59734],...,[1491476160000,60150]]
# I have approximately 5000 points (two numbers in each point)
pointTime=()
pointVal=()
for line in `echo $readData | grep -Po "[0-9]+,[0-9]+"`; do
# Get first number but drop last three zeroes (e.g. 1491476100)
pointTime+=(`echo $line | grep -Po "^[0-9]+" | sed "s/\(.*\)000$/\1/"`)
# Get second number, e.g. 60204
pointVal+=(`echo $line | grep -Po "[0-9]+$"`)
done
Maybe I could use some regex inside a parameter expansion, but I don't know how.
Fast Alternative
Here's how I would write the script:
mapfile -t points < <(grep -Po '\d+,\d+' "$myfile")
pointTime=("${points[#]%000,*}")
pointVal=("${points[#]#*,}")
or even
mapfile -t pointTime < <(grep -Po '\d+(?=000,)' "$myfile")
mapfile -t pointVal < <(grep -Po ',\K\d+' "$myfile")
when you are sure that the file is well-formed.
Problems of the old Script
You already identified the main problem: The loop is slow, especially, since a lot of programs are called inside the loop. Nevertheless here are some hints how you could have improved your script without throwing away the loop. Some parts were needlessly complicated, for instance
readData=`cat $myfile`
`echo $readData | grep -Po "[0-9]+,[0-9]+"`
can be written as
grep -Po "[0-9]+,[0-9]+" "$myfile"
and
echo $line | grep -Po "^[0-9]+" | sed "s/\(.*\)000$/\1/"
can be written as
grep -Po "^[0-9]+(?=000)" <<< "$line"
A big speed boost would be to use bash's matching operator =~ instead of grep, because just starting up grep is slow.
[[ "$line" =~ (.*)000,(.*) ]]
pointTime+=("${BASH_REMATCH[1]}")
pointTime+=("${BASH_REMATCH[2]}")
I'm suspicious of the requirement to store the results in an array. You probably actually want to loop over the values in pairs. In any event, storing intermediate values in memory is inelegant and wasteful.
grep -Eo '[0-9]+,[0-9]+' "$myfile" |
while IFS=, read -r first second, do
process value pair "${first%000}" "$second"
done
If you insist on storing the values in an array, how to change the body of the loop should be obvious.
pointTime+=("${first%000}")
pointVal+=("$second")

Splitting a line in bash based on delimiter with Sed / Regex

Regex rookie and hoping to change that. I have the following seemingly very simple problem that I cannot figure the correct regex implementation to parse properly. Basically I have a file that has lines that looks like this:
time:3:35PM
I am just trying to cut out all characters up to and including ONLY FIRST ':' delimiter and keep the rest intact with sed so that I can process on many files with same format. What I am trying to get is this:
3:35PM
The below is the closest I got but is just using the last occurrence of the delimiter instead of the first.:
sed 's/.*://'
I have also tried with python but have challenges with applying a python function to iterate through all lines in many files as opposed to just one file.
Any help would be greatly appreciated.
You can do this in just about every text processing tool (many without using regular expressions at all).
ed
If the in-place editing is really important, the canonical correct way is not sed (the stream editor) but ed (the file editor).
ed "$file" << EOF
,s/^[^:]*://g
w
EOF
sed
(Pretty much the same commands as ed, formatted a little differently)
sed 's/^[^:]*://' < "$file" > "$file".new
mv "$file".new "$file"
BASH
This one doesn't cause any new processes to be spawned. (For whatever that's worth.)
while IFS=: read _ time; do
printf '%s\n' "$time"
done < "$file" > "$file".new
mv "$file".new "$file"
awk
awk -F: 'BEGIN{ OFS=":" } { print $2,$3 }' < "$file" > "$file".new
mv "$file".new "$file"
cut
cut -d: -f2- < "$file" > "$file".new
mv "$file".new "$file"
Since you don't need a regular expression to match a single, known character, consider using cut instead of sed.
This simple expression sets : as the d-elimiter and emits f-ields 2, onwards (-):
cut -d: -f2-
Example:
% echo 'time:3:35PM' | cut -d: -f2-
3:35PM
kojiro's answer has a plenty of great alternatives, but you have asked how to do that with regex. Here are some pure regex solutions:
grep -oP '[^:]*:\K.*' file.txt
\K makes it forget everything before the occurrence of \K.
But if you know the exact prefix length then you can use lookaround feature:
grep -oP '(?<=^time:).*' file.txt
Note that most of regex implementations do not support these features. You can use it in grep with -P flag and perl itself. I wonder if any other utility supports these.
To remove every instance up to : and including the : you could do..
sed -i.bak 's/^[^:]*://' file.txt
on multiple .txt files
sed -i.bak 's/^[^:]*://' *.txt
The -i option specifies that files are to be edited in-place. By creating a temporary file and sending output to this file rather than to the standard output.
Please consider my answer here:
How to use regex with cut at the command line?
You could for example just write:
echo 'time:3:35PM' | cutr -d : -f 2- -r :
In your particular case, you could simply use cut though:
echo 'time:3:35PM' | cut -d : -f 2-
Any feedback welcome. cutr isn't perfect yet, but before I invest too much time into it, I wanted to get some feedback.

Perl regex where pattern is output from linux command

I have a linux command statistics -o -u i1,1,1 which returns
max count[0]:=31
max count:=31
I would like to pluck out the number 31 in my perl script. I can do it from the command line using awk piped to head
statistics -o -u i1,1,1 | awk -F':=' '{print $2}' | head -n1
or similarly using grep
statistics -o -u i1,1,1 | grep -Po '(?<=max count:=)\d+'
or sed...
How can I do similar within a perl script?
EDIT
Essentially, I would like to replace a backtick system call inside perl code with a pure perl solution.
You can emulate the awk:
perl -F":=" -lane 'print $F[1]'
Or you can emulate the grep:
perl -nle 'print /(?<=max count:=)(\d+)/'
They do not work in the same way, in that the first one will give output for any line that contains := followed by something.
The -n switch allows for reading of stdin or files, -l handles newlines and -F sets the delimiter for autosplit -a.
Update:
According to your comment, it seems what you want is to replace a system call with pure perl code:
my $variable = `statistics -o -u i1,1,1 | grep -Po '(?<=max count:=)\d+'`;
The statistics command is unknown to me, so I do not know of a pure perl way to replace it, though something might exist on cpan. You can save yourself one process by processing the output in perl though. Something like this should work:
my #lines = grep /max count:=/, qx(statistics -o -u i1,1,1);
my ($num) = $lines[0] =~ /max count:=(\d+)/;
The qx() operator works exactly the same way as backticks, I just use it as a personal preference.

Match two strings in one line with grep

I am trying to use grep to match lines that contain two different strings. I have tried the following but this matches lines that contain either string1 or string2 which not what I want.
grep 'string1\|string2' filename
So how do I match with grep only the lines that contain both strings?
You can use
grep 'string1' filename | grep 'string2'
Or
grep 'string1.*string2\|string2.*string1' filename
I think this is what you were looking for:
grep -E "string1|string2" filename
I think that answers like this:
grep 'string1.*string2\|string2.*string1' filename
only match the case where both are present, not one or the other or both.
To search for files containing all the words in any order anywhere:
grep -ril \'action\' | xargs grep -il \'model\' | xargs grep -il \'view_type\'
The first grep kicks off a recursive search (r), ignoring case (i) and listing (printing out) the name of the files that are matching (l) for one term ('action' with the single quotes) occurring anywhere in the file.
The subsequent greps search for the other terms, retaining case insensitivity and listing out the matching files.
The final list of files that you will get will the ones that contain these terms, in any order anywhere in the file.
If you have a grep with a -P option for a limited perl regex, you can use
grep -P '(?=.*string1)(?=.*string2)'
which has the advantage of working with overlapping strings. It's somewhat more straightforward using perl as grep, because you can specify the and logic more directly:
perl -ne 'print if /string1/ && /string2/'
Your method was almost good, only missing the -w
grep -w 'string1\|string2' filename
You could try something like this:
(pattern1.*pattern2|pattern2.*pattern1)
The | operator in a regular expression means or. That is to say either string1 or string2 will match. You could do:
grep 'string1' filename | grep 'string2'
which will pipe the results from the first command into the second grep. That should give you only lines that match both.
And as people suggested perl and python, and convoluted shell scripts, here a simple awk approach:
awk '/string1/ && /string2/' filename
Having looked at the comments to the accepted answer: no, this doesn't do multi-line; but then that's also not what the author of the question asked for.
Don't try to use grep for this, use awk instead. To match 2 regexps R1 and R2 in grep you'd think it would be:
grep 'R1.*R2|R2.*R1'
while in awk it'd be:
awk '/R1/ && /R2/'
but what if R2 overlaps with or is a subset of R1? That grep command simply would not work while the awk command would. Lets say you want to find lines that contain the and heat:
$ echo 'theatre' | grep 'the.*heat|heat.*the'
$ echo 'theatre' | awk '/the/ && /heat/'
theatre
You'd have to use 2 greps and a pipe for that:
$ echo 'theatre' | grep 'the' | grep 'heat'
theatre
and of course if you had actually required them to be separate you can always write in awk the same regexp as you used in grep and there are alternative awk solutions that don't involve repeating the regexps in every possible sequence.
Putting that aside, what if you wanted to extend your solution to match 3 regexps R1, R2, and R3. In grep that'd be one of these poor choices:
grep 'R1.*R2.*R3|R1.*R3.*R2|R2.*R1.*R3|R2.*R3.*R1|R3.*R1.*R2|R3.*R2.*R1' file
grep R1 file | grep R2 | grep R3
while in awk it'd be the concise, obvious, simple, efficient:
awk '/R1/ && /R2/ && /R3/'
Now, what if you actually wanted to match literal strings S1 and S2 instead of regexps R1 and R2? You simply can't do that in one call to grep, you have to either write code to escape all RE metachars before calling grep:
S1=$(sed 's/[^^]/[&]/g; s/\^/\\^/g' <<< 'R1')
S2=$(sed 's/[^^]/[&]/g; s/\^/\\^/g' <<< 'R2')
grep 'S1.*S2|S2.*S1'
or again use 2 greps and a pipe:
grep -F 'S1' file | grep -F 'S2'
which again are poor choices whereas with awk you simply use a string operator instead of regexp operator:
awk 'index($0,S1) && index($0.S2)'
Now, what if you wanted to match 2 regexps in a paragraph rather than a line? Can't be done in grep, trivial in awk:
awk -v RS='' '/R1/ && /R2/'
How about across a whole file? Again can't be done in grep and trivial in awk (this time I'm using GNU awk for multi-char RS for conciseness but it's not much more code in any awk or you can pick a control-char you know won't be in the input for the RS to do the same):
awk -v RS='^$' '/R1/ && /R2/'
So - if you want to find multiple regexps or strings in a line or paragraph or file then don't use grep, use awk.
git grep
Here is the syntax using git grep with multiple patterns:
git grep --all-match --no-index -l -e string1 -e string2 -e string3 file
You may also combine patterns with Boolean expressions such as --and, --or and --not.
Check man git-grep for help.
--all-match When giving multiple pattern expressions, this flag is specified to limit the match to files that have lines to match all of them.
--no-index Search files in the current directory that is not managed by Git.
-l/--files-with-matches/--name-only Show only the names of files.
-e The next parameter is the pattern. Default is to use basic regexp.
Other params to consider:
--threads Number of grep worker threads to use.
-q/--quiet/--silent Do not output matched lines; exit with status 0 when there is a match.
To change the pattern type, you may also use -G/--basic-regexp (default), -F/--fixed-strings, -E/--extended-regexp, -P/--perl-regexp, -f file, and other.
Related:
How to grep for two words existing on the same line?
Check if all of multiple strings or regexes exist in a file
How to run grep with multiple AND patterns? & Match all patterns from file at once
For OR operation, see:
How do I grep for multiple patterns with pattern having a pipe character?
Grep: how to add an “OR” condition?
Found lines that only starts with 6 spaces and finished with:
cat my_file.txt | grep
-e '^ .*(\.c$|\.cpp$|\.h$|\.log$|\.out$)' # .c or .cpp or .h or .log or .out
-e '^ .*[0-9]\{5,9\}$' # numers between 5 and 9 digist
> nolog.txt
Let's say we need to find count of multiple words in a file testfile.
There are two ways to go about it
1) Use grep command with regex matching pattern
grep -c '\<\(DOG\|CAT\)\>' testfile
2) Use egrep command
egrep -c 'DOG|CAT' testfile
With egrep you need not to worry about expression and just separate words by a pipe separator.
grep ‘string1\|string2’ FILENAME
GNU grep version 3.1
Place the strings you want to grep for into a file
echo who > find.txt
echo Roger >> find.txt
echo [44][0-9]{9,} >> find.txt
Then search using -f
grep -f find.txt BIG_FILE_TO_SEARCH.txt
grep '(string1.*string2 | string2.*string1)' filename
will get line with string1 and string2 in any order
for multiline match:
echo -e "test1\ntest2\ntest3" |tr -d '\n' |grep "test1.*test3"
or
echo -e "test1\ntest5\ntest3" >tst.txt
cat tst.txt |tr -d '\n' |grep "test1.*test3\|test3.*test1"
we just need to remove the newline character and it works!
You should have grep like this:
$ grep 'string1' file | grep 'string2'
I often run into the same problem as yours, and I just wrote a piece of script:
function m() { # m means 'multi pattern grep'
function _usage() {
echo "usage: COMMAND [-inH] -p<pattern1> -p<pattern2> <filename>"
echo "-i : ignore case"
echo "-n : show line number"
echo "-H : show filename"
echo "-h : show header"
echo "-p : specify pattern"
}
declare -a patterns
# it is important to declare OPTIND as local
local ignorecase_flag filename linum header_flag colon result OPTIND
while getopts "iHhnp:" opt; do
case $opt in
i)
ignorecase_flag=true ;;
H)
filename="FILENAME," ;;
n)
linum="NR," ;;
p)
patterns+=( "$OPTARG" ) ;;
h)
header_flag=true ;;
\?)
_usage
return ;;
esac
done
if [[ -n $filename || -n $linum ]]; then
colon="\":\","
fi
shift $(( $OPTIND - 1 ))
if [[ $ignorecase_flag == true ]]; then
for s in "${patterns[#]}"; do
result+=" && s~/${s,,}/"
done
result=${result# && }
result="{s=tolower(\$0)} $result"
else
for s in "${patterns[#]}"; do
result="$result && /$s/"
done
result=${result# && }
fi
result+=" { print "$filename$linum$colon"\$0 }"
if [[ ! -t 0 ]]; then # pipe case
cat - | awk "${result}"
else
for f in "$#"; do
[[ $header_flag == true ]] && echo "########## $f ##########"
awk "${result}" $f
done
fi
}
Usage:
echo "a b c" | m -p A
echo "a b c" | m -i -p A # a b c
You can put it in .bashrc if you like.
grep -i -w 'string1\|string2' filename
This works for exact word match and matching case insensitive words ,for that -i is used
When the both strings are in sequence then put a pattern in between on grep command:
$ grep -E "string1(?.*)string2" file
Example if the following lines are contained in a file named Dockerfile:
FROM python:3.8 as build-python
FROM python:3.8-slim
To get the line that contains the strings: FROM python and as build-python then use:
$ grep -E "FROM python:(?.*) as build-python" Dockerfile
Then the output will show only the line that contain both strings:
FROM python:3.8 as build-python
If git is initialized and added to the branch then it is better to use git grep because it is super fast and it will search inside the whole directory.
git grep 'string1.*string2.*string3'
searching for two String and highlight only string1 and string2
grep -E 'string1.*string2|string2.*string1' filename | grep -E 'string1|string2'
or
grep 'string1.*string2\|string2.*string1' filename | grep -E 'string1\|string2'
ripgrep
Here is the example using rg:
rg -N '(?P<p1>.*string1.*)(?P<p2>.*string2.*)' file.txt
It's one of the quickest grepping tools, since it's built on top of Rust's regex engine which uses finite automata, SIMD and aggressive literal optimizations to make searching very fast.
Use it, especially when you're working with a large data.
See also related feature request at GH-875.