Slow bash script using grep and sed - regex

I'm trying to speed up my script, which currently takes approx 30 seconds. I am a novice to bash, and I am sure I am using some bad scripting practice (found some hint in https://unix.stackexchange.com/a/169765 but still cannot fix my issue).
What I need to do is get data from an external file, and extract numbers into two arrays. My script works fine, except it is too slow.
readData=`cat $myfile`
# readData = [[1491476100000,60204],[1491476130000,59734],...,[1491476160000,60150]]
# I have approximately 5000 points (two numbers in each point)
pointTime=()
pointVal=()
for line in `echo $readData | grep -Po "[0-9]+,[0-9]+"`; do
# Get first number but drop last three zeroes (e.g. 1491476100)
pointTime+=(`echo $line | grep -Po "^[0-9]+" | sed "s/\(.*\)000$/\1/"`)
# Get second number, e.g. 60204
pointVal+=(`echo $line | grep -Po "[0-9]+$"`)
done
Maybe I could use some regex inside a parameter expansion, but I don't know how.

Fast Alternative
Here's how I would write the script:
mapfile -t points < <(grep -Po '\d+,\d+' "$myfile")
pointTime=("${points[#]%000,*}")
pointVal=("${points[#]#*,}")
or even
mapfile -t pointTime < <(grep -Po '\d+(?=000,)' "$myfile")
mapfile -t pointVal < <(grep -Po ',\K\d+' "$myfile")
when you are sure that the file is well-formed.
Problems of the old Script
You already identified the main problem: The loop is slow, especially, since a lot of programs are called inside the loop. Nevertheless here are some hints how you could have improved your script without throwing away the loop. Some parts were needlessly complicated, for instance
readData=`cat $myfile`
`echo $readData | grep -Po "[0-9]+,[0-9]+"`
can be written as
grep -Po "[0-9]+,[0-9]+" "$myfile"
and
echo $line | grep -Po "^[0-9]+" | sed "s/\(.*\)000$/\1/"
can be written as
grep -Po "^[0-9]+(?=000)" <<< "$line"
A big speed boost would be to use bash's matching operator =~ instead of grep, because just starting up grep is slow.
[[ "$line" =~ (.*)000,(.*) ]]
pointTime+=("${BASH_REMATCH[1]}")
pointTime+=("${BASH_REMATCH[2]}")

I'm suspicious of the requirement to store the results in an array. You probably actually want to loop over the values in pairs. In any event, storing intermediate values in memory is inelegant and wasteful.
grep -Eo '[0-9]+,[0-9]+' "$myfile" |
while IFS=, read -r first second, do
process value pair "${first%000}" "$second"
done
If you insist on storing the values in an array, how to change the body of the loop should be obvious.
pointTime+=("${first%000}")
pointVal+=("$second")

Related

Sed : print all lines after match

I got my research result after using sed :
zcat file* | sed -e 's/.*text=\(.*\)status=[^/]*/\1/' | cut -f 1 - | grep "pattern"
But it only shows the part that I cut. How can I print all lines after a match ?
I'm using zcat so I cannot use awk.
Thanks.
Edited :
This is my log file :
[01/09/2015 00:00:47] INFO=54646486432154646 from=steve idfrom=55516654455457 to=jone idto=5552045646464 guid=100021623456461451463 n
um=6 text=hi my number is 0 811 22 1/12 status=new survstatus=new
My aim is to find all users that spam my site with their telephone numbers (using grep "pattern") then print all the lines to get all the information about each spam. The problem is there may be matches in INFO or id, so I use sed to get the text first.
Printing all lines after a match in sed:
$ sed -ne '/pattern/,$ p'
# alternatively, if you don't want to print the match:
$ sed -e '1,/pattern/ d'
Filtering lines when pattern matches between "text=" and "status=" can be done with a simple grep, no need for sed and cut:
$ grep 'text=.*pattern.* status='
You can use awk
awk '/pattern/,EOF'
n.b. don't be fooled: EOF is just an uninitialized variable, and by default 0 (false). So that condition cannot be satisfied until the end of file.
Perhaps this could be combined with all the previous answers using awk as well.
Maybe this is what you actually want? Find lines matching "pattern" and extract the field after text= up through just before status=?
zcat file* | sed -e '/pattern/s/.*text=\(.*\)status=[^/]*/\1/'
You are not revealing what pattern actually is -- if it's a variable, you cannot use single quotes around it.
Notice that \(.*\)status=[^/]* would match up through survstatus=new in your example. That is probably not what you want? There doesn't seem to be a status= followed by a slash anywhere -- you really should explain in more detail what you are actually trying to accomplish.
Your question title says "all line after a match" so perhaps you want everything after text=? Then that's simply
sed 's/.*text=//'
i.e. replace up through text= with nothing, and keep the rest. (I trust you can figure out how to change the surrounding script into zcat file* | sed '/pattern/s/.*text=//' ... oops, maybe my trust failed.)
The seldom used branch command will do this for you. Until you match, use n for next then branch to beginning. After match, use n to skip the matching line, then a loop copying the remaining lines.
cat file | sed -n -e ':start; /pattern/b match;n; b start; :match n; :copy; p; n ; b copy'
zcat file* | sed -e 's/.*text=\(.*\)status=[^/]*/\1/' | ***cut -f 1 - | grep "pattern"***
instead change the last 2 segments of your pipeline so that:
zcat file* | sed -e 's/.*text=\(.*\)status=[^/]*/\1/' | **awk '$1 ~ "pattern" {print $0}'**

bash regular expression test: if vs grep

I need to scan each line of a file looking for any characters above hex \x7E. The file has several million rows, so improving efficiency would be great. So far, reading each line in a while loop, this works and finds lines with invalid characters:
echo "$line" | grep -P "[\x7F-\xFF]" > /dev/null 2>&1
if [ $? -eq 0 ]; then...
But this doesn't:
if [[ "$line" =~ [\x7F-\xFF] ]]; then...
I'm assuming it would be more efficient the second way, if I could get it to work. What am I missing?
If you're interested in efficiency, you shouldn't write your loop in bash. You should rethink your program in terms of pipes and use efficient tools.
That said, you can do this with
LC_CTYPE=C LC_COLLATE=C
if [[ "$line" =~ [$'\x7f'-$'\xff'] ]]
then
echo "It contains bytes \x7F or up"
fi
I basically have to split the file. Valid records go to one file, invalid records go to another.
sed -n '/[^\x0-\x7e]/w badrecords
//! w goodrecords'
If you're already using Perl regular expressions, you might as well use perl for the task:
perl -ne '
if (/[\x7F-\xFF]/) {print STDERR $_} else {print}
' file > valid 2> invalid
I'd bet that's faster than a bash loop.
I suspect this would be more efficient, even though it processes the file twice:
grep -P "[\x7F-\xFF]" file > invalid
grep -vP "[\x7F-\xFF]" file > valid
You'd want to write your grep code as
if grep -qP "[\x7F-\xFF]" <<< "$line"; then...

Splitting a line in bash based on delimiter with Sed / Regex

Regex rookie and hoping to change that. I have the following seemingly very simple problem that I cannot figure the correct regex implementation to parse properly. Basically I have a file that has lines that looks like this:
time:3:35PM
I am just trying to cut out all characters up to and including ONLY FIRST ':' delimiter and keep the rest intact with sed so that I can process on many files with same format. What I am trying to get is this:
3:35PM
The below is the closest I got but is just using the last occurrence of the delimiter instead of the first.:
sed 's/.*://'
I have also tried with python but have challenges with applying a python function to iterate through all lines in many files as opposed to just one file.
Any help would be greatly appreciated.
You can do this in just about every text processing tool (many without using regular expressions at all).
ed
If the in-place editing is really important, the canonical correct way is not sed (the stream editor) but ed (the file editor).
ed "$file" << EOF
,s/^[^:]*://g
w
EOF
sed
(Pretty much the same commands as ed, formatted a little differently)
sed 's/^[^:]*://' < "$file" > "$file".new
mv "$file".new "$file"
BASH
This one doesn't cause any new processes to be spawned. (For whatever that's worth.)
while IFS=: read _ time; do
printf '%s\n' "$time"
done < "$file" > "$file".new
mv "$file".new "$file"
awk
awk -F: 'BEGIN{ OFS=":" } { print $2,$3 }' < "$file" > "$file".new
mv "$file".new "$file"
cut
cut -d: -f2- < "$file" > "$file".new
mv "$file".new "$file"
Since you don't need a regular expression to match a single, known character, consider using cut instead of sed.
This simple expression sets : as the d-elimiter and emits f-ields 2, onwards (-):
cut -d: -f2-
Example:
% echo 'time:3:35PM' | cut -d: -f2-
3:35PM
kojiro's answer has a plenty of great alternatives, but you have asked how to do that with regex. Here are some pure regex solutions:
grep -oP '[^:]*:\K.*' file.txt
\K makes it forget everything before the occurrence of \K.
But if you know the exact prefix length then you can use lookaround feature:
grep -oP '(?<=^time:).*' file.txt
Note that most of regex implementations do not support these features. You can use it in grep with -P flag and perl itself. I wonder if any other utility supports these.
To remove every instance up to : and including the : you could do..
sed -i.bak 's/^[^:]*://' file.txt
on multiple .txt files
sed -i.bak 's/^[^:]*://' *.txt
The -i option specifies that files are to be edited in-place. By creating a temporary file and sending output to this file rather than to the standard output.
Please consider my answer here:
How to use regex with cut at the command line?
You could for example just write:
echo 'time:3:35PM' | cutr -d : -f 2- -r :
In your particular case, you could simply use cut though:
echo 'time:3:35PM' | cut -d : -f 2-
Any feedback welcome. cutr isn't perfect yet, but before I invest too much time into it, I wanted to get some feedback.

Return a regex match in a Bash script, instead of replacing it

I just want to match some text in a Bash script. I've tried using sed but I can't seem to make it just output the match instead of replacing it with something.
echo -E "TestT100String" | sed 's/[0-9]+/dontReplace/g'
Which will output TestTdontReplaceString.
Which isn't what I want, I want it to output 100.
Ideally, it would put all the matches in an array.
edit:
Text input is coming in as a string:
newName()
{
#Get input from function
newNameTXT="$1"
if [[ $newNameTXT ]]; then
#Use code that im working on now, using the $newNameTXT string.
fi
}
You could do this purely in bash using the double square bracket [[ ]] test operator, which stores results in an array called BASH_REMATCH:
[[ "TestT100String" =~ ([0-9]+) ]] && echo "${BASH_REMATCH[1]}"
echo "TestT100String" | sed 's/[^0-9]*\([0-9]\+\).*/\1/'
echo "TestT100String" | grep -o '[0-9]\+'
The method you use to put the results in an array depends somewhat on how the actual data is being retrieved. There's not enough information in your question to be able to guide you well. However, here is one method:
index=0
while read -r line
do
array[index++]=$(echo "$line" | grep -o '[0-9]\+')
done < filename
Here's another way:
array=($(grep -o '[0-9]\+' filename))
Pure Bash. Use parameter substitution (no external processes and pipes):
string="TestT100String"
echo ${string//[^[:digit:]]/}
Removes all non-digits.
I Know this is an old topic but I came her along same searches and found another great possibility apply a regex on a String/Variable using grep:
# Simple
$(echo "TestT100String" | grep -Po "[0-9]{3}")
# More complex using lookaround
$(echo "TestT100String" | grep -Po "(?i)TestT\K[0-9]{3}(?=String)")
With using lookaround capabilities search expressions can be extended for better matching. Where (?i) indicates the Pattern before the searched Pattern (lookahead),
\K indicates the actual search pattern and (?=) contains the pattern after the search (lookbehind).
https://www.regular-expressions.info/lookaround.html
The given example matches the same as the PCRE regex TestT([0-9]{3})String
Use grep. Sed is an editor. If you only want to match a regexp, grep is more than sufficient.
using awk
linux$ echo -E "TestT100String" | awk '{gsub(/[^0-9]/,"")}1'
100
I don't know why nobody ever uses expr: it's portable and easy.
newName()
{
#Get input from function
newNameTXT="$1"
if num=`expr "$newNameTXT" : '[^0-9]*\([0-9]\+\)'`; then
echo "contains $num"
fi
}
Well , the Sed with the s/"pattern1"/"pattern2"/g just replaces globally all the pattern1s to pattern 2.
Besides that, sed while by default print the entire line by default .
I suggest piping the instruction to a cut command and trying to extract the numbers u want :
If u are lookin only to use sed then use TRE:
sed -n 's/.*\(0-9\)\(0-9\)\(0-9\).*/\1,\2,\3/g'.
I dint try and execute the above command so just make sure the syntax is right.
Hope this helped.
using just the bash shell
declare -a array
i=0
while read -r line
do
case "$line" in
*TestT*String* )
while true
do
line=${line#*TestT}
array[$i]=${line%%String*}
line=${line#*String*}
i=$((i+1))
case "$line" in
*TestT*String* ) continue;;
*) break;;
esac
done
esac
done <"file"
echo ${array[#]}

Improving Shell Script Performance

This shell script is used to extract a line of data from $2 if it contains the pattern $line.
$line is constructed using the regular expression [A-Z0-9.-]+#[A-Z0-9.-]+ (a simple email match), form the lines in file $1.
#! /bin/sh
clear
for line in `cat "$1" | grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+"`
do
echo `cat "$2" | grep -m 1 "\b$line\b"`
done
File $1 has short lines of data (< 100 chars) and contains approx. 50k lines (approx. 1-1.5 MB).
File $2 has slightly longer lines of text (> 80 to < 200) and has 2M+ lines (approx. 200MB).
The desktops this is running on has plenty of RAM (6 Gig) and Xenon processors with 2-4 cores.
Are there any quick fixes to increase performance as currently it takes 1-2 hours to completely run (and output to another file).
NB: I'm open to all suggestions but we're not in the position to complexity re-write the whole system etc. In addition the data come from a third party and is prone to random formatting.
Quick suggestions:
Avoid the useless use of cat and change cat X | grep Y to grep Y X.
You can process the grep output as it is produced by piping it rather than using backticks. Using backticks requires the first grep to complete before you can start the second grep.
Thus:
grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+" "$1" | while read line; do
grep -m 1 "\b$line\b" "$2"
done
Next step:
Don't process $2 repeatedly. It's huge. You can save up all your patterns and then execute a single grep over the file.
Replace loop with sed.
No more repeated grep:
grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+" "$1" | sed -E 's/^|$/\\1/g' > patterns
grep -f patterns "$2"
Finally, using some bash fanciness (see man bash → Process Substitution) we can ditch the temporary file and do this in one long line:
grep -f <(grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+" "$1" | sed -E 's/^|$/\\b/g') "$2"
That's great unless you have so many patterns grep -f runs out of memory and barfs. If that happens you'll need to run it in batches. Annoying, but doable:
grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+" "$1" | sed -E 's/^|$/\\1/g' > patterns
while [ -s patterns ]; do
grep -f <(head -n 100 patterns) "$2"
sed -e '1,100d' -i patterns
done
That'll process 100 patterns at a time. The more it can do at once the fewer passes it'll have to make over your 2nd file.
the problem is you are piping too many shell commands, as well as unnecessary use of cat.
one possible solution using just awk
awk 'FNR==NR{
# get all email address from file1
for(i=1;i<=NF;i++){
if ( $i ~ /[a-zA-Z0-9.-]+#[a-zA-Z0-9.-]+/){
email[$i]
}
}
next
}
{
for(i in email) {
if ($0 ~ i) {
print
}
}
}' file1 file2
I would take the loop out, since greping a 2 million line file 50k times is probably pretty expensive ;)
To allow for you to take the loop out
First create a file of all your Email Addresses with your outer grep command.
Then use this as a pattern file to do your secondary grep by using grep -f
If $1 is a file, don't use "cat | grep". Instead, pass the file directly to grep. Should look like
grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+" $1
Besides, you may want to adjust your regex. You should at least expect the underscore ("_") in an email address, so
grep -i -o -E "[A-Z0-9._-]+#[A-Z0-9.-]+" $1
As John Kugelman has already answered, process the grep output by piping it rather than using backticks. If you are using backticks the whole expression within the backticks will be run first, and then the outer expression will be run with the output from the backticks as arguments.
First of all, this will be a lot slower than necessary as piping would allow the two programs to run simultaneously (which is really good if they are both CPU intensive and you have multiple CPUs). However there is another very important aspect to this, the line
for line in `cat "$1" | grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+"`
may become to long for the shell to handle. Most shells (to my knowledge at least) limit the length of a command line, or at least the arguments to a command, and I think this could become a problem for the for loop too.