Why is this grep filter slow?

Why is this grep filter slow? - regex

I want to get the first two letters in every word in the BSD dict word list, excluding those words that start with only one letter.
Without the one-letter exclusion it runs extremely fast:
time cat /usr/share/dict/web2 | cut -c 1-2 | tr '[a-z]' '[A-Z]' | uniq -c > /dev/null
real 0m0.227s
user 0m0.375s
sys 0m0.021s
grepping on '..', however, is painfully slow:
time cat /usr/share/dict/web2 | cut -c 1-2 | grep '..' | tr '[a-z]' '[A-Z]' | uniq -c > /dev/null
real 1m16.319s
user 1m0.694s
sys 0m10.225s
What's going on here?

The problem is the UTF-8 Locale, easy workaround for 100x speedup
What's really slow on the Mac is the UTF-8 locale.
Replace grep .. with LC_ALL=C grep .. then your command will run over 100x faster.
This is probably true of Linux as well, except a given Linux distro is probably more likely to default to the C environment.

I don't know why it is so awful. But I know one quick way to speed it up is to invert your grep(1) expression with -v, and throw away all one-character lines:
$ time cat /usr/share/dict/words | cut -c 1-2 | grep -v '^.$' | tr '[a-z]' '[A-Z]' | uniq -c > /dev/null
real 0m0.086s
user 0m0.090s
sys 0m0.000s

This might run a little better and would also get rid of your cut needing another pipe.
cat /usr/share/dict/web2 | egrep -o '^.{2,}' | tr '[a-z]' '[A-Z]' | uniq -c > /dev/null

it might even be faster if you cut down on the use of excessive pipes and useless cat
$ awk '{ a[toupper(substr($0,1,2))]++ } END{for(i in a) print i,a[i] }' file

Related

grep command to find out how many times any character is followed by '.'

I have to find out how often any character is followed by a period (.) with the help of grep. After finding how many times character is followed by period and then I have to sort the result in ascending order.
For example in this string: "Find my input. Output should be obtained. You need to find output."
The output should be something like this:
d 1
t 2
What I have done so far :
cat filename | grep -o "*." | sort -u
But it is not working as intended.
Any ideas how to solve this? I have to perform this operation on huge library of books in .txt files.

An iterative approach with GNU grep:
grep -o '.\.' filename | sort | uniq -c
Output:
1 d.
2 t.
grep -Po '.(?=\.)' filename | sort | uniq -c
Output:
1 d
2 t
grep -Po '.(?=\.)' filename | sort | uniq -c | awk '{print $2,$1}'
Output:
d 1
t 2

With single GNU awk process:
awk -v FPAT='.[.]' 'BEGIN{ PROCINFO["sorted_in"]="#ind_str_asc" }
{ for(i=1;i<=NF;i++) a[substr($i,1,1)]++ }
END{ for(i in a) print i,a[i] }' filename
The output:
d 1
t 2

This one is ok too
echo "Find my input. Output should be obtained. You need to find output."| grep -o ".\." | sort | uniq -c | rev | tr -d .

uniq treats lines as equal when they are not

I would expect different output from this command:
$ echo -e "あいうえお\nオエウイア" | uniq -c
2 あいうえお
The two lines are not the same.
Compare to this example, working as expected:
$ echo -e "aiueo\noeuia" | uniq -c
1 aiueo
1 oeuia
Is this a Unicode or UTF-8 issue? I did not find any option to support "exotic" characters.
Edit: I am experiencing a similar problem when using sort with japanese input. Input of the form a\nb\na\nb\n (or, ommiting '\n', abab) stays that way, I would expect it to be aabb or at least bbaa.

There you go - echo -e "あいうえお\nオエウイア" | uni2ascii -q | uniq -c | ascii2uni

Grep regexp (linux) for extracting two words and storing them in variables [duplicate]

This question already has answers here:
Parsing JSON with Unix tools
(45 answers)
Closed 8 years ago.
I need your help/pointers on extracting couple of words through regex. I have a line that is stored in a file (line shown below). I need to extract the values of two words (time and interface) and store them in a variable for further calculations.
{"record-type":"int-stats","time":1389309548046925,"host-id":"a.b.c.d","interface":"ab-0/0/44","latency":111223}
So the values of time and port needs to be stored in two different variables.

assuming that you are looking for "pure" shell scripts and not perl or python or programs what are generally not bundled with the os, here is something you could do:
#!/bin/sh
JFILE=a.json
TIME=$(egrep -o '"time":[0-9]+' $JFILE | cut -d: -f2)
IFACE=$(egrep -o '"interface":"[a-z0-9/\-]+"' $JFILE | cut -d: -f2 | sed -e 's/"//g')
echo "time = $TIME"
echo "interface = $IFACE"

If you can use awk then may be this could be of help:
$ string='{"record-type":"int-stats","time":1389309548046925,"host-id":"a.b.c.d","interface":"ab-0/0/44","latency":111223}'
$ time=$(awk -F[:,] '{ print $4 }' <<< "$string")
$ interface=$(awk -F[:,] '{ gsub(/\"/,"");print $8 }' <<< "$string")
$ echo "$time"
1389309548046925
$ echo "$interface"
ab-0/0/44

you can make use of arrays. e.g.
#!/bin/sh
JFILE=a.json
TIME=(`egrep -o '"time":[0-9]+' $JFILE | cut -d: -f2 | tr '\n' ' '`)
IFACE=(`egrep -o '"interface":"[a-z0-9/\-]+"' $JFILE | cut -d: -f2 | sed -e 's/"//g' | tr '\n' ' '`)
i=0
for each in ${TIME[#]}
do
echo "TIME[$i] = $each"
let i++
done
see Arrays in unix shell? for more about arrays.

Regex: How can I extract strings from "" to ""

I used Sysinternals Strings to output all strings from a memory dump. I need to extract all strings from * to *.
Between the two * are domains or elements of domains (Target list of a trojan).
*/cmserver/*
*/pub/html/*
*arabi-online.net/efs/servlet/efs/*
*ibanking.*.com.au/InternetBanking/*
I tried this...but I've problems with the $ character:
cat strings.txt | grep -o '\*[^"]*' | egrep "[a-zA-Z0-9\-\.\/]{4}\*$" | sort -u

If your grep supports PCRE, this should be easy:
grep -Po "(?<=\*)(.*)(?=\*)" strings.txt
Input:
$ cat strings.txt
*/cmserver/*
*/pub/html/*
*arabi-online.net/efs/servlet/efs/*
*ibanking.*.com.au/InternetBanking/*
Output:
$ grep -Po "(?<=\*)(.*)(?=\*)" strings.txt
/cmserver/
/pub/html/
arabi-online.net/efs/servlet/efs/
ibanking.*.com.au/InternetBanking/

Using sed it is easier:
sed 's/^\*\|\*$//g' strings.txt

cat strings.txt | grep "^\*" | grep "[A-Za-z0-9\-\+\.\/]\{4\}\*.$" | sort -u works the best for me!

shell script: how to compare process running time against a theshold?

Bash script should check if a certain process is running more than a certain number of minutes, and kill it if does.
I can get the running time by something like
ps -aux | grep ProgramName | grep -v grep | awk '{print $10}'
That gives 9:47.31 for instance. But where do I go further and check if that is greater than, say 10 minutes threshold?

Here is the awk 1 liner you'll need for your use case:
ps -o etime -C ProgramName | awk -v MAX=600 '{split($0, a, ":"); if (length(a)==2) sec=a[1]*60+a[2]; else if (length(a)==3) sec=a[1]*3600+a[2]*60+a[3]; if (sec>MAX) print "Elapsed"; else print "Not Elapsed"}'
Also note that ps -o etime -C ProgramName gives you the time since ProgramName has been running so you don't need to use your overly complicated command to get this time.
IMPORTANT: Also remember that for the processes that have been running for more than a day you will get output of ps command as something like 1-21:48:48. I don't have this case covered in my awk command but you can use the same awk's split command as I have shown above.
UPDATE: As per the comment below, use this version for FreeBSD or any other flavor of Unix (eg: Mac) where -C ProgramName option is not available.
ps -o etime=,command= | awk -v MAX=600 '/ProgramName/ && !/awk/ {split($1, a, ":"); if (length(a)==2) sec=a[1]*60+a[2]; else if (length(a)==3) sec=a[1]*3600+a[2]*60+a[3]; if (sec>MAX) print "Elapsed"; else print "Not Elapsed"}'

Here is one possible way:
for time in `ps auxwww | awk '{print $10}'`;
do
SEC=`echo $time | cut -d":" -f2`;
MIN=`echo $time | cut -d":" -f1`;
TOTALTIMEINSEC=`echo $SEC+$MIN*60 | bc`;
echo "the time in sec is:" $TOTALTIMEINSEC; done
BTW, you don't need to gerp -v grep, you can do:
grep [P]rogramName
That said, I'd love to see other solution, because I feel I'm recycling this methods...

First, you can avoid the unnecessary grep -v grep and awk dance with the following instead:
$ ps -o time `pidof ProgramName`
On my linux machine this seems to give the time in the format HH:MM:SS.
Taking into consideration that pidof ProgName might give more than one value you might handle that with tail -n +2|head -1 or something like that.
Now to get the duration you can convert the time into seconds:
$ seconds=$(printf "%d * 3600 + %d * 60 + %d\n" $(ps -o time $(pidof ProgramName)|tail -n +2|head -1|sed -e 's/:/ /g')|bc)
Note that the time given by ps -o time might be in this format too: D-HH:MM:SS where D is the number of days.

This will work for cases where your program has run less than a day
THRESH=360
ps auxwww | grep [P]rocessname | awk '{print $10}' | sed -e 's/:/ /; s/\.[0-9]*$//' | while read m s; do
let total=${m}*60+${s}
if [ $total -gt $THRESH ]; then
echo "${total} seconds total is over threshold of ${THRESH} seconds"
fi
done
If you want higher thresholds, you'll want to put some more logic around the extraction of process time, but at that point I'd put things into a perl/ruby script and get the information via `ps auxwww`

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why is this grep filter slow? - regex

This might run a little better and would also get rid of your cut needing another pipe. cat /usr/share/dict/web2 | egrep -o '^.{2,}' | tr '[a-z]' '[A-Z]' | uniq -c > /dev/null

it might even be faster if you cut down on the use of excessive pipes and useless cat $ awk '{ a[toupper(substr($0,1,2))]++ } END{for(i in a) print i,a[i] }' file

Related

grep command to find out how many times any character is followed by '.'

uniq treats lines as equal when they are not

Grep regexp (linux) for extracting two words and storing them in variables [duplicate]

Regex: How can I extract strings from "" to ""

shell script: how to compare process running time against a theshold?

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why is this grep filter slow? - regex

This might run a little better and would also get rid of your cut needing another pipe. cat /usr/share/dict/web2 | egrep -o '^.{2,}' | tr '[a-z]' '[A-Z]' | uniq -c > /dev/null

it might even be faster if you cut down on the use of excessive pipes and useless cat $ awk '{ a[toupper(substr($0,1,2))]++ } END{for(i in a) print i,a[i] }' file

Related

grep command to find out how many times any character is followed by '.'

uniq treats lines as equal when they are not

Grep regexp (linux) for extracting two words and storing them in variables [duplicate]

Regex: How can I extract strings from "*" to "*"

shell script: how to compare process running time against a theshold?

Categories

Resources

Regex: How can I extract strings from "" to ""