shell script: how to compare process running time against a theshold? - regex

Bash script should check if a certain process is running more than a certain number of minutes, and kill it if does.
I can get the running time by something like
ps -aux | grep ProgramName | grep -v grep | awk '{print $10}'
That gives 9:47.31 for instance. But where do I go further and check if that is greater than, say 10 minutes threshold?

Here is the awk 1 liner you'll need for your use case:
ps -o etime -C ProgramName | awk -v MAX=600 '{split($0, a, ":"); if (length(a)==2) sec=a[1]*60+a[2]; else if (length(a)==3) sec=a[1]*3600+a[2]*60+a[3]; if (sec>MAX) print "Elapsed"; else print "Not Elapsed"}'
Also note that ps -o etime -C ProgramName gives you the time since ProgramName has been running so you don't need to use your overly complicated command to get this time.
IMPORTANT: Also remember that for the processes that have been running for more than a day you will get output of ps command as something like 1-21:48:48. I don't have this case covered in my awk command but you can use the same awk's split command as I have shown above.
UPDATE: As per the comment below, use this version for FreeBSD or any other flavor of Unix (eg: Mac) where -C ProgramName option is not available.
ps -o etime=,command= | awk -v MAX=600 '/ProgramName/ && !/awk/ {split($1, a, ":"); if (length(a)==2) sec=a[1]*60+a[2]; else if (length(a)==3) sec=a[1]*3600+a[2]*60+a[3]; if (sec>MAX) print "Elapsed"; else print "Not Elapsed"}'

Here is one possible way:
for time in `ps auxwww | awk '{print $10}'`;
do
SEC=`echo $time | cut -d":" -f2`;
MIN=`echo $time | cut -d":" -f1`;
TOTALTIMEINSEC=`echo $SEC+$MIN*60 | bc`;
echo "the time in sec is:" $TOTALTIMEINSEC; done
BTW, you don't need to gerp -v grep, you can do:
grep [P]rogramName
That said, I'd love to see other solution, because I feel I'm recycling this methods...

First, you can avoid the unnecessary grep -v grep and awk dance with the following instead:
$ ps -o time `pidof ProgramName`
On my linux machine this seems to give the time in the format HH:MM:SS.
Taking into consideration that pidof ProgName might give more than one value you might handle that with tail -n +2|head -1 or something like that.
Now to get the duration you can convert the time into seconds:
$ seconds=$(printf "%d * 3600 + %d * 60 + %d\n" $(ps -o time $(pidof ProgramName)|tail -n +2|head -1|sed -e 's/:/ /g')|bc)
Note that the time given by ps -o time might be in this format too: D-HH:MM:SS where D is the number of days.

This will work for cases where your program has run less than a day
THRESH=360
ps auxwww | grep [P]rocessname | awk '{print $10}' | sed -e 's/:/ /; s/\.[0-9]*$//' | while read m s; do
let total=${m}*60+${s}
if [ $total -gt $THRESH ]; then
echo "${total} seconds total is over threshold of ${THRESH} seconds"
fi
done
If you want higher thresholds, you'll want to put some more logic around the extraction of process time, but at that point I'd put things into a perl/ruby script and get the information via `ps auxwww`

Related

How can I pipe a bash variable into a GREP regex in the following command?

I have a command that runs fine as follows:
grep ",162$" xyz.txt | grep -v "^162,162$"
It is basically looking for lines that have a pattern like "number,162" except those where the number is 162.
Now, I want to run many such commands where 162 is replaced with other numbers like
grep ",$159" xyz.txt | grep -v "^159,159$"
grep ",$111" xyz.txt | grep -v "^111,111$"
.. and so on where the same number appears in every part of the command everywhere. So, I need a single place to change the value and the rest of the command can take that value wherever needed.
I tried doing something like this to repeat the number of changes I need to do when I want to try another number
x=162 | grep ",\$x$" xyz.txt | grep -v "^\$x,\$x$"
but it doesn't work. What changes should I do so that everytime I have to change the number value, I just go to the first position of the command and change only the value of x?
Whenever you are piping grep to grep, consider using awk:
var=YOUR_NUM
awk -F, -v pat="$var" '$NF==pat && $1!=pat' file
This plays a little with awk's fields: by setting the field separator to the comma, you are then able to treat each part separately.
You want to match those cases in which the last field is the given value and the first one is not. This is possible using $NF for the last field and using $1 for the first one.
Also, you can add an extra layer of control by checking NF (number of fields) and making the condition be something like !($1==pat && NF==2) && $NF==pat so that a line like 162,162,162 would be printed. But this really depends on what you want to do.
For example:
$ cat file
hello
hello,162
162,162
$ val=162
$ awk -F, -v pat="$val" '$NF==pat && $1!=pat' file
hello,162
Then it is just a matter of changing $var with the value you want.
If I understand it correctcly, you want something like that:
x=162
grep ",$x\$" xyz.txt | grep -v "^$x,$x\$"
Quoting is your friend if you're intending a one-liner:
X=123 /bin/sh -c 'grep ","$X"$" xyz.txt | grep -v "^"$X","$X"$"'
Single grep command can be used with negative lookbehind:
$ cat xyz.txt
12,12
112,12
1,12
42,12
$ grep -P '(?<!^12),12$' xyz.txt
112,12
1,12
42,12
Passing variable is tricky
$ x=12
$ grep -P "(?<!"^$x"),"$x"$" xyz.txt
112,12
1,12
42,12

How to close other shells except current?

I need a simple script to close other shells/sessions except the one I'm currently logged in. I'm stuck with this line:
ps -o pid,tty,comm | grep sh$
Which results in selecting the current shells.
For example:
1346 136,0 sh
1355 136,1 sh
I can use the tty command to know my current shell (pts). Then, I think I need a loop.
This is actually not as easy as it might seem at first. The main challenge is that the ps utility is rather incompatible between different platforms, which creates a very significant risk that assumptions you make about ps won't always be correct on systems where you might use the script. And since the task is a rather... dangerous one, you would want to be careful here. Just as an example, the ps on my current system (Cygwin) does not have a -o option, while yours appears to have one.
Anyway, here's my solution:
pidCol=$(ps| head -1| awk '{ for (i = 1; i <= NF; ++i) if ($i == "PID") { print(i); exit; }; };');
if [[ -n "$pidCol" ]]; then
ps| tail -n+2| grep sh$| cut -c2-| awk "{ print(\$$pidCol); };"| grep -v "^$$\$"| xargs kill -9;
fi;
It first gets the column number in ps's output that contains the PID of the process. I tried to make it as robust as possible by parsing the ps header line. So if the PID column position varies between systems, we should still get it correctly for the current system.
Then, I've applied a guard around the kill pipeline to ensure it only runs if we successfully got the $pidCol from the parse command.
Then, in the actual kill pipeline, I strip off the header, grep for all sh processes, cut off the first character (because ps on some systems prints a little character indicator at the beginning of some (but not all) lines that does not get a corresponding column name in the header line), and then use awk to just print the PID column value. Finally, I grep out the current process's PID and run the remaining PIDs through xargs kill -9.
You can make use of $$ here in this ps piped with awk:
ps -o pid,tty,comm | awk -v curr=$$ '$3 ~ /sh/ && $1 != curr'
The variable $$ represents the PID of current shell.
You can get your current shell with tty and "clean" it to get the data after the 2nd slash like this:
current=$(tty | cut -d/ -f3-)
Then, it is a matter of printing all the results in ps -o pid,tty,comm whose second column does not match your current one... and leaving the header out:
ps -o pid,tty,comm | awk -v current="$current" 'NR>1 && $2!=current {print $1}'
Then, you can loop through this result and kill the given PIDs.
Run the following to get the PID (Process ID) of running sessions:
ps -ft
Use those PIDs to forcefully kill the process:
kill -TERM <PID1> <PID2> <PID3>
Use a loop in BASH to accomplish it for all, excluding the current session
Try this one.
pts=$( tty | sed 's/\/dev\(*\)/\1/' )
current=$( ps -C sh | grep $pts | ps -o pid= | head -n 1 )
total=$(ps -C sh -o pid= )
for i in $total ; do
if [[ $i -ne $current ]] ; then
kill -9 $i
fi
done
I use this one:
#!/bin/bash
current=$(tty | cut -d/ -f3-)
all=$(ps -A -o tty | grep pts/ | grep -v $current)
for i in $all ; do
pkill -9 -t $i
done
pgrep -u $USER | grep -v "`pgrep -s 0`" | xargs kill
This grabs a list of all the PID's for the current user and removes the one for the current session. This is all then supplied via xargs to kill to terminate.

Why is this grep filter slow?

I want to get the first two letters in every word in the BSD dict word list, excluding those words that start with only one letter.
Without the one-letter exclusion it runs extremely fast:
time cat /usr/share/dict/web2 | cut -c 1-2 | tr '[a-z]' '[A-Z]' | uniq -c > /dev/null
real 0m0.227s
user 0m0.375s
sys 0m0.021s
grepping on '..', however, is painfully slow:
time cat /usr/share/dict/web2 | cut -c 1-2 | grep '..' | tr '[a-z]' '[A-Z]' | uniq -c > /dev/null
real 1m16.319s
user 1m0.694s
sys 0m10.225s
What's going on here?
The problem is the UTF-8 Locale, easy workaround for 100x speedup
What's really slow on the Mac is the UTF-8 locale.
Replace grep .. with LC_ALL=C grep .. then your command will run over 100x faster.
This is probably true of Linux as well, except a given Linux distro is probably more likely to default to the C environment.
I don't know why it is so awful. But I know one quick way to speed it up is to invert your grep(1) expression with -v, and throw away all one-character lines:
$ time cat /usr/share/dict/words | cut -c 1-2 | grep -v '^.$' | tr '[a-z]' '[A-Z]' | uniq -c > /dev/null
real 0m0.086s
user 0m0.090s
sys 0m0.000s
This might run a little better and would also get rid of your cut needing another pipe.
cat /usr/share/dict/web2 | egrep -o '^.{2,}' | tr '[a-z]' '[A-Z]' | uniq -c > /dev/null
it might even be faster if you cut down on the use of excessive pipes and useless cat
$ awk '{ a[toupper(substr($0,1,2))]++ } END{for(i in a) print i,a[i] }' file

Find and kill a process in one line using bash and regex

I often need to kill a process during programming.
The way I do it now is:
[~]$ ps aux | grep 'python csp_build.py'
user 5124 1.0 0.3 214588 13852 pts/4 Sl+ 11:19 0:00 python csp_build.py
user 5373 0.0 0.0 8096 960 pts/6 S+ 11:20 0:00 grep python csp_build.py
[~]$ kill 5124
How can I extract the process id automatically and kill it in the same line?
Like this:
[~]$ ps aux | grep 'python csp_build.py' | kill <regex that returns the pid>
In bash, using only the basic tools listed in your question(1), you should be able to do:
kill $(ps aux | grep '[p]ython csp_build.py' | awk '{print $2}')
Details on its workings are as follows:
The ps gives you the list of all the processes.
The grep filters that based on your search string, [p] is a trick to stop you picking up the actual grep process itself.
The awk just gives you the second field of each line, which is the PID.
The $(x) construct means to execute x then take its output and put it on the command line. The output of that ps pipeline inside that construct above is the list of process IDs so you end up with a command like kill 1234 1122 7654.
Here's a transcript showing it in action:
pax> sleep 3600 &
[1] 2225
pax> sleep 3600 &
[2] 2226
pax> sleep 3600 &
[3] 2227
pax> sleep 3600 &
[4] 2228
pax> sleep 3600 &
[5] 2229
pax> kill $(ps aux | grep '[s]leep' | awk '{print $2}')
[5]+ Terminated sleep 3600
[1] Terminated sleep 3600
[2] Terminated sleep 3600
[3]- Terminated sleep 3600
[4]+ Terminated sleep 3600
and you can see it terminating all the sleepers.
Explaining the grep '[p]ython csp_build.py' bit in a bit more detail: when you do sleep 3600 & followed by ps -ef | grep sleep, you tend to get two processes with sleep in it, the sleep 3600 and the grep sleep (because they both have sleep in them, that's not rocket science).
However, ps -ef | grep '[s]leep' won't create a grep process with sleep in it, it instead creates one with the command grep '[s]leep' and here's the tricky bit: the grep doesn't find that one, because it's looking for the regular expression "any character from the character class [s] (which is basically just s) followed by leep.
In other words, it's looking for sleep but the grep process is grep '[s]leep' which doesn't have the text sleep in it.
When I was shown this (by someone here on SO), I immediately started using it because
it's one less process than adding | grep -v grep; and
it's elegant and sneaky, a rare combination :-)
(1) If you're not limited to using those basic tools, there's a nifty pgrep command which will find processes based on certain criteria (assuming you have it available on your system, of course).
For example, you can use pgrep sleep to output the process IDs for all sleep commands (by default, it matches the process name). If you want to match the entire command line as shown in ps, you can do something like pgrep -f 'sleep 9999'.
As an aside, it doesn't list itself if you do pgrep pgrep, so the tricky filter method shown above is not necessary in this case.
You can check that the processes are the ones you're interested in by using -a to show the full process names. You can also limit the scope to your own processes (or a specific set of users) with -u or -U. See the man page for pgrep/pkill for more options.
Once you're satisfied it will only show the processes you're interested in, you can then use pkill with the same parameters to send a signal to all those processes.
if you have pkill,
pkill -f csp_build.py
If you only want to grep against the process name (instead of the full argument list) then leave off -f.
One liner:
ps aux | grep -i csp_build | awk '{print $2}' | xargs sudo kill -9
Print out column 2: awk '{print $2}'
sudo is optional
Run kill -9 5124, kill -9 5373 etc (kill -15 is more graceful but slightly slower)
Bonus:
I also have 2 shortcut functions defined in my .bash_profile
(~/.bash_profile is for osx, you have to see what works for your *nix machine).
p keyword
lists out all Processes containing keyword
usage e.g: p csp_build , p python etc
bash_profile code:
# FIND PROCESS
function p(){
ps aux | grep -i $1 | grep -v grep
}
ka keyword
Kills All processes that have this keyword
usage e.g: ka csp_build , ka python etc
optional kill level e.g: ka csp_build 15, ka python 9
bash_profile code:
# KILL ALL
function ka(){
cnt=$( p $1 | wc -l) # total count of processes found
klevel=${2:-15} # kill level, defaults to 15 if argument 2 is empty
echo -e "\nSearching for '$1' -- Found" $cnt "Running Processes .. "
p $1
echo -e '\nTerminating' $cnt 'processes .. '
ps aux | grep -i $1 | grep -v grep | awk '{print $2}' | xargs sudo kill -klevel
echo -e "Done!\n"
echo "Running search again:"
p "$1"
echo -e "\n"
}
killall -r regexp
-r, --regexp
Interpret process name pattern as an extended regular expression.
This will return the pid only
pgrep -f 'process_name'
So to kill any process in one line:
kill -9 $(pgrep -f 'process_name')
or, if you know the exact name of the process you can also try pidof:
kill -9 $(pidof 'process_name')
But, if you do not know the exact name of the process, pgrep would be better.
If there is multiple process running with the same name, and you want to kill the first one then:
kill -9 $(pgrep -f 'process_name' | head -1)
Also to note that, if you are worried about case sensitivity then you can add -i option just like in grep. For example:
kill -9 $(pgrep -fi chrome)
More info about signals and pgrep at man 7 signal or man signal and man pgrep
Try using
ps aux | grep 'python csp_build.py' | head -1 | cut -d " " -f 2 | xargs kill
You may use only pkill '^python*' for regex process killing.
If you want to see what you gonna kill or find before killing just use pgrep -l '^python*' where -l outputs also name of the process. If you don't want to use
pkill, use just:
pgrep '^python*' | xargs kill
Use pgrep - available on many platforms:
kill -9 `pgrep -f cps_build`
pgrep -f will return all PIDs with coincidence "cps_build"
The solution would be filtering the processes with exact pattern , parse the pid, and construct an argument list for executing kill processes:
ps -ef | grep -e <serviceNameA> -e <serviceNameB> -e <serviceNameC> |
awk '{print $2}' | xargs sudo kill -9
Explanation from documenation:
ps utility displays a header line, followed by lines containing
information about all of your processes that have controlling terminals.
-e Display information about other users' processes, including those
-f Display the uid, pid, parent pid, recent CPU usage, process start
The grep utility searches any given input files, selecting lines that
-e pattern, --regexp=pattern
Specify a pattern used during the search of the input: an input
line is selected if it matches any of the specified patterns.
This option is most useful when multiple -e options are used to
specify multiple patterns, or when a pattern begins with a dash
(`-').
xargs - construct argument list(s) and execute utility
kill - terminate or signal a process
number 9 signal - KILL (non-catchable, non-ignorable kill)
Example:
ps -ef | grep -e node -e loggerUploadService.sh -e applicationService.js |
awk '{print $2}' | xargs sudo kill -9
you can do it with awk and backtics
ps auxf |grep 'python csp_build.py'|`awk '{ print "kill " $2 }'`
$2 in awk prints column 2, and the backtics runs the statement that's printed.
But a much cleaner solution would be for the python process to store it's process id in /var/run and then you can simply read that file and kill it.
My task was kill everything matching regexp that is placed in specific directory (after selenium tests not everything got stop). This worked for me:
for i in `ps aux | egrep "firefox|chrome|selenium|opera"|grep "/home/dir1/dir2"|awk '{print $2}'|uniq`; do kill $i; done
To kill process by keyword midori, for example:
kill -SIGTERM $(pgrep -i midori)
Lots of good answers here - I used the answer accepted by the OP. Just adding a small caveat note about pkill and pgrep. As you might see from their manual pages, on your OS, some OS's have a 15-character limit on the process name. The -f option gets around that on my OS, but I was in trouble until I found that option!
ps -o uid,pid,cmd|awk '{if($1=="username" && $3=="your command") print $2}'|xargs kill -15
A method using only awk (and ps):
ps aux | awk '$11" "$12 == "python csp_build.py" { system("kill " $2) }'
By using string equality testing I prevent matching this process itself.
Give -f to pkill
pkill -f /usr/local/bin/fritzcap.py
exact path of .py file is
# ps ax | grep fritzcap.py
3076 pts/1 Sl 0:00 python -u /usr/local/bin/fritzcap.py -c -d -m
I started using something like this:
kill $(pgrep 'python csp_build.py')
You don't need the user switch for ps.
kill `ps ax | grep 'python csp_build.py' | awk '{print $1}'`
In some cases, I'd like kill processes simutaneously like this way:
➜ ~ sleep 1000 &
[1] 25410
➜ ~ sleep 1000 &
[2] 25415
➜ ~ sleep 1000 &
[3] 25421
➜ ~ pidof sleep
25421 25415 25410
➜ ~ kill `pidof sleep`
[2] - 25415 terminated sleep 1000
[1] - 25410 terminated sleep 1000
[3] + 25421 terminated sleep 1000
But, I think it is a little bit inappropriate in your case.(May be there are running python a, python b, python x...in the background.)
Find and kill all the processes in one line in bash.
kill -9 $(ps -ef | grep '<exe_name>' | grep -v 'grep' | awk {'print $2'})
ps -ef | grep '<exe_name>' - Gives the list of running process details (uname, pid, etc ) which matches the pattern. Output list includes this grep command also which searches it. Now for killing we need to ignore this grep command process.
ps -ef | grep '<exec_name>' | grep -v 'grep' - Adding another grep with -v 'grep' removes the current grep process.
Then using awk get the process id alone.
Then keep this command inside $(...) and pass it to kill command, to kill all process.
If pkill -f csp_build.py doesn't kill the process you can add -9 to send a kill signall which will not be ignored. i.e. pkill -9 -f csp_build.py
Using -C flag of ps command
-C cmdlist
Select by command name. This selects the processes whose
executable name is given in cmdlist.
1st case, simple command
So if you run your script by standard shebang and calling them by his name:
/path/to/csp_build.py
You may find them whith
ps -C csp_build.py
So
kill $(ps -C csp_build.py ho pid)
may be enough.
2nd case, search for cmd
A little more strong, but still a lot quicker than most other answer in this SO question...
If you don't know ho this is run, or if you run them by
python csp_build.py
python3 csp_build.py
python /path/to/csp_build.py
You may find them by running:
ps -C python,python3,csp_build.py who pid,cmd | grep csp_build.py
Then using sed:
kill $(ps -C python,python3,csp_build.py who pid,cmd |
sed -ne '/csp_build.py/s/^ *\([0-9]\+\) .*$/\1/p')
Based on https://stackoverflow.com/a/3510879/15603477 answer. Minor optimization.
ps aux | grep 'python csp_build.py' | head -1 | tr -s ' ' | cut -d " " -f 2 | xargs kill
By using tr -s ' ' to squeeze multi whitespaces (if have) to 1 white space.
In case you encountered Operation not permitted follow this>> https://unix.stackexchange.com/questions/89316/how-to-kill-a-process-that-says-operation-not-permitted-when-attempted
Kill our own processes started from a common PPID is quite frequently, pkill associated to the –P flag is a winner for me. Using #ghostdog74 example :
# sleep 30 &
[1] 68849
# sleep 30 &
[2] 68879
# sleep 30 &
[3] 68897
# sleep 30 &
[4] 68900
# pkill -P $$
[1] Terminated sleep 30
[2] Terminated sleep 30
[3]- Terminated sleep 30
[4]+ Terminated sleep 30
I use this to kill Firefox when it's being script slammed and cpu bashing :)
Replace 'Firefox' with the app you want to die. I'm on the Bash shell - OS X 10.9.3 Darwin.
kill -Hup $(ps ux | grep Firefox | awk 'NR == 1 {next} {print $2}' | uniq | sort)
I use gkill processname, where gkill is the following script:
cnt=`ps aux|grep $1| grep -v "grep" -c`
if [ "$cnt" -gt 0 ]
then
echo "Found $cnt processes - killing them"
ps aux|grep $1| grep -v "grep"| awk '{print $2}'| xargs kill
else
echo "No processes found"
fi
NOTE: it will NOT kill processes that have "grep" in their command lines.
if you wanna do it mostly within awk, try
for i in $(jot 5); do
(python3 -c 'import sys; [ print(_) for _ in sys.stdin ]' ) & done;
sleep 1; ps aux | {m,g}awk '
/[p]ython/ {
_=(_)" "$2
} END {
system("echo \47 kill "(_)" \47")
system( "kill -9 " _) }'
[302] 14236
[303] 14237
[304] 14238
[305] 14239
[306] 14240
[303] + suspended (tty input) ( python3 -c 'import sys; [ print(_) for _ in sys.stdin ]'; )
[305] + suspended (tty input) ( python3 -c 'import sys; [ print(_) for _ in sys.stdin ]'; )
[304] + suspended (tty input) ( python3 -c 'import sys; [ print(_) for _ in sys.stdin ]'; )
[302] + suspended (tty input) ( python3 -c 'import sys; [ print(_) for _ in sys.stdin ]'; )
[306] + suspended (tty input) ( python3 -c 'import sys; [ print(_) for _ in sys.stdin ]'; )
kill 14239 14237 14236 14240 14238
[305] killed ( python3 -c 'import sys; [ print(_) for _ in sys.stdin ]'; )
[303] killed ( python3 -c 'import sys; [ print(_) for _ in sys.stdin ]'; )
[306] + killed ( python3 -c 'import sys; [ print(_) for _ in sys.stdin ]'; )
[304] - killed ( python3 -c 'import sys; [ print(_) for _ in sys.stdin ]'; )
[302] + killed ( python3 -c 'import sys; [ print(_) for _ in sys.stdin ]'; )
The following command will come handy:
kill $(ps -elf | grep <process_regex>| awk {'print $4'})
eg.,
ps -elf | grep top
0 T ubuntu 6558 6535 0 80 0 - 4001 signal 11:32 pts/1 00:00:00 top
0 S ubuntu 6562 6535 0 80 0 - 2939 pipe_w 11:33 pts/1 00:00:00 grep --color=auto top
kill -$(ps -elf | grep top| awk {'print $4'})
-bash: kill: (6572) - No such process
[1]+ Killed top
If the process is still stuck, use "-9" extension to hardkill, as follows:
kill -9 $(ps -elf | grep top| awk {'print $4'})
Hope that helps...!
I don't like killing things based purely on a blind result from grep - what if I mistakenly match more than desired?
I know this is going to get downvoted by command line purists, but I prefer an interactive filter for this case, such as pick (apt-get install pick). With this kind of tool the filtered result is displayed as you type, so you can see exactly what will get killed when you hit enter.
Thus the one-liner would become
function killpick { ps ax | pick -q "$1" | awk '{print $1}' | xargs kill -9; }
killpick by itself gives a chooser with incremental filtering, with the optional argument giving a starting string for the filter.
For basic bash versions
kill $(pidof <my_process>)

Improving Shell Script Performance

This shell script is used to extract a line of data from $2 if it contains the pattern $line.
$line is constructed using the regular expression [A-Z0-9.-]+#[A-Z0-9.-]+ (a simple email match), form the lines in file $1.
#! /bin/sh
clear
for line in `cat "$1" | grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+"`
do
echo `cat "$2" | grep -m 1 "\b$line\b"`
done
File $1 has short lines of data (< 100 chars) and contains approx. 50k lines (approx. 1-1.5 MB).
File $2 has slightly longer lines of text (> 80 to < 200) and has 2M+ lines (approx. 200MB).
The desktops this is running on has plenty of RAM (6 Gig) and Xenon processors with 2-4 cores.
Are there any quick fixes to increase performance as currently it takes 1-2 hours to completely run (and output to another file).
NB: I'm open to all suggestions but we're not in the position to complexity re-write the whole system etc. In addition the data come from a third party and is prone to random formatting.
Quick suggestions:
Avoid the useless use of cat and change cat X | grep Y to grep Y X.
You can process the grep output as it is produced by piping it rather than using backticks. Using backticks requires the first grep to complete before you can start the second grep.
Thus:
grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+" "$1" | while read line; do
grep -m 1 "\b$line\b" "$2"
done
Next step:
Don't process $2 repeatedly. It's huge. You can save up all your patterns and then execute a single grep over the file.
Replace loop with sed.
No more repeated grep:
grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+" "$1" | sed -E 's/^|$/\\1/g' > patterns
grep -f patterns "$2"
Finally, using some bash fanciness (see man bash → Process Substitution) we can ditch the temporary file and do this in one long line:
grep -f <(grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+" "$1" | sed -E 's/^|$/\\b/g') "$2"
That's great unless you have so many patterns grep -f runs out of memory and barfs. If that happens you'll need to run it in batches. Annoying, but doable:
grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+" "$1" | sed -E 's/^|$/\\1/g' > patterns
while [ -s patterns ]; do
grep -f <(head -n 100 patterns) "$2"
sed -e '1,100d' -i patterns
done
That'll process 100 patterns at a time. The more it can do at once the fewer passes it'll have to make over your 2nd file.
the problem is you are piping too many shell commands, as well as unnecessary use of cat.
one possible solution using just awk
awk 'FNR==NR{
# get all email address from file1
for(i=1;i<=NF;i++){
if ( $i ~ /[a-zA-Z0-9.-]+#[a-zA-Z0-9.-]+/){
email[$i]
}
}
next
}
{
for(i in email) {
if ($0 ~ i) {
print
}
}
}' file1 file2
I would take the loop out, since greping a 2 million line file 50k times is probably pretty expensive ;)
To allow for you to take the loop out
First create a file of all your Email Addresses with your outer grep command.
Then use this as a pattern file to do your secondary grep by using grep -f
If $1 is a file, don't use "cat | grep". Instead, pass the file directly to grep. Should look like
grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+" $1
Besides, you may want to adjust your regex. You should at least expect the underscore ("_") in an email address, so
grep -i -o -E "[A-Z0-9._-]+#[A-Z0-9.-]+" $1
As John Kugelman has already answered, process the grep output by piping it rather than using backticks. If you are using backticks the whole expression within the backticks will be run first, and then the outer expression will be run with the output from the backticks as arguments.
First of all, this will be a lot slower than necessary as piping would allow the two programs to run simultaneously (which is really good if they are both CPU intensive and you have multiple CPUs). However there is another very important aspect to this, the line
for line in `cat "$1" | grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+"`
may become to long for the shell to handle. Most shells (to my knowledge at least) limit the length of a command line, or at least the arguments to a command, and I think this could become a problem for the for loop too.