Extracting specific values from a from a text file with long lines - regex

I'm trying to get all "CP" values from a log file like below:
2013-06-27 17:00:00,017 INFO - [AlertSchedulerThread18] [2013-06-27 16:59:59, 813] -- SN: 989333333333 ||DN: 989333333333 ||CategoryId: 4687 ||CGID: null||Processing started ||Billing started||Billing Process: 97 msec ||Response code: 2001 ||Package id: 4387 ||TransactionId: 66651372336199820989389553437483742||CDR:26 msec||CDR insertion: 135 msec||Successfully inserted in CDR Table||CP:53 msec||PROC - 9 msec||Successfully executed procedure call.||Billing Ended||197 msec ||Processing ended
2013-06-27 17:00:00,018 INFO - [AlertSchedulerThread62] [2013-06-27 16:59:59, 824] -- SN: 989333333333 ||DN: 989333333333 ||CategoryId: 3241 ||CGID: null||Processing started ||Billing started||Billing Process: 61 msec ||Response code: 2001 ||Package id: 2861 ||TransactionId: 666513723361998319893580191324005184||CDR:25 msec||CDR insertion: 103 msec||Successfully inserted in CDR Table||CP:59 msec||PROC - 24 msec||Successfully executed procedure call.||Billing Ended||187 msec ||Processing ended
2013-06-27 17:00:00,028 INFO - [AlertSchedulerThread29] [2013-06-27 16:59:59, 903] -- SN: 989333333333 ||DN: 989333333333 ||CategoryId: 4527 ||CGID: null||Processing started ||Billing started||Billing Process: 47 msec ||Response code: 2001 ||Package id: 4227 ||TransactionId: 666513723361999169893616006323701572||CDR:22 msec||CDR insertion: 83 msec||Successfully inserted in CDR Table||CP:21 msec||PROC - 7 msec||Successfully executed procedure call.||Billing Ended||112 msec ||Processing ended
...getting output like this:
CP:53 msec
CP:59 msec
CP:21 msec
How can I do this using awk?

cut is always good and fast for these things:
$ cut -d"*" -f3 file
CP:53 msec
CP:59 msec
CP:21 msec
Anyway, these awk ways can make it:
$ awk -F"|" '{print $27}' file | sed 's/*//g'
CP:53 msec
CP:59 msec
CP:21 msec
or
$ awk -F"\|\|" '{print $14}' file | sed 's/*//g'
CP:53 msec
CP:59 msec
CP:21 msec
Or also
$ awk -F"*" '{print $3}' file
CP:53 msec
CP:59 msec
CP:21 msec
In both, we set the field delimiter to split the string as some specific character | or *. Then we print a certain block of the split text.

How about a hilarious sed command?
sed -n 's/.*\*\*\(.*\)\*\*.*/\1/p'

$ awk -F'[|][|]' '{print $14}' file
**CP:53 msec**
**CP:59 msec**
**CP:21 msec**
If you REALLY have '*'s in the input, just tweak to remove them:
$ awk -F'[|][|]' '{gsub(/\*/,""); print $14}' file
CP:53 msec
CP:59 msec
CP:21 msec

There's always grep:
grep -o 'CP:[[:digit:]]* msec' log.txt
If it's not necessarily going to be msec every time, you can just take everything up to the pipe:
grep -o 'CP:[^|]*' log.txt

With awk:
awk -F"[|*]+" '{ print $14 }' file

Code for GNU sed
$sed -r 's/.*(CP:[0-9]+\smsec).*/\1/' file
CP:53 msec
CP:59 msec
CP:21 msec

Related

How to do parsing of Elapsed time in seconds in linux

I want to do parsing of Elapsed time in seconds .Time formats given below:
1) 3 day 18h
2) 3 day
3) 3h 15min
4) 3h
5) 15min 10sec
6) 15min
7) 10sec
i'm getting values from systemctl status cassandra | awk '/(Active: active)/{print $9, $10,$11}' Now storing it's value in variable A,like
A=$(systemctl status cassandra | awk '/(Active: active)/{print $9, $10,$11}'
now A has input as 3 day 18h or 3 day etc. More examples-
A=3 day 18h or 3 day or 3h 15min or 3h or 15min 10sec or 15min or 10sec
now take different values of A, and parse in seconds.
What you want to achieve could be done directly in awk using the following line :
$ systemctl status cassandra | awk '/(Active: active)/{s=$6" "$7;gsub(/-|:/," ",s); print systime() - mktime(s)}'
This will give you the running time directly based on the start-time and not on the approximated running time printed by systemctl.
If this approach is not working then I suggest to use the date command to do all the parsing. If you can change the h by hour in your examples, then you can do the following :
$ date -d "1970-01-01 + 3day 18hour 15min 16sec" +%s
324916
If you cannot, then I suggest the following. If duration is stored in the variable $duration, then you do
$ date -d "1970-01-01 + ${duration/h/hour}" +%s
Having spaces between the numbers and the strings day, h,min or sec does not matter.
The idea of this is that you ask date to compute everything for you as %s returns the unix time since 1970-01-01 in seconds.
man date:
%s seconds since 1970-01-01 00:00:00 UTC
The given value of A is*:
A="3day 3day/3h 15min/3h/15min 10sec/15min/10sec"
To compute A in seconds you can use bash's parameter expansion:
A=${A//day/*86400}
A=${A//h/*3600}
A=${A//min/*60}
A=${A//sec/*1}
A=${A//\//+}
A=${A// /+}
echo "A = $A"
echo $A | bc
Output:
A = 3*86400+3*86400+3*3600+15*60+3*3600+15*60+10*1+15*60+10*1
542720
* Note here I changed the original value of A as provided by the OP. From
3 day/3 day/3h...
to
3day 3day/3h... # the rest is the same as OP's.
Using awk to s/h/hours/ and to launch date +"%s" -d "1970-01-01 GMT +" to parse the time strings and to count the seconds:
$ awk '{
sub(/h/,"hours") # date no eat h
$1="" # remove $1
"date +\"%s\" -d \"1970-01-01 GMT + " $0 "\"" | getline s # date
print s
}' file
324000
259200
11700
10800
910
900
10
for the data:
$ cat file
1) 3 day 18h
2) 3 day
3) 3h 15min
4) 3h
5) 15min 10sec
6) 15min
7) 10sec

Extract single line from command output in terminal

I would like to extract the line containing 'seconds time elapsed' output from perf stat output for some logging script that I am working on.
I do not want to write the output to a file and then search the file. I would like to do it using 'grep' or something similar.
Here is what I have tried:
perf stat -r 10 echo "Sample_String" | grep -eE "seconds time elapsed"
For which I get
grep: seconds time elapsed: No such file or directory
echo: Broken pipe
echo: Broken pipe
echo: Broken pipe
echo: Broken pipe
echo: Broken pipe
echo: Broken pipe
echo: Broken pipe
echo: Broken pipe
echo: Broken pipe
Performance counter stats for 'echo Sample_String' (10 runs):
0.254533 task-clock (msec) # 0.556 CPUs utilized ( +- 0.98% )
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
56 page-faults # 0.220 M/sec ( +- 0.53% )
<not supported> cycles
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
<not supported> instructions
<not supported> branches
<not supported> branch-misses
0.000457686 seconds time elapsed ( +- 1.08% )
And I tried this
perf stat -r 10 echo "Sample_String" > grep -eE "seconds time elapsed"
For which I got
Performance counter stats for 'echo Sample_String -eE seconds time elapsed' (10 runs):
0.262585 task-clock (msec) # 0.576 CPUs utilized ( +- 1.11% )
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
56 page-faults # 0.214 M/sec ( +- 0.36% )
<not supported> cycles
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
<not supported> instructions
<not supported> branches
<not supported> branch-misses
0.000456035 seconds time elapsed ( +- 1.05% )
I am new to these tools like grep, awk and sed. I hope someone can help me out. I also do not want to write the output to a file and then search the file.
The problem here is that the output you want is sent to stderr instead of the standard output.
You can see this by redirecting stderr to /dev/null, and you'll see that the only result left is the one from the "echo" command.
~/ perf stat -r 10 echo "Sample_String" 2>/dev/null
Sample_String
Sample_String
Sample_String
Sample_String
Sample_String
Sample_String
Sample_String
Sample_String
Sample_String
Sample_String
In order to do what you want, you will have to redirect perf's stderr to the standard output, and hide the standard output. This way, perf's output will be sent to the grep command.
~/ perf stat -r 10 echo "Sample_String" 2>&1 >/dev/null | grep 'seconds time elapsed'
0,013137361 seconds time elapsed ( +- 96,08% )
Looks like your desired output is going to stderr. Try:
perf stat -r 10 echo "Sample_String" 2>&1 | grep "seconds time elapsed"
This could work the way you intend:
grep -e "seconds time elapsed" <<< "$(perf stat -r 10 echo "Sample_String" 2>&1 >/dev/null)"
Output:
0.000544399 seconds time elapsed ( +- 2.05% )

awk match between two patterns in an "if/else" statement

I've got an awk issue that I can't seem to figure out. I'm trying to parse out data from SAR and found that some systems are using a different locale and I'm getting different output. The long term solution is to change the locale on all systems for the output data to the same thing, but I have to parse through old data for now and that is not currently an option. Here's the two types of data I get:
24-Hour Output:
21:10:01 all 8.43 0.00 1.81 2.00 0.00 87.76
21:20:01 all 7.99 0.00 1.74 0.82 0.00 89.44
21:30:01 all 8.35 0.00 1.76 0.94 0.00 88.95
12-Hour Output:
09:10:01 PM all 8.43 0.00 1.81 2.00 0.00 87.76
09:20:01 PM all 7.99 0.00 1.74 0.82 0.00 89.44
09:30:01 PM all 8.35 0.00 1.76 0.94 0.00 88.95
I need an awk statement that will get items from 7AM-7PM for all SAR data. I originally had something working, but once I found this issue, it breaks for all the 24-hour output. I trying getting the awk statement to work, but the following doesn't work and I can't figure out how to make it work:
awk '{ if ($2 == "AM" || $2 == "PM" && /07:00/,/07:00/) print $1" "$2; else '/07:00/,/19:00 print $1}' SAR_OUTPUT_FILE.txt
Basically, what I'm trying to output is, if it is in 24-hour format, searchh for 07:00-19:00 and return just the first column of output (since there is no "AM/PM" column. If it founds "AM/PM", I would confider that 12-hour format and want to get everything from 07:00-07:00 and return both the 1st and 2nd column (time + "AM/PM").
Can anyone help me out here?
Without access to an awk with time functions ( strftime() or mktime() ), you can shift the 12h end times so that they can be tested with the 24h time test.
Here's an awk executable that does that by adjusting the hours in the 12h formatted times to fit 24h time formats. The result is put into variable t for every line and is tested to be in the 24h range.
#!/usr/bin/awk -f
function timeShift( a, h ) {
if(NF==9 && split($1, a, ":")==3) {
if(a[1]==12) h = $2=="PM"?"12":"00"
else if($2=="PM") h = (a[1]+12)%24
else h = a[1]
return( h ":" a[2] ":" a[3] )
}
return( $1 )
}
{ t = timeShift() }
t >= "07:00:00" && t <= "19:00:00"
If you need to print fewer fields than the full line, an action block could be added after the final expression.

parse Log File, check for date, report results

I need to take the time stamp printed in After FTP connection and check whether it happened today.
I have a log file which contains the following:
---------------------------------------------------------------------
Opening connection for file1.dat
---------------------------------------------------------------------
---------------------------------------------------------------------
Before ftp connection -- time is -- Mon Oct 21 04:01:52 CEST 2013
---------------------------------------------------------------------
---------------------------------------------------------------------
After ftp connection -- time is Mon Oct 21 04:02:03 CEST 2013 .
---------------------------------------------------------------------
---------------------------------------------------------------------
Opening connection for file2.dat
---------------------------------------------------------------------
---------------------------------------------------------------------
Before ftp connection -- time is -- Wed Oct 23 04:02:03 CEST 2013
---------------------------------------------------------------------
---------------------------------------------------------------------
After ftp connection -- time is Wed Oct 23 04:02:04 CEST 2013 .
---------------------------------------------------------------------
Desired Output:
INPUT:file1.dat --> FAIL # since it is Oct 21st considering today is Oct 23.
INPUT:file2.dat --> PASS # since it is Oct 23rd.
INPUT:file3.dat --> FAIL # File information does not exist
What I tried so far:
grep "file1.dat\\|Before ftp connection\\|After ftp connection" logfilename
But this returns all the info that matches either file1.dat OR Before ftp connection OR After ftp connection. Considering the above sample, I get 5 lines out of which last 2 lines are from file2.dat:
Opening connection for file1.dat
Before ftp connection -- time is -- Mon Oct 21 04:01:52 CEST 2013
After ftp connection -- time is Mon Oct 21 04:02:03 CEST 2013 .
Before ftp connection -- time is -- Wed Oct 23 04:02:03 CEST 2013
After ftp connection -- time is Wed Oct 23 01:02:04 CEST 2013 .
I am stuck here. So ideally I need to take Mon Oct 21 04:02:03 CEST 2013 and compare and print the a result FAIL.
Defining the records correctly makes things a lot easier:
$ awk '{print $5,($0~"After.*"d?"PASS":"FAIL")}' d="$(date +'%a %b %d')" RS= file
file1.dat FAIL
file2.dat PASS
Use awk:
# read dates in shell variables
read x m d x x y < <(date)
awk -v f='file2.dat' -v m=$m -v d=$d -v y=$y '$0 ~ f {s=1; next}
s && /After ftp connection/ {
res = ($8==m && $9==d && $12==y) ? "PASS" : "FAIL";
print f, res; exit
}' file.log
file2.dat PASS
FOLLOW UP by OP:
I achieved the intended results by this:
check_success ()
{
CHK_DIR=/Archive
if [[ ! -d ${CHK_DIR} ]]; then
exit 1
elif [[ ! -d ${LOG_FOLDER} ]]; then
exit 1
fi
count_of_files=$(ls -al --time-style=+%D $CHK_DIR/*.dat | grep $(date +%D) | cut -f1 | awk '{ print $7}' | wc -l)
if [[ $count_of_files -lt 1 ]]; then
exit 2
fi
list_of_files=$(basename $(ls -al --time-style=+%D $CHK_DIR/*.dat | grep $(date +%D) | cut -f1 | awk '{ print $7}'))
for filename in $list_of_files
do
filename=basename filename
lg_name=$(grep -El "Opening.*$filename" $LOG_FOLDER/* | head -1 )
m=$(date +%b)
d=$(date +%d)
y=$(date +%Y)
output=$(awk -v f=$filename -v m=$m -v d=$d -v y=$y '$0 ~ f {s=1; next} s && /After ftp connection/ { res = ($8==m && $9==d && $12==y) ? "0" : "1"; print res; exit }' $lg_name)
if [[ ${output} != 0 ]]; then
exit 2
fi
done
exit 0
}
I used Anubhava's snippet, nevertheless Thanks to all the three champs.
It was tricky!
$ awk -vtoday=$(date "+%Y%m%d")
'/^Opening/ {file=$4}
/^After ftp connection/
{$1=$2=$3=$4=$5=$6=$NF="";
r="date -d \"" $0 "\" \"+%Y%m%d\""; r | getline dat;
if (today==dat) {print file, "PASS"}
else {print file, "FAIL"}}
' file
For file1.dat FAIL
For file2.dat PASS
Explanation
-vtoday=$(date "+%Y%m%d") gives today's date with "20131023" format
/^Opening/ {file=$4} gets lines starting with Opening and store the filename, that happens to be in the 4th field.
/^After ftp connection/ on lines starting with "After ftp connection...", do:
{$1=$2=$3=$4=$5=$6=$NF=""; delete up to 6th field and last one so the rest is the date info.
r="date -d \"" $0 "\" \"+%Y%m%d\""; r | getline dat; calculate the date on YYYYMMDD format of that line.
if (today==dat) {print file, "PASS} make comparison of dates.
else {print file, "FAIL"} idem.

parsing ns2 trace file [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm using NS 2.35 and am trying to determine the end-to-end delay of my routing algorithm.
I think anyone with some good scripting experience should be able to answer this question, sadly that person is not me.
I have a trace file, that looks something like this:
- -t 0.548 -s 2 -d 7 -p cbr -e 500 -c 0 -i 1052 -a 0 -x {2.0 17.0 6 ------- null}
h -t 0.548 -s 2 -d 7 -p cbr -e 500 -c 0 -i 1052 -a 0 -x {2.0 17.0 -1 ------- null}
+ -t 0.55 -s 2 -d 7 -p cbr -e 500 -c 0 -i 1056 -a 0 -x {2.0 17.0 10 ------- null}
+ -t 0.555 -s 2 -d 7 -p cbr -e 500 -c 0 -i 1057 -a 0 -x {2.0 17.0 11 ------- null}
r -t 0.556 -s 2 -d 7 -p cbr -e 500 -c 0 -i 1047 -a 0 -x {2.0 17.0 1 ------- null}
+ -t 0.556 -s 7 -d 12 -p cbr -e 500 -c 0 -i 1047 -a 0 -x {2.0 17.0 1 ------- null}
- -t 0.556 -s 7 -d 12 -p cbr -e 500 -c 0 -i 1047 -a 0 -x {2.0 17.0 1 ------- null}
But here is what I need to do.
A line that starts with + is when a new packet is added to the network.
A line starting with r is when a packet has been received by the destination. the double-typed number after the -t is the time at which that event happened. And finally, after the -i is the identity of the packet.
For me to calculate average end-to-end delay, I need to find every line that has a certain id after the -i. from there I need to calculate the timestamp of the r minus the timestamp of the +
So I figure there could be a regular expression separated by spaces. I could put each of the segements into their own variables. Then I would check the 15th (the packet ID).
But I'm not sure where to go from there, or how to put it all together.
I know there are some AWK scripts on the web for doing this, but they are all outdated and don't fit the current format (and I'm not sure how to change them).
Any help would be greatly appreciated.
EDIT:
Here is an example of a full packet route that I'm looking to find.
I've taken out a lot of lines in between these ones, so that you can see a single packets events.
# a packet is enqueued from node 2 going to node 7. It's ID is 1636. this was at roughly 1.75sec
+ -t 1.74499999999998 -s 2 -d 7 -p cbr -e 500 -c 0 -i 1636 -a 0 -x {2.0 17.0 249 ------- null}
# at 2.1s, it left node 2.
- -t 2.134 -s 2 -d 7 -p cbr -e 500 -c 0 -i 1636 -a 0 -x {2.0 17.0 249 ------- null}
# at 2.134 it hopped from 2 to 7 (not important)
h -t 2.134 -s 2 -d 7 -p cbr -e 500 -c 0 -i 1636 -a 0 -x {2.0 17.0 -1 ------- null}
# at 2.182 it was received by node 7
r -t 2.182 -s 2 -d 7 -p cbr -e 500 -c 0 -i 1636 -a 0 -x {2.0 17.0 249 ------- null}
# it was the enqueued by node 7 to be sent to node 12
+ -t 2.182 -s 7 -d 12 -p cbr -e 500 -c 0 -i 1636 -a 0 -x {2.0 17.0 249 ------- null}
# slightly later it left node 7 on its was to node 12
- -t 2.1832 -s 7 -d 12 -p cbr -e 500 -c 0 -i 1636 -a 0 -x {2.0 17.0 249 ------- null}
# it hopped from 7 to 12 (not important)
h -t 2.1832 -s 7 -d 12 -p cbr -e 500 -c 0 -i 1636 -a 0 -x {2.0 17.0 -1 ------- null}
# received by 12
r -t 2.2312 -s 7 -d 12 -p cbr -e 500 -c 0 -i 1636 -a 0 -x {2.0 17.0 249 ------- null}
# added to queue, heading to node 17
+ -t 2.2312 -s 12 -d 17 -p cbr -e 500 -c 0 -i 1636 -a 0 -x {2.0 17.0 249 ------- null}
# left for node 17
- -t 2.232 -s 12 -d 17 -p cbr -e 500 -c 0 -i 1636 -a 0 -x {2.0 17.0 249 ------- null}
# hopped to 17 (not important)
h -t 2.232 -s 12 -d 17 -p cbr -e 500 -c 0 -i 1636 -a 0 -x {2.0 17.0 -1 ------- null}
# received by 17 notice the time delay
r -t 2.28 -s 12 -d 17 -p cbr -e 500 -c 0 -i 1636 -a 0 -x {2.0 17.0 249 ------- null}
The ideal output of the script would recognize 2.134 as the start time, and 2.28 as the end, and then give me the delay of 0.146sec. It would do this for all packet IDs and only report the average.
It was requested that I expand a bit on how the file works, and what I am expecting.
The file is listing descriptions of about 10,000 packets. Each packet can be in a different state. The important states are + which means a packet has been enqueued at a router, and r which means the packet has been received by its destination.
It is possible that a packet that is enqueued (so a + entry) is not actually received and is instead dropped. This means we cannot assume that for every + entry there will be a r entry.
What I'm trying to measure is the average end to end delay. What this means, is that if you look at a single packet, it will have a time it was enqueued, and a time it was received. I need to make this calculation to find its end-to-end delay. But I also need to do it for 9,999 other packets to get an average.
I've thought about it more, and heres generally how I think the algorithm needs to work.
remove all lines that don't begin with a + or an r because they are unimportant.
go through all of the packet IDs (that is the numbers after -i, such as 1052 in the example), and put them into some sort of groups (multiple arrays perhaps).
each group should now contain all of the information about a particular packet.
inside the group, check if there is a +, ideally we want the very first +. Record its time.
look for any more + lines. Look at their time. It's possible the log is slightly jumbled. So its possible there is a + line later on that is actually earlier in the simulation.
If this new + line has an earlier time, then update the time variable with that.
assuming there are no more + lines, look for an r line.
if there is no r line, the packet was dropped so don't worry about it.
for every r line you find, all we need to do is find the one who has the lastest timestamp
The r line with the latest timestamp is where the packet was finally received.
subtract the + time from the r time, this gives us the time it took for the packet to travel.
Add this value to an array so that later it can be averaged.
repeat this process on every packet ID group, and then finally average the created array of delays.
Thats a lot of typing, but I think its as clear as I can be in what I want. I wish i was a regex master, but I just don't have time to learn it well enough to pull this off.
Thanks for all your help, and let me know if you have any questions.
There's not much to work with here, as Iain said in the comments to your question, but if I understand what you want to do correctly, something like this should work:
awk '/^[+r]/{$1~/r/?r[$15]=$2:r[$15]?d[$15]=r[$15]-$2:1} END {for(p in d){sum+=r[p];num++}print sum/num}' trace.file
It skips all lines not starting with '+' or 'r'. If the line starts with 'r' it adds time to the r array. Otherwise, it calculates the delay and adds it to the d array if the element is found in the r array. Finally it loops over the elements in the d array, adds up the total delay and number of elements and calculates the average from this. In your case the average is 0.
The :1 at the end of the main block is just in there so I can get away with a ternary expression instead of the significantly more verbose if statement.
EDIT: New expression to work with the added conditions:
awk '/^[+r]/{$1~/r/?$3>r[$15]?r[$15]=$3:1:!a[$15]||$3<a[$15]?a[$15]=$3:1} END {for(i in r){sum+=r[i]-a[i];num++}print "Average delay", sum/num}'
or as an awk-file
/^[+r]/ {
if ($1 ~ /r/) {
if ($3 > received[$15])
received[$15] = $3;
} else {
if (!added[$15] || $3 < added[$15])
added[$15] = $3;
}
} END {
for (packet in received) {
sum += received[packet] - added[packet];
num++
}
print "Average delay", sum/num
}
According to your algorithm it seems like 1.745 would be the start time, while you write that 2.134 is.