Processing Ping Data (Regular Expressions) - regex

I'm trying to create a script to process data from ping. So it will come from a file in the standard format with timestamps:
PING google.com (4.34.16.45) 56(84) bytes of data.
[1393790120.617504] 64 bytes from 4.34.16.45: icmp_req=1 ttl=63 time=25.7 ms
[1393790135.669873] 64 bytes from 4.34.16.45: icmp_req=2 ttl=63 time=30.2 ms
[1393790150.707266] 64 bytes from 4.34.16.45: icmp_req=3 ttl=63 time=20.6 ms
[1393790161.195257] 64 bytes from 4.34.16.45: icmp_req=4 ttl=63 time=35.2 ms
--- google.com ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 45145ms
rtt min/avg/max/mdev = 20.665/27.970/35.246/5.390 ms
I want to cut it to just the timestamp, time and request number like so (note this is from a different data set, given as an example):
0.026202538597014928 26.2 1
0.53210253859701473 24.5 2
1.0482067203067074 32.0 3
1.6627447926949444 139.6 4
2.2686229201578056 237.1 5
I realize I need to use sed to accomplish this. But I'm still really confused as to what the expressions would be to cut to the data properly. I imagine I would have something along these lines:
cat $inFile | grep -o "$begin$regex$end" | sed "s/$end//g" | sed "s/$begin//g" > $outFile
I'm just not sure what $begin and $end would be.
TL;DR Help me understand regular expressions?

You can try following sed command:
sed -ne '
2,/^$/ {
/^$/! {
s/^\[\([^]]*\).*icmp_req=\([0-9]*\).*time=\([0-9.]*\).*$/\1 \3 \2/
p
}
}
' infile
It uses -n switch to avoid automatic print of input lines. It select a range of lines between the second one and the first one that is blank, and for each one I do grouping of the text I want to extract.
Assuming infile with the content of the question, it yields:
1393790120.617504 25.7 1
1393790135.669873 30.2 2
1393790150.707266 20.6 3
1393790161.195257 35.2 4
UPDATE with simpler Scrutinizer's solution (see comments):
sed -n 's/^\[\([^]]*\).*icmp_req=\([0-9]*\).*time=\([0-9.]*\).*$/\1 \3 \2/p' infile

For good measure, here's an awk solution:
awk -F "[][ =]" '/^\[/ { print $2, $13, $9 }' file
Takes advantage of awk's ability to parse lines into fields based on a regex as the separator - here, any of the following chars: [, ],  , or =.
Simply prints out the fields of interest by index, for lines that start with [.

For a pure regex solution, see this expression:
\[([\d\.]*)].*?=(\d+).*?=([\d\.]*) ms
You can view an online demo here:
Regex101.com

Related

List lines beetween 2 keywords using grep/sed/awk

I have a sas log file and I want to list only those lines that are between two words: data and run.
File can contain many such words in many lines, for example:
MPRINT: data xxxxx;
yyyyy
xxxxxx
MPRINT: run;
fffff
yyyyy
data fff;
fffff
run;
I would like to have lines 1-4 and 8-10.
I tried something like
egrep -iz file -e '\sdata\s+\S*\s+(.|\s)*\srun\s' but this expression lists all lines between first begin and last end ((.|\s) is for the purpose of new line character).
I may also want to add additional words to pattern between data and run like:
MPRINT: data xxx;
fffff
NOTE: ffdd
set fff;
xxxxxx
MPRINT: run;
data fff;
yyyyyy
run;
In some cases I would like to list only lines between data and run where there is set word in some line.
I know there are many similar threads, but I didn't find any when keywords can repeat multiple times.
I'm not familiar awk or sed but if it can help I can also use it.
[Edit]
Note that data and run are not necessarily on the beginning of the line (I updated the example). Also there can't be any other data between data and run.
[Edit2]
As Tom noted every line that I was looking for started with MPRINT(...):, so filtered those lines.
Anubhava answer helped me the most with my final solution so I mark it as an answer.
Final expression looked like this :
grep -o path -e 'MPRINT.*' | cut -f '2-' -d ' '|
grep -iozP '(?ms) data [^\(;\s]+.*?(set|infile).*?run[^\n]*\n
You may use this gnu grep command witn -P (PCRE) option:
grep -ozP '(?ms).*?data .*?run[^\n]*\n' file
If you only want to print block with line starting from set then use:
grep -ozP '(?ms).*?data .*?^set.*?run[^\n]*\n' file
MPRINT: data xxxxx;
yyyyy
set fff;
xxxxxx
MLOGIC: run;
You may use this awk to print between 2 keywords that must contain a line starting with set:
awk '/data / {
p=1
}
p && !y {
if (/^set/)
y=1
else
buf = buf $0 ORS
}
y {
if (buf != "")
printf "%s", buf
buf=""
print
}
/run/ {
p=y=0
}' file
MPRINT: data xxxxx;
yyyyy
set fff;
xxxxxx
MLOGIC: run;
If you just want to print data between 2 keywords in awk, it is so simple:
awk '/data /,/run/' file
For what i understand the following will do the trick
sed -n '/data.*;/,/run;/p' $FILENAME
Note that the '.*' after data can be improved by something like [a-z|A-Z]{5} that you protect against matching the word data somewhere in the middle
From there matching from data to set would already require some external decision processes, so the command would be
sed -n '/data.*;/,/set.*;/p' $FILENAME
(Probably learned along the way from How to use sed/grep to extract text between two words?)
Just try (?s)data.+?run;
Explanation:
(?s) - single line mode, . matches newline character
data - match data literally
.+? - match one or more of any character (including neline), non-greedy due to ?
run; - match run; literally
Demo

get number value between two strings using regex

I have a string with multiple value outputs that looks like this:
SD performance read=1450kB/s write=872kB/s no error (0 0), ManufactorerID 27 Date 2014/2 CardType 2 Blocksize 512 Erase 0 MaxtransferRate 25000000 RWfactor 2 ReadSpeed 22222222Hz WriteSpeed 22222222Hz MaxReadCurrentVDDmin 3 MaxReadCurrentVDDmax 5 MaxWriteCurrentVDDmin 3 MaxWriteCurrentVDDmax 1
I would like to output only the read value (1450kB/s) using bash and sed.
I tried
sed 's/read=\(.*\)kB/\1/'
but that outputs read=1450kB but I only want the number.
Thanks for any help.
Sample input shortened for demo:
$ echo 'SD performance read=1450kB/s write=872kB/s no error' | sed 's/read=\(.*\)kB/\1/'
SD performance 1450kB/s write=872/s no error
$ echo 'SD performance read=1450kB/s write=872kB/s no error' | sed 's/.*read=\(.*\)kB.*/\1/'
1450kB/s write=872
$ echo 'SD performance read=1450kB/s write=872kB/s no error' | sed 's/.*read=\([0-9]*\)kB.*/\1/'
1450
Since entire line has to be replaced, add .* before and after search pattern
* is greedy, will try to match as much as possible, so in 2nd example it can be seen that it matched even the values of write
Since only numbers after read= is needed, use [0-9] instead of .
Running
sed 's/read=\(.*\)kB/\1/'
will replace read=[digits]kB with [digit]. If you want to replace the whole string, use
sed 's/.*read=\([0-9]*\)kB.*/\1/'
instead.
As Sundeep noticed, sed doesn't support non-greedy pattern, updated for [0-9]* instead

Matching bash variables as number literals with grep

I have a (GNU) bash script which establishes two variables to be matched in a file.
hour=$(head -n 1 sensorstest.log | cut -f5 | cut -d"-" -f1)
dom=$(head -n 1 sensorstest.log | cut -f5 | cut -d"-" -f4)
...and matches them to other occurrences in the file
grep -E [^0-9]+"$hour"-[0-9]+-[0-9]+-"$dom"-[0-9]+-[0-9]{4} sensorstest.log
Here is an example of the script calculating the mean for all values in field 2 of the input file for the given hour of day.
hMean=$(grep -E [^0-9]+"$hour"-[0-9]+-[0-9]+-"$dom"-[0-9]+-[0-9]{4} sensorstest.log | cut -f2 | awk ' {sum+=$
1}{count++}{mean=sum/count} END {printf("%.2f",mean) } ' );
Here is an example of the cleanup of the input file.
echo "removing: "$hour"th hour of the "$dom"th day of the "$month"th month"
sed -i -r '/'"$hour"'-[0-9]+-[0-9]+-'"$dom"'-'"$month"'-[0-9]{4}/d' sensorstest.log
And finally... Here is an example line in the file:
The format is:
status<tab>humidity<tab>temperature<tab>unix timestamp<tab>time/date
OK 94.4 16.9 1443058486 1-34-46-24-9-2015
I am attempting to match all instances of the hour on the day of the first entry in the file.
This works fine for numbers below 9, however;
Problem: Numbers over 9 are being matched as two single digit numbers, resulting in 12 matching 1, 2, 12, 21...etc.
Here is an example of where is trips up:
OK 100 17.2 1442570381 9-59-41-18-9-2015
OK 100 17.1 1442570397 9-59-57-18-9-2015
Moistening 100 17.6 1442574014 11-0-14-18-9-2015
Moistening 100 17.6 1442574030 11-0-30-18-9-2015
Here the output skips to 0-0-0-19-9-2015 (and yes I am missing an hour of entries from the log)
$ sudo statanhourtest.sh
100,1.4,1.40,-98.6 16.5,17.2,16.90,.7 1442566811 9-0-0-18-9-2015
removing: 9th hour of the 18th day of the 9th month
$ sudo statanhourtest.sh
100,1.4,1.40,-98.6 18.3,18.8,18.57,.5 1442620804 0-0-0-19-9-2015
removing: 0th hour of the 19th day of the 9th month
The problem is only happening with the hours. the day ($dom) is matching fine.
I have tried using the -w option with grep, but I think this only returns the exact match where I need the whole line.
There's not much online about matching numbers literally in grep. And I found nothing on using bash variables as a number literal.
Any help or relevant links would be greatly appreciated.
EDIT:
I have solved the problem after a night of dredging through the script.
The problem lay with my sed expression right at the end.
The problem being in single quoting parts of the sed expression and double quoting variables for expansion by the shell.
I took this from a suggestion on another thread.
Double quoting the whole expression solved the problem.
The awk suggestion has greatly increased the efficiency and accuracy of the script. Thanks again.
awk to the rescue!
I think you can combine everything to a simple awk script without needing any regex. For example,
awk 'NR==1{split($NF,h,"-")} {split($NF,t,"-")} t[1]==h[1] && t[4]==h[4]'
will parse the time stamp on the first row of the file and filters only the hour and day matching records.
This will take the average of field 2
awk 'NR==1
{
split($NF,h,"-")
}
{
split($NF,t,"-")
}
t[1]==h[1] && t[4]==h[4]
{
sum+=$2;
c++
}
END
{
print "Average: " sum/c
}'

bash: Batch reformatting using sed + date?

I have a bunch of data that looks like this:
"2004-03-23 20:11:55" 3 3 1
"2004-03-23 20:12:20" 1 1 1
"2004-03-31 02:20:04" 15 15 1
"2004-04-07 14:33:48" 141 141 1
"2004-04-15 02:08:31" 2 2 1
"2004-04-15 07:56:01" 1 2 1
"2004-04-16 12:41:22" 4 4 1
and I need to feed this data to a program which only accepts time in UNIX (Epoch) format. Is there a way I can change all the dates in bash? My first instinct tells me to do something like this:
sed 's/"(.*)"/`date -jf "%Y-%m-%d %T" "\1" "+%s"`'
But I am not entirely sure that the \1 inside the date call will properly backreference the regex matched by sed. In fact, when I run this, I get the following response:
sed: 1: "s/(".*")/`date -jf "% ...": unterminated substitute in regular expression
Can anyone guide me in the right direction on this? Thank you.
Nothing is going to be expanded between single quotes. Also, no, the shell expansions are going to happen before the sed \1 expansion, so your code isn't going to work. How about something like this (untested):
while IFS= read -r date time a b c
do
date --date "${date:1} ${time::-1}" # Cut the variables to remove the literal quotes
printf " %s %s %s\n" "$a" "$b" "$c"
done < file

Units of measure regex manipulation

Objective
On Linux, I am trying to get an end-user friendly string representing available system memory.
Example:
Your computer has 4 GB of memory.
Success criteria
I consider these aspects end-user friendly (you may disagree):
1G is more readable than 1.0G (1 Vs 1.0)
1GB is more readable than 1G (GB Vs G)
1 GB is more readable than 1GB (space-separated unit of measure)
memory is more readable than RAM, DDR or DDR3 (no jargon)
Starting point
The free utility from procps-ng has an option intended for humans:
-h, --human
Show all output fields automatically scaled to shortest three digit unit
and display the units of print out. Following units are used.
B = bytes
K = kilos
M = megas
G = gigas
T = teras
If unit is missing, and you have petabyte of RAM or swap, the number is
in terabytes and columns might not be aligned with header.
so I decided to start there:
> free -h
total used free shared buffers cached
Mem: 3.8G 1.4G 2.4G 0B 159M 841M
-/+ buffers/cache: 472M 3.4G
Swap: 4.9G 0B 3.9G
3.8G sounds promising so all I have to do now is...
Required steps
Filter the output for the line containing the human-readable string (i.e. Mem:)
Pick out the memory total from the middle of the line (i.e. 3.8G)
Parse out the number and unit of measure (i.e. 3.8 and G)
Format and display a string more to my liking (e.g. G↝ GB, ...)
My attempt
free -h | \
awk '/^Mem:/{print $2}' | \
perl -ne '/(\d+(?:\.\d+)?)(B|K|M|G|T)/ && printf "%g %sB\n", $1, $2'
outputs:
3.8 GB
Desired solution
I'd prefer to just use gawk, but I don't know how
Use a better, even canonical if there is one, way to parse a "float" out of a string
I don't mind the fastidious matching of "just the recognised magnitude letters" (B|K|M|G|T), even if this would unnecessarily break the match with the introduction of new sizes
I use %g to output 4.0 as 4, which is something you may disagree with, depending on how you feel about these comments: https://unix.stackexchange.com/a/70553/10283.
My question, in summary
Could you do the above in awk only?
Could my perl be written more elegantly than that, keeping the strictness of it?
Remember:
I am a beginner robot. Here to learn. :]
What I learned from Andy Lester
Summarised here for my own benefit: to cement learning, if I can.
Use regex character classes, not regex alternation, to pick out one character from a set
perl has a -a option, which splits $_ from -e or -n into #F:
for example, this gawk:
echo foo bar baz | awk '{print $2}'
can be written like this in perl:
echo foo bar baz | perl -ane 'print "$F[1]\n";'
Unless there is something equivalent to gawk 's --field-separator, I think I still like gawk better, although of course to do everything in perl is both cleaner and more efficient. (is there an equivalent?)
EDIT: actually, this proves there is, and it's -F just like in gawk:
echo ooxoooxoooo | perl -Fx -ane 'print join "\n", #F'
outputs:
oo
ooo
oooo
perl has a -l option, which is just awesome: think of it as Python's str.rstrip (see the link if you are not a Python head) for the validity of $_ but it re-appends the \n to the output automatically for you
Thanks, Andy!
Yes, I'm sure you could do this awk-only, but I'm a Perl guy so here's how you'd do it Perl-only.
Instead of (B|K|M|G|T) use [BKMGT].
Use Perl's -l to automatically strip newlines from input and add them on output.
I don't see any reason to have Awk do some of the stripping and Perl doing the rest. You can do autosplitting of fields with Perl's -a.
I don't know what the output from free -h is exactly (My free doesn't have an -h option) so I'm guessing at this
free -h | \
perl -alne'/^Mem:/ && ($F[1]=~/(\d+(?:\.\d+)?)[BKMGT]/) && printf( "%g %sB", $1, $2)'
An awk (actually gawk) solution
free -h | awk 'FNR == 2 {if (match($2,"[BKMGT]$",a)) r=sprintf("%.0f %sB",substr($2,0,RSTART-1), a[0]); else r=$2 " B";print "Your computer has " r " of memory."}'
or broken down for readability
free -h | awk 'FNR == 2 {if (match($2,"[BKMGT]$",a)) r=sprintf("%.0f %sB",
substr($2,0,RSTART-1), a[0]); else r=$2 " B";
print "Your computer has " r " of memory."}'
Where
FNR is the nth line (if 2 does the {} commands)
$2 is the 2nd field
if (condition) command; else command;
match(string, regex, matches array). Regex says "must end with one of BKMGT"
r=sprintf set variable r to sprintf with %.0f for no decimals float
RSTART tells where the match occured, a[0] is the first match
Outputs with the exemple above
Your computer has 4 GB of memory.
Another lengthy Perl answer:
free -b |
perl -lane 'if(/Mem/){ #u=("B","KB","MB","GB"); $F[2]/=1024, shift #u while ($F[2]>1024); printf("%.2f %s", $F[2],$u[0])}'