How to do parsing of Elapsed time in seconds in linux - regex

I want to do parsing of Elapsed time in seconds .Time formats given below:
1) 3 day 18h
2) 3 day
3) 3h 15min
4) 3h
5) 15min 10sec
6) 15min
7) 10sec
i'm getting values from systemctl status cassandra | awk '/(Active: active)/{print $9, $10,$11}' Now storing it's value in variable A,like
A=$(systemctl status cassandra | awk '/(Active: active)/{print $9, $10,$11}'
now A has input as 3 day 18h or 3 day etc. More examples-
A=3 day 18h or 3 day or 3h 15min or 3h or 15min 10sec or 15min or 10sec
now take different values of A, and parse in seconds.

What you want to achieve could be done directly in awk using the following line :
$ systemctl status cassandra | awk '/(Active: active)/{s=$6" "$7;gsub(/-|:/," ",s); print systime() - mktime(s)}'
This will give you the running time directly based on the start-time and not on the approximated running time printed by systemctl.
If this approach is not working then I suggest to use the date command to do all the parsing. If you can change the h by hour in your examples, then you can do the following :
$ date -d "1970-01-01 + 3day 18hour 15min 16sec" +%s
324916
If you cannot, then I suggest the following. If duration is stored in the variable $duration, then you do
$ date -d "1970-01-01 + ${duration/h/hour}" +%s
Having spaces between the numbers and the strings day, h,min or sec does not matter.
The idea of this is that you ask date to compute everything for you as %s returns the unix time since 1970-01-01 in seconds.
man date:
%s seconds since 1970-01-01 00:00:00 UTC

The given value of A is*:
A="3day 3day/3h 15min/3h/15min 10sec/15min/10sec"
To compute A in seconds you can use bash's parameter expansion:
A=${A//day/*86400}
A=${A//h/*3600}
A=${A//min/*60}
A=${A//sec/*1}
A=${A//\//+}
A=${A// /+}
echo "A = $A"
echo $A | bc
Output:
A = 3*86400+3*86400+3*3600+15*60+3*3600+15*60+10*1+15*60+10*1
542720
* Note here I changed the original value of A as provided by the OP. From
3 day/3 day/3h...
to
3day 3day/3h... # the rest is the same as OP's.

Using awk to s/h/hours/ and to launch date +"%s" -d "1970-01-01 GMT +" to parse the time strings and to count the seconds:
$ awk '{
sub(/h/,"hours") # date no eat h
$1="" # remove $1
"date +\"%s\" -d \"1970-01-01 GMT + " $0 "\"" | getline s # date
print s
}' file
324000
259200
11700
10800
910
900
10
for the data:
$ cat file
1) 3 day 18h
2) 3 day
3) 3h 15min
4) 3h
5) 15min 10sec
6) 15min
7) 10sec

Related

AWK comparing string date value from task list to today's date

I have a todo.txt task list that I'd like to filter on to show any tasks scheduled for dates in the future or today, i.e. - show no past dates scheduled and only show tasks which have a date scheduled.
The file lines and orders change sometimes to include a 'threshold (think snooze/postpone task until...) date in format: t:date +%Y-%m-%d which says, 'don't start this task until this date'.
Data file:
50 (A) Testing due date due:2018-09-22 t:2018-09-25
04 (B) Buy Socks, Underwear t:2018-09-22
05 (B) Buy Vaporizer t:2018-09-23 due:2018-09-22
16 (C) Watch Thor Ragnarock
12 (B) Pay Electric Bill due:2018-09-20 t:2018-09-25
x 2018-09-21 pri:B Buy Prebiotics +health #web due:2018-09-21
So far I've come up with this:
cat t | awk -F: -v date="$(date +%Y-%m-%d)" '/due:|t:/ $2 >= date || $3 >= date { print $0}'|
nl
Problem is, The date comparison is working on the “due:” field as it usually comes before “t:” field. Also, entries older then today are output.
Output:
1 50 (A) Testing due date due:2018-09-22 t:2018-09-25
2 05 (B) Buy Vaporizer t:2018-09-23 due:2018-09-22
3 12 (B) Pay Electric Bill due:2018-09-20 t:2018-09-25
Questions:
How do I correctly make the date comparison against "t:" value after the “:” separator if “t:” is present - and on the “due:” value if “t:” is not present?
Date greater than (“>”) seems to work but equal to does not (“>=“)
$ cat tst.awk
{
orig = $0
sched = ""
for (i=NF; i>0; i--) {
if ( sub(/^t:/,"",$i) ) {
sched = $i
break
}
else if ( sub(/^due:/,"",$i) ) {
sched = $i
}
}
$0 = orig
}
sched >= date
$ awk -v date="$(date +%Y-%m-%d)" -f tst.awk file
50 (A) Testing due date due:2018-09-22 t:2018-09-25
05 (B) Buy Vaporizer t:2018-09-23 due:2018-09-22
12 (B) Pay Electric Bill due:2018-09-20 t:2018-09-25

How to find the hdfs files time stamp to milli seconds level

Is there a way we can get the time stamp of the files in HDFS to millisecond level.
For example:
in linux we can get the full time stamp like below
$ ls --full-time
total 4
-rw-r--r--. 1 bigdatauser hadoop 0 2017-09-15 01:09:25.068425282 -0400 newfile1.txt
-rwxrwxrwx. 1 bigdatauser hadoop 106 2017-09-15 01:08:16.791844270 -0400 test.sh
If you use hdfs dfs -stat '%Y' you can see the time in milliseconds.
$ hdfs dfs -touchz /tmp/test_file
$ hdfs dfs -stat "%Y" /tmp/test_file
1506621031648
From http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FileSystemShell.html#stat:
Print statistics about the file/directory at in the specified format. Format accepts filesize in blocks (%b), type (%F), group name of owner (%g), name (%n), block size (%o), replication (%r), user name of owner(%u), and modification date (%y, %Y). %y shows UTC date as “yyyy-MM-dd HH:mm:ss” and %Y shows milliseconds since January 1, 1970 UTC. If the format is not specified, %y is used by default.

How to filter logs easily with awk?

Suppose I have a log file mylog like this:
[01/Oct/2015:16:12:56 +0200] error number 1
[01/Oct/2015:17:12:56 +0200] error number 2
[01/Oct/2015:18:07:56 +0200] error number 3
[01/Oct/2015:18:12:56 +0200] error number 4
[02/Oct/2015:16:12:56 +0200] error number 5
[10/Oct/2015:16:12:58 +0200] error number 6
[10/Oct/2015:16:13:00 +0200] error number 7
[01/Nov/2015:00:10:00 +0200] error number 8
[01/Nov/2015:01:02:00 +0200] error number 9
[01/Jan/2016:01:02:00 +0200] error number 10
And I want to find those lines that occur between 1 Oct at 18.00 and 1 Nov at 1.00. That is, the expected output would be:
[01/Oct/2015:18:07:56 +0200] error number 3
[01/Oct/2015:18:12:56 +0200] error number 4
[02/Oct/2015:16:12:56 +0200] error number 5
[10/Oct/2015:16:12:58 +0200] error number 6
[10/Oct/2015:16:13:00 +0200] error number 7
[01/Nov/2015:00:10:00 +0200] error number 8
I have managed to convert the times to timestamp by using match() and then mktime(). First one finds the specified pattern, that is stored in the array a[] so it can be accessed (interesting to see glenn jackman's answer to access captured group from line pattern for a good example). Since mktime requires a format YYYY MM DD HH MM SS[ DST], I also have to convert the month in the form Xxx into a digit, for which I use an answer by Ed Morton to "convert month from Aaa to xx": awk '{printf "%02d\n",(match("JanFebMarAprMayJunJulAugSepOctNovDec",$0)+2)/3}'.
All together, finally I have the timestamp in the variable mytimestamp:
awk 'match($0, /([0-9]+)\/([A-Z][a-z]{2})\/([0-9]{4}):([0-9]{1,2}):([0-9]{1,2}):([0-9]{1,2}) ([+-][0-9]{4})/, a) {
day=a[1]; month=a[2]; year=a[3];
hour=a[4]; min=a[5]; sec=a[6]; utc=a[7];
month=sprintf("%02d",(match("JanFebMarAprMayJunJulAugSepOctNovDec",month)+2)/3);
mydate=sprintf("%s %s %s %s %s %s %s", year,month,day,hour,min,sec,utc);
mytimestamp=mktime(mydate)
print mytimestamp
}' mylog
Returns:
1443708776
1443712376
1443715676
etc.
So now I am ready to convert against the given dates. Since awk takes a lot to handle such format, I prefer to provide them through an external shell variable, using date -d"my date" +"%s" to print the timestamp:
start="$(date -d"1 Oct 2015 18:00 +0200" +"%s")"
end="$(date -d"1 Nov 2015 01:00 +0200" +"%s")"
All together, this works:
awk start="$(date -d"1 Oct 2015 18:00 +0200" +"%s")" end="$(date -d"1 Nov 2015 01:00 +0200" +"%s")" 'match($0, /([0-9]+)\/([A-Z][a-z]{2})\/([0-9]{4}):([0-9]{1,2}):([0-9]{1,2}):([0-9]{1,2}) ([+-][0-9]{4})/, a) {day=a[1]; month=a[2]; year=a[3]; hour=a[4]; min=a[5]; sec=a[6]; utc=a[7]; month=sprintf("%02d",(match("JanFebMarAprMayJunJulAugSepOctNovDec",month)+2)/3); mydate=sprintf("%s %s %s %s %s %s %s", year,month,day,hour,min,sec,utc); mytimestamp=mktime(mydate); if (start<=mytimestamp && mytimestamp<=end) print}' mylog
[01/Oct/2015:18:07:56 +0200] error number 3
[01/Oct/2015:18:12:56 +0200] error number 4
[02/Oct/2015:16:12:56 +0200] error number 5
[10/Oct/2015:16:12:58 +0200] error number 6
[10/Oct/2015:16:13:00 +0200] error number 7
[01/Nov/2015:00:10:00 +0200] error number 8
However, this seems to be quite a bit of work for something that should be more straight forward. Nonetheless, the introduction of the "Time functions" section in man gawk is
Since one of the primary uses of AWK programs is processing log files
that contain time stamp information, gawk provides the following
functions for obtaining time stamps and formatting them.
So I wonder: is there any better way to do this? For example, what if the format instead of dd/Mmm/YYYY:HH:MM:ss was something like dd Mmm YYYY HH:MM:ss? Couldn't it be possible to provide the match pattern externally instead of having to change it every time this would happen? Do I really have to use match() and then process that output to then feed mktime()? Doesn't gawk provide a more simple way to do this?
Use ISO 8601 time format!
However, this seems to be quite a bit of work for something that should be more straight forward.
Yes, this should be straightforward, and the reason why it is not, is because the logs do not use ISO 8601. Application logs should use ISO format and UTC to display times, other settings should be considered broken and fixed.
Your request should be split in two parts. The first part canonise the logs, converting dates to the ISO format, the second performs a research:
awk '
match($0, /([0-9]+)\/([A-Z][a-z]{2})\/([0-9]{4}):([0-9]{1,2}):([0-9]{1,2}):([0-9]{1,2}) ([+-][0-9]{4})/, a) {
day=a[1]
month=a[2];
year=a[3]
hour=a[4]
min=a[5]
sec=a[6]
utc=a[7];
month=sprintf("%02d", (match("JanFebMarAprMayJunJulAugSepOctNovDec",month)+2)/3);
myisodate=sprintf("%4d-%2d-%2dT%2d:%2d:%2d%6s", year,month,day,hour,min,sec,utc);
$1 = myisodate
print
}' mylog
The nice thing about ISO 8601 dates – besides them being a standard – is that the chronological order coincide with lexicographic order, therefore, you can use the /…/,/…/ operator to extract the dates you are interested in. For instance to find what happened between 1 Oct 2015 18:00 +0200 and 1 Nov 2015 01:00 +0200, append the following filter to the previous, standardising filter:
awk '/2015-10-01:18:00:00+0200/,/2015-11-01:01:00:00+0200/'
without getting into time format (assuming all records are formatted the same) you can use sort | awk combination to achieve the same with ease.
This assumes logs are not ordered, based on your format and special sort option to sort months (M) and awk to pick the interested range. The sorting is based on year, month, and day in that order.
$ sort -k1.9,1.12 -k1.5,1.7M -k1.2,1.3 log | awk '/01\/Oct\/2015/,/01\/Nov\/2015/'
You can easily extend to include time as well and drop the sort if the file is already sorted.
The following has the time constraint as well
awk -F: '/01\/Oct\/2015/ && $2>=18{p=1}
/01\/Nov\/2015/ && $2>=1 {p=0} p'
I would use date command inside awk to achieve this, though no idea how this would perform with large log files.
awk -F "[][]" -v start="$(date -d"1 Oct 2015 18:00 +0200" +"%s")"
-v end="$(date -d"1 Nov 2015 01:00 +0200" +"%s")" '{
gsub(/\//,"-",$2);sub(/:/," ",$2);
cmd="date -d\""$2"\" +%s" ;
cmd|getline mytimestamp;
close(cmd);
if (start<=mytimestamp && mytimestamp<=end) print
}' mylog

awk match between two patterns in an "if/else" statement

I've got an awk issue that I can't seem to figure out. I'm trying to parse out data from SAR and found that some systems are using a different locale and I'm getting different output. The long term solution is to change the locale on all systems for the output data to the same thing, but I have to parse through old data for now and that is not currently an option. Here's the two types of data I get:
24-Hour Output:
21:10:01 all 8.43 0.00 1.81 2.00 0.00 87.76
21:20:01 all 7.99 0.00 1.74 0.82 0.00 89.44
21:30:01 all 8.35 0.00 1.76 0.94 0.00 88.95
12-Hour Output:
09:10:01 PM all 8.43 0.00 1.81 2.00 0.00 87.76
09:20:01 PM all 7.99 0.00 1.74 0.82 0.00 89.44
09:30:01 PM all 8.35 0.00 1.76 0.94 0.00 88.95
I need an awk statement that will get items from 7AM-7PM for all SAR data. I originally had something working, but once I found this issue, it breaks for all the 24-hour output. I trying getting the awk statement to work, but the following doesn't work and I can't figure out how to make it work:
awk '{ if ($2 == "AM" || $2 == "PM" && /07:00/,/07:00/) print $1" "$2; else '/07:00/,/19:00 print $1}' SAR_OUTPUT_FILE.txt
Basically, what I'm trying to output is, if it is in 24-hour format, searchh for 07:00-19:00 and return just the first column of output (since there is no "AM/PM" column. If it founds "AM/PM", I would confider that 12-hour format and want to get everything from 07:00-07:00 and return both the 1st and 2nd column (time + "AM/PM").
Can anyone help me out here?
Without access to an awk with time functions ( strftime() or mktime() ), you can shift the 12h end times so that they can be tested with the 24h time test.
Here's an awk executable that does that by adjusting the hours in the 12h formatted times to fit 24h time formats. The result is put into variable t for every line and is tested to be in the 24h range.
#!/usr/bin/awk -f
function timeShift( a, h ) {
if(NF==9 && split($1, a, ":")==3) {
if(a[1]==12) h = $2=="PM"?"12":"00"
else if($2=="PM") h = (a[1]+12)%24
else h = a[1]
return( h ":" a[2] ":" a[3] )
}
return( $1 )
}
{ t = timeShift() }
t >= "07:00:00" && t <= "19:00:00"
If you need to print fewer fields than the full line, an action block could be added after the final expression.

space delimited file handling

I have insider transactions of a company in a space delimited file. Sample data looks like the following:
1 Gilliland Michael S January 2,2013 20,000 19
2 Still George J Jr January 2,2013 20,000 19
3 Bishkin S. James February 1,2013 150,000 21
4 Mellin Mark P May 28,2013 238,000 25.26
Col1 is Serial# that I dont need to print
Col2 is the name of the person who did trades. This column is not consistent. It has first name and second name and middle initial and for some insiders salutations as well (Mr, Dr. Jr etc)
col3 is the date format Month Day,Year
col4 is the number of shares traded
col5 is the price at which shares were purchased or sold.
I need you guys help to print each column value separately. Thanks for your help.
Count the total number of fields read; the difference between that and the number of non-name fields gives you the width of the name.
#!/bin/bash
# uses bash features, so needs a /bin/bash shebang, not /bin/sh
# read all fields into an array
while read -r -a fields; do
# calculate name width assuming 5 non-name fields
name_width=$(( ${#fields[#]} - 5 ))
cur_field=0
# read initial serial number
ser_id=${fields[cur_field]}; (( ++cur_field ))
# read name
name=''
for ((i=0; i<name_width; i++)); do
name+=" ${fields[cur_field]}"; (( ++cur_field ))
done
name=${name# } # trim leading space
# date spans two fields due to containing a space
date=${fields[cur_field]}; (( ++cur_field ))
date+=" ${fields[cur_field]}"; (( ++cur_field ))
# final fields are one span each
num_shares=${fields[cur_field]}; (( ++cur_field ))
price=${fields[cur_field]}; (( ++cur_field ))
# print in newline-delimited form
printf '%s\n' "$ser_id" "$name" "$date" "$num_shares" "$price" ""
done
Run as follows (if you saved the script as process):
./process <input.txt >output.txt
It might be a little easier in perl.
perl -lane '
#date = splice #F, -4, 2;
#left = splice #F, -2, 2;
splice #F, 0, 1;
print join "|", "#F", "#date", #left
' file
Gilliland Michael S|January 2,2013|20,000|19
Still George J Jr|January 2,2013|20,000|19
Bishkin S. James|February 1,2013|150,000|21
Mellin Mark P|May 28,2013|238,000|25.26
You can change the delimiter in the join as per your requirement.
Here is the data separated using awk
awk '{c1=$1;c5=$NF;c4=$(NF-1);c3=$(NF-3)FS$(NF-2);$1=$NF=$(NF-1)=$(NF-2)=$(NF-3)="";gsub(/^ | *$/,"");c2=$0;print c1"|"c2"|"c3"|"c4"|"c5}' file
1|Gilliland Michael S|January 2,2013|20,000|19
2|Still George J Jr|January 2,2013|20,000|19
3|Bishkin S. James|February 1,2013|150,000|21
4|Mellin Mark P|May 28,2013|238,000|25.26
You know have your data in variable c1 to c5
Or better displayed here:
awk '{c1=$1;c5=$NF;c4=$(NF-1);c3=$(NF-3)FS$(NF-2);$1=$NF=$(NF-1)=$(NF-2)=$(NF-3)="";gsub(/^ | *$/,"");c2=$0;print c1"|"c2"|"c3"|"c4"|"c5}' file | column -t -s "|"
1 Gilliland Michael S January 2,2013 20,000 19
2 Still George J Jr January 2,2013 20,000 19
3 Bishkin S. James February 1,2013 150,000 21
4 Mellin Mark P May 28,2013 238,000 25.26