awk match between two patterns in an "if/else" statement - if-statement

I've got an awk issue that I can't seem to figure out. I'm trying to parse out data from SAR and found that some systems are using a different locale and I'm getting different output. The long term solution is to change the locale on all systems for the output data to the same thing, but I have to parse through old data for now and that is not currently an option. Here's the two types of data I get:
24-Hour Output:
21:10:01 all 8.43 0.00 1.81 2.00 0.00 87.76
21:20:01 all 7.99 0.00 1.74 0.82 0.00 89.44
21:30:01 all 8.35 0.00 1.76 0.94 0.00 88.95
12-Hour Output:
09:10:01 PM all 8.43 0.00 1.81 2.00 0.00 87.76
09:20:01 PM all 7.99 0.00 1.74 0.82 0.00 89.44
09:30:01 PM all 8.35 0.00 1.76 0.94 0.00 88.95
I need an awk statement that will get items from 7AM-7PM for all SAR data. I originally had something working, but once I found this issue, it breaks for all the 24-hour output. I trying getting the awk statement to work, but the following doesn't work and I can't figure out how to make it work:
awk '{ if ($2 == "AM" || $2 == "PM" && /07:00/,/07:00/) print $1" "$2; else '/07:00/,/19:00 print $1}' SAR_OUTPUT_FILE.txt
Basically, what I'm trying to output is, if it is in 24-hour format, searchh for 07:00-19:00 and return just the first column of output (since there is no "AM/PM" column. If it founds "AM/PM", I would confider that 12-hour format and want to get everything from 07:00-07:00 and return both the 1st and 2nd column (time + "AM/PM").
Can anyone help me out here?

Without access to an awk with time functions ( strftime() or mktime() ), you can shift the 12h end times so that they can be tested with the 24h time test.
Here's an awk executable that does that by adjusting the hours in the 12h formatted times to fit 24h time formats. The result is put into variable t for every line and is tested to be in the 24h range.
#!/usr/bin/awk -f
function timeShift( a, h ) {
if(NF==9 && split($1, a, ":")==3) {
if(a[1]==12) h = $2=="PM"?"12":"00"
else if($2=="PM") h = (a[1]+12)%24
else h = a[1]
return( h ":" a[2] ":" a[3] )
}
return( $1 )
}
{ t = timeShift() }
t >= "07:00:00" && t <= "19:00:00"
If you need to print fewer fields than the full line, an action block could be added after the final expression.

Related

How to do parsing of Elapsed time in seconds in linux

I want to do parsing of Elapsed time in seconds .Time formats given below:
1) 3 day 18h
2) 3 day
3) 3h 15min
4) 3h
5) 15min 10sec
6) 15min
7) 10sec
i'm getting values from systemctl status cassandra | awk '/(Active: active)/{print $9, $10,$11}' Now storing it's value in variable A,like
A=$(systemctl status cassandra | awk '/(Active: active)/{print $9, $10,$11}'
now A has input as 3 day 18h or 3 day etc. More examples-
A=3 day 18h or 3 day or 3h 15min or 3h or 15min 10sec or 15min or 10sec
now take different values of A, and parse in seconds.
What you want to achieve could be done directly in awk using the following line :
$ systemctl status cassandra | awk '/(Active: active)/{s=$6" "$7;gsub(/-|:/," ",s); print systime() - mktime(s)}'
This will give you the running time directly based on the start-time and not on the approximated running time printed by systemctl.
If this approach is not working then I suggest to use the date command to do all the parsing. If you can change the h by hour in your examples, then you can do the following :
$ date -d "1970-01-01 + 3day 18hour 15min 16sec" +%s
324916
If you cannot, then I suggest the following. If duration is stored in the variable $duration, then you do
$ date -d "1970-01-01 + ${duration/h/hour}" +%s
Having spaces between the numbers and the strings day, h,min or sec does not matter.
The idea of this is that you ask date to compute everything for you as %s returns the unix time since 1970-01-01 in seconds.
man date:
%s seconds since 1970-01-01 00:00:00 UTC
The given value of A is*:
A="3day 3day/3h 15min/3h/15min 10sec/15min/10sec"
To compute A in seconds you can use bash's parameter expansion:
A=${A//day/*86400}
A=${A//h/*3600}
A=${A//min/*60}
A=${A//sec/*1}
A=${A//\//+}
A=${A// /+}
echo "A = $A"
echo $A | bc
Output:
A = 3*86400+3*86400+3*3600+15*60+3*3600+15*60+10*1+15*60+10*1
542720
* Note here I changed the original value of A as provided by the OP. From
3 day/3 day/3h...
to
3day 3day/3h... # the rest is the same as OP's.
Using awk to s/h/hours/ and to launch date +"%s" -d "1970-01-01 GMT +" to parse the time strings and to count the seconds:
$ awk '{
sub(/h/,"hours") # date no eat h
$1="" # remove $1
"date +\"%s\" -d \"1970-01-01 GMT + " $0 "\"" | getline s # date
print s
}' file
324000
259200
11700
10800
910
900
10
for the data:
$ cat file
1) 3 day 18h
2) 3 day
3) 3h 15min
4) 3h
5) 15min 10sec
6) 15min
7) 10sec

Multiple matches with regex

Let's say I have a long log with something like this:
-----------1------------
path/to/file1
real 0.21
user 0.01
sys 0.02
11378688 maximum resident set size
-----------2------------
path/to/file2
real 0.21
user 0.01
sys 0.02
11378688 maximum resident set size
-----------3------------
path/to/file3
real 0.21
user 0.01
sys 0.02
11378688 maximum resident set size
-----------4------------
path/to/file4
ERROR: Lorem ipsum error
ERROR2: Lorem ipsum error 2
real 0.59
user 0.01
sys 0.02
11378688 maximum resident set size
I want to extract the path to file, error if any, time after "real" and memory used. Then transform them into format like this : "path time memory"
I've made this regex:
-*(?:[0-9]*)-*\n(.*)\n((?:.*\n)*)?real\s*(.*)\n.*\n.*\n\s*(.*)\s\s.*\n
But it only parses when there is single log entry (also parses errors if there are any), i.e. only:
-----------1------------
path/to/file1
real 0.21
user 0.01
sys 0.02
11378688 maximum resident set size
And nothing after that.
Can someone show me the direction? I am trying it on http://www.regex101.com
Languages: c/c++, bash, java, python, go
A way to do with a perl one-liner:
perl -0777 -ne '#l = /-+\d+-+\n([\s\S]*?)\nreal.*?([\d.]+)\n[\s\S]+?(\d+)\s+maximum.*(\n)/g;print "#l";' in1.txt
Output:
path/to/file1 0.21 11378688
path/to/file2 0.21 11378688
path/to/file3 0.21 11378688
path/to/file4
ERROR: Lorem ipsum error
ERROR2: Lorem ipsum error 2 0.59 11378688
You can use this:
-+(?:[0-9]*)-+\n(.*)\n((?:ERROR.*\n)*)real\s*(.*)\n.*\n.*\n\s*(.*)\s\s.*\n?
I replaces the * with + at the beginning because you are sure that there will be repetitions.
Later we can explicitly check if there are any errors and capture them.
Latly I made the last \n optional since that broke the last group(because there is no newline at the end of the file)
Here is a link for you to see if it works for you: https://regex101.com/r/jI5yV8/1

Search for multiple strings in many text files, count hits on combinations

I'm struggling to automate a reporting exercise, and would appreciate some pointers or advice please.
I have several hundred thousand small (<5kb) text files. Each contains a few variables, and I need to count the number of files that match each combination of variables.
Each file contains a device number, such as /001/ /002/.../006/.
Each file also contains a date string, such as 01.10.14 (dd.mm.yy)
Some files contain a 'status' string which is always "Not Settled"
I need a way to trawl through each file on a Linux system (spread across several subdirectories), and produce a report file that counts 'per device' how many files include each date stamp (6 month range) and for each of those dates, how many contain the status string.
The report might look like this:
device, date, total count of files
device, date, total "not settled" count
e.g.
/001/, 01.12.14, 356
/001/, 01.12.14, 12
/001/, 02.12.14, 209
/001/, 02.12.14, 8
/002/, 01.12.14, 209
/002/, 01.12.14, 7
etc etc
In other words:
Foreach /device/
Foreach <date>
count total matching files - write number to file
count toal matching 'not settled' files - write number to file
Each string to match could appear anywhere in the file.
I tried using grep piped to a second (and third) grep commands, but I'd like to automate this and loop through the variables (6 devices, about 180 dates, 2 status strings) . I suspect Perl and Bash is the answer, but I'm out of my depth.
Please can anyone recommend an approach to this?
Edit: Some sample data as mentioned in the comments. The information is basically receipt data from tills - as would be sent to a printer. Here's a sample (identifying bits stripped out).
c0! SUBTOTAL 11.37
c0! ! T O T A L 11.37!
c0! 19 ITEMS
c0! C a s h ? 11.37
vu p022c0!
c0! NET TOTAL VAT A 10.87
c0! VAT 00.0% 0.00
c0! NET TOTAL VAT B 0.42
c0! VAT 20.0% 0.08
c0! *4300 772/080/003/132 08.01.15 11:18 A-00
Contents = Not Settled
In the case above, I'd be looking for /003/ , 08.01.15, and "Not Settled"
Many thanks.
First, read everything into an SQLite database, then run queries against it to your heart's content. Putting the data in an SQL database is going to save you time if you need to tweak anything. Besides, even simple SQL can tackle this kind of thing if you have the right tables set up.
First of all I agree with #Sinan :-)
The following might work as hack to make a hash out of your file data.
# report.pl
use strict;
use warnings;
use Data::Dumper;
my %report;
my ($date, $device) ;
while (<>) {
next unless m/^ .*
(?<device>\/00[1-3]\/) .*
(?<date>\d{2}\.\d{2}\.\d{2})
.*$/x ;
($date, $device,) = ($+{date}, $+{device});
$_ = <> unless eof;
if (/Contents/) {
$report{$date}{$device}{"u_count"}++ ;
}
else {
$report{$date}{$device}{"count"}++ ;
}
}
print Dumper(\%report)
This seems to work with a collection of data files in the format shown below (since you don't say or show where the Contents = Not Settled appears, I assume it is either part of the last line along with the device ID or in a separate and final line for each file).
Explanation:
The script reads the STDIN of all the files passed as a glob in while(<>){} loop. First, next unless m/ ... skips forward lines of input until it matches the line with device and date information.
Next, the match then uses named capture groups (?<device> ?<date> to hold the values of the patterns it finds and places those values in corresponding variables (($date, $device,) = ($+{date}, $+{device});). These could simply be $1 and $2 but naming keeps me organized here.
Then, in case there is another line to read $_ = <> unless eof; reads it and tries the final set of conditional matches in order to add to $counts and $u_counts.
Data file format:
file1.data
c0! SUBTOTAL 11.37
c0! ! T O T A L 11.37! c0! 19 ITEMS
c0! C a s h ? 11.37
vu p022c0!
c0! NET TOTAL VAT A 10.87
c0! VAT 00.0% 0.00
c0! NET TOTAL VAT B 0.42
c0! VAT 20.0% 0.08
c0! *4300 772/080/003/132 08.01.15 11:18 A-00
file2.data
c0! SUBTOTAL 11.37
c0! ! T O T A L 11.37! c0! 19 ITEMS
c0! C a s h ? 11.37
vu p022c0!
c0! NET TOTAL VAT A 10.87
c0! VAT 00.0% 0.00
c0! NET TOTAL VAT B 0.42
c0! VAT 20.0% 0.08
c0! *4300 772/080/002/132 08.01.15 11:18 A-00
Contents = Not Settled
(a set of files for testing are listed here: http://pastebin.com/raw.php?i=7ALU80fE).
perl report.pl file*.data
Data::Dumper Output:
$VAR1 = {
'08.01.15' => {
'/002/' => {
'u_count' => 4
},
'/003/' => {
'count' => 1
}
},
'08.12.15' => {
'/003/' => {
'count' => 1
}
}
};
From that you can make a report by iterating through the hash with keys() (the date) and retrieving the inner hash and count values per machine. Really it would be a good idea to have some tests to make sure everything works as expected - that or just do as #sinan_Ünür suggests: use SQLite!
NB: this code was not extensively tested :-)

Python 2.7 Pandas: How to replace a for-loop?

I have a large pandas dataframe with 2000 rows (one date per row) and 2000 columns (1 second intervals). Each cell represents a temperature reading.
Starting with the 5th row, I need to go back 5 rows and find all the observations where the the 1st column in the row is higher than the 2nd column in the row.
For the 5th row I may find 2 such observations. I then want to do summary stats on the observations and append those summary stats to a list.
Then I go to the 6st row and go back 5 rows and find all the obvs where the 1th column is higher than the 2nd column. I get all obvs, do summary stats on the obvs and append the results to the new dataframe.
So for each row in the dataframe, I want to go back 5 days, get the obvs, get the stats, and append the stats to a dataframe.
The problem is that if I perform this operation on rows 5 -2000, then I will have a for-loop that is 1995 cycles long, and this takes a while.
What is the better or best way to do this?
Here is the code:
print huge_dataframe
sec_1 sec_2 sec_3 sec_4 sec_5
2013_12_27 0.05 0.12 0.06 0.15 0.14
2013_12_28 0.06 0.32 0.56 0.14 0.17
2013_12_29 0.07 0.52 0.36 0.13 0.13
2013_12_30 0.02 0.12 0.16 0.55 0.12
2013_12_31 0.06 0.30 0.06 0.14 0.01
2014_01_01 0.05 0.12 0.06 0.15 0.14
2014_01_02 0.06 0.32 0.56 0.14 0.17
2014_01_03 0.07 0.52 0.36 0.13 0.13
2014_01_04 0.02 0.12 0.16 0.55 0.12
2014_01_05 0.06 0.30 0.06 0.14 0.01
for each row in huge_dataframe.ix[5:]:
move = row[sec_1] - row[sec_2]
if move < 0: move = 'DOWN'
elif move > 0: move = 'UP'
relevant_dataframe = huge_dataframe.ix[only the 5 rows preceding the current row]
if move == 'UP':
mask = relevant_dataframe[sec_1 < sec_2] # creates a boolean dataframe
observations_df = relevant_dataframe[mask]
elif move == 'DOWN':
mask = relevant_dataframe[sec_1 > sec_2] # creates a boolean dataframe
observations_df = relevant_dataframe[mask]
# At this point I have observations_df which is only filled
# with rows where sec_1 < sec_2 or the opposite, depending on which
# row I am in.
summary_stats = str(observations_df.describe())
summary_list.append(summary_stats) # This is the goal
# I want to ultimatly
# turn the list into a
# dataframe
Since there is no code to create the data, I will just sketch the code that I would try to make work. Generally, try to prevent from row-wise operations whenever you can. I first had no clue either, but then I got interested and some research yielded TimeGrouper:
df = big_dataframe
df['move'] = df['sec_1'] > df['sec2']
def foobarRules(group):
# keep in mind that in here, you refer not to "relevant_dataframe", but to "group"
if (group.tail(1).move == True):
# some logic
else:
# some other logic
return str(group.describe())
grouper = TimeGrouper('5D')
allMyStatistics = df.groupby(grouper).apply(foobarRules)
I have honestly no clue how the return works if you return a multi-dimensional dataframe. I know it works well if you return either a row or a column, but if you return a dataframe that contains both rows and columns for every group - I guess pandas is smart enough to compute a panel of all these. Well, you will find out.

get the value from columns in two files

My original observations look like that:
name Analyte
spring 0.1
winter 0.4
To calculate p-value I did bootstrapping simulation:
name Analyte
spring 0.001
winter 0
spring 0
winter 0.2
spring 0.03
winter 0
spring 0.01
winter 0.02
spring 0.1
winter 0.5
spring 0
winter 0.04
spring 0.2
winter 0
spring 0
winter 0.06
spring 0
winter 0
.....
Now I want to calculate empirical p-value: In original data winter Analyte = 0.4 - if in bootstrapped data winter analyte was >=0.4 (for example 1 time) and bootstrapping was done (for example 100 times), then empirical p-value for winter analyte is calculated:
1/100 = 0.01
(How many times data was the same or higher than in original data
divided by total number of observations)
For spring analyte p-value is:
2/100 = 0.02
I want to calculate those p-values with awk.
My solution for spring is:
awk -v VAR="spring" '($1==VAR && $2>=0.1) {n++} END {print VAR,"p-value=",n/100}'
spring p-value= 0.02
The help I need is to pass original file (with names spring and winter and their analytes, observations and number of observations) into awk and assign those.
Explanation and script content:
Run it like: awk -f script.awk original bootstrap
# Slurp the original file in an array a
# Ignore the header
NR==FNR && NR>1 {
# Index of this array will be type
# Value of that type will be original value
a[$1]=$2
next
}
# If in the bootstrap file value
# of second column is greater than original value
FNR>1 && $2>a[$1] {
# Increment an array indexed at first column
# which is nothing but type
b[$1]++
}
# Increment another array regardless to identify
# the number of times bootstrapping was done
{
c[$1]++
}
# for each type in array a
END {
for (type in a) {
# print the type and calculate empirical p-value
# which is done by dividing the number of times higher value
# of a type was seen and total number of times
# bootstrapping was done.
print type, b[type]/c[type]
}
}
Test:
$ cat original
name Analyte
spring 0.1
winter 0.4
$ cat bootstrap
name Analyte
spring 0.001
winter 0
spring 0
winter 0.2
spring 0.03
winter 0
spring 0.01
winter 0.02
spring 0.1
winter 0.5
spring 0
winter 0.04
spring 0.2
winter 0
spring 0
winter 0.06
spring 0
winter 0
$ awk -f s.awk original bootstrap
spring 0.111111
winter 0.111111
Analysis:
Spring Original Value is 0.1
Winter Original Value is 0.4
Bootstrapping done is 9 times for this sample file
Count of values higher than Spring original value = 1
Count of values higher than Winter's original value = 1
So, 1/9 = 0.111111
this works for me, (GNU awk 3.1.6):
FNR == NR {
a[$1] = $2
next
}
$2 > a[$1] {
b[$1]++
}
{
c[$1]++
}
END {
for (i in a) print i, "p-value=",b[i]/c[i]
}
..output is:
winter p-value= 0.111111
spring p-value= 0.111111