Multiple matches with regex - regex

Let's say I have a long log with something like this:
-----------1------------
path/to/file1
real 0.21
user 0.01
sys 0.02
11378688 maximum resident set size
-----------2------------
path/to/file2
real 0.21
user 0.01
sys 0.02
11378688 maximum resident set size
-----------3------------
path/to/file3
real 0.21
user 0.01
sys 0.02
11378688 maximum resident set size
-----------4------------
path/to/file4
ERROR: Lorem ipsum error
ERROR2: Lorem ipsum error 2
real 0.59
user 0.01
sys 0.02
11378688 maximum resident set size
I want to extract the path to file, error if any, time after "real" and memory used. Then transform them into format like this : "path time memory"
I've made this regex:
-*(?:[0-9]*)-*\n(.*)\n((?:.*\n)*)?real\s*(.*)\n.*\n.*\n\s*(.*)\s\s.*\n
But it only parses when there is single log entry (also parses errors if there are any), i.e. only:
-----------1------------
path/to/file1
real 0.21
user 0.01
sys 0.02
11378688 maximum resident set size
And nothing after that.
Can someone show me the direction? I am trying it on http://www.regex101.com
Languages: c/c++, bash, java, python, go

A way to do with a perl one-liner:
perl -0777 -ne '#l = /-+\d+-+\n([\s\S]*?)\nreal.*?([\d.]+)\n[\s\S]+?(\d+)\s+maximum.*(\n)/g;print "#l";' in1.txt
Output:
path/to/file1 0.21 11378688
path/to/file2 0.21 11378688
path/to/file3 0.21 11378688
path/to/file4
ERROR: Lorem ipsum error
ERROR2: Lorem ipsum error 2 0.59 11378688

You can use this:
-+(?:[0-9]*)-+\n(.*)\n((?:ERROR.*\n)*)real\s*(.*)\n.*\n.*\n\s*(.*)\s\s.*\n?
I replaces the * with + at the beginning because you are sure that there will be repetitions.
Later we can explicitly check if there are any errors and capture them.
Latly I made the last \n optional since that broke the last group(because there is no newline at the end of the file)
Here is a link for you to see if it works for you: https://regex101.com/r/jI5yV8/1

Related

Show Print Format in Jupyter Widgets

I have a result from the classification_report from sklearn.metrics and then print the report it would be like:
precision recall f1-score support
1 1.00 0.84 0.91 43
2 0.12 1.00 0.22 1
avg / total 0.98 0.84 0.90 44
Now, the question is how can I show the result in a Jupyter widget (in the above format) and update its value?
Currently, I am using html widgets to show the result:
#pass test and result vectors
report = classification_report(pred_test , self.y_test_data)
predict_table = widgets.HTML(value = "")
predict_table.value = report
but it likes the following:
precision recall f1-score support 1 1.00 0.81 0.90 43 2 0.00 0.00 0.00 0 avg / total 1.00 0.81 0.90 43
I found a simple solution using html techniques! As we are using html widget in Jupyter, the problem can be solved by using pre tag in html:
predict_table.value = "<pre>" + report + "</pre>

Counting gradient using 2 columns array from external .dat file

I have got a .dat file with 2 columns and rows between 14000 to 36000 saved in file like below:
0.00 0.00
2.00 1.00
2.03 1.01
2.05 1.07
.
.
.
79.03 23.01
The 1st column is extension, the 2nd is strain. When I want to count gradient to designate Hooks Law of the plot, I use below code.
CCCCCC
Program gradient
REAL S(40000),E(40000),GRAD(40000,1)
open(unit=300, file='Probka1A.dat', status='OLD')
open(unit=321, file='result.out', status='unknown')
write(321,400)
400 format('alfa')
260 DO 200 i=1, 40000
read(300,30) S(i),E(i)
30 format(2F7.2)
GRAD(i,1)=(S(i)-S(i-1))/(E(i)-E(i-1))
write(321,777) GRAD(i,1)
777 Format(F7.2)
200 Continue
END
But after I executed it I got the warning
PGFIO-F-231/formatted read/unit=300/error on data conversion.
File name = Probka1A.dat formatted, sequential access record = 1
In source file gradient1.f, at line number 9
What can I do to count gradient by this or other way in Fortran 77?
You are reading from file without checking for the end of the file. Your code should be like this:
260 DO 200 i=1, 40000
read(300,*,ERR=400,END=400) S(i),E(i)
if (i>1) then
GRAD(i-1,1)=(S(i)-S(i-1))/(E(i)-E(i-1))
write(321,777) GRAD(i-1,1)
end if
777 Format(F7.2)
200 Continue
400 continue

awk match between two patterns in an "if/else" statement

I've got an awk issue that I can't seem to figure out. I'm trying to parse out data from SAR and found that some systems are using a different locale and I'm getting different output. The long term solution is to change the locale on all systems for the output data to the same thing, but I have to parse through old data for now and that is not currently an option. Here's the two types of data I get:
24-Hour Output:
21:10:01 all 8.43 0.00 1.81 2.00 0.00 87.76
21:20:01 all 7.99 0.00 1.74 0.82 0.00 89.44
21:30:01 all 8.35 0.00 1.76 0.94 0.00 88.95
12-Hour Output:
09:10:01 PM all 8.43 0.00 1.81 2.00 0.00 87.76
09:20:01 PM all 7.99 0.00 1.74 0.82 0.00 89.44
09:30:01 PM all 8.35 0.00 1.76 0.94 0.00 88.95
I need an awk statement that will get items from 7AM-7PM for all SAR data. I originally had something working, but once I found this issue, it breaks for all the 24-hour output. I trying getting the awk statement to work, but the following doesn't work and I can't figure out how to make it work:
awk '{ if ($2 == "AM" || $2 == "PM" && /07:00/,/07:00/) print $1" "$2; else '/07:00/,/19:00 print $1}' SAR_OUTPUT_FILE.txt
Basically, what I'm trying to output is, if it is in 24-hour format, searchh for 07:00-19:00 and return just the first column of output (since there is no "AM/PM" column. If it founds "AM/PM", I would confider that 12-hour format and want to get everything from 07:00-07:00 and return both the 1st and 2nd column (time + "AM/PM").
Can anyone help me out here?
Without access to an awk with time functions ( strftime() or mktime() ), you can shift the 12h end times so that they can be tested with the 24h time test.
Here's an awk executable that does that by adjusting the hours in the 12h formatted times to fit 24h time formats. The result is put into variable t for every line and is tested to be in the 24h range.
#!/usr/bin/awk -f
function timeShift( a, h ) {
if(NF==9 && split($1, a, ":")==3) {
if(a[1]==12) h = $2=="PM"?"12":"00"
else if($2=="PM") h = (a[1]+12)%24
else h = a[1]
return( h ":" a[2] ":" a[3] )
}
return( $1 )
}
{ t = timeShift() }
t >= "07:00:00" && t <= "19:00:00"
If you need to print fewer fields than the full line, an action block could be added after the final expression.

Python 2.7 Pandas: How to replace a for-loop?

I have a large pandas dataframe with 2000 rows (one date per row) and 2000 columns (1 second intervals). Each cell represents a temperature reading.
Starting with the 5th row, I need to go back 5 rows and find all the observations where the the 1st column in the row is higher than the 2nd column in the row.
For the 5th row I may find 2 such observations. I then want to do summary stats on the observations and append those summary stats to a list.
Then I go to the 6st row and go back 5 rows and find all the obvs where the 1th column is higher than the 2nd column. I get all obvs, do summary stats on the obvs and append the results to the new dataframe.
So for each row in the dataframe, I want to go back 5 days, get the obvs, get the stats, and append the stats to a dataframe.
The problem is that if I perform this operation on rows 5 -2000, then I will have a for-loop that is 1995 cycles long, and this takes a while.
What is the better or best way to do this?
Here is the code:
print huge_dataframe
sec_1 sec_2 sec_3 sec_4 sec_5
2013_12_27 0.05 0.12 0.06 0.15 0.14
2013_12_28 0.06 0.32 0.56 0.14 0.17
2013_12_29 0.07 0.52 0.36 0.13 0.13
2013_12_30 0.02 0.12 0.16 0.55 0.12
2013_12_31 0.06 0.30 0.06 0.14 0.01
2014_01_01 0.05 0.12 0.06 0.15 0.14
2014_01_02 0.06 0.32 0.56 0.14 0.17
2014_01_03 0.07 0.52 0.36 0.13 0.13
2014_01_04 0.02 0.12 0.16 0.55 0.12
2014_01_05 0.06 0.30 0.06 0.14 0.01
for each row in huge_dataframe.ix[5:]:
move = row[sec_1] - row[sec_2]
if move < 0: move = 'DOWN'
elif move > 0: move = 'UP'
relevant_dataframe = huge_dataframe.ix[only the 5 rows preceding the current row]
if move == 'UP':
mask = relevant_dataframe[sec_1 < sec_2] # creates a boolean dataframe
observations_df = relevant_dataframe[mask]
elif move == 'DOWN':
mask = relevant_dataframe[sec_1 > sec_2] # creates a boolean dataframe
observations_df = relevant_dataframe[mask]
# At this point I have observations_df which is only filled
# with rows where sec_1 < sec_2 or the opposite, depending on which
# row I am in.
summary_stats = str(observations_df.describe())
summary_list.append(summary_stats) # This is the goal
# I want to ultimatly
# turn the list into a
# dataframe
Since there is no code to create the data, I will just sketch the code that I would try to make work. Generally, try to prevent from row-wise operations whenever you can. I first had no clue either, but then I got interested and some research yielded TimeGrouper:
df = big_dataframe
df['move'] = df['sec_1'] > df['sec2']
def foobarRules(group):
# keep in mind that in here, you refer not to "relevant_dataframe", but to "group"
if (group.tail(1).move == True):
# some logic
else:
# some other logic
return str(group.describe())
grouper = TimeGrouper('5D')
allMyStatistics = df.groupby(grouper).apply(foobarRules)
I have honestly no clue how the return works if you return a multi-dimensional dataframe. I know it works well if you return either a row or a column, but if you return a dataframe that contains both rows and columns for every group - I guess pandas is smart enough to compute a panel of all these. Well, you will find out.

get the value from columns in two files

My original observations look like that:
name Analyte
spring 0.1
winter 0.4
To calculate p-value I did bootstrapping simulation:
name Analyte
spring 0.001
winter 0
spring 0
winter 0.2
spring 0.03
winter 0
spring 0.01
winter 0.02
spring 0.1
winter 0.5
spring 0
winter 0.04
spring 0.2
winter 0
spring 0
winter 0.06
spring 0
winter 0
.....
Now I want to calculate empirical p-value: In original data winter Analyte = 0.4 - if in bootstrapped data winter analyte was >=0.4 (for example 1 time) and bootstrapping was done (for example 100 times), then empirical p-value for winter analyte is calculated:
1/100 = 0.01
(How many times data was the same or higher than in original data
divided by total number of observations)
For spring analyte p-value is:
2/100 = 0.02
I want to calculate those p-values with awk.
My solution for spring is:
awk -v VAR="spring" '($1==VAR && $2>=0.1) {n++} END {print VAR,"p-value=",n/100}'
spring p-value= 0.02
The help I need is to pass original file (with names spring and winter and their analytes, observations and number of observations) into awk and assign those.
Explanation and script content:
Run it like: awk -f script.awk original bootstrap
# Slurp the original file in an array a
# Ignore the header
NR==FNR && NR>1 {
# Index of this array will be type
# Value of that type will be original value
a[$1]=$2
next
}
# If in the bootstrap file value
# of second column is greater than original value
FNR>1 && $2>a[$1] {
# Increment an array indexed at first column
# which is nothing but type
b[$1]++
}
# Increment another array regardless to identify
# the number of times bootstrapping was done
{
c[$1]++
}
# for each type in array a
END {
for (type in a) {
# print the type and calculate empirical p-value
# which is done by dividing the number of times higher value
# of a type was seen and total number of times
# bootstrapping was done.
print type, b[type]/c[type]
}
}
Test:
$ cat original
name Analyte
spring 0.1
winter 0.4
$ cat bootstrap
name Analyte
spring 0.001
winter 0
spring 0
winter 0.2
spring 0.03
winter 0
spring 0.01
winter 0.02
spring 0.1
winter 0.5
spring 0
winter 0.04
spring 0.2
winter 0
spring 0
winter 0.06
spring 0
winter 0
$ awk -f s.awk original bootstrap
spring 0.111111
winter 0.111111
Analysis:
Spring Original Value is 0.1
Winter Original Value is 0.4
Bootstrapping done is 9 times for this sample file
Count of values higher than Spring original value = 1
Count of values higher than Winter's original value = 1
So, 1/9 = 0.111111
this works for me, (GNU awk 3.1.6):
FNR == NR {
a[$1] = $2
next
}
$2 > a[$1] {
b[$1]++
}
{
c[$1]++
}
END {
for (i in a) print i, "p-value=",b[i]/c[i]
}
..output is:
winter p-value= 0.111111
spring p-value= 0.111111