get the value from columns in two files - regex

My original observations look like that:
name Analyte
spring 0.1
winter 0.4
To calculate p-value I did bootstrapping simulation:
name Analyte
spring 0.001
winter 0
spring 0
winter 0.2
spring 0.03
winter 0
spring 0.01
winter 0.02
spring 0.1
winter 0.5
spring 0
winter 0.04
spring 0.2
winter 0
spring 0
winter 0.06
spring 0
winter 0
.....
Now I want to calculate empirical p-value: In original data winter Analyte = 0.4 - if in bootstrapped data winter analyte was >=0.4 (for example 1 time) and bootstrapping was done (for example 100 times), then empirical p-value for winter analyte is calculated:
1/100 = 0.01
(How many times data was the same or higher than in original data
divided by total number of observations)
For spring analyte p-value is:
2/100 = 0.02
I want to calculate those p-values with awk.
My solution for spring is:
awk -v VAR="spring" '($1==VAR && $2>=0.1) {n++} END {print VAR,"p-value=",n/100}'
spring p-value= 0.02
The help I need is to pass original file (with names spring and winter and their analytes, observations and number of observations) into awk and assign those.

Explanation and script content:
Run it like: awk -f script.awk original bootstrap
# Slurp the original file in an array a
# Ignore the header
NR==FNR && NR>1 {
# Index of this array will be type
# Value of that type will be original value
a[$1]=$2
next
}
# If in the bootstrap file value
# of second column is greater than original value
FNR>1 && $2>a[$1] {
# Increment an array indexed at first column
# which is nothing but type
b[$1]++
}
# Increment another array regardless to identify
# the number of times bootstrapping was done
{
c[$1]++
}
# for each type in array a
END {
for (type in a) {
# print the type and calculate empirical p-value
# which is done by dividing the number of times higher value
# of a type was seen and total number of times
# bootstrapping was done.
print type, b[type]/c[type]
}
}
Test:
$ cat original
name Analyte
spring 0.1
winter 0.4
$ cat bootstrap
name Analyte
spring 0.001
winter 0
spring 0
winter 0.2
spring 0.03
winter 0
spring 0.01
winter 0.02
spring 0.1
winter 0.5
spring 0
winter 0.04
spring 0.2
winter 0
spring 0
winter 0.06
spring 0
winter 0
$ awk -f s.awk original bootstrap
spring 0.111111
winter 0.111111
Analysis:
Spring Original Value is 0.1
Winter Original Value is 0.4
Bootstrapping done is 9 times for this sample file
Count of values higher than Spring original value = 1
Count of values higher than Winter's original value = 1
So, 1/9 = 0.111111

this works for me, (GNU awk 3.1.6):
FNR == NR {
a[$1] = $2
next
}
$2 > a[$1] {
b[$1]++
}
{
c[$1]++
}
END {
for (i in a) print i, "p-value=",b[i]/c[i]
}
..output is:
winter p-value= 0.111111
spring p-value= 0.111111

Related

Netlogo: runtime-error with list, item, and -1

I have a rather specific error in netlogo that I've been staring at for a while now. Hope you guys have some insight.
The error is in a code which looks back in a list called 'strategy'. If the list is longer than investment-time variables 'REfocus' and 'PRICE' are set to a certain value. If the list is not longer than investment-time, the variables are not set (and thus remain 0).
The code consists out of a function strategy_actions and a reporter investment_time. Investment-time is approximately 3 years, but as ticks are in months, investment-time is rescaled to months. In strategy_actions, investment-time is scaled back to years, as each entry in the strategy list is also annual. (The scaling and rescaling seems arbitrary, but as investment-time is used a lot by other parts of the code, it made more sense to do it like this). The goal is to take the strategy from x time back (equal to investment-time).
The code (error follows underneath):
to strategy_actions
set_ROI
start_supply?
if current_strategy != 0
[
let it (investment_time / 12)
ifelse it >= length strategy
[
set REfocus 0
]
[
if item (it - 1) strategy = 1
[
if supply? = true [set_PRICE (set_discrete_distribution 0.29 0.19 0.29 0.15 0.07 0 0) (set_discrete_distribution 0.14 0.12 0.25 0.25 0.25 0 0)]
ifelse any? ids [set REfocus mean [mot_RE] of ids][set REfocus set_discrete_distribution 0.07 0.03 0.07 0.17 0.66 0 0]
]
if item (it - 1) strategy = 2
[
if supply? = true [set_PRICE (set_discrete_distribution 0.27 0.21 0.32 0.11 0.09 0 0) (set_discrete_distribution 0.15 0.11 0.22 0.30 0.23 0 0)]
ifelse any? prods [set REfocus mean [mot_RE] of prods][set REfocus set_discrete_distribution 0.12 0.03 0.10 0.18 0.57 0 0]
]
if item (it - 1) strategy = 3
[
if supply? = true [set_PRICE (set_discrete_distribution 0.26 0.22 0.26 0.18 0.09 0 0) (set_discrete_distribution 0.07 0.08 0.19 0.30 0.35 0 0)]
ifelse any? cons[set REfocus mean [mot_RE] of cons][set REfocus set_discrete_distribution 0.08 0.06 0.15 0.27 0.45 0 0]
]
]
set RE_history fput REfocus RE_history
]
end
to-report investment_time
report ((random-normal 3 1) * 12) ;approximately 3 years investment time
end
somehow, i sometimes get this runtime error during my behaviorspace experiment:
-1 isn't greater than or equal to zero.
error while observer running ITEM
called by procedure STRATEGY_ACTIONS
called by procedure SET_MEETING_ACTIONS
called by procedure GO
Does anyone know what causes this error?
You would help me out a lot!
Cheers,
Maria
It appears that investment_time is occasionally coming in as zero, so you are asking for item (0 - 1) of the strategy list. I did a bit of playing around with item and learned (to my surprise) that item (0.0001 - 1) strategy works just fine, yielding the 0th item in the list in spite of the argument being negative. But item (0 - 1) strategy does give the error you cite. Apparently an item number greater than -1 is interpreted as zero. Indeed item seems to truncate any fractional argument rather than rounding it. E.g., item 0.9 is interpreted as item 0, as is item -0.9
That might be worth putting in the documentation.
HTH,
Charles

Create new column in dataframe by applying math operation to column values based on a match

I have the following dataframes:
df1
name phone duration(m)
Luisa 443442 1
Jack 442334 6
Matt 442212 2
Jenny 453224 1
df2
prefix charge rate
443 0.8 0.3
446 0.8 0.4
442 0.6 0.1
476 0.8 0.3
my desired output is to match each phone number with its prefix (there are more prefixes than phone numbers) and calculate how much to charge per called by multiplying the duration of call for each phone number by the corresponding prefix charge plus the corresponding rate.
output ex.
df1
name phone duration(m) bill
Luisa 443442 1 (example: 1x0.3+0.8)
Jack 442334 6 (example: 6x0.1+0.6)
Matt 442212 2
Jenny 453224 1
my idea was to convert df2 to a dictionary like so dict={'443':[0.3,0.8],'442':[0.1,0.6]...} so i could match each number with the dict key and then do the opertion with the corresponding value of that matching key. However is not working and would also like to know if there is a better alternative.
To merge with prefix of arbitrary length you can do
>> df1['phone'] = df1.phone.astype(str)
>> df2['prefix'] = df2.prefix.astype(str)
>> df1['prefix_len'] = df1.phone.apply(
lambda h: max([len(p) for p in df2.prefix if h.startswith(p)] or [0]))
>> df1['prefix'] = df1.apply(lambda s: s.phone[:s.prefix_len], axis=1)
>> df1 = df1.merge(df2, on='prefix')
>> df1['bill'] = df1['duration(m)'] * df1['rate'] + df1['charge']
>> df1
duration(m) name phone prefix_len prefix charge rate bill
0 1 Luisa 443442 3 443 0.8 0.3 1.1
1 6 Jack 442334 3 442 0.6 0.1 1.2
2 2 Matt 442212 3 442 0.6 0.1 0.8
Note that
in case of multiple prefixes I choose the one with maximum length;
in case when there are no prefixes for particular phone I fill its length with default zero value, (then s.phone[:s.prefix_len] will produce an empty prefix and pd.merge will eliminate those phones from the result).
df1 = pd.DataFrame({'name':["Louisa","Jack","Matt","Jenny"],'phone':[443442,442334,442212,453224],'duration':[1,6,2,1]})
df2 = pd.DataFrame({'prefix':[443,446,442,476],'charge':[0.8,0.8,0.6,0.8],'rate':[0.3,0.4,0.1,0.3]})
df3=pd.concat((df1,df2),axis=1)
df4=pd.DataFrame({"phone_pref":df3["phone"].astype(str).str[:3]})
df4=df4["phone_pref"].drop_duplicates()
df3["bill"]=None
for j in range(len(df4)):
for i in range(len(df3["prefix"])):
if df3.loc[i,"prefix"]==int(df4.iloc[j]):
df3.loc[i,"bill"]=df3.loc[i,"duration"]*df3.loc[i,"charge"]+df3.loc[i,"rate"]
print(df3)
duration name phone charge prefix rate bill
0 1 Louisa 443442 0.8 443 0.3 1.1
1 6 Jack 442334 0.8 446 0.4 None
2 2 Matt 442212 0.6 442 0.1 1.3
3 1 Jenny 453224 0.8 476 0.3 None
The None values in the bill column are because in your excample no phone number has the prefixes 446 or 476 and thus they are not in the df4...
Also the bill is calculated with the formula of yours given in the question

awk match between two patterns in an "if/else" statement

I've got an awk issue that I can't seem to figure out. I'm trying to parse out data from SAR and found that some systems are using a different locale and I'm getting different output. The long term solution is to change the locale on all systems for the output data to the same thing, but I have to parse through old data for now and that is not currently an option. Here's the two types of data I get:
24-Hour Output:
21:10:01 all 8.43 0.00 1.81 2.00 0.00 87.76
21:20:01 all 7.99 0.00 1.74 0.82 0.00 89.44
21:30:01 all 8.35 0.00 1.76 0.94 0.00 88.95
12-Hour Output:
09:10:01 PM all 8.43 0.00 1.81 2.00 0.00 87.76
09:20:01 PM all 7.99 0.00 1.74 0.82 0.00 89.44
09:30:01 PM all 8.35 0.00 1.76 0.94 0.00 88.95
I need an awk statement that will get items from 7AM-7PM for all SAR data. I originally had something working, but once I found this issue, it breaks for all the 24-hour output. I trying getting the awk statement to work, but the following doesn't work and I can't figure out how to make it work:
awk '{ if ($2 == "AM" || $2 == "PM" && /07:00/,/07:00/) print $1" "$2; else '/07:00/,/19:00 print $1}' SAR_OUTPUT_FILE.txt
Basically, what I'm trying to output is, if it is in 24-hour format, searchh for 07:00-19:00 and return just the first column of output (since there is no "AM/PM" column. If it founds "AM/PM", I would confider that 12-hour format and want to get everything from 07:00-07:00 and return both the 1st and 2nd column (time + "AM/PM").
Can anyone help me out here?
Without access to an awk with time functions ( strftime() or mktime() ), you can shift the 12h end times so that they can be tested with the 24h time test.
Here's an awk executable that does that by adjusting the hours in the 12h formatted times to fit 24h time formats. The result is put into variable t for every line and is tested to be in the 24h range.
#!/usr/bin/awk -f
function timeShift( a, h ) {
if(NF==9 && split($1, a, ":")==3) {
if(a[1]==12) h = $2=="PM"?"12":"00"
else if($2=="PM") h = (a[1]+12)%24
else h = a[1]
return( h ":" a[2] ":" a[3] )
}
return( $1 )
}
{ t = timeShift() }
t >= "07:00:00" && t <= "19:00:00"
If you need to print fewer fields than the full line, an action block could be added after the final expression.

Python 2.7 Pandas: How to replace a for-loop?

I have a large pandas dataframe with 2000 rows (one date per row) and 2000 columns (1 second intervals). Each cell represents a temperature reading.
Starting with the 5th row, I need to go back 5 rows and find all the observations where the the 1st column in the row is higher than the 2nd column in the row.
For the 5th row I may find 2 such observations. I then want to do summary stats on the observations and append those summary stats to a list.
Then I go to the 6st row and go back 5 rows and find all the obvs where the 1th column is higher than the 2nd column. I get all obvs, do summary stats on the obvs and append the results to the new dataframe.
So for each row in the dataframe, I want to go back 5 days, get the obvs, get the stats, and append the stats to a dataframe.
The problem is that if I perform this operation on rows 5 -2000, then I will have a for-loop that is 1995 cycles long, and this takes a while.
What is the better or best way to do this?
Here is the code:
print huge_dataframe
sec_1 sec_2 sec_3 sec_4 sec_5
2013_12_27 0.05 0.12 0.06 0.15 0.14
2013_12_28 0.06 0.32 0.56 0.14 0.17
2013_12_29 0.07 0.52 0.36 0.13 0.13
2013_12_30 0.02 0.12 0.16 0.55 0.12
2013_12_31 0.06 0.30 0.06 0.14 0.01
2014_01_01 0.05 0.12 0.06 0.15 0.14
2014_01_02 0.06 0.32 0.56 0.14 0.17
2014_01_03 0.07 0.52 0.36 0.13 0.13
2014_01_04 0.02 0.12 0.16 0.55 0.12
2014_01_05 0.06 0.30 0.06 0.14 0.01
for each row in huge_dataframe.ix[5:]:
move = row[sec_1] - row[sec_2]
if move < 0: move = 'DOWN'
elif move > 0: move = 'UP'
relevant_dataframe = huge_dataframe.ix[only the 5 rows preceding the current row]
if move == 'UP':
mask = relevant_dataframe[sec_1 < sec_2] # creates a boolean dataframe
observations_df = relevant_dataframe[mask]
elif move == 'DOWN':
mask = relevant_dataframe[sec_1 > sec_2] # creates a boolean dataframe
observations_df = relevant_dataframe[mask]
# At this point I have observations_df which is only filled
# with rows where sec_1 < sec_2 or the opposite, depending on which
# row I am in.
summary_stats = str(observations_df.describe())
summary_list.append(summary_stats) # This is the goal
# I want to ultimatly
# turn the list into a
# dataframe
Since there is no code to create the data, I will just sketch the code that I would try to make work. Generally, try to prevent from row-wise operations whenever you can. I first had no clue either, but then I got interested and some research yielded TimeGrouper:
df = big_dataframe
df['move'] = df['sec_1'] > df['sec2']
def foobarRules(group):
# keep in mind that in here, you refer not to "relevant_dataframe", but to "group"
if (group.tail(1).move == True):
# some logic
else:
# some other logic
return str(group.describe())
grouper = TimeGrouper('5D')
allMyStatistics = df.groupby(grouper).apply(foobarRules)
I have honestly no clue how the return works if you return a multi-dimensional dataframe. I know it works well if you return either a row or a column, but if you return a dataframe that contains both rows and columns for every group - I guess pandas is smart enough to compute a panel of all these. Well, you will find out.

How should I format my .dat file so that a 3D vector plot can be made?

I'm working this programming task for college where we have to write a c++ program that calculates the magnetic field vector for certain coils in 3D space.
I've managed to write this program and I think I've got it working pretty well.
I want to add in a special thinh though (it's my exam paper, so it has to be extra good!): I wan't to plot the vectors out.
I'm used to calling gnuplot from c++ (via piping) and this is what I usually do:
create an output stream that writes the data to a .dat file
open a gnuplot pipe
make gnuplot plot all the contents of the .dat
Since my data has always been 2D, xand y plots, I'm quite lost here. My question is:
How to format the .dat file (e.g. do I use braces to group vector components?)
what is the actual gnuplot command to plot a 3D vector field?
It'd be easy if I could format the .dat file like this:
# Px Py Pz Bx By Bz
1 0 2 0.7 0.5 0.25 #<= example data line
... more data ...
when the magnetic field vector in the point P=(1,0,2)equals a vector B=(0.7,0.5,0.25). This would be easy to program, the real question is: will this do ? and how to I plot it in gnuplot. (wow, I've asked the same question 3 times I guess).
Piping to gnuplot
Ok, since someone asked me to describe how I pipe (don't know if it's the right term thought) stuff to gnuplot. Here it is:
First open up a pipe and call it pipe:
FILE *pipe = popen("gnuplot -persist 2>/dev/null", "w");
Tell gnuplot what to do through the pipe:
fprintf(pipe, "set term x11 enhanced \n");
fprintf(pipe, "plot x^2 ti 'x^2' with lines\n");
notice the \nwhich is absolutely necessary. It is what executes the command.
close the pipe:
pclose(pipe);
The necessary library is called <fstream> I believe.
I made this simple example to show you how to draw a vector field. The output would be something like this pic:
The data example I used to plot this was:
# Px Py Pz Bx By Bz
0 0 0 0.8 0.8 0.45
0 0 1 0.5 0.7 0.35
0 0 2 0.7 0.5 0.25
0 1 0 0.65 0.65 0.50
0 1 1 0.6 0.6 0.3
0 1 2 0.45 0.45 0.20
1 0 0 0.5 0.7 0.35
1 0 1 0.75 0.75 0.4
1 0 2 0.85 0.85 0.25
1 1 0 0.90 0.85 0.23
1 1 1 0.95 0.86 0.20
1 1 2 0.98 0.88 0.13
2 0 0 0.73 0.83 0.43
2 0 1 0.53 0.73 0.33
2 0 2 0.73 0.53 0.23
2 1 0 0.68 0.68 0.52
2 1 1 0.63 0.57 0.23
2 1 2 0.48 0.42 0.22
The command to plot it is:
gnuplot> splot "./data3d.dat" with vectors
Now you should read the section 44, page 53 of the official manual (and here the pdf). You may find this site also very useful.
Edited:
This command doesn't fit into your description: mapping from (x,y,z) to (t,u,v). It actually does this mapping: from (X,Y,Z) to (X+dX,Y+dY,Z+dZ).
Cheers,
Beco