formula for integer less than 15 - regex

In one field I want to accept numbers that could be decimal figure for weight but it should not be over 15. Previously I had the following regex code:
[1-9]\d*(\.\d+)?$
This is to be entered in Google Forms. In other words, all these numbers are OK:
0.05
1.5
2
3.56
But these are not ok:
2 kg
0
15.1
16

This should work for values 0 to 15
^((1[0-5])|([1-9]))?(\.\d*)?$

Related

Create new column in dataframe by applying math operation to column values based on a match

I have the following dataframes:
df1
name phone duration(m)
Luisa 443442 1
Jack 442334 6
Matt 442212 2
Jenny 453224 1
df2
prefix charge rate
443 0.8 0.3
446 0.8 0.4
442 0.6 0.1
476 0.8 0.3
my desired output is to match each phone number with its prefix (there are more prefixes than phone numbers) and calculate how much to charge per called by multiplying the duration of call for each phone number by the corresponding prefix charge plus the corresponding rate.
output ex.
df1
name phone duration(m) bill
Luisa 443442 1 (example: 1x0.3+0.8)
Jack 442334 6 (example: 6x0.1+0.6)
Matt 442212 2
Jenny 453224 1
my idea was to convert df2 to a dictionary like so dict={'443':[0.3,0.8],'442':[0.1,0.6]...} so i could match each number with the dict key and then do the opertion with the corresponding value of that matching key. However is not working and would also like to know if there is a better alternative.
To merge with prefix of arbitrary length you can do
>> df1['phone'] = df1.phone.astype(str)
>> df2['prefix'] = df2.prefix.astype(str)
>> df1['prefix_len'] = df1.phone.apply(
lambda h: max([len(p) for p in df2.prefix if h.startswith(p)] or [0]))
>> df1['prefix'] = df1.apply(lambda s: s.phone[:s.prefix_len], axis=1)
>> df1 = df1.merge(df2, on='prefix')
>> df1['bill'] = df1['duration(m)'] * df1['rate'] + df1['charge']
>> df1
duration(m) name phone prefix_len prefix charge rate bill
0 1 Luisa 443442 3 443 0.8 0.3 1.1
1 6 Jack 442334 3 442 0.6 0.1 1.2
2 2 Matt 442212 3 442 0.6 0.1 0.8
Note that
in case of multiple prefixes I choose the one with maximum length;
in case when there are no prefixes for particular phone I fill its length with default zero value, (then s.phone[:s.prefix_len] will produce an empty prefix and pd.merge will eliminate those phones from the result).
df1 = pd.DataFrame({'name':["Louisa","Jack","Matt","Jenny"],'phone':[443442,442334,442212,453224],'duration':[1,6,2,1]})
df2 = pd.DataFrame({'prefix':[443,446,442,476],'charge':[0.8,0.8,0.6,0.8],'rate':[0.3,0.4,0.1,0.3]})
df3=pd.concat((df1,df2),axis=1)
df4=pd.DataFrame({"phone_pref":df3["phone"].astype(str).str[:3]})
df4=df4["phone_pref"].drop_duplicates()
df3["bill"]=None
for j in range(len(df4)):
for i in range(len(df3["prefix"])):
if df3.loc[i,"prefix"]==int(df4.iloc[j]):
df3.loc[i,"bill"]=df3.loc[i,"duration"]*df3.loc[i,"charge"]+df3.loc[i,"rate"]
print(df3)
duration name phone charge prefix rate bill
0 1 Louisa 443442 0.8 443 0.3 1.1
1 6 Jack 442334 0.8 446 0.4 None
2 2 Matt 442212 0.6 442 0.1 1.3
3 1 Jenny 453224 0.8 476 0.3 None
The None values in the bill column are because in your excample no phone number has the prefixes 446 or 476 and thus they are not in the df4...
Also the bill is calculated with the formula of yours given in the question

Issue Calculating Mean of Grouped Data for entire range of dataset using Pandas

I have a data set of daily temperatures for which I want to calculate 20 year means. The data look like this:
1974 1 1 5.3 4.6 7.3 3.4
1974 1 2 3.3 7.2 4.5 6.5
...
2005 12 364 4.2 5.2 3.3 4.6
2005 12 365 3.1 5.5 2.6 6.8
There is no header in the file but the first column contains the year, the second column the month, and the third column the day of the year. The rest of the columns are temperature data.
I want to calculate the average temperature for each day over a period of 20 years. I thought the best way to do that would be to group the data by day and calculate the mean of each day for a specific range of years. Here is my code:
import pandas as pd
hist_fn = 'tmean_daily_1974_2005.txt'
twenty_year_fn = '20_yr_mean_1974_1993.txt'
start = 1974
end = 1993
hist_mean = pd.read_csv(hist_fn, sep='\s+', header=None)
# Limit dataframe to only the 20 years for which I want the mean calculated
interval_mean = hist_mean[(hist_mean[0]>=start) & (hist_mean[0]<=end)]
# Rename the first column to reflect what mean this file is displaying
interval_mean.iloc[:, 0] = ("%s-%s" % (start, end))
# Generate mean for each day spread across all the years in the dataframe
interval_mean.iloc[:, 3:] = interval_mean.groupby(2, as_index=False).mean().iloc[:, 2:]
# Write multiyear mean to txt
interval_mean.to_csv(twenty_year_fn, sep='\t', header=False, index=False)
The data set spans longer than 20 years and the method I used has worked for the first 20 year interval but gives me a (mostly) empty text file for any other set of years entered.
So when I use these inputs it works:
start = 1974
end = 1993
and it produces a file that looks like this:
1974-1993 1 1 4.33 5.25 6.84 3.67
1974-1993 1 2 7.23 6.22 5.65 6.23
...
1974-1993 12 364 5.12 4.34 5.21 2.16
1974-1993 12 365 4.81 5.95 3.56 6.78
but when I change the inputs to this:
start = 1975
end = 1994
it produces a .txt file with no temperatures:
1975-1994 1 1
1975-1994 1 2
...
1975-1994 12 364
1975-1994 12 365
I don't understand why this method works for the first 20 year interval but none of the subsequent intervals. Is it something to do with the way the data is organized or how it is being sliced?
Now when that's out of the way, we can talk about the problem you presented:
The strange behavior is due to the fact that pandas matches indices on assignment, and slicing preserves the original indices. That means that when setting
interval_mean.iloc[:, 3:] = interval_mean.groupby(2, as_index=False).mean().iloc[:, 2:]
Note that interval_mean.groupby(2, as_index=False).mean() has indices 0, ... , 30 (since as_index=False makes the groupby operation create new indices. Otherwise, it would have been the day number).On the other had, interval_mean has the original indices from hist_mean, meaning the first time (first 20 years) it has the indices 0, ..., ~20*365 and the second time is has indices starting from arround 20*365 and counting up.
This is a bit confusing at first, but pandas offer great documentation about it, and people quickly discover why it is so useful.
I'll to explain what happens with an example:
Assume we have the following DataFrame:
df = pd.DataFrame(np.reshape(np.random.randint(5, size=30), [-1,3]))
df
0 1 2
0 1 1 2
1 2 1 1
2 0 1 2
3 0 2 0
4 2 1 0
5 0 1 2
6 2 2 1
7 1 0 2
8 0 1 0
9 1 2 0
Note that the column names are 0,1,2 and the row names (the index) are 0, ..., 9.
When we preform groupby we obtain
df.groupby(0, as_index=False).mean()
0 1 2
0 0 1.250000 1.000000
1 1 1.000000 1.333333
2 2 1.333333 0.666667
(The index equals to the columns grouped by just because draw numbers between 0 to 2). Now, when will do assignments to df.loc, it will replace every cell by the corresponding cell in the assignee, if such cell exists. Otherwise, it will leave NA.
df.loc[:,:] = df.groupby(0, as_index=False).mean()
df
0 1 2
0 0.0 1.250000 1.000000
1 1.0 1.000000 1.333333
2 2.0 1.333333 0.666667
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
And when you write NA to csv, it leaves the cell blank.
The last piece of the puzzle is how interval_mean preserved the original indices, but this is because slicing preserves the original indices:
df[df[1] > 1]
0 1 2
3 0 2 0
6 2 2 1
9 1 2 0

Discounting losses in SAS

I'm writing my master thesis on the costs of occupational injuries. As a part of the thesis I have estimated the expected wage loss for each person for every year for four years after the injure. I would like to discount the estimated losses to a specific base year (2009) in SAS.
For the year 2009 the discounted loss is just equal the estimated loss. For 2010 and on the discounted loss can be calculated with the netpv function:
IF year=2009 then discount_loss=wage;
IF year=2010 then discount_loss=netpv(0.1,1,0,wage);
IF year=2011 then discount_loss=netpv(0.1,1,0,0,wage);
And so forth. But starting from 2014 I would like to use the estimated wage loss for 2014 as the expected loss onward - so for instance if the estimated loss is 100$ that would represent the yearly loss until retirement. Since each person don't have the same age there would be too many ways just to hard code, so I'm looking for a better way. There are approximately 200.000 persons in my data set with different estimated losses for each year.
The format of the (fictional) data looks like this:
id age year age_retirement wage_loss rate discount_loss
1 35 2009 65 -100 0.1 -100
1 36 2010 65 -100 0.1 -90,91
1 37 2011 65 -100 0.1 -82,64
1 38 2012 65 -100 0.1 -75,13
1 39 2013 65 -100 0.1 -68,30
1 40 2014 65 -100 0.1
The column discount_loss is the net present value of the loss i 2009. Calculated as above.
I would like the loss in 2014 to represent the sum of losses for the rest of the period (until age_retirement) on the labor market. That would be -100$ discounted for 2009 starting from 2014 until 2014+(65-40).
Thanks!
Use the FINANCE function for PV, Present Value.
In your situation above, you're looking for the value of 100 for 25 years of payments (65-40)=25. I'll leave the calculation of the number of years up to you.
FINANCE('PV', rate, nper, payment, <fv>, <type>);
In your case, Future Value is 0 and the type=1 as you assume payment at the beginning of the year.
The formula below calculates the present value of a series of 100 payments over 25 years with a 10% interest rate and paid at the beginning of the period.
value=FINANCE('PV', 0.1, 25, -100, 0, 1);
Value = 998.47440201
Reference is here:
https://support.sas.com/documentation/cdl/en/lefunctionsref/67960/HTML/default/viewer.htm#p1cnn1jwdmhce0n1obxmu4iq26ge.htm
If you are looking for speed why not first calculate an array that contains the PV of $1 for for i years where i goes from 1 to n. Then just select the element you need and multiply. This could all be done in a data step.

Reshaping Pandas data frame (a complex case!)

I want to reshape the following data frame:
index id numbers
1111 5 58.99
2222 5 75.65
1000 4 66.54
11 4 60.33
143 4 62.31
145 51 30.2
1 7 61.28
The reshaped data frame should be like the following:
id 1 2 3
5 58.99 75.65 nan
4 66.54 60.33 62.31
51 30.2 nan nan
7 61.28 nan nan
I use the following code to do this.
import pandas as pd
dtFrame = pd.read_csv("data.csv")
ids = dtFrame['id'].unique()
temp = dtFrame.groupby(['id'])
temp2 = {}
for i in ids:
temp2[i]= temp.get_group(i).reset_index()['numbers']
dtFrame = pd.DataFrame.from_dict(temp2)
dtFrame = dtFrame.T
Although the above code solve my problem but is there a more simple way to achieve this. I tried Pivot table but it does not solve the problem perhaps it requires to have same number of element in each group. Or may be there is another way which I am not aware of, please share your thoughts about it.
In [69]: df.groupby(df['id'])['numbers'].apply(lambda x: pd.Series(x.values)).unstack()
Out[69]:
0 1 2
id
4 66.54 60.33 62.31
5 58.99 75.65 NaN
7 61.28 NaN NaN
51 30.20 NaN NaN
This is really quite similar to what you are doing except that the loop is replaced by apply. The pd.Series(x.values) has an index which by default ranges over integers starting at 0. The index values become the column names (above). It doesn't matter that the various groups may have different lengths. The apply method aligns the various indices for you (and fills missing values with NaN). What a convenience!
I learned this trick here.

Writing both characters and digits in an array

I have a Fortran code which reads a txt file with seperate lines of characters and digits and then write them in a 1D array with 20 elements.
This code is not compatible with Fortran 77 compiler Force 2.0.9. My question is that how we can apply the aformenetioned procedure using a Fortran 77 compiler;i.e defining a 1D array nd then write the txt file line by line into elements of the array?
Thank you in advance.
The txt file follows:
Case 1:
10 0 1 2 0
1.104 1.008 0.6 5.0
25 125.0 175.0 0.7 1000.0
0.60
1 5
Advanced Case
15 53 0 10 0 1 0 0 1 0 0 0 0
0 0 0 0
0 0 1500.0 0 0 .03
0 0.001 0
0.1 0 0.125 0.08 0.46
0.1 5.0 0.04
# Jason:
I am a beginner and still learning Fortran. I guess Force 2 uses g77.
The followings are the correspond part of the original code. Force 2 editor returns an empty txt file as a result.
DIMENSION CARD(20)
CHARACTER*64 FILENAME
DATA XHEND / 4HEND /
OPEN(UNIT=3,FILE='CON')
OPEN(UNIT=4,FILE='CON')
OPEN(UNIT=7,STATUS='SCRATCH')
WRITE(3,9000) 'PLEASE ENTER THE INPUT FILE NAME : '
9000 FORMAT (A)
READ(4,9000) FILENAME
OPEN(UNIT=5,FILE=FILENAME,STATUS='OLD')
WRITE(3,9000) 'PLEASE ENTER THE OUTPUT FILE NAME : '
READ(4,9000) FILENAME
OPEN(UNIT=6,FILE=FILENAME,STATUS='NEW')
FILENAME = '...'
IR = 7
IW = 6
IP = 15
5 REWIND IR
I = 0
2 READ (5,7204,END=10000) CARD
IF (I .EQ. 0 ) WRITE (IW,7000)
7000 FORMAT (1H1 / 10X,15HINPUT DECK ECHO / 10X,15(1H-))
I= I + 1
WRITE (IW,9204) I,CARD
IF (CARD(1) .EQ. XHEND ) GO TO 7020
WRITE (IR,7204) CARD
7204 FORMAT (20A4)
9204 FORMAT (1X,I4,2X,20A4)
GO TO 2
7020 REWIND IR
It looks that CARD is being used as a to hold 20 4-character strings. I don't see the declaration as a character variable, only as an array, so perhaps in extremely old FORTRAN style a non-character variable is being used to hold characters? You are using a 20A4 format, so the values have to be positioned in the file precisely as 20 groups of 4 characters. You have to add blanks so that they are aligned into groups of 4 columns.
If you want to read numbers it would be much easier to read them into a numeric type and use list-directed IO:
real values (20)
read (5, *) values
Then you wouldn't have to worry about precision positioning of the values in the file.
This is really archaic FORTRAN ... even pre-FORTRAN-77 in style. I can't remember the last time that I saw Hollerith (H) formats! Where are you learning this from?
Edit: While I like Fortran for many programming tasks, I wouldn't use FORTRAN 66! Computers are supposed to make things easier ... there is no reason to have to count characters. Instead of
7000 FORMAT (1H1 / 10X,15HINPUT DECK ECHO / 10X,15(1H-))
You can use
7000 FORMAT ( / 10X, "INPUT DECK ECHO" / 10X, 15("-") )
I can think of only two reasons to use a Hollerith code: not bothering to change legacy source code (it is remarkable that a current Fortran compiler can process a feature that was obsolete 30 years ago! Fortran source code never dies!), or studying the history of computing languages. The name honors a great computing pioneer, whose invention accomplished the 1890 US Census in one year, when the 1880 Census took eight years: http://en.wikipedia.org/wiki/Herman_Hollerith
I much doubt that you will see the "1" in the first column performing "carriage control" today. I had to look up that "1" was the code for page eject. You are much more likely to see it in your output. See Are Fortran control characters (carriage control) still implemented in compilers?