Operation on Regex identified digits in python data frame - regex

I have one data frame with 2 columns 3rd column is output format given below:
DF:
reg value o/p**
2 for $20 11 20/2
4 for $24 12 24/4
2 for $30 13 30/2
Get $10 Cash 14 14
3 for $30 21 30/3
First, I have to match [\d]+ for [$][\d]+ in reg column and then have
to update the value column as 2nd integer of reg divide by the first
integer of reg if no match keeps same value.
My code is:
df["value"]=df["reg"].map(lambda x: (int(re.findall("[\d]+",x)[1]))/int(re.findall("[\d]+",x)[0]) if(re.search(r"[\d]+ for [$][\d]+" , x)) else x)
The code output is correct for match cases only.

Try:
df["value"]=df.apply(lambda x: (int(re.findall("[\d]+",x["reg"])[1]))/int(re.findall("[\d]+",x["reg"])[0]) if(re.search(r"[\d]+ for [$][\d]+" , x["reg"])) else x["value"], axis=1)
output:
reg value
0 2 for $20 10.0
1 4 for $24 6.0
2 2 for $30 15.0
3 Get $10 Cash 14.0
4 3 for $30 10.0
you are picking only reg column that's why you were not able to get value

Related

SAS Proc Print - No Output

I am so frustrated. I can't even get a proc print to work. I've tried so many things. I don't see the table in results viewer. My log says the file has been read and that I should see results. I've tried turning ods off and on and saving to work folder or saving to my own folder. I've tried switching to a list output. Right now, I just want this code to run which I got from: https://support.sas.com/resources/papers/proceedings11/270-2011.pdf .
data energy;
length state $2;
input region division state $ type expenditures ##;
datalines;
1 1 ME 1 708 1 1 ME 2 379 1 1 NH 1 597 1 1 NH 2 301
1 1 VT 1 353 1 1 VT 2 188 1 1 MA 1 3264 1 1 MA 2 2498
1 1 RI 1 531 1 1 RI 2 358 1 1 CT 1 2024 1 1 CT 2 1405
1 2 NY 1 8786 1 2 NY 2 7825 1 2 NJ 1 4115 1 2 NJ 2 3558
1 2 PA 1 6478 1 2 PA 2 3695 4 3 MT 1 322 4 3 MT 2 232
4 3 ID 1 392 4 3 ID 2 298 4 3 WY 1 194 4 3 WY 2 184
4 3 CO 1 1215 4 3 CO 2 1173 4 3 NM 1 545 4 3 NM 2 578
4 3 AZ 1 1694 4 3 AZ 2 1448 4 3 UT 1 621 4 3 UT 2 438
4 3 NV 1 493 4 3 NV 2 378 4 4 WA 1 1680 4 4 WA 2 1122
4 4 OR 1 1014 4 4 OR 2 756 4 4 CA 1 10643 4 4 CA 2 10114
4 4 AK 1 349 4 4 AK 2 329 4 4 HI 1 273 4 4 HI 2 298
;
proc sort data=energy out=energy_report;
by region division type;
run;
proc format;
value regfmt 1='Northeast'
2='South'
3='Midwest'
4='West';
value divfmt 1='New England'
2='Middle Atlantic'
3='Mountain'
4='Pacific';
value usetype 1='Residential Customers'
2='Business Customers';
run;
ods html file='my_report.html';
proc print data=energy_report;
run;
ods html close;
My log shows no errors:
NOTE: Writing HTML Body file: my_report.html
1582 proc print data=energy_report;
1583 run;
NOTE: There were 44 observations read from the data set WORK.ENERGY_REPORT.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.04 seconds
cpu time 0.00 seconds
When I go into my temporary files, I can open the "energy" and "energy_report" data set and I can view all the data. Why can't I see a print output? I'm not sure what I'm missing. I checked the output window, the results viewer window, and all the generated html files. They're all blank.
Thank you
It depends a lot on your set up, but I would enable HTML & Listing output and then check the output.
ods listing;
ods html;
proc print data=sashelp.class;
run;
If you're using EG the results should be in the process flow. If Studio, in the Results tab, if SAS Base, click on Results and open if necessary.
There is an option called 'Show Results as Generated' and it's possible it's been set to off in your installation for some reason. I often set mine up this way because I often generate a lot of files at once (HTML/XLSX) and don't want them to open up automatically.
Where you print to my_report.html, the file will probably be trying to go to C:\my_report.html - put in a full file path instead, and check that when you're done.
change
ods html file='my_report.html';
proc print data=energy_report;
run;
ods html close;
to
ods html file="&path./my_report4.html";
proc print data=energy_report;
run;
ods html close;
where &path contains the path where the file will be created.
And important : Use " instead of '. Double quote in the place of a quote.

Questions Regarding Dummy Variables in SAS

I was working on a problem that involved creating dummy variables, but I ran into an issue where I'm having missing values for the dummy variables in the corresponding reference category even though the dataset doesn't have missing values. Even if I'm selecting one of the categories to be the reference category or variable, shouldn't the dummy variable values be zero? I had the same issue even when I did not account for missing values. I've included my code, log, output, and the content of the text file for context and so that my question will be clearer.
The part of the homework assignment that I'm having issues with is the following:
Fibromyalgia is a syndrome of widespread body pain that is often treated by rheumatologists. One way of measuring the impact of fibromyalgia on patients is the Fibromyalgia Impact Questionnaire (FIQ). On the FIQ, high values show greater impact of disease (bad) and low values show lesser impact of disease (good). We have data on women with fibromyalgia who attended one of two types of disease self-management classes or who received standard care (the control group).
Data from this study are in the file fibr03_sum18.txt on the BS 805 web site in the Assignments section for Class 6. The variables in the data file are:
FIQ score (3.1 format) taken after the classes Group (1 = class 1, 2 = class 2, 3 = standard care) Disease Severity (On a scale of 1 to 6) before the classes Age (years) Since the data were entered into this file, information on a new patient and a correction to the data have been found. The new patient is in the control group, has FIQ = 8.2, Disease Severity =2, and Age = 25 years. The correction is that the second subject in class 1 was 17 rather than 18 years old.
A) Create a temporary SAS data set using these data. In the data set, create a set of indicator variables that code for group membership. Use PROC PRINT to list the data.
I read in the text file using column input, but I think it can be read in using list input as well? The text file contained the data below was the file was called: fibr03_sum18.txt.
3.1 1 6 21
1.8 1 6 18
3.3 1 5 22
2.9 1 4 15
4.3 1 3 24
4.8 1 3 22
4.9 1 2 17
6.4 1 2 18
5.7 2 5 17
6.1 2 5 25
8.5 2 3 31
7.1 2 2 17
7.7 2 1 25
9.8 2 1 22
5.1 3 4 23
7.2 3 1 15
8.3 3 1 22
6.7 3 2 20
My code for reading in the data and creating the temporary dataset with the dummy variables was:
*Part A: Reading in Data and Creating a Temporary Dataset;
libname HW6 'C:\Users\jackz\Desktop\SAS';
filename HW6new 'C:\Users\jackz\Desktop\SAS\fibr03_sum18.txt';
proc format;
value grpf 1='class 1' 2='class 2' 3='standard care';
run;
data one;
infile HW6new;
input #1 FIQ 3.1 #5 grp 1. #7 disev 1. #9 age 2.;
*Creating Dummy Variables;
if grp=1 then classc1=1; else if grp=2 then classc1=0;
if grp=2 then classc2=1; else if grp=1 then classc2=0;
if grp=. then classc1=.;
if grp=. then classc2=.;
label FIQ='FIQ Score'
grp='Group'
disev='Disease Severity'
age='Age';
format grp grpf.;
run;
*Printout of Dataset one;
proc print data=one label;
run;
My log for this code was:
NOTE: Copyright (c) 2016 by SAS Institute Inc., Cary, NC, USA.
NOTE: SAS (r) Proprietary Software 9.4 (TS1M5)
Licensed to BOSTON UNIVERSITY - SFA T&R, Site 70009029.
NOTE: This session is executing on the W32_10HOME platform.
NOTE: Updated analytical products:
SAS/STAT 14.3
SAS/ETS 14.3
SAS/OR 14.3
SAS/IML 14.3
SAS/QC 14.3
NOTE: Additional host information:
W32_10HOME WIN 10.0.16299 Workstation
NOTE: SAS initialization used:
real time 0.96 seconds
cpu time 0.95 seconds
1 *Part A: Reading in Data and Creating a Temporary Dataset;
2 libname HW6 'C:\Users\jackz\Desktop\SAS';
NOTE: Libref HW6 was successfully assigned as follows:
Engine: V9
Physical Name: C:\Users\jackz\Desktop\SAS
3 filename HW6new 'C:\Users\jackz\Desktop\SAS\fibr03_sum18.txt';
4 proc format;
5 value grpf 1='class 1' 2='class 2' 3='standard care';
NOTE: Format GRPF has been output.
6 run;
NOTE: PROCEDURE FORMAT used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
7 data one;
8 infile HW6new;
9 input #1 FIQ 3.1 #5 grp 1. #7 disev 1. #9 age 2.;
10 *Creating Dummy Variables;
11 if grp=1 then classc1=1; else if grp=2 then classc1=0;
12 if grp=2 then classc2=1; else if grp=1 then classc2=0;
13 if grp=. then classc1=.;
14 if grp=. then classc2=.;
15 label FIQ='FIQ Score'
16 grp='Group'
17 disev='Disease Severity'
18 age='Age';
19 format grp grpf.;
20 run;
NOTE: The infile HW6NEW is:
Filename=C:\Users\jackz\Desktop\SAS\fibr03_sum18.txt,
RECFM=V,LRECL=32767,File Size (bytes)=214,
Last Modified=15Jun2018:12:56:26,
Create Time=15Jun2018:12:56:26
NOTE: 18 records were read from the infile HW6NEW.
The minimum record length was 10.
The maximum record length was 10.
NOTE: The data set WORK.ONE has 18 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds
21 *Printout of Dataset one;
22 proc print data=one label;
NOTE: Writing HTML Body file: sashtml.htm
23 run;
NOTE: There were 18 observations read from the data set WORK.ONE.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.27 seconds
cpu time 0.06 seconds
Here is the output, although it is not lined up:
The SAS System
Obs FIQ Score Group Disease
Severity Age classc1 classc2
1 3.1 class 1 6 21 1 0
2 1.8 class 1 6 18 1 0
3 3.3 class 1 5 22 1 0
4 2.9 class 1 4 15 1 0
5 4.3 class 1 3 24 1 0
6 4.8 class 1 3 22 1 0
7 4.9 class 1 2 17 1 0
8 6.4 class 1 2 18 1 0
9 5.7 class 2 5 17 0 1
10 6.1 class 2 5 25 0 1
11 8.5 class 2 3 31 0 1
12 7.1 class 2 2 17 0 1
13 7.7 class 2 1 25 0 1
14 9.8 class 2 1 22 0 1
15 5.1 standard care 4 23 . .
16 7.2 standard care 1 15 . .
17 8.3 standard care 1 22 . .
18 6.7 standard care 2 20 . .
You can see that there are missing values for the dummy variables classc1 and classc2 even though there are no missing values in the original dataset. Should those values read 0, since group 3 does not fall in either grp=1 or grp=2?
Can anyone give me any hints as to what I have done wrong, if I have done anything wrong? Thanks for all of your help!
The output shows that the rows where the flag variables are missing values have group = 3 (standard care). The missing values are not missing due to the if statements, but due to the implicit resetting of data step variables to missing at the start of the implicit loop.
When group=3, there is no if statement that causes the flags variables to change from their initial 'reset to missing'
* when grp=3 neither classic1 nor classic2 variable is changed from its initial missing value;
put 'NOTE: ' _n_= (classic:) (=);
if grp=1 then classc1=1; else if grp=2 then classc1=0;
if grp=2 then classc2=1; else if grp=1 then classc2=0;
if grp=. then classc1=.;
if grp=. then classc2=.;
put 'NOTE: ' _n_= (classic:) (=);

Issue Calculating Mean of Grouped Data for entire range of dataset using Pandas

I have a data set of daily temperatures for which I want to calculate 20 year means. The data look like this:
1974 1 1 5.3 4.6 7.3 3.4
1974 1 2 3.3 7.2 4.5 6.5
...
2005 12 364 4.2 5.2 3.3 4.6
2005 12 365 3.1 5.5 2.6 6.8
There is no header in the file but the first column contains the year, the second column the month, and the third column the day of the year. The rest of the columns are temperature data.
I want to calculate the average temperature for each day over a period of 20 years. I thought the best way to do that would be to group the data by day and calculate the mean of each day for a specific range of years. Here is my code:
import pandas as pd
hist_fn = 'tmean_daily_1974_2005.txt'
twenty_year_fn = '20_yr_mean_1974_1993.txt'
start = 1974
end = 1993
hist_mean = pd.read_csv(hist_fn, sep='\s+', header=None)
# Limit dataframe to only the 20 years for which I want the mean calculated
interval_mean = hist_mean[(hist_mean[0]>=start) & (hist_mean[0]<=end)]
# Rename the first column to reflect what mean this file is displaying
interval_mean.iloc[:, 0] = ("%s-%s" % (start, end))
# Generate mean for each day spread across all the years in the dataframe
interval_mean.iloc[:, 3:] = interval_mean.groupby(2, as_index=False).mean().iloc[:, 2:]
# Write multiyear mean to txt
interval_mean.to_csv(twenty_year_fn, sep='\t', header=False, index=False)
The data set spans longer than 20 years and the method I used has worked for the first 20 year interval but gives me a (mostly) empty text file for any other set of years entered.
So when I use these inputs it works:
start = 1974
end = 1993
and it produces a file that looks like this:
1974-1993 1 1 4.33 5.25 6.84 3.67
1974-1993 1 2 7.23 6.22 5.65 6.23
...
1974-1993 12 364 5.12 4.34 5.21 2.16
1974-1993 12 365 4.81 5.95 3.56 6.78
but when I change the inputs to this:
start = 1975
end = 1994
it produces a .txt file with no temperatures:
1975-1994 1 1
1975-1994 1 2
...
1975-1994 12 364
1975-1994 12 365
I don't understand why this method works for the first 20 year interval but none of the subsequent intervals. Is it something to do with the way the data is organized or how it is being sliced?
Now when that's out of the way, we can talk about the problem you presented:
The strange behavior is due to the fact that pandas matches indices on assignment, and slicing preserves the original indices. That means that when setting
interval_mean.iloc[:, 3:] = interval_mean.groupby(2, as_index=False).mean().iloc[:, 2:]
Note that interval_mean.groupby(2, as_index=False).mean() has indices 0, ... , 30 (since as_index=False makes the groupby operation create new indices. Otherwise, it would have been the day number).On the other had, interval_mean has the original indices from hist_mean, meaning the first time (first 20 years) it has the indices 0, ..., ~20*365 and the second time is has indices starting from arround 20*365 and counting up.
This is a bit confusing at first, but pandas offer great documentation about it, and people quickly discover why it is so useful.
I'll to explain what happens with an example:
Assume we have the following DataFrame:
df = pd.DataFrame(np.reshape(np.random.randint(5, size=30), [-1,3]))
df
0 1 2
0 1 1 2
1 2 1 1
2 0 1 2
3 0 2 0
4 2 1 0
5 0 1 2
6 2 2 1
7 1 0 2
8 0 1 0
9 1 2 0
Note that the column names are 0,1,2 and the row names (the index) are 0, ..., 9.
When we preform groupby we obtain
df.groupby(0, as_index=False).mean()
0 1 2
0 0 1.250000 1.000000
1 1 1.000000 1.333333
2 2 1.333333 0.666667
(The index equals to the columns grouped by just because draw numbers between 0 to 2). Now, when will do assignments to df.loc, it will replace every cell by the corresponding cell in the assignee, if such cell exists. Otherwise, it will leave NA.
df.loc[:,:] = df.groupby(0, as_index=False).mean()
df
0 1 2
0 0.0 1.250000 1.000000
1 1.0 1.000000 1.333333
2 2.0 1.333333 0.666667
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
And when you write NA to csv, it leaves the cell blank.
The last piece of the puzzle is how interval_mean preserved the original indices, but this is because slicing preserves the original indices:
df[df[1] > 1]
0 1 2
3 0 2 0
6 2 2 1
9 1 2 0

SAS_Specific Conditional Sum_

I have a question on SAS Programming. It is about conditional sum. But it is very specific for me. Therefore, I want to ask as an example. I have the following dataset:
Group A Quantity
1 10 7
1 8 4
1 7 3
1 10 5
2 11 6
2 13 8
2 9 7
2 13 9
I want to add two more columns to this dataset. The new dataset should be:
Group A Quantity B NewColumn
1 10 7 10 12 (7+5)
1 8 4 10 12
1 7 3 10 12
1 10 5 10 12
2 13 6 13 15 (6+9)
2 10 8 13 15
2 9 7 13 15
2 13 9 13 15
So, the column B should be equal tha maximum value of each group and it is the same for all observations of each group. In this example, Group number 1 has 4 values. They are 10, 8, 7, 10. The maximum among these values is 10. Therefore, the values of the observations of the B column for the first group are all equal to 10. Maximum number for group number 2 is 13. Therefore, the values of the observations of the B column for the second group are all equal to 13.
The column C is more complicated. Its value depends on the all columns. Similiar to B column, it will be the same within group. More detailed, it is the sum of the specific observations of QUANTITIES column. These specific observations should belong to the observations that have the maximum value in each group. In our example, it is 12 for the first group. The reason is, the maximum number of first group is 10. and the quantities belong to 10 are 7 and 5. So, the sum of these is 12. For the second group it is 15. because the maximum value of the second group is 13 and the quantities belong to 13 are 6 and 9. So the sum is 15.
I hope. I can explain it. Many thanks in advance.
You can do this with proc sql:
proc sql;
select t.*, max_a as b,
(select sum(t2.quantity)
from t t2
where t2.group = t.group and t.a = max_a
) as c
from t join
(select group, max(a) as max_a
from t
group by group
) g
on t.group = g.group;
run;
If the data is coming from an underlying database, most databases support window functions which make this easier.
This is untested (I'm away from sas) and will probably have mistakes, but a triple DoW loop should work. One pass to get the max per group, second pass to get the sum, third pass to output the records. Something like:
data want ;
do until(last.group) ;
by group ;
set have ;
B=max(A,B) ;
end ;
do until(last.group) ;
set have ;
by group ;
if A = B then NewColumn = sum(NewColumn, Quantity) ;
end;
do until(last.group);
set have ;
by group;
output ;
end ;
run;

Summarizing data in SAS across groups

My data set is in this format as mentioned below:
NEWID
Age
H_PERS
Income
OCCU
FAMTYPE
REGION
Metro(Yes/No)
Exp_alcohol
population sample-(This is the weighted population each new id represents) etc.
I would like to generate a summarized view like below:
average expenditure value (This should be sum of (exp_alcohol/population sample))
% of population sample across Region Metro and each demographic variable
Please help me with your ideas.
Since I can't see your data set and your description was not very clear, I'm going to guess that you have data that looks something like this and you would like add some new variables that summarizes your data...
data alcohol;
input NEWID Age H_PERS Income OCCU $ FAMTYPE $ REGION $ Metro $
Exp_alcohol population_sample;
datalines;
1234 32 4 65000 abc m CA Yes 2 4
5678 23 5 35000 xyz s WA Yes 3 6
9923 34 3 49000 def d OR No 3 9
8844 26 4 54000 gdp m CA No 1 5
;
run;
data summar;
set alcohol;
retain TotalAvg_expend metro_count total_pop;
Divide = exp_alcohol/population_sample;
TotalAvg_expend + Divide;
total_pop + population_sample;
if metro = 'Yes' then metro_count + population_sample;
percent_metro = (metro_count/total_pop)*100;
drop NEWID Age H_PERS Income OCCU FAMTYPE REGION Divide;
run;
Output:
Exp_ population_ TotalAvg_ metro_ total_ percent_
Metro alcohol sample expend count pop metro
Yes 2 4 0.50000 4 4 100.000
Yes 3 6 1.00000 10 10 100.000
No 3 9 1.33333 10 19 52.632
No 1 5 1.53333 10 24 41.667