How can be modified dataframe below:
df <- data.frame (ID = c(1, 2, 2, 3), Name = c("Luke", "Pete", "Marie", "Frank"), Age = c(25, 34, 66, 45))
ID Name Age
1 Luke 25
2 Pete 34
2 Marie 66
3 Frank 45
To remove ID duplicated, and change it for next ID available.
ID Name Age
1 Luke 25
2 Pete 34
4 Marie 66
3 Frank 45
Thanks for help
I have the following data:
Patient Visit VisitNumber LAB LABVALUE
001 BASELINE 1 LAB1 10
001 DAY 100 2 LAB1 15
001 DAY 200 3 LAB1 12
002 BASELINE 1 LAB1 11
002 DAY 100 2 LAB1 14
002 DAY 200 3 LAB1 12
001 BASELINE 1 LAB2 40
001 DAY 100 2 LAB2 45
001 DAY 200 3 LAB2 42
002 BASELINE 1 LAB2 41
002 DAY 100 2 LAB2 44
002 DAY 200 3 LAB2 42
I would like to create the following table, which summarizes the variable 'LABVALUE' for all patients at each visit (Table 2):
Visit VisitNumber LAB MEAN BASELINEMEAN CHANGEBASEMEAN
BASELINE 1 LAB1 10.5 10.5 .
DAY 100 2 LAB1 14.5 10.5 4
DAY 200 3 LAB1 12 10.5 1.5
BASELINE 1 LAB2 40.5 40.5 .
DAY 100 2 LAB2 44.5 40.5 4
DAY 200 3 LAB2 42 40.5 1.5
I have the following code that generates the change in values from baseline for each visit by patient:
proc sort data=have;
by patient lab visitnumber;
run;
data for_report;
set have;
by patient lab;
retain base_visitnum base_labvalue;
if first.patient then do;
base_visitnum = .;
base_labvalue = .;
end;
if first.lab and visit='BASELINE' then do;
base_visitnumber = visitnumber;
base_labvalue = labvalue;
end;
if not first.lab then do;
delta_labvalue = labvalue - base_labvalue;
end;
run;
This generates the following table:
LAB Visit VisitNumber LABVALUE BASE_VISITNUM BASE_LABVALUE DELTA_LABVALUE
LAB1 BASELINE 1 10 1 10 .
LAB1 DAY 100 2 15 1 10 5
LAB1 DAY 200 3 12 1 10 2
LAB1 BASELINE 1 11 1 11 .
LAB1 DAY 100 2 14 1 11 3
LAB1 DAY 200 3 12 1 11 1
LAB2 BASELINE 1 40 1 10 .
LAB2 DAY 100 2 45 1 10 5
LAB2 DAY 200 3 42 1 10 2
LAB2 BASELINE 1 41 1 11 .
LAB2 DAY 100 2 44 1 11 3
LAB2 DAY 200 3 42 1 11 1
Any insight as to how I can generate Table 2 would be greatly appreciated.
This should get you most of the way there:
proc sql noprint;
create table table2 as
select visit,
visitnumber,
lab,
mean(value) as mean,
mean(base_labvalue) as baselinemean
from for_report
group by visit, visitnumber, lab
;
quit;
I've left some details for you to complete :-)
Also, watch out for the mismatch between base_visitnum and base_visitnumber in your example code.
I have the following dataset:
DATA survey;
INPUT id sex $ age inc r1 r2 r3 ;
DATALINES;
1 F 35 17 7 2 2
17 M 50 14 5 5 3
33 F 45 6 7 2 7
49 M 24 14 7 5 7
65 F 52 9 4 7 7
81 M 44 11 7 7 7
2 F 34 17 6 5 3
18 M 40 14 7 5 2
34 F 47 6 6 5 6
50 M 35 17 5 7 5
;
Now I would like to create to files based on whether the records are Female (F)or NOT. Therefore I do this:
date female other;
set survey;
if sex = "F" then output USA;
else output other;
run;
PROC PRINT; RUN;
This however does not give me two sets with data depending on the F and M value. Any idea on what I am doing wrong here?
When you look in the log window, do you see any error messages?
If your code is
if sex = "F" then output USA;
you should see an error, because the DATA statement does not include a dataset named USA. If you change USA to FEMALE it should work.
Learning to read log messages is an essential skill in SAS.
The questionnaire I have data from asked respondents to rank 20 items on a scale of importance to them. The lower end of the scale contained a "bin" in which respondents could throw away any of the 20 items that they found completely unimportant to them. The result is a dataset with 20 variables (1 for every item). Every variable receives a number between 1 and 100 (and 0 if the item was thrown in the bin)
I would like to recode the entries into a ranking of the variables for every respondent. So all variables would receive a number between 1 and 20 relative to where that respondent ranked it.
Example:
Current:
item1 item2 item3 item4 item5 item6 item7 item8 etc.
respondent1 67 44 29 7 0 99 35 22
respondent2 0 42 69 50 12 0 67 100
etc.
What I want:
item1 item2 item3 item4 item5 item6 item7 item8 etc.
respondent1 7 6 4 2 1 8 5 3
respondent2 1 4 7 5 3 1 6 8
etc.
As you can see with respondent2, I would like items that received the same value, to get the same rank and the ranking to then skip a number.
I have found a lot of info on how to rank observations but I have not found out how to rank variables yet. Is there anyone that knows how to do this?
Here is one solution using reshape:
/* Create sample data */
clear *
set obs 2
gen respondant = "respondant1"
replace respondant = "respondant2" in 2
set seed 123456789
forvalues i = 1/10 {
gen item`i' = ceil(runiform()*100)
}
replace item2 = item1 if respondant == "respondant2"
list
+----------------------------------------------------------------------------------------------+
| respondant item1 item2 item3 item4 item5 item6 item7 item8 item9 item10 |
|----------------------------------------------------------------------------------------------|
1. | respondant1 14 56 69 62 56 26 43 53 22 27 |
2. | respondant2 65 65 11 7 88 5 90 85 57 95 |
+----------------------------------------------------------------------------------------------+
/* reshape long first */
reshape long item, i(respondant) j(itemNum)
/* Rank observations, accounting for ties */
by respondant (item), sort : gen rank = _n
replace rank = rank[_n-1] if item[_n] == item[_n-1] & _n > 1
/* reshape back to wide format */
drop item // optional, you can keep and just include in reshape wide
reshape wide rank, i(respondant) j(itemNum)
I have a csv file that shows parts on order. The columns include days late, qty and commodity.
I need to group the data by days late and commodity with a sum of the qty. However the days late needs to be grouped into ranges.
>56
>35 and <= 56
>14 and <= 35
>0 and <=14
I was hoping I could use a dict some how. Something like this
{'Red':'>56,'Amber':'>35 and <= 56','Yellow':'>14 and <= 35','White':'>0 and <=14'}
I am looking for a result like this
Red Amber Yellow White
STRSUB 56 60 74 40
BOTDWG 20 67 87 34
I am new to pandas so I don't know if this is possible at all. Could anyone provide some advice.
Thanks
Suppose you start with this data:
df = pd.DataFrame({'ID': ('STRSUB BOTDWG'.split())*4,
'Days Late': [60, 60, 50, 50, 20, 20, 10, 10],
'quantity': [56, 20, 60, 67, 74, 87, 40, 34]})
# Days Late ID quantity
# 0 60 STRSUB 56
# 1 60 BOTDWG 20
# 2 50 STRSUB 60
# 3 50 BOTDWG 67
# 4 20 STRSUB 74
# 5 20 BOTDWG 87
# 6 10 STRSUB 40
# 7 10 BOTDWG 34
Then you can find the status category using pd.cut. Note that by default, pd.cut splits the Series df['Days Late'] into categories which are half-open intervals, (-1, 14], (14, 35], (35, 56], (56, 365]:
df['status'] = pd.cut(df['Days Late'], bins=[-1, 14, 35, 56, 365], labels=False)
labels = np.array('White Yellow Amber Red'.split())
df['status'] = labels[df['status']]
del df['Days Late']
print(df)
# ID quantity status
# 0 STRSUB 56 Red
# 1 BOTDWG 20 Red
# 2 STRSUB 60 Amber
# 3 BOTDWG 67 Amber
# 4 STRSUB 74 Yellow
# 5 BOTDWG 87 Yellow
# 6 STRSUB 40 White
# 7 BOTDWG 34 White
Now use pivot to get the DataFrame in the desired form:
df = df.pivot(index='ID', columns='status', values='quantity')
and use reindex to obtain the desired order for the rows and columns:
df = df.reindex(columns=labels[::-1], index=df.index[::-1])
Thus,
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID': ('STRSUB BOTDWG'.split())*4,
'Days Late': [60, 60, 50, 50, 20, 20, 10, 10],
'quantity': [56, 20, 60, 67, 74, 87, 40, 34]})
df['status'] = pd.cut(df['Days Late'], bins=[-1, 14, 35, 56, 365], labels=False)
labels = np.array('White Yellow Amber Red'.split())
df['status'] = labels[df['status']]
del df['Days Late']
df = df.pivot(index='ID', columns='status', values='quantity')
df = df.reindex(columns=labels[::-1], index=df.index[::-1])
print(df)
yields
Red Amber Yellow White
ID
STRSUB 56 60 74 40
BOTDWG 20 67 87 34
You can create a column in your DataFrame based on your Days Late column by using the map or apply functions as follows. Let's first create some sample data.
df = pandas.DataFrame({ 'ID': 'foo,bar,foo,bar,foo,bar,foo,foo'.split(','),
'Days Late': numpy.random.randn(8)*20+30})
Days Late ID
0 30.746244 foo
1 16.234267 bar
2 14.771567 foo
3 33.211626 bar
4 3.497118 foo
5 52.482879 bar
6 11.695231 foo
7 47.350269 foo
Create a helper function to transform the data of the Days Late column and add a column called Code.
def days_late_xform(dl):
if dl > 56: return 'Red'
elif 35 < dl <= 56: return 'Amber'
elif 14 < dl <= 35: return 'Yellow'
elif 0 < dl <= 14: return 'White'
else: return 'None'
df["Code"] = df['Days Late'].map(days_late_xform)
Days Late ID Code
0 30.746244 foo Yellow
1 16.234267 bar Yellow
2 14.771567 foo Yellow
3 33.211626 bar Yellow
4 3.497118 foo White
5 52.482879 bar Amber
6 11.695231 foo White
7 47.350269 foo Amber
Lastly, you can use groupby to aggregate by the ID and Code columns, and get the counts of the groups as follows:
g = df.groupby(["ID","Code"]).size()
print g
ID Code
bar Amber 1
Yellow 2
foo Amber 1
White 2
Yellow 2
df2 = g.unstack()
print df2
Code Amber White Yellow
ID
bar 1 NaN 2
foo 1 2 2
I know this is coming a bit late, but I had the same problem as you and wanted to share the function np.digitize. It sounds like exactly what you want.
a = np.random.randint(0, 100, 50)
grps = np.arange(0, 100, 10)
grps2 = [1, 20, 25, 40]
print a
[35 76 83 62 57 50 24 0 14 40 21 3 45 30 79 32 29 80 90 38 2 77 50 73 51
71 29 53 76 16 93 46 14 32 44 77 24 95 48 23 26 49 32 15 2 33 17 88 26 17]
print np.digitize(a, grps)
[ 4 8 9 7 6 6 3 1 2 5 3 1 5 4 8 4 3 9 10 4 1 8 6 8 6
8 3 6 8 2 10 5 2 4 5 8 3 10 5 3 3 5 4 2 1 4 2 9 3 2]
print np.digitize(a, grps2)
[3 4 4 4 4 4 2 0 1 4 2 1 4 3 4 3 3 4 4 3 1 4 4 4 4 4 3 4 4 1 4 4 1 3 4 4 2
4 4 2 3 4 3 1 1 3 1 4 3 1]