How can I delete objects without complete data by using stata - stata

I have a large panel dataset that looks as follows.
input id age high weight str6 daily_drink
1 10 110 35 water
1 10 110 35 coffee
1 11 120 38 water
1 11 120 38 coffee
1 12 130 50 water
1 12 130 50 coffee
2 11 118 31 water
2 11 118 31 coffee
2 11 118 31 milk
2 12 123 38 water
2 12 123 38 coffee
2 12 123 38 milk
3 10 98 55 water
3 11 116 36 water
3 12 129 39 water
4 12 125 40 water
end
However, I would like to use stata to keep objects with complete 10, 11, and 12 age. Looks like this.
id age high weight daily_drink
1 10 110 35 water
1 10 110 35 coffee
1 11 120 38 water
1 11 120 38 coffee
1 12 130 50 water
1 12 130 50 coffee
3 10 98 55 water
3 11 116 36 water
3 12 129 39 water
However, all the rows are without missing data, so I cannot simply delete the row with missing data. Is there any way to do it? Any suggestion will help. Thanks in advance.

You can use bysort and egen for this. Something along the lines of
bysort id: egen has10 = total(age==10)
bysort id: egen has11 = total(age==11)
bysort id: egen has12 = total(age==12)
keep if (has10 != 0) & (has11 != 0) & (has12 != 0)
should work (untested). See help egen for more info. Install gtools if you have very large data (ssc install gtools) and then replace egen by gegen.

A solution that works if 10, 11, 12 are the only age values possible:
bysort id (age) : gen nvals = sum(age != age[_n-1])
by id : replace nvals = nvals[_N]
keep if nvals == 3
Consider also
bysort id (age) : gen OK1 = age[1] == 10 & age[_N] == 12
by id : egen OK2 = max(age == 11)
keep if OK1 & OK2

Related

Replace ID repeated for next value available in the column ID

How can be modified dataframe below:
df <- data.frame (ID = c(1, 2, 2, 3), Name = c("Luke", "Pete", "Marie", "Frank"), Age = c(25, 34, 66, 45))
ID Name Age
1 Luke 25
2 Pete 34
2 Marie 66
3 Frank 45
To remove ID duplicated, and change it for next ID available.
ID Name Age
1 Luke 25
2 Pete 34
4 Marie 66
3 Frank 45
Thanks for help

SAS macro for mean change in baseline values

I have the following data:
Patient Visit VisitNumber LAB LABVALUE
001 BASELINE 1 LAB1 10
001 DAY 100 2 LAB1 15
001 DAY 200 3 LAB1 12
002 BASELINE 1 LAB1 11
002 DAY 100 2 LAB1 14
002 DAY 200 3 LAB1 12
001 BASELINE 1 LAB2 40
001 DAY 100 2 LAB2 45
001 DAY 200 3 LAB2 42
002 BASELINE 1 LAB2 41
002 DAY 100 2 LAB2 44
002 DAY 200 3 LAB2 42
I would like to create the following table, which summarizes the variable 'LABVALUE' for all patients at each visit (Table 2):
Visit VisitNumber LAB MEAN BASELINEMEAN CHANGEBASEMEAN
BASELINE 1 LAB1 10.5 10.5 .
DAY 100 2 LAB1 14.5 10.5 4
DAY 200 3 LAB1 12 10.5 1.5
BASELINE 1 LAB2 40.5 40.5 .
DAY 100 2 LAB2 44.5 40.5 4
DAY 200 3 LAB2 42 40.5 1.5
I have the following code that generates the change in values from baseline for each visit by patient:
proc sort data=have;
by patient lab visitnumber;
run;
data for_report;
set have;
by patient lab;
retain base_visitnum base_labvalue;
if first.patient then do;
base_visitnum = .;
base_labvalue = .;
end;
if first.lab and visit='BASELINE' then do;
base_visitnumber = visitnumber;
base_labvalue = labvalue;
end;
if not first.lab then do;
delta_labvalue = labvalue - base_labvalue;
end;
run;
This generates the following table:
LAB Visit VisitNumber LABVALUE BASE_VISITNUM BASE_LABVALUE DELTA_LABVALUE
LAB1 BASELINE 1 10 1 10 .
LAB1 DAY 100 2 15 1 10 5
LAB1 DAY 200 3 12 1 10 2
LAB1 BASELINE 1 11 1 11 .
LAB1 DAY 100 2 14 1 11 3
LAB1 DAY 200 3 12 1 11 1
LAB2 BASELINE 1 40 1 10 .
LAB2 DAY 100 2 45 1 10 5
LAB2 DAY 200 3 42 1 10 2
LAB2 BASELINE 1 41 1 11 .
LAB2 DAY 100 2 44 1 11 3
LAB2 DAY 200 3 42 1 11 1
Any insight as to how I can generate Table 2 would be greatly appreciated.
This should get you most of the way there:
proc sql noprint;
create table table2 as
select visit,
visitnumber,
lab,
mean(value) as mean,
mean(base_labvalue) as baselinemean
from for_report
group by visit, visitnumber, lab
;
quit;
I've left some details for you to complete :-)
Also, watch out for the mismatch between base_visitnum and base_visitnumber in your example code.

Create two datasets in SAS output

I have the following dataset:
DATA survey;
INPUT id sex $ age inc r1 r2 r3 ;
DATALINES;
1 F 35 17 7 2 2
17 M 50 14 5 5 3
33 F 45 6 7 2 7
49 M 24 14 7 5 7
65 F 52 9 4 7 7
81 M 44 11 7 7 7
2 F 34 17 6 5 3
18 M 40 14 7 5 2
34 F 47 6 6 5 6
50 M 35 17 5 7 5
;
Now I would like to create to files based on whether the records are Female (F)or NOT. Therefore I do this:
date female other;
set survey;
if sex = "F" then output USA;
else output other;
run;
PROC PRINT; RUN;
This however does not give me two sets with data depending on the F and M value. Any idea on what I am doing wrong here?
When you look in the log window, do you see any error messages?
If your code is
if sex = "F" then output USA;
you should see an error, because the DATA statement does not include a dataset named USA. If you change USA to FEMALE it should work.
Learning to read log messages is an essential skill in SAS.

Ranking variables (not observations)

The questionnaire I have data from asked respondents to rank 20 items on a scale of importance to them. The lower end of the scale contained a "bin" in which respondents could throw away any of the 20 items that they found completely unimportant to them. The result is a dataset with 20 variables (1 for every item). Every variable receives a number between 1 and 100 (and 0 if the item was thrown in the bin)
I would like to recode the entries into a ranking of the variables for every respondent. So all variables would receive a number between 1 and 20 relative to where that respondent ranked it.
Example:
Current:
item1 item2 item3 item4 item5 item6 item7 item8 etc.
respondent1 67 44 29 7 0 99 35 22
respondent2 0 42 69 50 12 0 67 100
etc.
What I want:
item1 item2 item3 item4 item5 item6 item7 item8 etc.
respondent1 7 6 4 2 1 8 5 3
respondent2 1 4 7 5 3 1 6 8
etc.
As you can see with respondent2, I would like items that received the same value, to get the same rank and the ranking to then skip a number.
I have found a lot of info on how to rank observations but I have not found out how to rank variables yet. Is there anyone that knows how to do this?
Here is one solution using reshape:
/* Create sample data */
clear *
set obs 2
gen respondant = "respondant1"
replace respondant = "respondant2" in 2
set seed 123456789
forvalues i = 1/10 {
gen item`i' = ceil(runiform()*100)
}
replace item2 = item1 if respondant == "respondant2"
list
+----------------------------------------------------------------------------------------------+
| respondant item1 item2 item3 item4 item5 item6 item7 item8 item9 item10 |
|----------------------------------------------------------------------------------------------|
1. | respondant1 14 56 69 62 56 26 43 53 22 27 |
2. | respondant2 65 65 11 7 88 5 90 85 57 95 |
+----------------------------------------------------------------------------------------------+
/* reshape long first */
reshape long item, i(respondant) j(itemNum)
/* Rank observations, accounting for ties */
by respondant (item), sort : gen rank = _n
replace rank = rank[_n-1] if item[_n] == item[_n-1] & _n > 1
/* reshape back to wide format */
drop item // optional, you can keep and just include in reshape wide
reshape wide rank, i(respondant) j(itemNum)

Grouping data by value ranges

I have a csv file that shows parts on order. The columns include days late, qty and commodity.
I need to group the data by days late and commodity with a sum of the qty. However the days late needs to be grouped into ranges.
>56
>35 and <= 56
>14 and <= 35
>0 and <=14
I was hoping I could use a dict some how. Something like this
{'Red':'>56,'Amber':'>35 and <= 56','Yellow':'>14 and <= 35','White':'>0 and <=14'}
I am looking for a result like this
Red Amber Yellow White
STRSUB 56 60 74 40
BOTDWG 20 67 87 34
I am new to pandas so I don't know if this is possible at all. Could anyone provide some advice.
Thanks
Suppose you start with this data:
df = pd.DataFrame({'ID': ('STRSUB BOTDWG'.split())*4,
'Days Late': [60, 60, 50, 50, 20, 20, 10, 10],
'quantity': [56, 20, 60, 67, 74, 87, 40, 34]})
# Days Late ID quantity
# 0 60 STRSUB 56
# 1 60 BOTDWG 20
# 2 50 STRSUB 60
# 3 50 BOTDWG 67
# 4 20 STRSUB 74
# 5 20 BOTDWG 87
# 6 10 STRSUB 40
# 7 10 BOTDWG 34
Then you can find the status category using pd.cut. Note that by default, pd.cut splits the Series df['Days Late'] into categories which are half-open intervals, (-1, 14], (14, 35], (35, 56], (56, 365]:
df['status'] = pd.cut(df['Days Late'], bins=[-1, 14, 35, 56, 365], labels=False)
labels = np.array('White Yellow Amber Red'.split())
df['status'] = labels[df['status']]
del df['Days Late']
print(df)
# ID quantity status
# 0 STRSUB 56 Red
# 1 BOTDWG 20 Red
# 2 STRSUB 60 Amber
# 3 BOTDWG 67 Amber
# 4 STRSUB 74 Yellow
# 5 BOTDWG 87 Yellow
# 6 STRSUB 40 White
# 7 BOTDWG 34 White
Now use pivot to get the DataFrame in the desired form:
df = df.pivot(index='ID', columns='status', values='quantity')
and use reindex to obtain the desired order for the rows and columns:
df = df.reindex(columns=labels[::-1], index=df.index[::-1])
Thus,
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID': ('STRSUB BOTDWG'.split())*4,
'Days Late': [60, 60, 50, 50, 20, 20, 10, 10],
'quantity': [56, 20, 60, 67, 74, 87, 40, 34]})
df['status'] = pd.cut(df['Days Late'], bins=[-1, 14, 35, 56, 365], labels=False)
labels = np.array('White Yellow Amber Red'.split())
df['status'] = labels[df['status']]
del df['Days Late']
df = df.pivot(index='ID', columns='status', values='quantity')
df = df.reindex(columns=labels[::-1], index=df.index[::-1])
print(df)
yields
Red Amber Yellow White
ID
STRSUB 56 60 74 40
BOTDWG 20 67 87 34
You can create a column in your DataFrame based on your Days Late column by using the map or apply functions as follows. Let's first create some sample data.
df = pandas.DataFrame({ 'ID': 'foo,bar,foo,bar,foo,bar,foo,foo'.split(','),
'Days Late': numpy.random.randn(8)*20+30})
Days Late ID
0 30.746244 foo
1 16.234267 bar
2 14.771567 foo
3 33.211626 bar
4 3.497118 foo
5 52.482879 bar
6 11.695231 foo
7 47.350269 foo
Create a helper function to transform the data of the Days Late column and add a column called Code.
def days_late_xform(dl):
if dl > 56: return 'Red'
elif 35 < dl <= 56: return 'Amber'
elif 14 < dl <= 35: return 'Yellow'
elif 0 < dl <= 14: return 'White'
else: return 'None'
df["Code"] = df['Days Late'].map(days_late_xform)
Days Late ID Code
0 30.746244 foo Yellow
1 16.234267 bar Yellow
2 14.771567 foo Yellow
3 33.211626 bar Yellow
4 3.497118 foo White
5 52.482879 bar Amber
6 11.695231 foo White
7 47.350269 foo Amber
Lastly, you can use groupby to aggregate by the ID and Code columns, and get the counts of the groups as follows:
g = df.groupby(["ID","Code"]).size()
print g
ID Code
bar Amber 1
Yellow 2
foo Amber 1
White 2
Yellow 2
df2 = g.unstack()
print df2
Code Amber White Yellow
ID
bar 1 NaN 2
foo 1 2 2
I know this is coming a bit late, but I had the same problem as you and wanted to share the function np.digitize. It sounds like exactly what you want.
a = np.random.randint(0, 100, 50)
grps = np.arange(0, 100, 10)
grps2 = [1, 20, 25, 40]
print a
[35 76 83 62 57 50 24 0 14 40 21 3 45 30 79 32 29 80 90 38 2 77 50 73 51
71 29 53 76 16 93 46 14 32 44 77 24 95 48 23 26 49 32 15 2 33 17 88 26 17]
print np.digitize(a, grps)
[ 4 8 9 7 6 6 3 1 2 5 3 1 5 4 8 4 3 9 10 4 1 8 6 8 6
8 3 6 8 2 10 5 2 4 5 8 3 10 5 3 3 5 4 2 1 4 2 9 3 2]
print np.digitize(a, grps2)
[3 4 4 4 4 4 2 0 1 4 2 1 4 3 4 3 3 4 4 3 1 4 4 4 4 4 3 4 4 1 4 4 1 3 4 4 2
4 4 2 3 4 3 1 1 3 1 4 3 1]