This should be rather straightforward to do through SAS, I hope.
I want to know the order that person tried different drug therapies. Some people may try a therapy for more than 1 month, but really we just want to know what did they try first, what did they try second, and what did they try third. Some people will go back and forth on therapies, and this needs to be captured (person 1) for example:
Unit Item Date Started
Person1 Yoga 1/1/2013
Person1 Vitamins 2/1/2013
Person1 Presciption 3/1/2013
Person1 Vitamins 4/1/2013
Person2 Vitamins 5/1/2012
Person2 Presciption 9/1/2013
Person2 Presciption 10/1/2013
Person3 Yoga 1/1/2013
Person3 Presciption 2/1/2013
How can I summarize this in SAS into:
Unit Therapy1 Therapy2 Therapy 3 Therapy 4
Person 1 Yoga Vitamins Prescription Vitamins
Person 2 Vitamins Prescription
Person 3 Yoga Prescription
Create a flag indicating which therapy this is:
data want;
set have;
by unit item notsorted; *notsorted means it looks at the data how it is and does not error for unordered by groups;
if first.unit then therapy=0; *start with 0 for each person.
if first.item then thearapy+1; *each time a different item comes in, increment therapy;
run;
Sort with nodupkey by unit therapy, then this is a textbook application of proc transpose.
Related
To say that a dataset is (person, year) level means that each row of that dataset has different (person, year) like this:
person year wage
Mike 2000 10
Mike 2010 30
Jack 1990 20
How can I make Stata display exactly those (person, year) variable sets that uniquely define each row?
I want to make a log file to record
person year
only, but not display any individual information (displaying individuals' information in a log file is against the rules set by the data provider).
How could I do this?
What I thought about is using bysort in some way
bysort person year: gen num=_n
and if every num is 1, then it means (person, year) defines each row.
But if a dataset is extremely large, then checking whether every num is 1 is too tedious. Is there any smarter way?
The command isid checks whether the variables you supply do jointly specify observations uniquely. Here is an example you can try:
. webuse grunfeld, clear
. isid company
variable company does not uniquely identify the observations
r(459);
. isid company year
Note the principle: no news is good news.
Another way to check for problems is through duplicates. For example, try duplicates list person year. In your case, you don't want that in the log. But what you can do first is anonymise your persons through
egen id = group(person)
and then check for duplicates on id year.
See also this FAQ.
I am working with a SAS dataset that includes up to 30 medications prescribed to an individual patient. The medications are coded med1, med2 ... med30. Each medication is represented by a 5-digit character variable. Using the identifier, I can then code the name of the drug, and whether that particular medication is a topical antibiotic or a systemic antibiotic.
For each patient, I want to use all 30 medication codes to create one variable indicating whether the patient got a topical antibiotic only, a systemic antibiotic only, or both a topical and an oral antibiotic. So if any of the 30 medications is a systemic antibiotic, I want the patient coded as oral_antibiotic=1.
I currently have this code:
data want;
set have;
array meds[30] med1-med30;
if meds[i] in ('06925' '06920') then do;
penicillin=1;
oral_antibiotic=1;
end;
else if meds[i] in ('03197') then do;
neosporin=1;
topical_antibiotic=1;
end;
.... (many more do loops with many more medications)
run;
The problem is that this code creates one indicator variable instead of 30, overwriting previous information.
I think that I really need 30 indicator variables, indicating whether each of the 30 drugs is an oral or topical antibiotic, before I write code that says if any of the drugs are oral antibiotics, the patient received an oral antibiotic.
I am new to macros and would really appreciate help.
data current;
input med1 med2 med3;
cards;
'06925' '06920' '03197' ;
run;
And I want this:
data want;
input med1 topical_antibiotic1 oral_antibiotic1 med2 topical_antibiotic2 oral_antibiotic2 med3 topical_antibiotic3 oral_antibiotic3;
cards;
'06925' 0 1 '06920' 0 1 '03197' 1 0
;
run;
I think that I really need 30 indicator variables, indicating whether
each of the 30 drugs is an oral or topical antibiotic, before I write
code that says if any of the drugs are oral antibiotics, the patient
received an oral antibiotic.
That's not true. Your current approach is fine as long as you're not resetting them. You don't show us the full code, so it's hard to say, but I'm going to assume that's what is happening here.
Your loop should look like:
array med(30) med1-med30;
*set to 0 at top of the loop;
topical_antibiotic=0; oral_antibiotic=0;
do i=1 to dim(med);
if med(i) in (.....) /*list of topical codes*/ then topical_antibiotic=1;
else if med(i) in (.....) /*list of oral codes*/ then oral_antibiotic=1;
end;
This assumes that an antibiotic cannot be in both Topical/Oral groups. If it can, you need to remove the ELSE from the second IF statement.
I agree that you probably only need one indicator variable for each drug group, (medication of interest). Seems like you just want to know for each subject, "Do they have it?" This example flips the arguments of the IN operator. If you had given more example data I could have done better with this example.
data current;
infile cards missover;
array med[3] $5;
input med[*];
oral_antibotic = '069' in: med; /*Assume oral all start with '069'*/;
topical_antibotic = '03197' in med;
cards;
06925 06920 03197
06925
;;;;
run;
In a dataset in SAS, I have some observations multiple times. What I am trying to do is: I am trying to add a column with the frequency of each observation and make sure I keep it only one time in my dataset. I have to do this for a dataset with many rows and around 8 variables.
name id address age
jack 2 chicago 50
peter 4 new york 45
jack 2 chicago 50
This would have to become:
name id address age frequency
jack 2 chicago 50 2
peter 4 new york 45 1
Is there anybody who knows how to do this in SAS (preferably without using SQL)?
Thank you a lot!
#kl78 is right, proc summary is the best non-sql solution here. This runs in memory which can cause problems with very large datasets, but you should be ok with 8 columns.
class _all_ will group by all the variables and the frequency is output by default, so there's no need to specify any measures. I've dropped the other automatic variable, _type_, as it isn't relevant here and renamed _freq_.
data have;
input name $ id address &$ age;
datalines;
jack 2 chicago 50
peter 4 new york 45
jack 2 chicago 50
;
run;
proc summary data=have nway;
class _all_;
output out=want (drop=_type_ rename=(_freq_=frequency));
run;
I am exploring an effect that I think will vary by GDP levels, from a data set that has, vertically, country and year (1960 to 2015), so each country label is on 55 rows. I ran
sort year
by year: egen yrank = xtile(rgdp), nquantiles(4)
which tags every year row with what quartile of GDP they were in that year. I want to run this:
xtreg fiveyearg taxratio if yrank == 1 & year==1960
which would regress my variable (tax ratio) against some averaged gdp data from countries that were in the bottom quartile of GDPs in 1960 alone. So even if later on they grew enough to change ranks, the later data would still be in the regression pool. Sadly, I cannot get this code, or any variation, to run.
My current approach is to try to generate some new variable that would give every row with country label X a value of 1 if they were in the bottom quartile in 1960, but I can't get that to work either. i have run out of ideas, so I thought I would ask!
Based on your latest comment, which describes the (un)expected behavior:
clear
set more off
*----- example data -----
input ///
country year rank
1 1960 2
1 1961 1
1 1962 2
2 1960 1
2 1961 1
2 1962 1
3 1960 3
3 1961 3
3 1962 3
end
list, sepby(country)
*----- what you want -----
// tag countries whose first observation for -rank- is 1
// (I assume the first observation for -year- is always 1960)
bysort country : gen toreg = rank[1] == 1
list, sepby(country)
// run regression conditional on -toreg-
xtreg ... if toreg
Check help subscripting if in doubt.
I have a data set with information on students' educations on a institution.
I want to get a number of how many different combinations of study programmes they have been on. I have information on both master and bachelor level and I want to count the number of different study programmes in each education level (master, bachelor).
For example person1 can have:
Bachelor:
- study1
- study2
- study3
- study3
Master:
- studyA
- studyA
Then I want a number of 3 study programmes in bachelor level (study3 should not Count twice), and a number of 1 in masters level.
Each study programme has its own row - so in the dataset person1 has 6 rows.
I want one row per person telling the number of study programmes per education level:
person number_bachelor number_master
person1 3 1
....etc...
I have tried with this:
proc sql;
create table new as
select distinct personid, name,
count(study) as number_of_bach
from old
group by personid, edu_level, study;
quit;
But it doesn't give me what I want.
This gives me two rows with person1 with the values of 1 and 2 in the variable "number_of_bach".
How can I edit this code to get the result I want?
Code:
data education;
input person $ level $ program $;
datalines;
person1 bachelor study1
person1 bachelor study2
person1 bachelor study3
person1 bachelor study3
person1 master study1
person2 bachelor study1
person2 master study2
person2 master study1
;
run;
proc sort data = education nodupkey;
by person level program;
run;
proc sql;
select person,
sum(case when level eq 'bachelor' then 1 else 0 end) as num_bachelors,
sum(case when level eq 'bachelor' then 1 else 0 end) as num_masters
from education
group by person;
quit;
Working: Here, SORT procedure will eliminate duplicate records, if any. Then SQL procedure only can be used to generate the person wise count of programs at bachelor level as well as count of programs at master level.
Output:
person num_bachelors num_masters
person1 3 1
person2 1 2
Is this what you want?
DATA old;
INPUT personid edu_level $ study $;
DATALINES;
1 bachelor study1
1 bachelor study2
1 bachelor study3
1 bachelor study3
1 master studyA
1 master studyA
1 master studyB
;
PROC SQL;
CREATE TABLE new AS
SELECT personid, edu_level, COUNT (DISTINCT study) AS num_bach
FROM OLD
GROUP BY personid, edu_level;
QUIT;
The column study is a so-called an aggregate column in your query (because COUNT is an aggregate function) and as such should not be included in the GROUP BY-clause (else your query will also groupy by 'study' and the count will always be 1.
If you want to have one each person on one line then add a PROC TRANSPOSE:
PROC transpose IN = new OUT = new2;
BY personid;
ID edu_level;
RUN;
(You could also create a more complex query using subqueries and joins instead of the transpose, as long as you don't have millions of rows the overhead for the TRANSPOSE doesn't matter)
For the sake of completeness here is a SQL-only solution to your question:
PROC SQL;
CREATE TABLE new AS
SELECT p.personid, b.num_bachelors, m.num_masters
/* Select unique personids */
FROM (SELECT DISTINCT personid
FROM old) AS p
/* Count number of bachelor-level courses */
LEFT JOIN (SELECT personid,
COUNT(DISTINCT study) AS num_bachelors
FROM old WHERE edu_level = 'bachelor'
GROUP BY personid) AS b on p.personid = b.personid
/* Count number of master-level courses */
LEFT JOIN (SELECT personid,
COUNT(DISTINCT study) AS num_masters
FROM old WHERE edu_level = 'master'
GROUP BY personid) AS m on p.personid = m.personid;
QUIT;