sas count number of different combinations - sas

I have a data set with information on students' educations on a institution.
I want to get a number of how many different combinations of study programmes they have been on. I have information on both master and bachelor level and I want to count the number of different study programmes in each education level (master, bachelor).
For example person1 can have:
Bachelor:
- study1
- study2
- study3
- study3
Master:
- studyA
- studyA
Then I want a number of 3 study programmes in bachelor level (study3 should not Count twice), and a number of 1 in masters level.
Each study programme has its own row - so in the dataset person1 has 6 rows.
I want one row per person telling the number of study programmes per education level:
person number_bachelor number_master
person1 3 1
....etc...
I have tried with this:
proc sql;
create table new as
select distinct personid, name,
count(study) as number_of_bach
from old
group by personid, edu_level, study;
quit;
But it doesn't give me what I want.
This gives me two rows with person1 with the values of 1 and 2 in the variable "number_of_bach".
How can I edit this code to get the result I want?

Code:
data education;
input person $ level $ program $;
datalines;
person1 bachelor study1
person1 bachelor study2
person1 bachelor study3
person1 bachelor study3
person1 master study1
person2 bachelor study1
person2 master study2
person2 master study1
;
run;
proc sort data = education nodupkey;
by person level program;
run;
proc sql;
select person,
sum(case when level eq 'bachelor' then 1 else 0 end) as num_bachelors,
sum(case when level eq 'bachelor' then 1 else 0 end) as num_masters
from education
group by person;
quit;
Working: Here, SORT procedure will eliminate duplicate records, if any. Then SQL procedure only can be used to generate the person wise count of programs at bachelor level as well as count of programs at master level.
Output:
person num_bachelors num_masters
person1 3 1
person2 1 2

Is this what you want?
DATA old;
INPUT personid edu_level $ study $;
DATALINES;
1 bachelor study1
1 bachelor study2
1 bachelor study3
1 bachelor study3
1 master studyA
1 master studyA
1 master studyB
;
PROC SQL;
CREATE TABLE new AS
SELECT personid, edu_level, COUNT (DISTINCT study) AS num_bach
FROM OLD
GROUP BY personid, edu_level;
QUIT;
The column study is a so-called an aggregate column in your query (because COUNT is an aggregate function) and as such should not be included in the GROUP BY-clause (else your query will also groupy by 'study' and the count will always be 1.
If you want to have one each person on one line then add a PROC TRANSPOSE:
PROC transpose IN = new OUT = new2;
BY personid;
ID edu_level;
RUN;
(You could also create a more complex query using subqueries and joins instead of the transpose, as long as you don't have millions of rows the overhead for the TRANSPOSE doesn't matter)
For the sake of completeness here is a SQL-only solution to your question:
PROC SQL;
CREATE TABLE new AS
SELECT p.personid, b.num_bachelors, m.num_masters
/* Select unique personids */
FROM (SELECT DISTINCT personid
FROM old) AS p
/* Count number of bachelor-level courses */
LEFT JOIN (SELECT personid,
COUNT(DISTINCT study) AS num_bachelors
FROM old WHERE edu_level = 'bachelor'
GROUP BY personid) AS b on p.personid = b.personid
/* Count number of master-level courses */
LEFT JOIN (SELECT personid,
COUNT(DISTINCT study) AS num_masters
FROM old WHERE edu_level = 'master'
GROUP BY personid) AS m on p.personid = m.personid;
QUIT;

Related

Splitting a Column into two based on condtions in Proc Sql ,SAS

I want to Split the airlines column into two groups and then
Add each group 's amount for all clients... : -
Group 1 = Air India & jet airways
| Group 2 = Others.
Loc Client_Name Airlines Amout
BBI A_1ABC2 Air India 41302
BBI A 1ABC2 Air India 41302
MAA Th 1ABC2 Spice Jet Airlines 288713
HYD Ma 1ABC2 Jet Airways 365667
BOM Vi 1ABC2 Air India 552506
Something like this: -
Rank Client_name Group1 Group2 Total
1 Ca 1ABC2 5266269 7040320 1230658
2 Ve 1ABC2 2815593 2675886 5491479
3 Ma 1ABC2 1286686 437843 1724529
4 Th 1ABC2 723268 701712 1424980
5 Ec 1ABC2 113517 627734 741251
6 A 1ABC2 152804 439381 592185
I grouped it first ..but i am confused regarding how to split: -
Data assign6.Airlines_grouping1;
Set assign6.Airlines_grouping;
if Scan(Airlines,1) IN ('Air','Jet') then Group = "Group1";
else
if Scan(Airlines,1) Not in('Air','Jet') then Group = "Group2";
Run;
You are categorizing a row based on the first word of the airline.
Proc TRANSPOSE with an ID statement is one common way to reshape data so that a categorical value becomes a column. A second way is to bypass the categorization and use a data step to produce the new shape of data directly.
Here is an example of the second way -- create new columns group1 and group2 and set value based on airline criteria.
data airlines_group_amounts;
set airlines;
if scan (airlines,1) in ('Air', 'Jet') then
group1 = amount;
else
group2 = amount;
run;
summarize over client
proc sql;
create table want as
select
client_name
, sum(group1) as group1
, sum(group2) as group2
, sum(amount) as total
from airlines_group_amounts
group by client_name
;
You can avoid the two steps and do all of the processing in a single query, or you can do the summarization with Proc MEANS
Here is a single query way.
proc sql;
create table want as
select
client_name
, sum(case when scan (airlines,1) in ('Air', 'Jet') then amount else 0 end) as group1
, sum(case when scan (airlines,1) in ('Air', 'Jet') then 0 else amount end) as group2
, sum(amount) as total
from airlines
group by client_name
;

SAS: ID variable based on several conditions

I have following dataset:
ID Status
1 cake
1 cake
1 flower
2 flower
2 flower
3 cake
3 flower
4 cake
4 cake
4 cake
Basically, I am only interested in the observations that, grouped by the ID, include at least one flower. Also I want an indication of whether the observation grouped by ID only has flower or if it was cake too. E.g. I would ideally like something like:
ID Status Indicator
1 cake 1
1 cake 1
1 flower 1
2 flower 2
2 flower 2
3 cake 1
3 flower 1
4 cake 0
4 cake 0
4 cake 0
I have tried to subset the dataset in multiple ways and merge together, conditional on the ID, but it does not seem to be working.
This SAS data step based on your input (which I called test here) will return that indicator value by ID group.
proc sort data=test;
by ID descending status;
run;
data result(drop=status);
set test;
by ID;
retain indicator;
if first.ID then indicator=0;
if status='flower' and indicator=0 then indicator=2;
if status='cake' and indicator=2 then indicator=1;
if last.ID then output;
run;
You could join that result with the source data to get the result as you provided it in your post.
NOTE: I don't have enough reputation to comment on the answer provided by Gordon Linoff but I just want to point out that there the indicator will not take three values (0='no flower',1='cake+flower',2='only flower') but will instead be a count of the number of 'flower' entries per ID, which I don't think is quite what the poster is asking for.
Rewritten as follows will give the expected result with indicator values 0='no flower',1='only flower',2='cake+flower'
proc sql;
select t.*,
(count(distinct status))*(sum(case when status = 'flower' then 1 else 0 end)>0) as indicator
from test t
group by id;
;
quit;
proc sql comes to mind:
proc sql;
select t.*, tt.indicator
from t join
(select id, sum(case when status = 'flower' then 1 else 0 end) as indicator
from t
group by id
) tt
on tt.id = t.id;
proc sql also has a "remerge" extension to SQL. That allows you to do:
proc sql;
select t.*, tt.indicator,
sum(case when status = 'flower' then 1 else 0 end) as indicator
from t j
group by id;
If your data is already sorted by ID then you could use a double DOW loop. The first loop will check for the presence of the values. Then you can use another loop to write back all of the detail rows for that group.
data want ;
do until (last.id);
set have;
by id;
if status='flower' then _flower=1;
else if status='cake' then _cake=1;
end;
if _flower and _cake then indicator=1;
else if _flower then indicator=2;
else indicator=0;
do until (last.id);
set have;
by id;
output;
end;
run;
This should be fast assuming the data is already sorted.

SAS software: How to delete observations with more than five zeros for the dependent variable

I have a consumer panel data with weekly recorded spending at a retail store. The unique identifier is household ID. I would like to delete observations if there occurs more than five zeros in spending. That is, the household did not make any purchase for five weeks. Once identified, I will delete all observations associated with the household ID. Does anyone know how I can implement this procedure in SAS? Thanks.
I think proc SQL would be good here.
This could be done in a single step with a more complex subquery but it is probably better to break it down into 2 steps.
Count how many zeroes each household ID has.
Filter to only include household IDs that have 5 or less zeroes.
proc sql;
create table zero_cnt as
select distinct household_id,
sum(case when spending = 0 then 1 else 0 end) as num_zeroes
from original_data
group by household_id;
create table wanted as
select *
from original_data
where household_id in (select distinct household_id from zero_cnt where num_zeroes <= 5);
quit;
Edit:
If the zeroes have to be consecutive then the method of building the list of IDs to exclude is different.
* Sort by ID and date;
proc sort data = original_data out = sorted_data;
by household_id date;
run;
Use the Lag operator: to check the previous spending amounts.
More info on LAG here: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000212547.htm
data exclude;
set sorted;
by household_id;
array prev{*} _L1-_L4;
_L1 = lag(spending);
_L2 = lag2(spending);
_L3 = lag3(spending);
_L4 = lag4(spending);
* Create running count for the number of observations for each ID;
if first.household_id; then spend_cnt = 0;
spend_cnt + 1;
* Check if current ID has at least 5 observations to check. If so, add up current spending and previous 4 and output if they are all zero/missing;
if spend_cnt >= 5 then do;
if spending + sum(of prev) = 0 then output;
end;
keep household_id;
run;
Then just use a subquery or match merge to remove the IDs in the 'excluded' dataset.
proc sql;
create table wanted as
select *
from original_data;
where household_id not in(select distinct household_id from excluded);
quit;

how to count distinct value over two dimension using SAS

I have a dataset looks like the following. This dataset contains four variable Country name Country, company ID Company, Year and Date.
Country Company Year Date
------- ------- ---- ----
A 1 2000 2000/01/02
A 1 2001 2001/01/03
A 1 2001 2001/07/02
A 1 2000 2001/08/03
B 2 2000 2001/08/03
C 3 2000 2001/08/03
I know how to count number of distinct company in each country. I did it using the following code.
proc sql;
create table lib.count as
select country, count(distinct company) as count
from lib.data
group by country;
quit;
My problem is how to count the number of distinct company-Years in each country. Essentially i want to know how many different company or same company in different year. If there are two observation for the same company in the same year, I want to count it as 1 different value. If same company have two observation in differeny year I want to count it as two different value. I want the output looks like the following (one number per country):
Country No. firm_year
A 2
B 1
C 1
Can anyone can teach me how to do it please.
A quick method is to concatenate all the variables you want to compare, creating a new variable. Something like:
data data_mod;
set data;
length company_year $ 20;
company_year= cats(company,year);
run;
Then you can run your proc sql with count(distinct company_year).
You need nested queries, as #DaBigNikoladze hinted at...
An "internal" query which will generate a list of distinct combinations of Country + Company + Year;
An "external" query which will count how many rows per country are present in the internal query.
Generate dataset
data have;
informat Country $1.
Company 1.
Year 4.
Date YYMMDD10.;
format Date YYMMDDs10.;
input country company year date;
datalines;
A 1 2000 2000/01/02
A 1 2001 2001/01/03
A 1 2001 2001/07/02
A 1 2000 2001/08/03
B 2 2000 2001/08/03
C 3 2000 2001/08/03
;
Execute query
PROC SQL;
CREATE TABLE want AS
SELECT country, Count(company) AS Firm_year
FROM (SELECT DISTINCT country, company, year FROM have)
GROUP BY country;
QUIT;
Results
Country Firm_year
A 2
B 1
C 1
proc sort data=lib.data out=temp nodupkey;
by country company year;
run;
data firm_year(keep=country cnt_fyr);
set out;
by country company year
retain cnt_fyr;
if first.country then cnt_fyr=1;
else cnt_fyr+1;
if last.country;
run;
The answer for your first question is:
data lib.count(keep=country companyCount);
set lib.data;
by country;
retain companyList '';
retain companyCount 0;
if first.country then do;
companyList = company;
companyCount = 1;
end;
else do;
if ^index(companyList, company) then do;
companyList = cats(companyList,',',company);
companyCount + 1;
end;
end;
if last.country then output;
run;
The resutl is:
Country companyCount
------- ------------
A 2
B 1
C 1
Similary you will take the number of distinct company-Years in each country.
Guess i'm a bit confused as to what you are expecting the result to look like. Here is an sql method that gets the same result as posted by the other answer so far.
data temp;
attrib Country length = $10;
attrib Company length = $10;
attrib Year length = $10;
attrib Date length = $10;
input Country $ Company $ Year $ Date $;
infile datalines delimiter = '#';
datalines;
A#1#x#x1#
A#1#x#x2#
B#2#x#x1#
C#3#x#x3#
;
run;
proc sql;
create table temp2 as
select country, count(distinct Date) as count
from temp
group by country, company;
quit;

SAS - grouping and ordering of events by date

This should be rather straightforward to do through SAS, I hope.
I want to know the order that person tried different drug therapies. Some people may try a therapy for more than 1 month, but really we just want to know what did they try first, what did they try second, and what did they try third. Some people will go back and forth on therapies, and this needs to be captured (person 1) for example:
Unit Item Date Started
Person1 Yoga 1/1/2013
Person1 Vitamins 2/1/2013
Person1 Presciption 3/1/2013
Person1 Vitamins 4/1/2013
Person2 Vitamins 5/1/2012
Person2 Presciption 9/1/2013
Person2 Presciption 10/1/2013
Person3 Yoga 1/1/2013
Person3 Presciption 2/1/2013
How can I summarize this in SAS into:
Unit Therapy1 Therapy2 Therapy 3 Therapy 4
Person 1 Yoga Vitamins Prescription Vitamins
Person 2 Vitamins Prescription
Person 3 Yoga Prescription
Create a flag indicating which therapy this is:
data want;
set have;
by unit item notsorted; *notsorted means it looks at the data how it is and does not error for unordered by groups;
if first.unit then therapy=0; *start with 0 for each person.
if first.item then thearapy+1; *each time a different item comes in, increment therapy;
run;
Sort with nodupkey by unit therapy, then this is a textbook application of proc transpose.