creating a new SAS variable group using an existing variable group - sas

I have a SAS dataset with the following 13 fields:
base_price tax_month1-12
base price is the price of a product before taxes are paid and the tax_month is the tax percent needed to be collected for each month.
I want to create a new variable group tax_paid1-12 which is the product of the base_price and each of the tax months.
Is there an efficient way to do this in SAS without multiplying the fields 1 at a time? The number of months could vary in the future, so I do not want to hard code the number of fields in the variable group.

Unfortunately you need addtional step to compute number of months.
You can use arrays to calculate tax_paid for each month.
data source;
infile datalines;
input base_price tax_month1 tax_month2 tax_month3;
datalines;
1 2 . 4
5 6 7 8
;
run;
data _null_;
set source;
array tax_month(*) tax_month:;
call symputx('n', dim(tax_month));
stop;
run;
data result;
set source;
array tax_month(&n) tax_month1-tax_month&n;
array tax_paid(&n) tax_paid1-tax_paid&n;
do i = 1 to dim(tax_month);
tax_paid(i) = base_price * tax_month(i);
end;
drop i;
run;

Related

How can I compare many datasets update several columns based on the max value of a single column in SAS?

I have test scores from many students in 8 different years. I want to retain only the max total score of each student, but then also retain all the student-year related information to that test score (that is, all the columns from the same year in which the student got the highest total score).
An example of the datasets I have:
%macro score;
%do year = 2010 %to 2018;
data student_&year.;
do id=1 to 10;
english=25*rand('uniform');
math=25*rand('uniform');
sciences=25*rand('uniform');
history=25*rand('uniform');
total_score=sum(english, math, sciences, history);
output;
end;
%end;
run;
%mend;
%score;
In my expected output, I would like to retain the max of total_score for each student, and also have the other columns related to that total score. If possible, I would also like to have the information about the year in which the student got the max of total_score. An example of the expected output would be:
DATA want;
INPUT id total_score english math sciences history year;
CARDS;
1 75.4 15.4 20 20 20 2017
2 63.8 20 13.8 10 20 2016
3 48 10 10 18 10 2018
4 52 12 10 10 20 2016
5 69.5 20 19.5 20 10 2013
6 85 20.5 20.5 21 23 2011
7 41 5 12 14 10 2010
8 55.3 15 20.3 10 10 2012
9 51.5 10 20 10 11.5 2013
10 48.9 12.9 16 10 10 2015
;
RUN;
I have been trying to work with the SAS UPDATE procedure. But it just get the most recent value for each student. I want the max total score. Also, within the update framework, I need to update two tables at a time. I would like to compare all tables at the same time. So this strategy I am trying does not work:
data want;
update score_2010 score_2011;
by id;
Thanks to anyone who can provide insights.
It is easier to obtain what you want if you have only one longitudinal dataset with all the original information of your students. It also makes more sense, since you are comparing students across different years.
To build a longitudinal dataset, you will first need to insert a variable informing the year of each of your original datasets. For example with:
%macro score;
%do year = 2010 %to 2018;
data student_&year.;
do id=1 to 10;
english=25*rand('uniform');
math=25*rand('uniform');
sciences=25*rand('uniform');
history=25*rand('uniform');
total_score=sum(english, math, sciences, history);
year=&year.;
output;
end;
%end;
run;
%mend;
%score;
After including the year, you can get a longitudinal dataset with:
data student_allyears;
set student_201:;
run;
Finally, you can get what you want with a proc sql, in which you select the max of "total_score" grouped by "id":
proc sql;
create table want as
select distinct *
from student_allyears
group by id
having total_score=max(total_score);
Create a view that stacks the individual data sets and perform your processing on that.
Example (SQL select, group by, and having)
data scores / view=scores;
length year $4;
set work.student_2010-work.student_2018 indsname=dsname;
year = scan(dsname,-1,'_');
run;
proc sql;
create table want as
select * from scores
group by id
having total_score=max(total_score)
;
Example DOW loop processing
Stack data so the view is processible BY ID. The first DOW loops computes which record has the max total score over the group and the second selects the record in the group for OUTPUT
data scores_by_id / view=scores_by_id;
set work.student_2010-work.student_2018 indsname=dsname;
by id;
year = scan(dsname,-1,'_');
run;
data want;
* compute which record in group has max measure;
do _n_ = 1 by 1 until (last.id);
set scores_by_id;
by id;
if total_score > _max then do;
_max = total_score;
_max_at_n = _n_;
end;
end;
* output entire record having the max measure;
do _n_ = 1 to _n_;
set scores_by_id;
if _n_ = _max_at_n then OUTPUT;
end;
drop _max:;
run;

SAS - Survey Select - Selecting Different Sample Size per Stratum

I have a list of financial advisors and I need to pull 4 samples per advisor but catch is in those 4 samples I need to force 2 mortgages, 1 loan, 1 credit card lets say.
Is there a way in the Survey select statement to set the specific number of samples to pull per stratum? I know you can stratify on 1 category and set it as a equal number. I was hoping I could use a mapping of employee names + the number of samples left to pull for each category and have survey select utilize that to pull in a dynamic way.
I'm using this as an example but this only stratifies on employee first and gives me 4 per employee. I would need to further stratify on Product type and set that to a specific sample size per product.
proc surveyselect data=work.Emp_Table_Final
method=srs n=4 out=work.testsample SELECTALL;
strata Employee_No;
run;
Thanks i know it might sound complicated, but if i know its possible then i can google the rest
Yes, you can have a dataset be the target of the n option. That dataset must:
Contain the strata variables as well as a variable SAMPSIZE or _NSIZE_ with the number to select
Have the same type and length as the strata variables
Be sorted by the strata variables
Have an entry for every strata variable value
See the documentation for more details.
data sample_counts;
length sex $1;
input sex $ _NSIZE_;
datalines;
F 5
M 3
;;;;
run;
proc sort data=sashelp.class out=class;
by sex;
run;
proc surveyselect n=sample_counts method=srs out=samples data=class;
strata sex;
run;
For two variables it's the same, you just need two variables in the sample_counts. Of course it makes it a lot more complicated, and you may want to produce this in an automated fashion.
proc sort data=sashelp.class out=class;
by sex age;
run;
data sample_counts;
length sex $1;
input sex $ age _NSIZE_;
datalines;
F 11 1
F 12 1
F 13 1
F 14 1
F 15 1
M 11 1
M 12 1
M 13 1
M 14 1
M 15 1
M 16 0
;;;;
run;
/* or do it in an automated way*/
data sample_counts;
set class;
by sex age; *your strata;
if first.age then do; *do this once per stratum level;
if age le 15 then _NSIZE_ = 1; *whatever your logic is for defining _NSIZE_;
else _NSIZE_=0;
output;
end;
run;
proc surveyselect n=sample_counts method=srs out=samples data=class;
strata sex age;
run;

SAS software: How to delete observations with more than five zeros for the dependent variable

I have a consumer panel data with weekly recorded spending at a retail store. The unique identifier is household ID. I would like to delete observations if there occurs more than five zeros in spending. That is, the household did not make any purchase for five weeks. Once identified, I will delete all observations associated with the household ID. Does anyone know how I can implement this procedure in SAS? Thanks.
I think proc SQL would be good here.
This could be done in a single step with a more complex subquery but it is probably better to break it down into 2 steps.
Count how many zeroes each household ID has.
Filter to only include household IDs that have 5 or less zeroes.
proc sql;
create table zero_cnt as
select distinct household_id,
sum(case when spending = 0 then 1 else 0 end) as num_zeroes
from original_data
group by household_id;
create table wanted as
select *
from original_data
where household_id in (select distinct household_id from zero_cnt where num_zeroes <= 5);
quit;
Edit:
If the zeroes have to be consecutive then the method of building the list of IDs to exclude is different.
* Sort by ID and date;
proc sort data = original_data out = sorted_data;
by household_id date;
run;
Use the Lag operator: to check the previous spending amounts.
More info on LAG here: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000212547.htm
data exclude;
set sorted;
by household_id;
array prev{*} _L1-_L4;
_L1 = lag(spending);
_L2 = lag2(spending);
_L3 = lag3(spending);
_L4 = lag4(spending);
* Create running count for the number of observations for each ID;
if first.household_id; then spend_cnt = 0;
spend_cnt + 1;
* Check if current ID has at least 5 observations to check. If so, add up current spending and previous 4 and output if they are all zero/missing;
if spend_cnt >= 5 then do;
if spending + sum(of prev) = 0 then output;
end;
keep household_id;
run;
Then just use a subquery or match merge to remove the IDs in the 'excluded' dataset.
proc sql;
create table wanted as
select *
from original_data;
where household_id not in(select distinct household_id from excluded);
quit;

how to count distinct value over two dimension using SAS

I have a dataset looks like the following. This dataset contains four variable Country name Country, company ID Company, Year and Date.
Country Company Year Date
------- ------- ---- ----
A 1 2000 2000/01/02
A 1 2001 2001/01/03
A 1 2001 2001/07/02
A 1 2000 2001/08/03
B 2 2000 2001/08/03
C 3 2000 2001/08/03
I know how to count number of distinct company in each country. I did it using the following code.
proc sql;
create table lib.count as
select country, count(distinct company) as count
from lib.data
group by country;
quit;
My problem is how to count the number of distinct company-Years in each country. Essentially i want to know how many different company or same company in different year. If there are two observation for the same company in the same year, I want to count it as 1 different value. If same company have two observation in differeny year I want to count it as two different value. I want the output looks like the following (one number per country):
Country No. firm_year
A 2
B 1
C 1
Can anyone can teach me how to do it please.
A quick method is to concatenate all the variables you want to compare, creating a new variable. Something like:
data data_mod;
set data;
length company_year $ 20;
company_year= cats(company,year);
run;
Then you can run your proc sql with count(distinct company_year).
You need nested queries, as #DaBigNikoladze hinted at...
An "internal" query which will generate a list of distinct combinations of Country + Company + Year;
An "external" query which will count how many rows per country are present in the internal query.
Generate dataset
data have;
informat Country $1.
Company 1.
Year 4.
Date YYMMDD10.;
format Date YYMMDDs10.;
input country company year date;
datalines;
A 1 2000 2000/01/02
A 1 2001 2001/01/03
A 1 2001 2001/07/02
A 1 2000 2001/08/03
B 2 2000 2001/08/03
C 3 2000 2001/08/03
;
Execute query
PROC SQL;
CREATE TABLE want AS
SELECT country, Count(company) AS Firm_year
FROM (SELECT DISTINCT country, company, year FROM have)
GROUP BY country;
QUIT;
Results
Country Firm_year
A 2
B 1
C 1
proc sort data=lib.data out=temp nodupkey;
by country company year;
run;
data firm_year(keep=country cnt_fyr);
set out;
by country company year
retain cnt_fyr;
if first.country then cnt_fyr=1;
else cnt_fyr+1;
if last.country;
run;
The answer for your first question is:
data lib.count(keep=country companyCount);
set lib.data;
by country;
retain companyList '';
retain companyCount 0;
if first.country then do;
companyList = company;
companyCount = 1;
end;
else do;
if ^index(companyList, company) then do;
companyList = cats(companyList,',',company);
companyCount + 1;
end;
end;
if last.country then output;
run;
The resutl is:
Country companyCount
------- ------------
A 2
B 1
C 1
Similary you will take the number of distinct company-Years in each country.
Guess i'm a bit confused as to what you are expecting the result to look like. Here is an sql method that gets the same result as posted by the other answer so far.
data temp;
attrib Country length = $10;
attrib Company length = $10;
attrib Year length = $10;
attrib Date length = $10;
input Country $ Company $ Year $ Date $;
infile datalines delimiter = '#';
datalines;
A#1#x#x1#
A#1#x#x2#
B#2#x#x1#
C#3#x#x3#
;
run;
proc sql;
create table temp2 as
select country, count(distinct Date) as count
from temp
group by country, company;
quit;

Computing moving average in SAS

I'm trying to use SAS to compute a moving average for x number of periods that uses forecasted values in the calculation. For example if I have a data set with ten observations for a variable, and I wanted to do a 3-month moving average. The first forecast value should be an average of the last 3 observations, and the second forecast value should be an average of the last two observations, and the first forecast value.
If you have for example data like this:
data input;
infile datalines;
length product $10 period value 8;
informat period yymmdd10.;
format period yymmdd10.;
input product $ period value;
datalines;
car 2016-01-01 10
car 2015-12-01 20
car 2015-11-01 30
car 2015-10-01 40
car 2015-09-01 30
car 2015-08-01 15
;
run;
You can left join input table itself with a condition:
input t1 left join input t2
on t1.product = t2.product
and t2.period between intnx('month',t1.period,-2,'b') and t1.period
group by t1.product, t1.period, t1.value
With this you have t1.value as current value and avg(t2.value) as 3 months avg. To compute 2 months avg change every value that is older then previos period to missing value with ifn() function:
avg(ifn( t2.period >= intnx('month',t1.period,-1,'b'),t2.value,. ))
Full code could looks like this:
proc sql;
create table want as
select t1.product, t1.period, t1.value as currentValue,
ifn(count(t2.period)>1,avg(ifn( t2.period >= intnx('month',t1.period,-1,'b'),t2.value,. )),.) as twoMonthsAVG,
ifn(count(t2.period)>2,avg(t2.value),.) as threeMonthsAVG
from input t1 left join input t2
on t1.product = t2.product
and t2.period between intnx('month',t1.period,-2,'b') and t1.period
group by t1.product, t1.period, t1.value
;
quit;
I've also added count(t2.perion) condition to return missing values if I haven't got enough records to compute measure. My result set looks like this: