SAS Dehoc Query...Breaking Range Data - sas

I need your help on developing a de-hoc query for hoc(range) data, below is an example of Shares Outstanding HOC:
ID StartDT EndDT SharesOutstanding
ABC 01-Jan-2010 03-Feb-2013 100
ABC 04-Feb-2014 03-Sep-2014 160
XYZ 01-Jan-2011 03-Mar-2012 52
XYZ 04-Mar-2012 09-Aug-2013 108
XYZ 10-Aug-2013 03-Sep-2014 120
Now I want to dehoc or break the above range data to per day. Below is the desired output:
ID Date Shares
ABC 01-Jan-2010 100
ABC 02-Jan-2010 100
ABC 03-Jan-2010 100
ABC 04-Jan-2010 100
ABC 05-Jan-2010 100
.......
ABC 03-Feb-2014 100
ABC 04-Feb-2014 160
....till 03-Sep-2014
I am using SAS Code with PROCSQL but that is very time consuming
Need your help on this query at earliest
Thanks
Hitesh

This should be fairly easy with a data step and some do-loops.
data want(drop = StartDT EndDT i);
set have;
format date date9.;
do i = 0 to (EndDT-StartDT);
date = StartDT + i;
output;
end;
run;
Do you really want lots of repeated rows, though, or are you just interested in getting the difference of dates?

Related

Using do loops in sas

Assume you have a data file called VIRUS_PROLIF from an infectious disease research center. Each observation has 3 variables COUNTRY START_DATE, and DOUBLE_RATE, where START_DATE is the date that the Country registered its 100th case of COVID-19. For each country, DOUBLE_RATE is the number of days it takes for the number of cases to double in that country. Write the SAS code using DO UNTIL to calculate the date at which that Country would be predicted to register 200,000 cases of COVID-19.
data VIRUS_PROLIF;
INPUT COUNTRY $ start_date mmddyy10. num_of_cases double_rate ;
*here doubling rate is 100% so if day 1 had 100 cases day 2 will have 200;
Datalines;
US 03/13/2020 100 100
;
run;
data VIRUS_PROLIF1 (drop=start_date);
set VIRUS_PROLIF;
do until (num_of_cases>200000);
double_rate+1;
num_of_cases+ (num_of_cases*1);
end;
run;
proc print data=VIRUS_PROLIF1;
run;
The key concept you're missing here is how to employ the growth rate. That would be using the following formula, similar to interest growth for money.
If you have one dollar today and you get 100% interest it becomes
StartingAmount * (1 + interestRate) where the interest rate here is 100/100 = 1.
*fake data;
data VIRUS_PROLIF;
INPUT COUNTRY $ start_date mmddyy10. num_of_cases double_rate;
*here doubling rate is 100% so if day 1 had 100 cases day 2 will have 200;
Datalines;
US 03/13/2020 100 100
AB 03/17/2020 100 20
;
run;
data VIRUS_PROLIF1;
set VIRUS_PROLIF;
*assign date to starting date so both are in output;
date=start_date;
*save record to data set;
output;
do until (num_of_cases>200000);
*increment your day;
date=date+1;
;
*doubling rate is represented as a percent so add it to 1 to show the rate;
num_of_cases=num_of_cases*(1+double_rate/100);
*save record to data set;
output;
end;
*control date display;
format date start_date date9.;
run;
*check results;
proc print data=VIRUS_PROLIF1;
run;
The problem 200,000 < N0 (1+R/100) k can be solved for integer k without iterations
day_of_200K = ceil (
LOG ( 200000 / NUM_OF_CASES )
/ LOG ( 1 + R / 100 )
);

Using the sum of the columns, to create a new varible

I have data set, that has States, Corn, and Cotton. I want to create a new variable, Corn_Pct in SAS (% of state corn output relative to the country's output of corn). The same for Cotton_pct.
sample of data: (numbers are not real)
State Corn Cotton
TX 135 500
AK 120 350
...
Can anyone help?
You can do this using a simple Proc SQL. Let the dataset be "Test",
Proc sql ;
create table test_percent as
select *,
Corn/sum(corn) as Corn_Pct format=percent7.1,
Cotton/sum(Cotton) as Cotton_Pct format=percent7.1
from test
;
quit;
If you have many columns, you can use Arrays and do loops to automatically generate percentages everytime.
I have calculated the total of a column in Inner Query and then used that total for the calculation in outer query using Cross Join
Hey Try this:-
/*My Dataset */
Data Test;
input State $ Corn Cotton ;
cards;
TK 135 500
AK 120 350
CK 100 250
FG 200 300
run;
/*Code*/
Proc sql;
create table test_percent as
Select a.*, (corn * 100/sm_corn) as Corn_pct, (Cotton * 100/sm_cotton) as Cotton_pct
from test a
cross join
(
select sum(corn) as sm_corn ,
sum(Cotton) as sm_cotton
from test
) b ;
quit;
/*My Output*/
State Corn Cotton Corn_pct Cotton_pct
TK 135 500 24.32432432 35.71428571
AK 120 350 21.62162162 25
CK 100 250 18.01801802 17.85714286
FG 200 300 36.03603604 21.42857143
Here you have an alternative using proc means and data step:
proc means data=test sum noprint;
output out=test2(keep=corn cotton) sum=corn cotton;
quit;
data test_percent (drop=corn_sum cotton_sum);
set test2(rename=(corn=corn_sum cotton=cotton_sum) in=in1) test(in=in2);
if (in1=1) then do;
call symput('corn_sum',corn_sum);
call symput('cotton_sum',cotton_sum);
end;
else do;
Corn_pct = corn/symget('corn_sum');
Cotton_pct = cotton/symget('cotton_sum');
output;
end;
run;

matching two datasets with one month lag

I am trying to match max daily data within a month to a monthly data.
data daily;
input permno $ date ret;
datalines;
1000 19860101 88
1000 19860102 90
1000 19860201 70
1000 19860202 55
1001 19860201 97
1001 19860202 74
1001 19860203 79
1002 19860301 55
1002 19860302 100
1002 19860301 10
;
run;
data monthly;
input permno $ date ret;
datalines;
1000 19860131 1
1000 19860228 2
1000 19860331 5
1001 19860331 3
1002 19860430 4
;
run;
The result I want is the following; (I want to match daily max data to one month lag monthly data. )
1000 19860102 90 1000 19860228 2
1000 19860201 70 1000 19860331 5
1001 19860201 97 1001 19860331 3
1002 19860302 100 1002 19860430 4
Below is what I have tried so far.
I want to have maximum ret value within a month so I have created yrmon to assign same yyyymm data for the same month daily data
data a1; set daily;
yrmon=year(date)*100 + month(date);
run;
In order to choose the maximum value(here, ret) within same yrmon group for the same permno, I used code below
proc means data=a1 noprint;
class permno yrmon ;
var ret;
output out= a2 max=maxret;
run;
However, it only got me permno yrmon ret data, leaving the original date data away.
data a3;
set a2;
new=intnx('month',yrmon,1);
format date new yymmn6.;
run;
But it won't work since yrmon is no longer date format.
Thank you in advance.
Hello
I am trying to match two different sets by permno(same company) but with one month lag (eg. daily9 dataset yrmon=198601 and monthly2 dataset yrmon=198602)
it is pretty difficult to handle for me because if I just add +1 in yrmon, 198612 +1 will not be 198701 and I am confused with handling these issues.
Can anyone help?
1) informat date1/date2 yymmn6. is used to read the date in yyyymm format
2) format date1/date2 yymmn6. is used to view the date in yyyymm format
3) intnx("months",b.date2,-1) is used to join the dates with lag of 1 month
data data1;
input date1 value1;
informat date1 yymmn6.;
format date1 yymmn6.;
cards;
200101 200
200212 300
200211 400
;
run;
data data2;
input date2 value2;
informat date2 yymmn6.;
format date2 yymmn6.;
cards;
200101 3000000
200102 4000000
200301 2000000
200212 2000000
;
run;
proc sql;
create table result as
select a.*,b.date2,b.value2 from
data1 a
left join
data2 b
on a.date1 = intnx("months",b.date2,-1);
quit;
My Output:
date1 |value1 |date2 |value2
200101 |200 |200102 |4000000
200211 |400 |200212 |2000000
200212 |300 |200301 |2000000
Let me know in case of any queries.

SAS assigning numbers and partitioning by account

I use SAS EG and have a data set that looks like:
CLIENT_ID Segment Yearmonth
XXXX A 201305
XXXX A 201306
XXXX A 201307
YYYY A 201305
YYYY A 201306
YYYY B 201307
i want an output that has a number assigned to a new column which resets when a new account is there:
CLIENT_ID Segment Yearmonth New_Variable
XXXX A 201305 1
XXXX A 201306 2
XXXX A 201307 3
YYYY A 201305 1
YYYY A 201306 2
YYYY B 201307 3
That was problem number one, which i solved with this code:
PROC SORT DATA= GENERAL.HISTORICAL_SEGMENTS;
by Client_ID;
RUN;
data HISTORICAL_SEGMENTS2;
SET GENERAL.HISTORICAL_SEGMENTS;
count + 1;
by Client_ID;
if first.Client_ID then count = 1;
run;
I want to create a second data set and i want to see if there is a way to get the segments only if the segment changes: For example from the above the
CLIENT_ID Segment Yearmonth New_Variable
YYYY A 201305 1
YYYY B 201306 2
Any help would be appreciated. Thanks.
Nice job on answering your first question. I think that step reads more clearly if you rearrange it a bit, e.g.:
data HISTORICAL_SEGMENTS2 ;
set GENERAL.HISTORICAL_SEGMENTS ;
by Client_ID ;
if first.Client_ID then count = 0 ;
count + 1 ;
run;
I think it's customary to put the BY statement right after the SET statement it applies to, for clarity sake. Reset the counter to 0 when Client_ID changes.
It looks like you want a second dataset, call it FIRSTS, with the first record from each by group. To do that, note that it's possible for one DATA step to write multiple output datasets. This can be done by using an explicit OUTPUT statement to write to each dataset, e.g. :
data HISTORICAL_SEGMENTS2 FIRSTS ;
set GENERAL.HISTORICAL_SEGMENTS ;
by Client_ID ;
if first.Client_ID then count = 0 ;
count + 1 ;
output HISTORICAL_SEGMENTS2 ; *output every record;
if first.Client_ID then output FIRSTS ; *output first of each group;
run;

Transform numbers with 0 values at the beginning

I have the following dataset:
DATA survey;
INPUT zip_code number;
DATALINES;
1212 12
1213 23
1214 23
;
PROC PRINT; RUN;
I want to link this data to another table but the thing is that the numbers in the other table are stored in the following format: 0012, 0023, 0023.
So I am looking for a way to do the following:
Check how long the number is
If length = 1, add 3 0 values to the beginning
If length = 2, add 2 0 values to the beginning
Any thoughts on how I can get this working?
Numbers are numbers so if the other table has the field as a number then you don't need to do anything. 13 = 0013 = 13.00 = ....
If the other table actually has a character variable then you need to convert one or the other.
char_number = put(number, Z4.);
number = input(char_number, 4.);
You can use z#. formats to accomplish this:
DATA survey;
INPUT zip_code number;
DATALINES;
1212 12
1213 23
1214 23
9999 999
8888 8
;
data survey2;
set survey;
number_long = put(number, z4.);
run;
If you need it to be four characters long, then you could do it like this:
want = put(input(number,best32.),z4.);