I have data set, that has States, Corn, and Cotton. I want to create a new variable, Corn_Pct in SAS (% of state corn output relative to the country's output of corn). The same for Cotton_pct.
sample of data: (numbers are not real)
State Corn Cotton
TX 135 500
AK 120 350
...
Can anyone help?
You can do this using a simple Proc SQL. Let the dataset be "Test",
Proc sql ;
create table test_percent as
select *,
Corn/sum(corn) as Corn_Pct format=percent7.1,
Cotton/sum(Cotton) as Cotton_Pct format=percent7.1
from test
;
quit;
If you have many columns, you can use Arrays and do loops to automatically generate percentages everytime.
I have calculated the total of a column in Inner Query and then used that total for the calculation in outer query using Cross Join
Hey Try this:-
/*My Dataset */
Data Test;
input State $ Corn Cotton ;
cards;
TK 135 500
AK 120 350
CK 100 250
FG 200 300
run;
/*Code*/
Proc sql;
create table test_percent as
Select a.*, (corn * 100/sm_corn) as Corn_pct, (Cotton * 100/sm_cotton) as Cotton_pct
from test a
cross join
(
select sum(corn) as sm_corn ,
sum(Cotton) as sm_cotton
from test
) b ;
quit;
/*My Output*/
State Corn Cotton Corn_pct Cotton_pct
TK 135 500 24.32432432 35.71428571
AK 120 350 21.62162162 25
CK 100 250 18.01801802 17.85714286
FG 200 300 36.03603604 21.42857143
Here you have an alternative using proc means and data step:
proc means data=test sum noprint;
output out=test2(keep=corn cotton) sum=corn cotton;
quit;
data test_percent (drop=corn_sum cotton_sum);
set test2(rename=(corn=corn_sum cotton=cotton_sum) in=in1) test(in=in2);
if (in1=1) then do;
call symput('corn_sum',corn_sum);
call symput('cotton_sum',cotton_sum);
end;
else do;
Corn_pct = corn/symget('corn_sum');
Cotton_pct = cotton/symget('cotton_sum');
output;
end;
run;
Related
During some data cleaning process, there is a need to compare the data between different rows. For example, if the rows have the same countryID and subjectID then keep the largest temperature:
CountryID SubjectID Temperature
1001 501 36
1001 501 38
1001 510 37
1013 501 36
1013 501 39
1095 532 36
In this case like this, I will use the lag() function as follows.
proc sort table;
by CountryID SubjectID descending Temperature;
run;
data table_laged;
set table;
CountryID_lag = lag(CountryID);
SubjectID_lag = lag(SubjectID);
Temperature_lag = lag(Temperature);
if CountryID = CountryID_lag and SubjectID = SubjectID_lag then do;
if Temperature < Temperature_lag then delete;
end;
drop CountryID_lag SubjectID_lag Temperature_lag;
run;
The code above may work.
But I still want to know if there are any better ways to solve this kind of questions?
I think you complicate task. You can use proc sql and max function:
proc sql noprint;
create table table_laged as
select CountryID,SubjectID,max(Temperature)
from table
group by CountryID,SubjectID;
quit;
I don't know if you want it that way but you code would keep the highest temperatures
So when you have 2 1 3 for one subject if will keep 3. But when you have 1 4 3 4 4 it will keep 4 4 4. Better is to keep simple the first row for each subject which is the highest because of descending order.
proc sort data = table;
by CountryID SubjectID descending Temperature;
run;
data table_laged;
set table;
by CountryID SubjectID;
if first.SubjectID;
run;
You can use double DOW technique to:
Compute a measure over a group,
Apply the measure to items in the group.
The benefit of DOW looping is a single pass over the data set when incoming data is already grouped.
In this question, 1. is to identify the row in the group with the first highest temperature, and 2. is to select the row for output.
data want;
do _n_ = 1 by 1 until (last.SubjectId);
set have;
by CountryId SubjectId;
if temperature > _max_temp then do;
_max_temp = temperature;
_max_at_n = _n_;
end;
end;
do _n_ = 1 to _n_;
set have;
if _n_ = _max_at_n then OUTPUT;
end;
drop _:;
run;
The traditional procedural technique is Proc MEANS
data have;input
CountryID SubjectID Temperature; datalines;
1001 501 36
1001 501 38
1001 510 37
1013 501 36
1013 501 39
1095 532 36
run;
proc means noprint data=have;
by countryid subjectid;
output out=want(drop=_:) max(temperature)=temperature;
run;
If the data is disordered in CountryID and SubjectID going into the data step, a hash object can be used or SQL per #Aurieli.
I have a dataset called stores.I want to extract total_sales(retail_price),
proportion of sales and cumulative proportion of sales by each store in
SAS.
Sample dataset : - Stores
Date Store_Postcode Retail_Price month Distance
08/31/2013 CR7 8LE 470 8 7057.8
10/26/2013 CR7 8LE 640 10 7057.8
08/19/2013 CR7 8LE 500 8 7057.8
08/17/2013 E2 0RY 365 8 1702.2
09/22/2013 W4 3PH 395.5 12 2522
06/19/2013 W4 3PH 360.5 6 1280.9
11/15/2013 W10 6HQ 475 12 3213.5
06/20/2013 W10 6HQ 500 1 3213.5
09/18/2013 E7 8NW 315 9 2154.8
10/23/2013 E7 8NW 570 10 5777.9
11/18/2013 W10 6HQ 455 11 3213.5
08/21/2013 W10 6HQ 530 8 3213.5
Code i tried: -
Proc sql;
Create table work.Top_sellers as
Select Store_postcode as Stores,SUM(Retail_price) as Total_Sales,Round((Retail_price/Sum(Retail_price)),0.01) as
Proportion_of_sales
From work.stores
Group by Store_postcode
Order by total_sales;
Quit;
I've no idea on how to calculate cumulative variable in proc sql...
Please help me improve my code!!
Computing a cumulative result in SQL requires the data to have an explicit unique ordered key and the query involves a reflexive join with 'triangular' criteria for the cumulative aspect.
data have;
do id = 100 to 120;
sales = ceil (10 + 25 * ranuni(123));
output;
end;
run;
proc sql;
create table want as
select
have1.id
, have1.sales
, sum(have2.sales) as sales_cusum
from
have as have1
join
have as have2
on
have1.id >= have2.id /* 'triangle' criteria */
group by
have1.id, have1.sales
order by
have1.id
;
quit;
A second way is re-compute the cusum on row by row basis
proc sql;
create table want as
select have.id, have.sales,
( select sum(inner.sales)
from (select * from have) as inner
where inner.id <= have.id
)
as cusum
from
have;
I change my mind, CDF is a different calculation.
Here's how to do this via a data step. First calculate the cumulative totals (I used a data step here, but I could use PROC EXPAND if you had SAS/ETS).
*sort demo data;
proc sort data=sashelp.shoes out=shoes;
by region sales;
run;
data cTotal last (keep = region cTotal);
set shoes;
by region;
*calculate running total;
if first.region then cTotal=0;
cTotal = cTotal + sales;
*output records, everything to cTotal but only the last record which is total to Last dataset;
if last.region then output last;
output cTotal;
retain cTotal;
run;
*merge in results and calculate percentages;
data calcs;
merge cTotal Last (rename=cTotal=Total);
by region;
percent = cTotal/Total;
run;
If you need a more efficient solution, I'd try a DoW solution.
I am trying to match max daily data within a month to a monthly data.
data daily;
input permno $ date ret;
datalines;
1000 19860101 88
1000 19860102 90
1000 19860201 70
1000 19860202 55
1001 19860201 97
1001 19860202 74
1001 19860203 79
1002 19860301 55
1002 19860302 100
1002 19860301 10
;
run;
data monthly;
input permno $ date ret;
datalines;
1000 19860131 1
1000 19860228 2
1000 19860331 5
1001 19860331 3
1002 19860430 4
;
run;
The result I want is the following; (I want to match daily max data to one month lag monthly data. )
1000 19860102 90 1000 19860228 2
1000 19860201 70 1000 19860331 5
1001 19860201 97 1001 19860331 3
1002 19860302 100 1002 19860430 4
Below is what I have tried so far.
I want to have maximum ret value within a month so I have created yrmon to assign same yyyymm data for the same month daily data
data a1; set daily;
yrmon=year(date)*100 + month(date);
run;
In order to choose the maximum value(here, ret) within same yrmon group for the same permno, I used code below
proc means data=a1 noprint;
class permno yrmon ;
var ret;
output out= a2 max=maxret;
run;
However, it only got me permno yrmon ret data, leaving the original date data away.
data a3;
set a2;
new=intnx('month',yrmon,1);
format date new yymmn6.;
run;
But it won't work since yrmon is no longer date format.
Thank you in advance.
Hello
I am trying to match two different sets by permno(same company) but with one month lag (eg. daily9 dataset yrmon=198601 and monthly2 dataset yrmon=198602)
it is pretty difficult to handle for me because if I just add +1 in yrmon, 198612 +1 will not be 198701 and I am confused with handling these issues.
Can anyone help?
1) informat date1/date2 yymmn6. is used to read the date in yyyymm format
2) format date1/date2 yymmn6. is used to view the date in yyyymm format
3) intnx("months",b.date2,-1) is used to join the dates with lag of 1 month
data data1;
input date1 value1;
informat date1 yymmn6.;
format date1 yymmn6.;
cards;
200101 200
200212 300
200211 400
;
run;
data data2;
input date2 value2;
informat date2 yymmn6.;
format date2 yymmn6.;
cards;
200101 3000000
200102 4000000
200301 2000000
200212 2000000
;
run;
proc sql;
create table result as
select a.*,b.date2,b.value2 from
data1 a
left join
data2 b
on a.date1 = intnx("months",b.date2,-1);
quit;
My Output:
date1 |value1 |date2 |value2
200101 |200 |200102 |4000000
200211 |400 |200212 |2000000
200212 |300 |200301 |2000000
Let me know in case of any queries.
I need a summation column, however, both retain and lag commando'es are inefficient.
There are number of ways. You could use proc sql or proc means. I've written a way below:
data begin;
length person $3 sallary 5;
input person sallary;
datalines;
a 200
a 300
b 800
c 400
c 500
c 600
;
run;
proc means data=begin noprint;
by person; /*Handle each person as distinct subset*/
output out=Sal_by_person(drop= _type_ _freq_)
sum(sallary)=Total_sallary /*What we calculate and what we call them.*/
;
run;
I don't know where to start with this. I've tried listing the columns in every possible order but they are always listed horizontally. The dataset is:
data job2;
input year apply_count interviewed_count hired_count interviewed_mean hired_mean;
datalines;
2012 349 52 12 0.149 0.23077
2013 338 69 20 0.20414 0.28986
2014 354 70 18 0.19774 0.25714
;
run;
Here's an example of the proc report code for just one analysis variable:
proc report data = job2;
columns apply_count year;
define year / across " ";
define apply_count / analysis "Applied" format = comma8.;
run;
Ideally the final report would look like this:
2012 2013 2014
Applied 349 338 354
Interv. 52 69 70
Hired 12 20 18
Inter % 15% 20% 20%
Hired % 23% 29% 26%
I don't know if this is the best way to do this.
data job2;
input year apply_count interviewed_count hired_count interviewed_mean hired_mean;
datalines;
2012 349 52 12 0.149 0.23077
2013 338 69 20 0.20414 0.28986
2014 354 70 18 0.19774 0.25714
;;;;
run;
proc transpose data=job2 out=job3;
by year;
run;
data job3;
set job3;
length y atype $8;
y = propcase(scan(_name_,1,'_'));
atype = scan(_name_,-1,'_');
if atype eq 'mean' then substr(y,8,1)='%';
run;
proc print;
run;
proc report data=job3 list;
columns atype y year, col1 dummy;
define atype / group noprint;
define y / group order=data ' ';
define year / across ' ';
define dummy / noprint;
define col1 / format=12. ' ';
compute before atype;
xatype = atype;
endcomp;
compute after atype;
line ' ';
endcomp;
compute col1;
if xatype eq 'mean' then do;
call define('_C3_','format','percent12.');
call define('_C4_','format','percent12.');
call define('_C5_','format','percent12.');
end;
endcomp;
run;