SAS-calculating the percentage of variables - sas

Supposing a dataset as following:
firm_name year gender
A 2011 M
A 2011 M
A 2011 F
A 2012 M
A 2012 M
A 2012 F
A 2012 M
I use following code to do calculation.
proc sql;
create table result as
select firm_name,
year,
sum(case when(gender="M") then 1 else 0 end)/count(*) as des
from Have
group by 1,2;
quit;

Longfish is correct. This can be implemented with Proc freq
data begin;
input firm_name $ year gender $ ;
datalines;
A 2011 M
A 2011 M
A 2011 F
A 2012 M
A 2012 M
A 2012 F
A 2012 M
;
run;
In SAS we always need to sort the data out. You can try to mess with the order and results are often interesting. Note that SAS process usually gives only warning, not errors.
proc sort data= begin; by firm_name year gender; run;
proc freq data= begin ; /*YOu can add noprint option in case of large files*/
by firm_name year;
table gender /out= wanted;
run;
For more on proc Freq see https://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#statug_freq_sect006.htm

Related

Attempting to Automatically Create Label of Years Used in an Analysis

Say I had 5 years of data that were being used to calculate some measure across those aggregated years. Sometimes those are 5 consecutive years and other times data was not available for a given year so it must be skipped. For example 2016-2020 vs 2015-2017 & 2019-2020. In this case data was not available for 2018. I have been given a set of rules for how these years should be presented.
Consecutive years should be ex: 2016-2020
Non-Consecutive Years Will Look slightly different depending on where the missing year(s) occur.
2015-2017 & 2019-2020
2010, 2012, 2014, 2016 & 2017
2015-2018 & 2020
While it would be trivial just to produce a comma separated list of all years used this is how they want the years presented. These labels are for a series of different measures so I am attempting to create these labels automatically within a macro. The number of years of data is also not always 5. It could be 3 years or even 10 years.
The obvious first idea was a do until process that started at the minimum year and progressively compared against the next year used in the analysis looking to see if the years were consecutive. Given the number of years isn't consistently 5 this was what made the most sense so far but I have not worked with do until loops very much. As such I couldn't quite figure out how to progressivley build the label over the iterations of the do until loop while also adhering to these rules.
For this example lets use the years 2015,2016,2017,2019,2020.
Any help would be greatly appreciated.
This could be a case of a picture is worth a thousand words.
Example:
/* simulate raw results of a survey of 10 questions over 16 years */
data surveyresults;
call streaminit(20230125);
do qid = 1 to 10;
do year = 2007 to 2022;
if year = 2021 then continue;
if rand('uniform') > 0.85 then continue;
do _n_ = 1 to rand('integer', 30);
pid + 1;
if rand('uniform') > 0.85 then continue;
answercode = rand('integer', 20);
output;
end;
end;
end;
run;
proc sql noprint;
create table stage1 as
select distinct qid, year, 1 as flag
from surveyresults
order by qid, year
;
select catx(' ', min(year), 'to', max(year))
into :year_range
from stage1 ;
ods html file='plot.html';
proc sgplot data=stage1;
scatter x=year y=qid / markerattrs=(symbol=squarefilled size=8.2%);
xaxis values=(&year_range);
yaxis type=discrete;
run;
ods html close;
This should get you started.
data test;
infile cards dsd;
input x ##;
d = dif(x); /*used to create RUN when dif > 1 increment run*/
if d eq . or d > 1 then run+1;
cards;
2015,2016,2017,2019,2020,2022,2024,2025,2026
;;;;
run;
proc print;
run;
proc summary data=test nway; /*count the number of years in each run*/
class run;
output out=runlen(drop=_type_);
run;
data test; /* merge TEST and RUNLEN*/
length list $128;
do until(last.run); /*loop until last.run*/
merge test runlen;
by run;
if first.run then list = cats(x); /*start of list*/
end;
select(_freq_); /*based on run-length create LIST */
when(1);
when(2) list = catx(' & ',list,x);
otherwise list = catx('-',list,x);
end;
run;
proc print;
run;
Probably an easier way than this, but this works for your scenarios.
data years;
input year;
cards;
2015
2016
2017
2019
2020
;
run;
/* data years; */
/* input year; */
/* cards; */
/* 2010 */
/* 2012 */
/* 2014 */
/* 2016 */
/* 2017 */
/* ; */
/* run; */
/* data years; */
/* input year; */
/* cards; */
/* 2015 */
/* 2016 */
/* 2017 */
/* 2018 */
/* 2020 */
/* ; */
/* run; */
data want;
merge years end=eof years(firstobs=2 rename=year=next_year);
length year_list $200. interval $20.;;
retain year_list start_year;
_dif= next_year - year;
if _n_=1 then start_year=year;
if _dif > 1 or eof then do;
if start_year ne year then interval = catx('-', start_year, year);
else interval = put(start_year, 8. -l);
if eof then year_list=catx(" & ", year_list, interval);
else year_list = catx(", ", year_list, interval);
start_year = next_year;
end;
if eof then call symputx('year_list', year_list);
run;
%put &year_list;
This version creates the combined list. I think it has the features you describe.
data test;
infile cards dsd;
input x ##;
d = dif(x); /*used to create RUN when dif > 1 increment run*/
if d eq . or d > 1 then run+1;
cards;
2015,2016,2017,2019,2020,2022,2024,2025,2026
;;;;
run;
proc print;
run;
data list(keep=combinedlist);
length list $128 combinedList $256;
do while(not eof);
list=' ';
do runlength=1 by 1 until(last.run); /*loop until last.run*/
set test end=eof;
by run;
if first.run then list = cats(x); /*start of list*/
end;
select(runlength); /*based on run-length create LIST */
when(1);
when(2) list = catx(' & ',list,x);
otherwise list = catx('-',list,x);
end;
combinedList = catx(', ',combinedList,list);
end;
output;
stop;
run;
proc print;
run;

SAS transpose columns to row and values to columns

I have a summary table which I want to transpose, but I can't get my head around. The columns should be the rows, and the columns are the values.
Some explanation about the table. Each column represents a year. People can be in 3 groups: A, B or C. In 2016, everyone (100) is in group A. In 2017, 35 are in group A (5 + 20 + 10), 15 in B and 50 in C.
DATA have;
INPUT year2016 $ year2017 $ year2018 $ count;
DATALINES;
A A A 5
A A B 20
A A C 10
A B C 15
A C A 50
;
RUN;
I want to be able to make a nice graph of the evolution of the groups through the different periods. So I want to end up with a table where the columns are the rows (=period) and the columns are the values (= the 3 different groups). Please find an example of the table I want:
Image of table want
I have tried different approaches, but I can't get what I want.
Maybe more direct way but this is probably how I would do it.
DATA have;
INPUT year2016 $ year2017 $ year2018 $ count;
id + 1;
DATALINES;
A A A 5
A A B 20
A A C 10
A B C 15
A C A 50
;
RUN;
proc print;
proc transpose data=have out=want1 name=period;
by id count notsorted;
var year:;
run;
proc print;
run;
proc summary data=want1 nway completetypes;
class period col1;
freq count;
output out=want2(drop=_type_);
run;
proc print;
run;
proc transpose data=want2 out=want(drop=_name_) prefix=Group_;
by period;
var _freq_;
id col1;
run;
proc print;
run;

How to merge 2 datasets with different lengths?

I would like to merge 2 datasets with 2 different dimensions.
TABLE1: people
gender name
M raa
F chico
M july
F sergio
TABLE2: serial_numbers
gender serial
M 4
F 5
I want the result to be
result
gender name serial
M raa 4
F chico 5
M july 4
F sergio 5
I'm creating here the datasets to illustrate how to merge both datasets:
data people;
infile cards;
length gender $1
name $10;
input gender name;
cards;
M raa
F chico
M july
F sergio
;
run;
data serial_numbers;
length gender $1
serial 8;
infile cards;
input gender serial;
cards;
M 4
F 5
;
run;
Solution 1: use a proc sql to perform the join.
proc sql;
create table result as
select a.gender, a.name, b.serial
from people a LEFT JOIN serial_numbers b
on a.gender=b.gender;
quit;
proc print data=result;
run;
Solution 2: use a data step to merge both datasets. This requires the datasets to be sorted:
proc sort data=people;
by gender;
run;
proc sort data=serial_numbers;
by gender;
run;
data result;
merge people serial_numbers;
by gender;
run;
proc print data=result;
run;

Report using data _Null_

I'm looking for report using SAS data step :
I have a data set:
Name Company Date
X A 199802
X A 199705
X D 199901
y B 200405
y F 200309
Z C 200503
Z C 200408
Z C 200404
Z C 200309
Z C 200210
Z M 200109
W G 200010
Report I'm looking for:
Name Company From To
X A 1997/05 1998/02
D 1998/02 1999/01
Y B 2003/09 2004/05
F 2003/09 2003/09
Z C 2002/10 2005/03
M 2001/09 2001/09
W G 2000/10 2000/10
THANK you,
Tried using proc print but it is not accurate. So looking for a data null solution.
data _null_;
set salesdata;
by name company date;
array x(*) from;
From=lag(date);
if first.name then count=1;
do i=count to dim(x);
x(i)=.;
end;
count+1;
If first.company then do;
from_date1=date;
end;
if last.company then To_date=date;
if from_date1 ="" and to_date="" then delete;
run;
data _null_;
set yourEvents;
by Name Company notsorted;
file print;
If _N_ EQ 1 then put
#01 'Name'
#06 'Company'
#14 'From'
#22 'To'
;
if first.Name then put
#01 Name
#; ** This instructs sas to not start a new line for the next put instruction **;
retain From To;
if first.company then do;
From = 1E9;
To = 0;
end;
if Date LT From then From = Date;
if Date GT To then To = Date;
if last.Company then put
#06 Company
#14 From yymm7.
#22 To yymm7.
;
run;
I have done data step to calculate From_date and To_date
and then proc report to print the report by group.
proc sort data=have ;
by Name Company Date;
run;
data want(drop=prev_date date);
set have;
by Name Company date;
attrib From_Date To_date format=yymms10.;
retain prev_date;
if first.Company then prev_date=date;
if last.Company then do;
To_date=Date;
From_Date=prev_date;
end;
if not(last.company) then delete;
run;
proc sort data=want;
by descending name ;
run;
proc report data=want;
define Name/order order=data;
run;
IMHO, the simplest way is exploiting proc report and its analysis column type as the code below. Note that name and company columns are automatically sorted in alphabetical order (as most of the summary functions or procedures do).
/* your data */
data have;
infile datalines;
input Name $ Company $ Date $;
cards;
X A 199802
X A 199705
X D 199901
y B 200405
y F 200309
Z C 200503
Z C 200408
Z C 200404
Z C 200309
Z C 200210
Z M 200109
W G 200010
;
run;
/* convert YYYYMM to date */
data have2(keep=name company date);
set have(rename=(date=date_txt));
name = upcase(name);
y = input(substr(date_txt, 1, 4), 4.);
m = input(substr(date_txt, 5, 2), 2.);
date = mdy(m,1,y);
format date yymms7.;
run;
/****** 1. proc report ******/
proc report data=have2;
columns name company date=date_from date=date_to;
define name / 'Name' group;
define company / 'Company' group;
define date_from / 'From' analysis min;
define date_to / 'To' analysis max;
run;
The html output:
(tested on SAS 9.4 win7 x64)
============================ OFFTOPIC ==============================
One may also consider using proc means or proc tabulate. The basic code forms are shown below. However, you can also see that further adjustments in output formats are required.
/***** 2. proc tabulate *****/
proc tabulate data=have2;
class name company;
var date;
table name*company, date=' '*(min='From' max='To')*format=yymms7.;
run;
proc tabulate output:
/***** 3. proc means (not quite there) *****/
* proc means + ODS -> cannot recognize date formats;
proc means data=have2 nonobs min max;
class name company;
format date yymms7.; * in vain;
var date;
run;
proc means output (cannot output date format, dunno why):
You may leave comments on improving these alternative ways.

Bar Graph by Month - SAS EG

I am trying to create a bar graph in SAS Enterprise Guide. The graph is Savings by Month.
The input Data is
Ref Date Savings
A 03JUN2013 1000
A 08JUN2013 2000
A 08JUL2013 1500
A 08AUG2013 300
A 08NOV2013 100
B 09DEC2012 500
B 09MAY2013 400
B 19MAY2013 5999
B 09OCT2013 511
C 15OCT2013 1200
C 01NOV2013 1500
The first step I do is to convert the date into month. The I use PROC MEANS to calculate total savings by month by Ref.
Then I create a bar graph. The issue I am getting is the bar graph is not in a sequential order as it should be. Like it is AUG13 JUl13 JUN13 .. etc. instead of JUN JUL AUG.
PROC SQL;
CREATE TABLE SAVINGS_11 AS
SELECT
PUT(DATE,monname3.) AS MONTH,
(DATE) FORMAT=MONNAME3. AS MONTH1,
MONTH(DATE) AS MONTH2,
PUT(DATE,MONYY5.) AS MONTH3,
(DATE) FORMAT=MONYY5. AS MONTH4,
DATE,
REF,
SAVINGS
FROM INPUT;
QUIT;
/* -------------------------------------------------------------------
Sort data set
------------------------------------------------------------------- */
PROC SORT
DATA=SAVINGS_11(KEEP=SAVINGS MONTH MONTH1 MONTH2 MONTH3 MONTH4 REF)
OUT=SORT1;
BY REF;
RUN;
/* -------------------------------------------------------------------
Run the Means Procedure
------------------------------------------------------------------- */
TITLE;
TITLE1 "Summary";
TITLE2 "Results";
FOOTNOTE;
PROC MEANS DATA=SORT1
NOPRINT
CHARTYPE
NOLABELS
NWAY
SUM NONOBS ;
VAR SAVINGS;
CLASS MONTH / ORDER=DATA ASCENDING;
BY REF;
ID MONTH1 MONTH2 MONTH3 MONTH4;
OUTPUT OUT=MEANSUMMARY
SUM()=
/ AUTONAME AUTOLABEL WAYS INHERIT
;
RUN;
/* -------------------------------------------------------------------
End of task code.
------------------------------------------------------------------- */
RUN; QUIT;
TITLE; FOOTNOTE;
PROC SORT
DATA=MEANSUMMARY(KEEP=MONTH MONTH2 "SAVINGS_Sum"n REF)
OUT=SORT2
;
BY REF MONTH2;
RUN;
Axis1
STYLE=1
WIDTH=1
MINOR=NONE
;
Axis2
STYLE=1
WIDTH=1
;
TITLE;
TITLE1 "Bar Chart";
FOOTNOTE;
PROC GCHART DATA=SORT2
;
VBAR
MONTH
/
SUMVAR="SAVINGS_Sum"n
CLIPREF
FRAME LEVELS=ALL
TYPE=SUM
INSIDE=SUM
COUTLINE=BLACK
RAXIS=AXIS1
MAXIS=AXIS2
;
BY REF;
/* -------------------------------------------------------------------
End of task code.
------------------------------------------------------------------- */
RUN; QUIT;
TITLE; FOOTNOTE;
Whatever format I use, the end result is not in a sequential order. Please help.
Your problem is that you're converting the date value to a character variable. MONTH, at least, should be a formatted date variable, not a character variable; so this line:
PUT(DATE,monname3.) AS MONTH,
should be
DATE AS MONTH FORMAT=monname3.,
Most procedures (like PROC MEANS and PROC GPLOT) will respect formats and group by same-formatted values. I don't completely understand why you have 5 month variables all containing different versions of the same thing, so perhaps there are better ways to do what you're doing here.
In particular, if you have SAS 9.2 or later, SGPLOT will probably do this entire process for you without any of the summarization steps.
Apart from what Joe mentioned above, you also need to include the key word DISCRETE in your VBAR statement if you want to be able to see all months for each reference on the x-axis where they made a saving (Note: this wil generate warning messages if some references do not have any savings in some months).
You could try the following code which I believe produces the output you are after:
PROC SQL;
CREATE TABLE DATA_TO_PLOT AS
SELECT
REF
,INPUT(PUT(date,YYMMN6.),YYMMN6.) FORMAT =DATE9. AS MONTH
,SUM(Savings) AS MONTHLY_SAVINGS
FROM INPUT
GROUP BY 1,2
ORDER BY 1,2 ;
QUIT;
Axis1 STYLE=1 WIDTH=1 MINOR=NONE;
Axis2 STYLE=1 WIDTH=1;
TITLE;
TITLE1 "Bar Chart";
PROC GCHART DATA=DATA_TO_PLOT;
VBAR MONTH
/ SUMVAR=MONTHLY_SAVINGS
CLIPREF
FRAME TYPE=SUM
COUTLINE=BLACK
RAXIS=AXIS1
MAXIS=AXIS2
INSIDE=SUM
DISCRETE
;
FORMAT MONTH MONYY7.;
BY Ref;
RUN; QUIT;