I have a large table of connections, and would like to expand that table to include recursive connections.
My data looks like this --
data city_list;
input from_city $ to_city $;
datalines;
PORTLAND SEATTLE
SEATTLE BOISE
BOISE PORTLAND
PORTLAND HELENA
NYC ORLANDO
ORLANDO MIAMI
;
run;
I'd like expand the data set to include stopovers, so it ends up looking like this. I'm not concerned about whether I have both a "PORTLAND/SEATTLE" and a "SEATTLE/PORTLAND" record -- I can handle those afterwards as necessary.
BOISE HELENA
BOISE PORTLAND
BOISE SEATTLE
NYC MIAMI
NYC ORLANDO
ORLANDO MIAMI
PORTLAND HELENA
PORTLAND SEATTLE
SEATTLE HELENA
I've tried using the following macro, but ran into performance problems when there were too many levels of recursion. I believe the best option would be hash tables, but am not sure how to code this precise scenario.
data city_list;
input from_city $ to_city $;
datalines;
PORTLAND SEATTLE
SEATTLE BOISE
BOISE PORTLAND
PORTLAND HELENA
NYC ORLANDO
ORLANDO MIAMI
;
run;
%macro RecurJoin(
baseTbl,
destTbl,
baseKey,
compKey
);
Proc SQL;
Create Table WORK.RECUR_JOIN_TBL as
SELECT distinct Base.&baseKey, Connect.&compkey
FROM &baseTbl AS Base
INNER JOIN &baseTbl AS Connect
ON (Base.&compkey = Connect.&baseKey)
LEFT JOIN &baseTbl AS Subbase
ON (Base.&baseKey = Subbase.&baseKey) AND
(Connect.&compkey = Subbase.&compkey)
WHERE Subbase.&baseKey IS NULL;
quit;
proc sql noprint;
select count(1) into :connectCnt from RECUR_JOIN_TBL;
quit;
Data &destTbl;
set &baseTbl
RECUR_JOIN_TBL;
run;
Proc DataSets nolist;
Delete RECUR_JOIN_TBL;
Quit;
%if &connectCnt > 0 %then %do;
%RecurJoin(
baseTbl=&destTbl,
destTbl=&destTbl,
baseKey=&baseKey,
compKey=&compKey
);
%end;
%mend;
%RecurJoin(
baseTbl=city_list,
destTbl=FNL_CITY_LIST,
baseKey=from_city,
compKey=to_city
);
Proc Sort data=WORK.FNL_CITY_LIST (where=(NOT(from_city=to_city)));
by from_city to_city;
run;
Memory allowing, you can use the hash-based approach I came up with in this answer to identify the groups of connected cities within your dataset. Then you just need to generate a row for every pair of cities within the same group, which can easily be done via a cartesian join in proc sql.
Related
Problem:
I have a dataset with hundreds of variables (columns) and I want to standardize all numeric variables. But instead of center and dividing by just one standard deviation, I need to center and divide all variables by two standard deviations.
This is an example of the dataset I have
data have;
INPUT year $1-4 program_id $6-8 program_name $10-31 enrollments 33-36 admissions 38-41 graduates 43-46;
datalines;
2010 002 Electrical Engineering 1563 0321 0156
2010 001 Civil Engineering 2356 0739 0236
2010 003 Mechanical Engineering 0982 0234 0069
2010 021 English 3945 1034 0269
2010 031 Physics 0459 0134 0069
2010 041 Arts 0234 0072 0045
2019 004 Engineering 4745 1202 0597
2019 022 English Teaching 2788 0887 0201
2019 023 English and Spanish 0751 0345 0092
2019 031 Physics 0589 0126 0039
2019 032 Astronomy 0093 0035 0021
2019 041 Arts 0359 0097 0062
2019 044 Cinema 0293 0100 0039
;
run;
I want two different datasets. In the first, standardization applies for all variables across the whole dataset.
proc sql;
create table want1 as
select *,
(enrollments - mean(enrollments))/(2*STD(enrollments)) as z_enrollments,
(admissions - mean(admissions))/(2*STD(admissions)) as z_admissions,
(graduates - mean(graduates))/(2*STD(graduates)) as z_graduates
from have;
quit;
In the second, standardization is grouped by year:
proc sql;
create table want2 as
select *,
(enrollments - mean(enrollments))/(2*STD(enrollments)) as z_enrollments,
(admissions - mean(admissions))/(2*STD(admissions)) as z_admissions,
(graduates - mean(graduates))/(2*STD(graduates)) as z_graduates
from have
group by year;
quit;
Question: How to do this for all the hundreds of numeric variables of my dataset, without needing to write down the name of each one of them?
What I tried:
As I want this code to be replicable to different datasets, I was trying to follow the reasoning of this other question. That is, first to identify all numeric variables, than to save all variables names into an array and them doing the computations. I thought that perhaps I also need to save the resulting parameters of each column (mean and std) in an array as well. But I still did not get how to make arrays, datasteps and loops to work together.
I started trying to set an array for calculating the number of numerical variables. This runs fine.
data _null_;
set have;
array x[*] _numeric_;
call symput("nVar",dim(x));
stop;
run;
%put Number Variables = &nVar;
Then I tried to adapt the following code - which is a combination of #DomPazz answer with #Tom suggestion in the comments - but it did not work:
data want;
set have nobs=nobs;
array x[&nVar] _numeric_;
array N[&nVar];
n(1)=x(1); do i=2 to dim(n); n(i)=(x(i) - mean(x(i))/(2*(STD(x(i)); end;
keep N:;
run;
I don't know if the above code would get the right result. But I get an error saying that I have the incorrect number of arguments for the STD function. I looked it up: apparently, STD() in datastep runs row-wise, not column-wise.
I also tried PROC STANDARD, I get some results, but they don't match with my calculations. Probably I did not set the parameters right:
proc standard data=have mean=0 std=2
out=want;
run;
You can use the METHED=STD on PROC STDIZE to standardize around the mean and one STD.
So just add the MULT= option to divide by 2.
proc stdize data=have method=STD mult=0.5 out=want;
run;
Answering last comment:
#Tom I was reading the proc stdize documentation, but I could not figure out if I can customize the LOCATION and SCALE measures. For example, if instead of dividing by 2sdt, I want to subtract the mean and divide by the range for all variables. Would it be possible?
Quick solution:
* Output Mean;
proc stdize data=have method=mean out=out1 outstat=mean1;
var _numeric_;
run;
* Output Range;
proc stdize data=have method=range out=out1 outstat=range1;
var _numeric_;
run;
* LOCATION and SCALE;
data scale_location;
set mean1 (where=(_type_='LOCATION')) range1 (where=(_type_='SCALE'));
run;
* Target;
proc stdize data=have method=in(scale_location) out=want;
var _numeric_;
run;
I want to count the number of unique items in a variable (call it "categories") then use that count to set the number of iterations in a SAS macro (i.e., I'd rather not hard code the number of iterations).
I can get a count like this:
proc sql;
select count(*)
from (select DISTINCT categories from myData);
quit;
I can run a macro like this:
%macro superFreq;
%do i=1 %to &iterationVariable;
Proc freq data=myData;
table var&i / out=var&i||freq;
run;
%mend superFreq;
%superFreq
I want to know how to get the count into the iteration variable so that the macro iterates as many times as there are unique values in the variable "categories".
Sorry if this is confusing. Happy to clarify if need be. Thanks in advance.
You can achieve this by using the into clause in proc sql:
proc sql noprint;
select max(age),
max(height),
max(weight)
into :max_age,
:max_height,
:max_weight
from sashelp.class;
quit;
%put &=max_age &=max_height &=max_weight;
Result:
MAX_AGE= 16 MAX_HEIGHT= 72 MAX_WEIGHT= 150
You can also select a list of results into a macro variable by combining the into clause with the separated by clause:
proc sql noprint;
select name into :list_of_names separated by ' ' from sashelp.class;
quit;
%put &=list_of_names;
Result:
LIST_OF_NAMES=Alfred Alice Barbara Carol Henry James Jane Janet Jeffrey John Joyce Judy Louise Mary Philip Robert Ronald Thomas
William
I have a SAS question. I have a large dataset containing unique ID's and a bunch of variables for each year in a time series. Some ID's are present throughout the entire timeseries, some new ID's are added and some old ID's are removed.
ID Year Var3 Var4
1 2015 500 200
1 2016 600 300
1 2017 800 100
2 2016 200 100
2 2017 100 204
3 2015 560 969
3 2016 456 768
4 2015 543 679
4 2017 765 534
As can be seen from the table above, ID 1 is present in all three years (2015-2017), ID 2 is present from 2016 and onwards, ID 3 is removed in 2017 and ID 4 is present in 2015, removed in 2016 and then present again in 2017.
I would like to know which ID's are new and which are removed in any given year, whilst keeping all the data. Eg. a new table with indicators for which ID's are new and which are removed. Furthermore, it would be nice to get a frequency of how many ID' are added/removed in a given year and the sum og their "Var3" and "Var4". Do you have any suggestions how to do that?
************* UPDATE ******************
Okay, so I tried the following program:
**** Addition to suggested code ****;
options validvarname=any;
proc sql noprint;
create table years as
select distinct year
from have;
create table ids as
select distinct id
from have;
create table all_id_years as
select a.id, b.year
from ids as a,
years as b
order by id, year;
create table indicators as
select coalesce(a.id,b.id) as id,
coalesce(a.year,b.year) as year,
coalesce(a.id/a.id,0) as indicator
from have as a
full join
all_id_years as b
on a.id = b.id
and a.year = b.year
order by id, year
;
quit;
Now this will provide me with a table that only contains the ID's that are new in 2017:
data new_in_17;
set indicators;
where ('2016'n=0) and ('2017'n=1);
run;
I can now merge this table to add var3 and var4:
data new17;
merge new_in_17(in=x1) have(in=x2);
by id;
if x1=x2;
run;
Now I can find the frequence of new ID's in 2017 and the sum of var3 and var4:
proc means data=new17 noprint;
var var3 var4;
where year in (2017);
output out=sum_var_freq_new sum(var3)=sum_var3 sum(var4)=sum_var4;
run;
This gives me the output I need. However, I would like the equivalent output for the ID's that are "gone" between 2016 and 2017 which can be made from:
data gone_in_17;
set indicators;
where ('2016'n=1) and ('2017'n=0);
run;
data gone17;
merge gone_in_17(in=x1) have(in=x2);
by id;
if x1=x2;
run;
proc means data=gone17 noprint;
var var3 var4;
where year in (2016);
output out=sum_var_freq_gone sum(var3)=sum_var3 sum(var4)=sum_var4;
run;
The end result should be a combination of the two tables "sum_var_freq_new" and "sum_var_freq_gone" into one table. Furthermore, I need this table for every new year, so my current approach is very inefficient. Do you guys have any suggestions how to achieve this efficiently?
Aside from a different sample, you didn't provide much extra info from your previous question in order to understand what was lacking in the previous answer.
To build on the latter though, you could use a macro do loop to dynamically account for the distinct year values present in your dataset.
data have;
infile datalines;
input ID year var3 var4;
datalines;
1 2015 500 200
1 2016 600 300
1 2017 800 100
2 2016 200 100
2 2017 100 204
3 2015 560 969
3 2016 456 768
4 2015 543 679
4 2017 765 534
;
run;
proc sql noprint;
select distinct year
into :year1-
from have
;
quit;
%macro doWant;
proc sql;
create table want as
select distinct ID
%let i=1;
%do %while(%symexist(year&i.));
,exists(select * from have b where year=&&year&i.. and a.id=b.id) as "&&year&i.."n
%let i=%eval(&i.+1);
%end;
from have a
;
quit;
%mend;
%doWant;
This will produce the following result:
ID 2015 2016 2017
-----------------
1 1 1 1
2 0 1 1
3 1 1 0
4 1 0 1
Here is a more efficient way of doing this and also giving you the summary values.
First a little SQL magic. Create the cross product of years and IDs, then join that to the table you have to create an indicator;
proc sql noprint;
/*All Years*/
create table years as
select distinct year
from have;
/*All IDS*/
create table ids as
select distinct id
from have;
/*All combinations of ID/year*/
create table all_id_years as
select a.id, b.year
from ids as a,
years as b
order by id, year;
/*Original data with rows added for missing years. Indicator=1 if it*/
/*existed prior, 0 if not.*/
create table indicators as
select coalesce(a.id,b.id) as id,
coalesce(a.year,b.year) as year,
coalesce(a.id/a.id,0) as indicator
from have as a
full join
all_id_years as b
on a.id = b.id
and a.year = b.year
order by id, year
;
quit;
Now transpose that.
proc transpose data=indicators out=indicators(drop=_name_);
by id;
id year;
var indicator;
run;
Create the sums. You could also add other summary stats if you wanted here:
proc summary data=have;
by id;
var var3 var4;
output out=summary sum=;
run;
Merge the indicators and the summary values:
data want;
merge indicators summary(keep=id var3 var4);
by id;
run;
I have the following data:
Year Country Score
---- ------- -----
2007 AU 76
2007 SG 78
2008 AU 56
2008 SG 90
2009 AU 82
2009 SG 48
Suppose I want to show the Score in each country in each year(group with year) by using gplot, such as:
I have tried:
plot Score*(country year);
and
plot country*year=score;
But neither of them can work. I am not familiar with gplot, so how to achieve this?
/* First grab the 2007 year data you want to plot */
PROC SQL;
create table data2007 as
select *
from data_original
where year=2007;
QUIT;
/* Then plot data using the symbol statement*/
symbol interpol=boxt;
proc gplot data=data2007;
plot score*country;
run;
quit;
/* You can also research PROC UNIVARIATE and PROC BOXPLOT to achieve similar results */
If you want to do this by year.... I believe the following will work:
symbol interpol=boxt;
proc gplot data=data2007;
plot score*country;
by year;
run;
quit;
If you wish to have all the year and all the country you could:
PROC SQL;
create table new_data as
select year
, country
, LEFT(TRIM(country) || " _ " || year) as country_year
from data_original
QUIT;
symbol interpol=boxt;
proc gplot data=data2007;
plot score*country_year;
run;
quit;
Be aware of the number of levels to be graphed.
SGPLOT is going to be the easiest way to get this; it's much more powerful than GPLOT in many areas, and nice boxplots is one of them.
This gets pretty close to what you want. You may need to do a few things to get the legend exactly what you want, but it does make separate box plots grouped the way you ask. I threw in some extra data to make the boxplots look realistic.
data have;
input Year Country $ Score ;
datalines;
2007 AU 76
2007 AU 74
2007 AU 78
2007 SG 78
2007 SG 80
2007 SG 76
2008 AU 56
2008 SG 90
2009 AU 82
2009 SG 48
2008 AU 54
2008 AU 58
2008 SG 88
2008 SG 92
2009 AU 78
2009 AU 86
2009 SG 44
2009 SG 52
;;;;
run;
title;
proc sgplot data=have;
vbox score/category=country group=year groupdisplay=stacked; *or reverse category and group depending on your preference;
run;
GPLOT is a bit trickier. The way you get groups in GPLOT is the equal sign, so:
symbol interpol=boxt;
proc gplot data=have;
plot score*country=year;
run;
quit;
But that doesn't look nearly as nice nor does it stack adjacently. I also don't like how hard it is to make them sit in the right spot on the plot.
I am interested in dividing my data into thirds, but I only have a summary table of counts by a state. Specifically, I have estimated enrollment counts by state, and I would like to calculate what states comprise the top third of all enrollments. So, the top third should include at least a total cumulative percentage of .33333...
I have tried various means of specifying cumulative percentages between .33333 and .40000 but with no success in specifying the general case. PROC RANKalso can't be used because the data is organized as a frequency table...
I have included some dummy (but representative) data below.
data state_counts;
input state $20. enrollment;
cards;
CALIFORNIA 440233
TEXAS 318921
NEW YORK 224867
FLORIDA 181517
ILLINOIS 162664
PENNSYLVANIA 155958
OHIO 141083
MICHIGAN 124051
NEW JERSEY 117131
GEORGIA 104351
NORTH CAROLINA 102466
VIRGINIA 93154
MASSACHUSETTS 80688
INDIANA 75784
WASHINGTON 73764
MISSOURI 73083
MARYLAND 73029
WISCONSIN 72443
TENNESSEE 71702
ARIZONA 69662
MINNESOTA 66470
COLORADO 58274
ALABAMA 54453
LOUISIANA 50344
KENTUCKY 49595
CONNECTICUT 47113
SOUTH CAROLINA 46155
OKLAHOMA 43428
OREGON 42039
IOWA 38229
UTAH 36476
KANSAS 36469
MISSISSIPPI 33085
ARKANSAS 32533
NEVADA 27545
NEBRASKA 24571
NEW MEXICO 22485
WEST VIRGINIA 21149
IDAHO 20596
NEW HAMPSHIRE 19121
MAINE 18213
HAWAII 16304
RHODE ISLAND 13802
DELAWARE 12025
MONTANA 11661
SOUTH DAKOTA 11111
VERMONT 10082
ALASKA 9770
NORTH DAKOTA 9614
WYOMING 7457
DIST OF COLUMBIA 6487
;
run;
***** calculating the cumulative frequencies by hand ;
proc sql;
create table dummy_3 as
select
state,
enrollment,
sum(enrollment) as total_enroll,
enrollment / calculated total_enroll as percent_total
from state_counts
order by percent_total desc ;
quit;
data dummy_4; set dummy_3;
if first.percent_total then cum_percent = 0;
cum_percent + percent_total;
run;
Based on the value for cum_percent, the states that make up the top third of enrollment are: California, Texas, New York, Florida, and Illinois.
Is there any way to do this programatically? I'd eventually like to specify a flag variable for selecting states.
Thanks...
You can easily count percentages using PROC FREQ with WEIGHT statement and then select those in the first third using LAG function:
proc freq data=state_counts noprint order=data;
tables state / out=state_counts2;
weight enrollment;
run;
data top3rd;
set state_counts2;
cum_percent+percent;
if lag(cum_percent)<100/3 then top_third=1;
run;
It seems like you're 90% of the way there. If you just need a way to put cum_percent into flagged buckets, setting up a format is pretty straightforward.
proc format;
value pctile
low-0.33333 = 'top third'
0.33333<-.4 = 'next bit'
0.4<-high = 'the rest'
;
run;
options fmtsearch=(work);
And add a statement at the end of your datastep:
pctile_flag = put(cum_percent,pctile.);
Rewrite your last data step like this:
data dummy_4(drop=found);
set dummy_3;
retain cum_percent 0 found 0;
cum_percent + percent_total;
if cum_percent < (1/3) then do;
top_third = 1;
end;
else if ^found then do;
top_third = 1;
found =1;
end;
else
top_third = 0;
run;
note: your first. syntax is incorrect. first. and last. only work on BY groups. You get the right values in CUM_PERCENT by way of the cum_percent + percent_total; statement.
I am not aware of a PROC that will do this for you.