Match-Merging Causing Concatenated Data

Match-Merging Causing Concatenated Data - sas

I am trying to match merge two data sets by the variable "country". Both data sets contain the variable country (one has it named as "name" but was changed to country) and other variables, one data set (data1) contains continent information. However, I run into the issue of SAS just concatenating the data sets, that is, stacking them on top of one another.
I have tried the basics, sorting the data sets by the same by variables and making sure to use the by statement when merging the data sets.
proc sort data=data1;
by name;
run;
proc sort data=data2;
by country;
run;
data merged_data;
length continent $ 20 country $ 200;
merge data1(rename=(name=country)) data2;
by country;
run;
The result of this code is the data sets just being stacked on top of one another. My goal is to attach the continent to the country, ie identify the continent of each country.
data1:
Continent Name
Asia China
Australia New Zealand
Europe France
data2:
Country Var City
China 1.2 Beijing, China
New Zealand 3.5 Auckland, New Zealand
France 2.8 Paris, France
data I want:
Country Var City Continent
China 1.2 Beijing, China Asia
New Zealand 3.5 Auckland, New Zealand Australia
France 2.8 Paris, France Europe
data I get:
Country Var City Continent
China 1.2 Beijing, China
New Zealand 3.5 Auckland, New Zealand
France 2.8 Paris, France
China Asia
New Zealand Australia
France Europe

From my example data your logic works for me. Maybe your error has to do with your length statement
Data Df1;
INPUT Country $1-18 #19 Temp;
datalines;
United States 87
Canada 68
Mexico 88
Russia 77
China 55
;
Run;
Data Df2;
INPUT name $1-18 #19 season $;
datalines;
United States Summer
Canada Summer
Mexico Summer
Russia Winter
China Winter
;
Run;
Proc sort data=Df1;
by Country;
Proc sort data= Df2;
by Name;
Run;
Data Merged_data;
merge Df1 Df2(rename=(name=country));
by country;
Run;

Make sure the values of the variables are what you think they are. Print the values using $QUOTE. format. Look at the results using fixed length font. etc.
Perhaps one has the actual values you see and the other has a code that is decoded by a format to the values you see.
If it is not an issue of formatted value versus actual value then perhaps the records in DATA2 have leading spaces.
This program produces the result you are showing. If you remove the leading spaces from COUNTRY in DATA2 then the merge works as expected.
data data1 ;
input Continent $13. Name $15.;
cards;
Asia China
Australia New Zealand
Europe France
;
data data2;
input Country $15. Var City $25.;
country=' '||country;
cards;
China 1.2 Beijing, China
New Zealand 3.5 Auckland, New Zealand
France 2.8 Paris, France
;
proc sort data=data1; by name; run;
proc sort data=data2; by country; run;
data want ;
merge data2 data1(rename=(name=country)) ;
by country;
run;
Results:
Obs Country Var City Continent
1 China 1.2 Beijing, China
2 France 2.8 Paris, France
3 New Zealand 3.5 Auckland, New Zealand
4 China . Asia
5 France . Europe
6 New Zealand . Australia

Related

SAS Programming - removing duplicates

I have the following data based on distance between the cities.
Source Destination Distance
USA UK 1000
USA Spain 200
UK USA 1000
Germany Spain 500
Spain USA 200
I want to remove the duplicates where source and destination are same. For Example USA to UK will be same as UK to USA and hence the duplicate value needs to be removed.
Following is the desired output.
Source Destination Distance
USA UK 1000
USA Spain 200
Germany Spain 500

First produce a dummy variable to hold sorted source and destination by call sortc, then sort by dummy variable.
data have;
input Source $ Destination $ Distance;
cards;
USA UK 1000
USA Spain 200
UK USA 1000
Germany Spain 500
Spain USA 200
;
data temp;
set have;
length dummy $50.;
_var1=source; _var2=destination;
call sortc (of _:);
dummy=catx(' ',of _:);
drop _:;
run;
proc sort data=temp out=want(drop=dummy) nodupkey;
by dummy;
run;

You will have to create a Dimension / lookup table for all the routes you want, then lookup the values to standardise the output you want.
I created a lookup table called Routes, and variable containing all the pair values to lookup.
Full Code:
data have;
input Source $ Destination $ Distance ;
datalines;
USA UK 1000
USA Spain 200
UK USA 1000
Germany Spain 500
Spain USA 200
;
run;
data routes;
infile datalines dsd dlm=',';
length pairs $50.;
input Source $ Destination $ Distance Pairs $ ;
datalines;
USA,UK,1000,USA-UK/UK-USA
USA,Spain,200,USA-Spain/Spain-USA
Germany, Spain,500,Germany-Spain/Spain-Germany
;
run;
proc sql;
create table want as
Select distinct
t2.Source, t2.Destination, t2.Distance
from have t1 inner join routes t2 on
t2.Pairs contains catx('-',t1.Source,t1.Destination) or
t2.Pairs contains catx('-',t1.Destination,t1.Source)
;
quit;
Output:
Source=Germany Destination=Spain Distance=500
Source=USA Destination=Spain Distance=200
Source=USA Destination=UK Distance=1000

How to delete all the duplicate observations but add a column with the frequency in SAS?

In a dataset in SAS, I have some observations multiple times. What I am trying to do is: I am trying to add a column with the frequency of each observation and make sure I keep it only one time in my dataset. I have to do this for a dataset with many rows and around 8 variables.
name id address age
jack 2 chicago 50
peter 4 new york 45
jack 2 chicago 50
This would have to become:
name id address age frequency
jack 2 chicago 50 2
peter 4 new york 45 1
Is there anybody who knows how to do this in SAS (preferably without using SQL)?
Thank you a lot!

#kl78 is right, proc summary is the best non-sql solution here. This runs in memory which can cause problems with very large datasets, but you should be ok with 8 columns.
class _all_ will group by all the variables and the frequency is output by default, so there's no need to specify any measures. I've dropped the other automatic variable, _type_, as it isn't relevant here and renamed _freq_.
data have;
input name $ id address &$ age;
datalines;
jack 2 chicago 50
peter 4 new york 45
jack 2 chicago 50
;
run;
proc summary data=have nway;
class _all_;
output out=want (drop=_type_ rename=(_freq_=frequency));
run;

SAS Recursive Join

I have a large table of connections, and would like to expand that table to include recursive connections.
My data looks like this --
data city_list;
input from_city $ to_city $;
datalines;
PORTLAND SEATTLE
SEATTLE BOISE
BOISE PORTLAND
PORTLAND HELENA
NYC ORLANDO
ORLANDO MIAMI
;
run;
I'd like expand the data set to include stopovers, so it ends up looking like this. I'm not concerned about whether I have both a "PORTLAND/SEATTLE" and a "SEATTLE/PORTLAND" record -- I can handle those afterwards as necessary.
BOISE HELENA
BOISE PORTLAND
BOISE SEATTLE
NYC MIAMI
NYC ORLANDO
ORLANDO MIAMI
PORTLAND HELENA
PORTLAND SEATTLE
SEATTLE HELENA
I've tried using the following macro, but ran into performance problems when there were too many levels of recursion. I believe the best option would be hash tables, but am not sure how to code this precise scenario.
data city_list;
input from_city $ to_city $;
datalines;
PORTLAND SEATTLE
SEATTLE BOISE
BOISE PORTLAND
PORTLAND HELENA
NYC ORLANDO
ORLANDO MIAMI
;
run;
%macro RecurJoin(
baseTbl,
destTbl,
baseKey,
compKey
);
Proc SQL;
Create Table WORK.RECUR_JOIN_TBL as
SELECT distinct Base.&baseKey, Connect.&compkey
FROM &baseTbl AS Base
INNER JOIN &baseTbl AS Connect
ON (Base.&compkey = Connect.&baseKey)
LEFT JOIN &baseTbl AS Subbase
ON (Base.&baseKey = Subbase.&baseKey) AND
(Connect.&compkey = Subbase.&compkey)
WHERE Subbase.&baseKey IS NULL;
quit;
proc sql noprint;
select count(1) into :connectCnt from RECUR_JOIN_TBL;
quit;
Data &destTbl;
set &baseTbl
RECUR_JOIN_TBL;
run;
Proc DataSets nolist;
Delete RECUR_JOIN_TBL;
Quit;
%if &connectCnt > 0 %then %do;
%RecurJoin(
baseTbl=&destTbl,
destTbl=&destTbl,
baseKey=&baseKey,
compKey=&compKey
);
%end;
%mend;
%RecurJoin(
baseTbl=city_list,
destTbl=FNL_CITY_LIST,
baseKey=from_city,
compKey=to_city
);
Proc Sort data=WORK.FNL_CITY_LIST (where=(NOT(from_city=to_city)));
by from_city to_city;
run;

Memory allowing, you can use the hash-based approach I came up with in this answer to identify the groups of connected cities within your dataset. Then you just need to generate a row for every pair of cities within the same group, which can easily be done via a cartesian join in proc sql.

Adding percent % to the proc tabulate report

I have a proc tabulate code as below:
proc tabulate data=want;
class TERM CAMPUS GENDER ;
var count ;
table GENDER ALL, (CAMPUS all)*TERM*(count='#Enrl '*f=best8.*sum=' ' count=''*colpctsum='% Tot Enrl ' ) / rts=20;
run;
and my result is as below
campus
East Campus
Term
Spring 2014 Spring 2015 Dfference
#Enrl %Tot_Enrl #Enrl %ToT_Enrl #Enrl %Tot_Enrl
Gender
Female 8462 52.86 8429 52.36 -33 -37.08
Male 7478 46.71 7608 47.26 130 146.07
None 68 0.42 60 0.37 -8 -8.89
All 16008 100.00 16907 100.00 89 100
I need to add % sign in the '%Tot_Enrl' variables.
Also can i remove campus and term titles? I have 'campus' title and 'east campus' title. So i need to remove 'Campus' and 'Term'. is that possible?

You need some more =' ' to get rid of Campus and Term. To get percent sign, you need a format; SAS shows you how here.
data want;
format term $15. campus $15.;
input term $ & campus $ & gender $ count;
datalines;
Spring 2014 East Campus Female 8462
Spring 2014 East Campus Male 7478
Spring 2014 East Campus None 68
Spring 2015 East Campus Female 8429
Spring 2015 East Campus Male 7608
Spring 2015 East Campus None 60
Difference East Campus Female -33
Difference East Campus Male 130
Difference East Campus None -8
;;;;
run;
proc format;
picture mypct(round) low-high='000,009.90%' (mult=100);
run;
proc tabulate data=want;
class TERM CAMPUS GENDER ;
var count ;
table GENDER ALL, (CAMPUS=' ' all)*TERM=' '*(count='#Enrl '*f=best8.*sum=' ' count=''*colpctsum='% Tot Enrl '*f=mypct. ) / rts=20;
run;
The sorting ends up wrong, I'm guessing you have a formatted numeric for some of that, but it gets the idea across.

How to calculate quantile data for table of frequencies in SAS?

I am interested in dividing my data into thirds, but I only have a summary table of counts by a state. Specifically, I have estimated enrollment counts by state, and I would like to calculate what states comprise the top third of all enrollments. So, the top third should include at least a total cumulative percentage of .33333...
I have tried various means of specifying cumulative percentages between .33333 and .40000 but with no success in specifying the general case. PROC RANKalso can't be used because the data is organized as a frequency table...
I have included some dummy (but representative) data below.
data state_counts;
input state $20. enrollment;
cards;
CALIFORNIA 440233
TEXAS 318921
NEW YORK 224867
FLORIDA 181517
ILLINOIS 162664
PENNSYLVANIA 155958
OHIO 141083
MICHIGAN 124051
NEW JERSEY 117131
GEORGIA 104351
NORTH CAROLINA 102466
VIRGINIA 93154
MASSACHUSETTS 80688
INDIANA 75784
WASHINGTON 73764
MISSOURI 73083
MARYLAND 73029
WISCONSIN 72443
TENNESSEE 71702
ARIZONA 69662
MINNESOTA 66470
COLORADO 58274
ALABAMA 54453
LOUISIANA 50344
KENTUCKY 49595
CONNECTICUT 47113
SOUTH CAROLINA 46155
OKLAHOMA 43428
OREGON 42039
IOWA 38229
UTAH 36476
KANSAS 36469
MISSISSIPPI 33085
ARKANSAS 32533
NEVADA 27545
NEBRASKA 24571
NEW MEXICO 22485
WEST VIRGINIA 21149
IDAHO 20596
NEW HAMPSHIRE 19121
MAINE 18213
HAWAII 16304
RHODE ISLAND 13802
DELAWARE 12025
MONTANA 11661
SOUTH DAKOTA 11111
VERMONT 10082
ALASKA 9770
NORTH DAKOTA 9614
WYOMING 7457
DIST OF COLUMBIA 6487
;
run;
***** calculating the cumulative frequencies by hand ;
proc sql;
create table dummy_3 as
select
state,
enrollment,
sum(enrollment) as total_enroll,
enrollment / calculated total_enroll as percent_total
from state_counts
order by percent_total desc ;
quit;
data dummy_4; set dummy_3;
if first.percent_total then cum_percent = 0;
cum_percent + percent_total;
run;
Based on the value for cum_percent, the states that make up the top third of enrollment are: California, Texas, New York, Florida, and Illinois.
Is there any way to do this programatically? I'd eventually like to specify a flag variable for selecting states.
Thanks...

You can easily count percentages using PROC FREQ with WEIGHT statement and then select those in the first third using LAG function:
proc freq data=state_counts noprint order=data;
tables state / out=state_counts2;
weight enrollment;
run;
data top3rd;
set state_counts2;
cum_percent+percent;
if lag(cum_percent)<100/3 then top_third=1;
run;

It seems like you're 90% of the way there. If you just need a way to put cum_percent into flagged buckets, setting up a format is pretty straightforward.
proc format;
value pctile
low-0.33333 = 'top third'
0.33333<-.4 = 'next bit'
0.4<-high = 'the rest'
;
run;
options fmtsearch=(work);
And add a statement at the end of your datastep:
pctile_flag = put(cum_percent,pctile.);

Rewrite your last data step like this:
data dummy_4(drop=found);
set dummy_3;
retain cum_percent 0 found 0;
cum_percent + percent_total;
if cum_percent < (1/3) then do;
top_third = 1;
end;
else if ^found then do;
top_third = 1;
found =1;
end;
else
top_third = 0;
run;
note: your first. syntax is incorrect. first. and last. only work on BY groups. You get the right values in CUM_PERCENT by way of the cum_percent + percent_total; statement.
I am not aware of a PROC that will do this for you.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js