I have a dataset that has bunch of addresses.
PROC SORT DATA=work68;
by ADDRESS ;
run;
However it only show ADDRESS columns like .. it considers only the very first number of address..
2237 Strang Avenue
2932 Ely Avenue
3306 Wilson Ave
3313 Wilson Avenue
3313 Wilson Avenue
3313 Wilson Avenue
46 Nuvern Avenue
You can use the option SORTSEQ=LINGUISTIC(NUMERIC_COLLATION=ON) to ask SAS to try and sort numeric values as if they were numbers.
PROC SORT DATA=work68 sortseq=linguistic(numeric_collation=on);
by ADDRESS ;
run;
If I understand correctly what you're asking, you could try creating a new address column with all digits removed and sort on that:
data have;
input address $100.;
infile cards truncover;
cards;
1107 Huichton Rd.
1111 Ely Avenue
;
run;
data v_have /view = v_have;
set have;
address_nonumbers = strip(compress(address,,'d'));
run;
proc sort data = v_have out = want;
by address_nonumbers;
run;
Proc SQL syntax can sort data in special ways, ORDER BY <computation-1>, …, <computation-N>
You may want to sort by street names first, and then by numeric premise identifier (house number). For example
Data
data have; input; address=_infile_;datalines;
2237 Strang Avenue
2932 Ely Avenue
3306 Wilson Ave
3313 Wilson Avenue
46 Nuvern Avenue
3313 Ely Avenue
4494 Nuvern Avenue
run;
Sort on street name, then house number
proc sql;
create table want as
select *
from have
order by
compress (address,,'ds') /* ignore digits and spaces - presume to be street name */
, input (scan(address,1),? best12.) /* house number */
;
quit;
This example has simplified presumptions and will not properly sort address constructs such as #### ##th Street
Related
Hi I have two tables with different column orders, and the column name are not capitalized as the same. How can I compare if the contents of these two tables are the same?
For example, I have two tables of students' grades
table A:
Math English History
-------+--------+---------
Tim 98 95 90
Helen 100 92 85
table B:
history MATH english
--------+--------+---------
Tim 90 98 95
Helen 85 100 92
You may use either of the two approaches to compare, regardless of the order or column name
/*1. Proc compare*/
proc sort data=A; by name; run;
proc sort data=B; by name; run;
proc compare base=A compare=B;
id name;
run;
/*2. Proc SQL*/
proc sql;
select Math, English, History from A
<union/ intersect/ Except>
select MATH, english, history from B;
quit;
use except corr(corresponding) it will check by name. if everything is matching you will get zero records.
data have1;
input Math English History;
datalines;
1 2 3
;
run;
data have2;
input English math History;
datalines;
2 1 3
;
run;
proc sql ;
select * from have1
except corr
select * from have2;
edit1
if you want to check which particular column it differs you may have to transpose and compare as shown below example.
data have1;
input name $ Math English pyschology History;
datalines;
Tim 98 95 76 90
Helen 100 92 55 85
;
run;
data have2;
input name $ English Math pyschology History;
datalines;
Tim 95 98 76 90
Helen 92 100 99 85
;
run;
proc sort data = have1 out =hav1;
by name;
run;
proc sort data = have2 out =hav2;
by name;
run;
proc transpose data =hav1 out=newhave1 (rename = (_name_= subject
col1=marks));
by name;
run;
proc transpose data =hav2 out=newhave2 (rename = (_name_= subject
col1=marks));
by name;
run;
proc sql;
create table want(drop=mark_dif) as
select
a.name as name
,a.subject as subject
,a.marks as have1_marks
,b.marks as have2_marks
,a.marks -b.marks as mark_dif
from newhave1 a inner join newhave2 b
on upcase(a.name) = upcase(b.name)
and upcase(a.subject) =upcase(b.subject)
where calculated mark_dif ne 0;
I am analyzing data. I need to extract everything from these strings before the first space. How can I extract the substring of everything up to the first space. I am using SAS and have used PRXMATCH but not familiar with doing this. Thanks!
0518Audible adbl.co/bill NJ 01
06257-ELEVEN CHICAGO IL Purchase $33.30 Cash Back $10.00
0625#03345 JEWEL CHICAGO IL Purchase $58.58 Cash Back $20.00 00
So in my output I need:
0518Audible
06257-ELEVEN
0625#03345
I then need to extract only the first numbers so I get:
0518
06257
0625
Any help is greatly appreciated. Thanks much
Did not work:
TXN_DESCRIPTION_2=prxmatch('/^\d+/', TXN_DESCRIPTION_1);
Use prxchange.
data have;
length string $500.;
string="0518Audible adbl.co/bill NJ 01";output;
string="06257-ELEVEN CHICAGO IL Purchase $33.30 Cash Back $10.00";output;;
string="0625#03345 JEWEL CHICAGO IL Purchase $58.58 Cash Back $20.00 00";output;
run;
data want;
set have;
string1=prxchange('s/(^\S+).*/$1/',-1,string);
string2=prxchange('s/(^\d+).*/$1/',-1,string);
run;
SAS has some simple string manipulation methods that can be used also, if desired:
data have;
length str $500.;
str="0518Audible adbl.co/bill NJ 01";output;
str="06257-ELEVEN CHICAGO IL Purchase $33.30 Cash Back $10.00";output;
str="0625#03345 JEWEL CHICAGO IL Purchase $58.58 Cash Back $20.00";output;
run;
data want;
set have;
str1=scan(str,1," ");
str2=substr(str,1,notdigit(str)-1);
run;
This is some example data, real data is more complex, other fields and about 40000 observations and up to 180 values per id (i know that i will get 360 rows in transposed table, but thats ok):
Data have;
input lastname firstname $ value;
datalines;
miller george 47
miller george 45
miller henry 44
miller peter 45
smith peter 42
smith frank 46
;
run;
And i want it to transpose in this way, so I have lastname, and then alternating firstname and value for ervery line matching the lastname.
data want:
Lastname Firstname1 Value1 Firstname2 value2 Firstname3 Value3 firstname4 value4
miller george 47 george 45 henry 44 peter 45
smith peter 42 frank 46
I tried a bit with proc transpose, but i was not able to build a table exactly the way i want it, described above. I need the want table exactly that way (real data is more complex and with other fields), so please no answers which propose to create a want table with other layout.
proc summary has a very useful function to do this, idgroup. You need to specify how many values you have per lastname, so I've included a step to calculate the maximum number.
Data have;
input lastname $ firstname $ value;
datalines;
miller george 47
miller george 45
miller henry 44
miller peter 45
smith peter 42
smith frank 46
;
run;
/* get frequency count of lastnames */
proc freq data=have noprint order=freq;
table lastname / out=name_freq;
run;
/* store maximum into a macro variable (first record will be the highest) */
data _null_;
set name_freq (obs=1);
call symput('max_num',count);
run;
%put &max_num.;
/* transpose data using proc summary */
proc summary data=have nway;
class lastname;
output out=want (drop=_:)
idgroup(out[&max_num.] (firstname value)=) / autoname;
run;
I am interested in dividing my data into thirds, but I only have a summary table of counts by a state. Specifically, I have estimated enrollment counts by state, and I would like to calculate what states comprise the top third of all enrollments. So, the top third should include at least a total cumulative percentage of .33333...
I have tried various means of specifying cumulative percentages between .33333 and .40000 but with no success in specifying the general case. PROC RANKalso can't be used because the data is organized as a frequency table...
I have included some dummy (but representative) data below.
data state_counts;
input state $20. enrollment;
cards;
CALIFORNIA 440233
TEXAS 318921
NEW YORK 224867
FLORIDA 181517
ILLINOIS 162664
PENNSYLVANIA 155958
OHIO 141083
MICHIGAN 124051
NEW JERSEY 117131
GEORGIA 104351
NORTH CAROLINA 102466
VIRGINIA 93154
MASSACHUSETTS 80688
INDIANA 75784
WASHINGTON 73764
MISSOURI 73083
MARYLAND 73029
WISCONSIN 72443
TENNESSEE 71702
ARIZONA 69662
MINNESOTA 66470
COLORADO 58274
ALABAMA 54453
LOUISIANA 50344
KENTUCKY 49595
CONNECTICUT 47113
SOUTH CAROLINA 46155
OKLAHOMA 43428
OREGON 42039
IOWA 38229
UTAH 36476
KANSAS 36469
MISSISSIPPI 33085
ARKANSAS 32533
NEVADA 27545
NEBRASKA 24571
NEW MEXICO 22485
WEST VIRGINIA 21149
IDAHO 20596
NEW HAMPSHIRE 19121
MAINE 18213
HAWAII 16304
RHODE ISLAND 13802
DELAWARE 12025
MONTANA 11661
SOUTH DAKOTA 11111
VERMONT 10082
ALASKA 9770
NORTH DAKOTA 9614
WYOMING 7457
DIST OF COLUMBIA 6487
;
run;
***** calculating the cumulative frequencies by hand ;
proc sql;
create table dummy_3 as
select
state,
enrollment,
sum(enrollment) as total_enroll,
enrollment / calculated total_enroll as percent_total
from state_counts
order by percent_total desc ;
quit;
data dummy_4; set dummy_3;
if first.percent_total then cum_percent = 0;
cum_percent + percent_total;
run;
Based on the value for cum_percent, the states that make up the top third of enrollment are: California, Texas, New York, Florida, and Illinois.
Is there any way to do this programatically? I'd eventually like to specify a flag variable for selecting states.
Thanks...
You can easily count percentages using PROC FREQ with WEIGHT statement and then select those in the first third using LAG function:
proc freq data=state_counts noprint order=data;
tables state / out=state_counts2;
weight enrollment;
run;
data top3rd;
set state_counts2;
cum_percent+percent;
if lag(cum_percent)<100/3 then top_third=1;
run;
It seems like you're 90% of the way there. If you just need a way to put cum_percent into flagged buckets, setting up a format is pretty straightforward.
proc format;
value pctile
low-0.33333 = 'top third'
0.33333<-.4 = 'next bit'
0.4<-high = 'the rest'
;
run;
options fmtsearch=(work);
And add a statement at the end of your datastep:
pctile_flag = put(cum_percent,pctile.);
Rewrite your last data step like this:
data dummy_4(drop=found);
set dummy_3;
retain cum_percent 0 found 0;
cum_percent + percent_total;
if cum_percent < (1/3) then do;
top_third = 1;
end;
else if ^found then do;
top_third = 1;
found =1;
end;
else
top_third = 0;
run;
note: your first. syntax is incorrect. first. and last. only work on BY groups. You get the right values in CUM_PERCENT by way of the cum_percent + percent_total; statement.
I am not aware of a PROC that will do this for you.
I have a dataset as below:
country
United States, Seattle
United Kingdom, London
How can I split country into a data in SAS like:
country city
United States Seattle
United Kingdom London
Use function SCAN() with comma as separator.
data test;
set test;
city=scan(country,2,',');
country=scan(country,1,',');
run;
Another option, INFILE magic (google the term for papers on the topic); useful for parsing many variables from one string and/or dealing with quoted fields and such that would be more work with scan.
filename tempfile "c:\temp\test.txt";
data have;
input #1 country $50.;
datalines;
United States, Seattle
United Kingdom, London
;;;;
run;
data want;
set have;
infile tempfile dlm=',' dsd;
input #1 ##;
_infile_=country;
format newcountry city $50.;
input newcountry $ city $ ##;
run;
tempfile can be any file (or one you create on the fly with any character in it to avoid premature EOF).
Response to:
data test;
set test;
city=scan(country,2,',');
country=scan(country,1,',');
run;
What if I want to split the last comma in the string only, keeping 7410 City?
Example: "Junior 18, Plays Piano, 7410 City