I'm a new SAS user and I'm working on making some code modular. I want to create a macro variable to reference any county name that's formatted within the data, and a second macro variable that removes spaces from the first macro variable to create a resolved macro to be used in a dataset name. Here's where I am:
options symbolgen;
%let county_formatted=CONTRA COSTA CO.;
%let county=%sysfunc(substr(%sysfunc(compress(&county_formatted)), 1, %sysfunc(length(%sysfunc(&county_formatted)))-3))));
run;
options nosymbolgen;
Within the dataset, county names are formatted to look like "CONTRA COSTA CO.". I'd like to take any county name and compress it for the purpose of creating output datasets with the macro resolving to "contracosta" or "CONTRACOSTA". All values for county_formatted have "CO." as the last three characters, and some county names have more than one word, such as "CONTRA COSTA CO.", "SAN LUIS OBISPO CO.", or "SHASTA CO.". I want to land on "CONTRACOSTA", "SANLUISOBISPO", or "SHASTA" for the resolved value of &county.
From the above code, I get the following error:
ERROR: Expected close parenthesis after macro function invocation not found.
I tried building from the inside out and was able to compress CONTRA COSTA CO. as needed to CONTRACOSTACO. but can't seem to remove the last 3 characters.
I'd appreciate any help in correcting my code.
Thanks
This will works:
%let county_formatted=CONTRA COSTA CO.;
%let county_compressed=%sysfunc(compress(&county_formatted.));
%let county=%sysfunc(substr(&county_compressed., 1, %length(&county_compressed.)-3));
%put ***&county.***;
Your question comes from %sysfunc(length(%sysfunc(&county_formatted)))-3), the length() function returns length of raw string, it is with space.
By the way, SAS has %length(), the macro version of %sysfunc(length()).
You probably don't want to use macro for the transformation you envision. You didn't go into any detail about to be used in a dataset name, so I'll presume you have a data splitting macro that relies on a data value for the output name.
PRXCHANGE can perform regular expression substitutions.
Your use case of removing spaces and a trailing text would use a regex such as
prxchange('s/ |CO. *$//', -1, company)
data have;
length company $20 seq sales 8;
input company & seq & sales;
datalines;
CONTRA COSTA CO. 1 100
CONTRA COSTA CO. 2 90
CONTRA COSTA CO. 3 110
CONTRA COSTA CO. 4 110
CONTRA COSTA CO. 5 120
CONTRA COSTA CO. 6 80
SAN LUIS OBISPO CO. 1 200
SAN LUIS OBISPO CO. 2 210
SAN LUIS OBISPO CO. 3 220
SAN LUIS OBISPO CO. 4 230
SHASTA CO. 1 50
SHASTA CO. 2 150
SHASTA CO. 3 250
;
proc sql;
create table splitbase1 as
select
prxchange('s/ |CO. *$//', -1, company) as outname length=20
, *
from
have
;
* split "have" into sub-tables whose names are specified by outname;
....
Related
Suppose that I was given the following data
ID Birthday Monthly Salary
P222 2 March 1976 9,600
P013 13 June 1955 31,450
S015 12 September 1966 27,500
The ID number starts with a character followed by three digits.
The first character is the abbreviation of the occupation ("P" for Professor. and "S" for Staff, etc.).
Consider the following data, denoted by (*) and (**):
(*):
P222 2Mar1976 9,60000
P013 13Jun1955 31,45000
S015 12Sep1966 27,50000
(**):
P222 2Mar1976 $9,6,00
***************
P013 13Jun1955 $31,450
**************
S015 12Sep1966 $27,500
***********
Suppose I have to write SAS programs to read the aforementioned data (*) and (**) respectively to create a temporary SAS data file, called PERSONEL, which contains five variables, namely ID, OCCUPATION, BIRTHDAY, YEAR and SALARY.
I mean YEAR by the year of birth here. So variables BIRTHDAY, YEAR and SALARY are numeric, but ID and OCCUPATION would be character variables.
For example, the first record should have
ID="P222", OCCUPATION="P", BIRTHDAY=27821, YEAR=1976, SALARY=9600
Is it possible for me to do this WITHOUT using assignment statement?
If you have fixed column text file, like your first example:
RULE: ----+----1----+----2----+----3
1311 P222 2Mar1976 9,60000
1312 P013 13Jun1955 31,45000
1313 S015 12Sep1966 27,50000
Then you could read the variables directly from the proper columns.
data want;
infile 'myfile' truncover;
input id $ 1-4 occupation $ 1 #7 birthday date9. year 12-15 #16 salary comma12.2 ;
format birthday date9. salary dollar12.2;
run;
Result:
Obs id occupation birthday year salary
1 P222 P 02MAR1976 1976 $9,600.00
2 P013 P 13JUN1955 1955 $31,450.00
3 S015 S 12SEP1966 1966 $27,500.00
The second version has the values in slightly different positions and and extra line that would need to be skipped.
I have a datafile which uses blank space as delimiter. I want to write a data step to read this file into sas.
The fields are not separated by a single blanks in most of the cases the fields are separated by more than 10 blanks spaces.I have checked using notepad++ and the delimiters are not tabs.
137 3.35 Afghanistan 2009-07-08
154 2.43 Albania 2009-07-22
101 1.22 Antigua and Barbuda 2009-06-24
155 4.13 Federated States of Micronesia 2009-07-22
I am tried writing informat statements for these and have been unsuccessful
Here's what I have done so far
input casedt1id :$3. contntid :4 country :&$32. casedt1 yymmdd10.
This reads only the first field properly and the rest get missing values.
The question is to write an informat statement to read this data ?
thanks for the help.
regards
jana
You can use the # symbol to control where the pointer reads from on the line. It looks like you have a fixed starting column for each variable.
data want;
input #1 casedt1id :$3. #14 contntid :4 #28 country :&$32. #61 casedt1 :yymmdd10.;
format casedt1 yymmdd10.;
datalines;
137 3.35 Afghanistan 2009-07-08
154 2.43 Albania 2009-07-22
101 1.22 Antigua and Barbuda 2009-06-24
155 4.13 Federated States of Micronesia 2009-07-22
;
That looks like fixed column data to me. The problem then is using INFORMATs with fixed column data. This should work
input casedt1id $ 1-3 contntid 4-27 country $28-60 casedt1 yymmdd10.;
format casedt1 yymmdd10.;
The trick is make sure the pointer is in the right place when it tries to read the formatted text. So in the statement above that is done by telling it read to column 60 for COUNTRY. So now you are at column 61 when you are ready to read the date. You could also use + or # to move the pointer.
... #61 casedt1 yymmdd10. ...
If you are reading from a variable length file (most files now are variable length) then make sure to add the TRUNCOVER option to the INFILE statement just in case the date is missing or written using fewer than 10 characters.
I have a dataset that looks like this but with many, many more variable pairs:
Stuff2016 Stuff2008 Earth2016 Earth2008 Fire2016 Fire2008
123456 5646743 45 456 456 890101
541351 543534534 45 489 489 74456
352352 564889 98 489489 1231 189
464646 542235423 13 15615 1561 78
987654 4561889 44 1212 12121 111
For each pair of almost identically named variables,
I want SAS to subtract 2016 data - 2008 data without typing the variable names.
What's the easiest way to tell SAS to do this without having to specifically type the variable names? Is there a way to tell it to subtract every other variable minus the one that precedes it without mentioning the specific variable names?
Thanks a lot!!!!
I would probably recommend three arrays but you could do it with one. This highly depends on the order of the variables which isn't a good assumption in my book. Also, how would you name the results automatically?
data want;
set have;
array vars(*) stuff2016--fire2008;
array diffs(*) diffs1-diffs20; *something big enough to hold difference;
do i=1 to dim(vars)-1;
diffs(i) = vars(i)-vars(i+1);
end;
run;
Instead, I'd highly suggest you use the dictionary tables to query your variable names and dynamically generate your variable lists which are then passed onto three different arrays, one for 2016, one for 2008 and one for the difference. The libname and memname are stored in uppercase in the Dictionary table so keep that in mind.
data have;
input Stuff2016 Stuff2008 Earth2016 Earth2008 Fire2016 Fire2008;
cards;
123456 5646743 45 456 456 890101
541351 543534534 45 489 489 74456
352352 564889 98 489489 1231 189
464646 542235423 13 15615 1561 78
987654 4561889 44 1212 12121 111
;
run;
proc sql;
select name into :var2016 separated by " "
from sashelp.vcolumn
where libname='WORK'
and memname='HAVE'
and name like '%2016'
order by name;
select name into :var2008 separated by " "
from sashelp.vcolumn
where libname='WORK'
and memname='HAVE'
and name like '%2008'
order by name;
select catx("_", compress(name, ,'d'), "diff") into :vardiff separated by " "
from sashelp.vcolumn
where libname='WORK'
and memname='HAVE'
and name like '%2016'
order by name;
quit;
%put &var2016.;
%put &var2008.;
%put &vardiff.;
data want;
set have;
array v2016(*) &var2016;
array v2008(*) &var2008;
array diffs(*) &vardiff;
do i=1 to dim(v2016);
diffs(i)=v2016(i)-v2008(i);
end;
run;
In a dataset in SAS, I have some observations multiple times. What I am trying to do is: I am trying to add a column with the frequency of each observation and make sure I keep it only one time in my dataset. I have to do this for a dataset with many rows and around 8 variables.
name id address age
jack 2 chicago 50
peter 4 new york 45
jack 2 chicago 50
This would have to become:
name id address age frequency
jack 2 chicago 50 2
peter 4 new york 45 1
Is there anybody who knows how to do this in SAS (preferably without using SQL)?
Thank you a lot!
#kl78 is right, proc summary is the best non-sql solution here. This runs in memory which can cause problems with very large datasets, but you should be ok with 8 columns.
class _all_ will group by all the variables and the frequency is output by default, so there's no need to specify any measures. I've dropped the other automatic variable, _type_, as it isn't relevant here and renamed _freq_.
data have;
input name $ id address &$ age;
datalines;
jack 2 chicago 50
peter 4 new york 45
jack 2 chicago 50
;
run;
proc summary data=have nway;
class _all_;
output out=want (drop=_type_ rename=(_freq_=frequency));
run;
I am interested in dividing my data into thirds, but I only have a summary table of counts by a state. Specifically, I have estimated enrollment counts by state, and I would like to calculate what states comprise the top third of all enrollments. So, the top third should include at least a total cumulative percentage of .33333...
I have tried various means of specifying cumulative percentages between .33333 and .40000 but with no success in specifying the general case. PROC RANKalso can't be used because the data is organized as a frequency table...
I have included some dummy (but representative) data below.
data state_counts;
input state $20. enrollment;
cards;
CALIFORNIA 440233
TEXAS 318921
NEW YORK 224867
FLORIDA 181517
ILLINOIS 162664
PENNSYLVANIA 155958
OHIO 141083
MICHIGAN 124051
NEW JERSEY 117131
GEORGIA 104351
NORTH CAROLINA 102466
VIRGINIA 93154
MASSACHUSETTS 80688
INDIANA 75784
WASHINGTON 73764
MISSOURI 73083
MARYLAND 73029
WISCONSIN 72443
TENNESSEE 71702
ARIZONA 69662
MINNESOTA 66470
COLORADO 58274
ALABAMA 54453
LOUISIANA 50344
KENTUCKY 49595
CONNECTICUT 47113
SOUTH CAROLINA 46155
OKLAHOMA 43428
OREGON 42039
IOWA 38229
UTAH 36476
KANSAS 36469
MISSISSIPPI 33085
ARKANSAS 32533
NEVADA 27545
NEBRASKA 24571
NEW MEXICO 22485
WEST VIRGINIA 21149
IDAHO 20596
NEW HAMPSHIRE 19121
MAINE 18213
HAWAII 16304
RHODE ISLAND 13802
DELAWARE 12025
MONTANA 11661
SOUTH DAKOTA 11111
VERMONT 10082
ALASKA 9770
NORTH DAKOTA 9614
WYOMING 7457
DIST OF COLUMBIA 6487
;
run;
***** calculating the cumulative frequencies by hand ;
proc sql;
create table dummy_3 as
select
state,
enrollment,
sum(enrollment) as total_enroll,
enrollment / calculated total_enroll as percent_total
from state_counts
order by percent_total desc ;
quit;
data dummy_4; set dummy_3;
if first.percent_total then cum_percent = 0;
cum_percent + percent_total;
run;
Based on the value for cum_percent, the states that make up the top third of enrollment are: California, Texas, New York, Florida, and Illinois.
Is there any way to do this programatically? I'd eventually like to specify a flag variable for selecting states.
Thanks...
You can easily count percentages using PROC FREQ with WEIGHT statement and then select those in the first third using LAG function:
proc freq data=state_counts noprint order=data;
tables state / out=state_counts2;
weight enrollment;
run;
data top3rd;
set state_counts2;
cum_percent+percent;
if lag(cum_percent)<100/3 then top_third=1;
run;
It seems like you're 90% of the way there. If you just need a way to put cum_percent into flagged buckets, setting up a format is pretty straightforward.
proc format;
value pctile
low-0.33333 = 'top third'
0.33333<-.4 = 'next bit'
0.4<-high = 'the rest'
;
run;
options fmtsearch=(work);
And add a statement at the end of your datastep:
pctile_flag = put(cum_percent,pctile.);
Rewrite your last data step like this:
data dummy_4(drop=found);
set dummy_3;
retain cum_percent 0 found 0;
cum_percent + percent_total;
if cum_percent < (1/3) then do;
top_third = 1;
end;
else if ^found then do;
top_third = 1;
found =1;
end;
else
top_third = 0;
run;
note: your first. syntax is incorrect. first. and last. only work on BY groups. You get the right values in CUM_PERCENT by way of the cum_percent + percent_total; statement.
I am not aware of a PROC that will do this for you.