How many observations in the output dataset? - sas

A raw data file is listed below:
RANCH,1250,2,1,Sheppard Avenue, "$64,000"
SPLIT,1190,1,1,Rand Street, "$65,850"
CONDON, 1400,2,1,Market Street, "80,050"
TWOSTORY, 1810,4,3,Garris Street, "$107,250"
RANCH, 1500,3,3,Kemble Avenue, "$86,650"
SPLIT, 1615, 4,3, West Drive, "94,450"
SPLIT, 1305, 3,1.5,Graham Avenue, "$73,650"
The following is the code:
data work.condo_ranch;
infield "file_specificaton" did;
input style $ #;
if style = 'CONDO' or style = 'RANCH' then
input sqfeet bedrooms baths street $ price: dollar10.;
run;
So, I think the output dataset contains 3 observations, while the correct answer is that the output contains 7 observations. Does anyone tell me why? Many thanks for your time and attention.

Why would you expect the output dataset to have only 3 observations. There is an implied OUTPUT statement at the bottom of the DATA step. If you want to output only those records where STYLE IN ("CONDO","RANCH") you could add a conditional OUTPUT, e.g.:
if style = 'CONDO' or style = 'RANCH' then do;
input sqfeet bedrooms baths street $ price: dollar10.;
output;
end;

If you only want to output the records where style is CONDO or RANCH you could just change your THEN to a semi-colon. That would make your IF statement a subsetting IF. So the data step would return at that point and never run the second INPUT or the implied OUTPUT at the end of the step.

Related

SAS replacing string with string values within a table

I have this CSV dataset named Cars:
Brand, Model, Description, Year, Price, Sale
Toyota, Wish, "1.5, used""", 2018, 120000, 0
Lexus, LX300, "2.0 4wd, leather seat 15"", 2021, 23000, 0
Toyota, Innova, "2.0, 7 seater 4wd, "full spec 12""", 2007, 5000, 0
Honda, CRV, "2.5, 4wd", 2021, 11500, 0
Nissan, Serena, "7 seater, hybrid, used", 2019, 14400, 0
Hyundai, Elantra, "5 seater, turbo used", 2020, 13210, 0
I tried to replace , and " under description so that SAS can read it correctly.
FILENAME cars '....Cars.csv';
data cars_out;
infile cars dlm=',' firstobs=2 truncover;
format Brand $7. Model $7. Description $334. Year 4. Price 5. Sale 1.;
input#;
Year= translate(Year,' ','",');
input Brand Model Description Year Price Sale;
run;
But this doesnt work? any clue on why?
Your csv file is invalid. It has unbalanced quotation marks and embedded commas, causing a problem. If you open the file up in Excel, you will see it is invalid.
After manually fixing the file within Excel, the correct .csv should look as such:
Data:
Brand, Model, Description, Year, Price, Sale
Toyota, Wish," ""1.5 used""""""",2018,120000,0
Lexus, LX300," ""2.0 4wd leather seat 15""""",2021,23000,0
Toyota, Innova," ""2.0, 7 seater 4wd ""full spec 12""""""",2007,5000,0
Honda, CRV," ""2.5, 4wd""",2021,11500,0
Nissan, Serena," ""7 seater, hybrid used""",2019,14400,0
Hyundai, Elantra," ""5 seater, turbo used""",2020,13210,0
After balancing the quotes, SAS can read this correctly using your provided code. The only change I'd make is to increase the formatted length of price to 8 so it is not shown in scientific notation.
Your approach cannot work. You cannot change the value of YEAR before you have read anything into YEAR. You need to fix the text in the file for SAS (or anyone) to be able to parse it.
Best would be to re-create the file from the original data using proper quoting to begin with. When generating a CSV (or any delimited file) you should add quotes around values that contain the delimiter or quotes. Any actual quotes in the value should be doubled to make clear they are not the closing quote. SAS will do this automatically when you use the DSD option on the FILE statement.
With your example you might be able to peal off the beginning and the end since they do not appear to have issues. Then whatever is left is the value of DESCRIPTION.
data want;
infile cars dlm=',' truncover firstobs=2;
length brand model $20 description $400 year price sale 8 dummy $400;
input brand model #;
nwords=countw(_infile_,',');
do i=1 to nwords-5;
input dummy #;
description=catx(',',description,dummy);
end;
input year price sale;
drop i dummy;
run;
Results
You could probably clean things up more:
if '""'=char(description,1)||char(description,length(description)) then do;
description=substrn(description,2,length(description)-2);
description=tranwrd(description,'""','"');
if countc(description,'"')=1 then description=compress(description,'"');
end;
So with that cleaned up version your source file SHOULD have looked like this for it to be properly parsed as a comma delimited file.
Brand,Model,Description,Year,Price,Sale
Toyota,Wish,"1.5,used",2018,120000,0
Lexus,LX300,"2.0 4wd,leather seat 15",2021,23000,0
Toyota,Innova,"2.0,7 seater 4wd,""full spec 12""",2007,5000,0
Honda,CRV,"2.5,4wd",2021,11500,0
Nissan,Serena,"7 seater,hybrid,used",2019,14400,0
Hyundai,Elantra,"5 seater,turbo used",2020,13210,0

Automating creation of an indicator variable in SAS

I am working with a SAS dataset that includes up to 30 medications prescribed to an individual patient. The medications are coded med1, med2 ... med30. Each medication is represented by a 5-digit character variable. Using the identifier, I can then code the name of the drug, and whether that particular medication is a topical antibiotic or a systemic antibiotic.
For each patient, I want to use all 30 medication codes to create one variable indicating whether the patient got a topical antibiotic only, a systemic antibiotic only, or both a topical and an oral antibiotic. So if any of the 30 medications is a systemic antibiotic, I want the patient coded as oral_antibiotic=1.
I currently have this code:
data want;
set have;
array meds[30] med1-med30;
if meds[i] in ('06925' '06920') then do;
penicillin=1;
oral_antibiotic=1;
end;
else if meds[i] in ('03197') then do;
neosporin=1;
topical_antibiotic=1;
end;
.... (many more do loops with many more medications)
run;
The problem is that this code creates one indicator variable instead of 30, overwriting previous information.
I think that I really need 30 indicator variables, indicating whether each of the 30 drugs is an oral or topical antibiotic, before I write code that says if any of the drugs are oral antibiotics, the patient received an oral antibiotic.
I am new to macros and would really appreciate help.
data current;
input med1 med2 med3;
cards;
'06925' '06920' '03197' ;
run;
And I want this:
data want;
input med1 topical_antibiotic1 oral_antibiotic1 med2 topical_antibiotic2 oral_antibiotic2 med3 topical_antibiotic3 oral_antibiotic3;
cards;
'06925' 0 1 '06920' 0 1 '03197' 1 0
;
run;
I think that I really need 30 indicator variables, indicating whether
each of the 30 drugs is an oral or topical antibiotic, before I write
code that says if any of the drugs are oral antibiotics, the patient
received an oral antibiotic.
That's not true. Your current approach is fine as long as you're not resetting them. You don't show us the full code, so it's hard to say, but I'm going to assume that's what is happening here.
Your loop should look like:
array med(30) med1-med30;
*set to 0 at top of the loop;
topical_antibiotic=0; oral_antibiotic=0;
do i=1 to dim(med);
if med(i) in (.....) /*list of topical codes*/ then topical_antibiotic=1;
else if med(i) in (.....) /*list of oral codes*/ then oral_antibiotic=1;
end;
This assumes that an antibiotic cannot be in both Topical/Oral groups. If it can, you need to remove the ELSE from the second IF statement.
I agree that you probably only need one indicator variable for each drug group, (medication of interest). Seems like you just want to know for each subject, "Do they have it?" This example flips the arguments of the IN operator. If you had given more example data I could have done better with this example.
data current;
infile cards missover;
array med[3] $5;
input med[*];
oral_antibotic = '069' in: med; /*Assume oral all start with '069'*/;
topical_antibotic = '03197' in med;
cards;
06925 06920 03197
06925
;;;;
run;

Base SAS 9.2-- SUBSTR function to classify Zip Codes into regions

I have a variable full of ZIP code observations and I want to sort those ZIP codes into four regions based on the first three digits of the code.
For example, all ZIP codes that start with 350, 351, or 352 should be grouped into a region called "central." Those that start with 362, 368, 360 or 361 should be in a region called "east." Etc.
How do I get base SAS to look at only the first three digits of the ZIP code variable?
What is the best way to associate those digits with a new variable called "region?"
Here's the code I have so far:
data work.temp;
set library.dataset;
a= substr (Zip_Code,1,3);
put a;
keep Zip_Code a;
run;
proc print data=work.temp;
run;
The column a is blank in my proc print results, however.
Thanks for your help
As #joe explains, this is due to zipcode being defined as numeric variable. I have seen this happening in one of the client locaton, that zipcode is defined as numeric. It lead to various data issues . You should try to define zipcode as character variable and then you can assign regions by using if statements or by reference table or by proc format. Below are exaples of if statement and reference tables. I find reference table method very robust.
data have;
input zip_code $;
datalines;
35099
35167
35245
36278
36899
36167
;
By if statement
data work.temp;
set have;
if in('350', '351', '352') then Region ='EAST';
if substr (Zip_Code,1,3) in('362', '368', '361') then REgion ='WEST';
run;
By use of reference table
data reference;
input code $ Region $;
datalines;
350 EAST
351 EAST
352 EAST
362 WEST
368 WEST
361 WEST
;
proc sql;
select a.*, b.region from have a
left join
reference b
on substr (Zip_Code,1,3) = code;
If a is blank, then your zip_code variable is almost certainly numeric. You probably have a note about numeric to character conversion.
SAS will happily allow you to ignore numeric and character in most instances, but it won't always give correct behavior. In this case, it's probably converting it with the BEST12 format, meaning, 60601 becomes " 60601". So substr(that,1,3) gives " ", of course.
Zip code ideally would be stored in a character variable as it's an identifier, but if it's not for whatever reason, you can do this:
a = substr(put(zip_code,z5.),1,3);
The Zw.d format is correct since you want Massachusetts to be "02101" and not "2101 ".

Q: SAS Column Input Truncating

I am taking an intro to SAS course and am having serious issues reading some data in.
This is the code I have thus far (the assignment tells us to use Column input) is as follows:
DATA shirtCol;
INPUT Name $ 1-6 Color $ 8-13 Price 15-19 ShippingCost 21-24;
DATALINES;
Large Red 18.97 0.25
Medium Blue 24.68 1.10
XLarge Black 29.99 1.75
Small Orange 15.89 0.50
;
RUN;
PROC print data=shirtCol;
RUN;
I am using SAS university edition to run this code and when I run the program the Price and Shipping column only have one number after the decimal point. Is there anything I am doing wrong? How can I make it so the program no longer truncates my output?
It seems that SAS studio counts the blanks spaces as columns. Try removing the leading blanks before your input:
DATA shirtCol;
INPUT Name $ 1-6 Color $ 8-13 Price 15-19 ShippingCost 21-24;
DATALINES;
Large Red 18.97 0.25
Medium Blue 24.68 1.10
XLarge Black 29.99 1.75
Small Orange 15.89 0.50
;
RUN;
Count the number of characters in your data lines to see why it is not reading what you want. Either correct the column numbers in your INPUT statement or remove the extra blanks you seem to have added to every line. Inserting a ruler line can help.
DATA shirtCol;
INPUT Name $ 1-6 Color $ 8-13 Price 15-19 ShippingCost 21-24;
*---+---10----+---20----+---30;
DATALINES;
Large Red 18.97 0.25
Medium Blue 24.68 1.10
XLarge Black 29.99 1.75
Small Orange 15.89 0.50
;
RUN;
Using a consistent indentation style can avoid this type of issue. I never indent the DATALINES; statement. Actually I never use DATALINES; statements, but I never indent CARDS; statements and I use them a lot. :)
If that doesn't work then make sure you haven't introduced tabs to replace spaces in your data. I think that SAS/Studio might have trouble with actually storing the tabs in the code lines that it sends to SAS to execute. You can add an INFILE statement with the EXPANDTABS option to have SAS expand the tabs into spaces for you. While you are at it you might want to add the TRUNCOVER option also.
infile datalines expandtabs truncover ;

TRANWRD to fix Merge error?

I recently combined two datasets with a pretty straightforward Merge statement. I was using an ACS dataset and a Census population dataset. I needed a flag from the latter to be in the former. When I merged, the place variable (town/county, state) was not de-duplicated because one dataset used state abbreviations while the other used the full spelling:
Obs GeoID GeoName
1 . Abbeville County, SC
2 45001 Abbeville County, South Carolina
I need to change the GeoName for Obs1 so that it equals Obs2
Would an index function work? Or do I need the TRANWRD function? Thanks.
Solved:
data _null_;
length geoName $100;
GeoName_C = scan(GeoName,1,',');
GeoName_S = scan(GeoName,-1,','); *-1 scans from the right in case you could have commas in the city - check for this and adjust GeoName_C to include them if it is possible;
GeoName_S_F = stnamel(strip(GeoName_S));
GeoName = catx(',',GeoName_C,GeoName_S_F);
put _all_;
run;
What I would do is separate the city from the state and use SAS's inbuilt function stnamel to convert the abbreviation to the full name.
data _null_;
length geoName $100;
GeoName='Abbeville Road, SC';
GeoName_C = scan(GeoName,1,',');
GeoName_S = scan(GeoName,-1,','); *-1 scans from the right in case you could have commas in the city - check for this and adjust GeoName_C to include them if it is possible;
GeoName_S_F = stnamel(strip(GeoName_S));
GeoName = catx(',',GeoName_C,GeoName_S_F);
put _all_;
run;