How can I read in data with uneven spacings? - sas

I have data which doesn't appear to have consistent spacings or positioning. It looks like:
1675 C Street , Suite 201
Anchorage AK 99501
61.205475 -149.886882
600 Azalea Road
Mobile AL 36609
30.656824 -88.148781
1601 Harbor Bay Parkway , Suite 150
Alameda CA 94502
37.726114 -122.240546
1900 Point West Way, Suite 270
Sacramento CA 95815
38.5994175 -121.4315844
3600 Wilshire Blvd., Suite 1500
Los Angeles CA 90010
34.06153 -118.303463
From this I'd like to extract the street address, city name, state, zip code, lat, and long. I thought the following code would work, but it produces very weird results.
data voa;
input Address $50.;
input City $ State $ Zip;
input Latitude Longitude;
datalines;
I think the issue comes from the fact that there isn't consistent spacing or positioning of the elements.

Your data will work fine using LIST input you just need to add the "look for double delimiter option" & to CITY plus it need to be a bit longer $16 or so.
input City &$16. State $ Zip;

In the absence of consistent delimiters or fixed width fields, this is easier to do using scan:
data want;
infile cards truncover;
length STATE $2 CITY $32;
input Address $50.;
input;
ZIP = input(scan(_INFILE_, -1),5.);
STATE = scan(_INFILE_, -2);
CITY = trim(substr(_INFILE_,1,index(_INFILE_,STATE) - 1));
input Latitude Longitude;
cards;
1675 C Street , Suite 201
Anchorage AK 99501
61.205475 -149.886882
600 Azalea Road
Mobile AL 36609
30.656824 -88.148781
1601 Harbor Bay Parkway , Suite 150
Alameda CA 94502
37.726114 -122.240546
1900 Point West Way, Suite 270
Sacramento CA 95815
38.5994175 -121.4315844
3600 Wilshire Blvd., Suite 1500
Los Angeles CA 90010
34.06153 -118.303463
;
run;

Related

Query on mixed input style SAS data.This code is not giving me proper output

DATA nationalparks;
INPUT #1 ParkName $ 1-21 #23 State $ Year #40 Acreage COMMA9.;
DATALINES;
Yellowstone ID/MT/WY 1872 4,065,493
Everglades FL 1934 1,398,800
Yosemite CA 1864 760,91
Great Smoky Mountains NC/TN 1926 520,269
;
RUN;
This SAS code is not showing proper result set.
Your program looks fine to me if the data really is in the columns your code is using.
Make sure you haven't indented the datalines. Also check that your editor hasn't replaced spaces with TAB characters. Make sure that STATE and YEAR always have a value and that the value does not include any spaces. You can use . to mark missing values, even for the character variable STATE. You can read STATE and YEAR using columns instead and then blanks will be treated as missing. No need to use formatted input for the last variable, if you add the : modifier then SAS will use list mode and adjust the width used on the informat to match the width of the next value on the line. But again if ACREAGE is missing then use . to mark that. Or add an INFILE DATALINES TRUNCOVER; statement before the INPUT statement.
DATA nationalparks;
INPUT ParkName $ 1-21 State $ 23-30 Year 32-35 Acreage :COMMA.;
*---+----10---+----20---+----30---+----40---+----50;
DATALINES;
Yellowstone ID/MT/WY 1872 4,065,493
Everglades FL 1934 1,398,800
Yosemite CA 1864 760,91
Great Smoky Mountains NC/TN 1926 520,269
;
Results:
Obs ParkName State Year Acreage
1 Yellowstone ID/MT/WY 1872 4065493
2 Everglades FL 1934 1398800
3 Yosemite CA 1864 76091
4 Great Smoky Mountains NC/TN 1926 520269

Need a suggestion on how to merge datasets with similar but not identical joint id

So what I have is a data set that has each city and state in one column. The other data set also has city and state in one column BUT some of the cities are combined. For example:
Data set one will have:
CITY STATE POPULATION
Cape Coral Fl 1000000
Fort Myers FL 2000000
Gainesville FL 100000
Data set two will have:
CITY STATE EMPLOYMENT
Cape Coral - Fort Myers FL 900
Gainesville FL 1000
I thought about doing a "fuzzy" match, but then for the hyphenated cities I won't get the full population. I could try to break up the hyphenated cities and then dividing the employment in half, but I don't know how to do that.
I am hoping there is an easier solution out there that I haven't thought of. I went ahead and did a traditional merge by CITY STATE, but it only matched half of my data set.
Thanks in advance!
The second data set can be broken into more rows if you make some assumptions, such as each component city is separated by dash (-) and state is always last piece.
data two;
length city_state $100;
input CITY_STATE & EMPLOYMENT;
datalines;
Cape Coral - Fort Myers FL 900
Gainesville FL 1000
run;
data two_b;
length city_state_item $100;
set two;
state = scan (city_state, -1, ' ');
p = find (city_state, trim(state), -101);
city_state_base = substr(city_state,1,p-1);
do _n_ = 1 by 1 while (scan(city_state_base,_n_,'-') ne '');
city_state_item = catx (' ', scan(city_state_base,_n_,'-'), state);
OUTPUT;
employment = 0;
end;
drop p city_state_base state;
run;
After splitting you will have to match ONE.city_state to TWO_B.city_state_item and deal with how employment will be split or not split depending on how the matched data get re-aggregated or used to compute some employment to population ratio.
Making some assumptions this solution could work:
data a;
length city_state $100;
input CITY_STATE & POPULATION;
datalines;
Cape Coral Fl 1000000
Fort Myers FL 2000000
Gainesville FL 100000
run;
data b;
length city_state $100;
input CITY_STATE & EMPLOYMENT;
datalines;
Cape Coral - Fort Myers FL 900
Gainesville FL 1000
Run;
Proc sql;
select a.city_state, b.city_state, a.population, case when b.city_state contains '-' then b.EMPLOYMENT /2 else b.EMPLOYMENT End as EMPLOYMENT from a
inner join b
on b.city_state contains substr(a.city_state,1,length(a.city_state)-length(scan(a.city_state,-1,' ')));
quit;
result:
city_state | city_state |POPULATION |EMPLOYMENT
------------------------------------------------------------------------
Cape Coral Fl | Cape Coral - Fort Myers FL | 1000000 | 450
Fort Myers FL | Cape Coral - Fort Myers FL | 2000000 | 450
Gainesville FL | Gainesville FL | 100000 | 1000
Assuming every city_state with a - includes two city states, you can half it
case when b.city_state contains '-' then b.EMPLOYMENT /2 else b.EMPLOYMENT End as EMPLOYMENT
Assuming every city_state ends with the short state, you can remove the state and do a contains statement:
b.city_state contains substr(a.city_state,1,length(a.city_state)-length(scan(a.city_state,-1,' ')));

Testing a Condition Before Creating an Observation in SAS using '#' in the end of the input statement

I have read the online document and from it, I think that it only works with the column input method. How can this be used with list input method?
/This Works/
data new;
input height 25-26 #;
if height = 6 ;
input name $ 1-8 colour $ 9-13 place $ 16-24 ;
datalines;
Deepak Red Delhi 6
Aditi Yellow Delhi 5
Anup Blue Delhi 5
Era Green Varanasi 5
Avinash Black Noida 5
Vivek Grey Agra 5
;
run;
/* But This Doesn't*/
data new;
input height #;
if height = 6;
input name $ colour $ place $ height;
datalines;
Deepak Red Delhi 6
Aditi Yellow Delhi 5
Anup Blue Delhi 5
Era Green Varanasi 5
Avinash Black Noida 5
Vivek Grey Agra 5
;
run;
LOG:
NOTE: Invalid data for height in line 79 1-6.
79 Deepak Red Delhi 6
height=. name= colour= place= _ERROR_=1 _N_=1
NOTE: Invalid data for height in line 80 1-5.
80 Aditi Yellow Delhi 5
height=. name= colour= place= _ERROR_=1 _N_=2
The fixed layout of the first data lines make it possible to input a field from a specific location.
The second layout is variable in layout, so it is harder to arbitrarily grab a specific field.
So, what is wrong? In the second DATA step the input will read from the start of the line, so it won't read a number from where a name is.
Don't worry about 'reducing processing' by reading only part of a line. Held input and conditional processing is more often used for processing data lines that have some sort of variant or conditional data items within the content.
For both of those formats I would read all of the variables and then add logic to filter based on values.
If you really need to check if the last "word" on the line matched some criteria before deciding HOW to read the line then you might want to try using the automatic _infile_ variable.
data new;
input # ;
if scan(_infile_,-1,' ') = '6';
input name $ colour $ place $ height;
datalines;
Deepak Red Delhi 6
Aditi Yellow Delhi 5
Anup Blue Delhi 5
Era Green Varanasi 5
Avinash Black Noida 5
Vivek Grey Agra 5
;

How to read missing data when reading txt file in SAS?

I am new to SAS.
I want to read a txt file in SAS. The thing is the file has missing data in some lines.
For example the data is:
James Monroe Monroe Hall Virginia 58 4/28/1758 1816
John Quincy Adams Braintree Massachusetts 57 7/11/1767 1824 113,142 30.92%
Andrew Jackson Waxhaws Region South/North Carolina 61 3/15/1767 1828 642,806 55.93%
Martin Van Buren Kinderhook New York 54 12/5/1782 1836 763,291 50.79%
William Henry Harrison Charles City County Virginia 68 2/9/1773 1840 1,275,583 52.87%
The columns I want are 'FullName', 'City', 'State', 'Age', DOB', 'Year','Number' and 'Percentage'.
My code is:
infile 'C:\sasfiles\testfile.txt' missover dlm='09'x dsd
input FullName City State Age DOB Year Number PercVote;
Run;
But I get the error
9 CHAR Rutherford B. Hayes.Delaware ..Ohio..54.10/4/1822.1876.4,034,142.47.92% 73
ZONE 5776676676242246767046667676222004666003303323233330333303233323330332332
NUMR 2548526F2402E081953945C1712500099F89F9954910F4F18229187694C034C142947E925
FullName=. City=. State=. Age=. DOB=. Year=54 Number=. PercVote=1876 ERROR=1 N=19
NOTE: Invalid data for FullName in line 20 1-17.
NOTE: Invalid data for City in line 20 19-32.
NOTE: Invalid data for State in line 20 37-40.
NOTE: Invalid data for Number in line 20 46-55.
NOTE: Invalid data for PercVote in line 20 62-70.
You need to provide data in a format that can by properly parsed.
When using delimited data then use adjacent delimiters to indicate there is a value missing. It is best to define the variables before you use them. If you define them in order then your input statement can be very simple.
data want ;
infile cards dsd dlm='|' firstobs=2 truncover ;
length name $13 city $11 state $15 age 8 dob 8 yod 8 ;
input name -- yod ;
informat dob mmddyy10.;
format dob yymmdd10.;
cards;
----+----0----+----0----+----0----+----0----+----0----+----0----+----0
James Monroe|Monroe Hall|Virginia||4/28/1758|1816
John Quincy Adams|Braintree|Massachusetts|57|7/11/1767|1824
;
Or force the data into columns and use column input.
data want ;
infile cards firstobs=2 truncover ;
input name $ 1-13 city $ 19-29 state $ 31-45 age 47-48 #50 dob mmddyy10. yod 61-64;
format dob yymmdd10.;
cards;
----+----0----+----0----+----0----+----0----+----0----+----0----+----0
James Monroe Monroe Hall Virginia 4/28/1758 1816
John Quincy Adams Braintree Massachusetts 57 7/11/1767 1824
;
Try this:
data want;
infile cards dlm='09'x missover;
input (FullName City State) (:$32.) Age :8. DOB :mmddyy9. Year :4. Number :comma8. PercVote :percent8.;
format DOB mmddyy10. number comma16. percvote percent6.2;
cards;
James Monroe Monroe Hall Virginia 58 4/28/1758 1816
John Quincy Adams Braintree Massachusetts 57 7/11/1767 1824 113,142 30.92%
Andrew Jackson Waxhaws Region South/North Carolina 61 3/15/1767 1828 642,806 55.93%
Martin Van Buren Kinderhook New York 54 12/5/1782 1836 763,291 50.79%
William Henry Harrison Charles City County Virginia 68 2/9/1773 1840 1,275,583 52.87%
;
run;

How to use proc transpose on variables that contain numbers separated with _?

Hi I am new to sas I have a question regarding proc transpose
I have this data
Input
School Name State School Code 26/07/2009 02/08/2009 09/08/2009 16/08/2009
Northwest High IL 14556 06 06 06 06
Georgia High GA 147 05 05 05 06
Macy Hgh TX 45456 NA NA NA NA
The desired output is
School Name State School Code Date Absent
Northwest High IL 14566 26/07/2009 6
Northwest High IL 14556 02/08/2009 6
Northwest High IL 14556 09/08/2009 6
Northwest High IL 14556 16/08/2009 6
Georgia High GA 147 26/07/2009 5
Georgia High GA 147 02/08/2009 5
Georgia High GA 147 09/08/2009 5
Georgia High GA 147 16/08/2009 6
Macy Hgh TX 45456 26/07/2009 NA
Macy Hgh TX 45456 02/08/2009 NA
Macy Hgh TX 45456 09/08/2009 NA
Macy Hgh TX 45456 16/08/2009 NA
This is the code I have written
proc sort data=work.input;
by School_Name State School_Code;
run;
proc transpose data=work.input out=work.inputModified;
by by School_Name State School_Code;
run
I get this error saying that No variables to transpose I think the issue is since the variables are actual numbers like this _26_07_2009 sas does not recognize them,
But I don't get the desired output the dates are actual variables when imported into sas they become _26_07_2009. Note there are about 185 dates and they are actual variables.
Thanks
The following transpose does the job:
proc transpose data=work.input out=work.inputModified;
by School_Name State School_Code;
var _:;
run;
Notice the _: notation - it picks up all variables which start with an underscore and transposes them.
As I mentioned in the link in my comments earlier, if you do not explicitly specify the variables you want to tranpose- then proc transpose by default looks for numeric variables that are not in the by variable list to transpose. However, since your date variables are read-in as strings [due to the presence of NAs] it was saying NOTE: No variables to transpose.
You can use the following to convert the date and absent columns into numeric columns.
data inputModified2;
set inputModified;
format date date9.;
date = input(compress(tranwrd(_name_,'_','')), ddmmyy8.);
if col1 NE 'NA' then absent = input(col1, 8.);
else absent=.;
drop _name_ col1;
run;