How to use SAS to split a string into two variables - sas

I have a dataset as below:
country
United States, Seattle
United Kingdom, London
How can I split country into a data in SAS like:
country city
United States Seattle
United Kingdom London

Use function SCAN() with comma as separator.
data test;
set test;
city=scan(country,2,',');
country=scan(country,1,',');
run;

Another option, INFILE magic (google the term for papers on the topic); useful for parsing many variables from one string and/or dealing with quoted fields and such that would be more work with scan.
filename tempfile "c:\temp\test.txt";
data have;
input #1 country $50.;
datalines;
United States, Seattle
United Kingdom, London
;;;;
run;
data want;
set have;
infile tempfile dlm=',' dsd;
input #1 ##;
_infile_=country;
format newcountry city $50.;
input newcountry $ city $ ##;
run;
tempfile can be any file (or one you create on the fly with any character in it to avoid premature EOF).

Response to:
data test;
set test;
city=scan(country,2,',');
country=scan(country,1,',');
run;
What if I want to split the last comma in the string only, keeping 7410 City?
Example: "Junior 18, Plays Piano, 7410 City

Related

SAS colon format modifier

What do the numbers in the grey box represent? And what's a simple way of understanding how the colon modifier affects the way sas reads in values?
The answer depends on information not provided. The answer B is the best choice in the sense that you should use the colon modifier when using informats in the INPUT statement to prevent the use of the formatted input mode instead of list input mode. Otherwise the formatted input could read too many or too few characters and also might leave the cursor in the wrong place for reading the next field.
But if you try to read that data from in-line cards it works fine for those two lines. That is because in-line data lines are padded to next multiple of 80 bytes.
If you put those lines into a file without any trailing spaces on the lines then the second line fails because there are not 10 characters to read for the last field. But if you add the TRUNCOVER option (or PAD) to the INFILE statement then it will work.
Try it yourself. TEST1 and TEST3 work. TEST2 gets a LOST CARD note.
data test1;
input name $ hired date9. age state $ salary comma10.;
format hired date9.;
cards;
Donny 5MAR2008 25 FL $43,123.50
Margaret 20FEB2008 43 NC 65,150
;
options parmcards=test;
filename test temp ;
parmcards;
Donny 5MAR2008 25 FL $43,123.50
Margaret 20FEB2008 43 NC 65,150
;
data test2;
infile test;
input name $ hired date9. age state $ salary comma10.;
format hired date9.;
run;
data test3;
infile test truncover;
input name $ hired date9. age state $ salary comma10.;
format hired date9.;
run;
With different data the first formatted input can cause trouble also. For example if the date values used only 2 digits for the year it would throw things off. So it tries to read FL as the age and then reads the first 8 characters of the salary as the STATE and just blanks as the SALARY.
data test1;
input name $ hired date9. age state $ salary comma10.;
format hired date9.;
cards;
Donny 5MAR08 25 FL $43,123.50
Margaret 20FEB2008 43 NC 65,150
;
Results:
Obs name hired age state salary
1 Donny 05MAR2008 . $43,123. .
2 Margaret 20FEB2008 43 NC 65150

I would like to represent the title and gender variables, in my SAS table, as numbers. How do I do this in SAS?

I would like to represent the title and gender variables as numbers. What code do I need to add to do this?
DATA test;
INPUT title$ gender$ name$ age;
CARDS;
Mr Male Micheal 20
Mrs Female Stephanie 25
Mr Female Linda 30
Dr Male James 40
Dr Female Jane 45;
run;
Below is my attempt at the question. However something is wrong because the title and gender variables does not change!
proc format library = Work;
value $title_ 'Mr' = 1 'Mrs' = 2 'Dr' = 3;
value $gender_ 'Male' = 1 'Female' = 2;
run;
OPTIONS FMTSEARCH = (Work);
data test;
format $title $title_;
set test;
run;
You're nearly there - you just have slightly wrong syntax for your format statement. This is your current format statement:
format $title $title_;
Here's a corrected one. I've extended it to apply your gender format as well:
format title $title_. gender $gender_.;
It is not necessary to overwrite a dataset to apply a format, i.e.
data mydata;
set mydata;
format ...;
run;
You can apply one directly by using proc datasets instead of writing a data step like the one above, e.g.
proc datasets lib = work;
modify test;
format title $title_. gender $gender_.;
run;
quit;

SAS programming, read from a file and separate into different column

I have a excel-file where I want to split words into different columns in SAS.
In the file it looks like this in the same column, I want to split it and get rid of quotation marks :
ID;"City";"Year"
1;"New york";NULL
2;"stockton";"18"
This is what I tried to do:
data work.project ;
infile "&path\users.csv" delimiter=';' missover dsd;
input ID: $30.
City: $200.
Year: $5. ;
run;
proc print data=work.project;
run;
My output:
Obs ID City Year
1 ,,,"ID ""City"" ""Year
2 ,,,"1 ""new york"" NULL"
3 ,,,"2 ""stockton"" ""18"
4 ,,,"3 ""moscow "" NULL"
Rather than the colon and formats in the INPUT statement use an INFORMAT statement.
data work.project;
infile datalines4 delimiter=';' truncover dsd;
informat id $30. city $200. year $4.;
input ID City Year;
datalines4;
1;"New York";NULL
2;"Stockton";"18"
;;;;
run;
proc print data=project;
run;

Reconstitute .txt file of HTML table as Dataset in SAS

I am currently using SAS version 9 to try and read in a flat file in .txt format of a HTML table that I have taken from the following page (entitled Wayne Rooney's Match History):
http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney
I've got the data into a .txt file using a Python webscraper using Scrapy. The format of my .txt file is like thus:
17-08-2013,1 : 4,Swansea,Manchester United,28',7.26,Assist Assist,26-08-2013,0 : 0,Manchester United,Chelsea,90',7.03,None,14-09-2013,2 : 0,Manchester United,Crystal Palace,90',8.44,Man of the Match Goal,17-09-2013,4 : 2,Manchester United,Bayer Leverkusen,84',9.18,Goal Goal Assist,22-09-2013,4 : 1,Manchester City,Manchester United,90',7.17,Goal Yellow Card,25-09-2013,1 : 0,Manchester United,Liverpool,90',None,Man of the Match Assist,28-09-2013,1 : 2,Manchester United,West Bromwich Albion,90'...
...and so on. What I want is a dataset that has the same format as the original table. I know my way round SAS fairly well, but tend not to use infile statements all that much. I have tried a few variations on a theme, but this syntax has got me the nearest to what I want so far:
filename myfile "C:\Python27\Football Data\test.txt";
data test;
length date $10.
score $6.
home_team $40.
away_team $40.
mins_played $3.
rating $4.
incidents $40.;
infile myfile DSD;
input date $
score $
home_team $
away_team $
mins_played $
rating $
incidents $ ;
run;
This returns a dataset with only the first row of the table included. I have tried using fixed widths and pointers to set the dataset dimensions, but because the length of things like team names can change so much, this is causing the data to be reassembled from the flat file incorrectly.
I think I'm most of the way there, but can't quite crack the last bit. If anyone knows the exact syntax I need that would be great.
Thanks
I would read it straight from the web. Something like this; this works about 50% but took a whole ten minutes to write, i'm sure it could be easily improved.
Basic approach is you use #'string' to read in text following a string. You might be better off reading this in as a bytestream and doing a regular expression match on <tr> ... </tr> and then parsing that rather than taking the sort of more brute force method here.
filename rooney url "http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney" lrecl=32767;
data rooney;
infile rooney scanover;
retain are_reading;
input #;
if find(_infile_,'<table id="player-fixture" class="grid fixture">')
then are_reading=1;
if find(_infile_,'</table>') then are_reading=0;
if are_reading then do;
input #'<td class="date">' date ddmmyy10.
#'class="team-link">' home_team $20.
#'class="result-1 rc">' score $10.
#'class="team-link">' away_team $20.
#'title="Minutes played in this match">' mins_played $10.
#'title="Rating in this match">' rating $6.
;
output;
end;
run;
As far as reading the scrapy output, you should change at least two things:
Add the delimiter. Not truly necessary, but I'd consider the code incorrect without it, unless delimiter is space.
Add a trailing "##" to get SAS to hold the line pointer, since you don't have line feeds in your data.
data want;
infile myfile flowover dlm=',' dsd lrecl=32767;
length date $10.
score $6.
home_team $40.
away_team $40.
mins_played $3.
rating $4.
incidents $40.;
input date $
score $
home_team $
away_team $
mins_played $
rating $
incidents $ ##;
run;
Flowover is default, but I like to include it to make it clear.
You also probably want to input the date as a date value (not char), so informat date ddmmyy10.;. The rating is also easily input as numeric if you want to, and both mins played and score could be input as numeric if you're doing analysis on those by adding ' and : to the delimiter list.
Finally, your . on length is incorrect; SAS is nice enough to ignore it, but . is only placed like so for formats.
Here's my final code:
data want;
infile "c:\temp\test2.txt" flowover dlm="',:" lrecl=32767;
informat date ddmmyy10.
score_1 score_2 2.
home_team $40.
away_team $40.
mins_played 3.
rating 4.2
incidents $40.;
input date
score_1
score_2
home_team $
away_team $
mins_played
rating ??
incidents $ ##;
run;
I remove the dsd as that's incompatible with the ' delimiter; if DSD is actually needed then you can add it back, remove that delimiter, and read minutes in as char. I add ?? for rating as it sometimes is "None" so ?? ignores the warnings about that.

How can I use the ampersand and a delimiter in SAS

I want to read the following dat file into SAS. Since the names and values are separated by 2 spaces I use the ampersand in the input statement. But it seems that the DLM='/' in the infile statement conflicts with it. Can someone tell me what the mistake in my code is?
File:
1118 ART CONTUCK 57.69/65.20/120.50//152.60
2287 MICHAEL WINSTONE 145.89
Code:
data mylib.D_report;
infile Dinning dlm='/' dsd missover;
input ID 1-4 Name & $17. M1-M6;
run;
You're mixing input styles, which while understandable given you have fairly mixed input data, isn't permitted the way you're doing it.
Your best option is to read M1-6 into one variable, then split it up using SCAN.
data work.D_report;
infile datalines missover dlm=' ';
input ID :4.
Name & $17.
Ms :$40.;
array M[6];
do _t = 1 to countc(Ms,'/')+1;
if _t > dim(M) then leave;
M[_t]=scan(Ms,_t,'/','m');
end;
datalines;
1118 ART CONTUCK 57.69/65.20/120.50//152.60
2287 MICHAEL WINSTONE 145.89
;;;;
run;
You just need to change the delimiter.
data D_report;
dlm = ' ';
infile cards dlm=dlm missover dsd;
input ID 1-4 Name & $17. #;
dlm = '/';
input M1-M6;
cards;
1118 ART CONTUCK 57.69/65.20/120.50//152.60
2287 MICHAEL WINSTONE 145.89
run;
proc print;
run;