Reconstitute .txt file of HTML table as Dataset in SAS - sas

I am currently using SAS version 9 to try and read in a flat file in .txt format of a HTML table that I have taken from the following page (entitled Wayne Rooney's Match History):
http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney
I've got the data into a .txt file using a Python webscraper using Scrapy. The format of my .txt file is like thus:
17-08-2013,1 : 4,Swansea,Manchester United,28',7.26,Assist Assist,26-08-2013,0 : 0,Manchester United,Chelsea,90',7.03,None,14-09-2013,2 : 0,Manchester United,Crystal Palace,90',8.44,Man of the Match Goal,17-09-2013,4 : 2,Manchester United,Bayer Leverkusen,84',9.18,Goal Goal Assist,22-09-2013,4 : 1,Manchester City,Manchester United,90',7.17,Goal Yellow Card,25-09-2013,1 : 0,Manchester United,Liverpool,90',None,Man of the Match Assist,28-09-2013,1 : 2,Manchester United,West Bromwich Albion,90'...
...and so on. What I want is a dataset that has the same format as the original table. I know my way round SAS fairly well, but tend not to use infile statements all that much. I have tried a few variations on a theme, but this syntax has got me the nearest to what I want so far:
filename myfile "C:\Python27\Football Data\test.txt";
data test;
length date $10.
score $6.
home_team $40.
away_team $40.
mins_played $3.
rating $4.
incidents $40.;
infile myfile DSD;
input date $
score $
home_team $
away_team $
mins_played $
rating $
incidents $ ;
run;
This returns a dataset with only the first row of the table included. I have tried using fixed widths and pointers to set the dataset dimensions, but because the length of things like team names can change so much, this is causing the data to be reassembled from the flat file incorrectly.
I think I'm most of the way there, but can't quite crack the last bit. If anyone knows the exact syntax I need that would be great.
Thanks

I would read it straight from the web. Something like this; this works about 50% but took a whole ten minutes to write, i'm sure it could be easily improved.
Basic approach is you use #'string' to read in text following a string. You might be better off reading this in as a bytestream and doing a regular expression match on <tr> ... </tr> and then parsing that rather than taking the sort of more brute force method here.
filename rooney url "http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney" lrecl=32767;
data rooney;
infile rooney scanover;
retain are_reading;
input #;
if find(_infile_,'<table id="player-fixture" class="grid fixture">')
then are_reading=1;
if find(_infile_,'</table>') then are_reading=0;
if are_reading then do;
input #'<td class="date">' date ddmmyy10.
#'class="team-link">' home_team $20.
#'class="result-1 rc">' score $10.
#'class="team-link">' away_team $20.
#'title="Minutes played in this match">' mins_played $10.
#'title="Rating in this match">' rating $6.
;
output;
end;
run;
As far as reading the scrapy output, you should change at least two things:
Add the delimiter. Not truly necessary, but I'd consider the code incorrect without it, unless delimiter is space.
Add a trailing "##" to get SAS to hold the line pointer, since you don't have line feeds in your data.
data want;
infile myfile flowover dlm=',' dsd lrecl=32767;
length date $10.
score $6.
home_team $40.
away_team $40.
mins_played $3.
rating $4.
incidents $40.;
input date $
score $
home_team $
away_team $
mins_played $
rating $
incidents $ ##;
run;
Flowover is default, but I like to include it to make it clear.
You also probably want to input the date as a date value (not char), so informat date ddmmyy10.;. The rating is also easily input as numeric if you want to, and both mins played and score could be input as numeric if you're doing analysis on those by adding ' and : to the delimiter list.
Finally, your . on length is incorrect; SAS is nice enough to ignore it, but . is only placed like so for formats.
Here's my final code:
data want;
infile "c:\temp\test2.txt" flowover dlm="',:" lrecl=32767;
informat date ddmmyy10.
score_1 score_2 2.
home_team $40.
away_team $40.
mins_played 3.
rating 4.2
incidents $40.;
input date
score_1
score_2
home_team $
away_team $
mins_played
rating ??
incidents $ ##;
run;
I remove the dsd as that's incompatible with the ' delimiter; if DSD is actually needed then you can add it back, remove that delimiter, and read minutes in as char. I add ?? for rating as it sometimes is "None" so ?? ignores the warnings about that.

Related

SAS replacing string with string values within a table

I have this CSV dataset named Cars:
Brand, Model, Description, Year, Price, Sale
Toyota, Wish, "1.5, used""", 2018, 120000, 0
Lexus, LX300, "2.0 4wd, leather seat 15"", 2021, 23000, 0
Toyota, Innova, "2.0, 7 seater 4wd, "full spec 12""", 2007, 5000, 0
Honda, CRV, "2.5, 4wd", 2021, 11500, 0
Nissan, Serena, "7 seater, hybrid, used", 2019, 14400, 0
Hyundai, Elantra, "5 seater, turbo used", 2020, 13210, 0
I tried to replace , and " under description so that SAS can read it correctly.
FILENAME cars '....Cars.csv';
data cars_out;
infile cars dlm=',' firstobs=2 truncover;
format Brand $7. Model $7. Description $334. Year 4. Price 5. Sale 1.;
input#;
Year= translate(Year,' ','",');
input Brand Model Description Year Price Sale;
run;
But this doesnt work? any clue on why?
Your csv file is invalid. It has unbalanced quotation marks and embedded commas, causing a problem. If you open the file up in Excel, you will see it is invalid.
After manually fixing the file within Excel, the correct .csv should look as such:
Data:
Brand, Model, Description, Year, Price, Sale
Toyota, Wish," ""1.5 used""""""",2018,120000,0
Lexus, LX300," ""2.0 4wd leather seat 15""""",2021,23000,0
Toyota, Innova," ""2.0, 7 seater 4wd ""full spec 12""""""",2007,5000,0
Honda, CRV," ""2.5, 4wd""",2021,11500,0
Nissan, Serena," ""7 seater, hybrid used""",2019,14400,0
Hyundai, Elantra," ""5 seater, turbo used""",2020,13210,0
After balancing the quotes, SAS can read this correctly using your provided code. The only change I'd make is to increase the formatted length of price to 8 so it is not shown in scientific notation.
Your approach cannot work. You cannot change the value of YEAR before you have read anything into YEAR. You need to fix the text in the file for SAS (or anyone) to be able to parse it.
Best would be to re-create the file from the original data using proper quoting to begin with. When generating a CSV (or any delimited file) you should add quotes around values that contain the delimiter or quotes. Any actual quotes in the value should be doubled to make clear they are not the closing quote. SAS will do this automatically when you use the DSD option on the FILE statement.
With your example you might be able to peal off the beginning and the end since they do not appear to have issues. Then whatever is left is the value of DESCRIPTION.
data want;
infile cars dlm=',' truncover firstobs=2;
length brand model $20 description $400 year price sale 8 dummy $400;
input brand model #;
nwords=countw(_infile_,',');
do i=1 to nwords-5;
input dummy #;
description=catx(',',description,dummy);
end;
input year price sale;
drop i dummy;
run;
Results
You could probably clean things up more:
if '""'=char(description,1)||char(description,length(description)) then do;
description=substrn(description,2,length(description)-2);
description=tranwrd(description,'""','"');
if countc(description,'"')=1 then description=compress(description,'"');
end;
So with that cleaned up version your source file SHOULD have looked like this for it to be properly parsed as a comma delimited file.
Brand,Model,Description,Year,Price,Sale
Toyota,Wish,"1.5,used",2018,120000,0
Lexus,LX300,"2.0 4wd,leather seat 15",2021,23000,0
Toyota,Innova,"2.0,7 seater 4wd,""full spec 12""",2007,5000,0
Honda,CRV,"2.5,4wd",2021,11500,0
Nissan,Serena,"7 seater,hybrid,used",2019,14400,0
Hyundai,Elantra,"5 seater,turbo used",2020,13210,0

SAS Reading CSV file using input statement

I am using this code to import csv file in sas
data retail;
infile "C:\users\Documents\training\Retails_csv" DSD MISSOVER FIRST OBS =2;
INPUT Supplier :$32. Item_Category :$32. Month :$3. Cost :DOLLAR10. Revenue :DOLLAR10. Unit_Price :DOLLAR10.2 Units_Availed :8. Units_Sold :8.;run;
I need to get the Cost Revenue and Unit price in $ formatThe Output sas data
my dataset is I need the same Cost, revenue, Unit_PRICE IN DOLLAR FORMAT
please someone help
thanks
The INFILE statement does not take the REPLACE keyword. In fact since INFILE is just saying where the input data is coming from there isn't really any logical feature that the INFILE statement might have that would use that name.
You need to attach the DOLLAR format to the variables if you want SAS to print the values using dollar signs and thousands separators. You can either attach the format in the data step or in the steps that print the data.
format cost revenue unit_price dollar10. ;

SAS colon format modifier

What do the numbers in the grey box represent? And what's a simple way of understanding how the colon modifier affects the way sas reads in values?
The answer depends on information not provided. The answer B is the best choice in the sense that you should use the colon modifier when using informats in the INPUT statement to prevent the use of the formatted input mode instead of list input mode. Otherwise the formatted input could read too many or too few characters and also might leave the cursor in the wrong place for reading the next field.
But if you try to read that data from in-line cards it works fine for those two lines. That is because in-line data lines are padded to next multiple of 80 bytes.
If you put those lines into a file without any trailing spaces on the lines then the second line fails because there are not 10 characters to read for the last field. But if you add the TRUNCOVER option (or PAD) to the INFILE statement then it will work.
Try it yourself. TEST1 and TEST3 work. TEST2 gets a LOST CARD note.
data test1;
input name $ hired date9. age state $ salary comma10.;
format hired date9.;
cards;
Donny 5MAR2008 25 FL $43,123.50
Margaret 20FEB2008 43 NC 65,150
;
options parmcards=test;
filename test temp ;
parmcards;
Donny 5MAR2008 25 FL $43,123.50
Margaret 20FEB2008 43 NC 65,150
;
data test2;
infile test;
input name $ hired date9. age state $ salary comma10.;
format hired date9.;
run;
data test3;
infile test truncover;
input name $ hired date9. age state $ salary comma10.;
format hired date9.;
run;
With different data the first formatted input can cause trouble also. For example if the date values used only 2 digits for the year it would throw things off. So it tries to read FL as the age and then reads the first 8 characters of the salary as the STATE and just blanks as the SALARY.
data test1;
input name $ hired date9. age state $ salary comma10.;
format hired date9.;
cards;
Donny 5MAR08 25 FL $43,123.50
Margaret 20FEB2008 43 NC 65,150
;
Results:
Obs name hired age state salary
1 Donny 05MAR2008 . $43,123. .
2 Margaret 20FEB2008 43 NC 65150

SAS Compress function Returns FALSE value

I'm having some trouble using the 'Compress' function in SAS. My aim is to remove any alphabetical characters from the 'Comments' field in this data step.
However, when I execute the code below, the 'NewPrice' field shows FALSE rather than the expected value.
data WORK.basefile2;
length TempFileName Filenameused $300.;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile CSV2 delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=3 Filename=TempFileName;
Filenameused= TRANWRD(substr(TempFileName, 48), ".xlsx.csv", "");
informat customer_id $8.;
informat Name $50.;
informat Reco_issue $50.;
informat Reco_action $50.;
informat ICD $50.;
informat Start_date $50.;
format customer_id $8.;
format Name $50.;
format Reco_issue $50.;
format Reco_action $50.;
format ICD $50.;
format Start_date $50.;
format Comments $255.;
format Description $255.;
NewPrice=COMPRESS(Comments, '', 'kd');
input
customer_id $
Name $
'Total Spend'n $
Template $
Product_id $
Description $
Start_date $
CD_Sales
CD_Lines
CD_Level $
CD_Price $
CD_uom $
CD_Discount $
Reco_issue $
Reco_action $
Reco_price $
Reco_discount $
Impact_£
Impact
Agree $
Comments $
ICD $
Structure_Code $
Deal_type $
NewPrice;
run;
Output when code is executed
Sample data (comma delimited):
ASDFGH,TEST,"£31,333.00",15AH,156907,TEST,18/10/2016,"£4,003.10",222,5,£5.19,M,,Below Hard Floor,Change Rate,£6.63,,£0.48,21.72%,N,In negotiations with the customer for new prices
ASDFGH,TEST,"£31,333.00",15AH,475266,TEST,11/11/2016,£49.61,29,5,£2.52,EA,,At Hard Floor,Change Rate,£6.36,,£1.28,60.38%,N,In negotiations with the customer for new prices
ASDFGH,TEST,"£31,333.00",15AH,404740,TEST,21/09/2017,£38.69,1,5,£116.07,EA,,Below Hard Floor,Change Rate,£163.80,,£15.91,29.14%,N,In negotiations with the customer for new prices
ASDFGH,TEST,"£31,333.00",15AH,476557,TEST,11/11/2016,£32.13,25,5,£1.32,EA,,Below Hard Floor,Change Rate,£2.76,,£0.48,52.17%,N,In negotiations with the customer for new prices
ASDFGH,TEST,"£31,333.00",15AH,476553,TEST,11/11/2016,£29.17,11,5,£1.29,EA,,Below Hard Floor,Change Rate,£3.39,,£0.70,61.95%,N,In negotiations with the customer for new prices
ASDFGH,TEST,"£31,333.00",15AH,476557,TEST,11/11/2016,£17.61,5,5,£3.96,EA,,Below Hard Floor,Change Rate,£9.69,,£1.91,59.13%,N,In negotiations with the customer for new prices
ASDFGH,TEST,"£31,333.00",15AH,475261,TEST,11/11/2016,£16.70,4,5,£10.92,EA,,Below Hard Floor,Change Rate,£26.67,,£5.25,59.06%,N,In negotiations with the customer for new prices
ASDFGH,TEST,"£31,333.00",15AH,476546,TEST,11/11/2016,£15.73,10,5,£0.96,EA,,Below Hard Floor,Change Rate,£2.67,,£0.57,64.04%,N,In negotiations with the customer for new prices
ASDFGH,TEST,"£31,333.00",15AH,476549,TEST,11/11/2016,£5.84,3,5,£1.86,EA,,At Hard Floor,Change Rate,£6.00,,£1.38,69.00%,N,In negotiations with the customer for new prices
ASDFGH,TEST,"£31,333.00",15AH,477340,TEST,11/11/2016,£3.75,2,5,£4.11,EA,,Below Hard Floor,Change Rate,£11.40,,£2.43,63.95%,N,In negotiations with the customer for new prices
ASDFGH,TEST,"£31,333.00",,259738,TEST,13/01/2018,"£45,173.66",403,5,£10.35,EA,20,Below Hard Floor,Change Rate,£10.80,,£0.15,4.17%,N,New Prices Agreed £3.52
ASDFGH,TEST,"£31,333.00",,297622,TEST,13/01/2018,£736.60,5,5,£10.95,EA,20,Below Hard Floor,Change Rate,£11.46,,£0.17,4.45%,N,New Prices Agreed £3.75
ASDFGH,TEST,"£31,333.00",,105384,TEST,19/07/2017,£223.44,1,5,£11.25,BG,42.5,Below Hard Floor,Change Rate,£11.49,,£0.08,2.09%,N,New Prices Agreed £3.76
Any help would be greatly appreciated!
Thanks,
Henry
First off, I assume you mean that NEWPRICE=' ', not FALSE, since SAS doesn't have a "FALSE" per se. ' ' would be treated as FALSE in a boolean expression, though.
COMPRESS is returning ' ' here because the value in COMMENTS has entirely letter characters. Your COMPRESS arguments are asking it to keep digits and spaces ('' is still a single space, even if it doesn't look like it - leave the argument empty entirely if you don't want space ), meaning it would only keep spaces and digits. You have no digits in the COMMENTS field for most records, so you have only spaces, which are treated as equivalent to missing by SAS.
The other records, you have the problem that you're keeping spaces, so it's not going to turn into a number neatly. You'll want to use input, and probably not keep spaces, to get the value you want. (You also probably do want to keep ., no?) Or make NEWPRICE a character field.
Finally, your line :
NewPrice=COMPRESS(Comments, '', 'kd');
is before the input statement, which is a problem - the value for comments isn't defined yet when it runs.
This works, for example. Note I don't understand why you have NewPrice listed in input (and some other fields that won't have definitions either)...
data WORK.basefile2;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile datalines delimiter = ',' MISSOVER DSD ;
informat customer_id $8.;
informat Name $50.;
informat Reco_issue $50.;
informat Reco_action $50.;
informat ICD $50.;
informat Start_date $50.;
format customer_id $8.;
format Name $50.;
format Reco_issue $50.;
format Reco_action $50.;
format ICD $50.;
format Start_date $50.;
format Comments $255.;
format Description $255.;
input
customer_id $
Name $
'Total Spend'n $
Template $
Product_id $
Description $
Start_date $
CD_Sales
CD_Lines
CD_Level $
CD_Price $
CD_uom $
CD_Discount $
Reco_issue $
Reco_action $
Reco_price $
Reco_discount $
Impact_Pound
Impact
Agree $
Comments $
ICD $
Structure_Code $
Deal_type $
;
NewPrice=input(COMPRESS(Comments, '.', 'kd'),best32.);
datalines;
ASDFGH,TEST,"£31,333.00",15AH,156907,TEST,18/10/2016,"£4,003.10",222,5,£5.19,M,,Below Hard Floor,Change Rate,£6.63,,£0.48,21.72%,N,New Prices Agreed £3.76
ASDFGH,TEST,"£31,333.00",15AH,475266,TEST,11/11/2016,£49.61,29,5,£2.52,EA,,At Hard Floor,Change Rate,£6.36,,£1.28,60.38%,N,In negotiations with the customer for new prices
ASDFGH,TEST,"£31,333.00",15AH,404740,TEST,21/09/2017,£38.69,1,5,£116.07,EA,,Below Hard Floor,Change Rate,£163.80,,£15.91,29.14%,N,In negotiations with the customer for new prices
;;;;
run;

Read specific columns of a delimited file in SAS

This seems like it should be straightforward, but I can't find how to do this in the documentation. I want to read in a comma-delimited file, but it's very wide, and I just want to read a few columns.
I thought I could do this, but the # pointer seems to point to columns of the text rather than the column numbers defined by the delimiter:
data tmp;
infile 'results.csv' delimiter=',' MISSOVER DSD lrecl=32767 firstobs=2;
#1 id
#5 name$
run;
In this example, I want to read just what is in the 1st and 5th columns based on the delimiter, but SAS is reading what is in position 1 and position 5 of text file. So if the first line of the input file starts like this
1234567, "x", "y", "asdf", "bubba", ... more variables ...
I want id=1234567 and name=bubba, but I'm getting name=567, ".
I realize that I could read in every column and drop the ones I don't want, but there must be a better way.
Indeed, # does point to column of text not the delimited column. The only method using standard input I've ever found was to read in blank, ie
input
id
blank $
blank $
blank $
name $
;
and then drop blank.
However, there is a better solution if you don't mind writing your input differently.
data tmp;
infile datalines;
input #;
id = scan(_INFILE_,1,',');
name = scan(_INFILE_,5,',');
put _all_;
datalines;
12345,x,y,z,Joe
12346,x,y,z,Bob
;;;;
run;
It makes formatting slightly messier, as you need put or input statements for each variable you do not want in base character format, but it might be easier depending on your needs.
You can skip fields fairly efficiently if you know a bit of INPUT statement syntax, note the use of (3*dummy)(:$1.). Reading just one byte should also improve performance slightly.
data tmp;
infile cards DSD firstobs=2;
input id $ (3*dummy)(:$1.) name $;
drop dummy;
cards;
id,x,y,z,name
1234567, "x", "y", "asdf", "bubba", ... more variables
1234567, "x", "y", "asdf", "bubba", ... more variables
run;
proc print;
run;
One more option that I thought of when answering a related question from another user.
filename tempfile temp;
data _null_;
set sashelp.cars;
file tempfile dlm=',' dsd lrecl=32767;
put (Make--Wheelbase) ($);
run;
data mydata;
infile tempfile dlm=',' dsd truncover lrecl=32767;
length _tempvars1-_tempvars100 $32;
array _tempvars[100] $;
input (_tempvars[*]) ($);
make=_tempvars[1];
type=_tempvars[3];
MSRP=input(_tempvars[6],dollar8.);
keep make type msrp;
run;
Here we use an array of effectively temporary (can't actually BE temporary, unfortunately) variables, and then grab just what we want specifying the columns. This is probably overkill for a small file - just read in all the variables and deal with it - but for 100 or 200 variables where you want just 15, 18, and 25, this might be easier, as long as you know which column you want exactly. (I could see using this in dealing with census data, for example, if you have it in CSV form. It's very common to just want a few columns most of which are way down 100 or 200 columns from the starting column.)
You have to take some care with your lengths for the temporary array (has to be as long as your longest column that you care about!), and you have to make sure not to mess up the columns since you won't get to know if you mess up unless it's obvious from the data.