I have a text file downloaded from the BLS website that has a lot of spaces in between columns.
Code:
data unemployment;
infile 'P:\Projects\la.data.2.AllStatesU.txt' dsd firstobs=2;
input #1 series_id : $20.
#32 year
#36 period : $3.
#51 value
#57 footnote_codes : $1.;
run;
But I get a mess of errors
NOTE: Invalid data for year in line 2 32-53.
NOTE: Invalid data for value in line 2 51-53.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9--
3 CHAR LAUST010000000000003 .1976.M02. 7.7. 53
ZONE 44555333333333333333222222222203333043302222222223230
NUMR C15340100000000000030000000000919769D0290000000007E79
The period column has the first two characters right, but the year and everything else is wrong. How do I fix this?
Snapshot of text file:
code output:
The file from that website
filename bls url
"https://download.bls.gov/pub/time.series/la/la.data.2.AllStatesU"
;
has tab characters in it. That is shown in the example you posted of line 3 from the SAS LOG.
You can either tell the INFILE statement to expand the tabs into spaces and read it as fixed column format.
data unemployment;
infile bls expandtabs firstobs=2 truncover;
input
series_id $ 1-20
year 33-36
period $ 41-43
value ?? 50-60
footnote_codes $ 65
;
run;
Or tell it that the tab character is the delimiter.
data unemployment;
infile bls dlm='09'x dsd firstobs=2 truncover;
input
series_id :$20.
year
period :$3.
value ??
footnote_codes :$1.
;
run;
Note: The ?? modifier for VALUE is because the file has a hyphen to represent missing values in that field. The ?? input modifier will tell the data step to not flag those as data errors.
I have this CSV dataset named Cars:
Brand, Model, Description, Year, Price, Sale
Toyota, Wish, "1.5, used""", 2018, 120000, 0
Lexus, LX300, "2.0 4wd, leather seat 15"", 2021, 23000, 0
Toyota, Innova, "2.0, 7 seater 4wd, "full spec 12""", 2007, 5000, 0
Honda, CRV, "2.5, 4wd", 2021, 11500, 0
Nissan, Serena, "7 seater, hybrid, used", 2019, 14400, 0
Hyundai, Elantra, "5 seater, turbo used", 2020, 13210, 0
I tried to replace , and " under description so that SAS can read it correctly.
FILENAME cars '....Cars.csv';
data cars_out;
infile cars dlm=',' firstobs=2 truncover;
format Brand $7. Model $7. Description $334. Year 4. Price 5. Sale 1.;
input#;
Year= translate(Year,' ','",');
input Brand Model Description Year Price Sale;
run;
But this doesnt work? any clue on why?
Your csv file is invalid. It has unbalanced quotation marks and embedded commas, causing a problem. If you open the file up in Excel, you will see it is invalid.
After manually fixing the file within Excel, the correct .csv should look as such:
Data:
Brand, Model, Description, Year, Price, Sale
Toyota, Wish," ""1.5 used""""""",2018,120000,0
Lexus, LX300," ""2.0 4wd leather seat 15""""",2021,23000,0
Toyota, Innova," ""2.0, 7 seater 4wd ""full spec 12""""""",2007,5000,0
Honda, CRV," ""2.5, 4wd""",2021,11500,0
Nissan, Serena," ""7 seater, hybrid used""",2019,14400,0
Hyundai, Elantra," ""5 seater, turbo used""",2020,13210,0
After balancing the quotes, SAS can read this correctly using your provided code. The only change I'd make is to increase the formatted length of price to 8 so it is not shown in scientific notation.
Your approach cannot work. You cannot change the value of YEAR before you have read anything into YEAR. You need to fix the text in the file for SAS (or anyone) to be able to parse it.
Best would be to re-create the file from the original data using proper quoting to begin with. When generating a CSV (or any delimited file) you should add quotes around values that contain the delimiter or quotes. Any actual quotes in the value should be doubled to make clear they are not the closing quote. SAS will do this automatically when you use the DSD option on the FILE statement.
With your example you might be able to peal off the beginning and the end since they do not appear to have issues. Then whatever is left is the value of DESCRIPTION.
data want;
infile cars dlm=',' truncover firstobs=2;
length brand model $20 description $400 year price sale 8 dummy $400;
input brand model #;
nwords=countw(_infile_,',');
do i=1 to nwords-5;
input dummy #;
description=catx(',',description,dummy);
end;
input year price sale;
drop i dummy;
run;
Results
You could probably clean things up more:
if '""'=char(description,1)||char(description,length(description)) then do;
description=substrn(description,2,length(description)-2);
description=tranwrd(description,'""','"');
if countc(description,'"')=1 then description=compress(description,'"');
end;
So with that cleaned up version your source file SHOULD have looked like this for it to be properly parsed as a comma delimited file.
Brand,Model,Description,Year,Price,Sale
Toyota,Wish,"1.5,used",2018,120000,0
Lexus,LX300,"2.0 4wd,leather seat 15",2021,23000,0
Toyota,Innova,"2.0,7 seater 4wd,""full spec 12""",2007,5000,0
Honda,CRV,"2.5,4wd",2021,11500,0
Nissan,Serena,"7 seater,hybrid,used",2019,14400,0
Hyundai,Elantra,"5 seater,turbo used",2020,13210,0
I'm new to SAS. I'm trying to read a txt file where the same variables are listed in multiple columns.
The first variable is the date. The second one is time, and the last one is Blood Glucose. Thanks a lot for your kindness and help.
Sincerely
Wilson
The data can be read using a list input statement with the : (format modifier) and ## (line hold) features specified.
glucose-readings.txt (data file)
01jan16 14:46 89 03jan16 11:27 103 04jan16 09:40 99
05jan16 09:46 105 11jan16 10:58 108 13jan16 10:32 109
14jan16 10:49 90 18jan16 09:32 110 25jan16 10:37 100
Sample program
data want;
infile "c:\temp\glucose-readings.txt";
input
datepart :date9.
timepart :time5.
glucose
##;
datetime = dhms(datepart,0,0,timepart);
format
datepart date9.
timepart time5.
datetime datetime19.
glucose 3.
;
;
proc print; run;
From the documentation INPUT Statement: List
:
... For a numeric variable, this format modifier reads the value from the next non-blank column until the pointer reaches the next blank column or the end of the data line, whichever comes first.
...
##
holds an input record for the execution of the next INPUT statement across iterations of the DATA step. This line-hold specifier is called double trailing #.
...
Tip The double trailing # is useful when each input line contains values for several observations.
Be sure to read the documentation, that is were you will find detailed explanations and useful examples.
I am taking an intro to SAS course and am having serious issues reading some data in.
This is the code I have thus far (the assignment tells us to use Column input) is as follows:
DATA shirtCol;
INPUT Name $ 1-6 Color $ 8-13 Price 15-19 ShippingCost 21-24;
DATALINES;
Large Red 18.97 0.25
Medium Blue 24.68 1.10
XLarge Black 29.99 1.75
Small Orange 15.89 0.50
;
RUN;
PROC print data=shirtCol;
RUN;
I am using SAS university edition to run this code and when I run the program the Price and Shipping column only have one number after the decimal point. Is there anything I am doing wrong? How can I make it so the program no longer truncates my output?
It seems that SAS studio counts the blanks spaces as columns. Try removing the leading blanks before your input:
DATA shirtCol;
INPUT Name $ 1-6 Color $ 8-13 Price 15-19 ShippingCost 21-24;
DATALINES;
Large Red 18.97 0.25
Medium Blue 24.68 1.10
XLarge Black 29.99 1.75
Small Orange 15.89 0.50
;
RUN;
Count the number of characters in your data lines to see why it is not reading what you want. Either correct the column numbers in your INPUT statement or remove the extra blanks you seem to have added to every line. Inserting a ruler line can help.
DATA shirtCol;
INPUT Name $ 1-6 Color $ 8-13 Price 15-19 ShippingCost 21-24;
*---+---10----+---20----+---30;
DATALINES;
Large Red 18.97 0.25
Medium Blue 24.68 1.10
XLarge Black 29.99 1.75
Small Orange 15.89 0.50
;
RUN;
Using a consistent indentation style can avoid this type of issue. I never indent the DATALINES; statement. Actually I never use DATALINES; statements, but I never indent CARDS; statements and I use them a lot. :)
If that doesn't work then make sure you haven't introduced tabs to replace spaces in your data. I think that SAS/Studio might have trouble with actually storing the tabs in the code lines that it sends to SAS to execute. You can add an INFILE statement with the EXPANDTABS option to have SAS expand the tabs into spaces for you. While you are at it you might want to add the TRUNCOVER option also.
infile datalines expandtabs truncover ;
I am currently using SAS version 9 to try and read in a flat file in .txt format of a HTML table that I have taken from the following page (entitled Wayne Rooney's Match History):
http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney
I've got the data into a .txt file using a Python webscraper using Scrapy. The format of my .txt file is like thus:
17-08-2013,1 : 4,Swansea,Manchester United,28',7.26,Assist Assist,26-08-2013,0 : 0,Manchester United,Chelsea,90',7.03,None,14-09-2013,2 : 0,Manchester United,Crystal Palace,90',8.44,Man of the Match Goal,17-09-2013,4 : 2,Manchester United,Bayer Leverkusen,84',9.18,Goal Goal Assist,22-09-2013,4 : 1,Manchester City,Manchester United,90',7.17,Goal Yellow Card,25-09-2013,1 : 0,Manchester United,Liverpool,90',None,Man of the Match Assist,28-09-2013,1 : 2,Manchester United,West Bromwich Albion,90'...
...and so on. What I want is a dataset that has the same format as the original table. I know my way round SAS fairly well, but tend not to use infile statements all that much. I have tried a few variations on a theme, but this syntax has got me the nearest to what I want so far:
filename myfile "C:\Python27\Football Data\test.txt";
data test;
length date $10.
score $6.
home_team $40.
away_team $40.
mins_played $3.
rating $4.
incidents $40.;
infile myfile DSD;
input date $
score $
home_team $
away_team $
mins_played $
rating $
incidents $ ;
run;
This returns a dataset with only the first row of the table included. I have tried using fixed widths and pointers to set the dataset dimensions, but because the length of things like team names can change so much, this is causing the data to be reassembled from the flat file incorrectly.
I think I'm most of the way there, but can't quite crack the last bit. If anyone knows the exact syntax I need that would be great.
Thanks
I would read it straight from the web. Something like this; this works about 50% but took a whole ten minutes to write, i'm sure it could be easily improved.
Basic approach is you use #'string' to read in text following a string. You might be better off reading this in as a bytestream and doing a regular expression match on <tr> ... </tr> and then parsing that rather than taking the sort of more brute force method here.
filename rooney url "http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney" lrecl=32767;
data rooney;
infile rooney scanover;
retain are_reading;
input #;
if find(_infile_,'<table id="player-fixture" class="grid fixture">')
then are_reading=1;
if find(_infile_,'</table>') then are_reading=0;
if are_reading then do;
input #'<td class="date">' date ddmmyy10.
#'class="team-link">' home_team $20.
#'class="result-1 rc">' score $10.
#'class="team-link">' away_team $20.
#'title="Minutes played in this match">' mins_played $10.
#'title="Rating in this match">' rating $6.
;
output;
end;
run;
As far as reading the scrapy output, you should change at least two things:
Add the delimiter. Not truly necessary, but I'd consider the code incorrect without it, unless delimiter is space.
Add a trailing "##" to get SAS to hold the line pointer, since you don't have line feeds in your data.
data want;
infile myfile flowover dlm=',' dsd lrecl=32767;
length date $10.
score $6.
home_team $40.
away_team $40.
mins_played $3.
rating $4.
incidents $40.;
input date $
score $
home_team $
away_team $
mins_played $
rating $
incidents $ ##;
run;
Flowover is default, but I like to include it to make it clear.
You also probably want to input the date as a date value (not char), so informat date ddmmyy10.;. The rating is also easily input as numeric if you want to, and both mins played and score could be input as numeric if you're doing analysis on those by adding ' and : to the delimiter list.
Finally, your . on length is incorrect; SAS is nice enough to ignore it, but . is only placed like so for formats.
Here's my final code:
data want;
infile "c:\temp\test2.txt" flowover dlm="',:" lrecl=32767;
informat date ddmmyy10.
score_1 score_2 2.
home_team $40.
away_team $40.
mins_played 3.
rating 4.2
incidents $40.;
input date
score_1
score_2
home_team $
away_team $
mins_played
rating ??
incidents $ ##;
run;
I remove the dsd as that's incompatible with the ' delimiter; if DSD is actually needed then you can add it back, remove that delimiter, and read minutes in as char. I add ?? for rating as it sometimes is "None" so ?? ignores the warnings about that.