SAS replacing string with string values within a table - sas

I have this CSV dataset named Cars:
Brand, Model, Description, Year, Price, Sale
Toyota, Wish, "1.5, used""", 2018, 120000, 0
Lexus, LX300, "2.0 4wd, leather seat 15"", 2021, 23000, 0
Toyota, Innova, "2.0, 7 seater 4wd, "full spec 12""", 2007, 5000, 0
Honda, CRV, "2.5, 4wd", 2021, 11500, 0
Nissan, Serena, "7 seater, hybrid, used", 2019, 14400, 0
Hyundai, Elantra, "5 seater, turbo used", 2020, 13210, 0
I tried to replace , and " under description so that SAS can read it correctly.
FILENAME cars '....Cars.csv';
data cars_out;
infile cars dlm=',' firstobs=2 truncover;
format Brand $7. Model $7. Description $334. Year 4. Price 5. Sale 1.;
input#;
Year= translate(Year,' ','",');
input Brand Model Description Year Price Sale;
run;
But this doesnt work? any clue on why?

Your csv file is invalid. It has unbalanced quotation marks and embedded commas, causing a problem. If you open the file up in Excel, you will see it is invalid.
After manually fixing the file within Excel, the correct .csv should look as such:
Data:
Brand, Model, Description, Year, Price, Sale
Toyota, Wish," ""1.5 used""""""",2018,120000,0
Lexus, LX300," ""2.0 4wd leather seat 15""""",2021,23000,0
Toyota, Innova," ""2.0, 7 seater 4wd ""full spec 12""""""",2007,5000,0
Honda, CRV," ""2.5, 4wd""",2021,11500,0
Nissan, Serena," ""7 seater, hybrid used""",2019,14400,0
Hyundai, Elantra," ""5 seater, turbo used""",2020,13210,0
After balancing the quotes, SAS can read this correctly using your provided code. The only change I'd make is to increase the formatted length of price to 8 so it is not shown in scientific notation.

Your approach cannot work. You cannot change the value of YEAR before you have read anything into YEAR. You need to fix the text in the file for SAS (or anyone) to be able to parse it.
Best would be to re-create the file from the original data using proper quoting to begin with. When generating a CSV (or any delimited file) you should add quotes around values that contain the delimiter or quotes. Any actual quotes in the value should be doubled to make clear they are not the closing quote. SAS will do this automatically when you use the DSD option on the FILE statement.
With your example you might be able to peal off the beginning and the end since they do not appear to have issues. Then whatever is left is the value of DESCRIPTION.
data want;
infile cars dlm=',' truncover firstobs=2;
length brand model $20 description $400 year price sale 8 dummy $400;
input brand model #;
nwords=countw(_infile_,',');
do i=1 to nwords-5;
input dummy #;
description=catx(',',description,dummy);
end;
input year price sale;
drop i dummy;
run;
Results
You could probably clean things up more:
if '""'=char(description,1)||char(description,length(description)) then do;
description=substrn(description,2,length(description)-2);
description=tranwrd(description,'""','"');
if countc(description,'"')=1 then description=compress(description,'"');
end;
So with that cleaned up version your source file SHOULD have looked like this for it to be properly parsed as a comma delimited file.
Brand,Model,Description,Year,Price,Sale
Toyota,Wish,"1.5,used",2018,120000,0
Lexus,LX300,"2.0 4wd,leather seat 15",2021,23000,0
Toyota,Innova,"2.0,7 seater 4wd,""full spec 12""",2007,5000,0
Honda,CRV,"2.5,4wd",2021,11500,0
Nissan,Serena,"7 seater,hybrid,used",2019,14400,0
Hyundai,Elantra,"5 seater,turbo used",2020,13210,0

Related

How to create a new variable of age based upon an existing numeric date born variable in sas?

I want to create a numeric age variable using an existing numeric born date variable (MMDDYY10) in SAS. This "BORN" variable is numeric with a length of 8, the format is MMDDYY10. I'm assuming to use: age=today's date -BORN date. However, BORN date is like:-15226、-8803….I just don't understand why before these number, there is a minus signal. So what is the code to transfer to actual age?
I don't understand why before born date number, there is a minus signal. So how to use today's date minus born date of patient?
SAS is using a number for date/time. Dates are defined as number of days between 1.1. 1960 and specified date, so dates before that time are negative. To translate it to a (for people) readable form, you have to use formats (for example MMDDYY10.)
Similarly time is a number of seconds since midnight of the current day. SAS time values are between 0 and 86400.
Your code would look like this:
data have;
input born MMDDYY10.;
format born MMDDYY10.;
datalines;
03/17/2000
11/11/1988
08/11/1923
;
run;
data want;
set have;
age = floor((DATE()-born) / 365.25);
run;
SAS will correctly translate your input (if you correctly used your formats) into numbers, which are easy for a program to calculate with.

Changing the first row name conditionally on character interval in SAS

Consider the following data:
data GDP;
input Year $ Agriculture Industry;
datalines;
2016 195 1634
2017 220 1986
;
When exporting as a .dat file:
proc export
data = GDP
outfile = '....\GDP.dat'
dbms = TAB
replace;
run;
Then I get the following file:
However, I want the following file:
Where:
Mydata is a text I manually add.
The number after for instance Year (that is Year: 1-4) is the character intervals where the values are within. For instance, the values in the Year column is from characther 1 to 4. The values in the agriculture column goes from 9 to 11, and so on.
So SAS should count the interval for the values and add it to the first row name. How to do it in SAS?
You can fudge this with labels to your variables and then add the LABEL option to PROC EXPORT.
data GDP;
input Year $ Agriculture Industry;
label Year = "Mydata, Year:1-4" Agriculture = "Agriculture:9-11";
datalines;
2016 195 1634
2017 220 1986;
run;
proc export
data = GDP
outfile = '....\GDP.dat'
dbms = TAB
LABEL
replace;
run;
FYI - it looks like you're trying to create a fixed width file and put the specifications in the header. I'd advise against this and either put the specifications in a separate file or to include it at the top of the file instead.
Putting it in the header makes it harder for any other system to process correctly.
If you really need this for some reason, you may also want to consider using a data step to create your export instead of using PROC EXPORT.
AFAIK there is no easy way to define the specifications automatically though you could push the PROC CONTENTS output to a separate data set.

SAS Reading CSV file using input statement

I am using this code to import csv file in sas
data retail;
infile "C:\users\Documents\training\Retails_csv" DSD MISSOVER FIRST OBS =2;
INPUT Supplier :$32. Item_Category :$32. Month :$3. Cost :DOLLAR10. Revenue :DOLLAR10. Unit_Price :DOLLAR10.2 Units_Availed :8. Units_Sold :8.;run;
I need to get the Cost Revenue and Unit price in $ formatThe Output sas data
my dataset is I need the same Cost, revenue, Unit_PRICE IN DOLLAR FORMAT
please someone help
thanks
The INFILE statement does not take the REPLACE keyword. In fact since INFILE is just saying where the input data is coming from there isn't really any logical feature that the INFILE statement might have that would use that name.
You need to attach the DOLLAR format to the variables if you want SAS to print the values using dollar signs and thousands separators. You can either attach the format in the data step or in the steps that print the data.
format cost revenue unit_price dollar10. ;

How do I change a date formatted like this "2017-03-24" into a proper date format in SAS?

I need to change what it is currently at into the date9 format.
input Company_Name$ 1-58 Form_Type$ 59-70 CIK Date_filed$ 86-96 File_Name$ 118-141;
length Date_filed $10.;
format Date_filed yymmdd10.;
ndate=put(Date_filed, yymmdd10.);
this is what I have, but it doesn't work. I am not sure how to do this.
SAS stores dates in numeric variables as the number of days since 1960. Once you have a valid date value you can attach any of the many formats that change how the value is displayed. To read directly from the source file into a date value using an INFORMAT that is appropriate for the value. Note that you cannot combine an informat with the column range in the input statement. Instead you can use # to position at the first column of the date string and then use formatted input. Also you did not tell INPUT what columns to read for the variable CIK.
input Company_Name $ 1-58 Form_Type $ 59-70
CIK 71-85 #86 Date_filed yymmdd10.
File_Name $ 118-141
;
format Date_filed date9.;
Note you were reading 11 columns before but date strings in YYYY-MM-DD style only need 10 columns, so check if your date strings are in columns 86-95 or columns 87-96. If you don't read the right columns you might miss the second digit of the day of the month or the first digit of the year.

Reconstitute .txt file of HTML table as Dataset in SAS

I am currently using SAS version 9 to try and read in a flat file in .txt format of a HTML table that I have taken from the following page (entitled Wayne Rooney's Match History):
http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney
I've got the data into a .txt file using a Python webscraper using Scrapy. The format of my .txt file is like thus:
17-08-2013,1 : 4,Swansea,Manchester United,28',7.26,Assist Assist,26-08-2013,0 : 0,Manchester United,Chelsea,90',7.03,None,14-09-2013,2 : 0,Manchester United,Crystal Palace,90',8.44,Man of the Match Goal,17-09-2013,4 : 2,Manchester United,Bayer Leverkusen,84',9.18,Goal Goal Assist,22-09-2013,4 : 1,Manchester City,Manchester United,90',7.17,Goal Yellow Card,25-09-2013,1 : 0,Manchester United,Liverpool,90',None,Man of the Match Assist,28-09-2013,1 : 2,Manchester United,West Bromwich Albion,90'...
...and so on. What I want is a dataset that has the same format as the original table. I know my way round SAS fairly well, but tend not to use infile statements all that much. I have tried a few variations on a theme, but this syntax has got me the nearest to what I want so far:
filename myfile "C:\Python27\Football Data\test.txt";
data test;
length date $10.
score $6.
home_team $40.
away_team $40.
mins_played $3.
rating $4.
incidents $40.;
infile myfile DSD;
input date $
score $
home_team $
away_team $
mins_played $
rating $
incidents $ ;
run;
This returns a dataset with only the first row of the table included. I have tried using fixed widths and pointers to set the dataset dimensions, but because the length of things like team names can change so much, this is causing the data to be reassembled from the flat file incorrectly.
I think I'm most of the way there, but can't quite crack the last bit. If anyone knows the exact syntax I need that would be great.
Thanks
I would read it straight from the web. Something like this; this works about 50% but took a whole ten minutes to write, i'm sure it could be easily improved.
Basic approach is you use #'string' to read in text following a string. You might be better off reading this in as a bytestream and doing a regular expression match on <tr> ... </tr> and then parsing that rather than taking the sort of more brute force method here.
filename rooney url "http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney" lrecl=32767;
data rooney;
infile rooney scanover;
retain are_reading;
input #;
if find(_infile_,'<table id="player-fixture" class="grid fixture">')
then are_reading=1;
if find(_infile_,'</table>') then are_reading=0;
if are_reading then do;
input #'<td class="date">' date ddmmyy10.
#'class="team-link">' home_team $20.
#'class="result-1 rc">' score $10.
#'class="team-link">' away_team $20.
#'title="Minutes played in this match">' mins_played $10.
#'title="Rating in this match">' rating $6.
;
output;
end;
run;
As far as reading the scrapy output, you should change at least two things:
Add the delimiter. Not truly necessary, but I'd consider the code incorrect without it, unless delimiter is space.
Add a trailing "##" to get SAS to hold the line pointer, since you don't have line feeds in your data.
data want;
infile myfile flowover dlm=',' dsd lrecl=32767;
length date $10.
score $6.
home_team $40.
away_team $40.
mins_played $3.
rating $4.
incidents $40.;
input date $
score $
home_team $
away_team $
mins_played $
rating $
incidents $ ##;
run;
Flowover is default, but I like to include it to make it clear.
You also probably want to input the date as a date value (not char), so informat date ddmmyy10.;. The rating is also easily input as numeric if you want to, and both mins played and score could be input as numeric if you're doing analysis on those by adding ' and : to the delimiter list.
Finally, your . on length is incorrect; SAS is nice enough to ignore it, but . is only placed like so for formats.
Here's my final code:
data want;
infile "c:\temp\test2.txt" flowover dlm="',:" lrecl=32767;
informat date ddmmyy10.
score_1 score_2 2.
home_team $40.
away_team $40.
mins_played 3.
rating 4.2
incidents $40.;
input date
score_1
score_2
home_team $
away_team $
mins_played
rating ??
incidents $ ##;
run;
I remove the dsd as that's incompatible with the ' delimiter; if DSD is actually needed then you can add it back, remove that delimiter, and read minutes in as char. I add ?? for rating as it sometimes is "None" so ?? ignores the warnings about that.