I have this CSV dataset named Cars:
Brand, Model, Description, Year, Price, Sale
Toyota, Wish, "1.5, used""", 2018, 120000, 0
Lexus, LX300, "2.0 4wd, leather seat 15"", 2021, 23000, 0
Toyota, Innova, "2.0, 7 seater 4wd, "full spec 12""", 2007, 5000, 0
Honda, CRV, "2.5, 4wd", 2021, 11500, 0
Nissan, Serena, "7 seater, hybrid, used", 2019, 14400, 0
Hyundai, Elantra, "5 seater, turbo used", 2020, 13210, 0
I tried to replace , and " under description so that SAS can read it correctly.
FILENAME cars '....Cars.csv';
data cars_out;
infile cars dlm=',' firstobs=2 truncover;
format Brand $7. Model $7. Description $334. Year 4. Price 5. Sale 1.;
input#;
Year= translate(Year,' ','",');
input Brand Model Description Year Price Sale;
run;
But this doesnt work? any clue on why?
Your csv file is invalid. It has unbalanced quotation marks and embedded commas, causing a problem. If you open the file up in Excel, you will see it is invalid.
After manually fixing the file within Excel, the correct .csv should look as such:
Data:
Brand, Model, Description, Year, Price, Sale
Toyota, Wish," ""1.5 used""""""",2018,120000,0
Lexus, LX300," ""2.0 4wd leather seat 15""""",2021,23000,0
Toyota, Innova," ""2.0, 7 seater 4wd ""full spec 12""""""",2007,5000,0
Honda, CRV," ""2.5, 4wd""",2021,11500,0
Nissan, Serena," ""7 seater, hybrid used""",2019,14400,0
Hyundai, Elantra," ""5 seater, turbo used""",2020,13210,0
After balancing the quotes, SAS can read this correctly using your provided code. The only change I'd make is to increase the formatted length of price to 8 so it is not shown in scientific notation.
Your approach cannot work. You cannot change the value of YEAR before you have read anything into YEAR. You need to fix the text in the file for SAS (or anyone) to be able to parse it.
Best would be to re-create the file from the original data using proper quoting to begin with. When generating a CSV (or any delimited file) you should add quotes around values that contain the delimiter or quotes. Any actual quotes in the value should be doubled to make clear they are not the closing quote. SAS will do this automatically when you use the DSD option on the FILE statement.
With your example you might be able to peal off the beginning and the end since they do not appear to have issues. Then whatever is left is the value of DESCRIPTION.
data want;
infile cars dlm=',' truncover firstobs=2;
length brand model $20 description $400 year price sale 8 dummy $400;
input brand model #;
nwords=countw(_infile_,',');
do i=1 to nwords-5;
input dummy #;
description=catx(',',description,dummy);
end;
input year price sale;
drop i dummy;
run;
Results
You could probably clean things up more:
if '""'=char(description,1)||char(description,length(description)) then do;
description=substrn(description,2,length(description)-2);
description=tranwrd(description,'""','"');
if countc(description,'"')=1 then description=compress(description,'"');
end;
So with that cleaned up version your source file SHOULD have looked like this for it to be properly parsed as a comma delimited file.
Brand,Model,Description,Year,Price,Sale
Toyota,Wish,"1.5,used",2018,120000,0
Lexus,LX300,"2.0 4wd,leather seat 15",2021,23000,0
Toyota,Innova,"2.0,7 seater 4wd,""full spec 12""",2007,5000,0
Honda,CRV,"2.5,4wd",2021,11500,0
Nissan,Serena,"7 seater,hybrid,used",2019,14400,0
Hyundai,Elantra,"5 seater,turbo used",2020,13210,0
I'm having some trouble using the 'Compress' function in SAS. My aim is to remove any alphabetical characters from the 'Comments' field in this data step.
However, when I execute the code below, the 'NewPrice' field shows FALSE rather than the expected value.
data WORK.basefile2;
length TempFileName Filenameused $300.;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile CSV2 delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=3 Filename=TempFileName;
Filenameused= TRANWRD(substr(TempFileName, 48), ".xlsx.csv", "");
informat customer_id $8.;
informat Name $50.;
informat Reco_issue $50.;
informat Reco_action $50.;
informat ICD $50.;
informat Start_date $50.;
format customer_id $8.;
format Name $50.;
format Reco_issue $50.;
format Reco_action $50.;
format ICD $50.;
format Start_date $50.;
format Comments $255.;
format Description $255.;
NewPrice=COMPRESS(Comments, '', 'kd');
input
customer_id $
Name $
'Total Spend'n $
Template $
Product_id $
Description $
Start_date $
CD_Sales
CD_Lines
CD_Level $
CD_Price $
CD_uom $
CD_Discount $
Reco_issue $
Reco_action $
Reco_price $
Reco_discount $
Impact_£
Impact
Agree $
Comments $
ICD $
Structure_Code $
Deal_type $
NewPrice;
run;
Output when code is executed
Sample data (comma delimited):
ASDFGH,TEST,"£31,333.00",15AH,156907,TEST,18/10/2016,"£4,003.10",222,5,£5.19,M,,Below Hard Floor,Change Rate,£6.63,,£0.48,21.72%,N,In negotiations with the customer for new prices
ASDFGH,TEST,"£31,333.00",15AH,475266,TEST,11/11/2016,£49.61,29,5,£2.52,EA,,At Hard Floor,Change Rate,£6.36,,£1.28,60.38%,N,In negotiations with the customer for new prices
ASDFGH,TEST,"£31,333.00",15AH,404740,TEST,21/09/2017,£38.69,1,5,£116.07,EA,,Below Hard Floor,Change Rate,£163.80,,£15.91,29.14%,N,In negotiations with the customer for new prices
ASDFGH,TEST,"£31,333.00",15AH,476557,TEST,11/11/2016,£32.13,25,5,£1.32,EA,,Below Hard Floor,Change Rate,£2.76,,£0.48,52.17%,N,In negotiations with the customer for new prices
ASDFGH,TEST,"£31,333.00",15AH,476553,TEST,11/11/2016,£29.17,11,5,£1.29,EA,,Below Hard Floor,Change Rate,£3.39,,£0.70,61.95%,N,In negotiations with the customer for new prices
ASDFGH,TEST,"£31,333.00",15AH,476557,TEST,11/11/2016,£17.61,5,5,£3.96,EA,,Below Hard Floor,Change Rate,£9.69,,£1.91,59.13%,N,In negotiations with the customer for new prices
ASDFGH,TEST,"£31,333.00",15AH,475261,TEST,11/11/2016,£16.70,4,5,£10.92,EA,,Below Hard Floor,Change Rate,£26.67,,£5.25,59.06%,N,In negotiations with the customer for new prices
ASDFGH,TEST,"£31,333.00",15AH,476546,TEST,11/11/2016,£15.73,10,5,£0.96,EA,,Below Hard Floor,Change Rate,£2.67,,£0.57,64.04%,N,In negotiations with the customer for new prices
ASDFGH,TEST,"£31,333.00",15AH,476549,TEST,11/11/2016,£5.84,3,5,£1.86,EA,,At Hard Floor,Change Rate,£6.00,,£1.38,69.00%,N,In negotiations with the customer for new prices
ASDFGH,TEST,"£31,333.00",15AH,477340,TEST,11/11/2016,£3.75,2,5,£4.11,EA,,Below Hard Floor,Change Rate,£11.40,,£2.43,63.95%,N,In negotiations with the customer for new prices
ASDFGH,TEST,"£31,333.00",,259738,TEST,13/01/2018,"£45,173.66",403,5,£10.35,EA,20,Below Hard Floor,Change Rate,£10.80,,£0.15,4.17%,N,New Prices Agreed £3.52
ASDFGH,TEST,"£31,333.00",,297622,TEST,13/01/2018,£736.60,5,5,£10.95,EA,20,Below Hard Floor,Change Rate,£11.46,,£0.17,4.45%,N,New Prices Agreed £3.75
ASDFGH,TEST,"£31,333.00",,105384,TEST,19/07/2017,£223.44,1,5,£11.25,BG,42.5,Below Hard Floor,Change Rate,£11.49,,£0.08,2.09%,N,New Prices Agreed £3.76
Any help would be greatly appreciated!
Thanks,
Henry
First off, I assume you mean that NEWPRICE=' ', not FALSE, since SAS doesn't have a "FALSE" per se. ' ' would be treated as FALSE in a boolean expression, though.
COMPRESS is returning ' ' here because the value in COMMENTS has entirely letter characters. Your COMPRESS arguments are asking it to keep digits and spaces ('' is still a single space, even if it doesn't look like it - leave the argument empty entirely if you don't want space ), meaning it would only keep spaces and digits. You have no digits in the COMMENTS field for most records, so you have only spaces, which are treated as equivalent to missing by SAS.
The other records, you have the problem that you're keeping spaces, so it's not going to turn into a number neatly. You'll want to use input, and probably not keep spaces, to get the value you want. (You also probably do want to keep ., no?) Or make NEWPRICE a character field.
Finally, your line :
NewPrice=COMPRESS(Comments, '', 'kd');
is before the input statement, which is a problem - the value for comments isn't defined yet when it runs.
This works, for example. Note I don't understand why you have NewPrice listed in input (and some other fields that won't have definitions either)...
data WORK.basefile2;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile datalines delimiter = ',' MISSOVER DSD ;
informat customer_id $8.;
informat Name $50.;
informat Reco_issue $50.;
informat Reco_action $50.;
informat ICD $50.;
informat Start_date $50.;
format customer_id $8.;
format Name $50.;
format Reco_issue $50.;
format Reco_action $50.;
format ICD $50.;
format Start_date $50.;
format Comments $255.;
format Description $255.;
input
customer_id $
Name $
'Total Spend'n $
Template $
Product_id $
Description $
Start_date $
CD_Sales
CD_Lines
CD_Level $
CD_Price $
CD_uom $
CD_Discount $
Reco_issue $
Reco_action $
Reco_price $
Reco_discount $
Impact_Pound
Impact
Agree $
Comments $
ICD $
Structure_Code $
Deal_type $
;
NewPrice=input(COMPRESS(Comments, '.', 'kd'),best32.);
datalines;
ASDFGH,TEST,"£31,333.00",15AH,156907,TEST,18/10/2016,"£4,003.10",222,5,£5.19,M,,Below Hard Floor,Change Rate,£6.63,,£0.48,21.72%,N,New Prices Agreed £3.76
ASDFGH,TEST,"£31,333.00",15AH,475266,TEST,11/11/2016,£49.61,29,5,£2.52,EA,,At Hard Floor,Change Rate,£6.36,,£1.28,60.38%,N,In negotiations with the customer for new prices
ASDFGH,TEST,"£31,333.00",15AH,404740,TEST,21/09/2017,£38.69,1,5,£116.07,EA,,Below Hard Floor,Change Rate,£163.80,,£15.91,29.14%,N,In negotiations with the customer for new prices
;;;;
run;
This seems like it should be straightforward, but I can't find how to do this in the documentation. I want to read in a comma-delimited file, but it's very wide, and I just want to read a few columns.
I thought I could do this, but the # pointer seems to point to columns of the text rather than the column numbers defined by the delimiter:
data tmp;
infile 'results.csv' delimiter=',' MISSOVER DSD lrecl=32767 firstobs=2;
#1 id
#5 name$
run;
In this example, I want to read just what is in the 1st and 5th columns based on the delimiter, but SAS is reading what is in position 1 and position 5 of text file. So if the first line of the input file starts like this
1234567, "x", "y", "asdf", "bubba", ... more variables ...
I want id=1234567 and name=bubba, but I'm getting name=567, ".
I realize that I could read in every column and drop the ones I don't want, but there must be a better way.
Indeed, # does point to column of text not the delimited column. The only method using standard input I've ever found was to read in blank, ie
input
id
blank $
blank $
blank $
name $
;
and then drop blank.
However, there is a better solution if you don't mind writing your input differently.
data tmp;
infile datalines;
input #;
id = scan(_INFILE_,1,',');
name = scan(_INFILE_,5,',');
put _all_;
datalines;
12345,x,y,z,Joe
12346,x,y,z,Bob
;;;;
run;
It makes formatting slightly messier, as you need put or input statements for each variable you do not want in base character format, but it might be easier depending on your needs.
You can skip fields fairly efficiently if you know a bit of INPUT statement syntax, note the use of (3*dummy)(:$1.). Reading just one byte should also improve performance slightly.
data tmp;
infile cards DSD firstobs=2;
input id $ (3*dummy)(:$1.) name $;
drop dummy;
cards;
id,x,y,z,name
1234567, "x", "y", "asdf", "bubba", ... more variables
1234567, "x", "y", "asdf", "bubba", ... more variables
run;
proc print;
run;
One more option that I thought of when answering a related question from another user.
filename tempfile temp;
data _null_;
set sashelp.cars;
file tempfile dlm=',' dsd lrecl=32767;
put (Make--Wheelbase) ($);
run;
data mydata;
infile tempfile dlm=',' dsd truncover lrecl=32767;
length _tempvars1-_tempvars100 $32;
array _tempvars[100] $;
input (_tempvars[*]) ($);
make=_tempvars[1];
type=_tempvars[3];
MSRP=input(_tempvars[6],dollar8.);
keep make type msrp;
run;
Here we use an array of effectively temporary (can't actually BE temporary, unfortunately) variables, and then grab just what we want specifying the columns. This is probably overkill for a small file - just read in all the variables and deal with it - but for 100 or 200 variables where you want just 15, 18, and 25, this might be easier, as long as you know which column you want exactly. (I could see using this in dealing with census data, for example, if you have it in CSV form. It's very common to just want a few columns most of which are way down 100 or 200 columns from the starting column.)
You have to take some care with your lengths for the temporary array (has to be as long as your longest column that you care about!), and you have to make sure not to mess up the columns since you won't get to know if you mess up unless it's obvious from the data.