Related
I have this CSV dataset named Cars:
Brand, Model, Description, Year, Price, Sale
Toyota, Wish, "1.5, used""", 2018, 120000, 0
Lexus, LX300, "2.0 4wd, leather seat 15"", 2021, 23000, 0
Toyota, Innova, "2.0, 7 seater 4wd, "full spec 12""", 2007, 5000, 0
Honda, CRV, "2.5, 4wd", 2021, 11500, 0
Nissan, Serena, "7 seater, hybrid, used", 2019, 14400, 0
Hyundai, Elantra, "5 seater, turbo used", 2020, 13210, 0
I tried to replace , and " under description so that SAS can read it correctly.
FILENAME cars '....Cars.csv';
data cars_out;
infile cars dlm=',' firstobs=2 truncover;
format Brand $7. Model $7. Description $334. Year 4. Price 5. Sale 1.;
input#;
Year= translate(Year,' ','",');
input Brand Model Description Year Price Sale;
run;
But this doesnt work? any clue on why?
Your csv file is invalid. It has unbalanced quotation marks and embedded commas, causing a problem. If you open the file up in Excel, you will see it is invalid.
After manually fixing the file within Excel, the correct .csv should look as such:
Data:
Brand, Model, Description, Year, Price, Sale
Toyota, Wish," ""1.5 used""""""",2018,120000,0
Lexus, LX300," ""2.0 4wd leather seat 15""""",2021,23000,0
Toyota, Innova," ""2.0, 7 seater 4wd ""full spec 12""""""",2007,5000,0
Honda, CRV," ""2.5, 4wd""",2021,11500,0
Nissan, Serena," ""7 seater, hybrid used""",2019,14400,0
Hyundai, Elantra," ""5 seater, turbo used""",2020,13210,0
After balancing the quotes, SAS can read this correctly using your provided code. The only change I'd make is to increase the formatted length of price to 8 so it is not shown in scientific notation.
Your approach cannot work. You cannot change the value of YEAR before you have read anything into YEAR. You need to fix the text in the file for SAS (or anyone) to be able to parse it.
Best would be to re-create the file from the original data using proper quoting to begin with. When generating a CSV (or any delimited file) you should add quotes around values that contain the delimiter or quotes. Any actual quotes in the value should be doubled to make clear they are not the closing quote. SAS will do this automatically when you use the DSD option on the FILE statement.
With your example you might be able to peal off the beginning and the end since they do not appear to have issues. Then whatever is left is the value of DESCRIPTION.
data want;
infile cars dlm=',' truncover firstobs=2;
length brand model $20 description $400 year price sale 8 dummy $400;
input brand model #;
nwords=countw(_infile_,',');
do i=1 to nwords-5;
input dummy #;
description=catx(',',description,dummy);
end;
input year price sale;
drop i dummy;
run;
Results
You could probably clean things up more:
if '""'=char(description,1)||char(description,length(description)) then do;
description=substrn(description,2,length(description)-2);
description=tranwrd(description,'""','"');
if countc(description,'"')=1 then description=compress(description,'"');
end;
So with that cleaned up version your source file SHOULD have looked like this for it to be properly parsed as a comma delimited file.
Brand,Model,Description,Year,Price,Sale
Toyota,Wish,"1.5,used",2018,120000,0
Lexus,LX300,"2.0 4wd,leather seat 15",2021,23000,0
Toyota,Innova,"2.0,7 seater 4wd,""full spec 12""",2007,5000,0
Honda,CRV,"2.5,4wd",2021,11500,0
Nissan,Serena,"7 seater,hybrid,used",2019,14400,0
Hyundai,Elantra,"5 seater,turbo used",2020,13210,0
I know that this is a very basic question, but I really am having trouble.
I wrote the following code and all I want to do is to read the data correctly, but I cannot find a way to instruct SAS to read the BMI.
There are two things I want to do.
1), Have SAS store the entire number including all the decimals.
2), when printed, I would like to approximate the number to two decimal points.
data HW02EX01;
input Patient 1-2 Weight 5-7 Height 9-10 Age 13-14 BMI 17-26 Smoking $ 29-40 Asthma $ 45-48;
cards;
14 167 70 65 23.9593878 never no
run;
Note: I left only the first observation since the display becomes really ugly and wearisome to edit by hand.
friend.
Maybe it could be useful the following code:
data HW02EX01_;
input Patient Weight Height Age BMI Smoking : $20. Asthma $10.;
format BMI 32.2;
cards;
14 167 70 65 23.9593878 never no
;
By way of comment, I would like to indicate some details:
If your input data has a fixed length, use the column reading method as you propose in your code in the input statement. If not, use a list reading entry (assuming there is a delimiter between your input data).
When SAS reads in the list form, it converts and saves all the digits of a numerical value. So you do not have to worry about reading decimals too.
To display a numerical value the way you like it, you can use the format statement to assign a representation of the value. In this case with two decimals we use the format "w.d". Where w is the total length of the number that can occur and d indicates the number of decimals to show. It should be mentioned that the formats do not change the value of the variable, only its presentation.
When using the cards statement, it is not necessary to use a run statement.
I hope it will be useful.
See the comments in the code, specifically the usage of LENGTH, FORMAT and INFORMAT statements to control the input and output appearance of data.
data HW02EX01;
*specify the length of the variables;
length patient $8. weight height age bmi 8. smoking asthma $8.;
*specify the informats of the variables;
*an informat is does the variable look like when trying to read it in;
informat patient $8. weight height age bmi best32.;
*formats control how infomraiton is displayed in output/tables;
format bmi 32.2 weight height age 12.;
input Patient $ Weight Height Age BMI Smoking Asthma ;
cards;
14 167 70 65 23.9593878 never no
;
run;
I have a dataset which has three variables: Application number, decline code and sequence. Now, there may be multiple decline code for a single application(which will have different sequence number). So the data looks like following:
Application No Decline Code Sequence
1234 FG 1
1234 FK 3
1234 AF 2
1256 AF 2
1256 FK 1
.
.
.
.
And so on
So, I have to put this in wide format such that the first column contains unique application numbers and corresponding to each of them is their decline code(I don't need sequence number, just that decline codes should appear in order of their sequence number from left to right, separated by a comma). Something like below
Application Number Decline Code
1234 FG, AF, FK
1256 FK, AF
..........
.........
And so on
Now I tried ruining proc transpose by application number on SAS. But the problem is that it creates multiple columns with all the decline codes listed and then if a certain decline code doesn't apply for an application, it will show . in that. So their are many missing values and it isn't quite the format I am expecting. Is there any way to do this in SAS or sql?
PROC TRANSPOSE can certainly help here; then you can CATX the variables together if you really just want one variable:
data have;
input ApplicationNo DeclineCode $ Sequence ;
datalines;
1234 FG 1
1234 FK 3
1234 AF 2
1256 AF 2
1256 FK 1
;;;;
run;
proc sort data=have;
by ApplicationNo Sequence;
run;
proc transpose data=have out=want_pre;
by ApplicationNo;
var DeclineCode;
run;
data want;
set want_pre;
length decline_codes $1024;
decline_codes = catx(', ',of col:);
keep ApplicationNo decline_codes;
run;
You could also do this trivially in one datastep, using first and last checks.
data want_ds;
set have;
by ApplicationNo Sequence;
retain decline_codes;
length decline_codes $1024; *or whatever you need;
if first.ApplicationNo then call missing(decline_codes);
decline_codes = catx(',',decline_codes, DeclineCode);
if last.ApplicationNo then output;
run;
A raw data file is listed below:
RANCH,1250,2,1,Sheppard Avenue, "$64,000"
SPLIT,1190,1,1,Rand Street, "$65,850"
CONDON, 1400,2,1,Market Street, "80,050"
TWOSTORY, 1810,4,3,Garris Street, "$107,250"
RANCH, 1500,3,3,Kemble Avenue, "$86,650"
SPLIT, 1615, 4,3, West Drive, "94,450"
SPLIT, 1305, 3,1.5,Graham Avenue, "$73,650"
The following is the code:
data work.condo_ranch;
infield "file_specificaton" did;
input style $ #;
if style = 'CONDO' or style = 'RANCH' then
input sqfeet bedrooms baths street $ price: dollar10.;
run;
So, I think the output dataset contains 3 observations, while the correct answer is that the output contains 7 observations. Does anyone tell me why? Many thanks for your time and attention.
Why would you expect the output dataset to have only 3 observations. There is an implied OUTPUT statement at the bottom of the DATA step. If you want to output only those records where STYLE IN ("CONDO","RANCH") you could add a conditional OUTPUT, e.g.:
if style = 'CONDO' or style = 'RANCH' then do;
input sqfeet bedrooms baths street $ price: dollar10.;
output;
end;
If you only want to output the records where style is CONDO or RANCH you could just change your THEN to a semi-colon. That would make your IF statement a subsetting IF. So the data step would return at that point and never run the second INPUT or the implied OUTPUT at the end of the step.
I have some complicated string parsing which would be very difficult to accomplish using regular SAS functions because of the string value inconsistency; as a result
I think I will need to use Perl Regular Expressions. Below have 4 variables (price, date, size, bundle) which I have to create using parts of the text string. I'm have trouble getting the syntax correct - I am new to regular expressions.
Here is a sample data set.
data have;
infile cards truncover;
input text $80.;
cards;
acq_newsale_0_CartChat_0_Flash_1192014.jpg
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf
acq_sale_3xconoffer_8999_nacpg_2102014.sfw
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf
;run;
/The first variable is price it is normally located near the end or middle of the string/
data want;
set have;
price =(input(prxchange('s/(\w+)_(\d+)_(\w+)/$2/',-1,text),8.))/100;
format price dollar8.2;
run;
Using the data set above I need to have this result:
price
0
79.99
89.99
89.99
79.99
64.99
/Date is always a series of consecutive digits. Either 6, 7 or 8. Using | which means 'or' I thought I would be able to pull that way/
data want;
set have;
date=prxparse('/\d\d\d\d\d\d|\d\d\d\d\d\d\d|\d\d\d\d\d\d\d\d/',text);
run;
Using the data set above I need to have this result:
Date
1192014
112014
2102014
272014
12252014
462014
1192014
12162013
/* For size there is always an ‘x’ in the middle of the sub-string which is with followed by two or three digits on either side*/
data want;
set have;
size=prxparse('/(\w+)_(\d+)'x'(\d+)_(\w+)/',text);
run;
Size
728x90
160x600
300x250
160x600
728x90
/*This is normally located towards the beginning of the string. It’s always a single digit number followed by an x It in never followed by additional digits but can also be just 0. */
data want;
set have;
Bundle=prxparse('/(\d+)'x'',text);
run;
Bundle
0
3x
3x
3X
3x
0
2x
3x
The final product I am looking for should look like this:
Text Date price Size Bundle
acq_newsale_0_CartChat_0_Flash_1192014.jpg 1192014 0 0
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf 112014 79.99 3x
acq_sale_3xconoffer_8999_nacpg_2102014.sfw 2102014 89.99 3x
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp 272014 89.99 728x90 3X
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov 12252014 160x600 3x
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg 462014 300x250 0
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf 1192014 79.99 160x600 2x
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf 12162013 64.99 728x90 3
x
If you're extracting, don't use PRXCHANGE. Use PRXPARSE, PRXMATCH, and PRXPOSN.
Sample usage, with date:
data have;
infile cards truncover;
input text $80.;
cards;
acq_newsale_0_CartChat_0_Flash_1192014.jpg
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf
acq_sale_3xconoffer_8999_nacpg_2102014.sfw
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf
;
run;
data want;
set have;
rx_date = prxparse('~(\d{6,8})~io');
rc_date = prxmatch(rx_date,text);
if rc_date then datevar = prxposn(rx_date,1,text);
run;
Just enclose in parens the section you want to extract (in this case, all of it).
Date was easy - as you say, 6-8 numbers. The others may be harder. The 3x etc. bit you can probably find, depending on how strict you need to be; the price I think you'll have a very hard time finding. You need to be able to better articulate the rules. "Towards the beginning" isn't a regex rule. "The second set of digits" is; "The second to last set", perhaps might work. I'll see if I can figure out a few.
In your example data, this works. I in particular don't like the price search; that one may well fail with a more complicated set of data. You can figure out adding the decimal for yourself.
data have;
infile cards truncover;
input text $80.;
cards;
acq_newsale_0_CartChat_0_Flash_1192014.jpg
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf
acq_sale_3xconoffer_8999_nacpg_2102014.sfw
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf
blahblah :23 blahblah
blahblahblah 23 blah blah
;
run;
data want;
set have;
rx_date = prxparse('~_(\d{6,8})[_\.]~io');
rx_price = prxparse('~_(\d+)_.*?(?=_\d+[_\.]).*?(?!_\d+[_\.])~io');
rx_bundle = prxparse('~(?!_\d+_)_(\dx)~io');
rx_size = prxparse('~_(\d+x\d+)[_\.]~io');
rx_adnum = prxparse('~\s:?(\d\d)\s~io');
rc_date = prxmatch(rx_date,text);
rc_price = prxmatch(rx_price,text);
rc_bundle = prxmatch(rx_bundle,text);
rc_size = prxmatch(rx_size,text);
rc_adnum = prxmatch(rx_adnum,text);
if rc_date then datevar = prxposn(rx_date,1,text);
if rc_price then price = prxposn(rx_price,1,text);
if rc_bundle then bundle = prxposn(rx_bundle,1,text);
if rc_size then size = prxposn(rx_size,1,text);
if rc_adnum then adnum = prxposn(rx_adnum,1,text);
run;