I'm trying to create variables Cap1 through Cap6. I'm not sure how to have read them as character data. My code is:
DATA Capture;
INFILE '/folders/myfolders/sasuser.v94/Capture.txt' DLM='09'x DSD MISSOVER FIRSTOBS=2;
INPUT Sex $ AgeGroup $ Weight Cap1 - Cap6 $;
RUN;
And my issue is Cap1 through Cap5 are interpreted as numerical data. How do I solve this?
Your issue is simple: you are using a variable list, but you aren't applying the $ to the whole variable list! You need ( ) around the list and the modifier to apply it to the whole list.
See:
DATA Capture;
INFILE datalines DLM=' ' DSD;
INPUT Sex $ AgeGroup $ Weight (Cap1 - Cap6) ($);
datalines;
M 18-34 135 A B C D E F
F 35-54 115 G H I J K L
;;;;
RUN;
Indeed,
I would also expect this input statement to work as you did, but it does not. Putting a $ after Cap1 does not resolve it either, as this log shows.
26 INPUT Sex $ AgeGroup $ Weight Cap1 $ - Cap6 $;
_
22
ERROR 22-322: Expecting a name.
You can solve it
by assigning a format to your variables before reading them, for instance format Cap1 - Cap6 $2.;
To test it,
I included the data in the source file, i.e. using datalines
DATA Capture;
INFILE datalines DLM='09'x DSD missover FIRSTOBS=1;
format Sex $1. AgeGroup $9. Weight 8.2 Cap1 - Cap6 $2.;
INPUT Sex AgeGroup Weight Cap1 - Cap6;
datalines;
M 1-5 24.5 11 12 13 14 15 16
M 6-10 34.2 21 22 23 24 25 26
;
proc print;
proc contents;
RUN;
How to understand this:
SAS was originally created as a programming language for non-developers (i.c. statisticians) who rather don't care about data formats, so SAS does a lot of guess work for you (just like VBA if you don't use option explicit).
So, the first time you mention a variable name in a data step, SAS ads a variable to the Program Data Vector (PDV) with an apropriate type (numeric or charater) and length, but this is guess work.
For instance: as the first student in the test dataset CLASS included in the standard instalation of SAS is male,
data WORK.CLASS;
set sasHelp.CLASS;
select (sex);
when ('M') gender = 'male';
when ('F') gender = 'female';
otherwise gender = 'unknown';
end;
run;
results in truncating 'female' to four positions:
You can correct that by instructing sas to add the variable to the PDV beforehand.
For a character variable,
format myName $20.; and
length myName $20.; are equivalent and
informat myName $20.; is also about the same.
(The storry becomes more complex with user defined formats, though.)
For numerics, there is a huge difference:
length mySize 8.; preserves 8 bytes in the PDV for mySize
format mySize 8.; tells SAS to print or display mySize with up to 8 digits and no decimals
informat mySize $20.; tells SAS a expect 8 digits without decimals when reading mySize.
Numericals can only have certain lengths, depending on the operatin system. On windowns
8. is the default and corresponds to a double on most databases
4. corresponds to a float
3. is the minimum, which I use for booleans
Formats can be very different
format mySize 8.3; tells SAS tot print mySize with 8 characters, including 3 decimals for the fraction (which leaves room for up to 4 decimals before the decimal dot if it has a positive value. Less decimals will be printed to display larger numbers)
format mySize 8.3; tells SAS tot read mySize assuming the last 3 decimals are the fraction, so 12345678 will be interpreted as 12345.678
Then there are special formats to read and write dates, times and so on and user defined value and picture formats, but that lead me too far.
Related
What do the numbers in the grey box represent? And what's a simple way of understanding how the colon modifier affects the way sas reads in values?
The answer depends on information not provided. The answer B is the best choice in the sense that you should use the colon modifier when using informats in the INPUT statement to prevent the use of the formatted input mode instead of list input mode. Otherwise the formatted input could read too many or too few characters and also might leave the cursor in the wrong place for reading the next field.
But if you try to read that data from in-line cards it works fine for those two lines. That is because in-line data lines are padded to next multiple of 80 bytes.
If you put those lines into a file without any trailing spaces on the lines then the second line fails because there are not 10 characters to read for the last field. But if you add the TRUNCOVER option (or PAD) to the INFILE statement then it will work.
Try it yourself. TEST1 and TEST3 work. TEST2 gets a LOST CARD note.
data test1;
input name $ hired date9. age state $ salary comma10.;
format hired date9.;
cards;
Donny 5MAR2008 25 FL $43,123.50
Margaret 20FEB2008 43 NC 65,150
;
options parmcards=test;
filename test temp ;
parmcards;
Donny 5MAR2008 25 FL $43,123.50
Margaret 20FEB2008 43 NC 65,150
;
data test2;
infile test;
input name $ hired date9. age state $ salary comma10.;
format hired date9.;
run;
data test3;
infile test truncover;
input name $ hired date9. age state $ salary comma10.;
format hired date9.;
run;
With different data the first formatted input can cause trouble also. For example if the date values used only 2 digits for the year it would throw things off. So it tries to read FL as the age and then reads the first 8 characters of the salary as the STATE and just blanks as the SALARY.
data test1;
input name $ hired date9. age state $ salary comma10.;
format hired date9.;
cards;
Donny 5MAR08 25 FL $43,123.50
Margaret 20FEB2008 43 NC 65,150
;
Results:
Obs name hired age state salary
1 Donny 05MAR2008 . $43,123. .
2 Margaret 20FEB2008 43 NC 65150
This is the story
This is the input file
mukesh,04/04/15,04/06/15,125.00,333.23
vishant,04/05/15,04/07/15,200.00,200
achal,04/06/15,04/08/15,275.00,55.43
this is the import statement that I am using
data datetimedata;
infile fileref dlm=',';
input lastname$ datechkin mmddyy10. datechkout mmddyy10. room_rate equip_cost;
run;
the below is the log which shows success
NOTE: The infile FILEREF is:
Filename=\\VBOXSVR\win_7\SAS\DATA\datetime\datetimedata.csv,
RECFM=V,LRECL=256,File Size (bytes)=688,
Last Modified=13Jun2015:12:08:36,
Create Time=13Jun2015:09:13:09
NOTE: 17 records were read from the infile FILEREF.
The minimum record length was 34.
The maximum record length was 40.
NOTE: The data set WORK.DATETIMEDATA has 17 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
I have published only 3 observation here.
Now when I print the sas dataset everything works fine except the room_rate variable.
THe output should be 3 digit numbers , but i am getting only the last digit .
Where Am i going wrong !!!
You're mixing input types. When you use list input, you can't specify informats. You either need to specify them using modified list input (add a colon to the informat) or use an informat statement earlier. The following works.
data datetimedata;
infile datalines dlm=',';
input lastname$ datechkin :mmddyy10. datechkout :mmddyy10. room_rate equip_cost;
datalines;
mukesh,04/04/15,04/06/15,125.00,333.23
vishant,04/05/15,04/07/15,200.00,200
achal,04/06/15,04/08/15,275.00,55.43
;;;;
run;
proc print data=datetimedata;
run;
When I export a dataset to Stata format using PROC EXPORT, SAS 9.4 automatically expands adds an extra (empty) byte to every observation of every string variable. For example, in this data set:
data test1;
input cust_id $ 1
month 3-8
category $ 10-12
status $ 14-14
;
datalines;
A 200003 ABC C
A 200004 DEF C
A 200006 XYZ 3
B 199910 ASD X
B 199912 ASD C
;
quit;
proc export data = test1
file = "test1.dta"
dbms = stata replace;
quit;
the variables cust_id, category, and status should be str1, str3, and str1 in the final Stata file, and thus take up 1 byte, 3 bytes, and 1 byte, respectively, for every observation. However, SAS automatically adds an extra empty byte to each observation, which expands their data types to str2, str4, and str2 data type in the outputted Stata file.
This is extremely problematic because that's an extra byte added to every observation of every string variable. For large datasets (I have some with ~530 million observations and numerous string variables), this can add several gigabytes to the exported file.
Once the file is loaded into Stata, the compress command in Stata can automatically remove these empty bytes and shrink the file, but for large datasets, PROC EXPORT adds so many extra bytes to the file that I don't always have enough memory to load the dataset into Stata in the first place.
Is there a way to stop SAS from padding the string variables in the first place? When I export a file with a one character string variable (for example), I want that variable stored as a one character string variable in the output file.
This is how you can do it using existing functions.
filename FT41F001 temp;
data _null_;
file FT41F001;
set test1;
put 256*' ' #;
__s=1;
do while(1);
length __name $32.;
call vnext(__name);
if missing(__name) or __name eq: '__' then leave;
substr(_FILE_,__s) = vvaluex(__name);
putlog _all_;
__s = sum(__s,vformatwx(__name));
end;
_file_ = trim(_file_);
put;
format month f6.;
run;
To avoid the use of _FILE_;
data _null_;
file FT41F001;
set test1;
__s=1;
do while(1);
length __name $32. __value $128 __w 8;
call vnext(__name);
if missing(__name) or __name eq: '__' then leave;
__value = vvaluex(__name);
__w = vformatwx(__name);
put __value $varying128. __w #;
end;
put;
format month f6.;
run;
If you are willing to accept a flat file answer, I've come up with a fairly simple way of generating one that I think has the properties you require:
data test1;
input cust_id $ 1
month 3-8
category $ 10-12
status $ 14-14
;
datalines;
A 200003 ABC C
A 200004 DEF C
A 200006 XYZ 3
B 199910 SD X
B 199912 D C
;
run;
data _null_;
file "/folders/myfolders/test.txt";
set test1;
put #;
_FILE_ = cat(of _all_);
put;
run;
/* Print contents of the file to the log (for debugging only)*/
data _null_;
infile "/folders/myfolders/test.txt";
input;
put _infile_;
run;
This should work as-is, provided that the total assigned length of all variables in your dataset is less than 32767 (the limit of the cat function in the data step environment- the lower 200 character limit doesn't apply, as that's only when you use cat to create a variable that hasn't been assigned a length). Beyond that you may start to run into truncation issues. A workaround when that happens is to only cat together a limited number of variables at a time - a manual process, but much less laborious than writing out put statements based on the lengths of all the variables, and depending on your data it may never actually come up.
Alternatively, you could go down a more complex macro route, getting variable lengths from either the vlength function or dictionary.columns and using those plus the variable names to construct the required put statement(s).
Is there a faster /shorter way to convert character date to another character date format without using an intermediate input statement as done here below...
data alpha;
length date $12;
input date $;
cards;
15FEB2014
;
run;
data beta;set alpha;
date_f=put(input(date,date9.),yymmdd10.);
run;
Character date: 15FEB2014 --> Character date: 2014-02-15
Thanks
sas_kappel
Of course it is possible, but it's not likely to be easier.
data alpha;
length date $12;
input date $;
cards;
15FEB2014
;
run;
data beta;set alpha;
date_f=put(input(date,date9.),yymmdd10.);
format date_f2 $10.;
array monnames[12] $ _temporary_ ("JAN","FEB","MAR","APR","MAY","JUN","JUL","AUG","SEP","OCT","NOV","DEC") ;
date_f2 = catx('-',substr(date,6,4),put(whichc(substr(date,3,3),of monnames[*]),z2.),substr(date,1,2));
run;
I can't think of a simpler way than that without cheating any more than I already am in that. The put(input()) method is doing a lot of work behind that small amount of code.
You could also write a direct character format that converted '01JAN2014' to '2014-01-01', but you'd have to write every single possible conversion one by one unless I'm missing something obvious.
I have some complicated string parsing which would be very difficult to accomplish using regular SAS functions because of the string value inconsistency; as a result
I think I will need to use Perl Regular Expressions. Below have 4 variables (price, date, size, bundle) which I have to create using parts of the text string. I'm have trouble getting the syntax correct - I am new to regular expressions.
Here is a sample data set.
data have;
infile cards truncover;
input text $80.;
cards;
acq_newsale_0_CartChat_0_Flash_1192014.jpg
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf
acq_sale_3xconoffer_8999_nacpg_2102014.sfw
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf
;run;
/The first variable is price it is normally located near the end or middle of the string/
data want;
set have;
price =(input(prxchange('s/(\w+)_(\d+)_(\w+)/$2/',-1,text),8.))/100;
format price dollar8.2;
run;
Using the data set above I need to have this result:
price
0
79.99
89.99
89.99
79.99
64.99
/Date is always a series of consecutive digits. Either 6, 7 or 8. Using | which means 'or' I thought I would be able to pull that way/
data want;
set have;
date=prxparse('/\d\d\d\d\d\d|\d\d\d\d\d\d\d|\d\d\d\d\d\d\d\d/',text);
run;
Using the data set above I need to have this result:
Date
1192014
112014
2102014
272014
12252014
462014
1192014
12162013
/* For size there is always an ‘x’ in the middle of the sub-string which is with followed by two or three digits on either side*/
data want;
set have;
size=prxparse('/(\w+)_(\d+)'x'(\d+)_(\w+)/',text);
run;
Size
728x90
160x600
300x250
160x600
728x90
/*This is normally located towards the beginning of the string. It’s always a single digit number followed by an x It in never followed by additional digits but can also be just 0. */
data want;
set have;
Bundle=prxparse('/(\d+)'x'',text);
run;
Bundle
0
3x
3x
3X
3x
0
2x
3x
The final product I am looking for should look like this:
Text Date price Size Bundle
acq_newsale_0_CartChat_0_Flash_1192014.jpg 1192014 0 0
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf 112014 79.99 3x
acq_sale_3xconoffer_8999_nacpg_2102014.sfw 2102014 89.99 3x
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp 272014 89.99 728x90 3X
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov 12252014 160x600 3x
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg 462014 300x250 0
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf 1192014 79.99 160x600 2x
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf 12162013 64.99 728x90 3
x
If you're extracting, don't use PRXCHANGE. Use PRXPARSE, PRXMATCH, and PRXPOSN.
Sample usage, with date:
data have;
infile cards truncover;
input text $80.;
cards;
acq_newsale_0_CartChat_0_Flash_1192014.jpg
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf
acq_sale_3xconoffer_8999_nacpg_2102014.sfw
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf
;
run;
data want;
set have;
rx_date = prxparse('~(\d{6,8})~io');
rc_date = prxmatch(rx_date,text);
if rc_date then datevar = prxposn(rx_date,1,text);
run;
Just enclose in parens the section you want to extract (in this case, all of it).
Date was easy - as you say, 6-8 numbers. The others may be harder. The 3x etc. bit you can probably find, depending on how strict you need to be; the price I think you'll have a very hard time finding. You need to be able to better articulate the rules. "Towards the beginning" isn't a regex rule. "The second set of digits" is; "The second to last set", perhaps might work. I'll see if I can figure out a few.
In your example data, this works. I in particular don't like the price search; that one may well fail with a more complicated set of data. You can figure out adding the decimal for yourself.
data have;
infile cards truncover;
input text $80.;
cards;
acq_newsale_0_CartChat_0_Flash_1192014.jpg
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf
acq_sale_3xconoffer_8999_nacpg_2102014.sfw
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf
blahblah :23 blahblah
blahblahblah 23 blah blah
;
run;
data want;
set have;
rx_date = prxparse('~_(\d{6,8})[_\.]~io');
rx_price = prxparse('~_(\d+)_.*?(?=_\d+[_\.]).*?(?!_\d+[_\.])~io');
rx_bundle = prxparse('~(?!_\d+_)_(\dx)~io');
rx_size = prxparse('~_(\d+x\d+)[_\.]~io');
rx_adnum = prxparse('~\s:?(\d\d)\s~io');
rc_date = prxmatch(rx_date,text);
rc_price = prxmatch(rx_price,text);
rc_bundle = prxmatch(rx_bundle,text);
rc_size = prxmatch(rx_size,text);
rc_adnum = prxmatch(rx_adnum,text);
if rc_date then datevar = prxposn(rx_date,1,text);
if rc_price then price = prxposn(rx_price,1,text);
if rc_bundle then bundle = prxposn(rx_bundle,1,text);
if rc_size then size = prxposn(rx_size,1,text);
if rc_adnum then adnum = prxposn(rx_adnum,1,text);
run;