I have two lines of observations to read in SAS.
It is a comma-delimited data set.
My code is as below:
DATA SASweek1.industry;
INFILE "&Dirdata.Assignment1_Q6_data.txt" DLM="," DSD termstr=crlf TRUNCOVER;
LENGTH Company $ 15;
INPUT Company $ State $ Expense COMMA9. ;
FORMAT Expense DOLLAR9.;
*INFORMAT Expense DOLLAR10.;
RUN; * not ready;
The raw data set looks like this:
I can print out the first line of observations well,
but the last "0" will go to the first position of the second
line, becoming "0Lee's..".
Any suggestions would be highly appreciated!!
It is just doing what you told it to do. You told it to read exactly 9 characters.
Normally you should not use formatted input mode with delimited data. You prevent that by either adding the : (colon) prefix in front of the informat specification in the INPUT statement or removing the informat specification completely and using an INFORMAT statement to let SAS know what informat to use.
But your data is NOT properly delimited because the last field contains the delimiter, but the value is not enclosed in quotes. So the commas make it look like two values instead of one. The real solution is to fix the process that created the file to create a valid delimited file. It needs to quote the values with commas in them, or remove the commas from the numbers, or use a delimiter character that does not appear in the data.
Fortunately since it is the last field on the line you CAN use formatted input to read just that field. Since you are using the TRUNCOVER option just set the width of the informat in the INPUT statement to the maximum.
DATA SASweek1.industry;
INFILE "&Dirdata.Assignment1_Q6_data.txt" DLM="," DSD termstr=crlf TRUNCOVER;
LENGTH Company $15 State $15 Expense 8;
INPUT Company State Expense COMMA32. ;
FORMAT Expense DOLLAR9.;
RUN;
Related
I can not find the way to reverse text strings.
For example I want to reverse these:
MMMM121231M34 to become 43M132121MMMM
MM1M11M1 to become 1M11M1MM
1111213111 to become 1113121111
Judging from your examples, what you mean by 'rearrange' is actually 'reverse'.
In that case, you've got the very handy reverse() function in SAS.
Used in context:
data test;
length text $32;
infile datalines;
input text $;
result=reverse(strip(text));
datalines;
MMMM121231M34
MM1M11M1
1111213111
;
run;
EDIT on #Joe's request: in the particular example above, I create the test dataset by setting a length of 32 characters for the text variable. Therefore, when reading the values from datalines, these are padded with blanks up to that total of 32 characters. Hence, when reversing that value, the result has that many blanks at the start, followed by the actual value you are looking for. By adding the strip function, you remove the excess blanks from the value of text before reversing, keeping only the "real" value in the result.
I have a SAS string that always starts with a date. I want to remove the date from the substring.
Example of data is below (data does not have bullets, included bullets to increase readability)
10/01/2016|test_num15
11/15/2016|recom_1_test1
03/04/2017|test_0_8_i0|vacc_previous0
I want the data to look like this (data does not have bullets, included bullets to increase readability)
test_num15
recom_1_test1
test_0_8_i0|vacc_previous0
Index find '|' position in the string, then substr substring; or use regular expression.
data have;
input x $50.;
x1=substr(x,index(x,'|')+1);
x2=prxchange('s/([^_]+\|)(?=\w+)//',1,x);
cards;
10/01/2016|test_num15
11/15/2016|recom_1_test1
03/04/2017|test_0_8_i0|vacc_previous0
;
run;
This is a great use case for call scan. If your length of date is constant (always 10), then you don't actually need this (start would be 12 then and skip to the substr, as user667489 noted in comments), but if it's not this would be helpful.
data have;
length textstr $100;
input textstr $;
datalines;
10/01/2016|test_num15
11/15/2016|recom_1_test1
03/04/2017|test_0_8_i0|vacc_previous0
;;;;
run;
data want;
set have;
call scan(textstr,2,start,length,'|');
new_textstr = substr(textstr,start);
run;
It would also let you grab the second word only if that's useful (using length third argument for substr).
Can you please tell me if we can use single trailing (#) for delimited data
rather than fixed width.
Thanks,
Nikhila
From the comments it looks like the question is really how to skip columns in delimited data. A simple way is to read the value into a variable that you later drop. Or even read it into a variable that you want and then overwrite it with the value from the column you do want to keep.
data want ;
infile cards dsd truncover ;
length var1 var2 $20;
input 3*var1 var2 ;
cards;
nikhila,26,hyd,btech
akhila,24,blr,btech
nitesh,20,blr,bmm
;
I'm familiar with the :, and ~ modifiers in SAS put and input statements. The behaviour of & in an input statement is also fairly well documented. But what does & do in a put statement?
It seems to have a similar effect to :, triggering modified list output rather than formatted output, but I can't find any documentation of this behaviour.
E.g.
data _null_;
set sashelp.class;
file 'c:\temp\output.csv' dlm=',';
put Name Sex Age & 4. Height Weight;
run;
Quoting from the on-line documentation in the section of SAS 9.4 under INPUT Statement, List
&
indicates that a character value can have one or more single embedded blanks. This format modifier reads the value from the next non-blank column until the pointer reaches two consecutive blanks, the defined length of the variable, or the end of the input line, whichever comes first.
Restriction:
The & modifier must follow the variable name and $ sign that it affects.
Tip:
If you specify an informat after the & modifier, the terminating condition for the format modifier remains two blanks.
Here is an example from the example section:
Example Reading Character Data That Contains Embedded Blanks
The INPUT statement in this DATA step uses the & format modifier with list input to read character values that contain embedded blanks.
data list;
infile file-specification;
input name $ & score;
run;
It can read these input data records:
----+----1----+----2----+----3----+
Joseph 11 Joergensen red
Mitchel 13 Mc Allister blue
Su Ellen 14 Fischer-Simon green
The & modifier follows the variable that it affects in the INPUT statement. Because this format modifier follows NAME, at least two blanks must separate the NAME field from the SCORE field in the input data records.
You can also specify an informat with a format modifier, as shown here:
input name $ & +3 lastname & $15. team $;
In addition, this INPUT statement reads the same data to demonstrate that you are not required to read all the values in an input record. The +3 column pointer control moves the pointer past the score value in order to read the value for LASTNAME and TEAM.
I have a file in following format:
Name Salary Age
bob 10000 18
sally 5555 20
#not found 4fjfjhdfjfnvndf
#not found 4fjfjhdfjfnvndf
9/2-10/2
but then I have random points in the file where there are 4-6 lines of random characters. The files has 2 million rows. I was wondering if the infile statement automatically skips these random spurt of lines or do I have to go into the file and delete these lines automatically.
You probably have to deal with them in some fashion. If you have truncover or missover on the infile statement, it won't do any harm (you must have one, though, or it might cause your next lines to get shifted over). But you'll have a garbage line in your program that you need to deal with.
The quick and dirty method would be something like this:
data have;
infile "blah.txt" dlm=' ' dsd lrecl=32767 truncover;
input name $ salary age;
if missing(salary) and missing(age) then delete;
run;
If the garbage was likely to generate missing values for the numerics, that would work. However, your log probably has some warnings in it that aren't great, and this isn't perfect in what it finds, either, if the garbage might be numeric values. (If it's entirely numeric values, you could test if name is a number.)
The better method is to preprocess _infile_ - which is a bit more 'advanced' but certainly a good approach.
data have;
infile "blah.txt" dlm=' ' dsd lrecl=32767 truncover;
input #;
if countw(_infile_) ne 3 then delete; *if there are not exactly 3 "words" then delete it;
if notdigit(scan(_infile_,2)) or notdigit(scan(_infile_,3)) then delete; *if the 2nd or 3rd word contain non-digit values then delete;
input name $ salary age;
run;
Both approaches require some consistency with data to work, and probably require some tweaking - for example if salary and age are acceptable to be missing, both of these would delete rows you don't want deleted.