Is there a way to read in specific parts of my data without using FIRSTOBS=? For example, I have 5 different files, all of which have a few rows of unwanted characters. I want my data to read in starting with the first row that is numeric. But each of these 5 files have that first numeric row starting in different rows. Rather than going into each file to find where FIRSTOBS should be, is there a way I can instead check this? Perhaps by using an IF statement with ANYDIGIT?
Have you tried something like this from the SAS docs? Example 5: Positioning the Pointer with a Numeric Variable
data office (drop=x);
infile file-specification;
input x #;
if 1<=x<=10 then
input #x City $9.;
else do;
put 'Invalid input at line ' _n_;
delete;
end;
run;
This assumes that you don't know how many lines are to be skipped at the beginning of each file. My filerefs are UNIX to run the example on another OS they will need to be changed;
*Create two example input data files;
filename FT15F001 '~/file1.txt';
parmcards;
char
char and 103
10 10 10
10 10 10.1
;;;;
run;
filename FT15F001 '~/file2.txt';
parmcards;
char
char and 103
char
char
char
10 10 10.5
10 10 10
;;;;
run;
*Read them starting from the first line that has all numbers;
filename FT77F001 '~/file*.txt';
data both;
infile FT77F001 eov=eov;
input #;
/*Reset the flag at the start of each new file*/
if _n_ eq 1 or eov then do;
eov=0;
flag=1;
end;
if flag then do;
if anyalpha(_infile_) then delete;
else flag=0;
end;
input v1-v3;
drop flag;
retain flag;
run;
proc print;
run;
I ended up doing:
INPUT City $#;
StateAvg = input(substr(City,1,4),COMMA4.);
IF 5000<= StateAvg <= 7000 THEN
INPUT City 1-7 State ZIP;
ELSE DO;
Delete;
END;
And this worked. Thanks for the suggestions, I went back and looked at example 5 and it helped.
Related
I want to create a variable that resolves to the character before a specified character (*) in a string. However I am asking myself now if this specified character appears several times in a string (like it is in the example below), how to retrieve one variable that concatenates all the characters that appear before separated by a comma?
Example:
data have;
infile datalines delimiter=",";
input string :$20.;
datalines;
ABC*EDE*,
EFCC*W*d*
;
run;
Code:
data want;
set have;
cnt = count(string, "*");
_startpos = 0;
do i=0 to cnt until(_startpos=0);
before = catx(",",substr(string, find(string, "*", _startpos+1)-1,1));
end;
drop i _startpos;
run;
That code output before=C for the first and second observation. However I want it to be before=C,E for the first one and before=C,W,d for the second observation.
You can use Perl regular expression replacement pattern to transform the original string.
Example:
data have;
infile datalines delimiter=",";
input string :$20.;
datalines;
ABC*EDE*,
EFCC*W*d*
;
data want;
set have;
csl = prxchange('s/([^*]*?)([^*])\*/$2,/',-1,string); /* comma separated letters */
csl = prxchange('s/, *$//',1,csl); /* remove trailing comma */
run;
Make sure to increment _STARTPOS so your loop will finish. You can use CATX() to add the commas. Simplify selecting the character by using CHAR() instead of SUBSTR(). Also make sure to TELL the data step how to define the new variable instead of forcing it to guess. I also include test to handle the situation where * is in the first position.
data have;
input string $20.;
datalines;
ABC*EDE*
EFCC*W*d*
*XXXX*
asdf
;
data want;
set have;
length before $20 ;
_startpos = 0;
do cnt=0 to length(string) until(_startpos=0);
_startpos = find(string,'*',_startpos+1);
if _startpos>1 then before = catx(',',before,char(string,_startpos-1));
end;
cnt=cnt-(string=:'*');
drop i _startpos;
run;
Results:
Obs string before cnt
1 ABC*EDE* C,E 2
2 EFCC*W*d* C,W,d 3
3 *XXXX* X 1
4 asdf 0
call scan is also a good choice to get position of each *.
data have;
infile datalines delimiter=",";
input string :$20.;
datalines;
ABC*EDE*,
EFCC*W*d*
****
asdf
;
data want;
length before $20.;
set have;
do i = 1 to count(string,'*');
call scan(string,i,pos,len,'*');
before = catx(',',before,substrn(string,pos+len-1,1));
end;
put _n_ = +7 before=;
run;
Result:
_N_=1 before=C,E
_N_=2 before=C,W,d
_N_=3 before=
_N_=4 before=
I have a variable in SAS with a lot of numbers, for example 11000, 30129, 11111, 30999. I want to group this by the first two digits so "11000 and 11111" and "30129 and 30999" will be in a own table.
It's quite simple,
You have to create a second column and extract the 2 first digit.
Then sort the dataset by this second columns.
data test;
infile datalines dsd ;
input a : 15. ;
datalines;
11000,
30129,
11111,
309999,
;
run;
data test_a;
length val_a $2;
set test;
val_a= SUBSTRN(a,1,2);
run;
proc sort data=test_a out=test_b;
by val_a;
run;
Result will be :
val_a a
11 11000
11 11111
30 30129
30 309999
And then you can create 2 dataset with selection on the val_a like this :
data want data_11 data_30;
set test_b;
if val_a = 11 then output data_11;
if val_a = 30 then output data_30;
run;
Regards,
I think I did like you, but my new column only shows with ".". But I think your answer can give me some help anyways, thank you!
data books;
infile "&path\Boken.csv" dlm=';' missover dsd firstobs=2;
input ISBN: $12.
Book: $quote150.;
run;
data test_a;
format val_ISBN 15.;
set books;
val_ISBN= SUBSTRN(ISBN,1,2);
run;
proc sort data=test_a out=test_b;
by val_ISBN;
run;
proc print data=test_b (obs=10) noobs ;
run;
I have a SAS dataset that I have to export to a .csv-file. I have the following two contradicting requirements.
I have to use the semicolon as the delimiter in the .csv-file.
Some of the character variables are manually inputted strings from formulas, hence they may contain semicolons.
My solution to the above is to either escape the semicolon or to replace it with a comma.
How can I, in a nice, clean and efficient way use e.g. tranwrd on an entire dataset?
My attempt:
For each variable, use the tranwrd(.., ";", ",") function on a variable in the data set. Update the dataset and loop through all variables. This, however, is naturally a very inefficient way of doing it for even semi-large datasets, since I have to do a datastep for each variable. The code for it is quite ugly, since I have to get the variable names by a few steps, but the inefficiency definitely takes the cake.
data test;
input w $ c b d e $ f $;
datalines4;
Aaa;; 50 11 1 222 a;s
Bbb 35 12 2 250 qw
Comma, 75 13 3 foo zx
;;;;
run;
* Get the variable names;
proc contents data=test out=vars(keep=name type varnum) order=varnum noprint;
run;
* Sort by variable number;
proc sort data=vars;
by varnum;
run;
* Put variable names into a space-separated string;
proc sql noprint;
select compress(name)
into :name_list separated by ' '
from vars;
quit;
%let len = %sysfunc(countw(&name_list));
*Initialize loop dataset;
data a;
set test;
run;
%macro loop;
%do i = 1 %to &len;
%let j = %scan(&name_list,&i);
data a(rename=(v_&j = &j) drop=&j);
set a;
v_&j.=compress(tranwrd(&j,";",","));
run;
%end;
%mend;
%loop;
I think I may have more elegant solution to your problem:
data class;
set sashelp.class;
array vars [*] _character_;
do i = 1 to dim(vars);
vars[i] = compress(tranwrd(vars[i],"a","X"));
end;
drop i;
run;
You can use array to reference all character columns from your data set and then loop through them.
The most widely used standard for csv files whose fields can contain delimiters is to quote fields that contain them, and double up any quotes. In SAS you can do this automatically using the dlm and dsd options in a put statement:
data test;
input w $ c b d e $ f $;
datalines4;
Aaa;; 50 11 1 222 a;s
Bbb" 35 12 2 250 qw
Comma, 75 13 3 foo zx
;;;;
run;
data _null_;
set test;
file "c:\temp\test.csv" dsd dlm=';';
put (_ALL_) (&);
run;
This results in the following semicolon-delimited csv (minus a header row, but that's a separate issue):
"Aaa;;";50;11;1;222;"a;s"
"Bbb""";35;12;2;250;qw
Comma,;75;13;3;foo;zx
Sorry, didn't notice your comment about the workaround until after I posted this. I'll leave it here in case anyone finds it helpful.
Fields in a properly formatted delimited file are quoted. PROC EXPORT will do that. There is no need to change the data.
data test;
input w $ c b d e $ f $;
datalines4;
Aaa;; 50 11 1 222 a;s
Bbb 35 12 2 250 qw
Comma, 75 13 3 foo zx
;;;;
run;
filename FT45F001 temp;
proc export data=test outfile=FT45F001 dbms=csv;
delimiter=';';
run;
data _null_;
infile FT45F001;
input;
list;
run;
proc import replace datafile=FT45F001 dbms=csv out=test2;
delimiter=';';
run;
proc print;
run;
proc compare base=test compare=test2;
run;
Say we have the SAS code:
data t1 (keep=KEY COUNT C_AMT2 C_AMT);
SET t1;
BY key;
RETAIN COUNT C_AMT;
IF FIRST.KEY THEN
DO;
COUNT=0;
C_AMT2=0;
END;
COUNT+1;
C_AMT=SUM(C_AMT2, C_AMT);
IF LAST.KEY THEN
OUTPUT;
RUN;
What would change here if I were to remove "IF LAST.KEY THEN OUTPUT;". The documentation says that output causes SAS to write to the datastep immediately, not at the end of the data step. Because here it is right before the end of the data step, would this mean removing it would cause no difference?
Removing it would cause a difference.
Then you would have a record for every value of key, assuming multiple values. Controlling the output means you'd have only the last record.
It looks like it's calculating a count and total so there are other ways to achieve this. I'm going to assume that there's some other code that you've suppressed.
The relevant section from the documentation that refers to this is in the link you have above
Implicit versus Explicit Output
By default, every DATA step contains an implicit OUTPUT statement at the end of each iteration that tells SAS to write observations to the data set or data sets that are being created. Placing an explicit OUTPUT statement in a DATA step overrides the automatic output, and SAS adds an observation to a data set only when an explicit OUTPUT statement is executed. Once you use an OUTPUT statement to write an observation to any one data set, however, there is no implicit OUTPUT statement at the end of the DATA step. In this situation, a DATA step writes an observation to a data set only when an explicit OUTPUT executes. You can use the OUTPUT statement alone or as part of an IF-THEN or SELECT statement or in DO-loop processing.
Here's some code that simulates your issue:
*Generate random data;
Data have;
do Key=1 to 2;
do i=1 to 3;
Amount=floor(rand('normal', 50, 5));
OUTPUT;
end;
end;
run;
data t1;
set have;
retain count C_Amt;
by Key;
if first.key then do;
count=0;
C_Amt=0;
end;
Count+1;
c_amt=sum(c_amt, amount);
if last.key then output;
run;
proc print data=t1;
run;
data t1;
set have;
retain count C_Amt;
by Key;
if first.key then do;
count=0;
C_Amt=0;
end;
Count+1;
c_amt=sum(c_amt, amount);
*if last.key then output;
run;
proc print data=t1;
run;
And the corresponding output:
With last.key then output
Obs Key i Amount count C_Amt
1 1 3 46 3 147
2 2 3 44 3 154
And with out last.key
Obs Key i Amount count C_Amt
1 1 1 47 1 47
2 1 2 54 2 101
3 1 3 46 3 147
4 2 1 61 1 61
5 2 2 49 2 110
6 2 3 44 3 154
Commas are an error here:
(keep=KEY, COUNT, C_AMT2, C_AMT)
Anyway:
RUN;
usually means:
output;
return;
But if SAS encounters an output statement in your code, the output at the end (enclosed in the run statement) will be ignored.
Hence, since your output statement is conditionally executed only IF LAST.KEY, in your dataset you will have only observations marked as last.key, because your RUN; will only mean return.
Something like:
data want; set have; output; run;
Is exactly the same to not explicit output:
data want; set have; output; run;
You can use output as you want:
data want01 want02;
set have;
if a then output want01;
if b then output want02;
run;
data want01;
var=var1;
output;
var=var2;
output;
run;
I have my data in a .txt file which is comma separated. I am writing a regular infile statements to import that file into a sas dataset. The data is some 2.5 million rows. However in the 37314th row and many more rows I have junk values. SAS is importing rows only a row above the junk value rows and therefore I am not getting a dataset with all 2.5 million rows but with 37314 rows. I want to write a code which while writing this infile takes care of these junk rows and either doesnt take them or deletes them. All in all, I need all the 2.5 million rows which I am not able to get because of in between junk rows.
any help would be appreciated.
You can read the whole line into the input buffer using just an
Input;
statement. Then you can parse the fields individually using the
_infile_
variable.
Example:
data _null_;
infile datalines firstobs=2;
input;
city = scan(_infile_, 1, ' ');
char_min = scan(_infile_, 3, ' ');
char_min = substr(char_min, 2, length(char_min)-2);
minutes = input(char_min, BEST12.);
put city= minutes=;
datalines;
City Number Minutes Charge
Jackson 415-555-2384 <25> <2.45>
Jefferson 813-555-2356 <15> <1.62>
Joliet 913-555-3223 <65> <10.32>
;
run;
Working with Data in the Input Buffer.
You can also use the ? and ?? modifiers for the input statement to 'ignore' any problem rows.
Here's the link to the doc. Look under the heading "Format Modifiers for Error Reporting".
An example:
data x;
format my_num best.;
input my_num ?? ;
**
** POSSIBLE ERROR HANDLING HERE:
*;
if my_num ne . then do;
output;
end;
datalines;
a
;
run;