I understand how to use pointer control to search for a phrase in the raw data and then read the value into a SAS variable. I need to know how to tell SAS to stop reading the raw data when it encounters a particular phrase.
For example in the below code I want to read the data only between phrases Start and Stop. So the Jelly should not be part of the output
data work.two;
input #"1" Name :$32.;
datalines;
Start 1 Frank 6
1 Joan 2
3 Sui Stop
1 Jelly 4
5 Jose 3
;
run;
You cannot really combine those into a single pass through the file. The problem is that the #'1' will skip past the line with STOP in it so there is no way your data step will see it.
Pre-process the file.
filename copy temp;
data _null_;
file copy ;
retain start 0 ;
input ;
if index(_infile_,'Start') then start=1;
if start then put _infile_;
if index(_infile_,'Stop') then stop;
datalines;
Start 1 Frank 6
1 Joan 2
3 Sui Stop
1 Jelly 4
5 Jose 3
;
data work.two;
infile copy ;
input #"1" Name :$32. ##;
run;
You can make the logic to detect what parts of the source file to include as complex as you need.
All names are the second position from the left of each row, so name could be got by scan function, if there is 'Stop' in the row then stop loop.
data work.two;
input ##;
Name=scan(_infile_,-2);
if indexw(_infile_,'Stop')>0 then stop;
input;
datalines;
Start 1 Frank 6
1 Joan 2
3 Sui Stop
1 Jelly 4
5 Jose 3
;
run;
Related
I need to produce a few reports and I always struggle with the report procedure to obtain the required result without messing too much with the initial dataset what takes a lot of time.
The dataset is of the following form:
ID VAR1 VAR2 VAR3
1 1 2 3
1 4 5 6
2 7 8 9
2 10 11 12
and the output should be:
ID VAR1
VAR2
VAR3
1 1
2
3
4
5
6
2 7
8
9
10
11
12
Is there a good way (i.e. efficient) of producing the output using proc report? Becuase the only way I see is to manually create blank characters in ID variable and manually create the column with variables and using put function heavily to create spaces. The only thing proc report is used for is to create blank spaces between subjects by using sorting variables. The amount of time it takes is insane. I tried to find some good resources on that but with no success.
I will appreciate any suggestions. Thanks.
Sounds like you just want to use data step to produce your "report".
Here is an outline:
data _null_;
set have;
by id;
array cols var1-var3;
if first.id then put #1 id #;
do index=1 to dim(cols);
put #5+index cols[index] ;
end;
put;
run;
Results:
1 1
2
3
4
5
6
2 7
8
9
10
11
12
Add a FILE statement to direct the output somewhere else. For example use FILE PRINT; to send the report to the listing output instead of the LOG.
Problem Statement: I have a text file and I want to read it using SAS INFILE function. But SAS is not giving me the proper output.
Text File:
1 Big Bazar 15,000
2 Hypercity 20,000
3 Star Bazar 25,000
4 Big Basket 30,000
5 Grofers 35,000
6 DMart 40,000
The Code that I have tried:
DATA Profit;
INFILE '/folders/myfolders/Akki/Retain_Sum.txt';
INPUT Month $1 Name $3-12 Profit;
Informat Profit COMMA6.;
FORMAT Profit COMMA6.;
RETAIN Cummulative_Profit;
Cummulative_Profit = SUM(Cummulative_Profit, Profit);
Run;
PROC PRINT data=profit;
Run;
What am I looking for?
I want to read above data in SAS but it seems there is a problem in my code. (Whenever I run my code it gives some missing value in the profit variable of Grofers and DMart observation). Can you fix it? I want SAS to read complete file.
Thanks in advance.
Your problem comes from the fact that you are specifying column input for your second variable, saying it should read from column 3 to 12. While it works for the first 4 entries, the last two are two short and it reads in the beginning of the profit value in the name variable.
Since your file is clearly not "fixed width", you should be using list input. Unfortunately because your name values contain spaces, this could prove tricky. The proper way to do it would be to have your name values quoted in your text file. You can then use the dsd option on your infile statement to read these values properly with list input:
DATA Profit;
INFILE datalines dlm=' ' dsd;
length month $1 name $12;
INPUT Month $ Name $ Profit;
Informat Profit COMMA6.;
FORMAT Profit COMMA6.;
RETAIN Cummulative_Profit;
Cummulative_Profit = SUM(Cummulative_Profit, Profit);
datalines;
1 "Big Bazar" 15,000
2 "Hypercity" 20,000
3 "Star Bazaar" 25,000
4 "Big Basket" 30,000
5 Grofers 35,000
6 DMart 40,000
;
Run;
PROC PRINT data=profit;
Run;
Your file does not conform to the rules for LIST input with imbedded blanks. You can still read it without changing the file but you have to find the column where the name field ends.
filename FT15F001 temp;
data bad;
infile FT15F001 col=col;
input month #;
l = findc(_infile_,' ','b') - col +1;
input name $varying32. l profit :comma.;
format profit comma12.;
drop l;
parmcards;
1 Big Bazar 15,000
2 Hypercity 20,000
3 Star Bazar 25,000
4 Big Basket 30,000
5 Grofers 35,000
6 DMart 40,000
;;;;
run;
proc print;
run;
Obs month name profit
1 1 Big Bazar 15,000
2 2 Hypercity 20,000
3 3 Star Bazar 25,000
4 4 Big Basket 30,000
5 5 Grofers 35,000
6 6 DMart 40,000
When reading an input file where one line contains more than one observation, we can use either '#' or '##'.
When should we use one over the other?
Use the double # when you want the pointer to remain in the same place for the next iteration of the data step. If you just want the pointer to remain in place the next INPUT statement in the current iteration of the data step then you just need to use one trailing #.
Example reading one line with multiple iterations of the data step.
data want;
id+1;
input score ##;
cards;
10 20 30 45
;
Example reading from one line multiple times in the same iteration of the data step.
data want;
infile cards truncover ;
input id score #;
do rep=1 by 1 until (score=.);
output;
input score #;
end;
cards;
1 10 20 30 45
2 15 32
3 5 6 8 12 13 56
;
data jul11.merge11;
input month sales ;
datalines ;
1 3123
1 1234
2 7482
2 8912
3 1284
;
run;
data jul11.merge22;
input month goal ;
datalines;
1 4444
1 5555
1 8989
2 9099
2 8888
3 8989
;
run;
data jul11.merge1;
merge jul11.merge11 jul11.merge22 ;
by month;
difference =goal - sales ;
run;
proc print data=jul11.merge1 noobs;
run;
output:
month sales goal difference
1 3123 4444 1321
1 1234 5555 4321
1 1234 8989 7755
2 7482 9099 1617
2 8912 8888 -24
3 1284 8989 7705
Why it didn't match all observation in table 1 with in table 2 for common months ?
pdv retains data of observation to seek if any more observation are left for that particular by group before it reinitialises it , in that case it should have done cartesian product .
Gives perfect cartesian product for one to many merging but not for many to many .
This is because of how SAS processes the data step. A merge is never a true cartesian product (ie, all records are searched and matched up against all other records, like a SQL comma join might ); what SAS does (in the case of two datasets) is it follows down one dataset (the one on the left) and advances to the next particular by-group value; then it looks over on the right dataset, and advances until it gets to that by group value. If there are other records in between, it processes those singly. If there are not, but there is a match, then it matches up those records.
Then it looks on the left to see if there are any more in that by group, and if so, advances to the next. It does the same on the right. If only one of these has a match then it will only bring in those values; hence if it has 1 element on the left and 5 on the right, it will do 1x5 or 5 rows. However, if there are 2 on the left and 3 on the right, it won't do 2x3=6; it does 1:1, 2:2, and 2:3, because it's advancing record pointers sequentially.
The following example is a good way to see how this works. If you really want to see it in action, throw in the data step debugger and play around with it interactively.
data test1;
input x row1;
datalines;
1 1
1 2
1 3
1 4
2 1
2 2
2 3
3 1
;;;;
run;
data test2;
input x row2;
datalines;
1 1
1 2
1 3
2 1
3 1
3 2
3 3
;;;;
run;
data test_merge;
merge test1 test2;
by x;
put x= row1= row2=;
run;
If you do want to do a cartesian join in SAS datastep, you have to do nested SET statements.
data want;
set test1;
do _n_ = 1 to nobs_2;
set test2 point=_n_ nobs=nobs_2;
output;
end;
run;
That's the true cartesian, you can then test for by group equality; but that's messy, really. You could also use a hash table lookup, which works better with BY groups. There are a few different options discussed here.
SAS doesn't handle many-to-many merges very well within the datastep. You need to use a PROC SQL if you want to do a many-to-many merge.
I have the data as follows
id^number^obs
123^2^a~b
124^3^c~d~e
125^4^f~g~h~i
the first number is a unique id, the second number is the # of observations for the id, the rest of the line is the observations.
for the first line, the unique id is 123, it has 2 observations: they are a and b
I want read the data into SAS as
id number obs
123 2 a
123 2 b
124 3 c
124 3 d
124 3 e
125 4 f
125 4 g
125 4 h
125 4 i
My question is how I can do that in SAS?
Thanks a lot!
I'm assuming this is a question regarding reading in data from a flat-file and storing it in a SAS dataset. The following code will do that for you:
/* Insert filename */
filename myfile "";
/* This writes out a dataset called mydataset from the flat-file */
data mydataset;
infile myfile dlm='^' dsd firstobs=2;
input id number _obs $;
_i=1;
do until (scan(_obs,_i,'~') = '');
obs=scan(_obs,_i,'~');
_i+1;
drop _:; /* Remove this line to see all variables in final dataset */
output;
end;
run;
Explanation
The data-step reads in records from the flat-file, but before outputting to the dataset, it uses the scan function to separate the obs variable by '~', outputting a separate observation for each value.
As mentioned in the comment, you can remove the drop statement to further understand how the code is working.