I have the data as follows
id^number^obs
123^2^a~b
124^3^c~d~e
125^4^f~g~h~i
the first number is a unique id, the second number is the # of observations for the id, the rest of the line is the observations.
for the first line, the unique id is 123, it has 2 observations: they are a and b
I want read the data into SAS as
id number obs
123 2 a
123 2 b
124 3 c
124 3 d
124 3 e
125 4 f
125 4 g
125 4 h
125 4 i
My question is how I can do that in SAS?
Thanks a lot!
I'm assuming this is a question regarding reading in data from a flat-file and storing it in a SAS dataset. The following code will do that for you:
/* Insert filename */
filename myfile "";
/* This writes out a dataset called mydataset from the flat-file */
data mydataset;
infile myfile dlm='^' dsd firstobs=2;
input id number _obs $;
_i=1;
do until (scan(_obs,_i,'~') = '');
obs=scan(_obs,_i,'~');
_i+1;
drop _:; /* Remove this line to see all variables in final dataset */
output;
end;
run;
Explanation
The data-step reads in records from the flat-file, but before outputting to the dataset, it uses the scan function to separate the obs variable by '~', outputting a separate observation for each value.
As mentioned in the comment, you can remove the drop statement to further understand how the code is working.
Related
I understand how to use pointer control to search for a phrase in the raw data and then read the value into a SAS variable. I need to know how to tell SAS to stop reading the raw data when it encounters a particular phrase.
For example in the below code I want to read the data only between phrases Start and Stop. So the Jelly should not be part of the output
data work.two;
input #"1" Name :$32.;
datalines;
Start 1 Frank 6
1 Joan 2
3 Sui Stop
1 Jelly 4
5 Jose 3
;
run;
You cannot really combine those into a single pass through the file. The problem is that the #'1' will skip past the line with STOP in it so there is no way your data step will see it.
Pre-process the file.
filename copy temp;
data _null_;
file copy ;
retain start 0 ;
input ;
if index(_infile_,'Start') then start=1;
if start then put _infile_;
if index(_infile_,'Stop') then stop;
datalines;
Start 1 Frank 6
1 Joan 2
3 Sui Stop
1 Jelly 4
5 Jose 3
;
data work.two;
infile copy ;
input #"1" Name :$32. ##;
run;
You can make the logic to detect what parts of the source file to include as complex as you need.
All names are the second position from the left of each row, so name could be got by scan function, if there is 'Stop' in the row then stop loop.
data work.two;
input ##;
Name=scan(_infile_,-2);
if indexw(_infile_,'Stop')>0 then stop;
input;
datalines;
Start 1 Frank 6
1 Joan 2
3 Sui Stop
1 Jelly 4
5 Jose 3
;
run;
I'm trying to create variables Cap1 through Cap6. I'm not sure how to have read them as character data. My code is:
DATA Capture;
INFILE '/folders/myfolders/sasuser.v94/Capture.txt' DLM='09'x DSD MISSOVER FIRSTOBS=2;
INPUT Sex $ AgeGroup $ Weight Cap1 - Cap6 $;
RUN;
And my issue is Cap1 through Cap5 are interpreted as numerical data. How do I solve this?
Your issue is simple: you are using a variable list, but you aren't applying the $ to the whole variable list! You need ( ) around the list and the modifier to apply it to the whole list.
See:
DATA Capture;
INFILE datalines DLM=' ' DSD;
INPUT Sex $ AgeGroup $ Weight (Cap1 - Cap6) ($);
datalines;
M 18-34 135 A B C D E F
F 35-54 115 G H I J K L
;;;;
RUN;
Indeed,
I would also expect this input statement to work as you did, but it does not. Putting a $ after Cap1 does not resolve it either, as this log shows.
26 INPUT Sex $ AgeGroup $ Weight Cap1 $ - Cap6 $;
_
22
ERROR 22-322: Expecting a name.
You can solve it
by assigning a format to your variables before reading them, for instance format Cap1 - Cap6 $2.;
To test it,
I included the data in the source file, i.e. using datalines
DATA Capture;
INFILE datalines DLM='09'x DSD missover FIRSTOBS=1;
format Sex $1. AgeGroup $9. Weight 8.2 Cap1 - Cap6 $2.;
INPUT Sex AgeGroup Weight Cap1 - Cap6;
datalines;
M 1-5 24.5 11 12 13 14 15 16
M 6-10 34.2 21 22 23 24 25 26
;
proc print;
proc contents;
RUN;
How to understand this:
SAS was originally created as a programming language for non-developers (i.c. statisticians) who rather don't care about data formats, so SAS does a lot of guess work for you (just like VBA if you don't use option explicit).
So, the first time you mention a variable name in a data step, SAS ads a variable to the Program Data Vector (PDV) with an apropriate type (numeric or charater) and length, but this is guess work.
For instance: as the first student in the test dataset CLASS included in the standard instalation of SAS is male,
data WORK.CLASS;
set sasHelp.CLASS;
select (sex);
when ('M') gender = 'male';
when ('F') gender = 'female';
otherwise gender = 'unknown';
end;
run;
results in truncating 'female' to four positions:
You can correct that by instructing sas to add the variable to the PDV beforehand.
For a character variable,
format myName $20.; and
length myName $20.; are equivalent and
informat myName $20.; is also about the same.
(The storry becomes more complex with user defined formats, though.)
For numerics, there is a huge difference:
length mySize 8.; preserves 8 bytes in the PDV for mySize
format mySize 8.; tells SAS to print or display mySize with up to 8 digits and no decimals
informat mySize $20.; tells SAS a expect 8 digits without decimals when reading mySize.
Numericals can only have certain lengths, depending on the operatin system. On windowns
8. is the default and corresponds to a double on most databases
4. corresponds to a float
3. is the minimum, which I use for booleans
Formats can be very different
format mySize 8.3; tells SAS tot print mySize with 8 characters, including 3 decimals for the fraction (which leaves room for up to 4 decimals before the decimal dot if it has a positive value. Less decimals will be printed to display larger numbers)
format mySize 8.3; tells SAS tot read mySize assuming the last 3 decimals are the fraction, so 12345678 will be interpreted as 12345.678
Then there are special formats to read and write dates, times and so on and user defined value and picture formats, but that lead me too far.
I have a dataset which has multiple rows of data for a given person, but only the first row of the person's information contains their name. The rest of the rows of that person's data have the name field missing. I think I can use the retain statement to populate the name, but nothing I try works.
Here is an example of the dataset structure I am working with:
data test;
input id $ value ;
datalines;
Bob 100
. 200
. 300
Jim 475
. 250
. 300
;
run;
I think the problem is that technically id is not missing in those rows, it equals ., even though when reading datalines with input statement you get empty id.
Try this:
data test;
input id $ value;
/*store not empty ID in different retained variable*/
retain current_id;
if not missing(id) then current_id=id;
else id=current_id;
datalines;
Bob 100
. 200
. 300
Jim 475
. 250
. 300
;
run;
When reading an input file where one line contains more than one observation, we can use either '#' or '##'.
When should we use one over the other?
Use the double # when you want the pointer to remain in the same place for the next iteration of the data step. If you just want the pointer to remain in place the next INPUT statement in the current iteration of the data step then you just need to use one trailing #.
Example reading one line with multiple iterations of the data step.
data want;
id+1;
input score ##;
cards;
10 20 30 45
;
Example reading from one line multiple times in the same iteration of the data step.
data want;
infile cards truncover ;
input id score #;
do rep=1 by 1 until (score=.);
output;
input score #;
end;
cards;
1 10 20 30 45
2 15 32
3 5 6 8 12 13 56
;
I have a table as below:
id sprvsr phone name
2 123 5232 ali
2 128 5458 ali
3 145 7845 oya
3 125 4785 oya
I would like to put same id and same name on one column and sprvsr and phone in one column together as below:
id sprvsr phone name
2 123-128 5232-5458 ali
3 145-125 7845-4785 oya
edit question:
have one more question- related this one.
i followed the way you showed me and works. Thank you! Another problem is for example:
sprvsr name
5232-5458 ali
5232-5458 ali
5458-5232 ali
is there any way that i can make them in same order?
If you need the variables in the same order, you'll need to use a temporary array and sort it. This requires having some idea of how many rows you might have. Also requires it to be sorted. This is a bit more complicated than the previous solution (in a previous revision).
data have;
input id sprvsr $ phone $ name $;
datalines;
2 123 5232 ali
2 128 5458 ali
3 145 7845 oya
3 125 4785 oya
4 128 5458 ali
4 123 5232 ali
;
run;
data want;
array phones[99] $8 _temporary_; *initialize these two to some reasonably high number;
array sprvsrs[99] $3 _temporary_;
length phone_all sprvsr_all $200; *same;
set have;
by id;
if first.id then do; *for each id, start out clearing the arrays;
call missing(of phones[*] sprvsrs[*]);
_counter=0;
end;
_counter+1; *increment counter;
phones[_counter]=phone; *assign current phone/sprvsr to array elements;
sprvsrs[_counter]=sprvsr;
if last.id then do; *now, create concatenated list and output;
call sortc(of phones[*]); *sort the lists;
call sortc(of sprvsrs[*]);
phone_all = catx('-',of phones[*]); *concatenate them together;
sprvsr_all= catx('-',of sprvsrs[*]);
output;
end;
drop phone sprvsr;
rename
phone_all=phone
sprvsr_all=sprvsr;
run;
The construction array[*] means "All variables of that array". So catx('-',of phones[*]) means put all phones elements in the catx (fortunately, missing ones are ignored by catx).
This is a way to do that:
data have;
input id sprvsr $ phone $ name $;
datalines;
2 123 5232 ali
2 128 5458 ali
3 145 7845 oya
3 125 4785 oya
;
run;
data want (drop=lag_sprvsr lag_phone);
format id;
length sprvsr $7 phone $9;
set have;
by id;
lag_sprvsr=lag(sprvsr);
lag_phone=lag(phone);
if lag(id)=id then do;
sprvsr=catx('-',lag_sprvsr,sprvsr);
phone=catx('-',lag_phone,phone);
end;
if last.id then output;
run;
Just pay attention to the possible lenghts of the input variables and that of the concatenated string. The input dataset must be sorted by id.
The catx() function removes the leading and trailing blanks and concatenates with a delimiter.