I want to simply check the values read in SAS. In the raw data file
----+---10----+---20
H Let
P Grn Lea Qua Gro
P Ice Pls Frm
P Rom Qua Gro
H Sqs
P Ylw Tas Acr
P Zuc Pls Frm
I submitted a code
data a;
infile 'FileA.txt';
retain vege;
input code $1. #;
if code='H' then input #3 vege $3.;
if code='P';
input #3 variety : $10. #15 Supplier : $11.;
run;
proc print noobs;
run;
I got the observations
Let P Gm Gro
Let P Ice Frm
Let P Rom Gro
Sqs P Ylw Acr
Sqs P Zuc Frm
I uderstand that the if code=P; is the reason why the code value is P, but I would like to know if there was supposed to be more observations.
According to the text book I am working on, the sixth observations has certain values and it is indicated by _ N _ =6.
I am still learning and not quite sure what it means... may I have some help?
Thank you.
An if without a then is a special form of if not found in other languages. It is known as a subsetting if and program flow only passes through the statement when the evaluation if true.
Data set rows are output tacitly and implicitly when program flow reaches the bottom of the step (unless there is an explicit output elsewhere in the step)
Thus, all the data file lines were read, only five of them met the sub-setting if criteria asserted by if code='P'; and fell through to the end of the step and were implicitly output.
Related
I have a data step where I have a few columns that need tied to one other column.
I have tried using multiple "from" statements and " to" statements and a couple other permutations of that, but nothing seems to do the trick. The code looks something like this:
data analyze;
set css_email_analysis;
from = bill_account_number;
to = customer_number;
output;
from = bill_account_number;
to = email_addr;
output;
from = bill_account_number;
to = e_customer_nm;
output;
run;
I would like to see two columns showing bill accounts in the "from" column, and the other values in the "to", but instead I get a bill account and its customer number, with some "..."'s for the other values.
Issue
This is most likely because SAS has two datatypes and the first time the to variable is set up, it has the value of customer_number. At your second to statement you attempt to set to to have the value of email_addr. Assuming email_addr is a character variable, two things can happen here:
Customer_number is a number - to has already been set up as a number, so SAS cannot force to to become a character, an error like this may appear:
NOTE: Invalid numeric data, 'me#mywebsite.com' , at line 15 column 8. to=.
ERROR=1 N=1
Customer_number is a character - to has been set up as a character, but without explicitly defining its length, if it happens to be shorter than the value of email_addr then the email address will be truncated. SAS will not show an error if this happens:
Code:
data _NULL_;
to = 'hiya';
to = 'me#mydomain.com';
put to=;
run;
short=me#m
to is set with a length of 4, and SAS does not expand it to fit the new data.
Detail
The thing to bear in mind here is how SAS works behind the scenes.
The data statement sets up an output location
The set statement adds the variables from first observation of the dataset specified to a space in memory called the PDV, inheriting lengths and data types.
PDV:
bill_account_number|customer_number|email_addr|e_customer_nm
===================================================================
010101 | 758|me#my.com |John Smith
The to statement adds another variable inheriting the characteristics of customer_number
PDV:
bill_account_number|customer_number|email_addr|e_customer_nm|to
===================================================================
010101 | 758|me#my.com |John Smith |758
(to is either char length 3 or a numeric)
Subsequent to statements will not alter the characteristics of the variable and SAS will continue processing
PDV (if customer_number is character = TRUNCATION):
bill_account_number|customer_number|email_addr|e_customer_nm|to
===================================================================
010101 | 758|me#my.com |John Smith |me#
PDV (if customer_number is numeric = DATA ERROR, to set to missing):
bill_account_number|customer_number|email_addr|e_customer_nm|to
===================================================================
010101 | 758|me#my.com |John Smith |.
Resolution
To resolve this issue it's probably easiest to set the length and type of to before your first to statement:
data analyze;
set css_email_analysis;
from = bill_account_number;
length to $200;
to = customer_number;
output;
...
You may get messages like this, where SAS has converted data on your behalf:
NOTE: Numeric values have been converted to character
values at the places given by: (Line):(Column).
27:8
N.B. it's not necessary to explicitly define the length and type of from, because as far as I can see, you only ever get the values for this variable from one variable in the source dataset. You could also achieve this with a rename if you don't need to keep the bill_account_number variable:
rename bill_account_number = from;
I want to import a raw dataset in SAS whose first column starts with "C".But
there is something wrong with my code as ERROR keeps popping up in log window:-
Can anybody help me realise it??
Sample dataset:=
H 1095 NJ 06DEC84
C 01DEC11 $45.0
C 01AUG11 $37.5
H 1096 CA 01SEP83
My code : -
Filename hca2 'C:\Users\Desktop\SAS\datasets\HCA_file.txt';
Data assign8.hca2;
Infile hca2;
Input#1 FC $1.;
If FC = 'C' then
Input #3 DOB #11 Transaction_Value ;
Run;
The error (which would have been good to add to the question) is likely from the input of DOB without an associated informat. The following may help.
attrib dob informat=date9. format=date9.;
informat is for processing input and format is for output.
A simple INPUT statement will read in some data and immediately skip to the next line.
When processing a single line of data with multiple input statements, the earlier input statements should use a trailing # symbol to indicate 'held-input' and causes the input processor to not immediately proceed to the next line. Instead the 'active position' of the input processor will remain in the same line at the last position used for input.
Changing the code as follows will force the input processor to remain on the same line.
input #1 FC $1. #;
Note: The input processor will skip to the next line when the next implicit data step iteration occurs. This means when your if fails the next iteration in the data step will be reading from the next line in the file.
Input can be held across implicit iterations by using two # symbols (input .... ##;)
In a basic data step I'm creating a new variable and I need to filter the dataset based on this new variable.
data want;
set have;
newVariable = 'aaa';
*lots of computations that change newVariable ;
*if xxx then newVariable = 'bbb';
*if yyy AND not zzz then newVariable = 'ccc';
*etc.;
where newVariable ne 'aaa';
run;
ERROR: Variable newVariable is not on file WORK.have.
I usually do this in 2 steps, but I'm wondering if there is a better way.
( Of course you could always write a complex where statement based on variables present in WORK.have. But in this case the computation of newVariable it's too complex and it is more efficient to do the filter in a 2nd data step )
I couldn't find any info on this, I apologize for the dumb question if the answer is in the documentation and I didn't find it. I'll remove the question if needed.
Thanks!
Use a subsetting if statement:
if newVariable ne 'aaa';
In general, if <condition>; is equivalent to if not(<condition>) then delete;. The delete statement tells SAS to abandon this iteration of the data step and go back to the start for the next iteration. Unless you have used an explicit output statement before your subsetting if statement, this will prevent a row from being output.
I am unsure if this is possible (or stupid question), as I just started looking at SAS last week. I've managed to import my .CSV file to a SAS data set using the:
proc import
Specifying the guessingrows= to limit my out=.
My problem is now that my CSV files to import are not of same structure, which I noticed after writing some code using the obsnum= to specify start and x-lines to read.
So my question is wether or not SAS is capable of either look for a specific string/empty variable, and use as end observation?
My Data looks like (but number of Var_x varies for each file):
First I tried looking at the slice= but is only useful if I know the exact Places of interest, as the empty Space between the Groups can vary.
Is it possible to use the set function to specify to start at line 1 and read till encounting a blank field? Or can you redirect me to some function (that I couldn't find myself)?
I would like to look at each "block" separately and process.
Thank you in advance
I think you can do this in a relatively straightforward way if you are comfortable doing some processing after all the data has been inputted.
So do proc import on the whole dataset with no restriction.
Then use a data step and a counter to process through the data and output as necessary. Something like:
data output1 output2 output3;
set imported_data;
if _n_ = 1 then counter = 1;
var1lag = lag(var1);
if var1 = '' and var1lag ne '' then counter=counter+1;
if counter = 1 then output output1;
else if counter = 2 then output output2;
else output output3;
run;
data output1;
set output1;
if var1 = '' and var2 = . and var3 = . then delete;
run;
data output2;
set output2;
if var1 = '' and var2 = . and var3 = . then delete;
run;
data output3;
set output3;
if var1 = '' and var2 = . and var3 = . then delete;
run;
The above code outputs to three datasets based on the value of counter. The lag function lets us look up a row to ensure its the first time we see no data and updates the counter as we see no data.
Then we go back and remove any fully blank data for our datasets.
You could easily use some arrays to make this work more scaleably if you have many outputs instead of the if/else statements to output the data.
I have a messy file, where some of the columns are tab delimitated and some are comma.
My problem with the data set is reading the files with variable lengths
12 Stephen Cole, 33, Columbia, MO
5 Dave Anderson, 25*, Concord, OH
The first column is a ID (tab) the the name (comma) age (comma), active (presence of an asterisk after age), home (tab)
The * after the age indicates if they are inactive.
All the names start at column #19, but everything after that is variable lengths and column starts.
I want to read into a format where I finally get.
ID Name Age Active Home
12 Stephen Cole 33 Active Columbia, MO
5 Dave Anderson 25 Inactive Concord, OH
Thus far I have:
data marathon;
infile 'c:/file.txt' dlm=',' pad firstobs=12;
input #3 ID 3. #19 Name $CHAR13.;
Then I get stuck on how to read the rest. I am mostly thrown with how to read the asterisk next to the age as its own column. If I had that understood, I think I can handle the rest.
You have a couple of issues. First, you need to use delimited input, specifically you need to combine comma and tab into one set of delimiters - one way is shown below. Second, you have two fields that are nontrivial; the one with the asterisk needs to be parsed afterwards (I use compress to keep specifically digits in the first line, and to keep specifically asterisks in the second line). You also need to read city/state in separate fields and combine them together (I use catx).
data want;
infile "c:\temp\test.dat" dlm='092C'x;
input
id
name :$50.
age_active $
home_city :$25.
home_st $
;
age=input(compress(age_active,,'kd'),best.);
active = ifc(compress(age_active,'*','k')='*','Active','Inactive');
home = catx(', ',home_city,home_st);
run;
Watch your lengths, I suggest reasonable ones given my past experience but you could see longer names or cities easily.