SAS Raw data importing - sas

I want to import a raw dataset in SAS whose first column starts with "C".But
there is something wrong with my code as ERROR keeps popping up in log window:-
Can anybody help me realise it??
Sample dataset:=
H 1095 NJ 06DEC84
C 01DEC11 $45.0
C 01AUG11 $37.5
H 1096 CA 01SEP83
My code : -
Filename hca2 'C:\Users\Desktop\SAS\datasets\HCA_file.txt';
Data assign8.hca2;
Infile hca2;
Input#1 FC $1.;
If FC = 'C' then
Input #3 DOB #11 Transaction_Value ;
Run;

The error (which would have been good to add to the question) is likely from the input of DOB without an associated informat. The following may help.
attrib dob informat=date9. format=date9.;
informat is for processing input and format is for output.
A simple INPUT statement will read in some data and immediately skip to the next line.
When processing a single line of data with multiple input statements, the earlier input statements should use a trailing # symbol to indicate 'held-input' and causes the input processor to not immediately proceed to the next line. Instead the 'active position' of the input processor will remain in the same line at the last position used for input.
Changing the code as follows will force the input processor to remain on the same line.
input #1 FC $1. #;
Note: The input processor will skip to the next line when the next implicit data step iteration occurs. This means when your if fails the next iteration in the data step will be reading from the next line in the file.
Input can be held across implicit iterations by using two # symbols (input .... ##;)

Related

One too many new lines in a WRITE statement [duplicate]

This is my code:
Program Output_Format
Implicit none
Integer::k
Integer,parameter:: Br_nn_mre = 5
Character(56),parameter:: FMT_01 = '(1x,"NN_DM:",1x,*("NN_",i2.2,1x))'
Open( 15 , File = 'Output.txt' , Status = 'Unknown' , Action = 'Write' )
Write( 15 , FMT_01 ) ( k , k = 1 , Br_nn_mre )
Close( 15 , Status = 'Keep' )
End Program Output_Format
The content of Output.txt file is:
NN_DM: NN_01 NN_02 NN_03 NN_04 NN_05 NN_
I want to get this content in Output.txt:
NN_DM: NN_01 NN_02 NN_03 NN_04 NN_05
That is, without the trailing NN_
What is wrong with * IN FMT_01 format? For example if I put a 5 in place of * I will get what I want. How can I use the unlimited repeat count and still get the desired output? I won't always know how many times to repeat.
This is related to how formats are processed, and in particular, when a data transfer statement terminates.
For an output statement such as you have, transfer terminates when either:
a data edit descriptor is reached and there is no remaining element in the output list; or
the final closing parenthesis is reached and there is no remaining element in the output list.
In your formats
'(1x,"NN_DM:",1x,*("NN_",i2.2,1x))'
and
'(1x,"NN_DM:",1x,5("NN_",i2.2,1x))'
the single data edit descriptor there is the i2.2. The 1xs are control edit descriptors and the "NN_DM" and "NN_" are character string edit descriptors.
Let's look at how your format is processed in the case of 5 as the repeat count. The first part of the format 1x,"NN_DM:",1x is processed without issue giving output NN_DM: moving us on to 5("NN_",i2.2,1x)). Corresponding to this repeated fragment are five data items, so they are processed (giving output NN_01 NN_02 NN_03 NN_04 NN_5).
The important part is what happens next. After completing this 5(..) part we reach the final closing parenthesis of the format specification and there is no remaining output item, so processing of the format comes to an end.
What's different with the *(..) case?
Well, when we reach the end of *(..) we go back round to the start of that repeated format; we don't move on to the final closing parenthesis.1 That leaves us to process the edit descriptors until we reach a data edit descriptor. This means that "NN_" is processed (resulting in NN_ being output) before we notice that we are out of data items for output.
Coming to the fix: use the colon edit descriptor. The colon edit descriptor acts like a data edit descriptor in the sense that format processing terminates immediately if there is no remaining data item.
Character(56),parameter:: FMT_01 = '(1x,"NN_DM:",1x,*("NN_",i2.2,:,1x))'
Personally, I would write this as
Character(*),parameter:: FMT_01 = '(" NN_DM:",*(" NN_",i2.2,:))'
1 This would be no different if we had 6 as the repeat count; * isn't special except that it is a "very large repeat count".

Multiple To clauses in Data step

I have a data step where I have a few columns that need tied to one other column.
I have tried using multiple "from" statements and " to" statements and a couple other permutations of that, but nothing seems to do the trick. The code looks something like this:
data analyze;
set css_email_analysis;
from = bill_account_number;
to = customer_number;
output;
from = bill_account_number;
to = email_addr;
output;
from = bill_account_number;
to = e_customer_nm;
output;
run;
I would like to see two columns showing bill accounts in the "from" column, and the other values in the "to", but instead I get a bill account and its customer number, with some "..."'s for the other values.
Issue
This is most likely because SAS has two datatypes and the first time the to variable is set up, it has the value of customer_number. At your second to statement you attempt to set to to have the value of email_addr. Assuming email_addr is a character variable, two things can happen here:
Customer_number is a number - to has already been set up as a number, so SAS cannot force to to become a character, an error like this may appear:
NOTE: Invalid numeric data, 'me#mywebsite.com' , at line 15 column 8. to=.
ERROR=1 N=1
Customer_number is a character - to has been set up as a character, but without explicitly defining its length, if it happens to be shorter than the value of email_addr then the email address will be truncated. SAS will not show an error if this happens:
Code:
data _NULL_;
to = 'hiya';
to = 'me#mydomain.com';
put to=;
run;
short=me#m
to is set with a length of 4, and SAS does not expand it to fit the new data.
Detail
The thing to bear in mind here is how SAS works behind the scenes.
The data statement sets up an output location
The set statement adds the variables from first observation of the dataset specified to a space in memory called the PDV, inheriting lengths and data types.
PDV:
bill_account_number|customer_number|email_addr|e_customer_nm
===================================================================
010101 | 758|me#my.com |John Smith
The to statement adds another variable inheriting the characteristics of customer_number
PDV:
bill_account_number|customer_number|email_addr|e_customer_nm|to
===================================================================
010101 | 758|me#my.com |John Smith |758
(to is either char length 3 or a numeric)
Subsequent to statements will not alter the characteristics of the variable and SAS will continue processing
PDV (if customer_number is character = TRUNCATION):
bill_account_number|customer_number|email_addr|e_customer_nm|to
===================================================================
010101 | 758|me#my.com |John Smith |me#
PDV (if customer_number is numeric = DATA ERROR, to set to missing):
bill_account_number|customer_number|email_addr|e_customer_nm|to
===================================================================
010101 | 758|me#my.com |John Smith |.
Resolution
To resolve this issue it's probably easiest to set the length and type of to before your first to statement:
data analyze;
set css_email_analysis;
from = bill_account_number;
length to $200;
to = customer_number;
output;
...
You may get messages like this, where SAS has converted data on your behalf:
NOTE: Numeric values have been converted to character
values at the places given by: (Line):(Column).
27:8
N.B. it's not necessary to explicitly define the length and type of from, because as far as I can see, you only ever get the values for this variable from one variable in the source dataset. You could also achieve this with a rename if you don't need to keep the bill_account_number variable:
rename bill_account_number = from;

Checking data vectors in SAS

I want to simply check the values read in SAS. In the raw data file
----+---10----+---20
H Let
P Grn Lea Qua Gro
P Ice Pls Frm
P Rom Qua Gro
H Sqs
P Ylw Tas Acr
P Zuc Pls Frm
I submitted a code
data a;
infile 'FileA.txt';
retain vege;
input code $1. #;
if code='H' then input #3 vege $3.;
if code='P';
input #3 variety : $10. #15 Supplier : $11.;
run;
proc print noobs;
run;
I got the observations
Let P Gm Gro
Let P Ice Frm
Let P Rom Gro
Sqs P Ylw Acr
Sqs P Zuc Frm
I uderstand that the if code=P; is the reason why the code value is P, but I would like to know if there was supposed to be more observations.
According to the text book I am working on, the sixth observations has certain values and it is indicated by _ N _ =6.
I am still learning and not quite sure what it means... may I have some help?
Thank you.
An if without a then is a special form of if not found in other languages. It is known as a subsetting if and program flow only passes through the statement when the evaluation if true.
Data set rows are output tacitly and implicitly when program flow reaches the bottom of the step (unless there is an explicit output elsewhere in the step)
Thus, all the data file lines were read, only five of them met the sub-setting if criteria asserted by if code='P'; and fell through to the end of the step and were implicitly output.

How to manipulate multiple csv files with raw data starting from different row for each file?

I would like to format multiplecsv files, some of them have summaries before the raw data. Raw data can start at any row, but if “colname” is find at any row then raw data start there. I am using the Standard Libary csv module to read files and check if “colname” exist and extract the data from there. With the code below, print(data) always gives me data from the first row of the file. But I want to pull the data starting from where “colname” is found. If “colname” is not found I don’t want to read the data.
Root_dir=r”folder1”
for fname in os.listdir(root_dir):
file_path = os.path.join(root_dir, fname)
if fname.endswith(('.csv')):
n = 0
with open(file_path,'rU') as fp:
csv_reader = csv.reader(fp)
while True:
for line in csv_reader:
if line == " colname": continue
n = n + 1
data=line
print(data)
Your code's logic reads only skip lines that aren't exactly " colname", which has 2 problems:
You want to skip lines until AFTER you have seen "colname"; you could use a boolean variable to distinguish between these two situations
Not clear if your test for colname is correct; for example, if there isn't exactly one leading space, or the line has a trailing end-of-line character, would trip it up.

SAS infile messy format of variable lengths

I have a messy file, where some of the columns are tab delimitated and some are comma.
My problem with the data set is reading the files with variable lengths
12 Stephen Cole, 33, Columbia, MO
5 Dave Anderson, 25*, Concord, OH
The first column is a ID (tab) the the name (comma) age (comma), active (presence of an asterisk after age), home (tab)
The * after the age indicates if they are inactive.
All the names start at column #19, but everything after that is variable lengths and column starts.
I want to read into a format where I finally get.
ID Name Age Active Home
12 Stephen Cole 33 Active Columbia, MO
5 Dave Anderson 25 Inactive Concord, OH
Thus far I have:
data marathon;
infile 'c:/file.txt' dlm=',' pad firstobs=12;
input #3 ID 3. #19 Name $CHAR13.;
Then I get stuck on how to read the rest. I am mostly thrown with how to read the asterisk next to the age as its own column. If I had that understood, I think I can handle the rest.
You have a couple of issues. First, you need to use delimited input, specifically you need to combine comma and tab into one set of delimiters - one way is shown below. Second, you have two fields that are nontrivial; the one with the asterisk needs to be parsed afterwards (I use compress to keep specifically digits in the first line, and to keep specifically asterisks in the second line). You also need to read city/state in separate fields and combine them together (I use catx).
data want;
infile "c:\temp\test.dat" dlm='092C'x;
input
id
name :$50.
age_active $
home_city :$25.
home_st $
;
age=input(compress(age_active,,'kd'),best.);
active = ifc(compress(age_active,'*','k')='*','Active','Inactive');
home = catx(', ',home_city,home_st);
run;
Watch your lengths, I suggest reasonable ones given my past experience but you could see longer names or cities easily.