SAS infile messy format of variable lengths - sas

I have a messy file, where some of the columns are tab delimitated and some are comma.
My problem with the data set is reading the files with variable lengths
12 Stephen Cole, 33, Columbia, MO
5 Dave Anderson, 25*, Concord, OH
The first column is a ID (tab) the the name (comma) age (comma), active (presence of an asterisk after age), home (tab)
The * after the age indicates if they are inactive.
All the names start at column #19, but everything after that is variable lengths and column starts.
I want to read into a format where I finally get.
ID Name Age Active Home
12 Stephen Cole 33 Active Columbia, MO
5 Dave Anderson 25 Inactive Concord, OH
Thus far I have:
data marathon;
infile 'c:/file.txt' dlm=',' pad firstobs=12;
input #3 ID 3. #19 Name $CHAR13.;
Then I get stuck on how to read the rest. I am mostly thrown with how to read the asterisk next to the age as its own column. If I had that understood, I think I can handle the rest.

You have a couple of issues. First, you need to use delimited input, specifically you need to combine comma and tab into one set of delimiters - one way is shown below. Second, you have two fields that are nontrivial; the one with the asterisk needs to be parsed afterwards (I use compress to keep specifically digits in the first line, and to keep specifically asterisks in the second line). You also need to read city/state in separate fields and combine them together (I use catx).
data want;
infile "c:\temp\test.dat" dlm='092C'x;
input
id
name :$50.
age_active $
home_city :$25.
home_st $
;
age=input(compress(age_active,,'kd'),best.);
active = ifc(compress(age_active,'*','k')='*','Active','Inactive');
home = catx(', ',home_city,home_st);
run;
Watch your lengths, I suggest reasonable ones given my past experience but you could see longer names or cities easily.

Related

Getting all words before a specific word in SAS

I have a SAS dataset which has text column as below:
" word1 word2 documented word .... word n"
I have two issues with:
While performing text cleaning, I want to remove numbers from this word, but using compress function is compressing everything into 1 word and thus making sentence unreadable?
I want to extract all the words before word "documented"
Any please?
input dataset:
enter image description here
Output dataset
ID Comments Results
1 increase documented this credit package requires approval increase
2 new business modification documented ls&f cancelled new business modification
3 annual renewal documented this package requires approval annual renewal
If you want everything before a word "documented" the search(FIND) and substring.
You will have to decide how to handle the string when word is not found.
data _null_;
input str $char80.;
length new $80;
new = substrn(str,1,find(str,'documented')-1);
put 'NOTE: ' str=;
put 'NOTE- ' new=;
cards;
a new business documented with id-123456
oldnew business with id-123456
;;;;
run;
NOTE: str=a new business documented with id-123456
new=a new business
NOTE: str=oldnew business with id-123456
new=

Dummary Variable numeric difference in SAS

My code was running fine until I added the last line for age 5+. Does anyone know what's wrong with that line? Thank you.
data Work.File ;
set Work.File;
Female =(Sex ='F');
Male = (Sex ='M');
Age1=(age=1);
Age2=(age=2);
Age3=(age=3);
Age4=(age=4);
Age5+=(age='5+');
run;
The name of a SAS variable has certain restrictions, you can't have a + sign. Also Age should be a numeric variable. You can write last line as:
Age5Plus=(age>=5);
"Age5+"n=(age>=5);
would also work after setting
options validvarname=any;
but than you have to escape that name every time you use that variable

Multiple To clauses in Data step

I have a data step where I have a few columns that need tied to one other column.
I have tried using multiple "from" statements and " to" statements and a couple other permutations of that, but nothing seems to do the trick. The code looks something like this:
data analyze;
set css_email_analysis;
from = bill_account_number;
to = customer_number;
output;
from = bill_account_number;
to = email_addr;
output;
from = bill_account_number;
to = e_customer_nm;
output;
run;
I would like to see two columns showing bill accounts in the "from" column, and the other values in the "to", but instead I get a bill account and its customer number, with some "..."'s for the other values.
Issue
This is most likely because SAS has two datatypes and the first time the to variable is set up, it has the value of customer_number. At your second to statement you attempt to set to to have the value of email_addr. Assuming email_addr is a character variable, two things can happen here:
Customer_number is a number - to has already been set up as a number, so SAS cannot force to to become a character, an error like this may appear:
NOTE: Invalid numeric data, 'me#mywebsite.com' , at line 15 column 8. to=.
ERROR=1 N=1
Customer_number is a character - to has been set up as a character, but without explicitly defining its length, if it happens to be shorter than the value of email_addr then the email address will be truncated. SAS will not show an error if this happens:
Code:
data _NULL_;
to = 'hiya';
to = 'me#mydomain.com';
put to=;
run;
short=me#m
to is set with a length of 4, and SAS does not expand it to fit the new data.
Detail
The thing to bear in mind here is how SAS works behind the scenes.
The data statement sets up an output location
The set statement adds the variables from first observation of the dataset specified to a space in memory called the PDV, inheriting lengths and data types.
PDV:
bill_account_number|customer_number|email_addr|e_customer_nm
===================================================================
010101 | 758|me#my.com |John Smith
The to statement adds another variable inheriting the characteristics of customer_number
PDV:
bill_account_number|customer_number|email_addr|e_customer_nm|to
===================================================================
010101 | 758|me#my.com |John Smith |758
(to is either char length 3 or a numeric)
Subsequent to statements will not alter the characteristics of the variable and SAS will continue processing
PDV (if customer_number is character = TRUNCATION):
bill_account_number|customer_number|email_addr|e_customer_nm|to
===================================================================
010101 | 758|me#my.com |John Smith |me#
PDV (if customer_number is numeric = DATA ERROR, to set to missing):
bill_account_number|customer_number|email_addr|e_customer_nm|to
===================================================================
010101 | 758|me#my.com |John Smith |.
Resolution
To resolve this issue it's probably easiest to set the length and type of to before your first to statement:
data analyze;
set css_email_analysis;
from = bill_account_number;
length to $200;
to = customer_number;
output;
...
You may get messages like this, where SAS has converted data on your behalf:
NOTE: Numeric values have been converted to character
values at the places given by: (Line):(Column).
27:8
N.B. it's not necessary to explicitly define the length and type of from, because as far as I can see, you only ever get the values for this variable from one variable in the source dataset. You could also achieve this with a rename if you don't need to keep the bill_account_number variable:
rename bill_account_number = from;

SAS Raw data importing

I want to import a raw dataset in SAS whose first column starts with "C".But
there is something wrong with my code as ERROR keeps popping up in log window:-
Can anybody help me realise it??
Sample dataset:=
H 1095 NJ 06DEC84
C 01DEC11 $45.0
C 01AUG11 $37.5
H 1096 CA 01SEP83
My code : -
Filename hca2 'C:\Users\Desktop\SAS\datasets\HCA_file.txt';
Data assign8.hca2;
Infile hca2;
Input#1 FC $1.;
If FC = 'C' then
Input #3 DOB #11 Transaction_Value ;
Run;
The error (which would have been good to add to the question) is likely from the input of DOB without an associated informat. The following may help.
attrib dob informat=date9. format=date9.;
informat is for processing input and format is for output.
A simple INPUT statement will read in some data and immediately skip to the next line.
When processing a single line of data with multiple input statements, the earlier input statements should use a trailing # symbol to indicate 'held-input' and causes the input processor to not immediately proceed to the next line. Instead the 'active position' of the input processor will remain in the same line at the last position used for input.
Changing the code as follows will force the input processor to remain on the same line.
input #1 FC $1. #;
Note: The input processor will skip to the next line when the next implicit data step iteration occurs. This means when your if fails the next iteration in the data step will be reading from the next line in the file.
Input can be held across implicit iterations by using two # symbols (input .... ##;)

How to add elements to datalines dynamically

I am trying to create Dataset using the Dataline option based on the data that the user input. Is there a way to add the values in Dataline dynamically in stored process? If not how do I go about doing this?
EDIT: I am getting input from user as an array of numbers. I want to add few more fields to form my dataset. So in short, the dataset i am trying to create is a combination of array elements from the user and some more data based on these input numbers.
User inputs: 1234, 2345, 3456
Dataset:
number | text | id
1234 | "Something 1" | 1
2345 | "Something 2" | 2
3456 | "Something 3" | 3
Datalines/cards shouldn't be used in any sort of production system. Among other things, they're illegal in an %include, which is often used in production systems.
I believe the default in a stored procedure is to return macro variables (based on the name given to the field in the input form). Either directly assign that to a dataset variable, or if it's a lot of text that you want to parse, write the macro variable out to a text file and read it in. More information for the proper way to handle this for your case might be available with more information in the question.
Given the edits, and assuming you get the value 1234, 2345, 3456 in a macro variable &number, you can do something like this:
data want;
_numvar = "&number.";
do _t = 1 to countc(_numvar,",")+1;
number = scan(_numvar,_t,",");
text = catx(" ","Something",substr(number,1,1);
id = input(substr(number,1,1));
output;
keep number text id;
end;
run;
I don't know how you're constructing text and id so I just made something up there.