SAS - Text Parsing for Case Sensitive Characters

SAS - Text Parsing for Case Sensitive Characters - sas

Im having some problems cleaning up free form text strings from a series of notations. The last part of this task involves identifying any names, and removing them from the string. Luckily, all names are U-Cased (always), and the relevant information is placed before the name (always).
My first thought was to use the FIND function to isolate where the name starts, then just output all characters before the starting position...but I could not determine how to use a "wild card" like option to grab the starting position of ANY capital letter. Sample and attempts included below -
DATA SAMPLE;
INPUT TXT $;
CARDS;
firsT
Second
thIrd
foUrth
;
RUN;
Attempt1:
DATA TEST;
SET SAMPLE;
ID = FIND(TXT,'A'-'Z');
RUN;
Attempt2:
DATA TEST;
SET SAMPLE;
ID = FIND(TXT,'A-Z');
RUN;
Clearly both attempts above are not too far from one another, but I could not find (or think) of another approach. Hoping that some mysterious function will come to rescue here...

Assuming I understand what you want to do, you're close - just not doing things the 'SAS' way.
FIND has two siblings, FINDC and FINDW. FINDC finds a single character in a list of characters, which it sounds like what you want to do. It has a lot of options for adding lists of characters; you can't just give it A-Z as that would add those three characters, but you can give it a U option to add uppercase characters.
DATA TEST;
SET SAMPLE;
_endpos= FINDC(TXT,,'u');
ID = substr(TXT,1,_endpos-1);
RUN;

Related

Is there any function in SAS where we can read the exact value from the variable

Suppose i have a column called ABC and that variable has the data like
:
123_112233_66778_1122 or
123_112233_1122_11232 or
1122_112233_66778_123
so i want to generate the desire variable in the next column as 1122. like this "1122" i have a long list where i need to cross the value from the column called ABC, if found the exact match then need to generate. However, i don't want to generate the match like 112233 because it does not match the value what i am looking for.
For an example you can see all three line what i have given for reference. I am taking only the match records which is "1122" from all the above 3 lines.
I really have no clue to overcome on the problem. I have tried my hands with wildcards but did not get much success. Any help would be much apricated

It is hard to tell from your description, but from the values you show it looks like you want the INDEXW() function. That will let you search a string for matching words with a option to specify which characters are to be considered as the separators between the words. The result is the location of where the word starts within longer string. When the word is not found the result is a zero.
Let's create a simple example to demonstrate.
data have;
input abc $30. ;
cards;
123_112233_66778_1122
123_112233_1122_11232
1122_112233_66778_123
;
data want;
set have ;
location = indexw(trim(abc),'1122','_');
run;
Note that SAS will consider any value other than zero (or missing) as TRUE so you can just use the INDEXW() function call in a WHERE statement.
data want;
set have;
where indexw(trim(abc),'1122','_');
run;

Use of : when reading multiple records in SAS

I am studying SAS programming and there is one thing that is puzzling me. I tried to look up what colons (:) do in the text book I am using but I could not find anything.
The following program was one of the questions, and with the colon the program does read the instream data but without the colons it reads funny.
I am suspecting that the length of ABRAMS is less than 12 and that is why it reads it inappropriately, but with the colon for some reason it recognizes is fine.
I appreciate your help.
data a;
input #1 Lname $ Fname $ /
Department : $12. Salary : comma.10;
cards;
ABRAMS THOMAS
SALES $25,209.03
;
run;
proc print;
run;

Have a look at the documentation for the input statement. There is admittedly quite a lot of it, so here's a link to the specific page that deals with this:
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000144370.htm
Relevant quote:
:
enables you to specify an informat that the INPUT statement uses to
read the variable value. For a character variable, this format
modifier reads the value from the next non-blank column until the
pointer reaches the next blank column, the defined length of the
variable, or the end of the data line, whichever comes first. For a
numeric variable, this format modifier reads the value from the next
non-blank column until the pointer reaches the next blank column or
the end of the data line, whichever comes first.

How does the reverse function in SAS work?

I have a time data field, say, 10/1/2014.
I want to extract the month and the year information dynamically in SAS, given any date.
I wrote the following code in SAS to extract the month info:
month = substr(time_field, 1, index(time_field, '/')-1);
This worked fine.
I wrote the following snippet to extract the year info:
year = substr(reverse(time_field), 1, 4);
This doesn't work; it throws a blank. Have I missed something? Please help.

SAS will return the year for you. No need to write any custom function for this purpose. Look:
data _null_;
length year 4.;
year=year(today());
put "we are on the year of " year;
run;

Your variable has trailing spaces most likely. So when you reverse it, the trailing spaces become leading spaces and then you take the first four characters which are blanks.
You can verify this by running the reverse function alone on the variable and see the results.
Try adding the compress function.
year = substr(reverse(compress(time_field)), 1, 4);
Though this may solve your problem, you should really convert your date to a SAS date and then use the Month/Day/Year functions.
data have;
length time_field $20.;
time_field="10/1/2014";
year_bad = substr(reverse(time_field),1, 4);
year_good = reverse(substr(reverse(compress(time_field)),1, 4));
year_better = year(input(time_field, mmddyy10.));
put "year_bad:" year_bad;
put "year_good:" year_good;
put "year_better:" year_better;
run;

Your data is either a month in a character field, or it is a numeric value formatted as a date. While you can use text expressions on numerics, you shouldn't; you should explicitly convert them.
When you don't, then you end up with things like this - ie, improper lengths of fields, because the automatic conversion is very loose. It tends to allow a huge amount of extra space where it's not required to.
If your data is numeric, use MONTH() or YEAR() and be done with it; there's no reason to play in text here. Look at the field in the data explorer; it will tell you if it's numeric or not. (Numeric with a format can still look like text, so actually look at it!)
If your data is text, then you have some better options than REVERSE.
First is SCAN. SCAN splits by word, similar to many other languages; often strsplit (R) or similar.
month=scan(mdy_var,1,'/');
day =scan(mdy_var,2,'/');
year =scan(mdy_var,3,'/');
Second, you could still use SUBSTR, along with LENGTH.
year = scan(mdy_var,length(mdy_var)-3,4);
LENGTH tells you how long the string really is (minus trailing spaces), so '10/1/2014' is 9 long; 6th character (9-3) is the 2, and then 4 characters after that [which should be unnecessary]. This method wouldn't really work with Day, of course, only with year (and only with 4 digit year). Scan is better really, but this is a good example of how this works.
Going along the same lines, you can use FIND and look backwards, also, using a negative start position.
year = substr(mdy_var,find(mdy_var,'/',-99)+1,4);
That starts it at the 99th character (which is realistically your maximum, right?) and goes left, and then tells you what position the first '/' it finds.

SAS - selecting character observations from position 1 to position 2

I am stuck in this one particular point. I have a character variable with observations extracted from rtf document. I need to keep only the observations from obs A to obs B. The firstobs and obs is not helpful here because we do not know the observation number beforehand. All we know is the two unique strings. For example in the dataset, I need to create a dataset with observations from obs 11 to 16. This is only part of dataset, the original dataset has over 1500 observations, that is why we use unique text to capture instead of observation number.
Thank you all in advance.

You don't explain enough, but odds are you can do something sort of like this if I understand you right (you have a "start" and a "stop" string in the document).
data want;
set have;
retain keep 0;
if strvar = "keepme" then keep=1;
if keep=1;
if strvar = "lastone" then keep=0;
run;
IE, have some condition set the keep variable to 1, then test for it, then have the off condition after that (assuming you want to keep the off condition row). Use string functions like index or find or scan to search for your particular string if it's not an entire string. You could also use regular expressions if necessary.

SAS: Where statement not working with string value

I'm trying to use PROC FREQ on a subset of my data called dataname. I would like it to include all rows where varname doesn't equal "A.Never Used". I have the following code:
proc freq data=dataname(where=(varname NE 'A.Never Used'));
run;
I thought there might be a problem with trailing or leading blanks so I also tried:
proc freq data=dataname(where=(strip(varname) NE 'A.Never Used'));
run;
My guess is for some reason my string values are not "A.Never Used" but whenever I print the data this is the value I see.

This is a common issue in dealing with string data (and a good reason not to!). You should consider the source of your data - did it come from web forms? Then it probably contains nonbreaking spaces ('A0'x) instead of regular spaces ('20'x). Did it come from a unicode environment (say, Japanese characters are legal)? Then you may have transcoding issues.
A few options that work for a large majority of these problems:
Compress out everything but alphabet characters. where=(compress(varname,,'ka') ne 'ANeverUsed') for example. 'ka' means 'keep only' and 'alphabet characters'.
UPCASE or LOWCASE to ensure you're not running into case issues.
Use put varname HEX.; in a data step to look at the underlying characters. Each two hex characters is one alphabet character. 20 is space (which strip would remove). Sort by varname before doing this so that you can easily see the rows that you think should have this value next to each other - what is the difference? Probably some special character, or multibyte characters, or who knows what, but it should be apparent here.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

SAS - Text Parsing for Case Sensitive Characters - sas

Related

Is there any function in SAS where we can read the exact value from the variable

Use of : when reading multiple records in SAS

How does the reverse function in SAS work?

SAS - selecting character observations from position 1 to position 2

SAS: Where statement not working with string value

Categories

Resources