SAS character comparison - sas

In SAS I have two variables.
NEW: 12345678900
OLD: 2345678900
I need a way to see if the last 10 characters in NEW are equal to the last 10 characters in OLD.
I've tried a variety of things, but it keeps flagging everything even when they aren't equal.

Try
same_10char_tail = 0;
if length(new) > 9 and length(old) > 9 then
if substr(new,length(new)-9) = substr(old,length(old)-9) then
same_10char_tail=1;
The nested if prevents a warning that would occur if the value was less than 10 characters long. Only when both variables are >=10 chars long will substr be happy.

You can reverse the strings and use =:. Or use substrn() to take the last 9 non-blank characters.
data want;
set have;
test1=reverse(trim(old)) =: reverse(trim(new));
test2=substrn(old,length(old)-8) = substrn(new,length(new)-8);
run;
Results:
Obs old new test1 test2
1 2345678900 12345678900 1 1
2 abc xyz 0 0

Related

Removing Measurement Units from Cell Array

I am trying to remove the units out of a column of cell array data i.e.:
cArray =
time temp
2022-05-10 20:19:43 '167 °F'
2022-05-10 20:19:53 '173 °F'
2022-05-10 20:20:03 '177 °F'
...
2022-06-09 20:18:10 '161 °F'
I have tried str2double but get all NaN.
I have found some info on regexp but don't follow exactly as the example is not the same.
Can anyone help me get the temp column to only read the value i.e.:
cArray =
time temp
2022-05-10 20:19:43 167
2022-05-10 20:19:53 173
2022-05-10 20:20:03 177
...
2022-06-09 20:18:10 161
For some cell array of data
cArray = { ...
1, '123 °F'
2, '234 °F'
3, '345 °F'
};
The easiest option is if we can safely assume the temperature data always starts with numeric values, and you want all of the numeric values. Then we can use regex to match only numbers
temps = regexp( cArray(:,2), '\d+', 'match', 'once' );
The match option causes regexp to return the matching string rather than the index of the match, and once means "stop at the first match" so that we ignore everything after the first non-numeric character.
The pattern '\d+' means "one or more numbers". You could expand it to match numbers with a decimal part using '\d+(\.\d+)?' instead if that's a requirement.
Then if you want to actually output numbers, you should use str2double. You could do this in a loop, or use cellfun which is a compact way of achieving the same thing.
temps = cellfun( #str2double, temps, 'uni', 0 ); % 'uni'=0 to retain cell array
Finally you can override the column in cArray
cArray(:,2) = temps;

SAS mainframe replace 20..99 to 200099

In my the sas mainframe code how to replace . with 0?
data newlic;
INPUT #1 LICNO $10.;
DATALINES;
203....412
...3300421
9955..032.
;
RUN;
PROC PRINT DATA = NEWLIC;
RUN;
DATA MYDATA;
SET NEWLIC;
ARRAY A(*) _NUMERIC_;
DO I=1 TO DIM(A);
IF A(I) = . THEN A(I) = 0;
END;
DROP I;
RUN;
PROC PRINT DATA = MYDATA;
RUN;
my required output
2030000412
0003300421
9955000320
the requirement is to replace '.' to 0
Use a regular expression to replace all non-alphanumeric characters with a 0:
s/[^0-9a-zA-Z]/0/
You can implement regex replacements in SAS with prxchange().
data mydata;
set newlic;
licno = prxchange('s/[^0-9a-zA-Z]/0/', -1, licno);
run;
You can use the TRANSLATE() function to replace unwanted characters with '0'. You can use the COMPRESS() function with d modifier to find any non-digit characters that exist in the value.
fixed=translate(licno,repeat('0',255),compress(licno,,'d'));
Results:
Obs LICNO fixed
1 1234567890 1234567890
2 ABC 9 0000000009
3 203....412 2030000412
4 ...3300421 0003300421
5 9955..032. 9955000320
6 123 1230000000
You can use the regular expression pattern metacharacter \D to locate non-digit characters and replace them with 0 in a use of PRXCHANGE().
From the complete list in the documentation
\d matches a digit character that is equivalent to [0−9].
\D matches any character that is not a digit.
Example:
data have; input
licno $char10.; datalines;
1234567890
ABC 9
203....412
...3300421
9955..032.
123
;
data want;
set have;
fixed = prxchange('s/\D/0/', -1, licno);
run;

Extracting specific words from a single cell containing text string

Basically I have a very long text containing multiple spaces, special characters, etc. in one cell in an excel file and I need to extract only specific words from it, each one to a seperate cell in another column.
What I'm looing for:
symbols that are always 9 characters in lenght, and always contain at least one number (up to 9).
So for an example in A1 I have:
euhe: djj33 dkdakofja. kaowdk ---------- jffjbrjjjj j jrjj 08/01/2222 999ABC123
fjfjfj 321XXX888 .... ........ 123456789AA
And in the end I want to have:
999ABC123 in B1
and
321XXX888 in B2.
Right now I'm doing this by using Text to columns feature and then just looking for specific words manually but sometimes the volume is so big it takes too much time and would be cool to automate this.
Can anyone help with this? Thank you!
EDIT:
More examples:
INPUT: '10/01/2016 1,060X 8.999%!!! 1.33 0.666 928888XE0'
OUTPUT: '928888XE0'
INPUT: 'ABCDEBATX ..... ,,00,001% 20///^^ addcA7 7777a 123456789 djaoij8888888 0.000001 12#'
OUTPUT: '123456789'
INPUT: 'FAR687465 B22222222 __ djj^66 20/20/20/20 1:'
OUTPUT: 'FAR687465' in B1 'B22222222' in B2
INPUT: 'fil476 .00 20/.. BUT AAAAAAAAA k98776 000.0001'
OUTPUT: 'blank'
To clarify: the 9 character string can be anywhere, there is no rule what is before or after them, they can be next to each other, or just at the beginning and end of this wall of text, no rules here, the text is random, taken out of some system, can contain dates, etc anything... The symbols are always 9 characters long and they are not the only 9 character symbols in the text. I call them symbols but they should only consist of numbers and letters. Can be only numbers, but never only letters. A1 cell can contain multiple spaces/tabs between words/symbols.
Also if possible to do this not only for A1, but the whole column A until it finds the first blank cell.
Try this code
Sub Test()
Dim r As Range
Dim i As Long
Dim m As Long
With CreateObject("VBScript.RegExp")
.Global = True
.Pattern = "\b[a-zA-Z\d]{9}\b"
For Each r In Range("A1", Range("A" & Rows.Count).End(xlUp))
If .Test(r.Value) Then
For i = 0 To .Execute(r.Value).Count - 1
If CBool(.Execute(r.Value)(i) Like "*[0-9]*") Then
m = IIf(Cells(1, 2).Value = "", 1, Cells(Rows.Count, 2).End(xlUp).Row + 1)
Cells(m, 2).Value = .Execute(r.Value)(i)
End If
Next i
End If
Next r
End With
End Sub
This bit of code is almost it... just need to check the strings... but excel crashes on the Str line of code
Sub Test()
Dim Outputs, i As Integer, LastRow As Long, Prueba, Prueba2
Outputs = Split(Range("A1"), " ")
For i = 0 To UBound(Outputs)
If Len(Outputs(i)) = 9 Then
Prueba = 0
Prueba2 = 0
On Error Resume Next
Prueba = Val(Outputs(i))
Prueba2 = Str(Outputs(i))
On Error GoTo 0
If Prueba <> 0 And Prueba2 <> 0 Then
LastRow = Range("B10000").End(xlUp).Row + 1
Cells(LastRow, 2) = Outputs(i)
End If
End If
Next i
End Sub
If someone could help to set the string check.. that would do the thing I guess.

Check specific sequence of alphanumeric string in sas

I have data with an ID field that is structured like this:
XX00000X
7 characters total, with the first 2 and last letters only, and numbers in between.
How can I check that the ID is structured specifically and exactly like this?
I'm not sure of how to approach checking this - one possibility was the CATs function but not sure how to apply that.
You can use a combination of functions to check this, including:
CHAR()
ANYDIGIT()
ANYALPHA()
data have;
input x $10.;
cards;
AB0000X
AO000BF
1234556
ABCDEFG
AB0123Y
AB
ABCDEFGHI
;
run;
data check;
set have;
flag=0;
if lengthn(x) ne 7 then flag=1;
length letter $1;
if flag=0 then do i=1 to 7;
letter = char(x, i);
if ( i in (1,2, 7) and anyalpha(letter) ne 1 )
or i in (3:6) and anydigit(letter) ne 1 then do;
flag=1;
leave;
end;
end;
run;
Regular expressions are obviously more succinct and likely a better approach.
Here is an approach by regular expression. [A-Z]{2} mathc first two letters, [0-9]{4} match four digits in the middle, [A-Z] match last letter, i ignore case.
data want;
set have;
flag=prxmatch("m/[A-Z]{2}[0-9]{4}[A-Z]/i",x);
run;

How to pad out character fields in SAS?

I am creating a SAS dataset from a database that includes a VARCHAR(5) key field.
This field includes some entries that use all 5 characters and some that use fewer.
When I import this data, I would prefer to pad all the shorter entries out to use all five characters. For this example, I want to pad on the left with 0, the character zero. So, 114 would become 00114, ABCD would become 0ABCD, and EA222 would stay as it is.
I've attempted this with a simple data statement, but of course the following does not work:
data test;
set databaseinput;
format key $5.;
run;
I've tried to do this with a user-defined informat, but I don't think it's possible to specify the ranges correctly on character fields, per this SAS KB answer. Plus, I'm fairly sure proc format won't let me define the result dynamically in terms of the incoming variable.
I'm sure there's an obvious solution here, but I'm just missing it.
Here is an alternative:
data padded_data_dsn; length key $5;
drop raw_data;
set raw_data_dsn(rename=(key=raw_data));
key = translate(right(raw_data),'0',' ');
run;
Data raw_data_dsn;
format key $5.;
key = '4'; key1 = CATT(REPEAT('0',5-length(key)),key);output;
key = 'A114'; key1 = CATT(REPEAT('0',5-length(key)),key);output;
key = 'A1140'; key1 = CATT(REPEAT('0',5-length(key)),key);output;
run;
I'm sure someone will have a more elegant solution, but the following code works. Essentially it is padding the variable with five leading zeros, then reversing the order of this text string so that the zeros are to the right, then reversing this text string again and limiting the size to five characters, in the original order but left-padded with zeros.
data raw_data_dsn;
format key $varying5.;
key = '114'; output;
key = 'ABCD'; output;
key = 'EA222'; output;
run;
data padded_data_dsn;
format key $5.;
drop raw_data;
set raw_data_dsn(rename=(key=raw_data));
key = put(put('00000' || raw_data ,$revers10.),$revers5.);
run;
Here's what worked for me.
data b (keep = str2);
format str2 $5. ;
set a;
catlength = 4 - length(str);
cat = repeat('0', catlength);
str2 = catt(cat, str);
run;
It works by counting the length of the existing string, and then creating a cat string of length 4 - that, and then appending the cat value and the original string together.
Notice that it screws up if the original string is length 5.
Also - it won't work if the input string has a $5. format on it.
data a; /*input dataset*/
input str $;
datalines;
a
aa
aaa
aaaa
aaaaa
;
run;
data b (keep = str2);
format str2 $5. ;
set a;
catlength = 4 - length(str);
cat = repeat('0', catlength);
str2 = catt(cat, str);
run;
input:
a
aa
aaa
aaaa
aaaaa
output:
0000a
000aa
00aaa
0aaaa
0aaaa
I use this, but only works with numeric values :S. Try with another formats in the INPUT
data work.prueba;
format xx $5.;
xx='1234';
vv=PUT(INPUT(xx,best5.),z5.);
run;