PRXMATCH not working with PRXPARSE function in SAS - regex

I have comments with multiple ids which I need to pull from comments. Each I’d in separate column is required.
Input data has 2 columns- comment_id & Comment(it has 1 or more IDs)
Desired output should have 2 columns: comment_id & ID
I am using following function.
For Parsing
data work.comments_parsed;
set work.comments;
if _N_ = 1 then do;
pasre_id=prxparse("/ab[c|d]?e?\d+/");
end;
retain pasre_id;
start = 1;
stops = length(Comment);
run;
For output generation
data work.desired_output;
set work.comments_parsed;
length ID $ 500;
call prxnext(pasre_id, start, stops, Comment, pos, len);
do while (pos >0);
ID = substr(Comment,pos,len);
output;
call prxnext(pasre_id, start, stops, Comment, pos, len);
end;
run;
ERROR: Argument 1 to the function PRXNEXT must be a positive integer returned by PRXPARSE for a valid pattern.
ERROR: Internal error detected in function PRXNEXT. The DATA step is terminating during the EXECUTION phase.
I believe error is because of incorrect parsing however when I use prxmatch function by using regular expression directly I am getting proper matching. Can you someone suggest me how I can make this code work.
This code works fine
data pattern_testing;
set work.comments_parsed;
pos = prxmatch("/ab[c|d]?e?\d+?/", Comment);
run;
But this code also gives same error:
data pattern_testing;
set work.comments_parsed;
pos = prxmatch(pasre_id,Comment);
run;

Code works when I have parsing and prxnext in same data step.
data work.comments_parsed;
set work.comments;
if _N_ = 1 then pasre_id = prxparse("/ab[c|d]?e?\d+/");
retain pasre_id;
length gen_string $ 500;
call prxnext(pasre_id, start, stops, COMMENT, pos, len);
do while (pos >0);
gen_string = substr(LAST_COMMENT,pos,len);
output;
call prxnext(pasre_id, start, stops, LAST_COMMENT, pos, len);
end;
run;

Related

How to correct this sas function in order to have the jaccard distance?

I created a SAS function using fcmp to calculate the jaccard distance between two strings. I do not want to use macros, as I'm going to use it through a large dataset for multiples variables. the substrings I have are missing others.
proc fcmp outlib=work.functions.func;
function distance_jaccard(string1 $, string2 $);
n = length(string1);
m = length(string2);
ngrams1 = "";
do i = 1 to (n-1);
ngrams1 = cats(ngrams1, substr(string1, i, 2) || '*');
end;
/*ngrams1= ngrams1||'*';*/
put ngrams1=;
ngrams2 = "";
do j = 1 to (m-1);
ngrams2 = cats(ngrams2, substr(string2, j, 2) || '*');
end;
endsub;
options cmplib=(work.functions);
data test;
string1 = "joubrel";
string2 = "farjoubrel";
jaccard_distance = distance_jaccard(string1, string2);
run;
I expected ngrams1 and ngrams2 to contain all the substrings of length 2 instead I got this
ngrams1=jo*ou*ub
ngrams2=fa*ar*rj
If you want real help with your algorithm you need to explain in words what you want to do.
I suspect your problem is that you never defined how long you new character variables NGRAM1 and NGRAM2 should be. From the output you show it appears that FCMP defaulted them to length $8.
To define a variable you need use a LENGTH statement (or an ATTRIB statement with the LENGTH= option) before you start referencing the variable.

Trying to make a new column conditional on whether other columns are empty

I have a dataset which is like this:
new_fish old_fish
1 2
4
3
And I want to make a column called status, where if new_fish is empty call it dead, and if old_fish is empty call it born, and if neither are empty call it alive.
I would want it to look like this:
new_fish old_fish status
1 2 alive
4 dead
3 born
I've tried the following code in sas,
data diff_withclass;
set diff;
if missing(new_fish) then status= 'dead';
if missing(old_fish) then status= 'born';
else status = 'alive';
run;
However, this doesn't work. It just sets status to alive.
ANy suggestions would be great.
You need to use else if. The second if statement is overwriting the first.
data diff_withclass;
set diff;
if missing(new_fish) then status= 'dead';
else if missing(old_fish) then status= 'born';
else status = 'alive';
run;
A useful construction is select, when you're choosing one of a list of options. Sometimes you select a varialbe:
select(var);
when (1) do something; *if var=1 then ... ;
when (2) do something; *else if var=2 then ... ;
otherwise do something; *else ... ;
end;
However, it can be used alone also, as in your case.
data diff_withclass;
set diff;
length status $5;
select;
when (missing(new_fish)) status= 'dead';
when (missing(old_fish)) status= 'born';
otherwise status = 'alive';
end;
run;
Another useful construction is the ifc function. This works like Excel's if, in that it takes an argument that is a boolean (so, the "if" part), if that is true it returns the second argument, and if it is false it returns the third argument. Here we can nest two of them to get your result.
data diff_ifc;
set diff;
length status $5;
status = ifc(missing(new_fish),'dead',ifc(missing(old_fish),'born','alive'));
run;
Finally, there is the good old "boolean arithmetic" solution. Here we create a numeric, and you can then turn that into a character if you prefer, or just use a user-defined format, which is really the optimal way to handle this anyway. This takes advantage of the fact that "true" evaluates to 1 and "false" to 0 in SAS, so we just add up the values, and arbitrarily call Dead the 2 and Born the 1 (could do the opposite, or even have Dead=-1, Born=1, Alive=0, by multiplying by -1 for Dead).
proc format;
value statusf
0 = 'Alive'
1 = 'Born'
2 = 'Dead'
;
quit;
data diff_bool;
set diff;
status = missing(new_fish)*2 + missing(old_fish);
format status statusf.;
run;

How to format dates for use in SAS?

I am trying to adapt Method 4 in this paper to calculate the duration of many observations, but discounting overlapping dates: https://support.sas.com/resources/papers/proceedings/proceedings/sugi31/048-31.pdf
For example, two rows of observations for subject 101 lasting from 2017-03-02 to 2017-03-16 and 2017-03-04 to 2017-03-17 respectively should return a value of only 16 days.
I am getting an error with the dates being 'Invalid numeric data', though, resulting in later errors. I have tried format startdate yyyymmdd10.; and format stopdate yyyymmdd10.; with no success.
Can anyone help me properly format my dates for use here, or identify any further errors?
Edit: Line 80 refers to do xdate = startdate to stopdate;.
I am still unable to convert or create the date variables as numeric/date values. I have used the following code:
data sasuser.Mdm;
set sasuser.Mdm;
do xdate = input(Startdate,yymmdd10.) to input(stopdate,yymmdd10.);
put xdate= yymmdd10.;
output;
end;
run;
To get this output:
1 data sasuser.Mdm;
2 set sasuser.Mdm;
3 do xdate = input(Startdate,yymmdd10.) to input(stopdate,yymmdd10.);
4 put xdate= yymmdd10.;
5 output;
6 end;
7 run;
xdate=2017-03-02
xdate=2017-03-03
xdate=2017-03-04
xdate=2017-03-05
xdate=2017-03-06
xdate=2017-03-07
xdate=2017-03-08
xdate=2017-03-09
xdate=2017-03-10
xdate=2017-03-11
xdate=2017-03-12
xdate=2017-03-13
xdate=2017-03-14
xdate=2017-03-15
xdate=2017-03-16
xdate=2017-03-04
xdate=2017-03-05
xdate=2017-03-06
xdate=2017-03-07
xdate=2017-03-08
xdate=2017-03-09
xdate=2017-03-10
xdate=2017-03-11
xdate=2017-03-12
xdate=2017-03-13
xdate=2017-03-14
xdate=2017-03-15
xdate=2017-03-16
xdate=2017-03-17
xdate=2017-03-07
xdate=2017-03-08
xdate=2017-03-09
xdate=2017-03-10
xdate=2017-03-11
xdate=2017-03-12
xdate=2017-03-13
xdate=2017-03-14
xdate=2017-03-15
xdate=2017-03-16
xdate=2017-03-17
xdate=2017-03-18
xdate=2017-03-19
xdate=2017-03-20
xdate=2017-03-21
xdate=2017-02-08
xdate=2017-02-09
xdate=2017-02-10
xdate=2017-02-11
xdate=2017-02-12
xdate=2017-02-13
xdate=2017-02-14
xdate=2017-02-15
xdate=2017-02-16
xdate=2017-02-17
xdate=2017-02-18
xdate=2017-02-19
xdate=2017-02-20
xdate=2017-02-21
xdate=2017-02-22
xdate=2017-02-23
xdate=2017-02-24
xdate=2017-02-23
xdate=2017-02-24
xdate=2017-02-25
xdate=2017-02-26
xdate=2017-02-27
xdate=2017-02-28
xdate=2017-03-01
xdate=2017-03-02
xdate=2017-03-03
xdate=2017-03-04
xdate=2017-03-05
xdate=2017-03-06
xdate=2017-03-07
xdate=2017-03-08
xdate=2017-02-26
xdate=2017-02-28
xdate=2017-03-13
xdate=2017-03-17
xdate=2017-03-25
xdate=2017-03-28
xdate=2017-03-23
xdate=2017-03-24
xdate=2017-03-25
xdate=2017-03-26
xdate=2017-03-27
xdate=2017-03-28
xdate=2017-03-29
xdate=2017-03-30
xdate=2017-03-29
xdate=2017-04-03
xdate=2017-04-04
xdate=2017-04-03
xdate=2017-04-04
xdate=2017-04-05
xdate=2017-04-05
xdate=2017-04-06
xdate=2017-04-06
xdate=2017-04-07
xdate=2017-03-25
xdate=2017-03-26
xdate=2017-03-30
xdate=2017-04-01
xdate=2017-04-02
xdate=2017-04-03
xdate=2017-04-04
xdate=2017-04-08
xdate=2017-04-09
xdate=2017-04-10
xdate=2017-04-11
xdate=2017-04-12
xdate=2017-04-12
xdate=2017-04-13
xdate=2017-04-13
xdate=2017-04-14
xdate=2017-04-15
xdate=2017-04-16
xdate=2017-04-17
xdate=2017-04-18
xdate=2017-04-19
xdate=2017-04-20
xdate=2017-04-21
xdate=2017-04-22
xdate=2017-04-19
xdate=2017-04-23
xdate=2017-04-24
xdate=2017-04-25
xdate=2017-04-26
xdate=2017-04-26
xdate=2017-04-27
xdate=2017-04-28
xdate=2017-05-05
xdate=2017-05-06
xdate=2017-05-16
xdate=2017-05-19
xdate=2017-05-20
xdate=2017-05-21
xdate=2017-05-22
xdate=2017-05-19
xdate=2017-05-20
xdate=2017-05-21
xdate=2017-05-22
xdate=2017-05-23
xdate=2017-05-24
xdate=2017-05-25
xdate=2017-05-26
xdate=2017-05-22
xdate=2017-05-23
xdate=2017-05-24
xdate=2017-05-25
xdate=2017-05-26
xdate=2017-05-27
xdate=2017-05-28
xdate=2017-05-29
xdate=2017-05-30
xdate=2017-05-31
xdate=2017-06-01
xdate=2017-06-02
xdate=2017-06-03
xdate=2017-06-04
xdate=2017-06-05
xdate=2017-06-06
xdate=2017-06-07
xdate=2017-06-08
xdate=2017-06-09
xdate=2017-06-10
xdate=2017-06-11
xdate=2017-06-12
xdate=2017-06-13
xdate=2017-06-14
xdate=2017-06-15
xdate=2017-06-16
xdate=2017-06-17
xdate=2017-06-18
xdate=2017-06-19
xdate=2017-06-20
xdate=2017-06-21
xdate=2017-06-22
xdate=2017-06-23
xdate=2017-06-24
xdate=2017-06-25
xdate=2017-06-26
xdate=2017-06-27
xdate=2017-06-28
xdate=2017-06-29
xdate=2017-06-30
xdate=2017-07-01
xdate=2017-07-02
xdate=2017-07-03
xdate=2017-07-04
xdate=2017-07-05
xdate=2017-07-06
xdate=2017-07-07
xdate=2017-07-08
xdate=2017-07-09
xdate=2017-07-10
xdate=2017-07-11
xdate=2017-07-12
xdate=2017-07-13
xdate=2017-07-14
xdate=2017-07-15
xdate=2017-07-16
xdate=2017-07-17
xdate=2017-07-18
xdate=2017-07-19
xdate=2017-07-20
xdate=2017-07-21
xdate=2017-07-22
xdate=2017-07-23
xdate=2017-07-24
xdate=2017-07-25
xdate=2017-07-26
xdate=2017-07-27
xdate=2017-07-28
xdate=2017-07-29
xdate=2017-07-30
xdate=2017-07-31
xdate=2017-08-01
xdate=2017-08-02
xdate=2017-08-03
xdate=2017-08-04
xdate=2017-08-05
xdate=2017-08-06
xdate=2017-08-07
xdate=2017-08-08
xdate=2017-08-09
xdate=2017-08-10
xdate=2017-08-11
xdate=2017-08-12
xdate=2017-08-13
xdate=2017-08-14
xdate=2017-08-15
xdate=2017-08-16
xdate=2017-08-17
xdate=2017-08-18
xdate=2017-08-19
xdate=2017-08-20
xdate=2017-08-21
xdate=2017-08-22
xdate=2017-08-23
xdate=2017-08-24
xdate=2017-08-25
xdate=2017-08-26
xdate=2017-08-27
xdate=2017-08-28
xdate=2017-08-29
xdate=2017-08-30
xdate=2017-08-31
xdate=2017-09-01
xdate=2017-05-27
xdate=2017-05-28
xdate=2017-05-29
xdate=2017-05-30
xdate=2017-05-31
xdate=2017-06-01
xdate=2017-06-02
xdate=2017-06-03
xdate=2017-06-04
xdate=2017-06-05
xdate=2017-06-06
xdate=2017-06-07
xdate=2017-06-08
xdate=2017-06-09
xdate=2017-06-10
xdate=2017-06-11
xdate=2017-06-12
xdate=2017-06-13
xdate=2017-06-14
xdate=2017-06-15
xdate=2017-06-16
xdate=2017-06-17
xdate=2017-06-18
xdate=2017-06-19
xdate=2017-06-20
xdate=2017-06-21
xdate=2017-06-22
xdate=2017-06-23
xdate=2017-06-24
xdate=2017-06-25
xdate=2017-06-26
xdate=2017-06-27
xdate=2017-06-28
xdate=2017-06-29
xdate=2017-06-30
xdate=2017-07-01
xdate=2017-07-02
xdate=2017-07-03
xdate=2017-07-04
xdate=2017-07-05
xdate=2017-07-06
xdate=2017-07-07
xdate=2017-07-08
xdate=2017-07-09
xdate=2017-07-10
xdate=2017-07-11
xdate=2017-07-12
xdate=2017-07-13
xdate=2017-07-14
xdate=2017-07-15
xdate=2017-07-16
xdate=2017-07-17
xdate=2017-07-18
xdate=2017-07-19
xdate=2017-07-20
xdate=2017-07-21
xdate=2017-07-22
xdate=2017-07-23
xdate=2017-07-24
xdate=2017-07-25
xdate=2017-07-26
xdate=2017-07-27
xdate=2017-07-28
xdate=2017-07-29
xdate=2017-07-30
xdate=2017-07-31
xdate=2017-08-01
xdate=2017-08-02
xdate=2017-08-03
xdate=2017-08-04
xdate=2017-08-05
xdate=2017-08-06
xdate=2017-08-07
xdate=2017-08-08
xdate=2017-08-09
xdate=2017-08-10
xdate=2017-08-11
xdate=2017-08-12
xdate=2017-08-13
xdate=2017-08-14
xdate=2017-08-15
xdate=2017-08-16
xdate=2017-08-17
xdate=2017-08-18
xdate=2017-08-19
xdate=2017-08-20
xdate=2017-08-21
xdate=2017-08-22
xdate=2017-08-23
xdate=2017-08-24
xdate=2017-08-25
xdate=2017-08-26
xdate=2017-08-27
xdate=2017-08-28
xdate=2017-08-29
xdate=2017-08-30
xdate=2017-08-31
xdate=2017-09-01
xdate=2017-06-14
xdate=2017-06-15
xdate=2017-06-16
xdate=2017-06-17
xdate=2017-06-18
xdate=2017-06-19
xdate=2017-06-20
xdate=2017-06-21
xdate=2017-06-22
xdate=2017-06-23
xdate=2017-06-24
xdate=2017-06-25
xdate=2017-06-26
xdate=2017-06-27
xdate=2017-06-28
xdate=2017-06-29
xdate=2017-06-14
xdate=2017-06-15
xdate=2017-06-16
xdate=2017-06-17
xdate=2017-06-18
xdate=2017-06-19
xdate=2017-06-20
xdate=2017-06-21
xdate=2017-06-22
xdate=2017-06-23
xdate=2017-06-24
xdate=2017-06-25
xdate=2017-06-26
xdate=2017-06-27
xdate=2017-06-28
xdate=2017-06-29
xdate=2017-03-27
xdate=2017-04-02
xdate=2017-04-07
xdate=2017-04-08
xdate=2017-04-09
xdate=2017-04-13
xdate=2017-04-14
xdate=2017-04-15
xdate=2017-04-16
xdate=2017-04-17
xdate=2017-04-19
xdate=2017-04-20
xdate=2017-04-21
xdate=2017-04-22
xdate=2017-04-23
xdate=2017-04-24
xdate=2017-04-20
xdate=2017-04-21
xdate=2017-04-22
xdate=2017-04-23
xdate=2017-04-24
xdate=2017-04-25
xdate=2017-04-26
xdate=2017-04-27
xdate=2017-04-28
xdate=2017-04-29
xdate=2017-04-30
xdate=2017-05-01
xdate=2017-05-02
xdate=2017-04-24
xdate=2017-04-25
xdate=2017-04-26
xdate=2017-04-27
xdate=2017-04-28
xdate=2017-04-29
xdate=2017-04-30
xdate=2017-05-01
xdate=2017-05-02
xdate=2017-05-03
xdate=2017-05-04
xdate=2017-05-05
xdate=2017-05-06
xdate=2017-05-07
xdate=2017-05-08
xdate=2017-05-09
xdate=2017-05-10
xdate=2017-05-11
xdate=2017-05-12
xdate=2017-05-13
xdate=2017-05-14
xdate=2017-05-15
xdate=2017-05-16
ERROR: Invalid DO loop control information, either the INITIAL or TO expression is missing or the
BY expression is missing, zero, or invalid.
SUBJID=106 KEY=106-9 OBS=9 TOTAL=12 STARTDATE=2017-04-25 STOPDATE= CLASS=Steroid / Diuretic
xdate=20934 _ERROR_=1 _N_=52
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 52 observations read from the data set SASUSER.MDM.
WARNING: The data set SASUSER.MDM may be incomplete. When this step was stopped there were 431
observations and 8 variables.
WARNING: Data set SASUSER.MDM was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 0.38 seconds
cpu time 0.29 seconds```
I don't understand why input doesn't appear to be working. Dates are still listed as character strings under column attributes. The do part also isn't working as intended. I'd be grateful for any further guidance.
Do not use the same name in the DATA and SET statement. Then you're always having to rebuild from the start.
Convert your start and stop date to SAS dates
Remove PUT
Add formats to see them displayed as desired
Drop old variables to avoid confusion.
Your two code steps, the data step and SQL do not appear related. Not sure why you would even need a list of dates for intervals or anything. There are much better ways to calculate an overlap. I think you're putting us through an xy problem where it would be significantly easier to show us what you're attempting to do and people would be able to provide a much better solution.
data sasuser.Mdm2; /*1*/
set sasuser.Mdm;
/*2*/
start_date = input(startdate, yymmdd10.);
end_date = input(stopdate, yymmdd10.);
do xdate = start_date to stop_date;
output; /*3*/
end;
/*4*/
format start_date end_date xDate yymmdd10.;
/*5*/
drop startdate stopdate;
run;
*check;
proc contents data=sasuser.mdm2;
run;
EDIT: Also, if you had some sort of grouping variable to indicate that these were part of the same episode you could then just take the min/max of the dates and subtract them to get the interval duration for starters. Grouping via a data step is trivial.
data want;
set have;
by id;
retain episode;
start_date = input(start_date, yymmdd10.);
end_date = input(stopdate, yymmdd10.);
prev_stop_date = lag(stopDate);
if first.id then do;
episode = 0;
call missing(prev_stop_date);
end;
if not (start_date <=prev_stop_date <= end_date) then episode+1;
*could add in logic to calculate dates and durations as well depending....;
run;
It sounds like your SAS log is complaining about this statement.
do xdate=startdate to stopdate;
Because STARTDATE and STOPDATE are character strings instead of dates.
Make sure to create your date values as dates instead of character strings.
Tom's correct, of course, the startdate and stopdate seem to be characters.
To properly use this, do something like this (only the do loop is relevant for you, the rest is to show it working):
data _null_;
startdate = '2017-03-02';
stopdate = '2017-03-16';
do xdate = input(Startdate,yymmdd10.) to input(stopdate,yymmdd10.);
put xdate= yymmdd10.; *just put to the log to see what you are getting;
end;
run;
input will convert the text to a numeric value. Do realize you have to format that xdate as a date format if you want to be able to view it - if you're just using it as an input, though, you can leave the formatting off.

Is there a SAS function to delete negative and missing values from a variable in a dataset?

Variable name is PRC. This is what I have so far. First block to delete negative values. Second block is to delete missing values.
data work.crspselected;
set work.crspraw;
where crspyear=2016;
if (PRC < 0)
then delete;
where ticker = 'SKYW';
run;
data work.crspselected;
set work.crspraw;
where ticker = 'SKYW';
where crspyear=2016;
where=(PRC ne .) ;
run;
Instead of using a function to remove negative and missing values, it can be done more simply when inputting or outputting the data. It can also be done with only one data step:
data work.crspselected;
set work.crspraw(where = (PRC >= 0 & PRC ^= .)); * delete values that are negative and missing;
where crspyear = 2016;
where ticker = 'SKYW';
run;
The section that does it is:
(where = (PRC >= 0 & PRC ^= .))
Which can be done for either the input dataset (work.crspraw) or the output dataset (work.crspselected).
If you must use a function, then the function missing() includes only missing values as per this answer. Hence ^missing() would do the opposite and include only non-missing values. There is not a function for non-negative values. But I think it's easier and quicker to do both together simultaneously without a function.
You don't need more than your first test to remove negative and missing values. SAS treats all 28 missing values (., ._, .A ... .Z) as less than any actual number.

Expecting an arithemetic operator in SAS?

I got the following error:
I don't understand my error please help.
you need space between do and _i_ (index variable) as shown below. as you have it as do_i_. your warning also gives a clue about this.
data RV2;
retain _seed_ 0;
n=20;
p=0.6;
do _i_ = 1 to 100;
binorm1= ranbin(_seed_,n, p);
output;
end;
drop _seed_ _i_;
run;