I'm trying to find out how to make a regular expression to match all text between the first "BEGIN" and the last "END" of a procedure block.
Here's the text which I want to filter:
PROCEDURE MyFirstFunction()#12345
VAR
TESTVAR#1 : Record 1;
TESTVAR#2 : Record 2;
BEGIN
// Here begins the code
IF 1 = 1 THEN BEGIN
IF 2 <> 1 THEN BEGIN
MESSAGE('2 is not equal to 1');
END;
MESSAGE('1 is equal to 1');
END;
END;
PROCEDURE MySecondFunction()#123456
VAR
TESTVAR#1 : Record 1;
TESTVAR#2 : Record 2;
BEGIN
// Here begins the code
IF 1 = 1 THEN BEGIN
IF 2 <> 1 THEN BEGIN
MESSAGE('2 is not equal to 1');
END;
MESSAGE('1 is equal to 1');
END;
END;
PROCEDURE MyThirdFunction()#123457
VAR
TESTVAR#1 : Record 1;
TESTVAR#2 : Record 2;
BEGIN
// Here begins the code
IF 1 = 1 THEN BEGIN
IF 2 <> 1 THEN BEGIN
MESSAGE('2 is not equal to 1');
END;
MESSAGE('1 is equal to 1');
END;
END;
I already tried it with a recursive regular expression, but this didn't work.
Here's the regular expression I worked on:
BEGIN(((?!BEGIN|END;).)|(?R))*END;
But I only get the second beginning of the first function.
Here's the link to regex101.com to test the regular expression:
https://regex101.com/r/ZoBm6h/1
I think the logic you want for the negative lookahead is that it should greedily consume everything after BEGIN until hitting the last END, provided that it also does not see the text PROCEDURE, which would mean that it's gone too far and has entered into the next procedure block.
BEGIN((?!PROCEDURE).)*END;
Demo
If you want to match all blocks, you can also use this Regex :
BEGIN((?!^(?!PROCEDURE)$).)*END
Related
I am following a published method to identify matched cases. I am getting the following error
ERROR: No matching %MACRO statement for this %MEND statement.
WARNING: Apparent invocation of macro MATCH not resolved.
137 %MEND MATCH;
138
139 %MATCH (g.ps_match,Match4,scase4,scontrol4, abuser, 0.0001);
_
180
ERROR 180-322: Statement is not valid or it is used out of proper order.
How do I correctly call the macro?
I am using SAS University Edition.
The method is from
http://www2.sas.com/proceedings/sugi25/25/po/25p225.pdf
Part 2: Perform the Match
The next part of the macro program performs the match and
outputs the matched pairs. First, the cases data set is
selected. Curob is used to keep track of the current case.
Matchto is used to identify matched pairs of cases and
controls. Start and oldi are initialized to control processing of
the controls data set DO loop.
data &lib..&matched.
(drop=Cmatch randnum aprob cprob start
oldi curctrl matched);
set &lib..&SCase. ;
curob + 1;
matchto = curob;
if curob = 1 then do;
start = 1;
oldi = 1;
end;
Next, the controls data set is selected. Processing starts at
the first unmatched observation. The data set is searched
until a match is found, or it is determined no match can be
made. Error checking is performed to avoid an infinite loop.
Curctrl is used to keep track of current control.
DO i = start to n;
set &lib..&Scontrol. point = i nobs = n;
if i gt n then goto startovr;
if _Error_ = 1 then abort;
curctrl = i;
If the propensity score of the current case (aprob) matches the
propensity score of the current control (cprob), then a match
was found. Update Cmatch to 1=Yes. Output the control.
Update matched to keep track of last matched control. Exit
the DO loop. If the propensity score of the current control is
greater than the propensity score of the current case, then no
match will be found for the current case. Stop the DO loop
processing.
if aprob = cprob then
do;
Cmatch = 1;
output &lib..&matched.;
matched = curctrl;
goto found;
end;
else if cprob gt aprob then
goto nextcase;
startovr: if i gt n then
goto nextcase;
END;
/* end of DO LOOP */
nextcase:
if Cmatch=0 then start = oldi;
found:
if Cmatch = 1 then do;
oldi = matched + 1;
start = matched + 1;
set &lib..&SCase. point = curob;
output &lib..&matched.;
end;
retain oldi start;
if _Error_=1 then _Error_=0;
run;
%MEND MATCH;
MACRO MATCH CALL STATEMENT
The following are call statements to the macro
program MATCH. The first performs a 4-digit match;
the second performs a 3-digit match.
%MATCH(STUDY,Propen,Match4,SCase4,
SContrl4,Interven,.0001);
%MATCH(STUDY,Propen,Match3,SCase3,
SContrl3,Interven,.001);
Presumably, you didn't include the beginning of the macro (i.e., the %MACRO MATCH(... portion, earlier in the paper). This is a macro, it's not intended to be run in pieces the way it's written - you need to include all of the code from %MACRO MATCH to %MEND and then the calls.
I'm trying to find a function that will index the nth instance of a character(s).
For example, if I have the string ABABABBABSSSDDEE and I want to find the 3rd instance of A, how do I do that? What if I want to find the 4th instance of AB
ABABABBABSSSDDEE
data HAVE;
input STRING $;
datalines;
ABABABBASSSDDEE
;
RUN;
Here is a much simplified implementation of finding N-th instance of a group of characters in a SAS character string using SAS find() function:
data a;
s='AB bhdf +BA s Ab fs ABC Nfm AB ';
x='AB';
n=3;
/* from left to right */
p = 0;
do i=1 to n until(p=0);
p = find(s, x, p+1);
end;
put p=;
/* from right to left */
p = length(s) + 1;
do i=1 to n until(p=0);
p = find(s, x, -p+1);
end;
put p=;
run;
As you can see it allows for both, left-to-right and right-to-left searches.
You can combine these two into a SAS user-defined function (negative n will indicate search from right to left as it is in find function):
proc fcmp outlib=sasuser.functions.findnth;
function findnth(str $, sub $, n);
p = ifn(n>=0,0,length(str)+1);
do i=1 to abs(n) until(p=0);
p = find(str,sub,sign(n)*p+1);
end;
return (p);
endsub;
run;
Note that the above solutions with FIND() and FINDNTH() functions assume that the searched substring can overlap with its prior instance. For example, if we search for a substring ‘AAA’ within a string ‘ABAAAA’, then the first instance of the ‘AAA’ will be found in position 3, and the second instance – in position 4. That is, the first and second instances are overlapping. For that reason, when we find an instance we increment position p by 1 (p+1) to start the next iteration (instance) of the search.
However, if such overlapping is not a valid case in your searches, and you want to continue search after the end of the previous substring instance, then we should increment p not by 1, but by length of the substring x. That will speed up our search (the more the longer our substring x is) as we will be skipping more characters as we go through the string s. In this case, in our search code we should replace p+1 to p+w, where w=length(x).
A detail discussion of this problem is described in my recent SAS blog post Finding n-th instance of a substring within a string. I also found that using find() function works considerably faster than using regular expression functions in SAS.
I realize I'm late to the party here, but in the interest of adding to the collection of answers, here's what I've come up with.
DATA test;
input = "ABABABBABSSSDDEE";
A_3 = find(prxchange("s/A/#/", 2, input), "A");
AB_4 = find(prxchange("s/AB/##/", 3, input), "AB");
RUN;
Breaking it down, prxchange() just does a pattern matching replacement, but the great thing about it is that you can tell it how many times to replace that pattern. So, prxchange("s/A/#/", 2, input) replaces the first two A's in input with #. Once you've replaced the first two A's, you can wrap it in a find() function to find the "first A", which is actually the third A of the original string.
One thing to note about this approach is that, ideally, the replacement string should be the same length as the string you're replacing. For instance, notice the difference between
prxchange("s/AB/##/", 3, input) /* gives 8 (correct) */
and
prxchange("s/AB/#/", 3, input) /* gives 5 (incorrect) */
That's because we've replaced a string of length 2 with a string of length 1 three times. In other words:
(length("#") - length("AB")) * 3 = -3
so 8 + (-3) = 5.
Hopefully that helps someone out there!
data _null_;
findThis = 'A'; *** substring to find;
findIn = 'ADABAACABAAE'; **** the string to search;
instanceOf=1; *** and the instance of the substring we want to find;
pos = 0;
len = 0;
startHere = 1;
endAt = length(findIn);
n = 0; *** count occurrences of the pattern;
pattern = '/' || findThis || '/';
rx = prxparse(pattern);
CALL PRXNEXT(rx, startHere, endAt, findIn, pos, len);
if pos le 0 then do;
put 'Could not find ' findThis ' in ' findIn;
end;
else do while (pos gt 0);
n+1;
if n eq instanceOf then leave;
CALL PRXNEXT(rx, startHere, endAt, findIn, pos, len);
end;
if n eq instanceOf then do;
put 'found ' instanceOf 'th instance of ' findThis ' at position ' pos ' in ' findIn;
end;
else do;
put 'No ' instanceOf 'th instance of ' findThis ' found';
end;
run;
Here is a solution using the find() function and a do loop within a datastep. I then take that code, and place it into a proc fcmp procedure to create my own function called find_n(). This should greatly simplify whatever task is using this and allows for code re-use.
Define the data:
data have;
length string $50;
input string $;
datalines;
ABABABBABSSSDDEE
;
run;
Do-loop solution:
data want;
set have;
search_term = 'AB';
nth_time = 4;
counter = 0;
last_find = 0;
start = 1;
pos = find(string,search_term,'',start);
do while (pos gt 0 and nth_time gt counter);
last_find = pos;
start = pos + 1;
counter = counter + 1;
pos = find(string,search_term,'',start+1);
end;
if nth_time eq counter then do;
put "The nth occurrence was found at position " last_find;
end;
else do;
put "Could not find the nth occurrence";
end;
run;
Define the proc fcmp function:
Note: If the nth-occurrence cannot be found return 0.
options cmplib=work.temp.temp;
proc fcmp outlib=work.temp.temp;
function find_n(string $, search_term $, nth_time) ;
counter = 0;
last_find = 0;
start = 1;
pos = find(string,search_term,'',start);
do while (pos gt 0 and nth_time gt counter);
last_find = pos;
start = pos + 1;
counter = counter + 1;
pos = find(string,search_term,'',start+1);
end;
result = ifn(nth_time eq counter, last_find, 0);
return (result);
endsub;
run;
Example proc fcmp usage:
Note that this calls the function twice. The first example is showing the original request solution. The second example shows what happens when a match cannot be found.
data want;
set have;
nth_position = find_n(string, "AB", 4);
put nth_position =;
nth_position = find_n(string, "AB", 5);
put nth_position =;
run;
here is what I am trying to do for more understanding i just wanna find away to get the substring and put it into variable
DECLARE
v_file_type thufitab.file_type%TYPE;
v_filename thufitab.filename%TYPE;
v_status thufitab.status%TYPE;
V_seq_FILENAME NUMBER (4);
CURSOR List_FILENAME_cur
IS
SELECT FILENAME
FROM thufitab
WHERE status = 2 AND ROWNUM <= 100;
BEGIN
FOR List_FILENAME_rec IN List_FILENAME_cur
LOOP
SELECT REGEXP_SUBSTR (FILENAME, '([1-9][0-9]{0,3})')
INTO V_seq_FILENAME
FROM thufitab;
DBMS_OUTPUT.PUT_LINE (V_seq_FILENAME);
END LOOP;
END;
Not sure I understand well, but, is this ok for you?
'CDR-([1-9][0-9]{0,3})_[0-9]{2}_[0-9]{2}_[0-9]{2}_[0-9]{4}_UK1\.FCDR'
^_______________^
group 1
I'm porting some classes from the Apache Commons library, and I found the following behaviour strange. I have a regular expression defined as
const
IPV4_REGEX = '^(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})$';
and I use it as follows:
ipv4Validator: TRegEx;
ipv4Validator := TRegEx.Create(IPV4_REGEX);
When I use it to match an IP address, the following code returns false - the debugger shows that Match.Groups.Count is 5, which I didn't expect.
var
Match: TMatch;
begin
Match := ipv4Validator.Match(inet4Address);
if Match.Groups.Count <> 4 then
Exit(false);
Is this the correct behaviour of TMatch.Groups.Count?
Just in case, here's the full code of my class. Notice that I have commented the offending line, because it made my tests fail.
unit InetAddressValidator;
interface
uses RegularExpressions;
const
IPV4_REGEX = '^(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})$';
type
TInetAddressValidator = class
private
ipv4Validator: TRegEx;
public
constructor Create; overload;
function isValid(const inetAddress: String): Boolean;
function isValidInet4Address(const inet4Address: String): Boolean;
end;
implementation
uses SysUtils;
constructor TInetAddressValidator.Create;
begin
inherited;
ipv4Validator := TRegEx.Create(IPV4_REGEX);
end;
function TInetAddressValidator.isValid(const inetAddress: String): Boolean;
begin
Result := isValidInet4Address(inetAddress);
end;
function TInetAddressValidator.isValidInet4Address(const inet4Address
: String): Boolean;
var
Match: TMatch;
IpSegment: Integer;
i: Integer;
begin
Match := ipv4Validator.Match(inet4Address);
// if Match.Groups.Count <> 4 then
// Exit(false);
IpSegment := 0;
for i := 1 to Match.Groups.Count - 1 do
begin
try
IpSegment := StrToInt(Match.Groups[i].Value);
except
Exit(false);
end;
if IpSegment > 255 then
Exit(false);
end;
Result := true;
end;
end.
Match.Groups[0] contains the whole expression, so this is correct.
TGroupcollection constructor:
constructor TGroupCollection.Create(ARegEx: TPerlRegEx;
const AValue: UTF8String; AIndex, ALength: Integer; ASuccess: Boolean);
var
I: Integer;
begin
FRegEx := ARegEx;
/// populate collection;
if ASuccess then
begin
SetLength(FList, FRegEx.GroupCount + 1);
for I := 0 to Length(FList) - 1 do
FList[I] := TGroup.Create(AValue, FRegEx.GroupOffsets[I], FRegEx.GroupLengths[I], ASuccess);
end;
end;
As you can see the internal Flist (TArray<TGroup>) is initiated with the number of groups + 1. FList[0] receives a group with offset 1 and the whole expression length. This behaviour is not documented.
Delphi's TRegEx is designed to mimic .NET's Regex class, which also adds the overall regex match to Match.Groups.Count. .NET does this so that the GroupCollection class can implement the ICollection interface.
In Java Matcher.group(0) also returns the overall regex match. Matcher.groupCount() returns the number of groups excluding the overall match. Most regex libraries do it this way.
I need regex help to create a delphi function to replace the HyperString ParseWord function in Rad Studio XE2. HyperString was a very useful string library that never made the jump to Unicode. I've got it mostly working but it doesn't honor quote delimiters at all. I need it to be an exact match for the function described below:
function ParseWord(const Source,Table:String;var Index:Integer):String;
Sequential, left to right token parsing using a table of single
character delimiters. Delimiters within quoted strings are ignored.
Quote delimiters are not allowed in Table.
Index is a pointer (initialize to '1' for first word) updated by the
function to point to next word. To retrieve the next word, simply
call the function again using the prior returned Index value.
Note: If Length(Resultant) = 0, no additional words are available.
Delimiters within quoted strings are ignored. (my emphasis)
This is what I have so far:
function ParseWord( const Source, Table: String; var Index: Integer):string;
var
RE : TRegEx;
match : TMatch;
Table2,
chars : string;
begin
if index = length(Source) then
begin
result:= '';
exit;
end;
// escape the special characters and wrap in a Group
Table2 :='['+TRegEx.Escape(Table, false)+']';
RE := TRegEx.create(Table2);
match := RE.Match(Source,Index);
if match.success then
begin
result := copy( Source, Index, match.Index - Index);
Index := match.Index+match.Length;
end
else
begin
result := copy(Source, Index, length(Source)-Index+1);
Index := length(Source);
end;
end;
while ( Length(result)= 0) and (Index<length(Source)) do
begin
Inc(Index);
result := ParseWord(Source,Table, Index);
end;
cheers and thanks.
I would try this regex for Table2:
Table2 := '''[^'']+''|"[^"]+"|[^' + TRegEx.Escape(Table, false) + ']+';
Demo:
This demo is more a POC since I was unable to find an online delphi regex tester.
The delimiters are the space (ASCII code 32) and pipe (ASCII code 124) characters.
The test sentence is:
toto titi "alloa toutou" 'dfg erre' 1245|coucou "nestor|delphi" "" ''
http://regexr.com?32i81
Discussion:
I assume that a quoted string is a string enclosed by either two single quotes (') or two double quotes ("). Correct me if I am wrong.
The regex will match either:
a single quoted string
a double quoted string
a string not composed by any passed delimiters
Known bug:
Since I didn't know how ParseWord handle quote escaping inside string, the regex doesn't support this feature.
For instance :
How to interpret this 'foo''bar' ? => Two tokens : 'foo' and 'bar' OR one single token 'foo''bar'.
What about this case too : "foo""bar" ? => Two tokens : "foo" and "bar" OR one single token "foo""bar".
In my original code I was looking for the delimiter and taking everything up to that as my next match, but that concept didn't carry over when looking for something within quotes. #Stephan's suggestion of negating the search eventually lead me to something that works. An additional complication that I never mentioned earlier is that HyperStr can use anything as a quoting character. The default is double quote but you can change it with a function call.
In my solution I've explicitly hardcoded the QuoteChar as double quote, which suits my own purposes, but it would be trivial to make QuoteChar a global and set it within another function. I've also successfully tested it with single quote (ascii 39), which would be the tricky one in Delphi.
function ParseWord( const Source, Table: String; var Index: Integer):string;
var
RE : TRegEx;
match : TMatch;
Table2: string;
Source2 : string;
QuoteChar : string;
begin
if index = length(Source) then
begin
result:= '';
exit;
end;
// escape the special characters and wrap in a Group
QuoteChar := #39;
Table2 :='[^'+TRegEx.Escape(Table, false)+QuoteChar+']*|'+QuoteChar+'.*?'+QuoteChar ;
Source2 := copy(Source, Index, length(Source)-index+1);
match := TRegEx.Match(Source2,Table2);
if match.success then
begin
result := copy( Source2, match.index, match.length);
Index := Index + match.Index + match.Length-1;
end
else
begin
result := copy(Source, Index, length(Source)-Index+1);
Index := length(Source);
end;
while ( Length(result)= 0) and (Index<length(Source)) do
begin
Inc(Index);
result := ParseWord(Source,Table, Index);
end;
end;
This solution doesn't strip the quote chars from around quoted strings, but I can't tell from my own existing code if it should or not, and I can't test using Hyperstr. Maybe someone else knows?