Regular Expressions and TMatch.Groups.Count

Regular Expressions and TMatch.Groups.Count - regex

I'm porting some classes from the Apache Commons library, and I found the following behaviour strange. I have a regular expression defined as
const
IPV4_REGEX = '^(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})$';
and I use it as follows:
ipv4Validator: TRegEx;
ipv4Validator := TRegEx.Create(IPV4_REGEX);
When I use it to match an IP address, the following code returns false - the debugger shows that Match.Groups.Count is 5, which I didn't expect.
var
Match: TMatch;
begin
Match := ipv4Validator.Match(inet4Address);
if Match.Groups.Count <> 4 then
Exit(false);
Is this the correct behaviour of TMatch.Groups.Count?
Just in case, here's the full code of my class. Notice that I have commented the offending line, because it made my tests fail.
unit InetAddressValidator;
interface
uses RegularExpressions;
const
IPV4_REGEX = '^(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})$';
type
TInetAddressValidator = class
private
ipv4Validator: TRegEx;
public
constructor Create; overload;
function isValid(const inetAddress: String): Boolean;
function isValidInet4Address(const inet4Address: String): Boolean;
end;
implementation
uses SysUtils;
constructor TInetAddressValidator.Create;
begin
inherited;
ipv4Validator := TRegEx.Create(IPV4_REGEX);
end;
function TInetAddressValidator.isValid(const inetAddress: String): Boolean;
begin
Result := isValidInet4Address(inetAddress);
end;
function TInetAddressValidator.isValidInet4Address(const inet4Address
: String): Boolean;
var
Match: TMatch;
IpSegment: Integer;
i: Integer;
begin
Match := ipv4Validator.Match(inet4Address);
// if Match.Groups.Count <> 4 then
// Exit(false);
IpSegment := 0;
for i := 1 to Match.Groups.Count - 1 do
begin
try
IpSegment := StrToInt(Match.Groups[i].Value);
except
Exit(false);
end;
if IpSegment > 255 then
Exit(false);
end;
Result := true;
end;
end.

Match.Groups[0] contains the whole expression, so this is correct.
TGroupcollection constructor:
constructor TGroupCollection.Create(ARegEx: TPerlRegEx;
const AValue: UTF8String; AIndex, ALength: Integer; ASuccess: Boolean);
var
I: Integer;
begin
FRegEx := ARegEx;
/// populate collection;
if ASuccess then
begin
SetLength(FList, FRegEx.GroupCount + 1);
for I := 0 to Length(FList) - 1 do
FList[I] := TGroup.Create(AValue, FRegEx.GroupOffsets[I], FRegEx.GroupLengths[I], ASuccess);
end;
end;
As you can see the internal Flist (TArray<TGroup>) is initiated with the number of groups + 1. FList[0] receives a group with offset 1 and the whole expression length. This behaviour is not documented.

Delphi's TRegEx is designed to mimic .NET's Regex class, which also adds the overall regex match to Match.Groups.Count. .NET does this so that the GroupCollection class can implement the ICollection interface.
In Java Matcher.group(0) also returns the overall regex match. Matcher.groupCount() returns the number of groups excluding the overall match. Most regex libraries do it this way.

Related

Swapping characters using TRegEx in Delphi

I need to swap character dot with comma and vice versa simultaneously.
function TformMain.SwapString(input, fromSymbol, toSymbol: String): String;
begin
Result := AnsiReplaceStr(input, fromSymbol, '_'); //100,200_00
Result := AnsiReplaceStr(Result, toSymbol, fromSymbol); //100.200_00
Result := AnsiReplaceStr(Result, '_', toSymbol); //100.200,00
end;
How to do this using TRegEx in Delphi Rio?

Although this is not an answer to your question (how to do this using regular expressions), I'd like to point out that this task can be performed with much greater runtime performance using a simple loop:
function SwapPeriodComma(const S: string): string;
var
i: Integer;
begin
Result := S;
for i := 1 to S.Length do
case S[i] of
'.':
Result[i] := ',';
',':
Result[i] := '.';
end;
end;
This is much faster than both the AnsiReplaceStr approach and the regular expression approach.
Generalised to any two characters:
function SwapChars(const S: string; C1, C2: Char): string;
var
i: Integer;
begin
Result := S;
for i := 1 to S.Length do
if S[i] = C1 then
Result[i] := C2
else if S[i] = C2 then
Result[i] := C1;
end;
(If you are OK with a procedure instead of a function, you can do this in-place and save memory and gain speed. But most likely you don't need such optimisations.)

Regular Expression - Match all between BEGIN and END

I'm trying to find out how to make a regular expression to match all text between the first "BEGIN" and the last "END" of a procedure block.
Here's the text which I want to filter:
PROCEDURE MyFirstFunction()#12345
VAR
TESTVAR#1 : Record 1;
TESTVAR#2 : Record 2;
BEGIN
// Here begins the code
IF 1 = 1 THEN BEGIN
IF 2 <> 1 THEN BEGIN
MESSAGE('2 is not equal to 1');
END;
MESSAGE('1 is equal to 1');
END;
END;
PROCEDURE MySecondFunction()#123456
VAR
TESTVAR#1 : Record 1;
TESTVAR#2 : Record 2;
BEGIN
// Here begins the code
IF 1 = 1 THEN BEGIN
IF 2 <> 1 THEN BEGIN
MESSAGE('2 is not equal to 1');
END;
MESSAGE('1 is equal to 1');
END;
END;
PROCEDURE MyThirdFunction()#123457
VAR
TESTVAR#1 : Record 1;
TESTVAR#2 : Record 2;
BEGIN
// Here begins the code
IF 1 = 1 THEN BEGIN
IF 2 <> 1 THEN BEGIN
MESSAGE('2 is not equal to 1');
END;
MESSAGE('1 is equal to 1');
END;
END;
I already tried it with a recursive regular expression, but this didn't work.
Here's the regular expression I worked on:
BEGIN(((?!BEGIN|END;).)|(?R))*END;
But I only get the second beginning of the first function.
Here's the link to regex101.com to test the regular expression:
https://regex101.com/r/ZoBm6h/1

I think the logic you want for the negative lookahead is that it should greedily consume everything after BEGIN until hitting the last END, provided that it also does not see the text PROCEDURE, which would mean that it's gone too far and has entered into the next procedure block.
BEGIN((?!PROCEDURE).)*END;
Demo

If you want to match all blocks, you can also use this Regex :
BEGIN((?!^(?!PROCEDURE)$).)*END

Largest "separation" of patterns for Delphi regex?

Update
As Graymatter has observed, regex fails to match when there are at least 2 extra line breaks before the second target. That is to say, changing the concatenation loop to "for I := 0 to 1" will make the regex-match fail.
As shown in the code below, without the concatenation, the program can get the two values using regex. However, with the concatenation, the program cannot get the two values.
Could you help to comment on the reason and the workaround ?
program Project1;
{$APPTYPE CONSOLE}
uses
// www.regular-expressions.info/delphi.html
// http://www.regular-expressions.info/download/TPerlRegEx.zip
PerlRegEx,
SysUtils;
procedure Test;
var
Content: UTF8String;
Regex: TPerlRegEx;
GroupIndex: Integer;
I: Integer;
begin
Regex := TPerlRegEx.Create;
Regex.Regex := 'Value1 =\s*(?P<Value1>\d+)\s*.*\s*Value2 =\s*(?P<Value2>\d*\.\d*)';
Content := '';
for I := 0 to 10000000 do
begin
// Uncomment here to see effect
// Content := Content + 'junkjunkjunkjunkjunk' + sLineBreak;
end;
Regex.Subject := 'junkjunkjunkjunkjunk' +
sLineBreak + ' Value1 = 1' +
sLineBreak + 'junkjunkjunkjunkjunk' + Content +
sLineBreak + ' Value2 = 1.23456789' +
sLineBreak + 'junkjunkjunkjunkjunk';
if Regex.Match then
begin
GroupIndex := Regex.NamedGroup('Value1');
Writeln(Regex.Groups[GroupIndex]);
GroupIndex := Regex.NamedGroup('Value2');
Writeln(Regex.Groups[GroupIndex]);
end
else
begin
Writeln('No match');
end;
Regex.Free;
end;
begin
Test;
Readln;
end.

Adding this line works.
Regex.Options := [preSingleLine];
From the documentation:
preSingleLine
Normally, dot (.) matches anything but a newline (\n). With preSingleLine, dot (.) will match anything, including newlines. This allows a multiline string to be regarded as a single entity. Equivalent to Perl's /s modifier. Note that preMultiLine and preSingleLine can be used together.
When there is only one line break before the second target, the regex can match even without preSingleline. The reason is because \s can match line return.

How to skip quoted text in regex (or How to use HyperStr ParseWord with Unicode text ?)

I need regex help to create a delphi function to replace the HyperString ParseWord function in Rad Studio XE2. HyperString was a very useful string library that never made the jump to Unicode. I've got it mostly working but it doesn't honor quote delimiters at all. I need it to be an exact match for the function described below:
function ParseWord(const Source,Table:String;var Index:Integer):String;
Sequential, left to right token parsing using a table of single
character delimiters. Delimiters within quoted strings are ignored.
Quote delimiters are not allowed in Table.
Index is a pointer (initialize to '1' for first word) updated by the
function to point to next word. To retrieve the next word, simply
call the function again using the prior returned Index value.
Note: If Length(Resultant) = 0, no additional words are available.
Delimiters within quoted strings are ignored. (my emphasis)
This is what I have so far:
function ParseWord( const Source, Table: String; var Index: Integer):string;
var
RE : TRegEx;
match : TMatch;
Table2,
chars : string;
begin
if index = length(Source) then
begin
result:= '';
exit;
end;
// escape the special characters and wrap in a Group
Table2 :='['+TRegEx.Escape(Table, false)+']';
RE := TRegEx.create(Table2);
match := RE.Match(Source,Index);
if match.success then
begin
result := copy( Source, Index, match.Index - Index);
Index := match.Index+match.Length;
end
else
begin
result := copy(Source, Index, length(Source)-Index+1);
Index := length(Source);
end;
end;
while ( Length(result)= 0) and (Index<length(Source)) do
begin
Inc(Index);
result := ParseWord(Source,Table, Index);
end;
cheers and thanks.

I would try this regex for Table2:
Table2 := '''[^'']+''|"[^"]+"|[^' + TRegEx.Escape(Table, false) + ']+';
Demo:
This demo is more a POC since I was unable to find an online delphi regex tester.
The delimiters are the space (ASCII code 32) and pipe (ASCII code 124) characters.
The test sentence is:
toto titi "alloa toutou" 'dfg erre' 1245|coucou "nestor|delphi" "" ''
http://regexr.com?32i81
Discussion:
I assume that a quoted string is a string enclosed by either two single quotes (') or two double quotes ("). Correct me if I am wrong.
The regex will match either:
a single quoted string
a double quoted string
a string not composed by any passed delimiters
Known bug:
Since I didn't know how ParseWord handle quote escaping inside string, the regex doesn't support this feature.
For instance :
How to interpret this 'foo''bar' ? => Two tokens : 'foo' and 'bar' OR one single token 'foo''bar'.
What about this case too : "foo""bar" ? => Two tokens : "foo" and "bar" OR one single token "foo""bar".

In my original code I was looking for the delimiter and taking everything up to that as my next match, but that concept didn't carry over when looking for something within quotes. #Stephan's suggestion of negating the search eventually lead me to something that works. An additional complication that I never mentioned earlier is that HyperStr can use anything as a quoting character. The default is double quote but you can change it with a function call.
In my solution I've explicitly hardcoded the QuoteChar as double quote, which suits my own purposes, but it would be trivial to make QuoteChar a global and set it within another function. I've also successfully tested it with single quote (ascii 39), which would be the tricky one in Delphi.
function ParseWord( const Source, Table: String; var Index: Integer):string;
var
RE : TRegEx;
match : TMatch;
Table2: string;
Source2 : string;
QuoteChar : string;
begin
if index = length(Source) then
begin
result:= '';
exit;
end;
// escape the special characters and wrap in a Group
QuoteChar := #39;
Table2 :='[^'+TRegEx.Escape(Table, false)+QuoteChar+']*|'+QuoteChar+'.*?'+QuoteChar ;
Source2 := copy(Source, Index, length(Source)-index+1);
match := TRegEx.Match(Source2,Table2);
if match.success then
begin
result := copy( Source2, match.index, match.length);
Index := Index + match.Index + match.Length-1;
end
else
begin
result := copy(Source, Index, length(Source)-Index+1);
Index := length(Source);
end;
while ( Length(result)= 0) and (Index<length(Source)) do
begin
Inc(Index);
result := ParseWord(Source,Table, Index);
end;
end;
This solution doesn't strip the quote chars from around quoted strings, but I can't tell from my own existing code if it should or not, and I can't test using Hyperstr. Maybe someone else knows?

Delphi extract string between to 2 tags

How would I go about extracting text between 2 html tags using delphi?
Here is an example string.
blah blah blah<tag>text I want to keep</tag>blah blah blah
and I want to extract this part of it.
<tag>text I want to keep</tag>
(basically removing all the blah blah blah garbage that comes before and after the <tag> & </tag> strings which I also want to keep.
Like I said, I am sure this is extremely easy for those who know, but I just cannot wrap my head around it at the moment. Thanks in advance for your replies.

If you have Delphi XE, you can use the new RegularExpressions unit:
ResultString := TRegEx.Match(SubjectString, '(?si)<tag>.*?</tag>').Value;
If you have an older version of Delphi, you can use a 3rd party regex component such as TPerlRegEx:
Regex := TPerlRegEx.Create(nil);
Regex.RegEx := '(?si)<tag>.*?</tag>';
Regex.Subject := SubjectString;
if Regex.Match then ResultString := Regex.MatchedExpression;

This depends entirely on how your input looks.
Update First I wrote a few solutions for special cases, but after the OP explained a bit more about the details, I had to generalize them a bit. Here is the most general code:
function ExtractTextInsideGivenTagEx(const Tag, Text: string): string;
var
StartPos1, StartPos2, EndPos: integer;
i: Integer;
begin
result := '';
StartPos1 := Pos('<' + Tag, Text);
EndPos := Pos('</' + Tag + '>', Text);
StartPos2 := 0;
for i := StartPos1 + length(Tag) + 1 to EndPos do
if Text[i] = '>' then
begin
StartPos2 := i + 1;
break;
end;
if (StartPos2 > 0) and (EndPos > StartPos2) then
result := Copy(Text, StartPos2, EndPos - StartPos2);
end;
function ExtractTagAndTextInsideGivenTagEx(const Tag, Text: string): string;
var
StartPos, EndPos: integer;
begin
result := '';
StartPos := Pos('<' + Tag, Text);
EndPos := Pos('</' + Tag + '>', Text);
if (StartPos > 0) and (EndPos > StartPos) then
result := Copy(Text, StartPos, EndPos - StartPos + length(Tag) + 3);
end;
Sample usage
ExtractTextInsideGivenTagEx('tag',
'blah <i>blah</i> <b>blah<tag a="2" b="4">text I want to keep</tag>blah blah </b>blah')
returns
text I want to keep
whereas
ExtractTagAndTextInsideGivenTagEx('tag',
'blah <i>blah</i> <b>blah<tag a="2" b="4">text I want to keep</tag>blah blah </b>blah')
returns
<tag a="2" b="4">text I want to keep</tag>

you can build an function using the pos the copy functions.
see this sample.
Function ExtractBetweenTags(Const Value,TagI,TagF:string):string;
var
i,f : integer;
begin
i:=Pos(TagI,Value);
f:=Pos(TagF,Value);
if (i>0) and (f>i) then
Result:=Copy(Value,i+length(TagI),f-i-length(TagF)+1);
end;
Function ExtractWithTags(Const Value,TagI,TagF:string):string;
var
i,f : integer;
begin
i:=Pos(TagI,Value);
f:=Pos(TagF,Value);
if (i>0) and (f>i) then
Result:=Copy(Value,i,f-i+length(TagF));
end;
and call like this
StrValue:='blah blah blah<tag> text I want to keep</tag>blah blah blah';
NewValue:=ExtractBetweenTags(StrValue,'<tag>','</tag>');//returns 'text I want to keep'
NewValue:=ExtractWithTags(StrValue,'<tag>','</tag>');//returns '<tag>text I want to keep</tag>'

I find that this version is more versatile because it isnt limited to one occurence of the tags. It searches for the next endtag after the starttag.
Function ExtractBetweenTags(Const Line, TagI, TagF: string): string;
var
i, f : integer;
begin
i := Pos(TagI, Line);
f := Pos(TagF, Copy(Line, i+length(TagI), MAXINT));
if (i > 0) and (f > 0) then
Result:= Copy(Line, i+length(TagI), f-1);
end;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regular Expressions and TMatch.Groups.Count - regex

Related

Swapping characters using TRegEx in Delphi

Regular Expression - Match all between BEGIN and END

Largest "separation" of patterns for Delphi regex?

How to skip quoted text in regex (or How to use HyperStr ParseWord with Unicode text ?)

Delphi extract string between to 2 tags

Categories

Resources