Unexpected result with TRegex matching a unicode string. Is this a bug?

Unexpected result with TRegex matching a unicode string. Is this a bug? - regex

I am using delphi 10.1.
I have an unexpected result doing the following
procedure TForm1.FormCreate(Sender: TObject);
var
Match: TMatch;
Regex: TRegex;
Perl: TPerlRegEx;
const
s1 = 'Réaitaei: test123';
s2 = 'Réàààaitaei: test123';
s3 = 'Réàààààààààaitaei: test123';
pattern = '(.*): ';
begin
Regex := TRegex.Create(pattern);
Match := Regex.Match(s1);
Assert(Match.Success);
Memo1.Lines.Add(Match.Groups[1].Value);
Match := Regex.Match(s2);
Assert(Match.Success);
Memo1.Lines.Add(Match.Groups[1].Value);
Match := Regex.Match(s3);
Assert(Match.Success);
Memo1.Lines.Add(Match.Groups[1].Value);
Perl := TPerlRegEx.Create;
Perl.RegEx := pattern;
Perl.Compile;
Perl.Subject := s1;
Assert(Perl.Match);
Memo1.Lines.Add(Perl.Groups[1]);
Perl.Subject := s2;
Assert(Perl.Match);
Memo1.Lines.Add(Perl.Groups[1]);
Perl.Subject := s3;
Assert(Perl.Match);
Memo1.Lines.Add(Perl.Groups[1]);
end;
i get :
Réaitaei Réàààaitaei: t Réàààààààààaitaei: test
Réaitaei Réàààaitaei Réàààààààààaitaei
So TPerlRegex works but not TRegex when doc says that TRegex is just a wrapper to TPerlRegex.
Am I missing sth or it s a bug ?
I found a post from Marco Cantu saying that since delphi xe7, delphi uses pcre 8.35 so I am not expecting any problem from using unicode.

Looks like RSP-17697 solves this issue in Tokyo 10.2.1. When you run a compare on the relevant units you can see that TBytes was changed to Unicode String. I've checked my re-used RegEx and its not affected but for anyone not sure and using Berlin or lower use the class method to assure all works: TRegex.Match('Réàààààààààaitaei: test123', '(.*): ').Groups[1].Value = 'Réàààààààààaitaei'
That would be one heck of a bug to hunt down

Related

Delphi multiline regex

I have some non-regression test code in Delphi that calls an external diff tool. Then my code loads the diff results and should remove acceptable differences, such as dates in the compared results. I'm trying to do this with a multiline TRegEx.Replace , but no match is found ...
https://regex101.com/r/QBZuws/2 shows the pattern I came up with and a sample test diff file. I need to delete the matching "pararaphs" of 3 lines
Here is my code :
function FilterDiff(AText:string):string;
var
LStr:string;
Regex: TRegEx;
begin
// AText:=StringReplace(AText,#13+#10,'\n',[rfReplaceAll]); // doesn't help ...
LStr := '\d\d.\d\d.20\d\d \d\d:\d\d:\d\d'; // regex for date and time
LStr := '##.*##\n-'+LStr+'\n\+'+LStr; // regex for paragraphs to remove
Regex := TRegEx.Create(LStr, [roMultiLine]);
Result := Regex.Replace(AText,'');
end;
procedure TReportTest.NonRegression;
var
LDiff : TStringList;
// others removed for clarity
begin
// removed section code that call an external tool and produces diff.txt file
LDiff := TStringList.Create;
LDiff.LoadFromFile('diff.txt');
Status(FilterDiff(LDiff.Text)); // show the diffs in DUnit GUI for now
LDiff.Free;
end;
Besides, while tracing TRegEx.Replace down to
System.RegularExpressionsAPI.pcre_exec($4D72A50,nil,'--- '#$D#$A'+++ '#$D#$A'## -86 +86 ##'#$D#$A'-16.11.2017 15:00:36'#$D#$A'+15.11.2017 10:47:58'#$D#$A'## -400 +400 ##'#$D#$A'-16.11.2017 15:00:36'#$D#$A'+15.11.2017 10:47:58'#$D#$A,132,0,1024,$7D56800,300)
System.RegularExpressionsCore.TPerlRegEx.Match
System.RegularExpressionsCore.TPerlRegEx.ReplaceAll
System.RegularExpressions.TRegEx.Replace(???,???)
TestReportAuto.FilterDiff('--- '#$D#$A'+++ '#$D#$A'## -86 +86 ##'#$D#$A'-16.11.2017 15:00:36'#$D#$A'+15.11.2017 10:47:58'#$D#$A'## -400 +400 ##'#$D#$A'-16.11.2017 15:00:36'#$D#$A'+15.11.2017 10:47:58'#$D#$A)
I was surprised to see quotes before and after each newline #$D#$A in the debugger, but they don't look "real" ... or are they ?

As you seem to have issues with different kinds of line breaks, I would recommend to adjust your Regex to use \R instead of \n which matches Windows style linebreaks (CR + LF) as well as Unix style linebreaks (LF).

Well, I just noticed the \n in regex matches only LF, not CR+LF, so I added
AText:=StringReplace(AText,#13+#10,#10,[rfReplaceAll]); // \n matches only LF !
at the beginning of my function and it's much better now...
Sometimes writing down a problem helps ...

Pattern Substitution in Haxe

var str2 : String = "Expander Detected (%MSG_ID%)";
var r2 = ~/[\(%MSG_ID%\)]+/g;
trace(r2.replace(str2, ""));
Expected Result: Expander Detected
Actual Result: Expander etected
I need to replace (%MSG_ID%) in my strings. Characters before (%MSG_ID%) are dynamic, so we can not replace them manually.

You need to remove the surrounding []. This works as expected:
var r2 = ~/\(%MSG_ID%\)+/g;
[] is a character set which matches if a single character contained in the set matches. Since the set happens to contain D, the D is also removed when calling replace(). However, you only want to match if all characters (and in that order) are present.
I'd recommend a tool like regex101.com for testing regexes. You can nicely see the issue there:

TRegex.Match never matches empty strings

I am processing a number of strings from a TStringList and want to skip some lines that do not match a certain RegEx Pattern. Therefore I created a pattern of ^(?!\t\w+\t\w+) and attempted
program P;
uses
System.SysUtils, System.Classes, System.RegularExpressions;
var
S: TStringList;
I: Integer;
begin
S := TStringList.Create;
try
//Test and empty string should be passed
S.Add('Test'); S.Add(''); S.Add(#9'Hello'#9'world%%');
I := 0;
while ((I < S.Count - 1) and TRegex.IsMatch(S[I], '^(?!\t\w+\t\w+)', [])) do
Inc(I);
Writeln(IntToStr(I) + ': ' + S[I]);
Readln;
finally
S.Free;
end;
end.
Surprisingly it prints 1: thus matches the empty string from my StringList, though it should match the pattern. I can catch this case by adding and S[I] <> '' but I'm wondering if I missed any Regex option (or similar) to correctly handle empty strings with a RegEx. I had to explicitly use empty RegexOptions in the IsMatch function, as roNotEmpty is used per default - but this only allows my pattern to match for a zero-length.
I have tested this in Delphi 10.1.

This is a known issue.
You can recompile the unit after modifying the code as mentioned in the comments of the issue. All you have to do is to explicitly add the pas file to your project to cause the compiler to recompile it instead of using the shipped dcu.

Very slow RegEx in AHK yet fast in Notepad++

I'd like to find a certain string in a webpage. I decided to use RegEx. (I know my RegExes are quite terrible, however, they work). My two expressions are very fast when used in Notepad++ (probably < 1s) and on Regex101, but they are horribly slow when used in AutoHotKey – about 2-5 minutes. How do I fix this?
sWindowInfo2 = http://www.archiwum.wyborcza.pl/Archiwum/1,0,4583161,20060208LU-DLO,Dzis_bedzie_Piast,.html
whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
whr.Open("GET", sWindowInfo2, false ), whr.Send()
whr.ResponseText
sPage := ""
sPage := whr.ResponseText
; get city name (if exists) – the following is very slooooow
if RegExMatch(sPage, "[\s\S]+<dzial>Gazeta\s(.+)<\/dzial>[\s\S]+")
{
sCity := RegExReplace(sPage, "[\s\S]+<dzial>Gazeta\s(.+)<\/dzial>[\s\S]+", "$1")
;MsgBox, % sCity
city := 1
}
if RegExMatch(sPage, "[\s\S]+<metryczka>GW\s(.+)\snr[\s\S]+")
{
sCity := RegExReplace(sPage, "[\s\S]+<metryczka>GW\s(.+)\snr[\s\S]+", "$1")
city := 1
}
EDIT:
In the page I provided the match is Lublin. Have a look at: https://regex101.com/r/qJ2pF8/1

You do not need to use RegExReplace to get the captured value. As per reference, you can pass the 3rd var into RegExMatch:
OutputVar
OutputVar is the unquoted name of a variable in which to store a match object, which can be used to retrieve the position, length and value of the overall match and of each captured subpattern, if any are present.
So, use a much simpler pattern:
FoundPos := RegExMatch(sPage, "<metryczka>GW\s(.+)\snr", SubPat) ;
It will return the position of the match, and will store "Lublin" in SubPat[1].
With this pattern, you avoid heavy backtracking you had with [\s\S]+<metryczka>GW\s(.+)\snr[\s\S]+ as the first [\s\S]+ matched up to the end of the string, and then backtracked to accommodate for the subsequent subpatterns. The longer the string, the slower the operation is.

RegEx Pattern Help to Split String

I am having a terrible time trying to get a regular expression defined to split a string that will look like the following . . .
SQL12345,54321SQL
XXXXX,XXXXX
Where X = [0-9A-Za-z] and can be repeated one or more times on each side of the delimeter (,).
The RegEx Pattern I've come up with is . . .
"([0-9A-Za-z]+)(,)([0-9A-Za-z]+)"
I only ever want one group on each side of the delimeter. I seem to be getting results that look like . . .
myStrArr[0] = ""
myStrArr[1] = "SQL12345"
myStrArr[2] = ","
myStrArr[3] = "54321SQL"
myStrArr[4] = ""
So, why am I getting the line beginning and line end (array elements 0 and 4). And, how can I fix my regex pattern so I don't get these returned?
THANK YOU!

user364939's code won't compile. Try this:
System.String originalString = "SQL12345,54321SQL";
System.String[] splitArray = originalString.Split(',');
System.Console.WriteLine(splitArray[0]);
System.Console.WriteLine(splitArray[1]);
Caveat: I tested this with C# in Snippet Compiler but made the .NET references verbose hoping that it will translate nicely to managed C++.
Here's a managed C++ version:
#include "stdafx.h"
using namespace System;
int main(array<System::String ^> ^args)
{
String^ p = "SQL12345,54321SQL";
array<String^>^ a = p->Split(',');
Console::WriteLine(a[0]);
Console::WriteLine(a[1]);
Console::ReadLine();
return 0;
}

Paul's answer could be a better solution if you don't want to use a regular expression. If you do, though, try this one:
([^,]+),([^,]+)

I agree with the dude above. Unless you care about other weird characters like | etc. Then you'd want to split on anything that is not [0-9A-Za-z].
String myString = "SQL12345,54321SQL";
String[] array = myString .split(",");

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Unexpected result with TRegex matching a unicode string. Is this a bug? - regex

Related

Delphi multiline regex

Pattern Substitution in Haxe

TRegex.Match never matches empty strings

Very slow RegEx in AHK yet fast in Notepad++

RegEx Pattern Help to Split String

Categories

Resources