TRegex.Match never matches empty strings - regex

I am processing a number of strings from a TStringList and want to skip some lines that do not match a certain RegEx Pattern. Therefore I created a pattern of ^(?!\t\w+\t\w+) and attempted
program P;
uses
System.SysUtils, System.Classes, System.RegularExpressions;
var
S: TStringList;
I: Integer;
begin
S := TStringList.Create;
try
//Test and empty string should be passed
S.Add('Test'); S.Add(''); S.Add(#9'Hello'#9'world%%');
I := 0;
while ((I < S.Count - 1) and TRegex.IsMatch(S[I], '^(?!\t\w+\t\w+)', [])) do
Inc(I);
Writeln(IntToStr(I) + ': ' + S[I]);
Readln;
finally
S.Free;
end;
end.
Surprisingly it prints 1: thus matches the empty string from my StringList, though it should match the pattern. I can catch this case by adding and S[I] <> '' but I'm wondering if I missed any Regex option (or similar) to correctly handle empty strings with a RegEx. I had to explicitly use empty RegexOptions in the IsMatch function, as roNotEmpty is used per default - but this only allows my pattern to match for a zero-length.
I have tested this in Delphi 10.1.

This is a known issue.
You can recompile the unit after modifying the code as mentioned in the comments of the issue. All you have to do is to explicitly add the pas file to your project to cause the compiler to recompile it instead of using the shipped dcu.

Related

Regex to select text outside of underscores

I am looking for a regex to select the text which falls outside of underscore characters.
Sample text:
PartIWant_partINeedIgnored_morePartsINeedIgnored_PartIwant
Basically I need to be able to select the first keyword which is always before the first underscore and the last keyword which is always after the last underscore. As an additional complexity, there case also be texts which have no underscore at all, these need to be selected completely as well.
The best I got yet was this expression:
^((?! *\_[^)]*\_ *).)*
which is only yielding me the first part, not the second and it has no support for the non-underscore yet at all.
This regex is used in a tool which monitors our http traffic, which means I can only 'select' the part I need but can't invoke functions or replace logic.
Thanks!
Use JavaScript string function split(). Check below example.
var t = "PartIWant_partINeedIgnored_morePartsINeedIgnored_PartIwant";
var arr = t.split('_');
console.log(arr);
//Access the required parts like this
console.log(arr[0] + ' ' + arr[arr.length - 1]);
Perhaps something like this:
/(^[^_]+)|([^_]+$)/g
That is, match either:
^[^_]+ the beginning of the string followed by non-underscores, or
[^_]+$ non-underscores followed by the end of the string.
var regex = /(^[^_]+)|([^_]+$)/g
console.log("A_b_c_D".match(regex)) // ["A", "D"]
console.log("A_b_D".match(regex)) // ["A", "D"]
console.log("A_D".match(regex)) // ["A", "D"]
console.log("AD".match(regex)) // ["AD"]
I'm not sure if you should use a regex here. I think splitting the string at underscore, and using the first and last element of the resulting array might be faster, and less complicated.
Trivial with .replace:
str.replace(/_.*_/, '')
// "PartIWantPartIwant"
With matching, you'd need to be selecting and concatenating groups:
parts = str.match(/^([^_]*).*?([^_]*)$/)
parts[1] + parts[2]
// "PartIWantPartIwant"
EDIT
This regex is used in a tool which monitors our http traffic, which means I can only 'select' the part I need but can't invoke functions or replace logic.
This is not possible: a regular expression cannot match a discontinuous span.

Very slow RegEx in AHK yet fast in Notepad++

I'd like to find a certain string in a webpage. I decided to use RegEx. (I know my RegExes are quite terrible, however, they work). My two expressions are very fast when used in Notepad++ (probably < 1s) and on Regex101, but they are horribly slow when used in AutoHotKey – about 2-5 minutes. How do I fix this?
sWindowInfo2 = http://www.archiwum.wyborcza.pl/Archiwum/1,0,4583161,20060208LU-DLO,Dzis_bedzie_Piast,.html
whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
whr.Open("GET", sWindowInfo2, false ), whr.Send()
whr.ResponseText
sPage := ""
sPage := whr.ResponseText
; get city name (if exists) – the following is very slooooow
if RegExMatch(sPage, "[\s\S]+<dzial>Gazeta\s(.+)<\/dzial>[\s\S]+")
{
sCity := RegExReplace(sPage, "[\s\S]+<dzial>Gazeta\s(.+)<\/dzial>[\s\S]+", "$1")
;MsgBox, % sCity
city := 1
}
if RegExMatch(sPage, "[\s\S]+<metryczka>GW\s(.+)\snr[\s\S]+")
{
sCity := RegExReplace(sPage, "[\s\S]+<metryczka>GW\s(.+)\snr[\s\S]+", "$1")
city := 1
}
EDIT:
In the page I provided the match is Lublin. Have a look at: https://regex101.com/r/qJ2pF8/1
You do not need to use RegExReplace to get the captured value. As per reference, you can pass the 3rd var into RegExMatch:
OutputVar
OutputVar is the unquoted name of a variable in which to store a match object, which can be used to retrieve the position, length and value of the overall match and of each captured subpattern, if any are present.
So, use a much simpler pattern:
FoundPos := RegExMatch(sPage, "<metryczka>GW\s(.+)\snr", SubPat) ;
It will return the position of the match, and will store "Lublin" in SubPat[1].
With this pattern, you avoid heavy backtracking you had with [\s\S]+<metryczka>GW\s(.+)\snr[\s\S]+ as the first [\s\S]+ matched up to the end of the string, and then backtracked to accommodate for the subsequent subpatterns. The longer the string, the slower the operation is.

regex replace : if not followed by letter or number

Okay so I wanted a regex to parse uncontracted(if that's what it is called) ipv6 adresses
Example ipv6 adress: 1050:::600:5:1000::
What I want returned: 1050:0000:0000:600:5:1000:0000:0000
My try at this:
ip:gsub("%:([^0-9a-zA-Z])", ":0000")
The first problem with this: It replaces the first and second :
So :: gets replaced with :0000
Replacing it with :0000: wouldn't work because then it will end with a :. Also this would note parse the newly added : resulting in: 1050:0000::600:5:1000:0000:
So what would I need this regex to do?
Replace every : by :0000 if it isn't followed by a number or letter
Main problem: :: gets replaced instead of 1 :
gsub and other functions from Lua's string library use Lua Patterns which are much simpler than regex. Using the pattern more than once will handle the cases where the pattern overlaps the replacement text. The pattern only needs to be applied twice since the first time will catch even pairings and the second will catch the odd/new pairings of colons. The trailing and leading colons can be handled separately with their own patterns.
ip = "1050:::600:5:1000::"
ip = ip:gsub("^:", "0000:"):gsub(":$", ":0000")
ip = ip:gsub("::", ":0000:"):gsub("::", ":0000:")
print(ip) -- 1050:0000:0000:600:5:1000:0000:0000
There is no single statement pattern to do this but you can use a function to do this for any possible input:
function fill_ip(s)
local ans = {}
for s in (s..':'):gmatch('(%x*):') do
if s == '' then s = '0000' end
ans[ #ans+1 ] = s
end
return table.concat(ans,':')
end
--examples:
print(fill_ip('1050:::600:5:1000::'))
print(fill_ip(':1050:::600:5:1000:'))
print(fill_ip('1050::::600:5:1000:1'))
print(fill_ip(':::::::'))

Delphi TRegEx bug?

I try to validate the input '3a' for regex '[_a-zA-Z][_a-zA-Z0-9]*' with that source:
len := TRegEx.Create([_a-zA-Z][_a-zA-Z0-9]*).Match('3a').Length;
I expected 0 for len variable, but it was 2. Is that correct?
This is not your real code. For a start it does not compile. You have omitted the quote marks. If we fix that then we have:
len := TRegEx.Create('[_a-zA-Z][_a-zA-Z0-9]*').Match('3a').Length;
But that returns a value of 1 and not 2 as you stated. This return value is correct because the a matches [_a-zA-Z] and then the input string ends.
I expect that you have the wrong regex. Perhaps you should be using
^[_a-zA-Z][_a-zA-Z0-9]*$
The ^ matches the beginning of the input string, the $ mathes the end. Presumably the input is taken from a source code tokenizer.
So the conclusion is that there is no bug evident in the Delphi regex code from this pattern and input.

Regex - If contains '%', can only contain '%20'

I am wanting to create a regular expression for the following scenario:
If a string contains the percentage character (%) then it can only contain the following: %20, and cannot be preceded by another '%'.
So if there was for instance, %25 it would be rejected. For instance, the following string would be valid:
http://www.test.com/?&Name=My%20Name%20Is%20Vader
But these would fail:
http://www.test.com/?&Name=My%20Name%20Is%20VadersAccountant%25
%%%25
Any help would be greatly appreciated,
Kyle
EDIT:
The scenario in a nutshell is that a link is written to an encoded state and then launched via JavaScript. No decoding works. I tried .net decoding and JS decoding, each having the same result - The results stay encoded when executed.
Doesn't require a %:
/^[^%]*(%20[^%]*)*$/
Which language are you using?
Most languages have a Uri Encoder / Decoder function or class.
I would suggest you decode the string first and than check for valid (or invalid) characters.
i.e. something like /[\w ]/ (empty is a space)
With a regex in the first place you need to respect that www.example.com/index.html?user=admin&pass=%%250 means that the pass really is "%250".
Another solution if look-arounds are not available:
^([^%]|%([013-9a-fA-F][0-9a-fA-F]|2[1-9a-fA-F]))*$
Reject the string if it matches %[^2][^0]
I think that would find what you need
/^([^%]|%%|%20)+$/
Edit: Added case where %% is valid string inside URI
Edit2: And fixed it for case where it should fail :-)
Edit3:
In case you need to use it in editor (which would explain why you can't use more programmatic way), then you have to correctly escape all special characters, for example in Vim that regex should lool:
/^\([^%]\|%%\|%20\)\+$/
Maybe a better approach is to deal with that validation after you decode that string:
string name = HttpUtility.UrlDecode(Request.QueryString["Name"]);
/^([^%]|%20)*$/
This requires a test against the "bad" patterns. If we're allowing %20 - we don't need to make sure it exists.
As others have said before, %% is valid too... and %%25would be %25
The below regex matches anything that doesn't fit into the above rules
/(?<![^%]%)%(?!(20|%))/
The first brackets check whether there is a % before the character (meaning that it's %%) and also checks that it's not %%%. it then checks for a %, and checks whether the item after doesn't match 20
This means that if anything is identified by the regex, then you should probably reject it.
I agree with dominic's comment on the question. Don't use Regex.
If you want to avoid scanning the string twice, you can just iteratively search for % and then check that it is being followed by 20 and nothing else. (Update: allow a % after to be interpreted as a literal %nnn sequence)
// pseudo code
pos = 0
while (pos = mystring.find(pos, '%'))
{
if mystring[pos+1] = "%" then
pos = pos + 2 // ok, this is a literal, skip ahead
else if mystring.substring(pos,2) != "20"
return false; // string is invalid
end if
}
return true;