Very slow RegEx in AHK yet fast in Notepad++ - regex

I'd like to find a certain string in a webpage. I decided to use RegEx. (I know my RegExes are quite terrible, however, they work). My two expressions are very fast when used in Notepad++ (probably < 1s) and on Regex101, but they are horribly slow when used in AutoHotKey – about 2-5 minutes. How do I fix this?
sWindowInfo2 = http://www.archiwum.wyborcza.pl/Archiwum/1,0,4583161,20060208LU-DLO,Dzis_bedzie_Piast,.html
whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
whr.Open("GET", sWindowInfo2, false ), whr.Send()
whr.ResponseText
sPage := ""
sPage := whr.ResponseText
; get city name (if exists) – the following is very slooooow
if RegExMatch(sPage, "[\s\S]+<dzial>Gazeta\s(.+)<\/dzial>[\s\S]+")
{
sCity := RegExReplace(sPage, "[\s\S]+<dzial>Gazeta\s(.+)<\/dzial>[\s\S]+", "$1")
;MsgBox, % sCity
city := 1
}
if RegExMatch(sPage, "[\s\S]+<metryczka>GW\s(.+)\snr[\s\S]+")
{
sCity := RegExReplace(sPage, "[\s\S]+<metryczka>GW\s(.+)\snr[\s\S]+", "$1")
city := 1
}
EDIT:
In the page I provided the match is Lublin. Have a look at: https://regex101.com/r/qJ2pF8/1

You do not need to use RegExReplace to get the captured value. As per reference, you can pass the 3rd var into RegExMatch:
OutputVar
OutputVar is the unquoted name of a variable in which to store a match object, which can be used to retrieve the position, length and value of the overall match and of each captured subpattern, if any are present.
So, use a much simpler pattern:
FoundPos := RegExMatch(sPage, "<metryczka>GW\s(.+)\snr", SubPat) ;
It will return the position of the match, and will store "Lublin" in SubPat[1].
With this pattern, you avoid heavy backtracking you had with [\s\S]+<metryczka>GW\s(.+)\snr[\s\S]+ as the first [\s\S]+ matched up to the end of the string, and then backtracked to accommodate for the subsequent subpatterns. The longer the string, the slower the operation is.

Related

Regex to insert space with certain characters but avoid date and time

I made a regex which inserts a space where ever there is any of the characters
-:\*_/;, present for example JET*AIRWAYS\INDIA/858701/IDBI 05/05/05;05:05:05 a/c should beJET* AIRWAYS\ INDIA/ 858701/ IDBI 05/05/05; 05:05:05 a/c
The regex I used is (?!a\/c|w\/d|m\/s|s\/w|m\/o)(\D-|\D:|\D\*|\D_|\D\\|\D\/|\D\;)
I have added some words exceptions like a/c w/d etc. \D conditions given to avoid date/time values getting separated, but this created an issue, the numbers followed by the above mentioned characters never get split.
My requirement is
1. Insert a space after characters -:\*_/;,
2. but date and time should not get split which may have / :
3. need exception on words like a/c w/d
The following is the full code
Private Function formatColon(oldString As String) As String
Dim reg As New RegExp: reg.Global = True: reg.Pattern = "(?!a\/c|w\/d|m\/s|s\/w|m\/o)(\D-|\D:|\D\*|\D_|\D\\|\D\/|\D\;)" '"(\D:|\D/|\D-|^w/d)"
Dim newString As String: newString = reg.Replace(oldString, "$1 ")
formatColon = XtraspaceKill(newString)
End Function
I would use 3 replacements.
Replace all date and time special characters with a special macro that should never be found in your text, e.g. for 05/15/2018 4:06 PM, something based on your name:
05MANUMOHANSLASH15MANUMOHANSLASH2018 4MANUMOHANCOLON06 PM
You can encode exceptions too, like this:
aMANUMOHANSLASHc
Now run your original regex to replace all special characters.
Finally, unreplace the macros MANUMOHANSLASH and MANUMOHANCOLON.
Meanwhile, let me tell you why this is complicated in a single regex.
If trying to do this in a single regex, you have to ask, for each / or :, "Am I a part of a date or time?"
To answer that, you need to use lookahead and lookbehind assertions, the latter of which Microsoft has finally added support for.
But given a /, you don't know if you're between the first and second, or second and third parts of the date. Similar for time.
The number of cases you need to consider will render your regex unmaintainably complex.
So please just use a few separate replacements :-)

How to Find a Time value (anything like #:##) in a string

Looking for a way to find anything that looks like a time value, such as 1:00 or 2:30 anywhere in a given string. I'd rather not scan the whole string for
If String.Mid(myString,i,4) Like "#:##" Then ...
if there is a better way to accomplish the same thing.
An occassional false positive is okay, so if I get 0:99 identified as a time value, there is no harm in that, and finding the 2:00 part of the time value 12:00 is fine too -- pointing at the character 2 instead of the character 1 causes no problems. And for this application, finding other separators besides the colon isn't needed.
Is a RegEx the best way to search for this sort of pattern, or is another approach more efficient?
Thanks!
A RegEx is probably the most straightforward solution for what you described.
Dim stringToMatch = "The time is 1:00 or maybe 13:01 or possibly 27:03 or 4:99 or part of 103:17, but not 22:7"
Dim matcher = New Regex("[0-9]{1,2}:[0-9]{2}")
Dim matches = matcher.Matches(stringToMatch)
For Each match As Match In matches
Console.WriteLine("Found match {0} at position {1}", match.Value, match.Index)
Next match
From there, it's simple to alter the RegEx pattern to better suit your needs, or to examine the Match objects to determine what was matched, at what index in the original string.

fine tuning go regular expressions

I am working with some regular expressions in go, and its not a direct process, ie takes time to work through and understand from items I've found and reading fast through the manual; any input on refining the following would be appreciated to speed up the process.
// {aa,bb,cc,dd, etc.}, {a+,b+,c+}
regexp.MustCompile(`\B\{([\w-]+)(.*,)([\w-]+)(?:,[\w-]+=[\w-]+)*\}`)
// above captures {a+, b+, c}, but not {a+,b+,c+}
// {1-9}, {1-9,10,19,20-52}
regexp.MustCompile(`\B\{([\d]?)-([\d]?)(?:,[\d]?=[\d]?)*\}`)
// the first is fine, but no idea on how to do the latter, i.e. multiple ranges that might have free standing addons, tricky, maybe beyond a regexp
// {a-f}, {A-F}, {x-C}
regexp.MustCompile(`\B\{([a-zA-Z]?)-([a-zA-Z]?)(?:,[a-zA-Z]?=[a-zA-Z]?)*\}`)
I'm not sure I need the (?: part, it is something found, I just need to recognize separate instances of sequences above (comma separated, number range, character range) bracketed by {} in text I'm parsing.
The problem is easier to tackle if you break the parsing down into steps. You could start with something like:
http://play.golang.org/p/ugqMmaeKEs
s := "{aa,bb,cc, dd}, {a+,\tb+,c+}, {1-9}, {1-9,10,19,20-52}, {a-f}, {A-F}, {x-C}, {}"
// finds the groups of characters captured in {}
matchGroups := regexp.MustCompile(`\{(.+?)\}`)
// splits each captured group of characters
splitParts := regexp.MustCompile(`\s*,\s*`)
parts := [][]string{}
for _, match := range matchGroups.FindAllStringSubmatch(s, -1) {
parts = append(parts, splitParts.Split(match[1], -1))
}
Now you have all the parts grouped into slices, and can parse the syntax of the individual pieces without having to match all combinations via a regex.

pl/sql negative look behind regex

I'm trying to implement a regular expression in pl/sql which excludes any results which are preceeded by a string.
data:
exclude this: 3
include this: 3
3
cvxcvxcv3
34edfgdsfg3
Using this regexp:
(?<!exclude this: )3\d{0}(\s|$)
What I would expect to be returned is:
exclude this: 3 <-- nothing
include this: 3 <- 3
3 <- 3
cvxcvxcv3 <- 3
34edfgdsfg3 <- the second 3 only
34edfgdsfg33 <- the last 3 only
This works fine when tested in notepad++ however when implementing it in pl/sql it isn't working. Looking at similar questions it appears that pl/sql doesn't support negative lookback fully but does anyone know of a similar construct or a way to work around this?
While i am not aware of any general technique to emulate negative lookbehind by means of pl/sql regexen, in your particular case a solution is possible:
([^e].{13}|[^x].{12}|[^c].{11}|[^l].{10}|[^u].{9}|[^d].{8}|[^e].{7}|[^ ].{6}|[^t].{5}|[^h].{4}|[^i].{3}|[^s].{2}|[^:].|[^ ]|^)3[^0-9]?(\s|$)
The negative lookbehind applies to a literal. Therefore all forbidden prefixes of the first character that must match are known beforehand as are their lengths. this allows for a compact (well ...) specification as a regex that must match.
Not that I would recommend that for best practice ... or any practice at all ...
Update (processing advice):
The regex as it stands identifies matches without providing any further information for postprocessing. However, you can identify the offset of the match and the length of the forbidden prefix with the following code:
DECLARE
s_data VARCHAR2(4000); -- This will contain the line you match against
s_matchpos BINARY_INTEGER; -- Offset of the 'interesting' part (digit '3' under the various constraints) in s_data
s_prefix VARCHAR2(100); -- The prefix part of the match
s_re VARCHAR2(4000); -- The regex
BEGIN
s_re := '([^e].{13}|[^x].{12}|[^c].{11}|[^l].{10}|[^u].{9}|[^d].{8}|[^e].{7}|[^ ].{6}|[^t].{5}|[^h].{4}|[^i].{3}|[^s].{2}|[^:].|[^ ]|^)3[^0-9]?(\s|$)';
s_prefix := regexp_replace( s_data, s_re, '\1', 1, 1); -- start at offset 1 of the data and find the first match
s_matchpos := regexp_instr( s_data, s_re ) + length(s_prefix);
END;
As mentioned above, not necessarily to recommend as best practice ...

Is there a RegEx that can parse out the longest list of digits from a string?

I have to parse various strings and determine a prefix, number, and suffix. The problem is the strings can come in a wide variety of formats. The best way for me to think about how to parse it is to find the longest number in the string, then take everything before that as a prefix and everything after that as a suffix.
Some examples:
0001 - No prefix, Number = 0001, No suffix
1-0001 - Prefix = 1-, Number = 0001, No suffix
AAA001 - Prefix = AAA, Number = 001, No suffix
AAA 001.01 - Prefix = AAA , Number = 001, Suffix = .01
1_00001-01 - Prefix = 1_, Number = 00001, Suffix = -01
123AAA 001_01 - Prefix = 123AAA , Number = 001, Suffix = _01
The strings can come with any mixture of prefixes and suffixes, but the key point is the Number portion is always the longest sequential list of digits.
I've tried a variety of RegEx's that work with most but not all of these examples. I might be missing something, or perhaps a RegEx isn't the right way to go in this case?
(The RegEx should be .NET compatible)
UPDATE: For those that are interested, here's the C# code I came up with:
var regex = new System.Text.RegularExpressions.Regex(#"(\d+)");
if (regex.IsMatch(m_Key)) {
string value = "";
int length;
var matches = regex.Matches(m_Key);
foreach (var match in matches) {
if (match.Length >= length) {
value = match.Value;
length = match.Length;
}
}
var split = m_Key.Split(new String[] {value}, System.StringSplitOptions.RemoveEmptyEntries);
m_KeyCounter = value;
if (split.Length >= 1) m_KeyPrefix = split(0);
if (split.Length >= 2) m_KeySuffix = split(1);
}
You're right, this problem can't be solved purely by regular expressions. You can use regexes to "tokenize" (lexically analyze) the input but after that you'll need further processing (parsing).
So in this case I would tokenize the input with (for example) a simple regular expression search (\d+) and then process the tokens (parse). That would involve seeing if the current token is longer than the tokens seen before it.
To gain more understanding of the class of problems regular expressions "solve" and when parsing is needed, you might want to check out general compiler theory, specifically when regexes are used in the construction of a compiler (e.g. http://en.wikipedia.org/wiki/Book:Compiler_construction).
You're input isn't regular so, a regex won't do. I would iterate over the all groups of digits via (\d+) and find the longest and then build a new regex in the form of (.*)<number>(.*) to find your prefix/suffix.
Or if you're comfortable with string operations you can probably just find the start and end of the target group and use substr to find the pre/suf fix.
I don't think you can do this with one regex. I would find all digit sequences within the string (probably with a regex) and then I would select the longest with .NET code, and call Split().
This depends entirely on your Regexp engine. Check your Regexp environment for capturing, there might be something in it like the automatic variables in Perl.
OK, let's talk about your question:
Keep in mind, that both, NFA and DFA, of almost every Regexp engine are greedy, this means, that a (\d+) will always find the longest match, when it "stumbles" over it.
Now, what I can get from your example, is you always need middle portion of a number, try this:
/^(.*\D)?(\d+)(\D.*)?$/ig
The now look at variables $1, $2, $3. Not all of them will exist: if there are all three of them, $2 will hold your number in question, the other vars, parts of the prefix. when one of the prefixes is missing, only variable $1 and $2 will be set, you have to see for yourself, which one is the integer. If both prefix and suffix are missing, $1 will hold the number.
The idea is to make the engine "stumble" over the first few characters and start matching a long number in the middle.
Since the modifier /gis present, you can loop through all available combinations, that the machine finds, you can then simply take the one you like most or something.
This example is in PCRE, but I'm sure .NET has a compatible mode.