Is boost:regex block size limited? - c++

I have quite big text file to parse by boost:regex. To make process easier, first I decide to split big file to blocks, for future parsing each block.
I use next regex string for that:
FIRST1.*?FIRST2.*?FIRST3((.*?\r*\n*)*)LAST1.*?LAST2.*?LAST3
It allows me to receive everything I want between "FIRST1 FIRST2 FIRST3" and "LAST1 LAST2 LAST3".
Between them there exists many lines with many text (more then 20 000 bytes). And it don't works. If I split text by 2 parts (part1 ~ 10 000 bytes and part2 ~10 000 bytes), and try this regular expression with:
FIRSTS part1 LASTS - everything parsing well
FIRSTS part2 LASTS - everything parsing well
FIRSTS part1part2 LASTS - breaks.
I thought it may be boost:regex limitation, and tried it here: online regex, it still same.
It looks like part1part2 is too big for regex block to return, is it true? Is there size limit for regex, or I just mess something up?
UPD:
I also found max size. It founds substring if it is characters [106-12131], but if I add any one character to any place of substring, it can't find it. So, it is 12025.

You probably should not be using regex here.
I'd show you the Spirit way to do this efficiently, but you don't show relevant code, so I'll wait.
That said, at least make the groups non-capturing groups (e.g. here ((.*?\r*\n*)*)) and consider using cmatch instead of smatch (docs)
Oh, this might be a case of catastrophic backtracking [ยน]:
((.*?\r*\n*)*)
Try something like this:
(.+?[\r\n]+)*
Make it non-capturing too:
(?:.+?[\r\n]+)*

Related

Can regex be used to find this pattern?

I need to parse a large amount of data in a log file, ideally I can do this by splitting the file into a list where each entry in the list is an individual entry in the log.
Every time a log entry is made it is prefixed with a string following this pattern:
"4404: 21:42:07.433 - After this point there could be anything (including new line characters and such). However, as soon as the prefix repeats that indicates a new log entry."
4404 Can be any number, but is always then followed by a :.
21:42:07.433 is the 21 hours 42 mins 7 seconds 433 milliseconds.
I don't know much about regex, but is it possible to identify this pattern using it?
I figured something like this would work...
"*: [0-24]:[0:60]:[0:60].[0-1000] - *"
However, it just throws an exception and I fear I'm not on the right track at all.
List<string> split_content = Regex.Matches(file_content, #"*: [0-24]:[0:60]:[0:60].[0-1000] - *").Cast<Match>().Select(m => m.Value).ToList();
The following expression would split a string according to your pattern:
\d+: \d{2}:\d{2}:\d{2}\.\d{3}
Add a ^ in the beginning if your delimiting string always starts a line (and use the m flag for regex). Capturing the log chunks with a regex would be more elaborate, I'd suggest just splitting (with Regex.Split) if you have your log content in the memory all at once.

regex search window size

I have a large document that I am running a regex on. Below is an example of a similar expression:
(?=( aExample| bExample)(?=.*(XX))(?=.*(P1)))
This works a lot of times, but sometimes due to other text within the document the condition is met by looking in the entire document, e.g., there might be 10 characters between "aExample" and "XX, but 1,000 characters between "XX" and "P1". I would like to contain the expression to N characters (lets say 50 for the sake of the example) so that the regex is a little more conservative. Any help is appreciated. How can I go about reducing the size of the window of the regex to N characters instead of the entire string/document? Thanks!
(?=( aExample| bExample)(?=.{1,50}(XX))(?=.{1,50}(P1)))
You want to limit the number of .s to look at so you can just use braces.

How to programmatically learn regexes?

My question is a continuation of this one. Basically, I have a table of words like so:
HAT18178_890909.098070313.1
HAT18178_890909.098070313.2
HAT18178_890909.143412462.1
HAT18178_890909.143412462.2
For my purposes, I do not need the terminal .1 or .2 for this set of names. I can manually write the following regex (using Python syntax):
r = re.compile('(.*\.\d+)\.\d+')
However, I cannot guarantee that my next set of names will have a similar structure where the final 2 characters will be discardable - it could be 3 characters (i.e. .12) and the separator could change as well (i.e. . to _).
What is the appropriate way to either explicitly learn a regex or to determine which characters are unnecessary?
It's an interesting problem.
X y
HAT18178_890909.098070313.1 HAT18178_890909.098070313
HAT18178_890909.098070313.2 HAT18178_890909.098070313
HAT18178_890909.143412462.1 HAT18178_890909.143412462
HAT18178_890909.143412462.2 HAT18178_890909.143412462
The problem is that there is not a single solution but many.
Even for a human it is not clear what the regex should be that you want.
Based on this data, I would think the possibilities to learn are:
Just match a fixed width of 25: .{25}
Fixed first part: HAT18178_890909.
Then:
There's only 2 varying numbers on each single spot (as you show 2 cases).
So e.g. [01] (either 0 or 1), [94] the next spot and so on would be a good solution.
The obvious one would be \d+
But it could also be \d{9}
You see, there are multiple correct answers.
These regexes would still work if the second point would be an underscore instead.
My conclusion:
The problem is that it is much more work to prepare the data for machine learning than it is to create a regex. If you want to be sure you cover everything, you need to have complete data, so then a regex is probably less effort.
You could split on non-alphanumeric characters;
[^a-zA-Z0-9']+
That would get you, in this case, few strings like this:
HAT18178
890909
098070313
1
From there on you can simply discard the last one if that's never necessary, and continue on processing the first sequences

Fastest Way in vbscript to Check if a String Contains a Word/Phrase from a List of Many Words/Phrases

I am implementing a function which is to check a blurb (e.g. a message/forum post, etc) against a (potentially long) list of banned words/phrases, and simply return true if any one or more of the words is found in the blurb, and false if not.
This is to be done in vbScript.
The old developer currently has a very large IF statement using instr() e.g.
If instr(ucase(contactname), "KORS") > 0 OR _
instr(ucase(contactname), "D&G") > 0 OR _
instr(ucase(contactname), "DOLCE") > 0 OR _
instr(ucase(contactname), "GABBANA") > 0 OR _
instr(ucase(contactname), "TIFFANY") > 0 OR _
'...
Then
I am trying to decide between two solutions to replace the above code:
Using regular expression to find matches, where the regex would be a simple (but potentially long) regex like this: "KORS|D&G|DOLCE|GABBANA|TIFFANY" and so on, and we would do a regular expression test to return true if any one or more of the words is found.
Using an array where each array item contains a banned word, and loop through each array item checking it against the blurb. Once a match is found the loop would terminate and a variable would be set to TRUE, etc.
It seems to me that the regular expression option is the best, since it is one "check" e.g. the blurb tested against the pattern. But I am wondering if the potentially very long regex pattern would add enough processing overhead to negate the simplicity and benefit of doing the one "check" vs. the many "checks" in the array looping scenario?
I am also open to additional options which I may have overlooked.
Thanks in advance.
EDIT - to clarify, this is for a SINGLE test of one "blurb" e.g. a comment, a forum post, etc. against the banned word list. It only runs one time during a web request. The benchmarking should test size of the word list and NOT the number of executions of the use case.
You could create a string that contains all of your words. Surround each word with a delimiter.
Const TEST_WORDS = "|KORS|D&G|DOLCE|GABBANA|TIFFANY|"
Then, test to see if your word (plus delimiter) is contained within this string:
If InStr(1, TEST_WORDS, "|" & contactname & "|", vbTextCompare) > 0 Then
' Found word
End If
No need for array loops or regular expressions.
Seems to me (without checking) that such complex regexp would be slower, and also evaluating such complex 'Or' statement wold be slow (VBS will evaluate all alternatives).
Should all alternatives be evaluated to know expression value - of course not.
What I would do, is to populate an array with banned words and then iterate through it, checking if the word is within text being searched - and if word is found discontinue iteration.
You could store the most 'popular' banned words on the top of the array (some kind of rank), so you would be most likely to find them in few first steps.
Another benefit of using array is that it is easier to manage its' values compared to 'hardcoded' values within if statement.
I just tested 1 000 000 checks with regexp ("word|anotherword") vs InStr for each word and it seems I was not right.
Regex check took 13 seconds while InStr 71 seconds.
Edited: Checking each word separately with regexp took 78 seconds.
Still I think that if you have many banned words checking them one by one and breaking if any is found would be faster (after last check I would consider joining them by (5? 10?) and checking not such complex regexp each time).

Specific regex to detect error string

I am parsing a text log, where each line contains an id closed in parenthesis and one or more (possibly hundreds) chunks of data (alphanumeric, always 20 chars), such as this:
id=(702831), data1=(Ub9fS97Hkc570Vvqkdy1), data2=(Hd7t553df8mnOa84wTcF)
id=(702832), data1=(Ba6FGoP5Dzxwmb6JhJ5a)
At this point of the program, I am not interested about the data, just about quick fetching of all the ids. The problem is, that due to the noisy communication channel an error may appear denoted by string Error that can be anywhere on the line. The goal is to ignore these lines.
What worked for me so far was a simple negative lookahead:
^id=\((\d+)\),(?!.*Error)
But I forgot, that there is some tiny probability, that this Error string may actually appear as a valid sequence of characters somewhere in the data, which has backfired on me just now.
The only way to distinguish between valid and invalid appearance of the Error string in the data chunk is to check for the length. If it's 20 characters, then it was this rare valid occurrence and I want to keep it (unless the Error is elsewhere on the line), if it's longer, I want to discard the line.
Is it still possible to treat this situation with a regular expression or is it already too much for the regex monster?
Thanks a lot.
Edit: Adding examples of error lines - all these should be ignored.
iErrord=(702831), data1=(Ub9fS97Hkc570Vvqkdy1), data2=(Hd7t553df8mnOa84wTcF)
id=(7028Error32), data1=(Ba6FGoP5Dzxwmb6JhJ5a)
id=(702833), daErrorta1=(hF6eDpLxbnFS5PfKaCds)
id=(702834), data1=(bx5EsH7BCsk6dMzpQDErrorKA)
However this one should not be ignored, the Error is just incidently contained in the data part, but it currently is ignored
id=(702834), data1=(bx5EsH6dMzpQDErrorKA)
Alright, it's not exactly what you were thinking about, but here's a suggestion :
Can't you simply match the lines following the pattern, undisturbed by an Error somewhere ?
Here's the regexp that'll do it :
^id=\((\d+)\), (data\d+=\([a-zA-Z\d]{20}\)(, )?)+$
If Error is anywhere on the line (except in the middle of the chunk of data), the regexp will not match it, so you get the wanted result, it'll be ignored.
If this doesn't please you, you have to add more lookahead and lookbehind groups. I'll try to do that and edit if I write a good regexp.
Since your chunks of data are always 20 characters long, if one is 25 characters this means there is an error in it. Therefore you could check if there is a chunk of such a length, then check if there is Error outside of parenthesis. If so, you shouldn't match the line. If not, it valid.
Something like
(?![^)]*Error)id=\((\d+)(?!.*(?:\(.{25}\)|\)[^(]*Error))
might do the trick.