Regex with nested repetition - regex

I'm trying to create a regex in Go that matches up to 50 words separated by white space where each word is 1-32 "a"s
I'm using the following regex
regexp.Compile(`^(a{1,32}\s?){1,50}$`)
and I am getting the following error
error parsing regexp: invalid repeat count: `{1,50}`
I've noticed that it does work up to 31 repetitions like so
r, err := regexp.Compile(`^(a{1,32}\s?){1,31}$`)
see https://go.dev/play/p/RLnroX9-57_m

Go's regexp engine has a limit where combination of top level and any inner repetitions must not exceed 1000 copies of the innermost repeated part. This is documented in re2 Syntax spec.
In your case up to 31 works because inner 32 * outer 31 = 992. 32 * 32 = 1024 and also 32 * 50 = 1600 won't work for exceeding that limit.
Workaround is to split expression into multiple parts: ^(a{1,32}\s?){1,31}(a{1,32}\s?){0,19}$

Related

PCRE2 - Match every word whose suffix matches a backreference

Given the string below,
ay bee ceefooh deefoo38 ee 37 ef gee38 aitch 38 eye19 jay38 kay 99 el88 em38 en 29 ou38 38 pee 12 q38 arr 999 esss 555
the goal is to match every word such that the suffix is a number that matches the number that appears after foo (which happens to be 38 in this case).
There is only one substring that begins with foo and ends with a number. The expected matches all exist after said substring.
Expected matches:
gee38
jay38
em38
ou38
q38
I've tried foo(\d+).*?(\w+\1)\b and foo(\d+).*(\w+\1)\b, but they fail to match all, because they either match the first one (gee38) or the last one (q38).
Is it possible to match all with just a single regex and, importantly, in just a single run?
The PCRE2 engine that I use behaves in the same way as https://regex101.com/r/uFEDOE/1. So, if the regex can match multiple substrings on regex101, then the engine that I use can too.
(?:foo|\G(?!^))(\d+).*?(?=(\w+))\w+(?=\1\b)
Demo
It could be some size or performance optimization.
#Niko Gambt, say if any optimization is important for you.

Regular Expression

I'm trying to get the regular expression to work (using jQuery) for a specific pattern I need.
I need following pattern:
First two character
s of the string need to be numbers (0-9) but maximum number is 53. for numbers below 10 a leading 0 is required
Character on position 3 needs to be a .
the next 4 characters need to be a number between 0-9, minimum number should be 2010, maximum 2050
so, Strings like 01.2020, 21.2020, or 45.2020 have to match but 54.2020 or 04.2051 must not.
I tried to write the regex without the min and max requirement first and I'm testing the string using regex101.com but I'm unable to get it to work.
acording to the definition /^[0-9]{2}\.\d[0-9]{4}$/ should allow me to insert the strings in the format NN.NNNN.
thankful for any input.
2 numbers from 00 to 53 can be matched using this : (?:[0-4][0-9]|5[0-3]) (00 -> 49 or 50 -> 53)
Character on position 3 needs to be a . : you've already got the \.
a number between 2010 and 2050 -> 20(?:[1-4][0-9]|50) (20 followed by either 10 -> 49 or 50)
This gives :
(?:[0-4][0-9]|5[0-3])\.20(?:[1-4][0-9]|50)

Regular expression to validate 2 character hex string

I have a source of data that was converted from an oracle database and loaded into a hadoop storage point. One of the columns was a BLOB and therefore had lots of control characters and unreadable/undetectable ascii characters outside of the available codeset. I am using Impala to write regex replace function to parse some of the unicode characters that the regex library cannot understand. I would like to remove the offending 2 character hex codes BEFORE I use the unhex query function so that I can do the rest of the regex parsing with a "clean" string.
Here's the code I've used so far, which doesn't quite work:
'[2-7]{1}([A-Fa-f]|[0-9]{1})'
I've determined that I only need to capture \u0020-\u007f - or represented in the two bit hex - 20-7f
If my string looks like this:
010A000000153020405C00000000143020405CBC000000F53320405C4C010000E12F204058540100002D01
I would like to be able to capture 2 characters at a time (e.g. 01,0A,00) evaluate whether or not that fits the acceptable range of 2 byte hex I mentioned above and return only what is acceptable.
The correct output should be:
30 20 40 5C 30 20 40 5C 33 20 40 5C 4C 2F 20 40 58 and 54
However, my expression finds the first acceptable number in my first range (5) and starts the capture from there which returns the position or indexing wrong for the rest of the string... and this is the return from my expression -
010A0000001**53**0**20****40****5C**000000001**43**0**20****40****5C**BC000000F**53****32**0**40****5C****4C**010000E1**2F****20****40****58****54**010000**2D**01
I just don't know how to evaluate only two characters at a time in a mixed-length string. And, if they don't fit the expression, iterate to the next two characters. But only in two character increments.
My example: https://regex101.com/r/BZL7t0/1
I have added a Positieve Lookbehind to it. Which starts at the beginning of the string and then matches 2 characters at the time. This ensures that the group you're matching always has groups of 2 characters before it.
Positieve Lookbehind:
(?<=^(..)*)
Updated regex:
(?<=^(..)*)([2-7]{1}[A-Fa-f0-9]{1})
Preview:
Regex101

Find repeating gps using regular expression

I work with text files, and I need to be able to see when the gps (last 3 columns of csv) "hangs up" for more than a few lines.
So for example, usually, part of a text file looks like this:
5451,1667,180007,35.7397387,97.8161897,375.8
5448,1053z,180006,35.7397407,97.8161814,375.7
5444,1667,180005,35.7397445,97.8161674,375.6
5439,1668,180004,35.7397483,97.8161526,375.5
5435,1669,180003,35.7397518,97.8161379,375.5
5431,1669,180002,35.7397554,97.8161269,375.6
5426,1054z,180001,35.7397584,97.8161115,375.6
5420,1670,175959,35.7397649,97.8160931,375.9
But sometimes there is an error with the gps and it looks like this:
36859,1598,202603.00,35.8867316,99.2515545,555.700
36859,1598,202608.00,35.8867316,99.2515545,555.700
36859,1142z,202610.00,35.8867316,99.2515545,555.700
36859,1597,202612.00,35.8867316,99.2515545,555.700
36859,1597,202614.00,35.8867316,99.2515545,555.700
36859,1596,202616.00,35.8867316,99.2515545,555.700
36859,1595,202618.00,35.8867316,99.2515545,555.700
I need to be able to figure out a way to search for matching strings of 7 different numbers, (the decimal portion of the gps) but so far I've only been able to figure out how to search for repeating #s or consecutive numbers.
Any ideas?
If you were to find such repetitions in an editor (such as Notepad++), you could use the following regex to find 4 or more repeating lines:
([^,]+(?:,[^,]+){2})\v+(?:(?:[^,]+,){3}\1(?:\v+|$)){3,}
To go a bit into detail
([^,]+(?:,[^,]+){2})\v+ is a group consisting of one or more non-commas followed by comma and another one or more non-commas followed by a vertical space (linebreak), that is not part of the group (e.g. 1,1,1\n)
(?:[^,]+,){3} matches one or more non-commas followed by comma, three times (your columns that don't have to be considered)
\1 is a backreference to group 1, matching if it contains exactly the same as group 1
(?:\v+|$) matches either another vertical whitespaces or the end of the text
{3,} for 3 or more repetitions - increase it if you want more
Here you can see, how it works
However, if you are using any programming language to check this, I wouldn't walk on the path of regex, as checking for those repetitions can be done a lot easier. Here is one example in Python, I hope you can adopt it for your needs:
oldcoords = [0,0,0]
lines = [line.rstrip('\n') for line in open(r'C:\temp\gps.csv')]
for line in lines:
gpscoords = line.split(',')[3:6]
if gpscoords == oldcoords:
repetitions += 1
else:
oldcoords = gpscoords
repetitions = 0
if repetitions == 4: #or however you define more than a few
print(', '.join(gpscoords) + ' is repeated')
If you can use perl, and if I understood you:
perl -ne 'm/^[^,]*,[^,]*,[^,]*,([^,]*,[^,]*,[^,]*$)/g; $current_line=$1; ++$line_number; if ($prev_line==$current_line){$equals++} else {if ($equals>=6){ print "Last three fields in lines ".($line_number-$equals-1)." to ".($line_number-1)." are equals to:\n$prev_line" } ; $equals=0}; $prev_line=$current_line' < onlyreplacethiswithyourfilepath should do the trick.
Sample output:
Last three fields in lines 1 to 7 are equals to:
35.8867316,99.2515545,555.700
Last three fields in lines 16 to 22 are equals to:
37.8782116,99.7825545,572.810
Last three fields in lines 31 to 44 are equals to:
36.6868916,77.2594245,581.358
Last three fields in lines 57 to 63 are equals to:
35.5128764,71.2874545,575.631

Reg-ex, Find x then N characters if N+1 == x

OK here is what I have:
(24(?:(?!24).)*)
its works in the fact it finds from 24 till the next 24 but not the 2nd 24... (wow some logic).
like this:
23252882240013152986400000006090000000787865670000004524232528822400513152986240013152986543530000452400
it finds from the 1st 24 till the next 24 but does not include it, so the strings it finds are:
23252882 - 2400131529864000000060900000007878656700000045 - 2423252882 - 2400513152986 - 24001315298654353000045 - 2400
that is half of what I want it to do, what I need it to find is this:
23252882 - 2400131529864000000060900000007878656700000045 - 2423252882240051315298624001315298654353000045 - 2400
lets say:
x = 24
n = 46
I need to:
find x then n characters if the n+1 character == x
so find the start take then next 46, and the 45th must be the start of the next string, including all 24's in that string.
hope this is clear.
Thanks in advance.
EDIT
answer = 24.{44}(?=24)
You're almost there.
First, find x (24):
24
Then, find n=46 characters, where the 46 includes the original 24 (hence 44 left):
.{44}
The following character must be x (24):
(?=24)
All together:
24.{44}(?=24)
You can play around with it here.
In terms of constructing such a regex from a given x, n, your regex consists of
x.{n-number_of_characters(x)}(?=x)
where you substitute in x as-is and calculate n-number_of_characters(x).
Try this:
(?(?=24)(.{46})|(.{25})(.{24}))
Explanation:
<!--
(?(?=24)(.{46})|(.{25})(.{24}))
Options: case insensitive; ^ and $ match at line breaks
Do a test and then proceed with one of two options depending on the result of the text «(?(?=24)(.{46})|(.{25})(.{24}))»
Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=24)»
Match the characters “24” literally «24»
If the test succeeded, match the regular expression below «(.{46})»
Match the regular expression below and capture its match into backreference number 1 «(.{46})»
Match any single character that is not a line break character «.{46}»
Exactly 46 times «{46}»
If the test failed, match the regular expression below if the test succeeded «(.{25})(.{24})»
Match the regular expression below and capture its match into backreference number 2 «(.{25})»
Match any single character that is not a line break character «.{25}»
Exactly 25 times «{25}»
Match the regular expression below and capture its match into backreference number 3 «(.{24})»
Match any single character that is not a line break character «.{24}»
Exactly 24 times «{24}»
-->