Regex - Match n occurences of substring within any m-lettered window - regex

I am facing some issues forming a regex that matches at least n times a given pattern within m characters of the input string.
For example imagine that my input string is:
00000001100000001110111100000000000000000000000000000000000000000000000000110000000111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001100
I want to detect all cases where an 1 appears at least 7 times (not necessarily consecutively) in the input string, but within a window of up to 20 characters.
So far I have built this expression:
(1[^1]*?){7,}
which detects all cases where an 1 appears at least 7 times in the input string, but this now matches both the:
11000000011101111
and the
1100000001110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011
parts whereas I want only the first one to be kept, as it is within a substring composed of less than 20 characters.
It tried to combine the aforementioned regex with:
(?=(^[01]{0,20}))
to also match only parts of the string containing either an '1' or a '0' of length up to 20 characters but when I do that it stops working.
Does anyone have an idea gow to accomplish this?
I have put this example in regex101 as a quick reference.
Thank you very much!

This is not something that can be done with regex without listing out every possible string. You would need to iterate over the string instead.
You could also iterate over the matches. Example in Python:
import re
matches = re.finditer(r'(?=((1[^1]*?){7}))', string)
matches = [match.group(1) for match in matches if len(match.group(1)) <= 20]

The next Python snippet is an attempt to get the desired sequences using only the regular expression.
import re
r = r'''
(?mx)
( # the 1st capturing group will contain the desired sequence
1 # this sequence should begin with 1
(?=(?:[01]{6,19}) # let's see that there are enough 0s and 1s in a line
(.*$)) # the 2nd capturing group will contain all characters to the end of a line
(?:0*1){6}) # there must be six more 1s in the sequence
(?=.{0,13} # complement the 1st capturing group to 20 characters
\2) # the rest of a line should be 2nd capturing group
'''
s = '''
0000000
101010101010111111100000000000001
00000001100000001110111100000000000000000000000000000000000000000000000000110000000111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001100
1111111
111111
'''
print([m.group(1) for m in re.finditer(r, s)])
Output:
['1010101010101', '11111100000000000001', '110000000111011', '1111111']
You can find an exhaustive explanation of this regular expression on RegEx101.

Related

Python Regular Expression: No space in between

I have the following string:
"......(some chars) aaa bbb ###8/13/2018 ......(some chars)"
The ### in the string represent some random characters. ###'s length is unknown and it could be None (just "aaa bbb 8/13/2018").
My goal is to find the date from the string (8/13/2018) and the starting index of ###.
I currently used the following code:
m = re.search(r'\s.*?([0-9]{1,}/[0-9]{1,}/[0-9]{2,})', str)
m.groups()[0] ## The date
m.start() ## index of ###
But the regex is matching bbb ###8/13/2018 instead of ###8/13/2018
I also tried change the regex to:
r'\s(?!\s).*?[0-9]{1,}/[0-9]{1,}/[0-9]{2,}'
r'\s(?!\s)*?[0-9]{1,}/[0-9]{1,}/[0-9]{2,}'
But neither of them works.
I will be appreciated for any help or comments. Thank you.
I tend to believe you are looking for:
#*(?:\d{1,2}/){2}\d{2,4} or even \S*(?:\d{1,2}/){2}\d{2,4}
This is simply saying:
\S* start with 0 or more non-space charaters.
(?:\d{1,2}/){2} find two groups of \d{1,2}/ but do not capture them. ie not capturing: (?:..).this will match the month and date part 8/13/. \d{1,2} means atleast one digit and atmost two digits
\d{2,4} match the year .Atleast 2 digits and atmost 4 digits
Using a part of your regex, I think you mean something like this
r'\S*([0-9]+/[0-9]+/[0-9]{2,})'
https://regex101.com/r/dxF4sT/1
To find the starting index, it would be where the match was found.
Note that \S will find all consecutive non-whitespace.
You can change this to other things like [#a-zA-Z] etc..., just add it to the class.

Why is this regex performing partial matches?

I have the following raw data:
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19 ...
I'm using this regex to remove duplicates:
([^.]+)(.[ ]*\1)+
which results in the following:
1.2.4.5.9.115.16.19 ...
The problem is how the regex handles 1.1 in the substring .11.15. What should be 9.11.15.16 becomes 9.115.16. How do I fix this?
The raw values are sorted in numeric order to accommodate the regex used for processing the duplicate values.
The regex is being used within Oracle's REGEXP_REPLACE
The decimal is a delimiter. I've tried commas and pipes but that doesn't fix the problem.
Oracle's REGEX does not work the way you intended. You could split the string and find distinct rows using the general method Splitting string into multiple rows in Oracle. Another option is to use XMLTABLE , which works for numbers and also strings with proper quoting.
SELECT LISTAGG(n, '.') WITHIN
GROUP (
ORDER BY n
) AS n
FROM (
SELECT DISTINCT TO_NUMBER(column_value) AS n
FROM XMLTABLE(replace('1.1.2.2.4.4.4.5.5.9.11.15.16.16.19', '.', ','))
);
Demo
Unfortunately Oracle doesn't provide a token to match a word boundary position. Neither familiar \b token nor ancient [[:<:]] or [[:>:]].
But on this specific set you can use:
(\d+\.)(\1)+
Note: You forgot to escape dot.
Your regex caught:
a 1 - the second digit in 11,
then a dot,
and finally 1 - the first digit in 15.
So your regex failed to catch the whole sequence of digits.
The most natural way to write a regex catching the whole sequence
of digits would be to use:
a loobehind for either the start of the string or a dot,
then catch a sequence of digits,
and finally a lookahead for a dot.
But as I am not sure whether Oracle supports lookarounds, I wrote
the regex another way:
(^|\.)(\d+)(\.(\2))+
Details:
(^|\.) - Either start of the string or a dot (group 1), instead of
the loobehind.
(\d+) - A sequence of digits (group 2).
( - Start of group 3, containing:
\.(\2) - A dot and the same sequence of digits which caught group 2.
)+ - End of group 3, it may occur multiple times.
Group the repeating pattern and remove it
As revo has indicated, a big source of your difficulties came with not escaping the period. In addition, the resulting string having a 115 included can be explained as follows (Valdi_Bo made a similar observation earlier):
([^.]+)(.[ ]*\1)+ will match 11.15 as follow:
SCOTT#DB>SELECT
2 '11.15' val,
3 regexp_replace('11.15','([^.]+)(\.[ ]*\1)+','\1') deduplicated
4 FROM
5 dual;
VAL DEDUPLICATED
11.15 115
Here is a similar approach to address those problems:
matching pattern composition
-Look for a non-period matching list of length 0 to N (subexpression is referenced by \1).
'19' which matches ([^.]*)
-Look for the repeats which form our second matching list associated with subexression 2, referenced by \2.
'19.19.19' which matches ([^.]*)([.]\1)+
-Look for either a period or end of string. This is matching list referenced by \3. This fixes the match of '11.15' by '115'.
([.]|$)
replacement string
I replace the match pattern with a replacement string composed of the first instance of the non-period matching list.
\1\3
Solution
regexp_replace(val,'([^.]*)([.]\1)+([.]|$)','\1\3')
Here is an example using some permutations of your examples:
SCOTT#db>WITH tst AS (
2 SELECT
3 '1.1.2.2.4.4.4.5.5.9.11.15.16.16.19' val
4 FROM
5 dual
6 UNION ALL
7 SELECT
8 '1.1.1.1.2.2.4.4.4.4.4.5.5.9.11.11.11.15.16.16.19' val
9 FROM
10 dual
11 UNION ALL
12 SELECT
13 '1.1.2.2.4.4.4.5.5.9.11.15.16.16.19.19.19' val
14 FROM
15 dual
16 ) SELECT
17 val,
18 regexp_replace(val,'([^.]*)([.]\1)+([.]|$)','\1\3') deduplicate
19 FROM
20 tst;
VAL DEDUPLICATE
------------------------------------------------------------------------
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19 1.2.4.5.9.11.15.16.19
1.1.1.1.2.2.4.4.4.4.4.5.5.9.11.11.11.15.16.16.19 1.2.4.5.9.11.15.16.19
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19.19.19 1.2.4.5.9.11.15.16.19
My approach does not address possible spaces in the string. One could just remove them separately (e.g. through a separate replace statement).

Find repeating gps using regular expression

I work with text files, and I need to be able to see when the gps (last 3 columns of csv) "hangs up" for more than a few lines.
So for example, usually, part of a text file looks like this:
5451,1667,180007,35.7397387,97.8161897,375.8
5448,1053z,180006,35.7397407,97.8161814,375.7
5444,1667,180005,35.7397445,97.8161674,375.6
5439,1668,180004,35.7397483,97.8161526,375.5
5435,1669,180003,35.7397518,97.8161379,375.5
5431,1669,180002,35.7397554,97.8161269,375.6
5426,1054z,180001,35.7397584,97.8161115,375.6
5420,1670,175959,35.7397649,97.8160931,375.9
But sometimes there is an error with the gps and it looks like this:
36859,1598,202603.00,35.8867316,99.2515545,555.700
36859,1598,202608.00,35.8867316,99.2515545,555.700
36859,1142z,202610.00,35.8867316,99.2515545,555.700
36859,1597,202612.00,35.8867316,99.2515545,555.700
36859,1597,202614.00,35.8867316,99.2515545,555.700
36859,1596,202616.00,35.8867316,99.2515545,555.700
36859,1595,202618.00,35.8867316,99.2515545,555.700
I need to be able to figure out a way to search for matching strings of 7 different numbers, (the decimal portion of the gps) but so far I've only been able to figure out how to search for repeating #s or consecutive numbers.
Any ideas?
If you were to find such repetitions in an editor (such as Notepad++), you could use the following regex to find 4 or more repeating lines:
([^,]+(?:,[^,]+){2})\v+(?:(?:[^,]+,){3}\1(?:\v+|$)){3,}
To go a bit into detail
([^,]+(?:,[^,]+){2})\v+ is a group consisting of one or more non-commas followed by comma and another one or more non-commas followed by a vertical space (linebreak), that is not part of the group (e.g. 1,1,1\n)
(?:[^,]+,){3} matches one or more non-commas followed by comma, three times (your columns that don't have to be considered)
\1 is a backreference to group 1, matching if it contains exactly the same as group 1
(?:\v+|$) matches either another vertical whitespaces or the end of the text
{3,} for 3 or more repetitions - increase it if you want more
Here you can see, how it works
However, if you are using any programming language to check this, I wouldn't walk on the path of regex, as checking for those repetitions can be done a lot easier. Here is one example in Python, I hope you can adopt it for your needs:
oldcoords = [0,0,0]
lines = [line.rstrip('\n') for line in open(r'C:\temp\gps.csv')]
for line in lines:
gpscoords = line.split(',')[3:6]
if gpscoords == oldcoords:
repetitions += 1
else:
oldcoords = gpscoords
repetitions = 0
if repetitions == 4: #or however you define more than a few
print(', '.join(gpscoords) + ' is repeated')
If you can use perl, and if I understood you:
perl -ne 'm/^[^,]*,[^,]*,[^,]*,([^,]*,[^,]*,[^,]*$)/g; $current_line=$1; ++$line_number; if ($prev_line==$current_line){$equals++} else {if ($equals>=6){ print "Last three fields in lines ".($line_number-$equals-1)." to ".($line_number-1)." are equals to:\n$prev_line" } ; $equals=0}; $prev_line=$current_line' < onlyreplacethiswithyourfilepath should do the trick.
Sample output:
Last three fields in lines 1 to 7 are equals to:
35.8867316,99.2515545,555.700
Last three fields in lines 16 to 22 are equals to:
37.8782116,99.7825545,572.810
Last three fields in lines 31 to 44 are equals to:
36.6868916,77.2594245,581.358
Last three fields in lines 57 to 63 are equals to:
35.5128764,71.2874545,575.631

Regular Expressions in R

I found somewhat similar questions
R - Select string text between two values, regex for n characters or at least m characters,
but I'm still having trouble
say I have a string in r
testing_String <- "AK ADAK NAS PADK ADK 70454 51 53N 176 39W 4 X T 7"
And I need to be able to pull anything between the first element in the string that contains 2 characters (AK) and PADK,ADK. PADK and ADK will change in character but will always be 4 and 3 characters in length respectively.
So I would need to pull
ADAK NAS
I came up with this but its picking up everything from AK to ADK
^[A-Za-z0_9_]{2}(.*?) +[A-Za-z0_9_]{4}|[A-Za-z0_9_]{3,}
If I understood your question correctly, this should do the trick:
\b[A-Z]{2}\s+(.+?)\s+[A-Z]{4}\s+[A-Z]{3}\b
Demo
You'll have to switch the perl = TRUE option (to use a decent regex engine).
\b means word boundary. So this pattern looks for a match starting with a 2-letter word and ending with a 4 letter word followed by a 3 letter word. Your value will be in the first group.
Alternatively, you can write the following to avoid using the capturing group:
\b[A-Z]{2}\s+\K.+?(?=\s+[A-Z]{4}\s+[A-Z]{3}\b)
But I'd prefer the first method because it's easier to read.
Lookbehind is supported for perl=TRUE, so this regex will do what you want:
(?<=\w{2}\s).*?(?=\s+[^\s]{4}\s[^\s]{2})

Regex failing to match number and dash with letter (or space and letter)

In the tester this works ... but not in PostgreSQL.
My data is like this -- usually a series of letters, followed by 2 numbers and a POSSIBLE '-' or 'space' with only ONE letter following. I am trying to isolate the 2 numbers and the Possible '-" or 'space' AND the ONE letter with my regex:
For ex:
AJ 50-R Busboys ## should return 50-R
APPLES 30 F ## should return 30 F
FOOBAR 30 Apple ## should return 30
Regex's (that have worked in the tester, but not in PostgreSQL) that I've tried:
substring(REF from '([0-9]+)-?([:space:])?([A-Za-z])?')
&
substring(REF from '([0-9]+)-?([A-Za-z])?')
So far everything tests out in the tester...but not the PostgreSQL. I just keep getting the numbers returns -- AND NOTHING AFTER IT.
What I am getting now(for ex):
AJ 50-R Busboys ## returns as "50" NOT as "50-R"
Your looking for: substring(REF from '([0-9]+(-| )([A-Za-z]\y)?)')
In SQLFiddle. Your primary problem is that substring returns the first or outermost matching group (ie., pattern surrounded with ()), which is why you get 50 for your '50-R'. If you were to surround the entire pattern with (), this would give you '50-R'. However, the pattern you have fails to return what you want on the other strings, even after accounting for this issue, so I had to modify the entire regex.
This matches your description and examples.
Your description is slightly ambiguous. Leading letters are followed by a space and then two digits in your examples, as opposed to your description.
SELECT t, substring(t, '^[[:alpha:] ]+(\d\d(:?[\s-]?[[:alpha:]]\M)?)')
FROM (
VALUES
('AJ 50-R Busboys') -- should return: 50-R
,('APPLES 30 F') -- should return: 30 F
,('FOOBAR 30 Apple') -- should return: 30
,('FOOBAR 30x Apple') -- should return: 30x
,('sadfgag30 D 66 X foo') -- should return: 30 D - not: 66 X
) r(t);
->SQLfiddle
Explanation
^ .. start of string (last row could fail without anchoring to start and global flag 'g'). Also: faster.
[[:alpha:] ]+ .. one or more letters or spaces (like in your examples).
( .. capturing parenthesis
\d\d .. two digits
(:? .. non-capturing parenthesis
[\s-]? .. '-' or 'white space' (character class), 0 or 1 times
[[:alpha:]] .. 1 letter
\M .. followed by end of word (can be end of string, too)
)? .. the pattern in non-capturing parentheses 0 or 1 times
Letters as defined by the character class alpha according to the current locale! The poor man's substitute [a-zA-Z] only works for basic ASCII letters and fails for anything more. Consider this simple demo:
SELECT substring('oö','[[:alpha:]]*')
,substring('oö','[a-zA-Z]*');
More about character classes in Postgres regular expressions in the manual.
It's because of the parentheses.
I've looked everywhere in the documentation and found an interesting sentence on this page:
[...] if the pattern contains any parentheses, the portion of the text that matched the first parenthesized subexpression (the one whose left parenthesis comes first) is returned.
I took your first expression:
([0-9]+)-?([:space:])?([A-Za-z])?
and wrapped it in parentheses:
(([0-9]+)-?([:space:])?([A-Za-z])?)
and it works fine (see SQLFiddle).
Update:
Also, because you're looking for - or space, you could rewrite your middle expression to [-|\s]? (thanks Matthew for pointing that out), which leads to the following possible REGEX:
(([0-9]+)[-|\s]?([A-Za-z])?)
(SQLFiddle)
Update 2:
While my answer provides the explanation as to why the result represented a partial match of your expression, the expression I presented above fails your third test case.
You should use the regex provided by Matthew in his answer.