I am facing some issues forming a regex that matches at least n times a given pattern within m characters of the input string.
For example imagine that my input string is:
00000001100000001110111100000000000000000000000000000000000000000000000000110000000111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001100
I want to detect all cases where an 1 appears at least 7 times (not necessarily consecutively) in the input string, but within a window of up to 20 characters.
So far I have built this expression:
(1[^1]*?){7,}
which detects all cases where an 1 appears at least 7 times in the input string, but this now matches both the:
11000000011101111
and the
1100000001110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011
parts whereas I want only the first one to be kept, as it is within a substring composed of less than 20 characters.
It tried to combine the aforementioned regex with:
(?=(^[01]{0,20}))
to also match only parts of the string containing either an '1' or a '0' of length up to 20 characters but when I do that it stops working.
Does anyone have an idea gow to accomplish this?
I have put this example in regex101 as a quick reference.
Thank you very much!
This is not something that can be done with regex without listing out every possible string. You would need to iterate over the string instead.
You could also iterate over the matches. Example in Python:
import re
matches = re.finditer(r'(?=((1[^1]*?){7}))', string)
matches = [match.group(1) for match in matches if len(match.group(1)) <= 20]
The next Python snippet is an attempt to get the desired sequences using only the regular expression.
import re
r = r'''
(?mx)
( # the 1st capturing group will contain the desired sequence
1 # this sequence should begin with 1
(?=(?:[01]{6,19}) # let's see that there are enough 0s and 1s in a line
(.*$)) # the 2nd capturing group will contain all characters to the end of a line
(?:0*1){6}) # there must be six more 1s in the sequence
(?=.{0,13} # complement the 1st capturing group to 20 characters
\2) # the rest of a line should be 2nd capturing group
'''
s = '''
0000000
101010101010111111100000000000001
00000001100000001110111100000000000000000000000000000000000000000000000000110000000111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001100
1111111
111111
'''
print([m.group(1) for m in re.finditer(r, s)])
Output:
['1010101010101', '11111100000000000001', '110000000111011', '1111111']
You can find an exhaustive explanation of this regular expression on RegEx101.
I work with text files, and I need to be able to see when the gps (last 3 columns of csv) "hangs up" for more than a few lines.
So for example, usually, part of a text file looks like this:
5451,1667,180007,35.7397387,97.8161897,375.8
5448,1053z,180006,35.7397407,97.8161814,375.7
5444,1667,180005,35.7397445,97.8161674,375.6
5439,1668,180004,35.7397483,97.8161526,375.5
5435,1669,180003,35.7397518,97.8161379,375.5
5431,1669,180002,35.7397554,97.8161269,375.6
5426,1054z,180001,35.7397584,97.8161115,375.6
5420,1670,175959,35.7397649,97.8160931,375.9
But sometimes there is an error with the gps and it looks like this:
36859,1598,202603.00,35.8867316,99.2515545,555.700
36859,1598,202608.00,35.8867316,99.2515545,555.700
36859,1142z,202610.00,35.8867316,99.2515545,555.700
36859,1597,202612.00,35.8867316,99.2515545,555.700
36859,1597,202614.00,35.8867316,99.2515545,555.700
36859,1596,202616.00,35.8867316,99.2515545,555.700
36859,1595,202618.00,35.8867316,99.2515545,555.700
I need to be able to figure out a way to search for matching strings of 7 different numbers, (the decimal portion of the gps) but so far I've only been able to figure out how to search for repeating #s or consecutive numbers.
Any ideas?
If you were to find such repetitions in an editor (such as Notepad++), you could use the following regex to find 4 or more repeating lines:
([^,]+(?:,[^,]+){2})\v+(?:(?:[^,]+,){3}\1(?:\v+|$)){3,}
To go a bit into detail
([^,]+(?:,[^,]+){2})\v+ is a group consisting of one or more non-commas followed by comma and another one or more non-commas followed by a vertical space (linebreak), that is not part of the group (e.g. 1,1,1\n)
(?:[^,]+,){3} matches one or more non-commas followed by comma, three times (your columns that don't have to be considered)
\1 is a backreference to group 1, matching if it contains exactly the same as group 1
(?:\v+|$) matches either another vertical whitespaces or the end of the text
{3,} for 3 or more repetitions - increase it if you want more
Here you can see, how it works
However, if you are using any programming language to check this, I wouldn't walk on the path of regex, as checking for those repetitions can be done a lot easier. Here is one example in Python, I hope you can adopt it for your needs:
oldcoords = [0,0,0]
lines = [line.rstrip('\n') for line in open(r'C:\temp\gps.csv')]
for line in lines:
gpscoords = line.split(',')[3:6]
if gpscoords == oldcoords:
repetitions += 1
else:
oldcoords = gpscoords
repetitions = 0
if repetitions == 4: #or however you define more than a few
print(', '.join(gpscoords) + ' is repeated')
If you can use perl, and if I understood you:
perl -ne 'm/^[^,]*,[^,]*,[^,]*,([^,]*,[^,]*,[^,]*$)/g; $current_line=$1; ++$line_number; if ($prev_line==$current_line){$equals++} else {if ($equals>=6){ print "Last three fields in lines ".($line_number-$equals-1)." to ".($line_number-1)." are equals to:\n$prev_line" } ; $equals=0}; $prev_line=$current_line' < onlyreplacethiswithyourfilepath should do the trick.
Sample output:
Last three fields in lines 1 to 7 are equals to:
35.8867316,99.2515545,555.700
Last three fields in lines 16 to 22 are equals to:
37.8782116,99.7825545,572.810
Last three fields in lines 31 to 44 are equals to:
36.6868916,77.2594245,581.358
Last three fields in lines 57 to 63 are equals to:
35.5128764,71.2874545,575.631
I have a string in the format A123ABC
First letter cannot contain <I,O,Q,U,Z>
Next 3 digits (0-9) from 21-998
Last 3 letters cannot include <I,Q,Z>
I used the following expression [A-HJ-NPR-TV-Y]{1}[0-9]{2,3}[A-HJ-PR-Y]{3}
But I am not able to restrict the number in the range 21-998.
Your letter part is fine, below is just the numbers portion:
regex = "(?:2[1-9]|[3-9][0-9]|[1-8][0-9][0-9]|9[0-8][0-9]|99[0-8])"
(?:...) group, but do not capture.
2[1-9] covers 21-29
[3-9][0-9] covers 30-99
[1-8][0-9][0-9] covers 100-899
9[0-8][0-9] covers 900-989
99[0-8] covers 990-998
| stands for "or"
Note: [0-9] may be replaced by \d. So, a more concise representation would be:
regex = "(?:2\d|[3-9]\d|[1-8]\d{2}|9[0-8]\d|99[0-8])"
One option would be matching (\d+) and checking if that falls in the range 21 - 998 outside a regex, in the language you're using, if possible.
If that is not feasible, you have to break it up (just showing the middle part):
(2[1-9]|[3-9]\d|[1-8]\d\d|9[0-8]\d|99[0-8])
Breakdown:
2[1-9] matches 21 - 29
[3-9]\d matches 30 - 99
[1-8]\d\d matches 100 - 899
9[0-8]\d matches 900 - 989
99[0-8] matches 990 - 998
Also, the {1} is superfluous and can be omitted, making the complete regex
[A-HJ-NPR-TV-Y](2[1-9]|[3-9]\d|[1-8]\d\d|9[0-8]\d|99[0-8])[A-HJ-PR-Y]{3}
Assuming the numbers between 21 and 99 are displayed with three digits (ie. : 021, 055, 099), here's a solution for the number part :
((02[1-9])|(0[3-9][0-9])|([1-8][0-9]{2})|(9([0-8][0-9])|(9[0-8])))
Entire regex :
[A-HJ-NPR-TV-Y]{1}((02[1-9])|(0[3-9][0-9])|([1-8][0-9]{2})|(9([0-8][0-9])|(9[0-8])))[A-HJ-PR-Y]{3}
There are probably easier ways to do this, but one way would be to use:
^((?=[^IOQUZ])([A-Z]))((02[^0])|(0[3-9]\d)|([1-8]\d\d)|(9[0-8]\d)|(99[0-8]))((?=[^IQZ])([A-Z])){3}$
To explain:
^ denotes the beginning of the string.
((?=[^IOQUZ])([A-Z])) would give you any capital letter not in <I, O, Q, U, Z>.
((02[^0])|(0[3-9]\d)|([1-8]\d\d)|(9[0-8]\d)|(99[0-8])) denotes any number between ((21 to 29) or (30 to 99) or (100 to 899) or (900 to 989) or (990 to 998)).
((?=[^IQZ])([A-Z])){3} would match any three capital letters not in <I, Q, Z>.
$ would denote the end of the string.
In the tester this works ... but not in PostgreSQL.
My data is like this -- usually a series of letters, followed by 2 numbers and a POSSIBLE '-' or 'space' with only ONE letter following. I am trying to isolate the 2 numbers and the Possible '-" or 'space' AND the ONE letter with my regex:
For ex:
AJ 50-R Busboys ## should return 50-R
APPLES 30 F ## should return 30 F
FOOBAR 30 Apple ## should return 30
Regex's (that have worked in the tester, but not in PostgreSQL) that I've tried:
substring(REF from '([0-9]+)-?([:space:])?([A-Za-z])?')
&
substring(REF from '([0-9]+)-?([A-Za-z])?')
So far everything tests out in the tester...but not the PostgreSQL. I just keep getting the numbers returns -- AND NOTHING AFTER IT.
What I am getting now(for ex):
AJ 50-R Busboys ## returns as "50" NOT as "50-R"
Your looking for: substring(REF from '([0-9]+(-| )([A-Za-z]\y)?)')
In SQLFiddle. Your primary problem is that substring returns the first or outermost matching group (ie., pattern surrounded with ()), which is why you get 50 for your '50-R'. If you were to surround the entire pattern with (), this would give you '50-R'. However, the pattern you have fails to return what you want on the other strings, even after accounting for this issue, so I had to modify the entire regex.
This matches your description and examples.
Your description is slightly ambiguous. Leading letters are followed by a space and then two digits in your examples, as opposed to your description.
SELECT t, substring(t, '^[[:alpha:] ]+(\d\d(:?[\s-]?[[:alpha:]]\M)?)')
FROM (
VALUES
('AJ 50-R Busboys') -- should return: 50-R
,('APPLES 30 F') -- should return: 30 F
,('FOOBAR 30 Apple') -- should return: 30
,('FOOBAR 30x Apple') -- should return: 30x
,('sadfgag30 D 66 X foo') -- should return: 30 D - not: 66 X
) r(t);
->SQLfiddle
Explanation
^ .. start of string (last row could fail without anchoring to start and global flag 'g'). Also: faster.
[[:alpha:] ]+ .. one or more letters or spaces (like in your examples).
( .. capturing parenthesis
\d\d .. two digits
(:? .. non-capturing parenthesis
[\s-]? .. '-' or 'white space' (character class), 0 or 1 times
[[:alpha:]] .. 1 letter
\M .. followed by end of word (can be end of string, too)
)? .. the pattern in non-capturing parentheses 0 or 1 times
Letters as defined by the character class alpha according to the current locale! The poor man's substitute [a-zA-Z] only works for basic ASCII letters and fails for anything more. Consider this simple demo:
SELECT substring('oö','[[:alpha:]]*')
,substring('oö','[a-zA-Z]*');
More about character classes in Postgres regular expressions in the manual.
It's because of the parentheses.
I've looked everywhere in the documentation and found an interesting sentence on this page:
[...] if the pattern contains any parentheses, the portion of the text that matched the first parenthesized subexpression (the one whose left parenthesis comes first) is returned.
I took your first expression:
([0-9]+)-?([:space:])?([A-Za-z])?
and wrapped it in parentheses:
(([0-9]+)-?([:space:])?([A-Za-z])?)
and it works fine (see SQLFiddle).
Update:
Also, because you're looking for - or space, you could rewrite your middle expression to [-|\s]? (thanks Matthew for pointing that out), which leads to the following possible REGEX:
(([0-9]+)[-|\s]?([A-Za-z])?)
(SQLFiddle)
Update 2:
While my answer provides the explanation as to why the result represented a partial match of your expression, the expression I presented above fails your third test case.
You should use the regex provided by Matthew in his answer.