Regex to match substrings containing n non-repeated characters - regex

I am facing a (naive) problem with regular expression.
I need to find any substrings composed of a fixed number (n) of different characters.
So, for "aaabcddd", if n=3 the substrings that I expect to find are: "abc" and "bcd".
My idea is to use n-1 capture groups and '[^' to exclude characters already matched. Thus, I wrote the following Perl regex (in Julia):
r"(([[:alpha:]])[^\2])[^\1]"
But, it is not working.
Do you have any tips?

You can not use a backreference to a capture group using a negated character class [^\1]
What you can do is use a negative lookahead to assert what is directly to the right of the current position is not what you have already captured in a previous group.
If that is the case, capture a single alpha in a new group.
The matches abc and bcd are in capture group 1
(?=(([[:alpha:]])(?!\2)([[:alpha:]])(?!\3|\2)[[:alpha:]]))
(?= Positive lookahead
( Capture group 1
([[:alpha:]]) Capture the first char in group 2
(?!\1)([[:alpha:]]) If not looking at what is captured by group 2 to the right, capture the second char in group 3
(?!\2|\1) If not looking to the right at what is captured by group 2 or 3
[[:alpha:]] Mach the 3rd char
) Close group 1
) Close the lookahead
Regex demo
Or a bit shorter using a case insensitive match:
(?=(([a-z])(?!\2)([a-z])(?!\3|\2)[a-z]))

Here is a solution to an arbitrary value of n characters:
#!/usr/local/bin/perl
use strict; use warnings; use feature ':5.10';
my $s="aaabcded";
my $n=3;
while ($s=~/(?=([[:alpha:]]{$n}))/g){
my $hit=$1;
my #chars = split //, $hit;
my %uniq;
#uniq{#chars} = ();
say "$hit" if (scalar keys %uniq) == $n;
}
Running with $n=3 prints:
abc
bcd
cde
Running with $n=4 prints:
abcd
bcde
And $n=5:
abcde

Related

Match all elements with n occurrences

I want to select the same element with exact n occurrences.
Match letters that repeats exact 3 times in this String: "aaaaabbbcccccccccdddee"
this should return "bbb" and "ddd"
If I define what I should match like "b{3}" or "d{3}", this would be easier, but I want to match all elements
I've tried and the closest I came up is this regex: (.)\1{2}(?!\1)
Which returns "aaa", "bbb", "ccc", "ddd"
And I can't add negative lookbehind, because of "non-fixed width" (?<!\1)
One possibility is to use a regex that looks for a character which is not followed by itself (or beginning of line), followed by three identical characters, followed by another character which is not the same as the second three i.e.
(?:(.)(?!\1)|^)((.)\3{2})(?!\3)
Demo on regex101
The match is captured in group 2. The issue with this though is that it absorbs a character prior to the match, so cannot find adjacent matches: as shown in the demo, it only matches aaa, ccc and eee in aaabbbcccdddeee.
This issue can be resolved by making the entire regex a lookahead, a technique which allows for capturing overlapping matches as described in this question. So:
(?=(?:(.)(?!\1)|^)((.)\3{2})(?!\3))
Again, the match is captured in group 2.
Demo on regex101
You could match what you don't want to keep, which is 4 or more times the same character.
Then use an alternation to capture what you want to keep, which is 3 times the same character.
The desired matches are in capture group 2.
(.)\1{3,}|((.)\3\3)
(.) Capture group 1, match a single character
\1{3,} Repeat the same char in group 1, 3 or more times
| Or
( Capture group 2
(.)\3\3 Capture group 3, match a single character followed by 2 backreferences matching 2 times the same character as in group 3
) Close group 2
Regex demo
This gets sticky because you cannot put a back reference inside a negative character set, so we'll use a lookbehind followed by a negative lookahead like this:
(?<=(.))((?!\1).)\2\2(?!\2))
This says find a character but don't include it in the match. Then look ahead to be certain the next character is different. Next consume it into capture group 2 and be certain that the next two characters match it, and the one after does not match.
Unfortunately, this does not work on 3 characters at the beginning of the string. I had to add a whole alternation clause to handle that case. So the final regex is:
(?:(?<=(.))((?!\1).)\2\2(?!\2))|^(.)\3\3(?!\3)
This handles all cases.
EDIT
I found a way to handle matches at the beginning of the string:
(?:(?<=(.))|^)((?!\1).)\2\2(?!\2)
Much nicer and more compact, and does not require looking in capture groups to get the answer.
If your environment permits the use of (*SKIP)(*FAIL), you can manage to return a lean set of matches by consuming substrings of four or more consecutive duplicate characters then discard them. In the alternation, match the desired 3 consecutive duplicated characters.
PHP Code: (Demo)
$string = 'aaaaabbbcccccccccdddee';
var_export(
preg_match_all(
'/(?:(.)\1{3,}(*SKIP)(*F)|(.)\2{2})/',
$string,
$m
)
? $m[0]
: 'no matches'
);
Output:
array (
0 => 'bbb',
1 => 'ddd',
)
This technique uses no lookarounds and does not generate false positive matches in the matches array (which would otherwise need to be filtered out).
This pattern is efficient because it never needs to look backward and by consuming the 4 or more consecutive duplicates, it can rule-out long substrings quickly.

Find Strings that contains a sequence of a specific sub string, with a limited amount of interruptions in between with regex

I'm looking for the following regex:
to find the part of the string (if exist) that contains the longest sequence of repeating GGG, with a minimal interruption of 10 chars in between every GGG.
i tried the following pattern but it didn't work that well: ((GGG).{0,10}?)*
CAGTTAGGGTTTAGGGTTAGGTTTAGGGTTAGGGTTAGGGTGAGGTGAGGGTGAGGGTTAGGGTGAGGGGTGAGGGGTTGGGGTTAGGGTTAGGGTTAGGAGTTGCAGGGGTTAGGGTTAGGGTTAGGGGTTAGGGTTAGGGTTAGGGGTTAGGGTTACTTTAGGGTTAGGGTTGGGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTAGAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTACCTGCTTACTTGCTGCAGGGTTAGGGTTAGGGTTAGGGTTAAGTTAGGGTTTAGGGTTGGGGTTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGGGTTAGGGTTAGGGTTAGGGTTAGGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTGGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGAGGTTAGGGTTAGGTTAGGGTTAGGGTTAGGGTTAGGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGCAGGGTTAGGGTTGGGGTTAGGGGTTAGGGGTTGGGGGGGTTAGGGTTGGGGGTTGGGGGTTAGGGAGGGTTAGGGGTTGGGGGTTGCAGGGGTTAGGGTTAGGGGTTGGGGTTAGGGTTAGGGTTAGGGTTACCTTGGGGGTTGGGGTTAGGGTTAGGGTTGCAGGGTTAGGGTTAGGAGTTAGGGTTAGAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTTAGGGTTAGGGTTGGGGTTAGGGTTAGAGGTTAGGGTTAGGGGTTGGGGTTAGGGGTTGGGGGTTGGGGTTAGGGTTGCAGAAGGGGTTGAGCAGGGTGGGAGTTAGGGATTAGGGATTAGGAGTTAGGGTGAGGGTTAGGGTTAGGGTGGGGTGGGGATTGGGGATTGGGAGTTAGGGTGGGTGGGGATTGGGGAGTTAGGAGTTAGGAGTTAGGAGTTAGGGAGTTAGGTTAGGGAGTTAGGGTTAGGAGTTAGAGGTTAGGGTTAGGGTGGGAGTTAGGGAGTTAGGAGGTGGGGTTGGGGTTAGGGTTAGGAGTTAGGGTTAGGGTTAGGGTTAGGGATTGGGAGTTAGGGTAGGAGTTAGGGTTAGAGGTTAGGAGTTAGGGTTAGGAGTTAGGGATTAGAGGTTAGGGTGGGATTAGGAGTTACTTACTTAGGGAGTTAGGAGTTAGGAGTTAGGGTGGGGTGGGAGTTAGAGGTTAGGAGTTAGGAGTTAGGGTTAGGGTTAGGAGTTAAGGGTTAGGGATTAGGAGTTAGGGTTAGGGTTAGGAGTTAGGGAGTTAGGGTGGGGTGGGAGTTGCAGGGATTGGGTTAGGGTTAGGAGTTGGGAGTTGGGGAGTTGGGAGTTAGGGTTACAGGGTGGGAGTTAGGAGTTAGGGAGTTAGGAGTTAGAGGTTAGGGATTAGGGGT
This pattern will work based on your rules: ((?:GGG.{0,10}?)+GGG)
regex101 demo
Explanation:
( start capture group
(?: start non-capturing group
GGG literally
.{0,10}? any character 0-10 times, non-greedy
) end the non-capturing group
+ match the previous group 1 or more times
GGG literally
) end the capture group
Then you can simply use re.findall to find all of those matches, and get the longest one of those with max(key=len).
Python demo:
import re
string = "CAGTTAGGGTTTAGGGTTAGGTTTAGGGTTAGGGTTAGGGTGAGGTGAGGGTGAGGGTTAGGGTGAGGGGTGAGGGGTTGGGGTTAGGGTTAGGGTTAGGAGTTGCAGGGGTTAGGGTTAGGGTTAGGGGTTAGGGTTAGGGTTAGGGGTTAGGGTTACTTTAGGGTTAGGGTTGGGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTTAGGGTTAGAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTACCTGCTTACTTGCTGCAGGGTTAGGGTTAGGGTTAGGGTTAAGTTAGGGTTTAGGGTTGGGGTTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGGGTTAGGGTTAGGGTTAGGGTTAGGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTGGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGAGGTTAGGGTTAGGTTAGGGTTAGGGTTAGGGTTAGGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGCAGGGTTAGGGTTGGGGTTAGGGGTTAGGGGTTGGGGGGGTTAGGGTTGGGGGTTGGGGGTTAGGGAGGGTTAGGGGTTGGGGGTTGCAGGGGTTAGGGTTAGGGGTTGGGGTTAGGGTTAGGGTTAGGGTTACCTTGGGGGTTGGGGTTAGGGTTAGGGTTGCAGGGTTAGGGTTAGGAGTTAGGGTTAGAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTTAGGGTTAGGGTTGGGGTTAGGGTTAGAGGTTAGGGTTAGGGGTTGGGGTTAGGGGTTGGGGGTTGGGGTTAGGGTTGCAGAAGGGGTTGAGCAGGGTGGGAGTTAGGGATTAGGGATTAGGAGTTAGGGTGAGGGTTAGGGTTAGGGTGGGGTGGGGATTGGGGATTGGGAGTTAGGGTGGGTGGGGATTGGGGAGTTAGGAGTTAGGAGTTAGGAGTTAGGGAGTTAGGTTAGGGAGTTAGGGTTAGGAGTTAGAGGTTAGGGTTAGGGTGGGAGTTAGGGAGTTAGGAGGTGGGGTTGGGGTTAGGGTTAGGAGTTAGGGTTAGGGTTAGGGTTAGGGATTGGGAGTTAGGGTAGGAGTTAGGGTTAGAGGTTAGGAGTTAGGGTTAGGAGTTAGGGATTAGAGGTTAGGGTGGGATTAGGAGTTACTTACTTAGGGAGTTAGGAGTTAGGAGTTAGGGTGGGGTGGGAGTTAGAGGTTAGGAGTTAGGAGTTAGGGTTAGGGTTAGGAGTTAAGGGTTAGGGATTAGGAGTTAGGGTTAGGGTTAGGAGTTAGGGAGTTAGGGTGGGGTGGGAGTTGCAGGGATTGGGTTAGGGTTAGGAGTTGGGAGTTGGGGAGTTGGGAGTTAGGGTTACAGGGTGGGAGTTAGGAGTTAGGGAGTTAGGAGTTAGAGGTTAGGGATTAGGGGT"
pattern = re.compile(r"((?:GGG.{0,10}?)+GGG)")
longest = max(re.findall(pattern, string), key=len)
print(len(longest), longest)
Output:
583 GGGTTAGGGTTAGGGTTAGGGTTAAGTTAGGGTTTAGGGTTGGGGTTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGGGTTAGGGTTAGGGTTAGGGTTAGGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTGGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGAGGTTAGGGTTAGGTTAGGGTTAGGGTTAGGGTTAGGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGCAGGGTTAGGGTTGGGGTTAGGGGTTAGGGGTTGGGGGGGTTAGGGTTGGGGGTTGGGGGTTAGGGAGGGTTAGGGGTTGGGGGTTGCAGGGGTTAGGGTTAGGGGTTGGGGTTAGGGTTAGGGTTAGGGTTACCTTGGGGGTTGGGGTTAGGGTTAGGGTTGCAGGGTTAGGGTTAGGAGTTAGGGTTAGAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTTAGGGTTAGGGTTGGGGTTAGGGTTAGAGGTTAGGGTTAGGGGTTGGGGTTAGGGGTTGGGGGTTGGGGTTAGGGTTGCAGAAGGGGTTGAGCAGGGTGGGAGTTAGGGATTAGGG
Edit:
If you want to have at least 51 GGGs in the string, you can use the pattern: ((?:GGG.{0,10}?){50,}GGG) to accomplish that.

Perl Regular expression | how to exclude words from a file

i searching to find some Perl Regular Expression Syntax about some requirements i have in a project.
First i want to exclude strings from a txt file (dictionary).
For example if my file have this strings:
path.../Document.txt |
tree
car
ship
i using Regular Expression
a1testtre -- match
orangesh1 -- match
apleship3 -- not match [contains word from file ]
Also i have one more requirement that i couldnt solve. I have to create a Regex that not allow a String to have over 3 times a char repeat (two chars).
For example :
adminnisstrator21 -- match (have 2 times a repetition of chars)
kkeeykloakk -- not match have over 3 times repetition
stack22ooverflow -- match (have 2 times a repetition of chars)
for this i have try
\b(?:([a-z])(?!\1))+\b
but it works only for the first char-reppeat
Any idea how to solve these two?
To not match a word from a file you might check whether a string contains a substring or use a negative lookahead and an alternation:
^(?!.*(?:tree|car|ship)).*$
^ Assert start of string
(?! negative lookahead, assert what is on the right is not
.*(?:tree|car|ship) Match 0+ times any char except a newline and match either tree car or ship
) Close negative lookahead
.* Match any char except a newline
$ Assert end of string
Regex demo
To not allow a string to have over 3 times a char repeat you could use:
\b(?!(?:\w*(\w)\1){3})\w+\b
\b Word boundary
(?! Negative lookahead, assert what is on the right is not
(?: NOn capturing group
\w*(\w)\1 Match 0+ times a word character followed by capturing a word char in a group followed by a backreference using \1 to that group
){3} Close non capturing group and repeat 3 times
) close negative lookahead
\w+ Match 1+ word characters
\b word boundary
Regex demo
Update
According to this posted answer (which you might add to the question instead) you have 2 patterns that you want to combine but it does not work:
(?=^(?!(?:\w*(.)\1){3}).+$)(?=^(?:(.)(?!(?:.*?\1){4}))*$)
In those 2 patterns you use 2 capturing groups, so the second pattern has to point to the second capturing group \2.
(?=^(?!(?:\w*(.)\1){3}).+$)(?=^(?:(.)(?!(?:.*?\2){4}))*$)
^
Pattern demo
One way to exclude strings that contain words from a given list is to form a pattern with an alternation of the words and use that in a regex, and exclude strings for which it matches.
use warnings;
use strict;
use feature qw(say);
use Path::Tiny;
my $file = shift // die "Usage: $0 file\n"; #/
my #words = split ' ', path($file)->slurp;
my $exclude = join '|', map { quotemeta } #words;
foreach my $string (qw(a1testtre orangesh1 apleship3))
{
if ($string !~ /$exclude/) {
say "OK: $string";
}
}
I use Path::Tiny to read the file into a a string ("slurp"), which is then split by whitespace into words to use for exclusion. The quotemeta escapes non-"word" characters, should any happen in your words, which are then joined by | to form a string with a regex pattern. (With complex patterns use qr.)
This may be possible to tweak and improve, depending on your use cases, for one in regards to the order of of patterns with common parts in alternation.†
The check that successive duplicate characters do not occur more than three times
foreach my $string (qw(adminnisstrator21 kkeeykloakk stack22ooverflow))
{
my #chars_that_repeat = $string =~ /(.)\1+/g;
if (#chars_that_repeat < 3) {
say "OK: $string";
}
}
A long string of repeated chars (aaaa) counts as one instance, due to the + quantifier in regex; if you'd rather count all pairs remove the + and four as will count as two pairs. The same char repeated at various places in the string counts every time, so aaXaa counts as two pairs.
This snippet can be just added to the above program, which is invoked with the name of the file with words to use for exclusion. They both print what is expected from provided samples.
†  Consider an example with exclusion-words: so, sole, and solely. If you only need to check whether any one of these matches then you'd want shorter ones first in the alternation
my $exclude = join '|', map { quotemeta } sort { length $a <=> length $b } #words;
#==> so|sole|solely
for a quicker match (so matches all three). This, by all means, appears to be the case here.
But, if you wanted to correctly identify which word matched then you must have longer words first,
solely|sole|so
so that a string solely is correctly matched by its word before it can be "stolen" by so. Then in this case you'd want it the other way round,
sort { length $b <=> length $a }
I hope someone else will come with a better solution, but this seems to do what you want:
\b Match word boundary
(?: Start capture group
(?:([a-z0-9])(?!\1))* Match all characters until it encounters a double
(?:([a-z0-9])\2)+ Match all repeated characters until a different one is reached
){0,2} Match capture group 0 or 2 times
(?:([a-z0-9])(?!\3))+ Match all characters until it encounters a double
\b Match end of word
I changed the [a-z] to also match numbers, since the examples you gave seem to also include numbers. Perl regex also has the \w shorthand, which is equivalent to [A-Za-z0-9_], which could be handy if you want to match any character in a word.
My problem is that i have 2 regex that working:
Not allow over 3 pairs of chars:
(?=^(?!(?:\w*(.)\1){3}).+$)
Not allow over 4 times a char to repeat:
(?=^(?:(.)(?!(?:.*?\1){4}))*$)
Now i want to combine them into one row like:
(?=^(?!(?:\w*(.)\1){3}).+$)(?=^(?:(.)(?!(?:.*?\1){4}))*$)
but its working only the regex that is first and not both of them
As mentioned in comment to #zdim's answer, take it a bit further by making sure that the order in which your words are assembled into the match pattern doesn't trip you. If the words in the file are not very carefully ordered to start, I use a subroutine like this when building the match string:
# Returns a list of alternative match patterns in tight matching order.
# E.g., TRUSTEES before TRUSTEE before TRUST
# TRUSTEES|TRUSTEE|TRUST
sub tight_match_order {
return #_ unless #_ > 1;
my (#alts, #ordered_alts, %alts_seen);
#alts = map { $alts_seen{$_}++ ? () : $_ } #_;
TEST: {
my $alt = shift #alts;
if (grep m#$alt#, #alts) {
push #alts => $alt;
} else {
push #ordered_alts => $alt;
}
redo TEST if #alts;
}
#ordered_alts
}
So following #zdim's answer:
...
my #words = split ' ', path($file)->slurp;
#words = tight_match_order(#words); # add this line
my $exclude = join '|', map { quotemeta } #words;
...
HTH

WKT: regex to extract only the first two floats values

I have the input below:
LINESTRING(-111.928130305897 33.4490602213529,-111.928130305897 33.4490602213529)
and I need a regex that generates this:
-111.928130305897 33.4490602213529
Its essentially the first two floats.
You can use the following regex:
(?<=\()-?(:?[1-9]\d*|\d)(:?\.\d*)\s+-?(:?[1-9]\d*|\d)(:?\.\d*)(?=,)
DEMO: https://regex101.com/r/Q2HreC/3
Explanations and hypothesis:
(?<=\() positive lookbehind to have the constraint that the floats follow a parenthesis
-?(:?[1-9]\d*|\d)(:?\.\d*) capture the first float: - is optional then a number with several digits starting by at least a 1, or a simple digit followed eventually by a . and some decimals.
\s+ some spaces in the middle
followed by a second float
(?=,) positive look ahead to add the constraint followed by ,
To match the first 2 floats for your example, you might use:
^LINESTRING\(([-+]?\d*\.?\d+) ([-+]?\d*\.?\d+)
That would match:
^LINESTRING from the beginning of the string
\( an opening parenthesis
followed by matching a float ([-+]?\d*\.?\d+) 2 times in a capturing group
The float regex:
( # Capturing group
[-+]? # Optional + or -
\d* # Match a digits zero or more times
\.? # Optional dot
\d+ # Match a digit one or more times
) # Close capturing group
Or to match -111.928130305897 33.4490602213529 for your example
without capturing groups you could use:
(?<=^LINESTRING\()[-+]?\d*\.?\d+ [-+]?\d*\.?\d+
or
(?<=^LINESTRING\()[^,]+
What about using the right tool for the right job ? This is a perl module to proper parse WKT :
Code :
#!/usr/bin/env perl
use strict; use warnings;
use Geo::WKT::Simple;
my $arr = [];
push #{ $arr }, Geo::WKT::Simple::wkt_parse_linestring("LINESTRING(-111.928130305897 33.4490602213529,-111.928130305897 33.4490602213529)");
print join "\n", #{ $arr->[0] };
Output :
-111.928130305897
33.4490602213529
Doc :
https://metacpan.org/pod/distribution/Geo-WKT/lib/Geo/WKT.pod

Regex: find occurrence of a digit in a string

Problem: I want to match those strings which contains two digits.Their position is random and a digit should match 2 times.
Example for better understanding my question:
3abc3
a22de
b7abc7a
For these strings it must match.If a string contains two digits but they are different then it shouldn't match.
Example:
3abcd2 not supposed to match
3abc3 -> supposed to match
I tried using {n}, but it not helps, because it thinks the two number follows each other.
You can use this grep:
grep -E '([0-9]).*\1' file
3abc3
a22de
b7abc7a
About this Regex:
([0-9]) # match and capture any digit in group #1
.* # match 0 or more of any character in between
\1 # using back-reference \1, make sure we have same digit as in group #1