Regex quantifier more than one group - regex

I need a regex to get a sequence of number 1 followed by number 0 and the total numbers should be equal to a max length. Is there a way to do something like (([1]+)([0]+)){maxLength} ?
Ex.:
maxLength = 7
10 -> should not pass (total length < maxLength)
1111100 -> should match
1000000 -> should match
11110000000 -> should match 1111000.
111111111111 -> should match 1111111.
Plus: The sequence could be 0 followed by 1, and the greater the amount of 1 the better (I don't know if it's possible in only one regex).
000000001111 -> should get 0001111.
I'm focusing on 1 followed by 0.
I started with [1]+[0]+,
after I quantified the 0s ([1]+)([0]{1,7}),
but it still giving more 0s than I want.
Then I was thinking in ([1]{7,}|[1]{6}[0]{1}|[1]{5}[0]{2}|[1]{4}[0]{3}|[1]{3}[0]{4}|[1]{2}[0]{5}|[1]{1}[0]{6}),
and ok, it works. BUT if maxLength = 100 the above solution is not viable.
Is there some way to count the length of the first matched group and then the second group to be the difference from the first one?
Or something like (([1]+)([0]+)){7} ?

My attempt using branch reset group:
0*(?|(1[10]{6})|([10]{6}1))
See an online demo. You can use the result from 1st capture group.
0* - 0+ literal zeros (greedy) upto;
(?| - Open branch reset group:
(1[10]{6}) - 1st Capture group holding a literal 1 and 6 ones or zeros.
| - Or:
([10]{6}1) - 1st Capture group holding 6 ones or zeros upto and a literal one.
) - Close branch reset group.

It seems you just want:
^(?:(?=1+0*$)|(?=0+1*$))[01]{7}
Here the {7} can be replaced with whatever the max length is minus one.

I think the regex can be as simple as:
/0*([01]{7})/
example:
const result = `
10
1111100
1000000
11110000000
111111111111
000000001111
`.split("\n").reduce((acc, str) => {
const m = str.match(/0*([01]{7})/);
m && acc.push(m[1]);
return acc
}, []);
console.log(result)

Related

Regex to enter a decimal number digit by digit

I have a requirement where user can input only between 0.01 to 100.00 in a textbox. I am using regex to limit the data entered. However, I cannot enter a decimal point, like 95.83 in the regex. Can someone help me fix the below regex?
(^100([.]0{1,2})?)$|(^\d{1,2}([.]\d{1,2})?)$
if I copy paste the value, it passes. But unable to type a decimal point.
Please advice.
Link to regex tester: https://regex101.com/r/b2BF6A/1
Link to demo: https://stackblitz.com/edit/react-9h2xsy
The regex
You can use the following regex:
See regex in use here
^(?:(?:\d?[1-9]|[1-9]0)(?:\.\d{0,2})?|0{0,2}\.(?:\d?[1-9]|[1-9]0)|10{2}(?:\.0{0,2})?)$
How it works
^(?:...|...|...)$ this anchors the pattern to ensure it matches the entire string
^ assert position at the start of the line
(?:...|...|...) non-capture group - used to group multiple alternations
$ assert position at the end of the line
(?:\d?[1-9]|[1-9]0)(?:\.\d{0,2})? first option
(?:\d?[1-9]|[1-9]0) match either of the following
\d?[1-9] optionally match any digit, then match a digit in the range of 1 to 9
[1-9]0 match any digit between 1 and 9, followed by 0
(?:\.\d{0,2})? optionally match the following
\. this character . literally
\d{0,2} match any digit between 0 and 2 times
0{0,2}\.(?:\d?[1-9]|[1-9]0) second option
0{0,2} match 0 between 0 and 2 times
\. match this character . literally
(?:\d?[1-9]|[1-9]0) match either of the following options
\d?[1-9] optionally match any digit, then match a digit in the range of 1 to 9
[1-9]0 match any digit between 1 and 9, followed by 0
10{2}(?:\.0{0,2})? third option
10{2} match 100
(?:\.0{0,2})? optionally match ., followed by 0 between 0 and 2 times
How it works (in simpler terms)
With the above descriptions for each alternation, this is what they will match:
Any two-digit number other than 0 or 00, optionally followed by any two-digit decimal.
In terms of a range, it's 1.00-99.99 with:
Optional leading zero: 01.00-99.99
Optional decimal: 01-99, or 01.-99, or 01.0-01.99
Any two-digit decimal other than 0 or 00
In terms of a range, it's .01-.99 with:
Optional leading zeroes: 00.01-00.99 or 0.01-0.99
Literally 100, followed by optional decimals: 100, or 100., or 100.0, or 100.00
The code
RegExp vs /pattern/
In your code, you can use either of the following options (replacing pattern with the pattern above):
new RegExp('pattern')
/pattern/
The first option above uses a string literal. This means that you must escape the backslash characters in the string in order for the pattern to be properly read:
^(?:(?:\\d?[1-9]|[1-9]0)(?:\\.\\d{0,2})?|0{0,2}\\.(?:\\d?[1-9]|[1-9]0)|10{2}(?:\\.0{0,2})?)$
The second option above allows you to avoid this and use the regex as is.
Here's a fork of your code using the second option.
Usability Issues
Please note that you'll run into a couple of usability issues with your current method of tackling this:
The user cannot erase all the digits they've entered. So if the user enters 100, they can only erase 00 and the 1 will remain. One option to resolving this is to make the entire non-capture group (with the alternations) optional by adding a ? after it. Whilst this does solve that issue, you now need to keep two regular expression patterns - one for user input and the other for validation. Alternatively, you could just test if the input is an empty string to allow it (but not validate the form until the field is filled.
The user cannot enter a number beginning with .. This is because we don't allow the input of . to go through your validation steps. The same rule applies here as the previous point made. You can allow it though if the value is . explicitly or add a new alternation of |\.
Similarly to my last point, you'll run into the issue for .0 when a user is trying to write something like .01. Again here, you can run the same test.
Similarly again, 0 is not valid input - same applies here.
An change to the regex that covers these states (0, ., .0, 0., 0.0, 00.0 - but not .00 alternatives) is:
^(?:(?:\d?[1-9]?|[1-9]0)(?:\.\d{0,2})?|0{0,2}\.(?:\d?[1-9]?|[1-9]0)|10{2}(?:\.0{0,2})?)$
Better would be to create logic for these cases to match them with a separate regex:
^0{0,2}\.?0?$
Usability Fixes
With the changes above in mind, your function would become:
See code fork here
handleChange(e) {
console.log(e.target.value)
const r1 = /^(?:(?:\d?[1-9]|[1-9]0)(?:\.\d{0,2})?|0{0,2}\.(?:\d?[1-9]|[1-9]0)|10{2}(?:\.0{0,2})?)$/;
const r2 = /^0{0,2}\.?0?$/
if (r1.test(e.target.value)) {
this.setState({
[e.target.name]: e.target.value
});
} else if (r2.test(e.target.value)) {
// Value is invalid, but permitted for usability purposes
this.setState({
[e.target.name]: e.target.value
});
}
}
This now allows the user to input those values, but also allows us to invalidate them if the user tries to submit it.
Using the range 0.01 to 100.00 without padding is this (non-factored):
0\.(?:0[1-9]|[1-9]\d)|[1-9]\d?\.\d{2}|100\.00
Expanded
# 0.01 to 0.99
0 \.
(?:
0 [1-9]
| [1-9] \d
)
|
# 1.00 to 99.99
[1-9] \d? \.
\d{2}
|
# 100.00
100 \.
00
It can be made to have an optional cascade if incremental partial form
should be allowed.
That partial is shown here for the top regex range :
^(?:0(?:\.(?:(?:0[1-9]?)|[1-9]\d?)?)?|[1-9]\d?(?:\.\d{0,2})?|1(?:0(?:0(?:\.0{0,2})?)?)?)?$
The code line with stringed regex :
const newRegExp = new RegExp("^(?:0(?:\\.(?:(?:0[1-9]?)|[1-9]\\d?)?)?|[1-9]\\d?(?:\\.\\d{0,2})?|1(?:0(?:0(?:\\.0{0,2})?)?)?)?$");
_________________________
The regex 'partial' above requires the input to be blank or to start
with a digit. It also doesn't allow 1-9 with a preceding 0.
If that is all to be allowed, a simple mod is this :
^(?:0{0,2}(?:\.(?:(?:0[1-9]?)|[1-9]\d?)?)?|(?:[1-9]\d?|0[1-9])(?:\.\d{0,2})?|1(?:0(?:0(?:\.0{0,2})?)?)?)?$
which allows input like the following:
(It should be noted that doing this requires allowing the dot . as
a valid input but could be converted to 0. on the fly to be put
inside the input box.)
.1
00.01
09.90
01.
01.11
00.1
00
.
Stringed version :
"^(?:0{0,2}(?:\\.(?:(?:0[1-9]?)|[1-9]\\d?)?)?|(?:[1-9]\\d?|0[1-9])(?:\\.\\d{0,2})?|1(?:0(?:0(?:\\.0{0,2})?)?)?)?$"

Select only letters which are followed by a number

I am trying to select some codes from a PostgreSQl table.
I only want the codes that have numbers in them e.g
GD123
GD564
I don't want to pick any codes like `GDTG GDCNB
Here's my query so far:
select regexp_matches(no_, '[a-zA-Z0-9]*$')
from myschema.mytable
which of course doesn't work.
Any help appreciated.
The pattern to match a string that has at least 1 letter followed by at least 1 number is '[A-Za-z]+[0-9]+'.
Now, if the valid patterns had to start with two letters, and then have 3 digits after as your examples show, then replace the + with {2} & {4} respectively, and enclose the pattern in ^$, like this: '^[A-Za-z]{2}[0-9]{3}$'
The regex match operator is ~ which you can use in the where clause:
SELECT no_
FROM myschema.mytable
WHERE no_ ~ '[A-Za-z]+[0-9]+'
You may use
CREATE TABLE tb1
(s character varying)
;
INSERT INTO tb1
(s)
VALUES
('GD123'),
('12345'),
('GDFGH')
;
SELECT * FROM tb1 WHERE s ~ '^(?![A-Za-z]+$)[a-zA-Z0-9]+$';
Result:
Details
^ - start of string
(?![A-Za-z]+$) - a negative lookahead that fails the match if there are only letters to the end of the string
[a-zA-Z0-9]+ - 1 or more alphanumeric chars
$ - end of string.
If you want to avoid matching 12345, use
'^(?![A-Za-z]+$)(?![0-9]+$)[a-zA-Z0-9]+$'
Here, (?![0-9]+$) will similarly fail the match if, from the string start, all chars up to the end of the string are digits. Result:
smth like:
so=# with c(v) as (values('GD123'),('12345'),('GD ERT'))
select v ~ '[A-Z]{1,}[0-9]+', v from c;
?column? | v
----------+--------
t | GD123
f | 12345
f | GD ERT
(3 rows)
?..
If the format of the data you want to obtain is a set of characters follewd by a set of digits (i.e., GD123) you can use the regex:
[a-zA-Z0-9]+[0-9]
This captures every digit and letter which is in front of the digits:
([A-z]+\d+)

Find overlapping matches and submatches using regular expressions in Python

I have a string of characters (a DNA sequence) with a regular expression I designed to filter out possible matches, (?:ATA|ATT)[ATCGN]{144,16563}(?:AGA|AGG|TAA|TAG). Later I apply two filter conditions:
The sequence must be divisible by 3, len(match) % 3 == 0, and
There must be no stop codon (['AGA', 'AGG', 'TAA', 'TAG']) before the end of the string, not any(substring in list(sliced(match[:-3], 3)) for substring in [x.replace("U", "T") for x in stop_codons]).
However, when I apply these filters, I get no matches at all (before the filters I get around ~200 matches. The way I'm searching for substrings in the full sequence is by running re.findall(regex, genome_fasta, overlapped=True), because matches could be submatches of other matches.
Is there something about regular expressions that I'm misunderstanding? To my knowledge, after the filters I should still have matches.
If there's anything else I need to add please let me know! (I'm using the regex package for Python 3.4, not the standard re package, because it has no overlap support).
EDIT 1:
Per comment: I'm looking for ORFs in the mitochondrial genome, but only considering those at least 150 nucleotides (characters) in length. Considering overlap is important because a match could include the first start codon in the string and the last stop codon in the string, but there could be another start codon in the middle. For example:
Given "ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAACTAG", matches should be "ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAACTAG" but also "ATAAAGCCATTTACCGTACATAGCACATTATAA", since both "TAG" and "TAA" are stop codons.
EDIT 2:
Totally, misunderstood comment, full code for method is:
typical_regex = r"%s[ATCGN]{%s,%s}%s" % (proc_start_codons, str(minimum_orf_length - 6), str(maximum_orf_length - 6), proc_stop_codons)
typical_fwd_matches = []
if re.search(typical_regex, genome_fasta, overlapped=True):
for match in re.findall(typical_regex, genome_fasta, overlapped=True):
if len(match) % 3 == 0:
if not any(substring in list(sliced(match[:-3], 3)) for substring in [x.replace("U", "T") for x in stop_codons]):
typical_fwd_matches.append(match)
print(typical_fwd_matches)
The typical_fwd_matches array is empty and the regex is rendered as (?:ATA|ATT)[ATCGN]{144,16563}(?:AGA|AGG|TAA|TAG) when printed to console/file.
I think you can do it this way.
The subsets will consist of ever decreasing size of the previous matches.
That's about all you have to do.
So, it's fairly straight forward to design the regex.
The regex will only match multiples of 3 chars.
The beginning and middle are captured in group 1.
This is used for the new text value which is just the last match
minus the last 3 chars.
Regex explained:
( # (1 start), Whole match minus last 3 chars
(?: ATA | ATT ) # Starts with one of these 3 char sequence
(?: # Cluster group
[ATCGN]{3} # Any 3 char sequence consisting of these chars
)+ # End cluster, do 1 to many times
) # (1 end)
(?: AGA | AGG | TAA | TAG ) # Last 3 char sequence, one of these
Python code sample:
Demo
import re
r = re.compile(r"((?:ATA|ATT)(?:[ATCGN]{3})+)(?:AGA|AGG|TAA|TAG)")
text = "ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAACTAG"
m = r.search(text)
while m:
print("Found: " + m.group(0))
text = m.group(1)
m = r.search(text)
Output:
Found: ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAACTAG
Found: ATAAAGCCATTTACCGTACATAGCACATTATAA
Found: ATTTACCGTACATAG
Using this method, the subsets being tested are these:
ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAACTAG
ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAAC
ATAAAGCCATTTACCGTACATAGCACATTA
ATTTACCGTACA
We can benchmark the time the regex takes to match these.
Regex1: ((?:ATA|ATT)(?:[ATCGN]{3})+)(?:AGA|AGG|TAA|TAG)
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 3
Elapsed Time: 1.63 s, 1627.59 ms, 1627594 µs
Matches per sec: 92,160

Regex for: 6 digits or 0-6 signs (digits or stars) with at least one star

How to write regex to validate this pattern?
123456 - correct
*1 - correct
1* - correct
124** - correct
*1*2 - correct
* - correct
123456* - incorrect (size 7)
12345 - incorrect (size 5 without stars)
tried:
^[0-9]{6}$|^(([0-9]){1,6}([*]){1,5}){1,6}+$
But it allows to have more than 6 numbers and don't allow for star to be before number.
There is no minimum/maximum count of "*" sign (but max count for all signs is 6).
Here you go:
^(?:\d{6}|(?=.*\*)[\d*]{1,6}|)$
Here is what it does:
^ <-- Start of the string (we don't want to capture more than that)
(?: <-- Start a non captured group (it will be used to do the "or" part)
\d{6} <-- 6 digits, nothing more
| <-- OR
(?=.*\*) <-- Look ahead for a '*' (you could replace the first * with {0,5})
[\d*] <-- digits or '*'
{1,6} <-- repeated one to six times (we know from the look ahead that there will be at least one '*'
| <-- OR (nothing)
) <-- End the non capturing group
$ <-- End of the string
I'm not quite sure if you want the empty case (but you said 0 to 6), if you actually want 1 to 6 just remove the last |
/ ([0-9] {6} ) | ( ( [0-9]{0-5} & [*]{1-5} ) {0-6})/
something like this?
[1-6]{6}|([1-6]|\*){1,6}[^123456]
this works for the inputs you gave...
If you want something else then update me...
You can't do this with just a regex. You also need a length check. However, here is a regex that will help.
([\d*]*\*[\d*]*)|(\d{6})
To validate the input, try something like this:
validate(input)
{
regex = "([\d*]*\*[\d*]*)|(\d{6})";
digitregex = ".*\d.*"; // this makes sure they aren't all stars
return (input.length < 7 and regex.matches(input) and digitregex.matches(input))
}
I am afraid that you will have to try for each position that the * might have, like this:
/([0-9]{6}|\*[0-9][0-9\*]{0,4}|[0-9]\*[0-9\*]{0,4}|[0-9]{2}\*[0-9\*]{0,3}|[0-9]{3}\*[0-9\*]{0,2}|[0-9]{4}\*[0-9\*]?|[0-9]{5}\*)/
Edit:
The above solution will however not allow **2
And I was wrong. You can do it with a look forward like Colin did. That is the way to go.
Try this : (updated)
([0-6]{6})|([0-6\*]{1,6})
It should work...
if any digits 0..9 are allowed try this regexp [0-9*]{2,6}
if only digits 1..6 as in your example [1-6*]{2,6}
it's a bit tricky cause also 12345 will be validated as correct
example here
You'll actually need a solution with look-around as already suggested by #Colin

Regular expression for bit strings with even number of 1s

Let L= { w in (0+1)* | w has even number of 1s}, i.e. L is the set of all bit strings with even number of 1s. Which one of the regular expressions below represents L?
A) (0*10*1)*
B) 0*(10*10*)*
C) 0*(10*1)* 0*
D) 0*1(10*1)* 10*
According to me option D is never correct because it does not represent the bit string with zero 1s. But what about the other options? We are concerned about the number of 1s(even or not) not the number of zeros doesn't matter.
Then which is the correct option and why?
A if false. It doesn't get matched by 0110 (or any zeros-only non-empty string)
B represents OK. I won't bother proving it here since the page margins are too small.
C doesn't get matched by 010101010 (zero in the middle is not matched)
D as you said doesn't get matched by 00 or any other # with no ones.
So only B
To solve such a problem you should
Supply counterexample patterns to all "incorrect" regexps. This will be either a string in L that is not matched, or a matched string out of L.
To prove the remaining "correct" pattern, you should answer two questions:
Does every string that matches the pattern belong to L? This can be done by devising properties each of matched strings should satisfy--for example, number of occurrences of some character...
Is every string in L matched by the regexp? This is done by dividing L into easily analyzable subclasses, and showing that each of them matches pattern in its own way.
(No concrete answers due to [homework]).
Examining the pattern B:
^0*(10*10*)*$
^ # match beginning of string
0* # match zero or more '0'
( # start group 1
10* # match '1' followed by zero or more '0'
10* # match '1' followed by zero or more '0'
)* # end group 1 - match zero or more times
$ # end of string
Its pretty obvious that this pattern will only match strings who have 0,2,4,... 1's.
Look for examples that should match but don't. 0, 11011, and 1100 should all match, but each one fails for one of those four
C is incorrect because it does not allow any 0s between the second 1 of one group and the first 1 of the next group.
This answer would be best for this language
(0*10*10*)
a quick python script actually eliminated all the possibilities:
import re
a = re.compile("(0*10*1)*")
b = re.compile("0*(10*10*)*")
c = re.compile("0*(10*1)* 0*")
d = re.compile("0*1(10*1)* 10*")
candidates = [('a',a),('b',b),('c',c),('d',d)]
tests = ['0110', '1100', '0011', '11011']
for test in tests:
for candidate in candidates:
if not candidate[1].match(test):
candidates.remove(candidate)
print "removed %s because it failed on %s" % (candidate[0], test)
ntests = ['1', '10', '01', '010', '10101']
for test in ntests:
for candidate in candidates:
if candidate[1].match(test):
candidates.remove(candidate)
print "removed %s because it matched on %s" % (candidate[0], test)
the output:
removed c because it failed on 0110
removed d because it failed on 0110
removed a because it matched on 1
removed b because it matched on 10