2 digits only allowed once (Regex) - regex

I'm trying to check if a level is valid or not.
The level is of the form: (but they're 998 more of these)
bbbbbbb
b41111b
b81400b
b81010b
b01121b
b08001b
bbbbbbb
The level must follow a few rules. I have written a regex to conform all rules but one:
The level must contain exactly 1 times 2 and 1 times 4.
(Notice in the level above there's two 4's and one 2. The level above is not valid.)
This is a school project so please guide me through to the answer.
Thanks in advance.
EDIT:
My current regex is:
^b{' + str(length) + r'}\n(b{1}[0-8]{' + str(length - 2) + r'}b{1}\n)+b{' + str(length) + '}$
For the level above, length = 7
Note that it doesn't even try to filter this wrong level above.
The other rules are:
The level must be surrounded by a 'b'
The level can only contain the char 'b' and numbers smaller than 9.
There can only be one 2
There can only be one 4
My regex above does take rules 1 and 2 into account, but I still need to figure out rules 3 and 4.
I have tried lookarounds and such, couldn't figure it out.

This is the regex that will meet all of your conditions:
^b(?!(?:[^2]*2){2,})(?!(?:[^4]*4){2,})[b0-8]*b$
It starts and end with b Using ^b and $b
It comprises of only letter b and numbers 0-8 by[b0-8]*
It won't allow more than one digit 2 by using (?!(?:[^2]*2){2,})
It won't allow more than one digit 4 by using (?!(?:[^4]*4){2,})

Well, a regex for exactly 1 a would be [^a]*a[^a]* (that is, a possibly empty sequence of non-a's, followed by an a, followed by a possibly empty sequence of non-a's). I'll leave it an an exercise how to handle multiple lines & making sure this covers the whole level.
For exactly 1 a and 1 b: [^ab]*((a[^ab]*b)|(b[^ab]*a))[^ab]*, with the same caveats. Explanation: a sequence of non-a-or-b's, follow by EITHER 1) an a, a run of non-a-or-b's, and a b, or 2) a b, a run of non-a-or-b's, and an a, with THAT followed by a run of non-a-or-b's.

The answer is to use negative lookaheads anchored to start of input.
It's unclear what you're trying to match, so I will just use a placeholder <your-regex> for your current regex:
^(?!.*?2.*?2)(?!.*?4.*?4)<your-regex>
See a live demo of rhis correctly rejecting more than 1 "2".

Related

Sed: Substitute letters on certain positions

I have a file with the structure:
N1H3O1 C2H2
C1H4 H201
C1H1N1 N1H3
C2N1O1P1H3 P5
What I am trying to do is to count the sum of coefficients in each of the formulae. Thus, the desire output is:
1+3+1 5 2+2 4
1+4 5 2+1 3
1+1+1 3 3+1 4
2+1+1+1+3 8 5 5
What I did is a simple replacement of each letter with "+" and then deleting the first " +".
I however would like to know how to do it in a more proper way in sed, using branch and flow operators.
The problem with your input is the 0 which is used instead of O, which might make it difficult to design a regular expression for it, which you can see here:
([^A-Z]+)*([0-9]+)
Other than that, you might be able to capture the numbers by simply adding ([^A-Z]+).
However, you may not wish to do this task with regular expression, since your data except for that 0 is pretty structured, and you could maybe write a script to do so.

Regex for UK registration number

I've been playing with creating a regular expression for UK registration numbers but have hit a wall when it comes to restricting overall length of the string in question. I currently have the following:
^(([a-zA-Z]?){1,3}(\d){1,3}([a-zA-Z]?){1,3})
This allows for an optional string (lower or upper case) of between 1 and 3 characters, followed by a mandatory numeric of between 1 and 3 characters and finally, a mandatory string (lower or upper case) of between 1 and 3 characters.
This works fine but I then want to apply a max length of 7 characters to the entire string but this is where I'm failing. I tried adding a 1,7 restriction to the end of the regex but the three 1,3 checks are superseding it and therefore allowing a max length of 9 characters.
Examples of registration numbers that need to pass are as follows:
A1
AAA111
AA11AAA
A1AAA
A11AAA
A111AAA
In the examples above, the A's represents any letter, upper or lower case and the 1's represent any number. The max length is the only restriction that appears not to be working. I disable the entry of a space so they can be assumed as never present in the string.
If you know what lengths you are after, I'd recommend you use the .length property which some languages expose for string length. If this is not an option, you could try using something like so: ^(?=.{1,7})(([a-zA-Z]?){1,3}(\d){1,3}([a-zA-Z]?){1,3})$, example here.

Integer range and multiple of

I have a number of fields I want to validate on text entry with a regex for both matching a range (0..120) and must be a multiple of 5.
For example, 0, 5, 25, 120 are valid. 1, 16, 123, 130 are not valid.
I think I have the regex for multiple of 5:
^\d*\d?((5)|(0))\.?((0)|(00))?$
and the regex for the range:
120|1[01][0-9]|[2-9][0-9]
However, I dont know how to combine these, any help much appreciated!
You can't do that with a simple regex. At least not the range-part (especially if the range should be generic/changeable).
And even if you manage to write the regex, it will be very complex and unreadable.
Write the validation on your own, using a parseStringToInt() function of your language and simple < and > checks.
Update: added another regex (see below) to be used when the range of values is not 0..120 (it can even be dynamic).
The second regex in the question does not match numbers smaller than 20. You can change it to match smaller numbers that always end in 0 or 5 to be multiple by 5:
\b(120|(1[01]|[0-9])?[05])\b
How it works (starting from inside):
(1[01]|[0-9])? matches 10, 11 or any one-digit number (0 to 9); these are the hundreds and tens in the final number; the question mark (?) after the sub-expression makes it match 0 or 1 times; this way the regex can also match numbers having only one digit (0..9);
[05] that follows matches 0 or 5 on the last digit (the units); only the numbers that end in 0 or 5 are multiple of 5;
everything is enclosed in parenthesis because | has greater priority than \b;
the outer \b matches word boundaries; they prevent the regex match only 1..3 digits from a longer number or numbers that are embedded in strings; it prevents it matching 15 in 150 or 120 in abc120.
Using dynamic range of values
The regex above is not very complex and it can be used to match numbers between 0 and 120 that are multiple of 5. When the range of values is different it cannot be used any more. It can be modified to match, lets say, numbers between 20 and 120 (as the OP asked in a comment below) but it will become harder to read.
More, if the range of allowed values is dynamic then a regex cannot be used at all to match the values inside the range. The multiplicity with 5 however can be achieved using regex :-)
For dynamic range of values that are multiple of 5 you can use this expression:
\b([1-9][0-9]*)?[05]\b
Parse the matched string as integer (the language you use probably provides such a function or a library that contains it) then use the comparison operators (<, >) of the host language to check if the matched value is inside the desired range.
At the risk of being painfully obvious
120|1[01][05]|[2-9][05]
Also, why the 2?

More efficient regex than "(cg[agct])|(ag[ag])"

I need a regex to match any of:
cgt, cgc, cga, cgg, aga, agg
They're DNA codons. Is the regex I've given, (cg[agct])|(ag[ag]), as efficient as it could be? It somehow seems clunky, and I wonder if I could use the fact that there has to be a g as the second character.
To sum up the comments:
It appears that what you have is pretty good.
The one suggestion is to change the grouping into a non-capturing group (or remove them all together).
Something like this seems optimal:
cg[agct]|ag[ag]
If you had a set that was FAR more frequent than the others, you could possibly speed it up (slightly) by adding it literally to the alternation:
cgg|cg[act]|ag[ag]
Internally, most regex engines will turn small character classes like this into their own alternation. It may be fastest to expand out the alternation all the way, or in different groups, to see the performance impact.
I would suggest that you should profile all three of these approaches with your regex engine:
cg[agct]|ag[ag]
cga|cgc|cgg|cgt|aga|agg
[ac]g[agct](?<!agt|agc)
The last one is the closest to an answer to your question, since it leverages the fact that a "g" is required in the middle and used a "negative lookbehind" to eliminate the invalid sets.
One other thing to check would be if just finding all instances of [ac]g[agct] (including the undesired "agt" and "agc") and then filtering them in your language of choice would be fastest.
EDIT, FOR SCIENCE!
Here is a chart of the various types of matches and failures, along with their number of steps required to reach a conclusion (match or no match).
cg[agct]|ag[ag] [ac]g[agct](?<!agt|agc) cga|cgc|cgg|cgt|aga|agg
agg 4 6 10
agc 4 8 10
cga 3 6 3
axa 3 2 8
cxa 3 2 10
xxx 2 1 6
So, it appears that (as we guessed), the methods have entirely different properties.
My hunch about splitting everything into an alternation was wrong. Don't use that.
Your hunch about utilizing the "g" in the middle is warranted, except that for partial matches (agg, for example) and full matches (cga, for example) take longer. However, throwing away bad results is slightly faster with the negative lookbehind version.
So, to compensate for the worst case, (8 checks versus 3 = delta -5) we would have to see at least 5 failing character positions. (2 checks versus 3 = delta 1 or 1 check versus 2 = delta 1)
I guess, then, that you should use the negative lookbehind version if you anticipate that you will fail a match at 5 positions for every match that you find.
EDIT 2, A SLIGHTLY BETTER VERSION
Looking at how exactly the regex is going to evaluate each match, we can craft a better version that will let about half of the matches "fast track", and will also reduce the number of characters checked when the match fails.
[ca]g(?:[ag]|(?<!ag)[ct])
agg 4
agc 7
cga 4
axa 2
cxa 2
xxx 1
This reduces all of the positive matches times by one or two comparisons each.
Based on this, I would recommend using [ca]g(?:[ag]|(?<!ag)[ct]) if you expect to check 4 or more positions for each match.

What would be the best (runtime performance) application or pattern or code or library for matching string patterns

I have been trying to figure out a decent way of matching string patterns. I will try my best to provide as much information as I can regarding what I am trying to do.
The simplest thougt is that there are some specified patterns and we want to know which of these patterns match completely or partially to a given request. The specified patterns hardly change. The amount of requests are about 10K per day but the results have to pe provided ASAP and thus runtime performance is the highest priority.
I have been thinking of using Assembly Compiled Regular Expression in C# for this, but I am not sure if I am headed in the right direction.
Scenario:
Data File:
Let's assume that data is provided as an XML request in a known schema format. It has anywehere between 5-20 rows of data. Each row has 10-30 columns. Each of the columns also can only have data in a pre-defined pattern. For example:
A1- Will be "3 digits" followed by a
"." follwed by "2 digits" -
[0-9]{3}.[0-9]{2}
A2- Will be "1
character" follwoed by "digits" -
[A-Z][0-9]{4}
The sample would be something like:
<Data>
<R1>
<A1>123.45</A1>
<A2>A5567</A2>
<A4>456EV</A4>
<An>xxx</An>
</R1>
</Data>
Rule File:
Rule ID A1 A2
1001 [0-9]{3}.45 A55[0-8]{2}
2002 12[0-9].55 [X-Z][0-9]{4}
3055 [0-9]{3}.45 [X-Z][0-9]{4}
Rule Location - I am planning to store the Rule IDs in some sort of bit mask.
So the rule IDs are then listed as location on a string
Rule ID Location (from left to right)
1001 1
2002 2
3055 3
Pattern file: (This is not the final structure, but just a thought)
Column Pattern Rule Location
A1 [0-9]{3}.45 101
A1 12[0-9].55 010
A2 A55[0-8]{2} 100
A2 [X-Z][0-9]{4} 011
Now let's assume that SOMEHOW (not sure how I am going to limit the search to save time) I run the regex and make sure that A1 column is only matched aginst A1 patterns and A2 column against A2 patterns. I would end up with the follwoing reults for "Rule Location"
Column Pattern Rule Location
A1 [0-9]{3}.45 101
A2 A55[0-8]{2} 100
Doing AND on each of the loctions
gives me the location 1 - 1001 -
Complete match.
Doing XOR on each of the loctions
gives me the location 3 - 3055 -
Partial match. (I am purposely not
doing an OR, because that would have
returned 1001 and 3055 as the result
which would be wrong for partial
match)
The final reulsts I am looking for are:
1001 - Complete Match
3055 - Partial Match
Start Edit_1: Explanation on Matching results
Complete Match - This occurs when all
of the patterns in given Rule are
matched.
Partial Match - This ocurrs when NOT
all of the patterns in given Rule are
matched, but atleast one pattern
matches.
Example Complete Match (AND):
Rule ID 1001 matched for A1(101) and A2 (100). If you look at the first charcter in 101 and 100 it is "1". When you do an AND - 1 AND 1 the result is 1. Thus position 1 i.e. 1001 is a Complete Match.
Exmple Partial Match (XOR):
Rule ID 3055 matched for A1(101). If you look at the last character in 101 and 100 it is "1" and "0". When you do an XOR - 1 XOR 0 the result is 1. Thus position 3 i.e. 3055 is Partial Match.
End Edit_1
Input:
The data will be provided in some sort of XML request. It can be one big request with 100K Data nodes or 100K requests with one data node only.
Rules:
The matching values have to be intially saved as some sort of pattern to make it easier to write and edit. Let's assume that there are approximately 100K rules.
Output:
Need to know which rules matched completely and partially.
Preferences:
I would prefer doing as much of the coding as I can in C#. However if there is a major performance boost, I can use a different language.
The "Input" and "Output" are my requirements, how I manage to get the "Output" does not matter. It has to be fast, lets say each Data node has to be processed in approximately 1 second.
Questions:
Are there any existing pattern or
framewroks to do this?
Is using Regex the right path
specifically Assembly Compiled
Regex?
If I end up using Regex how can I
specify for A1 patterns to only
match against A1 column?
If I do specify rule locations in a
bit type pattern. How do I process
ANDs and XORs when it grows to be
100K charcter long?
I am looking for any suggestions or options that I should consider.
Thanks..
The regular expression API only tells you when they fully matched, not when they partially matched. What you therefore need is some variation on a regular expression API that lets you try to match multiple regular expressions at once, and at the end can tell you which matched fully, and which partially matched. Ideally one that lets you precompile a set of patterns so you can avoid compilation at runtime.
If you had that then you could match your A1 patterns against the AI column, A2 columns against the A2 pattern, and so on. Then do something with the list of partial and full regular expressions.
The bad news is that I don't know of any software out there that implements this.
The good news is that the strategy described in http://swtch.com/~rsc/regexp/regexp1.html should be able to implement this. In particular the State sets can be extended to have information about your current state in multiple patterns at the same time. This extended set of State sets will result in a more complex state diagram (because you're tracking more stuff), and a more complex return at the end (you're returning a set of State sets), but runtime won't be changed a bit, whether you're matching one pattern or 50.