More efficient regex than "(cg[agct])|(ag[ag])" - regex

I need a regex to match any of:
cgt, cgc, cga, cgg, aga, agg
They're DNA codons. Is the regex I've given, (cg[agct])|(ag[ag]), as efficient as it could be? It somehow seems clunky, and I wonder if I could use the fact that there has to be a g as the second character.

To sum up the comments:
It appears that what you have is pretty good.
The one suggestion is to change the grouping into a non-capturing group (or remove them all together).
Something like this seems optimal:
cg[agct]|ag[ag]
If you had a set that was FAR more frequent than the others, you could possibly speed it up (slightly) by adding it literally to the alternation:
cgg|cg[act]|ag[ag]
Internally, most regex engines will turn small character classes like this into their own alternation. It may be fastest to expand out the alternation all the way, or in different groups, to see the performance impact.
I would suggest that you should profile all three of these approaches with your regex engine:
cg[agct]|ag[ag]
cga|cgc|cgg|cgt|aga|agg
[ac]g[agct](?<!agt|agc)
The last one is the closest to an answer to your question, since it leverages the fact that a "g" is required in the middle and used a "negative lookbehind" to eliminate the invalid sets.
One other thing to check would be if just finding all instances of [ac]g[agct] (including the undesired "agt" and "agc") and then filtering them in your language of choice would be fastest.
EDIT, FOR SCIENCE!
Here is a chart of the various types of matches and failures, along with their number of steps required to reach a conclusion (match or no match).
cg[agct]|ag[ag] [ac]g[agct](?<!agt|agc) cga|cgc|cgg|cgt|aga|agg
agg 4 6 10
agc 4 8 10
cga 3 6 3
axa 3 2 8
cxa 3 2 10
xxx 2 1 6
So, it appears that (as we guessed), the methods have entirely different properties.
My hunch about splitting everything into an alternation was wrong. Don't use that.
Your hunch about utilizing the "g" in the middle is warranted, except that for partial matches (agg, for example) and full matches (cga, for example) take longer. However, throwing away bad results is slightly faster with the negative lookbehind version.
So, to compensate for the worst case, (8 checks versus 3 = delta -5) we would have to see at least 5 failing character positions. (2 checks versus 3 = delta 1 or 1 check versus 2 = delta 1)
I guess, then, that you should use the negative lookbehind version if you anticipate that you will fail a match at 5 positions for every match that you find.
EDIT 2, A SLIGHTLY BETTER VERSION
Looking at how exactly the regex is going to evaluate each match, we can craft a better version that will let about half of the matches "fast track", and will also reduce the number of characters checked when the match fails.
[ca]g(?:[ag]|(?<!ag)[ct])
agg 4
agc 7
cga 4
axa 2
cxa 2
xxx 1
This reduces all of the positive matches times by one or two comparisons each.
Based on this, I would recommend using [ca]g(?:[ag]|(?<!ag)[ct]) if you expect to check 4 or more positions for each match.

Related

Sed: Substitute letters on certain positions

I have a file with the structure:
N1H3O1 C2H2
C1H4 H201
C1H1N1 N1H3
C2N1O1P1H3 P5
What I am trying to do is to count the sum of coefficients in each of the formulae. Thus, the desire output is:
1+3+1 5 2+2 4
1+4 5 2+1 3
1+1+1 3 3+1 4
2+1+1+1+3 8 5 5
What I did is a simple replacement of each letter with "+" and then deleting the first " +".
I however would like to know how to do it in a more proper way in sed, using branch and flow operators.
The problem with your input is the 0 which is used instead of O, which might make it difficult to design a regular expression for it, which you can see here:
([^A-Z]+)*([0-9]+)
Other than that, you might be able to capture the numbers by simply adding ([^A-Z]+).
However, you may not wish to do this task with regular expression, since your data except for that 0 is pretty structured, and you could maybe write a script to do so.

Integer range and multiple of

I have a number of fields I want to validate on text entry with a regex for both matching a range (0..120) and must be a multiple of 5.
For example, 0, 5, 25, 120 are valid. 1, 16, 123, 130 are not valid.
I think I have the regex for multiple of 5:
^\d*\d?((5)|(0))\.?((0)|(00))?$
and the regex for the range:
120|1[01][0-9]|[2-9][0-9]
However, I dont know how to combine these, any help much appreciated!
You can't do that with a simple regex. At least not the range-part (especially if the range should be generic/changeable).
And even if you manage to write the regex, it will be very complex and unreadable.
Write the validation on your own, using a parseStringToInt() function of your language and simple < and > checks.
Update: added another regex (see below) to be used when the range of values is not 0..120 (it can even be dynamic).
The second regex in the question does not match numbers smaller than 20. You can change it to match smaller numbers that always end in 0 or 5 to be multiple by 5:
\b(120|(1[01]|[0-9])?[05])\b
How it works (starting from inside):
(1[01]|[0-9])? matches 10, 11 or any one-digit number (0 to 9); these are the hundreds and tens in the final number; the question mark (?) after the sub-expression makes it match 0 or 1 times; this way the regex can also match numbers having only one digit (0..9);
[05] that follows matches 0 or 5 on the last digit (the units); only the numbers that end in 0 or 5 are multiple of 5;
everything is enclosed in parenthesis because | has greater priority than \b;
the outer \b matches word boundaries; they prevent the regex match only 1..3 digits from a longer number or numbers that are embedded in strings; it prevents it matching 15 in 150 or 120 in abc120.
Using dynamic range of values
The regex above is not very complex and it can be used to match numbers between 0 and 120 that are multiple of 5. When the range of values is different it cannot be used any more. It can be modified to match, lets say, numbers between 20 and 120 (as the OP asked in a comment below) but it will become harder to read.
More, if the range of allowed values is dynamic then a regex cannot be used at all to match the values inside the range. The multiplicity with 5 however can be achieved using regex :-)
For dynamic range of values that are multiple of 5 you can use this expression:
\b([1-9][0-9]*)?[05]\b
Parse the matched string as integer (the language you use probably provides such a function or a library that contains it) then use the comparison operators (<, >) of the host language to check if the matched value is inside the desired range.
At the risk of being painfully obvious
120|1[01][05]|[2-9][05]
Also, why the 2?

How to programmatically learn regexes?

My question is a continuation of this one. Basically, I have a table of words like so:
HAT18178_890909.098070313.1
HAT18178_890909.098070313.2
HAT18178_890909.143412462.1
HAT18178_890909.143412462.2
For my purposes, I do not need the terminal .1 or .2 for this set of names. I can manually write the following regex (using Python syntax):
r = re.compile('(.*\.\d+)\.\d+')
However, I cannot guarantee that my next set of names will have a similar structure where the final 2 characters will be discardable - it could be 3 characters (i.e. .12) and the separator could change as well (i.e. . to _).
What is the appropriate way to either explicitly learn a regex or to determine which characters are unnecessary?
It's an interesting problem.
X y
HAT18178_890909.098070313.1 HAT18178_890909.098070313
HAT18178_890909.098070313.2 HAT18178_890909.098070313
HAT18178_890909.143412462.1 HAT18178_890909.143412462
HAT18178_890909.143412462.2 HAT18178_890909.143412462
The problem is that there is not a single solution but many.
Even for a human it is not clear what the regex should be that you want.
Based on this data, I would think the possibilities to learn are:
Just match a fixed width of 25: .{25}
Fixed first part: HAT18178_890909.
Then:
There's only 2 varying numbers on each single spot (as you show 2 cases).
So e.g. [01] (either 0 or 1), [94] the next spot and so on would be a good solution.
The obvious one would be \d+
But it could also be \d{9}
You see, there are multiple correct answers.
These regexes would still work if the second point would be an underscore instead.
My conclusion:
The problem is that it is much more work to prepare the data for machine learning than it is to create a regex. If you want to be sure you cover everything, you need to have complete data, so then a regex is probably less effort.
You could split on non-alphanumeric characters;
[^a-zA-Z0-9']+
That would get you, in this case, few strings like this:
HAT18178
890909
098070313
1
From there on you can simply discard the last one if that's never necessary, and continue on processing the first sequences

2 digits only allowed once (Regex)

I'm trying to check if a level is valid or not.
The level is of the form: (but they're 998 more of these)
bbbbbbb
b41111b
b81400b
b81010b
b01121b
b08001b
bbbbbbb
The level must follow a few rules. I have written a regex to conform all rules but one:
The level must contain exactly 1 times 2 and 1 times 4.
(Notice in the level above there's two 4's and one 2. The level above is not valid.)
This is a school project so please guide me through to the answer.
Thanks in advance.
EDIT:
My current regex is:
^b{' + str(length) + r'}\n(b{1}[0-8]{' + str(length - 2) + r'}b{1}\n)+b{' + str(length) + '}$
For the level above, length = 7
Note that it doesn't even try to filter this wrong level above.
The other rules are:
The level must be surrounded by a 'b'
The level can only contain the char 'b' and numbers smaller than 9.
There can only be one 2
There can only be one 4
My regex above does take rules 1 and 2 into account, but I still need to figure out rules 3 and 4.
I have tried lookarounds and such, couldn't figure it out.
This is the regex that will meet all of your conditions:
^b(?!(?:[^2]*2){2,})(?!(?:[^4]*4){2,})[b0-8]*b$
It starts and end with b Using ^b and $b
It comprises of only letter b and numbers 0-8 by[b0-8]*
It won't allow more than one digit 2 by using (?!(?:[^2]*2){2,})
It won't allow more than one digit 4 by using (?!(?:[^4]*4){2,})
Well, a regex for exactly 1 a would be [^a]*a[^a]* (that is, a possibly empty sequence of non-a's, followed by an a, followed by a possibly empty sequence of non-a's). I'll leave it an an exercise how to handle multiple lines & making sure this covers the whole level.
For exactly 1 a and 1 b: [^ab]*((a[^ab]*b)|(b[^ab]*a))[^ab]*, with the same caveats. Explanation: a sequence of non-a-or-b's, follow by EITHER 1) an a, a run of non-a-or-b's, and a b, or 2) a b, a run of non-a-or-b's, and an a, with THAT followed by a run of non-a-or-b's.
The answer is to use negative lookaheads anchored to start of input.
It's unclear what you're trying to match, so I will just use a placeholder <your-regex> for your current regex:
^(?!.*?2.*?2)(?!.*?4.*?4)<your-regex>
See a live demo of rhis correctly rejecting more than 1 "2".

find a string with at least n matching elements

I have a list of numbers that I want to find at least 3 of...
here is an example
I have a large list of numbers in a sql database in the format of (for example)
01-02-03-04-05-06
06-08-19-24-25-36
etc etc
basically 6 random numbers between 0 and 99.
Now I want to find the strings where at least 3 of a set of given numbers occurs.
For example:
given: 01-02-03-10-11-12
return the strings that have at least 3 of those numbers in them.
eg
01-05-06-09-10-12 would match
03-08-10-12-18-22 would match
03-09-12-18-22-38 would not
I am thinking that there might be some algorithm or even regular expression that could match this... but my lack of computer science textbook experience is tripping me up I think.
No - this is not a homework question! This is for an actual application!
I am developing in ruby, but any language answer would be appreciated
You can use a string replacement to replace - with | to turn 01-02-03-10-11-12 into 01|02|03|10|11|12. Then wrap it like this:
((01|02|03|10|11|12).*){3}
This will find any of the digit pairs, then ignore any number of characters... 3 times. If it matches, then success.