What would be the best (runtime performance) application or pattern or code or library for matching string patterns - regex

I have been trying to figure out a decent way of matching string patterns. I will try my best to provide as much information as I can regarding what I am trying to do.
The simplest thougt is that there are some specified patterns and we want to know which of these patterns match completely or partially to a given request. The specified patterns hardly change. The amount of requests are about 10K per day but the results have to pe provided ASAP and thus runtime performance is the highest priority.
I have been thinking of using Assembly Compiled Regular Expression in C# for this, but I am not sure if I am headed in the right direction.
Scenario:
Data File:
Let's assume that data is provided as an XML request in a known schema format. It has anywehere between 5-20 rows of data. Each row has 10-30 columns. Each of the columns also can only have data in a pre-defined pattern. For example:
A1- Will be "3 digits" followed by a
"." follwed by "2 digits" -
[0-9]{3}.[0-9]{2}
A2- Will be "1
character" follwoed by "digits" -
[A-Z][0-9]{4}
The sample would be something like:
<Data>
<R1>
<A1>123.45</A1>
<A2>A5567</A2>
<A4>456EV</A4>
<An>xxx</An>
</R1>
</Data>
Rule File:
Rule ID A1 A2
1001 [0-9]{3}.45 A55[0-8]{2}
2002 12[0-9].55 [X-Z][0-9]{4}
3055 [0-9]{3}.45 [X-Z][0-9]{4}
Rule Location - I am planning to store the Rule IDs in some sort of bit mask.
So the rule IDs are then listed as location on a string
Rule ID Location (from left to right)
1001 1
2002 2
3055 3
Pattern file: (This is not the final structure, but just a thought)
Column Pattern Rule Location
A1 [0-9]{3}.45 101
A1 12[0-9].55 010
A2 A55[0-8]{2} 100
A2 [X-Z][0-9]{4} 011
Now let's assume that SOMEHOW (not sure how I am going to limit the search to save time) I run the regex and make sure that A1 column is only matched aginst A1 patterns and A2 column against A2 patterns. I would end up with the follwoing reults for "Rule Location"
Column Pattern Rule Location
A1 [0-9]{3}.45 101
A2 A55[0-8]{2} 100
Doing AND on each of the loctions
gives me the location 1 - 1001 -
Complete match.
Doing XOR on each of the loctions
gives me the location 3 - 3055 -
Partial match. (I am purposely not
doing an OR, because that would have
returned 1001 and 3055 as the result
which would be wrong for partial
match)
The final reulsts I am looking for are:
1001 - Complete Match
3055 - Partial Match
Start Edit_1: Explanation on Matching results
Complete Match - This occurs when all
of the patterns in given Rule are
matched.
Partial Match - This ocurrs when NOT
all of the patterns in given Rule are
matched, but atleast one pattern
matches.
Example Complete Match (AND):
Rule ID 1001 matched for A1(101) and A2 (100). If you look at the first charcter in 101 and 100 it is "1". When you do an AND - 1 AND 1 the result is 1. Thus position 1 i.e. 1001 is a Complete Match.
Exmple Partial Match (XOR):
Rule ID 3055 matched for A1(101). If you look at the last character in 101 and 100 it is "1" and "0". When you do an XOR - 1 XOR 0 the result is 1. Thus position 3 i.e. 3055 is Partial Match.
End Edit_1
Input:
The data will be provided in some sort of XML request. It can be one big request with 100K Data nodes or 100K requests with one data node only.
Rules:
The matching values have to be intially saved as some sort of pattern to make it easier to write and edit. Let's assume that there are approximately 100K rules.
Output:
Need to know which rules matched completely and partially.
Preferences:
I would prefer doing as much of the coding as I can in C#. However if there is a major performance boost, I can use a different language.
The "Input" and "Output" are my requirements, how I manage to get the "Output" does not matter. It has to be fast, lets say each Data node has to be processed in approximately 1 second.
Questions:
Are there any existing pattern or
framewroks to do this?
Is using Regex the right path
specifically Assembly Compiled
Regex?
If I end up using Regex how can I
specify for A1 patterns to only
match against A1 column?
If I do specify rule locations in a
bit type pattern. How do I process
ANDs and XORs when it grows to be
100K charcter long?
I am looking for any suggestions or options that I should consider.
Thanks..

The regular expression API only tells you when they fully matched, not when they partially matched. What you therefore need is some variation on a regular expression API that lets you try to match multiple regular expressions at once, and at the end can tell you which matched fully, and which partially matched. Ideally one that lets you precompile a set of patterns so you can avoid compilation at runtime.
If you had that then you could match your A1 patterns against the AI column, A2 columns against the A2 pattern, and so on. Then do something with the list of partial and full regular expressions.
The bad news is that I don't know of any software out there that implements this.
The good news is that the strategy described in http://swtch.com/~rsc/regexp/regexp1.html should be able to implement this. In particular the State sets can be extended to have information about your current state in multiple patterns at the same time. This extended set of State sets will result in a more complex state diagram (because you're tracking more stuff), and a more complex return at the end (you're returning a set of State sets), but runtime won't be changed a bit, whether you're matching one pattern or 50.

Related

regex to match max of input numeric string

I have incoming input entries.
Like these
750
1500
1
100
25
55
And There is an lookup table like given below
25
7
5
75
So when I will receive my first entry, in this case its 750. So this will look up into lookup entry table will try to match with a string which having max match from left to right.
So for 750, max match case would be 75.
I was wondering, Is that possible if we could write a regex for this kind of scenario. Because if I choose using startsWith java function It can get me output of 7 as well.
As input entries will be coming from text file one by one and all lookup entries present in file different text file.
I'm using java language.
May I know how can I write a regex for this flavor..?
This doesn't seem like a regex problem at first, but you could actually solve it with a regex, and the result would be pretty efficient.
A regex for your example lookup table would be:
/^(75?|5|25)/
This will do what you want, and it will avoid the repeated searches of a naive "check every one" approach.
The regex would get complicated,though, as your lookup table grew. Adding a couple of terms to your lookup table:
25
7
5
75
750
72
We now have:
/^(7(50?|2)?|5|25)/
This is obviously going to get complicated quickly. The trick would be programmatically constructing the appropriate regex for arbitrary data--not a trivial problem, but not insurmountable either.
That said, this would be an..umm...unusual thing to implement in production code.
I would be hesitant to do so.
In most cases, I would simply do this:
Find all the strings that match.
Find the longest one.
(?: 25 | 5 | 75? )
There is free software that automatically makes full blown regex trie for you.
Just put the output regex into a text file and load it instead.
If your values don't change very much, this is a very fast way to do a lookup.
If it does change, generate another one.
Whats good about a full blown trie here, is that it never takes more than 8
steps to match.
The one I just did http://imgur.com/a/zwMhL
App screenshot
Even a 175,000 Word Dictionary takes no more than 8 steps.
Internally the app initially makes a ternary tree from the input
then converts it into a full blown regex trie.

2 digits only allowed once (Regex)

I'm trying to check if a level is valid or not.
The level is of the form: (but they're 998 more of these)
bbbbbbb
b41111b
b81400b
b81010b
b01121b
b08001b
bbbbbbb
The level must follow a few rules. I have written a regex to conform all rules but one:
The level must contain exactly 1 times 2 and 1 times 4.
(Notice in the level above there's two 4's and one 2. The level above is not valid.)
This is a school project so please guide me through to the answer.
Thanks in advance.
EDIT:
My current regex is:
^b{' + str(length) + r'}\n(b{1}[0-8]{' + str(length - 2) + r'}b{1}\n)+b{' + str(length) + '}$
For the level above, length = 7
Note that it doesn't even try to filter this wrong level above.
The other rules are:
The level must be surrounded by a 'b'
The level can only contain the char 'b' and numbers smaller than 9.
There can only be one 2
There can only be one 4
My regex above does take rules 1 and 2 into account, but I still need to figure out rules 3 and 4.
I have tried lookarounds and such, couldn't figure it out.
This is the regex that will meet all of your conditions:
^b(?!(?:[^2]*2){2,})(?!(?:[^4]*4){2,})[b0-8]*b$
It starts and end with b Using ^b and $b
It comprises of only letter b and numbers 0-8 by[b0-8]*
It won't allow more than one digit 2 by using (?!(?:[^2]*2){2,})
It won't allow more than one digit 4 by using (?!(?:[^4]*4){2,})
Well, a regex for exactly 1 a would be [^a]*a[^a]* (that is, a possibly empty sequence of non-a's, followed by an a, followed by a possibly empty sequence of non-a's). I'll leave it an an exercise how to handle multiple lines & making sure this covers the whole level.
For exactly 1 a and 1 b: [^ab]*((a[^ab]*b)|(b[^ab]*a))[^ab]*, with the same caveats. Explanation: a sequence of non-a-or-b's, follow by EITHER 1) an a, a run of non-a-or-b's, and a b, or 2) a b, a run of non-a-or-b's, and an a, with THAT followed by a run of non-a-or-b's.
The answer is to use negative lookaheads anchored to start of input.
It's unclear what you're trying to match, so I will just use a placeholder <your-regex> for your current regex:
^(?!.*?2.*?2)(?!.*?4.*?4)<your-regex>
See a live demo of rhis correctly rejecting more than 1 "2".

Regular Expression to Clean a numbered list

I've only just started playing with Regex and seem to be a little stuck! I have written a bulk find and replace using multiline in TextSoap. It is for cleaning up recipes that I have OCR'd and because there is Ingredients and Directions I cannot change a "1 " to become "1. " as this could rewrite "1 Tbsp" as "1. Tbsp".
I therefore did a check to see if the following two lines (possibly with extra rows) was the next sequential numbers using this code as the find:
^(1) (.*)\n?((\n))(^2 (.*)\n?(\n)^3 (.*)\n?(\n))
^(2) (.*)\n?((\n))(^3 (.*)\n?(\n)^4 (.*)\n?(\n))
^(3) (.*)\n?((\n))(^4 (.*)\n?(\n)^5 (.*)\n?(\n))
^(4) (.*)\n?((\n))(^5 (.*)\n?(\n)^6 (.*)\n?(\n))
^(5) (.*)\n?((\n))(^6 (.*)\n?(\n)^7 (.*)\n?(\n))
and the following as the replace for each of the above:
$1. $2 $3 $4$5
My Problem is that although it works as I wanted it to, it will never perform the task for the last three numbers...
An example of the text I want to clean up:
1 This is the first step in the list
2 Second lot if instructions to run through
3 Doing more of the recipe instruction
4 Half way through cooking up a storm
5 almost finished the recipe
6 Serve and eat
And what I want it to look like:
1. This is the first step in the list
2. Second lot if instructions to run through
3. Doing more of the recipe instruction
4. Half way through cooking up a storm
5. almost finished the recipe
6. Serve and eat
Is there a way to check the previous line or two above to run this backwards? I have looked at lookahead and lookbehind and I am somewhat confused at that point. Does anybody have a method to clean up my numbered list or help me with the regex I desire please?
dan1111 is right. You may run into trouble with similar looking data. But given the sample you provided, this should work:
^(\d+)\s+([^\r\n]+)(?:[\r\n]*) // search
$1. $2\r\n\r\n // replace
If you're not using Windows, remove the \rs from the replace string.
Explanation:
^ // beginning of the line
(\d+) // capture group 1. one or more digits
\s+ // any spaces after the digit. don't capture
([^\r\n]+) // capture group 2. all characters up to any EOL
(?:[\r\n]*) // consume additional EOL, but do not capture
Replace:
$1. // group 1 (the digit), then period and a space
$2 // group 2
\r\n\r\n // two EOLs, to create a blank line
// (remove both \r for Linux)
What about this?
1 Tbsp salt
2 Tsp sugar
3 Eggs
You have run into a major limitation of regexes: they don't work well when your data can't be strictly defined. You may intuitively know what are ingredients and what are steps, but it isn't easy to go from that to a reliable set of rules for an algorithm.
I suggest you instead think about an approach that is based on position within the file. A given cookbook usually formats all the recipes the same: such as, the ingredients come first, followed by the list of steps. This would probably be an easier way to tell the difference.

More efficient regex than "(cg[agct])|(ag[ag])"

I need a regex to match any of:
cgt, cgc, cga, cgg, aga, agg
They're DNA codons. Is the regex I've given, (cg[agct])|(ag[ag]), as efficient as it could be? It somehow seems clunky, and I wonder if I could use the fact that there has to be a g as the second character.
To sum up the comments:
It appears that what you have is pretty good.
The one suggestion is to change the grouping into a non-capturing group (or remove them all together).
Something like this seems optimal:
cg[agct]|ag[ag]
If you had a set that was FAR more frequent than the others, you could possibly speed it up (slightly) by adding it literally to the alternation:
cgg|cg[act]|ag[ag]
Internally, most regex engines will turn small character classes like this into their own alternation. It may be fastest to expand out the alternation all the way, or in different groups, to see the performance impact.
I would suggest that you should profile all three of these approaches with your regex engine:
cg[agct]|ag[ag]
cga|cgc|cgg|cgt|aga|agg
[ac]g[agct](?<!agt|agc)
The last one is the closest to an answer to your question, since it leverages the fact that a "g" is required in the middle and used a "negative lookbehind" to eliminate the invalid sets.
One other thing to check would be if just finding all instances of [ac]g[agct] (including the undesired "agt" and "agc") and then filtering them in your language of choice would be fastest.
EDIT, FOR SCIENCE!
Here is a chart of the various types of matches and failures, along with their number of steps required to reach a conclusion (match or no match).
cg[agct]|ag[ag] [ac]g[agct](?<!agt|agc) cga|cgc|cgg|cgt|aga|agg
agg 4 6 10
agc 4 8 10
cga 3 6 3
axa 3 2 8
cxa 3 2 10
xxx 2 1 6
So, it appears that (as we guessed), the methods have entirely different properties.
My hunch about splitting everything into an alternation was wrong. Don't use that.
Your hunch about utilizing the "g" in the middle is warranted, except that for partial matches (agg, for example) and full matches (cga, for example) take longer. However, throwing away bad results is slightly faster with the negative lookbehind version.
So, to compensate for the worst case, (8 checks versus 3 = delta -5) we would have to see at least 5 failing character positions. (2 checks versus 3 = delta 1 or 1 check versus 2 = delta 1)
I guess, then, that you should use the negative lookbehind version if you anticipate that you will fail a match at 5 positions for every match that you find.
EDIT 2, A SLIGHTLY BETTER VERSION
Looking at how exactly the regex is going to evaluate each match, we can craft a better version that will let about half of the matches "fast track", and will also reduce the number of characters checked when the match fails.
[ca]g(?:[ag]|(?<!ag)[ct])
agg 4
agc 7
cga 4
axa 2
cxa 2
xxx 1
This reduces all of the positive matches times by one or two comparisons each.
Based on this, I would recommend using [ca]g(?:[ag]|(?<!ag)[ct]) if you expect to check 4 or more positions for each match.

Weighted disjunction in Perl Regular Expressions?

I am fairly experienced with regular expressions, but I am having some difficulty with a current application involving disjunction.
My situation is this: I need to separate an address into its component parts based on a regular expression match on the "Identifier elements" of the address -- A comparable English example would be words like "state", "road", or "boulevard"--IF, for example, we wrote these out in our addresses. Imagine we have an address like the following, where (and this would never happen in English), we specified the identifier type after each name
United States COUNTRY California STATE San Francisco CITY Mission STREET 345 NUMBER
(Where the words in CAPS are what I have called "identifiers").
We want to parse it into:
United States COUNTRY
California STATE
San Francisco CITY
Mission STREET
245 NUMBER
OK, this is certainly contrived for English, but here's the catch: I am working with Chinese data, where in fact this style of identifier specification happens all the time. An example below:
云南-省 ; 丽江-市 ; 古城-区 ; 西安-街 ; 杨春-巷 ;
Yunnan-Province ; LiJiang-City ; GuCheng-District ; Xi'An-Street ; Yangchun-Alley
This is easy enough--a lazy match on a potential candidate identifier names, separated into a disjunctive list.
For China, the following are the "province-level" entities:
省 (Province) ,
自治区 (Autonomous Region) ,
市 (Municipality)
So my regex so far looks like this:
(.+?(?:(?:省)|(?:自治区)|(?:市)))
I have a series of these, in order to account for different portions of the address. The next level, corresponding to cities, for instance, is:
(.+?(?:(?:地区)|(?:自治州)|(?:市)|(?:盟)))
So to match a province entity followed by a city entity:
(.+?(?:(?:省)|(?:自治区)|(?:市)))(.+?(?:(?:地区)|(?:自治州)|(?:市)|(?:盟)))
With named capture groups:
(?<Province>.+?(?:(?:省)|(?:自治区)|(?:市)))(?<City>.+?(?:(?:地区)|(?:自治州)|(?:市)|(?:盟)))
For the above, this yields:
$+{Province} = 云南省
$+{City} = 丽江市
This is all good and well, and gets me pretty far. The problem, however, is when I try to account for identifiers that can be a substring of other identifiers. A common street-level entity, for instance, is "村委会", which means village organizing committee. In the set of addresses I wish to separate, not every address has this written out in full. In fact, I find "村委" and just plain "村" as well.
The problem? If I have a pure disjunction of these elements, we have the following:
(?<Street>.+?(?:(?:村委会)|(?:村委)|(?:村)))
What happens, though, is that if you have an entity 保定-村委会 (Baoding Village organizing committee), this lazy regex stops at 村 and calls it a day, orphaning our poor 委会 because 村 is one of the potential disjunctive elements.
Imagine an English equivalent like the following:
(?<Animal>.+?(?:(?:Cat)|(?:Elephant)|(?:CatElephant)|(?:City)))
We have two input strings:
1. "crap catelephant crap city", where we wanted "Crap catelephant" and "crap city"
2. "crap catelephant city" , where we wanted "crap cat" "elephant city"
Ah, the solution, you say, is to make the pre-identifier capture greedy. But! There are entities have the same identifier that are not at the same level.
Take 市 for example. It means simply "city". But in China, there are county-level, province-level, and municipality-level cities. If this character occurred twice in the string, especially in two adjacent entities, the greedy search would incorrectly tag the greedy match as the first entity. As in the following:
广东-省 ; 江门-市 ; 开平-市 ; 三埠-区 石海管-区
Guangdong-province ; Jiangmen-City ; Kaiping-City ; Sanbu-District ; Shihaiguan-District
(Note, as above, this has been hand-segmented. The raw data would simply have a string of concatenated characters)
The match for a greedy search would be
江门市开平市
This is wrong, as the two adjacent entities should be separated into their constituent parts. Once is at the level of provincial city, one is a county-level city.
Back to the original point, and I thank you for reading this far, is there a way to put a weighting on disjunctive entities? I would want the regex to find the highest "weighted" identifier first. 村委会 instead of simple 村 for example, "catelephant" instead of just "cat". In preliminary experiments, the regex parser apparently proceeds left to right in finding disjunctive matches. Is this a valid assumption to make? Should I put the most frequently-occurring identifiers first in the disjunctive list?
If I have lost anyone with Chinese-related details, I apologize, and can further clarify if needed. The example really doesn't have to be Chinese--I think more generally it is a question about the mechanics of the regex disjunctive match -- in what order does it preference the disjunctive entities, and how does it decide when to "call it a day" in the context of a lazy search?
In a way, is there some sort of middle ground between lazy and greedy searches? Find the smallest bit you can find before the longest / highest weighted disjunctive entity? Be lazy, but put in that little bit of extra effort if you can for the sake of thoroughness?
(Incidentally, my work philosophy in college?)
How alternations are handled depends on the particular regular expression engine. For almost all engines (including Perl's regular expression engine) the alternation matches eagerly - that is, it matches the left-most choice first and only tries another alternative if this fails. For example, if you have /(cat|catelephant)/ it will never match catelephant. The solution is to reorder the choices so that the most specific comes first.