How to efficiently implement regular expression like .*a.*b.*? - regex

I want to matching of filenames like Colibri does. I tried to solve it by regular expressions.
Searching in Colibri works that You type characters which are in order inside of a filename and it finds all files with these characters in order in the filename. For example for "ab" it finds "cabal", "ab", and "achab".
Simple insertion of .* between letters works (so searched string "ab" becomes regular expression .*a.*b.*), but I want to make it on large amount of files.
So far I have O(N*???), where N is amount of filenames and ??? is at best linear complexity (I assume my language uses NFA). I don't care about space complexity that much. What data structures or algorithms should I choose to make it more efficient (in time complexity)?

If you just want to check whether the characters of a search string search are contained in another string str in the same order, you could use this simple algorithm:
pos := -1
for each character in search do
pos := indexOf(str, character, pos+1)
if pos is -1 then
break
endif
endfor
return pos
This algorithm returns the offset of the last character of search in str and -1 otherwise. Its runtime is in O(n) (you could replace indexOf by a simple while loop that compares the characters in str from pos to Length(str)-1 and returns either the offset or -1).

It will greatly improve your efficiency if you replace the . with a character negation. i.e.
[^a]*a[^b]*b.*
This way you have much less backtracking. See This Reference
Edit* #yi_H you are right, this regex would probably serve just as well:
a[^b]*b

Your . is unnecessary. You'll get better performance if you simply transform "abc"
into ^[^a]*a[^b]*b[^c]*c.
string exp = "^";
foreach (char c in inputString)
{
string s = Regex.Escape (c.ToString()); // escape `.` as `\.`
exp += "[^" + s + "]*" + s; // replace `a` with `[^a]*a`
}
Regex regex = new Regex (exp, RegexOptions.IgnoreCase);
foreach (string fileName in fileNames)
{
if (regex.IsMatch (fileName))
yield return fileName;
}

For a limited character-set it might make sense to create lookup table which contains an array or linked list of matching filenames.
If your ABC contains X characters then the "1 length" lookup table will contain X table entries, if it is a "2 length" table it will contain X^2 entries and so on. The 2 length table will contain for each entry ("ab", "qx") all the files which which have those letters in that order. When searching for longer input "string" look up the appropriate entry and do the search on those entries.
Note: calculate the needed extra memory and measure the speed improvement (compared to full table scan), the benefits depend on the data set.

Related

RegEx 5 Strings to Match - Replace Based on Match

Angular/JS Application
I have this: input.replace('/<|>|"|&|&apos;/gm', need this to be based on match value).
So I want to search by all those strings - but I want to replace the value based on which one matched. So if " matches = replace with " and if > matches = replace with >
I basically want to avoid this:
input.replace('/</gm', <)
input.replace('>/gm', >)
input.replace('"', ")
I think it has something to do with capturing groups - not a regex person.
Maybe the answer can only be: inputString.replace('/</gm', '<').replace('/>/gm', '>').replace('/"/gm', '"').replace('/&/gm', '&').replace('/&apos;/gm', '\'');
What's commonly done is to simply chain the replacements, executing one after another as in your example:
input.replace(/</g, "<").replace(/>/g, ">").replace(/"/g, '"').replace(/&/g, "&").replace(/&apos;/g, "'")
the downside of this it that it really doesn't scale well: Each replace operation runs in linear time. Thus for m replacement and a string of length n, the time complexity will be O(n * m). If you now were to implement support for all 2k+ named HTML entities, this would quickly blow up and your performance would degrade severely - not to mention the O(m) garbage strings that are created in the process, making for O(n * m) garbage data.
The proper way is to create a lookup table (a hash table, called a dictionary in JS) with O(1) access with all the named entities and their replacements:
const namedEntities = {lt: "<", gt: ">", quot: '"', amp: "&", apos: "'"}
return input.replace(/&(lt|gt|quot|amp|apos);/g, (_, match) => namedEntities[match])
this passes a replacement function to String.replace; no garbage strings are created and the time complexity - assuming an ideal RegEx implementation - is O(n).
If you want to religiously follow DRY, you might want to build the RegEx from the keys:
const regex = new RegExp("&(" + Object.keys(namedEntities).join("|") + ");", "g")
return input.replace(regex, (_, match) => namedEntities[match])
alternatively, consider using a more general RegEx, leveraging the dictionary to check whether an entity is valid and defaulting to no replacement:
return input.replace(/&(.+?);/g, (entity, match) => namedEntities[match] || entity)

Find group of strings starting and ending by a character using regular expression

I have a string, and I want to extract, using regular expressions, groups of characters that are between the character : and the other character /.
typically, here is a string example I'm getting:
'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh'
and so, I want to retrieved, 45.72643,4.91203 and also hereanotherdata
As they are both between characters : and /.
I tried with this syntax in a easier string where there is only 1 time the pattern,
[tt]=regexp(str,':(\w.*)/','match')
tt = ':45.72643,4.91203/'
but it works only if the pattern happens once. If I use it in string containing multiples times the pattern, I get all the string between the first : and the last /.
How can I mention that the pattern will occur multiple time, and how can I retrieve it?
Use lookaround and a lazy quantifier:
regexp(str, '(?<=:).+?(?=/)', 'match')
Example (Matlab R2016b):
>> str = 'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
>> result = regexp(str, '(?<=:).+?(?=/)', 'match')
result =
1×2 cell array
'45.72643,4.91203' 'hereanotherdata'
In most languages this is hard to do with a single regexp. Ultimately you'll only ever get back the one string, and you want to get back multiple strings.
I've never used Matlab, so it may be possible in that language, but based on other languages, this is how I'd approach it...
I can't give you the exact code, but a search indicates that in Matlab there is a function called strsplit, example...
C = strsplit(data,':')
That should will break your original string up into an array of strings, using the ":" as the break point. You can then ignore the first array index (as it contains text before a ":"), loop the rest of the array and regexp to extract everything that comes before a "/".
So for instance...
'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh'
Breaks down into an array with parts...
1 - 'abcd'
2 - '45.72643,4.91203/Rou'
3 - 'hereanotherdata/defgh'
Then Ignore 1, and extract everything before the "/" in 2 and 3.
As John Mawer and Adriaan mentioned, strsplit is a good place to start with. You can use it for both ':' and '/', but then you will not be able to determine where each of them started. If you do it with strsplit twice, you can know where the ':' starts :
A='abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
B=cellfun(#(x) strsplit(x,'/'),strsplit(A,':'),'uniformoutput',0);
Now B has cells that start with ':', and has two cells in each cell that contain '/' also. You can extract it with checking where B has more than one cell, and take the first of each of them:
C=cellfun(#(x) x{1},B(cellfun('length',B)>1),'uniformoutput',0)
C =
1×2 cell array
'45.72643,4.91203' 'hereanotherdata'
Starting in 16b you can use extractBetween:
>> str = 'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
>> result = extractBetween(str,':','/')
result =
2×1 cell array
{'45.72643,4.91203'}
{'hereanotherdata' }
If all your text elements have the same number of delimiters this can be vectorized too.

Which string-finding algorithm is appropriate for this?

I have a big string say "aaaaaaaaaaabbbbbbbbbcccccccccccddddddddddd" (but maybe longer) and I have a collection of lots of little strings. I want to count (overlap is OK) how many times the little strings are found in the big string. I care only about speed. KMP seemed good but it looked like Rabin-Karp dealt with multiple but was slow.
The problem with most string searching algorithms is that they will take at least time O(k) to return k matches, so if you have a string with say 1 million "a"s, and 1 million queries of the little string "a", then it will take around 1 million, million iterations to count all the matches!
An alternative linear time approach would be to:
Construct a suffix tree of the big string: O(n) where n is len(big string)
Precompute the number of suffixes below each node in the suffix tree: O(n)
For each small string, find the node in the suffix tree that has the small string as a suffix: O(m) where m is len(small string)
Add to the total count the number of suffixes below that node. (Each suffix corresponds to a different match of the small string in the big string)
This will take time O(n+p) where n is the length of the big string, and p is the total length of all the small strings.
Example Code
As requested, here is some small(ish) example code in Python that uses this approach:
from collections import defaultdict
class SuffixTree:
def __init__(self):
"""Returns an empty suffix tree"""
self.T=''
self.E={}
self.nodes=[-1] # 0th node is empty string
def add(self,s):
"""Adds the input string to the suffix tree.
This inserts all substrings into the tree.
End the string with a unique character if you want a leaf-node for every suffix.
Produces an edge graph keyed by (node,character) that gives (first,last,end)
This means that the edge has characters from T[first:last+1] and goes to node end."""
origin,first,last = 0,len(self.T),len(self.T)-1
self.T+=s
nc = len(self.nodes)
self.nodes += [-1]*(2*len(s))
T=self.T
E=self.E
nodes=self.nodes
Lm1=len(T)-1
for last_char_index in xrange(first,len(T)):
c=T[last_char_index]
last_parent_node = -1
while 1:
parent_node = origin
if first>last:
if (origin,c) in E:
break
else:
key = origin,T[first]
edge_first, edge_last, edge_end = E[key]
span = last - first
A = edge_first+span
m = T[A+1]
if m==c:
break
E[key] = (edge_first, A, nc)
nodes[nc] = origin
E[nc,m] = (A+1,edge_last,edge_end)
parent_node = nc
nc+=1
E[parent_node,c] = (last_char_index, Lm1, nc)
nc+=1
if last_parent_node>0:
nodes[last_parent_node] = parent_node
last_parent_node = parent_node
if origin==0:
first+=1
else:
origin = nodes[origin]
if first <= last:
edge_first,edge_last,edge_end=E[origin,T[first]]
span = edge_last-edge_first
while span <= last - first:
first+=span+1
origin = edge_end
if first <= last:
edge_first,edge_last,edge_end = E[origin,T[first]]
span = edge_last - edge_first
if last_parent_node>0:
nodes[last_parent_node] = parent_node
last+=1
if first <= last:
edge_first,edge_last,edge_end=E[origin,T[first]]
span = edge_last-edge_first
while span <= last - first:
first+=span+1
origin = edge_end
if first <= last:
edge_first,edge_last,edge_end = E[origin,T[first]]
span = edge_last - edge_first
return self
def make_choices(self):
"""Construct a sorted list for each node of the possible continuing characters"""
choices = [list() for n in xrange(len(self.nodes))] # Contains set of choices for each node
for (origin,c),edge in self.E.items():
choices[origin].append(c)
choices=[sorted(s) for s in choices] # should not have any repeats by construction
self.choices=choices
return choices
def count_suffixes(self,term):
"""Recurses through the tree finding how many suffixes are based at each node.
Strings assumed to use term as the terminating character"""
C = self.suffix_counts = [0]*len(self.nodes)
choices = self.make_choices()
def f(node=0):
t=0
X=choices[node]
if len(X)==0:
t+=1 # this node is a leaf node
else:
for c in X:
if c==term:
t+=1
continue
first,last,end = self.E[node,c]
t+=f(end)
C[node]=t
return t
return f()
def count_matches(self,needle):
"""Return the count of matches for this needle in the suffix tree"""
i=0
node=0
E=self.E
T=self.T
while i<len(needle):
c=needle[i]
key=node,c
if key not in E:
return 0
first,last,node = E[key]
while i<len(needle) and first<=last:
if needle[i]!=T[first]:
return 0
i+=1
first+=1
return self.suffix_counts[node]
big="aaaaaaaaaaabbbbbbbbbcccccccccccddddddddddd"
small_strings=["a","ab","abc"]
s=SuffixTree()
term=chr(0)
s.add(big+term)
s.count_suffixes(term)
for needle in small_strings:
x=s.count_matches(needle)
print needle,'has',x,'matches'
It prints:
a has 11 matches
ab has 1 matches
abc has 0 matches
However, in practice I would recommend you simply use a pre-existing Aho-Corasick implementation as I would expect this to be much faster in your particular case.
Matching against a large collection of strings sounds like http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm for me. It does find matches one at a time, so Peter de Rivaz's idea might be better if there are a huge number of matches. On the other hand, Aho-Corasick doesn't need to keep the big string in memory - you can just stream it through - and is very practical to implement and tune - the Wikipedia link notes that the orginal fgrep used it.
Thinking about it, you can work round the mega-match problem. Aho-Corasick can be viewed as creating a deterministic finite state machine just capable of recognizing each of the strings it is searching for. The state of the machine corresponds to the last N characters seen. If you wish to match two strings and one is a suffix of the other you need to be careful that when you are in the state that says you have just matched the longer string that you also recognize that this means that you have just matched the shorter string. If you deliberately choose not to do this, then the counts you accumulate for the shorter string will be incorrect - but you know that they are too low by the number of times the longer string was seen. So if you modify Aho-Corasick so that only the longest match at each point is recognized and counted, then the cost of matching remains linear in the number of characters in the string you are searching, and you can fix up the counts at the end by going through the long strings and then incrementing counts for the shorter strings which are suffixes of the long strings. This will take time at most linear in the total size of the strings being searched for.
Building on another answer (and hopefully convince you this is the best type of answer), you can look up http://en.wikipedia.org/wiki/Suffix_tree and also go through the references listed there to learn about suffix trees if you really want the fastest solution for your problem, that also can make it possible to get the number of matches without iterating over all the matches, and the running times and memory requirements you get are the absolute best possible for any substring matching or match counting algorithm. Once you understand how the suffix tree works and how to build/use it, then the only additional tweak you need is store the number of distinct string starting positions that are being represented at each internal node of the tree, a minor modification that you can easily do efficiently (linear time, as already claimed) by recursively getting the count from children nodes and adding them up to get the count at a current node. Then these counts allow you to count substring matches without iterating over all of them.
If I understood correctly, your input string consists of many one-character blocks.
In this case, you can compress your text using the Run-length encoding.
For example:
s = aaabbbbcc
is encoded as:
encoded_s = (a3)(b4)(c2)
Now you may search for patterns in encoded text.
If you want a concrete algorithm, just search the web for patterns matching in Run-length encoded strings. You can achieve time complexity O(N + M), where N and M are the lengths of compressed text and compressed pattern. Both M and N in general are much smaller than original lengths, so it beats any standard pattern matching algorithm e.g. KMP.
1) I'd go with finite automata. Can't think of a specialized library right now, but the general-purpose PCRE can be used to construct an automata efficiently searching for the given substring. For substring "foo" and "bar" one can construct a pattern /(foo)|(bar)/, scan a string and get the "id" number of the matched substring by iterating the ovector checking which group matched.
RE2::FindAndConsume is good if you only need the total count, not grouping by substring.
P.S. Example using Boost.Xpressive and loading the strings from a map: http://ericniebler.com/2010/09/27/boost-xpressive-ftw/
P.P.S. Recently I've had a good time creating a Ragel machine for a similar task. For a small set of searched strings a "normal" DFA would work, buf if you have a larger ruleset then using Ragel scanners shows good results (here is a related answer).
P.P.P.S. PCRE has the MARK keyword which is super useful for that kind of subpattern classification (cf).
2) Quite some time ago I wrote a Trie-based thingie in Scala for that kind of load: https://gist.github.com/ArtemGr/6150594; Trie.search goes over the string trying to match the current position to a number encoded in the Trie. The trie is encoded in a single cache-friendly array, I expect it to be as good as non-JIT DFAs.
3) I've been using boost::spirit for substring matching, but never got to measuring how it fares against other approaches. Spirit uses some efficient structure for the symbols matching, perhaps the structure can be used on its own without the overhead of Spirit.
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
using qi::lit; using qi::alnum; using qi::digit; using qi::_val; using qi::_1; using boost::phoenix::ref;
static struct exact_t: qi::symbols<char, int> {
exact_t() {add
("foo", 1)
("bar", 2);}
} exact;
int32_t match = -1;
qi::rule<const char*, int()> rule =
+alnum >> exact [ref (match) = _1];
const char* it = haystack; // Mutable iterator for Spirit.
qi::parse (it, haystack + haystackLen, rule);

Is there a RegEx that can parse out the longest list of digits from a string?

I have to parse various strings and determine a prefix, number, and suffix. The problem is the strings can come in a wide variety of formats. The best way for me to think about how to parse it is to find the longest number in the string, then take everything before that as a prefix and everything after that as a suffix.
Some examples:
0001 - No prefix, Number = 0001, No suffix
1-0001 - Prefix = 1-, Number = 0001, No suffix
AAA001 - Prefix = AAA, Number = 001, No suffix
AAA 001.01 - Prefix = AAA , Number = 001, Suffix = .01
1_00001-01 - Prefix = 1_, Number = 00001, Suffix = -01
123AAA 001_01 - Prefix = 123AAA , Number = 001, Suffix = _01
The strings can come with any mixture of prefixes and suffixes, but the key point is the Number portion is always the longest sequential list of digits.
I've tried a variety of RegEx's that work with most but not all of these examples. I might be missing something, or perhaps a RegEx isn't the right way to go in this case?
(The RegEx should be .NET compatible)
UPDATE: For those that are interested, here's the C# code I came up with:
var regex = new System.Text.RegularExpressions.Regex(#"(\d+)");
if (regex.IsMatch(m_Key)) {
string value = "";
int length;
var matches = regex.Matches(m_Key);
foreach (var match in matches) {
if (match.Length >= length) {
value = match.Value;
length = match.Length;
}
}
var split = m_Key.Split(new String[] {value}, System.StringSplitOptions.RemoveEmptyEntries);
m_KeyCounter = value;
if (split.Length >= 1) m_KeyPrefix = split(0);
if (split.Length >= 2) m_KeySuffix = split(1);
}
You're right, this problem can't be solved purely by regular expressions. You can use regexes to "tokenize" (lexically analyze) the input but after that you'll need further processing (parsing).
So in this case I would tokenize the input with (for example) a simple regular expression search (\d+) and then process the tokens (parse). That would involve seeing if the current token is longer than the tokens seen before it.
To gain more understanding of the class of problems regular expressions "solve" and when parsing is needed, you might want to check out general compiler theory, specifically when regexes are used in the construction of a compiler (e.g. http://en.wikipedia.org/wiki/Book:Compiler_construction).
You're input isn't regular so, a regex won't do. I would iterate over the all groups of digits via (\d+) and find the longest and then build a new regex in the form of (.*)<number>(.*) to find your prefix/suffix.
Or if you're comfortable with string operations you can probably just find the start and end of the target group and use substr to find the pre/suf fix.
I don't think you can do this with one regex. I would find all digit sequences within the string (probably with a regex) and then I would select the longest with .NET code, and call Split().
This depends entirely on your Regexp engine. Check your Regexp environment for capturing, there might be something in it like the automatic variables in Perl.
OK, let's talk about your question:
Keep in mind, that both, NFA and DFA, of almost every Regexp engine are greedy, this means, that a (\d+) will always find the longest match, when it "stumbles" over it.
Now, what I can get from your example, is you always need middle portion of a number, try this:
/^(.*\D)?(\d+)(\D.*)?$/ig
The now look at variables $1, $2, $3. Not all of them will exist: if there are all three of them, $2 will hold your number in question, the other vars, parts of the prefix. when one of the prefixes is missing, only variable $1 and $2 will be set, you have to see for yourself, which one is the integer. If both prefix and suffix are missing, $1 will hold the number.
The idea is to make the engine "stumble" over the first few characters and start matching a long number in the middle.
Since the modifier /gis present, you can loop through all available combinations, that the machine finds, you can then simply take the one you like most or something.
This example is in PCRE, but I'm sure .NET has a compatible mode.

How to replace a string between two substrings in a string in VC++/MFC?

Say I have a CString object strMain="AAAABBCCCCCCDDBBCCCCCCDDDAA";
I also have two smaller strings, say strSmall1="BB";
strSmall2="DD";
Now, I want to replace all occurence of strings which occur between strSmall1("BB") and strSmall2("DD") in strMain, with say "KKKKKKK"
Is there a way to do it without Regex. I cannot use regex as adding another file to the project is prohibited.
Is there a way in VC++/MFC to do it? Or any easy algorithm you can point me to?
int length = strMain.GetLength();
int begin = strMain.Find(strSmall1, 0) + strSmall1.GetLength();
int end = strMain.Find(strSmall2, 0);
CStringT left = strMain.Left(begin);
CStringT right = strMain.Right(length - end);
strMain = left + "KKKKKKK" + right
The easiest way is probably to handle the replacement recursively. Search for the starting delimiter and the ending delimiter. If you find them, put together a new string consisting of the string up to the starting delimiter, followed by the replacement string, followed by the return from recursively doing the replacement in the remainder of the string following the ending delimiter.
That, of course, assumes you want to replace all the occurrences in the main string -- if you only want to replace the first one, John Weldon's solution (for one example) will work quite nicely.
psudocode:
loop over string
if curlocation matches string strsmall1 save index break
loop over remaining string
replace till curlocation matches string strsmall2
Extra credit:
What will the next assignment be?
My answer:
Speed it up by jumping the length of strsmall1 and strsmall2 in loop iterations