String searching with Regex - regex

I am trying to use regex (which I am just starting with) to find sequences with up to 1 mismatched character. For example the pattern “nan” and the text “banana” I would want to find “ban” and “nan” the former being acceptable with the mismatch with ‘b’ and ‘n’. The problem I am having is making up a regex pattern without resorting to making individual wildcard inserts where I want them.
final String[] patterns = {"[a-z]an", "n[a-z]n", "na[a-z]"};
final String text = "banana";
for(String pattern : patterns)
{
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(text);
while(m.find())
{
System.out.println(m.start() + " " + m.group());
}
}
Is what I have as a test which is kind of a clunky way of getting what I want (albeit with some duplicates). For this kind of String Searching with a single mismatch is regex an effective means or should I try modifying traditional algorithms like Horspool or KMP?

Unfortunately regular expressions are designed to allow you to be very precise in the pattern you intend to match, conversely there is little support of approximate match. However the TRE library was designed exactly for this purpose ( http://laurikari.net/tre/ )

Related

The regex in string.format of LUA

I use string.format(str, regex) of LUA to fetch some key word.
local RICH_TAGS = {
"texture",
"img",
}
--\[((img)|(texture))=
local START_OF_PATTER = "\\[("
for index = 1, #RICH_TAGS - 1 do
START_OF_PATTER = START_OF_PATTER .. "(" .. RICH_TAGS[index]..")|"
end
START_OF_PATTER = START_OF_PATTER .. "("..RICH_TAGS[#RICH_TAGS].."))"
function RichTextDecoder.decodeRich(str)
local result = {}
print(str, START_OF_PATTER)
dump({string.find(str, START_OF_PATTER)})
end
output
hello[img=123] \[((texture)|(img))
dump from: [string "utils/RichTextDecoder.lua"]:21: in function 'decodeRich'
"<var>" = {
}
The output means:
str = hello[img=123]
START_OF_PATTER = \[((texture)|(img))
This regex works well with some online regex tools. But it find nothing in LUA.
Is there any wrong using in my code?
You cannot use regular expressions in Lua. Use Lua's string patterns to match strings.
See How to write this regular expression in Lua?
Try dump({str:find("\\%[%("))})
Also note that this loop:
for index = 1, #RICH_TAGS - 1 do
START_OF_PATTER = START_OF_PATTER .. "(" .. RICH_TAGS[index]..")|"
end
will leave out the last element of RICH_TAGS, I assume that was not your intention.
Edit:
But what I want is to fetch several specific word. For example, the
pattern can fetch "[img=" "[texture=" "[font=" any one of them. With
the regex string I wrote in my question, regex can do the work. But
with Lua, the way to do the job is write code like string.find(str,
"[img=") and string.find(str, "[texture=") and string.find(str,
"[font="). I wonder there should be a way to do the job with a single
pattern string. I tryed pattern string like "%[%a*=", but obviously it
will fetch a lot more string I need.
You cannot match several specific words with a single pattern unless they are in that string in a specific order. The only thing you could do is to put all the characters that make up those words into a class, but then you risk to find any word you can build from those letters.
Usually you would match each word with a separate pattern or you match any word and check if the match is one of your words using a look up table for example.
So basically you do what a regex library would do in a few lines of Lua.

RegEx that matches "variable" strings/sequences? + backtracking?

I want to use regex like language to match against variable-string (in my case sequence of character|words|numbers stored in a graph DB).
I found a way to implement RegEx engine :
https://deniskyashif.com/2019/02/17/implementing-a-regular-expression-engine/
the problem is that it matches against static string. My case is sort of what I call variable-string/sequence.
F.e. let say I have stored the following sequences :
who; why; when; where;
Keep in mind I dont have the sequences available (so that I can loop over them), they are deconstructed to a graph. (you can think of interface to the sequence like a function which given prefix predicts/returns the next character)
if I match against regex : w* it should match/return all of the strings one after another /like in backtracking/
if i use : whe* => when, where
etc..
Is there a way to modify NFA, DFA in such a way that it will accommodate variable-string ?
I just started exploring implementing NFA and think the change has to be here :
function search(nfa, word) { .... }
it has to be search that passes the next expected regex-symbol/state i.e. given the previous string-symbol does the next-predicted-string-symbol match the expected regex-symbol ?
The regex should 'drive' the match, rather than the string ! It should be doable because the regex is deconstructed to finite states with the transitions..
what do you think ?
they are stored as a tree in graph db...f.e.can be represented as :
lvl5: (where:.)
lvl4: (wher:e), (when:.), (whom:.),
lvl3: (whe:r), (whe:n), (who:m), (who:.), (why:.)
lvl2: (wh:y) , (wh:o), (wh:e)
lvl1: (w:h)
lvl0: w h y o .
I don’t understand your question, but this regex could be the answer:
<prefix>.*?\b
Where <prefix> is w or whe etc.
This will match all words in the input that start with the prefix.
In whatever language you’re using, there should be a way to loop over all matches found for a given input.

How to replace parts of a string in lua "in a single pass"?

I have the following string of anchors (where I want to change the contents of the href) and a lua table of replacements, which tells which word should be replaced for:
s1 = '<a href="word7">'
replacementTable = {}
replacementTable["word1"] = "potato1"
replacementTable["word2"] = "potato2"
replacementTable["word3"] = "potato3"
replacementTable["word4"] = "potato4"
replacementTable["word5"] = "potato5"
The expected result should be:
<a href="word7">
I know I could do this iterating for each element in the replacementTable and process the string each time, but my gut feeling tells me that if by any chance the string is very big and/or the replacement table becomes big, this apporach is going to perform poorly.
So I though it could be best if I could do the following: apply the regular expression for finding all the matches, get an iterator for each match and replace each match for its value in the replacementTable.
Something like this would be great (writing it in Javascript because I don't know yet how to write lambdas in Lua):
var newString = patternReplacement(s1, '<a[^>]* href="([^"]*)"', function(match) { return replacementTable[match] })
Where the first parameter is the string, the second one the regular expression and the third one a function that is executed for each match to get the replacement. This way I think s1 gets parsed once, being more efficient.
Is there any way to do this in Lua?
In your example, this simple code works:
print((s1:gsub("%w+",replacementTable)))
The point is that gsub already accepts a table of replacements.
In the end, the solution that worked for me was the following one:
local updatedBody = string.gsub(body, '(<a[^>]* href=")(/[^"%?]*)([^"]*")', function(leftSide, url, rightSide)
local replacedUrl = url
if (urlsToReplace[url]) then replacedUrl = urlsToReplace[url] end
return leftSide .. replacedUrl .. rightSide
end)
It kept out any querystring parameter giving me just the URI. I know it's a bad idea to parse HTML bodies with regular expressions but for my case, where I required a lot of performance, this was performing a lot faster and just did the job.

Java Regular Expression Two Question marks (??)

I know that /? means the / is optional. so "toys?" will match both toy and toys. My understanding is that if I make it lazy and use "toys??" I will match both toy and toys and always return toy. So, a quick test:
private final static Pattern TEST_PATTERN = Pattern.compile("toys??", Pattern.CASE_INSENSITIVE);
public static void main(String[] args) {
for(String arg : args) {
Matcher m = TEST_PATTERN.matcher(arg);
System.out.print("Arg: " + arg);
boolean b = false;
while (m.find()) {
System.out.print(" {");
for (int i=0; i<=m.groupCount(); ++i) {
System.out.print("[" + m.group(i) + "]");
}
System.out.print("}");
}
System.out.println();
}
}
Yep, it looks like it works as expected
java -cp .. regextest.RegExTest toy toys
Arg: toy {[toy]}
Arg: toys {[toy]}
Now, change the regular expression to "toys??2" and it still matches toys2 and toy2. In both cases, it returns the entire string without the s removed. Is there any functional difference between searching for "toys?2" and "toys??2".
The reason I am asking is because I found an example like the following:
private final static Pattern TEST_PATTERN = Pattern.compile("</??tag(\\s+?.*?)??>", Pattern.CASE_INSENSITIVE);
and although I see no apparent reason for using ?? rather than ?, I thought that perhaps the original author (who is not known to me) might know something that I don't, I expect the later.
?? is lazy while ? is greedy.
Given (pattern)??, it will first test for empty string, then if the rest of the pattern can't match, it will test for pattern.
In contrast, (pattern)? will test for pattern first, then it will test for empty string on backtrack.
Now, change the regular expression to "toys??2" and it still matches toys2 and toy2. In both cases, it returns the entire string without the s removed. Is there any functional difference between searching for "toys?2" and "toys??2".
The difference is in the order of searching:
"toys?2" searches for toys2, then toy2
"toys??2" searches for toy2, then toys2
But for the case of these 2 patterns, the result will be the same regardless of the input string, since the sequel 2 (after s? or s??) must be matched.
As for the pattern you found:
Pattern.compile("</??tag(\\s+?.*?)??>", Pattern.CASE_INSENSITIVE)
Both ?? can be changed to ? without affecting the result:
/ and t (in tag) are mutually exclusive. You either match one or the other.
> and \s are also mutually exclusive. The at least 1 in \s+? is important to this conclusion: the result might be different otherwise.
This is probably micro-optimization from the author. He probably thinks that the open tag must be there, while the closing tag might be forgotten, and that open/close tags without attributes/random spaces appears more often than those with some.
By the way, the engine might run into some expensive backtracking attempt due to \\s+?.*? when the input has <tag followed by lots of spaces without > anywhere near.

Is there a RegEx that can parse out the longest list of digits from a string?

I have to parse various strings and determine a prefix, number, and suffix. The problem is the strings can come in a wide variety of formats. The best way for me to think about how to parse it is to find the longest number in the string, then take everything before that as a prefix and everything after that as a suffix.
Some examples:
0001 - No prefix, Number = 0001, No suffix
1-0001 - Prefix = 1-, Number = 0001, No suffix
AAA001 - Prefix = AAA, Number = 001, No suffix
AAA 001.01 - Prefix = AAA , Number = 001, Suffix = .01
1_00001-01 - Prefix = 1_, Number = 00001, Suffix = -01
123AAA 001_01 - Prefix = 123AAA , Number = 001, Suffix = _01
The strings can come with any mixture of prefixes and suffixes, but the key point is the Number portion is always the longest sequential list of digits.
I've tried a variety of RegEx's that work with most but not all of these examples. I might be missing something, or perhaps a RegEx isn't the right way to go in this case?
(The RegEx should be .NET compatible)
UPDATE: For those that are interested, here's the C# code I came up with:
var regex = new System.Text.RegularExpressions.Regex(#"(\d+)");
if (regex.IsMatch(m_Key)) {
string value = "";
int length;
var matches = regex.Matches(m_Key);
foreach (var match in matches) {
if (match.Length >= length) {
value = match.Value;
length = match.Length;
}
}
var split = m_Key.Split(new String[] {value}, System.StringSplitOptions.RemoveEmptyEntries);
m_KeyCounter = value;
if (split.Length >= 1) m_KeyPrefix = split(0);
if (split.Length >= 2) m_KeySuffix = split(1);
}
You're right, this problem can't be solved purely by regular expressions. You can use regexes to "tokenize" (lexically analyze) the input but after that you'll need further processing (parsing).
So in this case I would tokenize the input with (for example) a simple regular expression search (\d+) and then process the tokens (parse). That would involve seeing if the current token is longer than the tokens seen before it.
To gain more understanding of the class of problems regular expressions "solve" and when parsing is needed, you might want to check out general compiler theory, specifically when regexes are used in the construction of a compiler (e.g. http://en.wikipedia.org/wiki/Book:Compiler_construction).
You're input isn't regular so, a regex won't do. I would iterate over the all groups of digits via (\d+) and find the longest and then build a new regex in the form of (.*)<number>(.*) to find your prefix/suffix.
Or if you're comfortable with string operations you can probably just find the start and end of the target group and use substr to find the pre/suf fix.
I don't think you can do this with one regex. I would find all digit sequences within the string (probably with a regex) and then I would select the longest with .NET code, and call Split().
This depends entirely on your Regexp engine. Check your Regexp environment for capturing, there might be something in it like the automatic variables in Perl.
OK, let's talk about your question:
Keep in mind, that both, NFA and DFA, of almost every Regexp engine are greedy, this means, that a (\d+) will always find the longest match, when it "stumbles" over it.
Now, what I can get from your example, is you always need middle portion of a number, try this:
/^(.*\D)?(\d+)(\D.*)?$/ig
The now look at variables $1, $2, $3. Not all of them will exist: if there are all three of them, $2 will hold your number in question, the other vars, parts of the prefix. when one of the prefixes is missing, only variable $1 and $2 will be set, you have to see for yourself, which one is the integer. If both prefix and suffix are missing, $1 will hold the number.
The idea is to make the engine "stumble" over the first few characters and start matching a long number in the middle.
Since the modifier /gis present, you can loop through all available combinations, that the machine finds, you can then simply take the one you like most or something.
This example is in PCRE, but I'm sure .NET has a compatible mode.