Measure the "matching"? - regex

Is there mechanism to measure or compare of how tight the pattern corresponds to the given string? By pattern I mean regex or something similar. For example we have string "foobar" and two regexes: "fooba." and ".*" Both patterns match the string. Is it possible to determine that "fooba." is more appropriate pattern for given string then ".*"?

There are metrics and heuristics for string 'distance'. Check this for example http://en.wikipedia.org/wiki/Edit_distance
Here is one random Java implementation that came with Google search.
http://www.merriampark.com/ldjava.htm
Some metrics are expensive to compute so look around and find one that fits your needs.
As for your specific example, IIRC, regex matching in Java prioritizes terms by matching length and then order so if you use something like
"(foobar)|(.*)", it will match the first one and you can determine this by examining the results returned for the two capture groups.

How about this for an idea: Use the length of your regular expression: length("fooba.") > length(".*"), so "fooba." is more specific...
However, it depends on where the regular expressions come from and how precise you need to be as "fo.*|.*ba" would be longer than "fooba.", so the solution will not always work.

What you're asking for isn't really a property of regular expressions.
Create an enum that measures "closeness", and create a class that will hold a given regex, and a closeness value. This requires you to determine which regex is considered "more close" than another.
Instantiate your various classes, and let them loose on your code, and compare the matched objects, letting the "most closeness" one rise to the top.
pseudo-code, without actually comparing anything, or resembling any sane language:
enum Closeness
Exact
PrettyClose
Decent
NotSoClose
WayOff
CouldBeAnything
mune
class RegexCloser
property Closeness Close()
property String Regex()
ssalc
var foo = new RegexCloser(Closeness := Exact, Regex := "foobar")
var bar = new RegexCloser(Closeness := CouldBeAnything, Regex := ".*")
var target = "foobar";
if Regex.Match(target, foo)
print String.Format("foo {0}", foo.Closeness)
fi
if Regex.Match(target, bar)
print String.Format("bar {0}", bar.Closeness)
fi

Related

Match repeating patterns that are consistently separated with comma or whitespace in golang

I am trying to parse multiple tags in one string literal.
such as name=testName, key=testKey, columns=(c1, c2, c3), and I might consider add more tags with different syntax in this string in the near future.
So it's natural to study regex to implement it.
as for the syntax:
valid:
`name=testName,key=testKey`
`name=testName, key=testKey`
`name=testName key=testKey`
`name=testName key=testKey`
`name=testName key=testKey columns=(c1 c2 c3)`
`name=testName key=testKey columns=(c1, c2, c3)`
`name=testName, key=testKey, columns=(c1 c2 c3)`
invalid:
`name=testName,, key=testKey` (multiple commas in between)
`name=testName, key=testKey,` (end with a comma)
`name=testName, key=testKey, columns=(c1,c2 c3)` u can only use comma or whitespace consistently inside columns, the rule applies to the whole tags as well. see below
`name=testName, key=testKey columns=(c1,c2,c3)`
I come up the whole pattern like this:
((name=\w+|key=\w+)+,\s*)*(name=\w+|key=\w+)+
I am wondering is it possible to set the subpattern as a regex and then combine them into a larger pattern.
such as
patternName := regexp.MustCompile(`name=\w+`)
patternKey := regexp.MustCompile(`key=\w+`)
pattern = ((patternName|patternKey)+,\s*)*(patternName|patternKey)+
considering I will add more tags, the whole pattern will definitely get larger and more ugly. Is there any elegant way like the combined way?
Yes, what you want is possible. the regexp.Regexp type has a String() method, which produces the string representation. So you can use this to combine regular expressions:
patternName := regexp.MustCompile(`name=\w+`)
patternKey := regexp.MustCompile(`key=\w+`)
pattern = regexp.MustCompile(`((`+patternName.String()+`|`+patternKey.String()+`)+,\s*)*(`+patternName.String()+`|`+patternKey.String()`+`)+`)
Can be shortened (though less efficient) with fmt.Sprintf:
pattern = regexp.MustCompile(fmt.Sprintf(`((%s|%s)+,\s*)*(%s|%s)+`, patternName, patternKey, patternName, patternKey)
But just because it's possible doesn't mean you should do it...
Your particular examples would be much more easily handled using standard text parsing methods such as strings.Split or strings.FieldsFunc, etc. Given your provided sample inputs, I would do it this way:
Split on whitespace/comma
Split each result on the equals sign.
Validate that the key names are expected (name and/or key)
This code will be far easier to read, and will execute probably hundreds or thousands of times faster, compared to a regular expression. This approach also lends itself easily to stream processing, which can be a big benefit if you're processing hundreds or more records, and don't want to consume a lot of memory. (Regexp can be made to do this as well, but it's still less readable).

Matching the IP using regular expression

set ip 10.10.
if {[regexp
{^(([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\.?){4}$} $ip
match]} { puts $match }
the above pattern matching 10.10. can anyone tell me how this happening
First, using a regular expression to check ip addresses is extremely fragile and unnecessarily complex, and you still have to do the heavy lifting yourself. Instead, use the Tcllib_ip package.
package require ip
If you want to know if a given string is an IPv4 address, just check with
::ip::is 4 $str ;# 1 if valid ipv4, 0 otherwise
or
::ip::version $str ;# returns 4 or 6 for ipv4 or ipv6, -1 otherwise
The commands in the package also handle address strings that aren't dotted decimal.
The package isn't included in all distributions, but can be installed using teacup install or by downloading the files and sourcing them into the script.
To answer the question: the original asker has one error and one problem. The error is that the regular expression used to match the ip address also matches strings that aren't ip addresses. This is one of the most common problems when using regular expressions. The reason and the fix is addressed in other answers to the question. To recap: Captain noted that since the original regular expression makes the dot optional, the string 10.10. can be matched as 1 0. 1 0.. There are several possible solutions: {^(([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(\.|$)){4}$} as suggested by the same Captain seems valid but may turn out to have more problems if tested.
The main problem is that a non-trivial regular expression is used to match the address. For all but the most trivial regular expressions, rigorous testing must be performed to ensure that they don't produce false positives. This testing is usually impractical to make exhaustive, which means that you can't know for sure if it works until an angry customer tells you it doesn't. When a case of false positive match is found, the solution is either to drop the regular expression and try another method, or alternatively to make the regular expression more complex in order to make the match more strict. At this point, the test suite may also have to grow.
A better way is to step back and look for other solutions. If there is a standard library function for it, that should be used. If we imagine there is none in this case, simply reflecting on the most basic formulation of an ipv4 decimal-dot address ("four groups of integers from 0 to 255, joined by dots") suggests some simple and safe functions:
proc isOctet n {
expr {[string is integer -strict $n] && 0 <= $n && $n <= 255}
}
proc splitIpv4dd1 str {
split $str .
}
proc splitIpv4dd2 str {
scan $str %d.%d.%d.%d
}
proc splitIpv4dd3 str {
lrange [regexp -inline {^(\d+)\.(\d+)\.(\d+)\.(\d+)$} $str] 1 end
}
# plug any of the preceding splitIpv4ddN functions into this command
proc putsIpv4dd str {
set count 0
foreach n [splitIpv4dd1 $str] {
if {[isOctet $n]} {
incr count
}
}
if {$count == 4} {puts $str}
}
It is much easier to verify that each of these functions does its job correctly without false negatives or positives, and if they do, the command to print ip addresses can be assumed to work correctly. The third splitting function uses a regular expression, but in this case it's a trivial one without alternatives and optional atoms.
One important goal when writing robust and maintainable code is to keep functions cohesive and clear-cut without loopholes or irregularities. Matching with non-trivial regular expressions runs counter to this.
I certainly understand and actually applaud the wish to understand what went wrong, but the correct conclusion to draw from this is that regular expression matching isn't a good method to use in this case.
You can try to use this regex:
^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$
Regex Demo
To answer "how this is happening" - ´.´ optional, it finds 1, 0., 1, 0.
And the answer to the unasked question
The below expression will make the dot optional only if it is the end of the string (modified to ensure no trailing dot):
^(([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(\.(?=[0-9])|$)){4}$
Please remember that the original question was asking "how is this happening" - i.e. understanding the regular expression behaviour... NOTHING about how to change the regex or how this should be done...

How to replace characters in string Erlang?

I have this piece of code that gets sessionid, make it a string, and then create a set with key as e.g. {{1401,873063,143916},<0.16443.0>} in redis. I'm trying replace { characters in this session with letter "a".
OldSessionID= io_lib:format("~p",[OldSession#session.sid]),
StringForOldSessionID = lists:flatten(OldSessionID),
ejabberd_redis:cmd([["SADD", StringForSessionID, StringForUserInfo]]);
I've tried this:
re:replace(N,"{","a",[global,{return,list}]).
Is this a good way of doing this? I read that regexp in Erlang is not a advised way of doing things.
Your solution works, and if you are comfortable with it, you should keep it.
On my side I prefer list comprehension : [case X of ${ -> $a; _ -> X end || X <- StringForOldSessionID ]. (just because I don't have to check the function documentation :o)
re:replace(N,"{","a",[global,{return,list}]).
Is this a good way of doing this? I read that regexp in Erlang is not
a advised way of doing things.
According to official documentation:
2.5 Myth: Strings are slow
Actually, string handling could be slow if done improperly. In Erlang, you'll have to think a little more about how the strings are used and choose an appropriate representation and use the re module instead of the obsolete regexp module if you are going to use regular expressions.
So, either you use re for strings, or:
leave { behind(using pattern matching)
if, say, N is {{1401,873063,143916},<0.16443.0>}, then
{{A,B,C},Pid} = N
And then format A,B,C,Pid into string.
Since Erlang OTP 20.0 you can use string:replace/3 function from string module.
string:replace/3 - replaces SearchPattern in String with Replacement. 3rd function parameter indicates whether the leading, the trailing or all encounters of SearchPattern are to be replaced.
string:replace(Input, "{", "a", all).

Hunspell/Aspell data conversion to human-readable inflection list

Is there an easy way to generate a human-readable inflection list from Hunspell/Aspell dictionary data files?
For example, I'd like to generate the following outputs (for different languages):
...
book, books
book, books, booked, booking
...
go, goes, went, gone, going
...
I looked at the Hunspell/Aspell docs, but couldn't find an API call that would do this.
There is a method that the command line one does, but it doesn't output quite in the format you're looking for. You could also do this manually if you wanted though just by some simple scripting with regex.
The format of for each set of affixes is
TYPE TAG REMOVE REPLACE MATCH
Such that where TAG matches what follows what's behind the /in a given word in the .dicfile, you can do the following (presuming you've already stripped the word of the /...):
if($word =~ /$match$/) $word =~ s/$remove$/$replace/;
Notice the $ there matching the end-of-line/word. Adjust with ^ if it's a prefix.
There are three caveats:
The $match directly from the .aff file is in almost all cases equivalent to standard regex. There are minor variations such that if the match is something like [abc-gh], you'd be better to change it to (a|b|c|-|g|h) or [abcgh-] (hunspell doesn't use hyphen as a metacharacter) otherwise it'll be interpreted as [abcdefgh] (standard regex). For a negated character class, your options are to manually move the - to the end of the expression (e.g. [^a-df] to [^adf-] or to use negative look behinds.
If $replace is 0, then you should change it to an empty string.
If your result ends with /..., you need to reprocess it again because it has a double affix.
Be careful. By my rough calculations, the dictionary I'm working on could have more than 50 million words being formed (and I wouldn't be surprised if it hits beyond 100 million).

Regular Expression - how to find text within particular if blocks?

I'm new to regular expressions and would like to use one to search through our source control to find text within a block of code that follows a particular enum value. I.e.:
/(\/{2}\#debug)(.|\s)*?(\/{2}\#end-debug).*/
var junk = dontWantThis if (junk) {dont want this} if ( **myEnumValue** ) **{ var yes = iWantToFindThis if (true) { var yes2 = iWantThisToo } }**
var junk2 = dontWantThis if (junk) {dont want this}
var stuff = dontWantThis if (junk) {dont want this} if ( enumValue ) { wantToFindThis }
var stuff = iDontWantThis if (junk) {iDontWantThisEither}
I know I can use (\{(/?[^\>]+)\}) to find if blocks, but I only want the first encompassing block of code that follows the enum value I'm looking for. I've also notice that using (\{(/?[^\>]+)\}) gives me the first { and last }, it doesn't group the subsequent {}.
Thank you!
Tim
Regexps simply can't handle this kind of stuff. For this you'll need a parser and scanner.
As others hint at, it's mathematically impossible to do with with regular expressions (at least in general; you might be able to get it to work if you have highly specialized cases). Try using a combination of lex and awk to get the desired results if you want to stick with standard Unix tools, or just go to Perl, Python, Ruby, etc. and build up the lexical parsing you need.
While nesting is a problem, you could use backtracking and lookahead to effectively count your matching braces or quotes. This is not strictly part of a regular expression but has been added to many regex libraries, such as the one in .NET, perl, and java; probably more. I wouldn't recommend that you go this route, as you should find it easier to lexically parse this. But if you do try this as a quick fix, absolutely collect a few test cases and run them through regexbuddy or expresso.