Matching the IP using regular expression - regex

set ip 10.10.
if {[regexp
{^(([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\.?){4}$} $ip
match]} { puts $match }
the above pattern matching 10.10. can anyone tell me how this happening

First, using a regular expression to check ip addresses is extremely fragile and unnecessarily complex, and you still have to do the heavy lifting yourself. Instead, use the Tcllib_ip package.
package require ip
If you want to know if a given string is an IPv4 address, just check with
::ip::is 4 $str ;# 1 if valid ipv4, 0 otherwise
or
::ip::version $str ;# returns 4 or 6 for ipv4 or ipv6, -1 otherwise
The commands in the package also handle address strings that aren't dotted decimal.
The package isn't included in all distributions, but can be installed using teacup install or by downloading the files and sourcing them into the script.
To answer the question: the original asker has one error and one problem. The error is that the regular expression used to match the ip address also matches strings that aren't ip addresses. This is one of the most common problems when using regular expressions. The reason and the fix is addressed in other answers to the question. To recap: Captain noted that since the original regular expression makes the dot optional, the string 10.10. can be matched as 1 0. 1 0.. There are several possible solutions: {^(([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(\.|$)){4}$} as suggested by the same Captain seems valid but may turn out to have more problems if tested.
The main problem is that a non-trivial regular expression is used to match the address. For all but the most trivial regular expressions, rigorous testing must be performed to ensure that they don't produce false positives. This testing is usually impractical to make exhaustive, which means that you can't know for sure if it works until an angry customer tells you it doesn't. When a case of false positive match is found, the solution is either to drop the regular expression and try another method, or alternatively to make the regular expression more complex in order to make the match more strict. At this point, the test suite may also have to grow.
A better way is to step back and look for other solutions. If there is a standard library function for it, that should be used. If we imagine there is none in this case, simply reflecting on the most basic formulation of an ipv4 decimal-dot address ("four groups of integers from 0 to 255, joined by dots") suggests some simple and safe functions:
proc isOctet n {
expr {[string is integer -strict $n] && 0 <= $n && $n <= 255}
}
proc splitIpv4dd1 str {
split $str .
}
proc splitIpv4dd2 str {
scan $str %d.%d.%d.%d
}
proc splitIpv4dd3 str {
lrange [regexp -inline {^(\d+)\.(\d+)\.(\d+)\.(\d+)$} $str] 1 end
}
# plug any of the preceding splitIpv4ddN functions into this command
proc putsIpv4dd str {
set count 0
foreach n [splitIpv4dd1 $str] {
if {[isOctet $n]} {
incr count
}
}
if {$count == 4} {puts $str}
}
It is much easier to verify that each of these functions does its job correctly without false negatives or positives, and if they do, the command to print ip addresses can be assumed to work correctly. The third splitting function uses a regular expression, but in this case it's a trivial one without alternatives and optional atoms.
One important goal when writing robust and maintainable code is to keep functions cohesive and clear-cut without loopholes or irregularities. Matching with non-trivial regular expressions runs counter to this.
I certainly understand and actually applaud the wish to understand what went wrong, but the correct conclusion to draw from this is that regular expression matching isn't a good method to use in this case.

You can try to use this regex:
^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$
Regex Demo

To answer "how this is happening" - ´.´ optional, it finds 1, 0., 1, 0.
And the answer to the unasked question
The below expression will make the dot optional only if it is the end of the string (modified to ensure no trailing dot):
^(([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(\.(?=[0-9])|$)){4}$
Please remember that the original question was asking "how is this happening" - i.e. understanding the regular expression behaviour... NOTHING about how to change the regex or how this should be done...

Related

Optimizing a regex filled with '?'

On the stenographic keyboard, there are the keys STKPWHRAO*EUFRPBLGTSDZ. The user presses several keys, then the keys are registered all at once when lifted. It's similar to playing chords on a piano. Example strokes are KAT, TPHOEUGT.
I have a regex which tests for valid steno chords. It can be any number of these keys but they must be in that order. My solution is qr/S?T?K?P?W?H?R?A?O?\*?E?U?F?R?P?B?L?G?T?S?D?Z?/ but since this regex gets called hundreds of times, the variable length might be a speed bottleneck. Each step forward in the regex is a bigger and bigger set of possibilities due to all the ?
Is there a faster regex approach to this? I need the regex to fail if keys are out of order.
To check if a string is a valid chord, you'd actually need
/^(?=.)S?T?K?P?W?H?R?A?O?\*?E?U?F?R?P?B?L?G?T?S?D?Z?\z/s
A simple optimization would be to make sure a match is possible.
/^(?=[STKPWHRAO*EUFBLGDZ])S?T?K?P?W?H?R?A?O?\*?E?U?F?R?P?B?L?G?T?S?D?Z?\z/s
The next step is to eliminate backtracking. That's where time is being lost.
/
^
(?=[STKPWHRAO*EUFBLGDZ])
S?+ T?+ K?+ P?+ W?+ H?+ R?+ A?+ O?+ \*?+ E?+
U?+ F?+ R?+ P?+ B?+ L?+ G?+ T?+ S?+ D?+ Z?+
\z
/x
Fortunately, even though S, T, P and R appear twice, backtracking could be completely eliminated without trouble. This should virtually the matching time to virtually nothing.
If even that isn't fast enough, the next step is writing a specialized C function. Starting the regex matching engine is expensive, and completely avoidable with a simple function.
Note that the above optimizations only help when the pattern doesn't match. They should be neutral when the pattern matches. The C function, on the other hand, would help even when then pattern matches.
Benchmarks:
use strict;
use warnings;
use feature qw( say );
use Benchmark qw( cmpthese );
my %tests = (
orig => q{ $s =~ /^(?=.)S?T?K?P?W?H?R?A?O?\*?E?U?F?R?P?B?L?G?T?S?D?Z?\z/s},
new => q{ $s =~
/
^
(?=[STKPWHRAO*EUFBLGDZ])
S?+ T?+ K?+ P?+ W?+ H?+ R?+ A?+ O?+ \*?+ E?+
U?+ F?+ R?+ P?+ B?+ L?+ G?+ T?+ S?+ D?+ Z?+
\z
/x
},
);
$_ = 'use strict; use warnings; our $s; ' . $_
for values %tests;
{ say "Matching:"; local our $s = "STAODZ"; cmpthese(-3, \%tests); }
{ say "Not matching:"; local our $s = "STPRSTPR"; cmpthese(-3, \%tests); }
Output:
Matching:
Rate new orig
new 509020/s -- -29%
orig 712274/s 40% --
Not matching:
Rate orig new
orig 158758/s -- -73%
new 579851/s 265% --
Which means
matching slowed from 1.40μs to 1.96μs (in this case), and
non-matching speed up from 6.30μs to 1.72μs (in this case).
To check if a string is a sequence of valid chords, you'd simply need
/^[STKPWHRAO*EUFBLGDZ]+\z/
If you want to extract all the chords in a string, I'd start by extracting the sequences matched by the following, then finding the chords within the extracted sequences:
/([STKPWHRAO*EUFBLGDZ]+)/
the variable length might be a speed bottleneck
You shouldn't work like that
First, write and debug your program
then, if it isn't fast enough for it's purpose, profile your program to find where the bottlenecks are
then optimise the bottlenecks
For goodness sake don't spend ages trying to guess where the bottlenecks are and optimising them before your code is complete, as you will more than likely find that you have guessed wrongly and wasted a lot of time
In any case, the regex engine is written in C and is pretty damn fast. I doubt very much whether the short pattern that you have written will take a significant amount of time to test
Each step forward in the regex is a bigger and bigger set of possibilities due to all the ?
That isn't true either. At each point in the regex there is only one character to test. The next character in the string either matches it or it doesn't. Either is fine, and the regex engine just goes on to the next step in the pattern. The matching process will be pretty much constant regardless of the string to be matched.

Regex expression to recognize XdY+Z OR XdY

I've been trying to develop a program that will be used for DMing in an MMORPG but I'm having trouble parsing for the actual regex expression I need.
To quote myself from another thread on a less active forum:
I've officially taken over the DiceRoller addon from years and years ago and I've reworked it a lot since I've taken it over and done a lot of testing in game. While I haven't uploaded anything yet, I've been struggling on a piece of regex expression that is currently crucial to the design of the addon.
Some background: the newest iteration of the DiceRoller addon makes it so you can type "!XdY" (where X is the number of dice, Y is the dice value) into raid chat and the DM who has the addon will go through some logic in the addon (random number lua protocol) and then spit out an input after adding up the dice.
It is as follows:
local count, size = string.match(message, "^!(%d+)[dD](%d+)$")
Now the functionality I need it to do is parse for both "!XdY" OR "XdY+Z", but it seems as if I can't get close to "XdY+Z" no matter which regex expression I use since I need it to do both expressions. I can provide more source code context if necessary.
This is the closest I've ever gotten:
http://i.imgur.com/eMhPHQB.png
and this is with the regex expression:
local count, size, modifier = string.match(message, "^!(%d+)[dD](%d+)+?(%d+)$")
As you can see, with the modifier it will work just fine. However, remove the modifier the regex expression still thinks that it is "XdY+Z" and so with "1d20" it think it is "1d2+0". It will think 1d200 is "1d20+0", etc. I've tried moving around the optional character "?" but it just causes the expression to not work at all. If I do !1d2 it doesn't work. It's almost as if the optional character NEEDS to be there?
Thanks for the help ahead of time, I've always struggled with regex.
local function dice(input)
local count, size, modifier = input:match"^!(%d+)[dD](%d+)%+?(%d*)$"
if count then
return tonumber(count), tonumber(size), tonumber("0"..modifier)
end
end
for _, input in ipairs{"!1d6", "!1d24", "!1d200", "!1d2+4", "!1d20+24"} do
print(input, dice(input))
end
Output:
!1d6 1 6 0
!1d24 1 24 0
!1d200 1 200 0
!1d2+4 1 2 4
!1d20+24 1 20 24
Lua regular expressions are very limited. You would need to use ^!(%d+)[dD](%d+)(?:+(%d+))?$ but this wouldn't be supported because of (?:+(%d+))? that uses a non-capturing group and a modifier on a group, both are not supported by Lua Patterns.
Consider using a regex library like this one that allows you to use PCRE, PHP regex engine, one of the most complete engine. But that would be overkill if you only want to use it for this regex. You can do it by code then, wouldn't be so hard for a simple task like this.
While Lua patterns are not powerful enough to parse this with one expression (as they don't support optional groups), there is an easy option to handle it with two expressions:
-- check the longer expression first
local count, size, modifier = string.match(message, "^!(%d+)[dD](%d+)+(%d+)$")
if not count then
count, size = string.match(message, "^!(%d+)[dD](%d+)$")
end

Tcl regexp cache with lists of RE

I read that Tcl caches the last 30 regexp compiled and also that assigning a variable to the RE in string version will make Tcl attach the compiled RE to the variable the first time it is used. But what I can't seem to find is if that compiled RE caching will still be done if the RE are contained in a list and iterated upon.
Basically, imagine I have this :
set REs {
"RE 1"
"RE 2"
.
.
.
"RE 39"
"RE 40"
}
foreach re $REs {
if { [regexp -nocase $re $line] } {
AchieveWorldPeace $line
}
}
Since those REs are used over and over and since I have more than 30 REs (and I don't want to recompile Tcl after changing the corresponding #define based solely on that script), the caching becomes important for the script to run at its fastest. My question is therefore : in this example, would the regular expression be recompiled at each loop? If yes, is there a way to ensure caching when using lists of regular expressions?
Basically, is there a way for the caching to be attached to the Tcl_Object pointed to by the list and not to the Tcl_Object pointed to by the iterator in the foreach ? (Note : that question might be wrong on multiple levels because I don't have any experience in terms of Tcl source code, but it's how I imagined the whole thing to be implemented.)
Please note that this question is more oriented on a better understanding of Tcl than on a specific code answer.
Also, I know I can do something like this :
set RE "(RE 1|RE 2| ... |RE 39|RE 40)"
if { [regexp -nocase $RE $line] } {
AchieveWorldPeace $line
}
And, from my tests, I know that this speeds up my script by about a factor of two (which is not bad considering the script does a lot more). However, there is no way to tell easily which RE was matched when implemented this way, so it's not quite the same. (Not critical in my case, but just saying...)
Tcl uses two caches of RE compilations. One is the per-thread cache, and the other is in the Tcl_Obj internal representation of the RE. Since the values in a list retain their internal representations, the foreach of a list will keep them as well: your example code will be perfectly well cached with no need for further special action by you. Easy!

TCL: Backslash issue (regsub)

I have an issue while trying to read a member of a list like \\server\directory
The issue comes when I try to get this variable using the lindex command, that proceeds with TCL substitution, so the result is:
\serverdirectory
Then, I think I need to use a regsub command to avoid the backslash substitution, but I did not get the correct proceedure.
An example of what I want should be:
set mistring "\\server\directory"
regsub [appropriate regular expresion here]
puts "mistring: '$mistring'" ==> "mistring: '\\server\directory'"
I have checked some posts around this, and keep the \\ is ok, but I still have problems when trying to keep always a single \ followed by any other character that could come here.
UPDATE: specific example. What I am actually trying to keep is the initial format of an element in a list. The list is received by an outer application. The original code is something like this:
set mytable $__outer_list_received
puts "Table: '$mytable'"
for { set i 0 } { $i < [llength $mitabla] } { incr i } {
set row [lindex $mytable $i]
puts "Row: '$row'"
set elements [lindex $row 0]
puts "Elements: '$elements'"
}
The output of this, in this case is:
Table: '{{
address \\server\directory
filename foo.bar
}}'
Row: '{
address \\server\directory
filename foo.bar
}'
Elements: '
address \\server\directory
filename foo.bar
'
So I try to get the value of address (in this specific case, \\server\directory) in order to write it in a configuration file, keeping the original format and data.
I hope this clarify the problem.
If you don't want substitutions, put the problematic string inside curly braces.
% puts "\\server\directory"
\serverdirectory
and it's not what you want. But
% puts {\\server\directory}
\\server\directory
as you need.
Since this is fundamentally a problem on Windows (and Tcl always treats backslashes in double-quotes as instructions to perform escaping substitutions) you should consider a different approach (otherwise you've got the problem that the backslashes are gone by the time you can apply code to “fix” them). Luckily, you've got two alternatives. The first is to put the string in {braces} to disable substitutions, just like a C# verbatim string literal (but that uses #"this" instead). The second is perhaps more suitable:
set mistring [file nativename "//server/directory"]
That ensures that the platform native directory separator is used on Windows (and nowadays does nothing on other platforms; back when old MacOS9 was supported it was much more magical). Normally, you only need this sort of thing if you are displaying full pathnames to users (usually a bad idea, GUI-wise) or if you are passing the name to some API that doesn't like forward slashes (notably when going as an argument to a program via exec but there are other places where the details leak through, such as if you're using the dde, tcom or twapi packages).
A third, although ugly, option is to double the slashes. \\ instead of \, and \ instead of \, while using double quotes. When the substitution occurs it should give you what you want. Of course, this will not help much if you do the substitution a second time.

Measure the "matching"?

Is there mechanism to measure or compare of how tight the pattern corresponds to the given string? By pattern I mean regex or something similar. For example we have string "foobar" and two regexes: "fooba." and ".*" Both patterns match the string. Is it possible to determine that "fooba." is more appropriate pattern for given string then ".*"?
There are metrics and heuristics for string 'distance'. Check this for example http://en.wikipedia.org/wiki/Edit_distance
Here is one random Java implementation that came with Google search.
http://www.merriampark.com/ldjava.htm
Some metrics are expensive to compute so look around and find one that fits your needs.
As for your specific example, IIRC, regex matching in Java prioritizes terms by matching length and then order so if you use something like
"(foobar)|(.*)", it will match the first one and you can determine this by examining the results returned for the two capture groups.
How about this for an idea: Use the length of your regular expression: length("fooba.") > length(".*"), so "fooba." is more specific...
However, it depends on where the regular expressions come from and how precise you need to be as "fo.*|.*ba" would be longer than "fooba.", so the solution will not always work.
What you're asking for isn't really a property of regular expressions.
Create an enum that measures "closeness", and create a class that will hold a given regex, and a closeness value. This requires you to determine which regex is considered "more close" than another.
Instantiate your various classes, and let them loose on your code, and compare the matched objects, letting the "most closeness" one rise to the top.
pseudo-code, without actually comparing anything, or resembling any sane language:
enum Closeness
Exact
PrettyClose
Decent
NotSoClose
WayOff
CouldBeAnything
mune
class RegexCloser
property Closeness Close()
property String Regex()
ssalc
var foo = new RegexCloser(Closeness := Exact, Regex := "foobar")
var bar = new RegexCloser(Closeness := CouldBeAnything, Regex := ".*")
var target = "foobar";
if Regex.Match(target, foo)
print String.Format("foo {0}", foo.Closeness)
fi
if Regex.Match(target, bar)
print String.Format("bar {0}", bar.Closeness)
fi