Tcl regexp cache with lists of RE - regex

I read that Tcl caches the last 30 regexp compiled and also that assigning a variable to the RE in string version will make Tcl attach the compiled RE to the variable the first time it is used. But what I can't seem to find is if that compiled RE caching will still be done if the RE are contained in a list and iterated upon.
Basically, imagine I have this :
set REs {
"RE 1"
"RE 2"
.
.
.
"RE 39"
"RE 40"
}
foreach re $REs {
if { [regexp -nocase $re $line] } {
AchieveWorldPeace $line
}
}
Since those REs are used over and over and since I have more than 30 REs (and I don't want to recompile Tcl after changing the corresponding #define based solely on that script), the caching becomes important for the script to run at its fastest. My question is therefore : in this example, would the regular expression be recompiled at each loop? If yes, is there a way to ensure caching when using lists of regular expressions?
Basically, is there a way for the caching to be attached to the Tcl_Object pointed to by the list and not to the Tcl_Object pointed to by the iterator in the foreach ? (Note : that question might be wrong on multiple levels because I don't have any experience in terms of Tcl source code, but it's how I imagined the whole thing to be implemented.)
Please note that this question is more oriented on a better understanding of Tcl than on a specific code answer.
Also, I know I can do something like this :
set RE "(RE 1|RE 2| ... |RE 39|RE 40)"
if { [regexp -nocase $RE $line] } {
AchieveWorldPeace $line
}
And, from my tests, I know that this speeds up my script by about a factor of two (which is not bad considering the script does a lot more). However, there is no way to tell easily which RE was matched when implemented this way, so it's not quite the same. (Not critical in my case, but just saying...)

Tcl uses two caches of RE compilations. One is the per-thread cache, and the other is in the Tcl_Obj internal representation of the RE. Since the values in a list retain their internal representations, the foreach of a list will keep them as well: your example code will be perfectly well cached with no need for further special action by you. Easy!

Related

Is there a way to match strings:numbers with variable positioning within the string?

We are using a simple curl to get metrics via an API. The problem is, that the output is fixed in the amount of arguments but not their position within the output.
We need to do this with a "simple" regex since the tool only accepts this.
/"name":"(.*)".*?"memory":(\d+).*?"consumer_utilisation":(\w+|\d+).*?"messages_unacknowledged":(\d+).*?"messages_ready":(\d+).*?"messages":(\d+)/s
It works fine for:
{"name":"queue1","memory":89048,"consumer_utilisation":null,"messages_unacknowledged":0,"messages_ready":0,"messages":0}
However if the output order is changed, then it doesn't match any more:
{"name":"queue2","consumer_utilisation":null,"messages_unacknowledged":0,"messages_ready":0,"messages":0,"memory":21944}
{"name":"queue3","consumer_utilisation":null,"messages_unacknowledged":0,"messages_ready":0,"memory":21944,"messages":0}
I need a relative definition of the strings to match, since I never know at which position they will appear. Its in total 9 different queue-metric-groups.
The simple option is to use a regex for each key-value pair instead of one large regex.
/"name":"((?:[^\\"]|\\.)*)"/
/"memory":(\d+)/
This other option is not a regex, but might be sufficient. Instead of using regex, you could simply transform the resulting response before reading it. Since you say "We are using a simple curl" I'm guessing you're talking about the Curl command line tool. You could pipe the result into a simple Perl command.
perl -ne 'use JSON; use Text::CSV qw(csv); $hash = decode_json $_; csv (sep_char=> ";", out => *STDOUT, in => [[$hash->{name}, $hash->{memory}, $hash->{consumer_utilisation}, $hash->{messages_unacknowledged}, $hash->{messages_ready}, $hash->{messages}]]);'
This will keep the order the same, making it easier to use a regex to read out the data.
input
{"name":"queue1","memory":89048,"consumer_utilisation":null,"messages_unacknowledged":0,"messages_ready":0,"messages":0}
{"name":"queue2","consumer_utilisation":null,"messages_unacknowledged":0,"messages_ready":0,"messages":0,"memory":21944}
{"name":"queue3","consumer_utilisation":null,"messages_unacknowledged":0,"messages_ready":0,"memory":21944,"messages":0}
output
queue1;89048;;0;0;0
queue2;21944;;0;0;0
queue3;21944;;0;0;0
For this to work you need Perl and the packages JSON and Text::CSV installed. On my system they are present in perl, libjson-perl and libtext-csv-perl.
note: I'm currently using ; as separator. If this is included into one of the output will be surrounded by double quotes. "name":"que;ue1" => "que;ue1";89048;;0;0;0 If the value includes both a ; and a " the " will be escaped by placing another one before it. "name":"q\"ue;ue1" => "q""ue;ue1";89048;;0;0;0

bsd_glob behaving differently on different machines

I am using bsd_glob to get a list of files matching a regular expression for file path. My perl utility is working on RHEL, but not on Suse 11/AIX/Solarix, for the exact same set of files and the same regular expression. I googled for any limitations of bsd_glob, but couldn't find much information. Can someone point what's wrong?
Below is the regular expression for the file path I am searching for:
/datafiles/data_one/level_one/*/DATA*
I need all files beginning with DATA, in any directory present under 'level_one'.
This works perfectly on my RHEL box, but not on any other Unix and Suse Linux.
Below is the code snipped where I am using bsd_glob
foreach my $file (bsd_glob ( "$fileName", GLOB_ERR )) {
if ($fileName =~ /[[:alnum:]]\*\/\*$/) {
next if -d $file;
$fileList{$file} = $permissions;
$total++;
}
elsif ($fileName =~ /[[:alnum:]]\*$/) {
$fileList{$file} = $permissions;
$total++;
}
else {
$fileList{$file} = $permissions;
$total++;
}
}
In this case where I am facing the issue, /datafiles/data_one/level_one/*/DATA* is being passed to bsd_glob. I am creating a map ($fileList) of files that are returned by bsd_glob based on the regular expression I am passing to it. $permissions is a predefined value.
Any help is appreciated.
The problem here looks to be that you're confusing glob patterns and regular expressions.
/[[:alnum:]]\*\/\*$/
/[[:alnum:]]\*$/
You're looking for a file called * with that, under a directory containing a literal *.
Whilst that is technically possible it's really very strange. And simply cannot ever match the patterns your glob should find.
Do you perhaps mean:
m,\w+.*/.*$,
(different delimiter for clarity)
Also - why are you using bsd_glob specifically? From File::Glob:
Since v5.6.0, Perl's CORE::glob() is implemented in terms of bsd_glob(). Note that they don't share the same prototype--CORE::glob() only accepts a single argument. Due to historical reasons, CORE::glob() will also split its argument on whitespace, treating it as multiple patterns, whereas bsd_glob() considers them as one pattern. But see :bsd_glob under EXPORTS, below.
Comment:
I used bsd_glob instead of glob as there was slight difference in the way it works on different UNIX platforms. Specifically, for the above mentioned pattern, on some UNIX platforms, it didn't return a file having exact name 'DATA', and only returned files with something appended to DATA.
I'm a little surprised at that, as they should be implementing the same mechanisms and the same POSIX standard on globbing. Is there any chance there's a permissions related problem instead?
But otherwise you could perhaps try not using glob to do the heavy lifting, and instead just compare the file name to a bunch of regular expressions. (Although note - REs have very different syntax)
foreach my $file ( glob('/datafiles/data_one/level_one/*/*') ) {
next unless $filename =~ m,DATA\w+$,;
}

Matching the IP using regular expression

set ip 10.10.
if {[regexp
{^(([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\.?){4}$} $ip
match]} { puts $match }
the above pattern matching 10.10. can anyone tell me how this happening
First, using a regular expression to check ip addresses is extremely fragile and unnecessarily complex, and you still have to do the heavy lifting yourself. Instead, use the Tcllib_ip package.
package require ip
If you want to know if a given string is an IPv4 address, just check with
::ip::is 4 $str ;# 1 if valid ipv4, 0 otherwise
or
::ip::version $str ;# returns 4 or 6 for ipv4 or ipv6, -1 otherwise
The commands in the package also handle address strings that aren't dotted decimal.
The package isn't included in all distributions, but can be installed using teacup install or by downloading the files and sourcing them into the script.
To answer the question: the original asker has one error and one problem. The error is that the regular expression used to match the ip address also matches strings that aren't ip addresses. This is one of the most common problems when using regular expressions. The reason and the fix is addressed in other answers to the question. To recap: Captain noted that since the original regular expression makes the dot optional, the string 10.10. can be matched as 1 0. 1 0.. There are several possible solutions: {^(([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(\.|$)){4}$} as suggested by the same Captain seems valid but may turn out to have more problems if tested.
The main problem is that a non-trivial regular expression is used to match the address. For all but the most trivial regular expressions, rigorous testing must be performed to ensure that they don't produce false positives. This testing is usually impractical to make exhaustive, which means that you can't know for sure if it works until an angry customer tells you it doesn't. When a case of false positive match is found, the solution is either to drop the regular expression and try another method, or alternatively to make the regular expression more complex in order to make the match more strict. At this point, the test suite may also have to grow.
A better way is to step back and look for other solutions. If there is a standard library function for it, that should be used. If we imagine there is none in this case, simply reflecting on the most basic formulation of an ipv4 decimal-dot address ("four groups of integers from 0 to 255, joined by dots") suggests some simple and safe functions:
proc isOctet n {
expr {[string is integer -strict $n] && 0 <= $n && $n <= 255}
}
proc splitIpv4dd1 str {
split $str .
}
proc splitIpv4dd2 str {
scan $str %d.%d.%d.%d
}
proc splitIpv4dd3 str {
lrange [regexp -inline {^(\d+)\.(\d+)\.(\d+)\.(\d+)$} $str] 1 end
}
# plug any of the preceding splitIpv4ddN functions into this command
proc putsIpv4dd str {
set count 0
foreach n [splitIpv4dd1 $str] {
if {[isOctet $n]} {
incr count
}
}
if {$count == 4} {puts $str}
}
It is much easier to verify that each of these functions does its job correctly without false negatives or positives, and if they do, the command to print ip addresses can be assumed to work correctly. The third splitting function uses a regular expression, but in this case it's a trivial one without alternatives and optional atoms.
One important goal when writing robust and maintainable code is to keep functions cohesive and clear-cut without loopholes or irregularities. Matching with non-trivial regular expressions runs counter to this.
I certainly understand and actually applaud the wish to understand what went wrong, but the correct conclusion to draw from this is that regular expression matching isn't a good method to use in this case.
You can try to use this regex:
^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$
Regex Demo
To answer "how this is happening" - ´.´ optional, it finds 1, 0., 1, 0.
And the answer to the unasked question
The below expression will make the dot optional only if it is the end of the string (modified to ensure no trailing dot):
^(([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(\.(?=[0-9])|$)){4}$
Please remember that the original question was asking "how is this happening" - i.e. understanding the regular expression behaviour... NOTHING about how to change the regex or how this should be done...

TCL: Backslash issue (regsub)

I have an issue while trying to read a member of a list like \\server\directory
The issue comes when I try to get this variable using the lindex command, that proceeds with TCL substitution, so the result is:
\serverdirectory
Then, I think I need to use a regsub command to avoid the backslash substitution, but I did not get the correct proceedure.
An example of what I want should be:
set mistring "\\server\directory"
regsub [appropriate regular expresion here]
puts "mistring: '$mistring'" ==> "mistring: '\\server\directory'"
I have checked some posts around this, and keep the \\ is ok, but I still have problems when trying to keep always a single \ followed by any other character that could come here.
UPDATE: specific example. What I am actually trying to keep is the initial format of an element in a list. The list is received by an outer application. The original code is something like this:
set mytable $__outer_list_received
puts "Table: '$mytable'"
for { set i 0 } { $i < [llength $mitabla] } { incr i } {
set row [lindex $mytable $i]
puts "Row: '$row'"
set elements [lindex $row 0]
puts "Elements: '$elements'"
}
The output of this, in this case is:
Table: '{{
address \\server\directory
filename foo.bar
}}'
Row: '{
address \\server\directory
filename foo.bar
}'
Elements: '
address \\server\directory
filename foo.bar
'
So I try to get the value of address (in this specific case, \\server\directory) in order to write it in a configuration file, keeping the original format and data.
I hope this clarify the problem.
If you don't want substitutions, put the problematic string inside curly braces.
% puts "\\server\directory"
\serverdirectory
and it's not what you want. But
% puts {\\server\directory}
\\server\directory
as you need.
Since this is fundamentally a problem on Windows (and Tcl always treats backslashes in double-quotes as instructions to perform escaping substitutions) you should consider a different approach (otherwise you've got the problem that the backslashes are gone by the time you can apply code to “fix” them). Luckily, you've got two alternatives. The first is to put the string in {braces} to disable substitutions, just like a C# verbatim string literal (but that uses #"this" instead). The second is perhaps more suitable:
set mistring [file nativename "//server/directory"]
That ensures that the platform native directory separator is used on Windows (and nowadays does nothing on other platforms; back when old MacOS9 was supported it was much more magical). Normally, you only need this sort of thing if you are displaying full pathnames to users (usually a bad idea, GUI-wise) or if you are passing the name to some API that doesn't like forward slashes (notably when going as an argument to a program via exec but there are other places where the details leak through, such as if you're using the dde, tcom or twapi packages).
A third, although ugly, option is to double the slashes. \\ instead of \, and \ instead of \, while using double quotes. When the substitution occurs it should give you what you want. Of course, this will not help much if you do the substitution a second time.

How do I check if a scalar has a compiled regex in it with Perl?

Let's say I have a subroutine/method that a user can call to test some data that (as an example) might look like this:
sub test_output {
my ($self, $test) = #_;
my $output = $self->long_process_to_get_data();
if ($output =~ /\Q$test/) {
$self->assert_something();
}
else {
$self->do_something_else();
}
}
Normally, $test is a string, which we're looking for anywhere in the output. This was an interface put together to make calling it very easy. However, we've found that sometimes, a straight string is problematic - for example, a large, possibly varying number of spaces...a pattern, if you will. Thus, I'd like to let them pass in a regex as an option. I could just do:
$output =~ $test
if I could assume that it's always a regex, but ah, but the backwards compatibility! If they pass in a string, it still needs to test it like a raw string.
So in that case, I'll need to test to see if $test is a regex. Is there any good facility for detecting whether or not a scalar has a compiled regex in it?
As hobbs points out, if you're sure that you'll be on 5.10 or later, you can use the built-in check:
use 5.010;
use re qw(is_regexp);
if (is_regexp($pattern)) {
say "It's a regex";
} else {
say "Not a regex";
}
However, I don't always have that option. In general, I do this by checking against a prototype value with ref:
if( ref $scalar eq ref qr// ) { ... }
One of the reasons I started doing it this way was that I could never remember the type name for a regex reference. I can't even remember it now. It's not uppercase like the rest of them, either, because it's really one of the packages implemented in the perl source code (in regcomp.c if you care to see it).
If you have to do that a lot, you can make that prototype value a constant using your favorite constant creator:
use constant REGEX_TYPE => ref qr//;
I talk about this at length in Effective Perl Programming as "Item 59: Compare values to prototypes".
If you want to try it both ways, you can use a version check on perl:
if( $] < 5.010 ) { warn "upgrade now!\n"; ... do it my way ... }
else { ... use is_regex ... }
As of perl 5.10.0 there's a direct, non-tricky way to do this:
use 5.010;
use re qw(is_regexp);
if (is_regexp($pattern)) {
say "It's a regex";
} else {
say "Not a regex";
}
is_regexp uses the same internal test that perl uses, which means that unlike ref, it won't be fooled if, for some strange reason, you decide to bless a regex object into a class other than Regexp (yes, that's possible).
In the future (or right now, if you can ship code with a 5.10.0 requirement) this should be considered the standard answer to the problem. Not only because it avoids a tricky edge-case, but also because it has the advantage of saying exactly what it means. Expressive code is a good thing.
See the ref built-in.