Glob pattern expression for a hexadecimal number in TCL? - regex

I am trying understand the difference between glob and regex patterns. I need to do some pattern matching in TCL.
The purpose is to find out if a hexadecimal value has been entered.
The value may or may not start with 0x
The value shall contain between 1 and 12 hex characters i.e 0-9, a-f, A-F and these shall follow the 0x if it exists
The thing is that glob does not allow use of {a,b} to tell about how many characters to look for. Also, at start I tried to use (0x[Xx])? but I think this is not working.
It is not essential to use glob. I can see that there are subtle differences between glob and regex. I just want to know if this can be done only through regex and not glob.

Tcl's glob patterns are much simpler than regular expressions. All they support is:
* to mean any number of any character.
? to mean any single character.
[…] to mean any single character from the set (the chars inside the brackets, which may include ranges).
\x to mean mean a literal x (which can be any character). That's how you put a glob metacharacter in a glob pattern.
They're also always anchored at both ends. (Regular expressions are much more powerful. They're also slower. You pay for power.)
To match hex numbers like 0xF00d, you'd use a glob pattern like this:
0x[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]
(or, as an actual Tcl command; we put the pattern in {braces} to avoid needing lots of backslashes for all the brackets…)
string match {0x[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]} $value
Note that we have to match an exact number of characters. (You can shorten the pattern by using case-insensitive matching, to 0x[0-9a-f][0-9a-f][0-9a-f][0-9a-f].)
Matching hex numbers is better done with regexp or scan (which also parses the hex number). Everyone likes to forget scan for parsing, yet it's quite good at it…
regexp {^0x([[:xdigit:]]+)$} $value -> theHexDigits
scan $value "0x%x" theParsedValue

The thing is that glob does not allow use of {a,b} to tell about how
many characters to look for. Also, at start I tried to use (0x[Xx])?
but I think this is not working.
A commonly used regular expression, not specific to Tcl at all, is ^(0[xX])?[A-Fa-f0-9]{1,12}$.
Update
As Donal writes, there is a power-cost tradeoff when it comes to regexp. I was curious and, for the given requirements (optional 0x prefix, range check [1,12]), found that a carefully crafted script using string operations incl. string match (see isHex1 below) outperforms regexp in this setting (see isHex2), whatever the input case:
proc isHex1 {str min max} {
set idx [string last "0x" $str]
if {$idx > 0} {
return 0
} elseif {$idx == 0} {
set str [string range $str 2 end]
}
set l [string length $str]
expr {$l >= $min && $l <= $max && [string match -nocase [string repeat {[0-9a-f]} $l] $str]}
}
proc isHex2 {str min max} {
set regex [format {^(0x)?[[:xdigit:]]{%d,%d}$} $min $max]
regexp $regex $str
}
isHex1 extends the idea of computing the string match pattern based on the input length (w/ or w/o prefix) and string repeat. My own timings suggest that isHex1 runs at least 40% faster than isHex2 (all using time, 10000 iterations), in a worst case (within range, final character decides). Other cases (e.g., out-of-range) are substantially faster.

The glob syntax is described in the string match documentation. Compared to regular expressions, glob is a blunt instrument.
With regular expressions, you get the standard character classes, including [:xdigit:] to match a hexadecimal digit.
To contrast with mrcalvin's answer, a Tcl-specific regex would be: (?i)^0x[[:xdigit:]]{1,12}$
the leading (?i) means the expression will be matched case-insensitively.
If all you care about is determining if the input is a valid number, you can use string is integer:
set s 0xdeadbeef
string is integer $s ;# => 1
set s deadbeef
string is integer $s ;# => 0
set s 0xdeadbeetle
string is integer $s ;# => 0

Related

Regular expression puzzler

I have been doing regular expression for 25+ years but I don't understand why this regex is not a match (using Perl syntax):
"unify" =~ /[iny]{3}/
# as in
perl -e 'print "Match\n" if "unify" =~ /[iny]{3}/'
Can someone help solve that riddle?
The quantifier {3} in the pattern [iny]{3} means to match a character with that pattern (either i or n or y), and then another character with the same pattern, and then another. Three -- one after another. So your string unify doesn't have that, but can muster two at most, ni.
That's been explained in other answers already. What I'd like to add is an answer to a clarification in comments: how to check for these characters appearing 3 times in the string, scattered around at will. Apart from matching that whole substring, as shown already, we can use a lookahead:
(?=[iny].*[iny].*[iny])
This does not "consume" any characters but rather "looks" ahead for the pattern, not advancing the engine from its current position. As such it can be very useful as a subpattern, in combination with other patterns in a larger regex.
A Perl example, to copy-paste on the command line:
perl -wE'say "Match" if "unify" =~ /(?=[iny].*[iny].*[iny])/'
The drawback to this, as well as to consuming the whole such substring, is the literal spelling out of all three subpatterns; what when the number need be decided dynamically? Or when it's twelve? The pattern can be built at runtime of course. In Perl, one way
my $pattern = '(?=' . join('.*', ('[iny]')x3) . ')';
and then use that in the regex.
 
For the sake of performance, for long strings and many repetitions, make that .* non-greedy
(?=[iny].*?[iny].*?[iny])
(when forming the pattern dynamically join with .*?)
A simple benchmark for illustration (in Perl)
use warnings;
use strict;
use feature 'say';
use Getopt::Long;
use List::Util qw(shuffle);
use Benchmark qw( cmpthese );
# For how many seconds to run each option (-r N, default 3),
# how many times to repeat for the test string (-n N, default 2)
my ($runfor, $n) = (3, 2);
GetOptions('r=i' => \$runfor, 'n=i' => \$n);
my $str = 'aa'
. join('', map { (shuffle 'b'..'t')x$n, 'a' } 1..$n)
. 'a'x($n+1)
. 'zzz';
my $pat_greedy = '(?=' . join('.*', ('a')x$n) . ')';
my $pat_non_greedy = '(?=' . join('.*?', ('a')x$n) . ')';
#my $pat_greedy = join('.*', ('a')x$n); # test straight match,
#my $pat_non_greedy = join('.*?', ('a')x$n); # not lookahead
sub match_repeated {
my ($s, $pla) = #_;
return ( $s =~ /$pla(.*z)/ ) ? "match" : "no match";
}
cmpthese(-$runfor, {
greedy => sub { match_repeated($str, $pat_greedy) },
non_greedy => sub { match_repeated($str, $pat_non_greedy) },
});
(Shuffling of that string is probably unneeded but I feared optimizations intruding.)
When a string is made with the factor of 20 (program.pl -n 20) the output is
Rate greedy non_greedy
greedy 56.3/s -- -100%
non_greedy 90169/s 159926% --
So ... some 1600 times better non-greedy. That test string is 7646 characters long and the pattern to match has 20 subpatterns (a) with .* between them (in greedy case); so there's a lot going on there. With default 2, so for a short string and a simpler pattern, the difference is 10%.
Btw, to test for straight-up matches (not using lookahead) just move those comment signs around the pattern variables, and it's nearly twice as bad:
Rate greedy non_greedy
greedy 56.5/s -- -100%
non_greedy 171949/s 304117% --
The letters n, i, and y aren't all adjacent. There's an f in between them.
/[iny]{3}/ matches any string that contains a substring of three letters taken from the set {i, n, y}. The letters can be in any order; they can even be repeated.
Choosing three characters three times, with replacement, means there are 33 = 27 matching substrings:
iii, iin, iiy, ini, inn, iny, iyi, iyn, iyy
nii, nin, niy, nni, nnn, nny, nyi, nyn, nyy
yii, yin, yiy, yni, ynn, yny, yyi, yyn, yyy
To match non-adjacent letters you can use one of these:
[iny].*[iny].*[iny]
[iny](.*[iny]){2}
([iny].*){3}
(The last option will work fine on its own since your search is unanchored, but might not be suitable as part of a larger regex. The final .* could match more than you intend.)
That pattern looks for three consecutive occurrences of the letters i, n, or y. You do not have three consecutive occurrences.
Perhaps you meant to use [inf] or [ify]?
Looks like you are looking for 3 consecutive letters, so yours should not match
[iny]{3} //no match
[unf]{3} //no match
[nif]{3} //matches nif
[nify]{3} //matches nif
[ify]{3} //matches ify
[uni]{3} //matches uni
Hope that helps somewhat :)
The {3} atom means "exactly three consecutive matches of the preceding element." While all of the letters in your character class are present in the string, they are not consecutive as they are separated by other characters in your string.
It isn't the order of items in the character class that's at issue. It's the fact that you can't match any combination of the three letters in your character class where exactly three of them are directly adjacent to one another in your example string.

Regexp not matching string with [] and / in Tcl

I am unable to match regex with a pin name having patterns with / and []. How to match string with this expression in tcl regexp?
ISSUE:
% set inst "channel/rptrw12\[5\]"
channel/rptrw12[5]
% set pin "channel/rptrw12\[5\]/rpinv\[11\]/vcc"
channel/rptrw12[5]/rpinv[11]/vcc
% regexp -nocase "^$inst" $pin
0
PASSING CASE:
% regexp -nocase vcc $pin
1
% set pat "ctrl/crdtfifo"
ctrl/crdtfifo
% set pin2 "ctrl/crdtfifo/iwdatabuf"
ctrl/crdtfifo/iwdatabuf
% regexp -nocase $pat $pin2
1
Your problem is that you are fighting with RE engine metacharacters, specifically […], which defines a character set. If you want to continue using your current approach, you'll need to add more backslashes.
But you don't have to do that!
If you are asking the question “does this string exist in that string?” you can also consider using one of these:
Use string first and check if the result (where the substring is) is not negative:
if {[string first $inst $pin] >= 0} {
puts "Found it"
}
Use regexp ***=, which means “interpret the rest of this as a literal string, no metacharacters”:
if {[regexp ***=$inst $pin]} {
puts "Found it"
}
If you only want to match for equality at the start of the string (you're asking “does this string start with that string?”) you probably should instead do one of these:
Use string first and check if the resulting index is zero:
if {[string first $inst $pin] == 0} {
puts "Found '$inst' at the start of '$pin'"
}
Use string equal with the right option (very much like strncmp() in C, if you know that):
if {[string equal -length [string length $inst] $inst $pin]} {
puts "'$pin' starts with '$inst'"
}
If you remember your regular expressions, the [] syntax has special meaning in regexp. It defines a character group. For example:
[abc]
means match a or b or c.
Therefore the pattern:
channel/rptrw12[5]
means match the string:
channel/rptrw125
If you want to match the literal character [ in regexp you need to escape it (same with all other characters that have meaning in regexp like . or ? or ( etc.). So your pattern should be:
channel/rptrw12\[5\]
But remember, the characters \ and [ has special meaning in tcl strings. So your code must do:
set inst "channel/rptrw12\\\[5\\\]"
The first \ escapes the \ character so that tcl will insert a single \ into the string. The third \ escapes the [ character so that tcl will not try to execute a command or function named 5.
Alternatively you can use {} instead of "":
set inst {channel/rptrw12\[5\]}

TCL_REGEXP:: How to grep a line from variable that looks similar in TCL

My TCL script:
set test {
a for apple
b for ball
c for cat
number n1
numbers 2,3,4,5,6
d for doctor
e for egg
number n2
numbers 56,4,5,5
}
set lines [split $test \n]
set data [join $lines :]
if { [regexp {number n1.*(numbers .*)} $data x y]} {
puts "numbers are : $y"
}
Current output if I run the above script:
C:\Documents and Settings\Owner\Desktop>tclsh stack.tcl
numbers are : numbers 56,4,5,5:
C:\Documents and Settings\Owner\Desktop>
Expected output:
In the script regexp, If I specify "number n1"... Its should print "numbers are : numbers 2,3,4,5,6"
If I specify "number n2"... Its should print "numbers are : numbers 56,4,5,5:"
Now always its prints the last (final line - numbers 56,4,5,5:) as output. How to resolve this issue.
Thanks,
Kumar
Try using
regexp {number n1.*?(numbers .*)\n} $test x y
(note that I'm matching against test. There is no need to replace the newlines.)
There are two differences from your pattern.
The question mark behind the first star makes the match non-greedy.
There is a newline character behind the capturing parentheses.
Your pattern told regexp to match from the first occurrence of number n1 up to the last occurrence of numbers, and it did. This is because the .* match between them was greedy, i.e. it matched as many characters as it could, which meant it went past the first numbers.
Making the match non-greedy means that the pattern will match from the first occurrence of number n1 up to the following occurrence of numbers, which was what you wanted.
After numbers, there is another .* match which is a bit troublesome. If it were greedy, it would match everything up to the end of the variable content. If it were non-greedy, it wouldn't match any characters, since matching a zero-length string satisfies the match. Another problem is that the Tcl RE engine doesn't really allow for switching back from non-greedy mode.
You can fix this by forcing the pattern to match one character past the text that you want the .* to match, making the zero-length match invalid. Matching a newline (\n) or space (\s) character should work. (This of course means that there must be a newline / other space character after every data field: if a numbers field is the last character range in the variable that field can't be located.)
Documentation: regular expression syntax, regexp
To use a Tcl variable in a regular expression is easy. On one level anyway: you put the regular expression in double quotes so that you have standard Tcl variable substitution inside it prior to it being passed to the RE engine:
# ...
set target "n1"
if { [regexp "number $target.*(numbers .*)" $data x y]} {
# ...
The hard part is that you've got to remember that switching to "…" from {…} will affect the whole of that word, and that the substitutions are of regular expression fragments. We usually recommend using {…} because that's easier to get consistently and unconfusingly right in the majority of cases.
Let's illustrate how this can get annoying. In your specific case, you may want to actually use this:
if { [regexp "number $target\[^:\]*:(numbers \[^:\]*)" $data x y]} {
The character sets here exclude the : (which you've — unnecessarily — used as a newline replacement) but because […] is also standard Tcl metasyntax, you have to backslash-quote it. (Things get even more annoying when you want to always use the contents of the variable as a literal even though they might include RE metasyntax characters; you need a regsub call to tidy things up. And you start to potentially make Tcl's RE cache less efficient too.)

Getting equal number of digits on both sides of a character in a string

I have a string
$test = 'xyz45sd2-32d34-sd23-456562.abc.com'
The objective is to obtain $1 = 23 and $2 = 45 i.e equal number of digits on both sides of the last -. Note that the number of digits is variable, and is not necessarily 2.
I have tried the following:
$test1 =~ s/.*(\d+)-(\d+).*//;
But
$1 contains 3
$2 contains 456562
You can try this regex
if($test1 =~ m/(\S+)-(\S+)-([a-z]*)(\d+)-(\d\d)(\d+).*/)
{
print $4,"|",$5;
}
I assume that u need only the first 2 didgits from 456562
perl -e '"xyz45sd2-32d34-sd23-456562.abc.com" =~ /(\d{2})-(\d{2})\d*(?=\.)/; print "$1\n$2\n"'
This other entry confirms that regex does not count:
How to match word where count of characters same
Building upon GreatBigBore's idea, if there's an upper bound to the count, then you could try the or operator |. This only matches your requirement to find a match; depending on the matched count the match will be in different bins. Only one case correctly places them in $1 and $2.
(\d{3})-(\d{3})|(\d{2})-(\d{2})|(\d{1})-(\d{1})
However if you concatenate the result captures as $1$3$5 and $2$4$6, you will effectively get the 2 stings you were looking for.
Another idea is to operate iteratively, you could repeat your search on the string by increasing the number until the match fails. (\d{1})-(\d{1}) , (\d{2})-(\d{2}) ...
A binary search comes to mind making it an O{ln(N)}, N being the upper limit for the capture length.
Theoretical answer
Short answer:
What you're looking for is not possible using regular expressions.
Long Answer:
Regular expressions (as their name suggests) are a compact representation of Regular languages (Type-3 grammars in the Chomsky Heirarchy).
What you're looking for is not possible using regular expressions as you're trying to write out an expression that maintains some kind of count (some contextual information other than beginning and end). This kind of behavior cannot be modelled as a DFA(actually any Finite Automaton). The informal proof of whether a language is regular is that there exists a DFA that accepts that language. As this kind of contextual information cannot be modeled in a DFA, thus by contradiction, you cannot write a regular expression for your problem.
Practical Solution
my ($lhs,$rhs) = $test =~ /^[^-]+-[^-]+-([^-]+)-([^-.]+)\S+/;
# Alernatively and faster
my (undef,undef,$lhs,$rhs) = split /-/, $test;
# Rest is common, no matter how $lhs and $rhs is extracted.
my #left = reverse split //, $lhs;
my #right = split //, $rhs;
my $i;
for($i=0; exists($left[$i]) and exists($right[$i]) and $left[$i] =~ /\d/ and $right[$i] =~ /\d/ ; ++$i){}
--$i;
$lhs= join "", reverse #left[0..$i];
$rhs= join "", #right[0..$i];
print $lhs, "\t", $rhs, "\n";
Edit: It's possible to improve the my solution by using regular expressions to extract the required numeric portions of $lhs and $rhs instead of split, reverse and for.
as #Samveen said it's technically not possible to do in pure regex
And Like #Samveen solution here's another version
#get left and right
my (undef,undef,$left,$right) = split /-/, $test;
#get left numbers
$left =~ s/.*?(\d+)$/$1/;
##get right numbers
$right =~ s/^(\d+).*/$1/;
##get length of both
my $right_length = length $right;
my $left_length = length $left;
if ($right_length > $left_length){
#make right length as same as left length
$right =~ s/(\d{$left_length}).*/$1/;
} else {
#make left length as same as right length
$left =~ s/.*(\d{$right_length})/$1/;
}
print $left, "\t", $right, "\n";

TCL regsub isn't working when the expression has [0]

I tried the following code:
set exp {elem[0]}
set temp {elem[0]}
regsub $temp $exp "1" exp
if {$exp} {
puts "######### 111111111111111 ################"
} else {
puts "########### 0000000000000000 ############"
}
of course, this is the easiest regsub possible (the words match completely), and still it doesnt work, and no substitution is done. if I write elem instead of elem[0], everything works fine.
I tried using {elem[0]}, elem[0], "elem[0]" etc, and none of them worked.
Any clue anyone?
This is the easiest regsub possible (the words match completely)
Actually, no, the words don't match. You see, in a regular expression, square brackets have meaning. Your expression {elem[0]} actually mean:
match the sequence of letters 'e'
followed by 'l'
followed by 'e'
followed by 'm'
followed by '0' (the character for the number zero)
So it would match the string "elem0" not "elem[0]" since the character after 'm' is not '0'.
What you want is {elem\[0\]} <-- backslash escapes special meaning.
Read the manual for tcl's regular expression syntax, re_syntax, for more info on how regular expressions work in tcl.
In addition to #slebetman's answer, if your want any special characters in your regular expression to be treated like plain text, there is special syntax for that:
set word {abd[0]}
set regex $word
regexp $regex $word ;# => returns 0, did not match
regexp "(?q)$regex" $word ;# => returns 1, matched
That (?q) marker must be the first part of the RE.
Also, if you're really just comparing literal strings, consider the simpler if {$str1 eq $str2} ... or the glob-style matching of [string match]