Tcl regexp not returning all matches - regex

I am reading a file, the content is as below:
Aug2017:
--------------------------------------
Name Age Phone
--------------------------------------
Jack 25 128736372
Peter 26 987840392
--------------------------------------
Sep2017:
--------------------------------------
Name Age Phone
--------------------------------------
Jared 21 874892032
Eric 24 847938427
--------------------------------------
So I wanted to extract the information between every dashed line and put them into a list. Assuming $data is containing the file content, I am using the tcl regexp below to extract the info:
regexp -all -inline -- {\s+\-{2,}\s+(.*?)\s+\-{2,}\s+} $data
As I know, the returned matched result will be stored as a list that containing fullMatch and subMatch.
I double checked with llength command, there is only one fullMatch and subMatch.
llength $data
2
Why is there only 1 subMatch? There supposed to be 5 matches like below:
Aug2017:
--------------------------------------
Name Age Phone --> 1st Match
--------------------------------------
Jack 25 128736372
Peter 26 987840392 --> 2nd Match
--------------------------------------
Sep2017: --> 3rd Match
--------------------------------------
Name Age Phone --> 4th Match
--------------------------------------
Jared 21 874892032
Eric 24 847938427 --> 5th Match
--------------------------------------
So in this case, I am choosing the second list element (subMatch) with lindex.
lindex [regexp -all -inline -- {\s+\-{2,}\s+(.*?)\s+\-{2,}\s+} $data] 1
However the result I got is like this, seems like it is matching from the beginning and end of the content:
Name Age Phone
--------------------------------------
Jack 25 128736372
Peter 26 987840392
--------------------------------------
Sep2017:
--------------------------------------
Name Age Phone
--------------------------------------
Jared 21 874892032
Eric 24 847938427
My impression was regexp should match from the beginning and match sequentially to the end of the string, not sure why tcl regex is behaving like this? Am I missing something?
** The main thing I want to achieve here is to extract data between the dashed separator, the above data is just an example.
Expected result: a list that containing all matches
{ {Name Age Phone} -->1st match
{Jack 25 128736372
Peter 26 987840392} -->2nd match
{Sep2017:} -->3rd match
{Name Age Phone} -->4th match
{Jared 21 874892032
Eric 24 847938427} -->5th match
}
UPDATE:
I have slightly changed my tcl regex as below, to include the lookahead and the suggestion by #glenn:
regexp -all -inline -expanded -- {\s+?-{2,}\s+?(.*?)(?=\s+?-{2,}\s+?)} $data
The result I got (10 submatches):
{ {----------------------
Name Age Phone} -->1st match
{Name Age Phone} -->2nd match
{----------------------
Jack 25 128736372
Peter 26 987840392} -->3rd match
{Jack 25 128736372
Peter 26 987840392} -->4th match
{----------------------
Sep2017:} -->5th match
{Sep2017:} -->6th match
...
...
}
It is pretty close to the expected result, but I still want to figure out how to use regex to perfectly match the expected 5 submatches.

Regular expression matching is not a good tool for this kind of problem. You're much better off with some kind of line filter.
A regular expression-based filter, closely matched to your example lines:
set f [open data.txt]
while {[gets $f line] >= 0} {
if {[regexp {:} $line]} continue
if {![regexp {\d} $line]} continue
puts $line
}
close $f
Rationale: only month name lines have colons, header lines and separators have no digits in them.
A filter that doesn't rely as much on regular expressions:
set f [open data.txt]
set skip 4
while {[gets $f line] >= 0} {
if {$skip < 1} {
if {[regexp {\-{2,}} $line]} {
set skip 4
} else {
puts $line
}
} else {
incr skip -1
}
}
close $f
This code reads every line, skips four lines at the beginning of each month, and resets the skip to 4 when a line of dashes interrupts the data.
(Note: the expression \-{2,} makes it look like the dash is special in a regular expression and needs to be escaped for that reason. Actually, it's because if the dash is the first character in the expression, the regexp command tries to interpret it as a switch. regexp -- {-{2,}} ... would work too but looks even stranger, I think.)
ETA (see comment): to get data between separators (i.e. just filter out the separators), try this:
set f [open data.txt]
while {[gets $f line] >= 0} {
if {![regexp {\-{2,}} $line]} {
puts $line
}
}
close $f
Or:
package require fileutil
::fileutil::foreachLine line data.txt {
if {![regexp {\-{2,}} $line]} {
puts $line
}
}
This should also work:
regsub -all -line {^\s+-{2,}.*(\n|\Z)} $data {}
Enabling newline-sensitive matching, this matches and removes all lines consisting only of whitespace, dashes, optional non-newlines and either a newline character or the end-of-outer-string.
To collect a list of matches rather than just printing filtered lines:
set matches {}
set matchtext {}
::fileutil::foreachLine line data.txt {
if {![regexp {\-{2,}} $line]} {
append matchtext $line\n
} else {
lappend matches $matchtext
set matchtext {}
}
}
After running this, the variable matches contains a list whose items are contiguous lines between separators.
Another way to to the same thing:
::textutil::splitx $data {(?n)^\s+-{2,}.*(?:\n|\Z)}
(It also adds an empty element at the end of the list, which is easy enough to remove if it is a problem.)
Documentation:
< (operator),
>= (operator),
append,
close,
continue,
fileutil (package),
gets,
if,
incr,
lappend,
open,
package,
puts,
regexp,
set,
textutil (package),
while,
Syntax of Tcl regular expressions

Related

extract 1st line with specific pattern using regexp

I have a string
set text {show log
===============================================================================
Event Log
===============================================================================
Description : Default System Log
Log contents [size=500 next event=7 (not wrapped)]
6 2020/05/22 12:36:05.81 UTC CRITICAL: IOM #2001 Base IOM
"IOM:1>some text here routes "
5 2020/05/22 12:36:05.52 UTC CRITICAL: IOM #2001 Base IOM
"IOM:2>some other text routes "
4 2020/05/22 12:36:05.10 UTC MINOR: abc #2001 some text here also 222 def "
3 2020/05/22 12:36:05.09 UTC WARNING: abc #2011 some text here 111 ghj"
1 2020/05/22 12:35:47.60 UTC INDETERMINATE: ghe #2010 a,b, c="7" "
}
I want to extract the 1st line that starts with "IOM:" using regexp in tcl ie
IOM:1>some text here routes
But implementation doesn't work, Can someone help here?
regexp -nocase -lineanchor -- {^\s*(IOM:)\s*\s*(.*?)routes$} $line match tag value
You may use
regexp -nocase -- {(?n)^"IOM:.*} $text match
regexp -nocase -line -- {^"IOM:.*} $text match
See the Tcl demo
Details
(?n) - (same as -line option) newline sensitive mode ON so that . could not match line breaks ( see Tcl regex docs: If newline-sensitive matching is specified, . and bracket expressions using ^ will never match the newline character (so that matches will never cross newlines unless the RE explicitly arranges it) and ^ and $ will match the empty string after and before a newline respectively, in addition to matching at beginning and end of string respectively)
^ - start of a line
"IOM: - "IOM: string
.* - the rest of the line to its end.
In addition to #Wiktor's great answer, you might want to iterate over the matches:
set re {^\s*"(IOM):(.*)routes.*$}
foreach {match tag value} [regexp -all -inline -nocase -line -- $re $text] {
puts [list $tag $value]
}
IOM {1>some text here }
IOM {2>some other text }
I see that you have a non-greedy part in your regex. The Tcl regex engine is a bit weird compared to other languages: the first quantifier in the regex sets the greediness for the whole regex.
set re {^\s*(IOM:)\s*\s*(.*?)routes$} ; # whole regex is greedy
set re {^\s*?(IOM:)\s*\s*(.*?)routes$} ; # whole regex in non-greedy
# .........^^

tcl regexp with with paren-delimited groups

I have a group of strings that look like:
foo<xyz><123>
bar
pizza<oregano><tomato><mozzarella>
so is boils down to a prefix (foo, bar, pizza,...) followed by any number of attribute names inclosed in angled brackets.
both the prefix and the attributes may consist of any character with the exception of angled brackets (which are only used for delimiting the attribute names)
Neither prefix nor attribute names must be empty.
Now I would like to have a regex in my Tcl application, that gives me both the prefix and all the attributes (it's ok if they keep their delimiting brackets, though i the end I have to split them up into a list).
The trivial approach ^(.+)(<.+>)*$ doesn't work because the trailing .+ is too greedy and eats away all the matches for the attribute names.
So I tried excluding the forbidden angle brackets ^(\[^<>\]+)(<.+>)*$ which works OK at first glance - but then i discovered that this would match fnork<<>><x<>> violating the rule that attribute names must not contain any angular brackets (apart from the delimiting one).
Third, I extended the forbidden characters to the attribute names ^(\[^<>\]+)(<\[^<>\]>)*$, but now things are getting a bit shady: while the regex only matches valid strings (so both prefix and attribute names must not contain any brackets), i no longer get the attribute names in as a match part:
% regexp -all -inline "^(\[^<>\]+)(<\[^<>\]+>)*" "A<xyz><123>"
A<xyz><123> A <123>
For whatever reason the <xyz> is not returned!
Any idea how to fix this?
side-note
the actual string I'm trying to parse uses square brackets and parentheses as delimiters. something like: pizza[large](tomato)(olives)(cheese) where there [term] can appear 0 or 1 time, whereas the (term)s can appear 0 or more times.
but due to the nature of square brackets and parentheses this requires a fair amount of quoting, which is probably too much of a distraction to be useful here)
In this case, the trick is to use a fairly simple RE and post-process the results:
% regexp -all -inline {^([^<>]+)((?:<[^<>]+>)*)$} foo<xyz><123>
foo<xyz><123> foo <xyz><123>
% regexp -all -inline {[^<>]+} <xyz><123>
xyz 123
You were almost there, but were struggling with using (<[^<>]+>)*, which won't work as that only captures the group one of the times it matches. (I wasn't aware that it captured the last match, but since I rarely want either first or last but rather all, I use a different approach.)
Putting that all together and assuming you've got one big multi-line string that has all the pieces you want to look at in it (e.g., because you've read it from a file) you get:
set str "foo<xyz><123>
bar
pizza<oregano><tomato><mozzarella>"
# Find the matching lines and do the first-level extract on them
foreach {- prefix attribs} [regexp -all -line -inline {^([^<>]+)((?:<[^<>]+>)*)$} $str] {
# Split the attribute names
set attributes [regexp -all -inline {[^<>]+} $attribs]
# Show that we've matched them for real
puts "prefix='$prefix', attributes=[join $attributes ,]"
}
Which produces this output:
prefix='foo', attributes=xyz,123
prefix='bar', attributes=
prefix='pizza', attributes=oregano,tomato,mozzarella
Let's tokenize this.
package require string::token
set lex {[[] LB []] RB [(] LP [)] RP [^][()]+ t}
set str {pizza[large](tomato)(olives)(cheese)}
% set tokens [::string::token text $lex $str]
{t 0 4} {LB 5 5} {t 6 10} {RB 11 11} {LP 12 12} {t 13 18} {RP 19" 19} {LP 20 20} {t 21 26} {RP 27 27} {LP 28 28} {t 29 34} {RP 35 35}
Having tokenized, we can parse, or evaluate the tokens as statements in a little language:
% set terms [lassign $tokens prefix]
proc t {str beg end} {
string range $str $beg $end
}
proc LB {str beg end} {
return "Optional term is: "
}
proc RB args {
return \n
}
proc LP {str beg end} {
rename LP {}
proc LP args {
return ", "
}
return "Arguments are: "
}
proc RP args {}
% puts "Prefix is: [eval [linsert $prefix 1 $str]]"
Prefix is: pizza
% % join [lmap term $terms {eval [linsert $term 1 $str]}] {}
Optional term is: large
Arguments are: tomato, olives, cheese
Documentation:
eval,
join,
lassign,
linsert,
lmap (for Tcl 8.5),
lmap,
package,
proc,
puts,
rename,
return,
set,
string::token (package)
I might have misread the requirements, but given that you have already "encoded" all structural details in your ad hoc notation, why not have the Tcl list machinery do the work?
set str {foo(xyz)(123)
bar
pizza[large](oregano)(tomato)(mozzarella)}
foreach line [split $str \n] {
set line [string map {"[" " " "]" " " ")(" " " "(" " {" ")" "} "} $line]
set suffix [lassign $line prefix]
lassign $suffix a b
if {[llength $suffix] == 2} {
set optional $a
set attributes $b
} else {
set optional ""
set attributes $a
}
puts "prefix='$prefix', optional='$optional', attributes='[join $attributes ,]'"
}
I apologise, strictly speaking, my answer does not address the regex question. And there is less wizardy than in the other replies ;)

Extract value using regex when space is missing

I am trying to extract the text and the third column from the below output. My problem is that there is one line where the space is missing. Is it possible to extract that value in this case?
4086 process-working 841901 841901 1234 22
4297 procesor_stats_controller_fmm543182 543182 0 22
4028 ipv6_ma 3063025 3063025 -55 78
4280 tty-verifyd 694043 694043 0 22
My regex so far looks like this:
\d+\s+(\w+-?\w+)\s*\d+\s+\d+\s+(-?\d+)\s+\d+
Thank you
EDIT: it's actually a bug in the device, at least one space should be there, so I'll just let them fix it and then retry. Thank you for taking the time to answer this :)
I this case, I'd first split the line into fields
foreach line $lines {
set fields [regexp -inline -all {\S+} $line]
if {[llength $fields] == 6} {
puts [lindex $fields 2]
} else {
# extract the digits at the end of this field
regexp {\d+$} [lindex $fields 1] value
puts $value
}
}
841901
543182
3063025
694043
The problem is with the \w. \w is equivalent to [a-zA-Z0-9_].
So, it'll match with the number also (since space is missing).
Instead of \w, use [a-zA-Z_]. Hence, this regular expression should do for you :
\d+\s+([a-z]+-?[a-z]+)\s*(\d+)\s+\d+\s+\d+\s+\d+

Matching a regexp in TCL PERL

I am having follwing pattern
Pattern[1]:
Key : "key1"
Value : 100
Pattern[2]:
Key : "key2"
Value : 20
Pattern[3]:
Key : "key3"
Value : 30
Pattern[4]:
Key : "key4"
Value : 220
I want to segregate each Pattern block . I am using TCL . Regexp that I am using is not resolving the purpose
set updateList [regexp -all -inline {Pattern\[\d+\].*?Value.*?\n} $list]
Which Regexp to use to segregate each pattern
I need output as
Pattern[1]:
Key : "key1"
Value : 100
Pattern[2]:
Key : "key2"
Value : 20
Pattern[3]:
Key : "key3"
Value : 30
Pattern[4]:
Key : "key4"
Value : 220
Your pattern Pattern\[\d+\].*?Value.*?\n contains mixed quantifiers: both greedy and lazy. Tcl does not handle mixed quantifier type as you would expect it in, say, PCRE (PHP, Perl), .NET, etc., it defaults to the first found one, as the subsequent quantifiers inherit the preceding quantifier type. So, the + after \d is greedy, thus, all others (in .*?) are also greedy - even if you declared them to be lazy. Also, the . matches a newline in Tcl regex, too, so, your pattern works like this.
So, based on your regex, you can make the \d+ lazy with \d+? and replace \n at the end with (?:\n|$) to match both the newline and the end of string:
set RE {Pattern\[\d+?\].*?Value.*?(?:\n|$)}
set updateList [regexp -all -inline $RE $str]
See the IDEONE demo
Alternative 1
Also, you can use a more verbose regex if your input string always has the same structure with all elements - Pattern, Key, Value - present:
set updateList [regexp -all -inline {Pattern\[\d+\]:\s*Key[^\n]*\s*Value[^\n]*} $str]
See the IDEONE demo, and here is the regex demo.
Since a . can match a newline, we need to use a [^\n] negated character class matching any character but a line feed.
Alternative 2
You can use an unrolled lazy subpattern matching Pattern[n]: and then any character that is not a starting point for a Pattern[n]: sequence:
set RE {Pattern\[\d+\]:[^P]*(?:P(?!attern\[\d+\]).)*}
set updateList [regexp -all -inline $RE $str]
See another IDEONE demo and a regex101 demo
Try this
Pattern\[\d+\](.|\n)*?Value.*?\n
The dot . character matches any characters but line break, so you need to add it in. Be aware that your line may end with a carriage character so you might need to add \r in.
% set list { Pattern[1]:
Key : "key1"
Value : 100
Pattern[2]:
Key : "key2"
Value : 20
Pattern[3]:
Key : "key3"
Value : 30
Pattern[4]:
Key : "key4"
Value : 220
}
% regexp -all -inline {Pattern\[\d+\].*?Value.*?\n} $list
{Pattern[1]:
Key : "key1"
Value : 100
Pattern[2]:
Key : "key2"
Value : 20
Pattern[3]:
Key : "key3"
Value : 30
Pattern[4]:
Key : "key4"
Value : 220
}
% regexp -all -inline {Pattern\[\d+?\].*?Value.*?\n} $list ;# only changing `\d+` to `\d+?`
{Pattern[1]:
Key : "key1"
Value : 100
} {Pattern[2]:
Key : "key2"
Value : 20
} {Pattern[3]:
Key : "key3"
Value : 30
} {Pattern[4]:
Key : "key4"
Value : 220
}
If $list does not end with a newline, you won't get the "pattern[4]" element returned. In that case, change
% regexp -all -inline {Pattern\[\d+?\].*?Value.*?\n} $list
to
% regexp -all -inline {Pattern\[\d+?\].*?Value.*?(?:\n|$)} $list
You want to capture blocks of lines and output them with blank lines in between. Your example data displays patterns on different levels that can be used to recognize which lines belong to which block.
The simplest pattern is this: every three lines in the input make up a block. This pattern suggests processing like this:
set lines [split [string trim $list \n] \n]
foreach {a b c} $lines {puts $a\n$b\n$c\n\n}
There is nothing in your example data that suggests that this wouldn't work. Still, there may be some complications that aren't reflected in your example data.
If there are stray blank lines in the input, you might need to get rid of them first:
set lines [lmap line $lines {if {[string is space $line]} continue else {set line}}]
If some blocks contain less or more lines than in your example, another simple pattern is that every block starts with a line that has optional(?) whitespace and the word Pattern. Those lines (except the first) should be preceded by a block-delimiter in the output:
set lines [split [string trim $list \n] \n]
puts [lindex $lines 0]
foreach line [lrange $lines 1 end] {
if {[regexp {\s*Pattern} $line]} {
puts \n$line
} else {
puts $line
}
}
puts \n
If the lines don't actually begin with whitespace, you could use string match Pattern* $line instead of the regular expression.
Documentation: continue, foreach, if, lindex, lmap, lmap replacement, lrange, puts, regexp, set, split, string

Regular expression in TCL

I have to parse this format using regexp in TCL.
Here is the format
wl -i eth1 country
Q1 (Q1/27) Q1
I'm trying to use the word country as a keyword to parse the format 'Q1 (Q1/27) Q1'.
I can do it if it is in a same line as country using the following regexp command.
regexp {([^country]*)country(.*)} $line match test country_value
But how can i tackle the above case?
Firstly, the regular expression you are using isn't doing quite the right thing in the first place, because [^country] matches a set of characters that consists of everything except the letters in country (so it matches from the h in eth1 onwards only, given the need to have country afterwards).
By default, Tcl uses the whole string to match against and newlines are just ordinary characters. (There is an option to make them special by also specifying -line, but it's not on by default.) This means that if I use your whole string and feed it through regexp with your regular expression, it works (well, you probably want to string trim $country_value at some point). This means that your real problem is in presenting the right string to match against.
If you're presenting lines one at a time (read from a file, perhaps) and you want to use a match against one line to trigger processing in the next, you need some processing outside the regular expression match:
set found_country 0
while {[gets $channel line] >= 0} {
if {$found_country} {
# Process the data...
puts "post-country data is $line"
# Reset the flag
set found_country 0
} elseif {[regexp {(.*) country$} $line -> leading_bits]} {
# Process some leading data...
puts "pre-country data is $leading_bits"
# Set the flag to handle the next line specially
set found_country 1
}
}
If you want to skip blank lines completely, put a if {$line eq ""} continue before the if {$found_country} ....