Extract value using regex when space is missing - regex

I am trying to extract the text and the third column from the below output. My problem is that there is one line where the space is missing. Is it possible to extract that value in this case?
4086 process-working 841901 841901 1234 22
4297 procesor_stats_controller_fmm543182 543182 0 22
4028 ipv6_ma 3063025 3063025 -55 78
4280 tty-verifyd 694043 694043 0 22
My regex so far looks like this:
\d+\s+(\w+-?\w+)\s*\d+\s+\d+\s+(-?\d+)\s+\d+
Thank you
EDIT: it's actually a bug in the device, at least one space should be there, so I'll just let them fix it and then retry. Thank you for taking the time to answer this :)

I this case, I'd first split the line into fields
foreach line $lines {
set fields [regexp -inline -all {\S+} $line]
if {[llength $fields] == 6} {
puts [lindex $fields 2]
} else {
# extract the digits at the end of this field
regexp {\d+$} [lindex $fields 1] value
puts $value
}
}
841901
543182
3063025
694043

The problem is with the \w. \w is equivalent to [a-zA-Z0-9_].
So, it'll match with the number also (since space is missing).
Instead of \w, use [a-zA-Z_]. Hence, this regular expression should do for you :
\d+\s+([a-z]+-?[a-z]+)\s*(\d+)\s+\d+\s+\d+\s+\d+

Related

In Tcl how can I remove all zeroes to the left but the zeroes to the right should remain?

Folks! I ran into a problem that I can't solve by myself.
Since the numbers "08" and "09" cannot be read like the others (01,02,03,04, etc ...) and must be treated separately in the language Tcl.
I can't find a way to remove all [I say ALL because there are more than one on the same line] the leading zeros except the one on the right, which must remain intact.
It may sound simple to those who are already thoroughly familiar with the Tcl / Tk language. But for me, who started out and am looking for more information about Tcl / Tk, I read a lot of material on the internet, including this https: // stackoverflow.com/questions/2110864/handling-numbers-with-leading-zeros-in-tcl#2111822
So nothing to show me how to do this in one sweep eliminating all leading zeros.
I need you to give me a return like this: 2:9:10
I need this to later manipulate the result with the expr [arithmetic expression] command.
In this example it just removes a single leading zero:
set time {02:09:10}
puts [regsub {^0*(.+)} $time {\1}]
# Return: 2:09:10
If anyone can give me that strength friend?! I'm grateful right now.
The group (^|:) matches either the beginning of the string or a colon.
0+ matches one or more zeros. Replace with the group match \1,
otherwise the colons get lost. And of course, use -all to do all of
the matches in the target string.
% set z 02:09:10
02:09:10
% regsub -all {(^|:)0+} $z {\1} x
2
% puts $x
2:9:10
%
Edit: As Barmar points out, this will change :00 to an empty string.
A better regex might be:
regsub -all {(^|:)0} $z {\1} x
This will only remove a single leading 0.
You're only matching the 0 at the beginning of the string, you need to match after each : as well.
puts [regsub -all {(^|:)0*([^:])} $time {\1\2}]
In general it is best to use scan $str %d to convert a decimal number with possible leading zeroes to its actual value.
But in your case this will also work (and seems simpler to me than the answers given earlier and doesn't rely on the separator being a colon):
regsub -all {0*(\d+)} $time {\1}
This will remove any number of leading zeroes, but doesn't trim 00 down to an empty string. Also trailing zeroes will not be affected.
regsub -all {0*(\d+)} {0003:000:1000} {\1} => 3:0:1000
the scan command is useful here to extract three decimal numbers out of that string:
% set time {02:09:10}
02:09:10
% scan $time {%d:%d:%d} h m s
3
% puts [list $h $m $s]
2 9 10
There are a few tricky edge cases here. Specifically, the string 02:09:10:1001:00 covers the key ones (including middle zeroes, only zeroes). We can use a single substitution command to do the work:
regsub -all {\m0+(?=\d)} $str {}
(This uses a word start anchor and lookahead constraint.)
However, I would be more inclined to use other tools for this sort of thing. For times, for example, parsing them is better done with scan:
set time "02:09:10"
scan $time "%d:%d:%d" h m s
Or, depending on what is going on, clock scan (which handles dates as well, making it more useful in some cases and less in others).

tcl regexp with with paren-delimited groups

I have a group of strings that look like:
foo<xyz><123>
bar
pizza<oregano><tomato><mozzarella>
so is boils down to a prefix (foo, bar, pizza,...) followed by any number of attribute names inclosed in angled brackets.
both the prefix and the attributes may consist of any character with the exception of angled brackets (which are only used for delimiting the attribute names)
Neither prefix nor attribute names must be empty.
Now I would like to have a regex in my Tcl application, that gives me both the prefix and all the attributes (it's ok if they keep their delimiting brackets, though i the end I have to split them up into a list).
The trivial approach ^(.+)(<.+>)*$ doesn't work because the trailing .+ is too greedy and eats away all the matches for the attribute names.
So I tried excluding the forbidden angle brackets ^(\[^<>\]+)(<.+>)*$ which works OK at first glance - but then i discovered that this would match fnork<<>><x<>> violating the rule that attribute names must not contain any angular brackets (apart from the delimiting one).
Third, I extended the forbidden characters to the attribute names ^(\[^<>\]+)(<\[^<>\]>)*$, but now things are getting a bit shady: while the regex only matches valid strings (so both prefix and attribute names must not contain any brackets), i no longer get the attribute names in as a match part:
% regexp -all -inline "^(\[^<>\]+)(<\[^<>\]+>)*" "A<xyz><123>"
A<xyz><123> A <123>
For whatever reason the <xyz> is not returned!
Any idea how to fix this?
side-note
the actual string I'm trying to parse uses square brackets and parentheses as delimiters. something like: pizza[large](tomato)(olives)(cheese) where there [term] can appear 0 or 1 time, whereas the (term)s can appear 0 or more times.
but due to the nature of square brackets and parentheses this requires a fair amount of quoting, which is probably too much of a distraction to be useful here)
In this case, the trick is to use a fairly simple RE and post-process the results:
% regexp -all -inline {^([^<>]+)((?:<[^<>]+>)*)$} foo<xyz><123>
foo<xyz><123> foo <xyz><123>
% regexp -all -inline {[^<>]+} <xyz><123>
xyz 123
You were almost there, but were struggling with using (<[^<>]+>)*, which won't work as that only captures the group one of the times it matches. (I wasn't aware that it captured the last match, but since I rarely want either first or last but rather all, I use a different approach.)
Putting that all together and assuming you've got one big multi-line string that has all the pieces you want to look at in it (e.g., because you've read it from a file) you get:
set str "foo<xyz><123>
bar
pizza<oregano><tomato><mozzarella>"
# Find the matching lines and do the first-level extract on them
foreach {- prefix attribs} [regexp -all -line -inline {^([^<>]+)((?:<[^<>]+>)*)$} $str] {
# Split the attribute names
set attributes [regexp -all -inline {[^<>]+} $attribs]
# Show that we've matched them for real
puts "prefix='$prefix', attributes=[join $attributes ,]"
}
Which produces this output:
prefix='foo', attributes=xyz,123
prefix='bar', attributes=
prefix='pizza', attributes=oregano,tomato,mozzarella
Let's tokenize this.
package require string::token
set lex {[[] LB []] RB [(] LP [)] RP [^][()]+ t}
set str {pizza[large](tomato)(olives)(cheese)}
% set tokens [::string::token text $lex $str]
{t 0 4} {LB 5 5} {t 6 10} {RB 11 11} {LP 12 12} {t 13 18} {RP 19" 19} {LP 20 20} {t 21 26} {RP 27 27} {LP 28 28} {t 29 34} {RP 35 35}
Having tokenized, we can parse, or evaluate the tokens as statements in a little language:
% set terms [lassign $tokens prefix]
proc t {str beg end} {
string range $str $beg $end
}
proc LB {str beg end} {
return "Optional term is: "
}
proc RB args {
return \n
}
proc LP {str beg end} {
rename LP {}
proc LP args {
return ", "
}
return "Arguments are: "
}
proc RP args {}
% puts "Prefix is: [eval [linsert $prefix 1 $str]]"
Prefix is: pizza
% % join [lmap term $terms {eval [linsert $term 1 $str]}] {}
Optional term is: large
Arguments are: tomato, olives, cheese
Documentation:
eval,
join,
lassign,
linsert,
lmap (for Tcl 8.5),
lmap,
package,
proc,
puts,
rename,
return,
set,
string::token (package)
I might have misread the requirements, but given that you have already "encoded" all structural details in your ad hoc notation, why not have the Tcl list machinery do the work?
set str {foo(xyz)(123)
bar
pizza[large](oregano)(tomato)(mozzarella)}
foreach line [split $str \n] {
set line [string map {"[" " " "]" " " ")(" " " "(" " {" ")" "} "} $line]
set suffix [lassign $line prefix]
lassign $suffix a b
if {[llength $suffix] == 2} {
set optional $a
set attributes $b
} else {
set optional ""
set attributes $a
}
puts "prefix='$prefix', optional='$optional', attributes='[join $attributes ,]'"
}
I apologise, strictly speaking, my answer does not address the regex question. And there is less wizardy than in the other replies ;)

Tcl regexp not returning all matches

I am reading a file, the content is as below:
Aug2017:
--------------------------------------
Name Age Phone
--------------------------------------
Jack 25 128736372
Peter 26 987840392
--------------------------------------
Sep2017:
--------------------------------------
Name Age Phone
--------------------------------------
Jared 21 874892032
Eric 24 847938427
--------------------------------------
So I wanted to extract the information between every dashed line and put them into a list. Assuming $data is containing the file content, I am using the tcl regexp below to extract the info:
regexp -all -inline -- {\s+\-{2,}\s+(.*?)\s+\-{2,}\s+} $data
As I know, the returned matched result will be stored as a list that containing fullMatch and subMatch.
I double checked with llength command, there is only one fullMatch and subMatch.
llength $data
2
Why is there only 1 subMatch? There supposed to be 5 matches like below:
Aug2017:
--------------------------------------
Name Age Phone --> 1st Match
--------------------------------------
Jack 25 128736372
Peter 26 987840392 --> 2nd Match
--------------------------------------
Sep2017: --> 3rd Match
--------------------------------------
Name Age Phone --> 4th Match
--------------------------------------
Jared 21 874892032
Eric 24 847938427 --> 5th Match
--------------------------------------
So in this case, I am choosing the second list element (subMatch) with lindex.
lindex [regexp -all -inline -- {\s+\-{2,}\s+(.*?)\s+\-{2,}\s+} $data] 1
However the result I got is like this, seems like it is matching from the beginning and end of the content:
Name Age Phone
--------------------------------------
Jack 25 128736372
Peter 26 987840392
--------------------------------------
Sep2017:
--------------------------------------
Name Age Phone
--------------------------------------
Jared 21 874892032
Eric 24 847938427
My impression was regexp should match from the beginning and match sequentially to the end of the string, not sure why tcl regex is behaving like this? Am I missing something?
** The main thing I want to achieve here is to extract data between the dashed separator, the above data is just an example.
Expected result: a list that containing all matches
{ {Name Age Phone} -->1st match
{Jack 25 128736372
Peter 26 987840392} -->2nd match
{Sep2017:} -->3rd match
{Name Age Phone} -->4th match
{Jared 21 874892032
Eric 24 847938427} -->5th match
}
UPDATE:
I have slightly changed my tcl regex as below, to include the lookahead and the suggestion by #glenn:
regexp -all -inline -expanded -- {\s+?-{2,}\s+?(.*?)(?=\s+?-{2,}\s+?)} $data
The result I got (10 submatches):
{ {----------------------
Name Age Phone} -->1st match
{Name Age Phone} -->2nd match
{----------------------
Jack 25 128736372
Peter 26 987840392} -->3rd match
{Jack 25 128736372
Peter 26 987840392} -->4th match
{----------------------
Sep2017:} -->5th match
{Sep2017:} -->6th match
...
...
}
It is pretty close to the expected result, but I still want to figure out how to use regex to perfectly match the expected 5 submatches.
Regular expression matching is not a good tool for this kind of problem. You're much better off with some kind of line filter.
A regular expression-based filter, closely matched to your example lines:
set f [open data.txt]
while {[gets $f line] >= 0} {
if {[regexp {:} $line]} continue
if {![regexp {\d} $line]} continue
puts $line
}
close $f
Rationale: only month name lines have colons, header lines and separators have no digits in them.
A filter that doesn't rely as much on regular expressions:
set f [open data.txt]
set skip 4
while {[gets $f line] >= 0} {
if {$skip < 1} {
if {[regexp {\-{2,}} $line]} {
set skip 4
} else {
puts $line
}
} else {
incr skip -1
}
}
close $f
This code reads every line, skips four lines at the beginning of each month, and resets the skip to 4 when a line of dashes interrupts the data.
(Note: the expression \-{2,} makes it look like the dash is special in a regular expression and needs to be escaped for that reason. Actually, it's because if the dash is the first character in the expression, the regexp command tries to interpret it as a switch. regexp -- {-{2,}} ... would work too but looks even stranger, I think.)
ETA (see comment): to get data between separators (i.e. just filter out the separators), try this:
set f [open data.txt]
while {[gets $f line] >= 0} {
if {![regexp {\-{2,}} $line]} {
puts $line
}
}
close $f
Or:
package require fileutil
::fileutil::foreachLine line data.txt {
if {![regexp {\-{2,}} $line]} {
puts $line
}
}
This should also work:
regsub -all -line {^\s+-{2,}.*(\n|\Z)} $data {}
Enabling newline-sensitive matching, this matches and removes all lines consisting only of whitespace, dashes, optional non-newlines and either a newline character or the end-of-outer-string.
To collect a list of matches rather than just printing filtered lines:
set matches {}
set matchtext {}
::fileutil::foreachLine line data.txt {
if {![regexp {\-{2,}} $line]} {
append matchtext $line\n
} else {
lappend matches $matchtext
set matchtext {}
}
}
After running this, the variable matches contains a list whose items are contiguous lines between separators.
Another way to to the same thing:
::textutil::splitx $data {(?n)^\s+-{2,}.*(?:\n|\Z)}
(It also adds an empty element at the end of the list, which is easy enough to remove if it is a problem.)
Documentation:
< (operator),
>= (operator),
append,
close,
continue,
fileutil (package),
gets,
if,
incr,
lappend,
open,
package,
puts,
regexp,
set,
textutil (package),
while,
Syntax of Tcl regular expressions

Matching a regexp in TCL PERL

I am having follwing pattern
Pattern[1]:
Key : "key1"
Value : 100
Pattern[2]:
Key : "key2"
Value : 20
Pattern[3]:
Key : "key3"
Value : 30
Pattern[4]:
Key : "key4"
Value : 220
I want to segregate each Pattern block . I am using TCL . Regexp that I am using is not resolving the purpose
set updateList [regexp -all -inline {Pattern\[\d+\].*?Value.*?\n} $list]
Which Regexp to use to segregate each pattern
I need output as
Pattern[1]:
Key : "key1"
Value : 100
Pattern[2]:
Key : "key2"
Value : 20
Pattern[3]:
Key : "key3"
Value : 30
Pattern[4]:
Key : "key4"
Value : 220
Your pattern Pattern\[\d+\].*?Value.*?\n contains mixed quantifiers: both greedy and lazy. Tcl does not handle mixed quantifier type as you would expect it in, say, PCRE (PHP, Perl), .NET, etc., it defaults to the first found one, as the subsequent quantifiers inherit the preceding quantifier type. So, the + after \d is greedy, thus, all others (in .*?) are also greedy - even if you declared them to be lazy. Also, the . matches a newline in Tcl regex, too, so, your pattern works like this.
So, based on your regex, you can make the \d+ lazy with \d+? and replace \n at the end with (?:\n|$) to match both the newline and the end of string:
set RE {Pattern\[\d+?\].*?Value.*?(?:\n|$)}
set updateList [regexp -all -inline $RE $str]
See the IDEONE demo
Alternative 1
Also, you can use a more verbose regex if your input string always has the same structure with all elements - Pattern, Key, Value - present:
set updateList [regexp -all -inline {Pattern\[\d+\]:\s*Key[^\n]*\s*Value[^\n]*} $str]
See the IDEONE demo, and here is the regex demo.
Since a . can match a newline, we need to use a [^\n] negated character class matching any character but a line feed.
Alternative 2
You can use an unrolled lazy subpattern matching Pattern[n]: and then any character that is not a starting point for a Pattern[n]: sequence:
set RE {Pattern\[\d+\]:[^P]*(?:P(?!attern\[\d+\]).)*}
set updateList [regexp -all -inline $RE $str]
See another IDEONE demo and a regex101 demo
Try this
Pattern\[\d+\](.|\n)*?Value.*?\n
The dot . character matches any characters but line break, so you need to add it in. Be aware that your line may end with a carriage character so you might need to add \r in.
% set list { Pattern[1]:
Key : "key1"
Value : 100
Pattern[2]:
Key : "key2"
Value : 20
Pattern[3]:
Key : "key3"
Value : 30
Pattern[4]:
Key : "key4"
Value : 220
}
% regexp -all -inline {Pattern\[\d+\].*?Value.*?\n} $list
{Pattern[1]:
Key : "key1"
Value : 100
Pattern[2]:
Key : "key2"
Value : 20
Pattern[3]:
Key : "key3"
Value : 30
Pattern[4]:
Key : "key4"
Value : 220
}
% regexp -all -inline {Pattern\[\d+?\].*?Value.*?\n} $list ;# only changing `\d+` to `\d+?`
{Pattern[1]:
Key : "key1"
Value : 100
} {Pattern[2]:
Key : "key2"
Value : 20
} {Pattern[3]:
Key : "key3"
Value : 30
} {Pattern[4]:
Key : "key4"
Value : 220
}
If $list does not end with a newline, you won't get the "pattern[4]" element returned. In that case, change
% regexp -all -inline {Pattern\[\d+?\].*?Value.*?\n} $list
to
% regexp -all -inline {Pattern\[\d+?\].*?Value.*?(?:\n|$)} $list
You want to capture blocks of lines and output them with blank lines in between. Your example data displays patterns on different levels that can be used to recognize which lines belong to which block.
The simplest pattern is this: every three lines in the input make up a block. This pattern suggests processing like this:
set lines [split [string trim $list \n] \n]
foreach {a b c} $lines {puts $a\n$b\n$c\n\n}
There is nothing in your example data that suggests that this wouldn't work. Still, there may be some complications that aren't reflected in your example data.
If there are stray blank lines in the input, you might need to get rid of them first:
set lines [lmap line $lines {if {[string is space $line]} continue else {set line}}]
If some blocks contain less or more lines than in your example, another simple pattern is that every block starts with a line that has optional(?) whitespace and the word Pattern. Those lines (except the first) should be preceded by a block-delimiter in the output:
set lines [split [string trim $list \n] \n]
puts [lindex $lines 0]
foreach line [lrange $lines 1 end] {
if {[regexp {\s*Pattern} $line]} {
puts \n$line
} else {
puts $line
}
}
puts \n
If the lines don't actually begin with whitespace, you could use string match Pattern* $line instead of the regular expression.
Documentation: continue, foreach, if, lindex, lmap, lmap replacement, lrange, puts, regexp, set, split, string

Passing a variable to regexp when the variable may have brackets (TCL)

In my job, I deal a lot with entities whose names may contain square brackets. We mostly use tcl, so square brackets can sometimes cause havoc. I'm trying to do the following:
set pat {pair_shap_val[9]}
set aff {pair_shap_val[9]_affin_input}
echo [regexp "${pat}_affin.*" $aff]
However, this returns a 0 when I would expect a 1. I'm certain that when ${pat} is passed to the regexp engine, the brackets are being expanded and read as "[9]" instead of "[9]".
How do I phrase the regexp so a pattern contains a variable when the variable itself may have special regexp characters?
EDIT: An easy way would be to just escape the brackets when setting $pat. However, the value for $pat is passed to me by a function so I cannot easily do that.
Just ruthlessly escape all non-word chars:
set pat {pair_shap_val[9]}
set aff {pair_shap_val[9]_affin_input}
puts [regexp "${pat}_affin.*" $aff] ;# ==> 0
set escaped_pat [regsub -all {\W} $pat {\\&}]
puts $escaped_pat ;# ==> pair_shap_val\[9\]
puts [regexp "${escaped_pat}_affin.*" $aff] ;# ==> 1
A second thought: this doesn't really seem to require regular expression matching. It appears you just need to check that the pat string is contained in the aff string:
% expr {[string first $pat $aff] != -1}
1