extract 1st line with specific pattern using regexp - regex

I have a string
set text {show log
===============================================================================
Event Log
===============================================================================
Description : Default System Log
Log contents [size=500 next event=7 (not wrapped)]
6 2020/05/22 12:36:05.81 UTC CRITICAL: IOM #2001 Base IOM
"IOM:1>some text here routes "
5 2020/05/22 12:36:05.52 UTC CRITICAL: IOM #2001 Base IOM
"IOM:2>some other text routes "
4 2020/05/22 12:36:05.10 UTC MINOR: abc #2001 some text here also 222 def "
3 2020/05/22 12:36:05.09 UTC WARNING: abc #2011 some text here 111 ghj"
1 2020/05/22 12:35:47.60 UTC INDETERMINATE: ghe #2010 a,b, c="7" "
}
I want to extract the 1st line that starts with "IOM:" using regexp in tcl ie
IOM:1>some text here routes
But implementation doesn't work, Can someone help here?
regexp -nocase -lineanchor -- {^\s*(IOM:)\s*\s*(.*?)routes$} $line match tag value

You may use
regexp -nocase -- {(?n)^"IOM:.*} $text match
regexp -nocase -line -- {^"IOM:.*} $text match
See the Tcl demo
Details
(?n) - (same as -line option) newline sensitive mode ON so that . could not match line breaks ( see Tcl regex docs: If newline-sensitive matching is specified, . and bracket expressions using ^ will never match the newline character (so that matches will never cross newlines unless the RE explicitly arranges it) and ^ and $ will match the empty string after and before a newline respectively, in addition to matching at beginning and end of string respectively)
^ - start of a line
"IOM: - "IOM: string
.* - the rest of the line to its end.

In addition to #Wiktor's great answer, you might want to iterate over the matches:
set re {^\s*"(IOM):(.*)routes.*$}
foreach {match tag value} [regexp -all -inline -nocase -line -- $re $text] {
puts [list $tag $value]
}
IOM {1>some text here }
IOM {2>some other text }
I see that you have a non-greedy part in your regex. The Tcl regex engine is a bit weird compared to other languages: the first quantifier in the regex sets the greediness for the whole regex.
set re {^\s*(IOM:)\s*\s*(.*?)routes$} ; # whole regex is greedy
set re {^\s*?(IOM:)\s*\s*(.*?)routes$} ; # whole regex in non-greedy
# .........^^

Related

Powershell: Can't Get RegEx to work on multiple lines

I am getting notes from a ticket that come in the form of:
[Employee ID]:
[First Name]: Test
[Last Name]: User
[Middle Initial]:
[Email]:
[Phone]:
[* Last 4 of SSN]: 1234
I've tried the following code to get the first name (in this example it would be 'Test':
if ($incNotes -match '(^\[First Name\]:)(. * ?$)')
{
Write-Host $_.matches.groups[0].value
Write-Host $_.matches.groups[1].value
}
But I get nothing. Is there a way I can use just one long regex pattern to get the information I need? The information stays in the same format on every ticket that comes through.
How would I get the information after the [First Name]: and so on....
You can use
if ($incNotes -match '(?m)^\[First Name]: *(\S+)') {
Write-Host $matches[1]
}
See the regex demo. If you can have any kind of horizontal whitespace chars between : and the name, replace the space with [\p{Zs}\t], or some kind of [\s-[\r\n]].
Details:
(?m) - a RegexOptions.Multiline option that makes ^ match start of any line position, and $ match end of lines
^ - start of a line
\[First Name]: - a [First Name]: string
* - zero or more spaces
(\S+) - Capturing group 1: one or more non-whitespace chars (replace with \S.* or \S[^\n\r]* to match any text till end of string).
Note that -match is a case insensitive regex matching operator, use -cmatch if you need a case sensitive behavior. Also, it only finds the first match and $matches[1] returns the Group 1 value.

Capture (remove) all double-quotes after colon

I'm trying to clean up a string. An example string:
{
"NodeID": "${NodeID}",
"EventID": "${EventID}"
}
I want to capture all double quotes which occur after the colon, so that the end string will be:
{
"NodeID": ${NodeID},
"EventID": ${EventID}
}
I know that it's JSON, and that technically it is a string in those positions, but they're macros that will be interpreted by a system which generates the actual JSON string and replaces the macros with data, so in my use case this text isn't JSON yet. I can deal with the text line-by-line to make it easier.
I'll be using the regex pattern in both PowerShell and Python.
The closest I've gotten so far have been: (?<=[^*:])("), and (?<=:)(.*)(?<!,)
This is working, but seems incredibly kludgy and inelegant:
$String = '{
"NodeID": "${NodeID}",
"EventID": "${EventID}"
}'
# The Regex to match the text after the colon
[regex]$Regex = '(?<=:)(.*)'
# Splitting each line of the string into an ArrayList element
[System.Collections.ArrayList]$StringArray = $String.Split([string[]][Environment]::NewLine, [StringSplitOptions]::None)
# Declaring an output string
$OutPutString = ''
# Loop through the ArrayList
$i = 1
foreach ($Row in $StringArray) {
# Split each element string at the RegEx match
$RowArray = $Row -split $Regex
[String]$RowString1 = $RowArray[0]
[String]$RowString2 = $RowArray[1]
# Reassemble the element string after replacing the double quotes in the 2nd half
$FullRowString = $RowString1 + $RowString2.Replace('"','')
# If this is the first line in the string, don't add a new line charact in front
if ($i -gt 1) {
$NewLine = "`n"
}
# Reassemble the string
$OutPutString += $NewLine + $FullRowString
$i++
}
$OutPutString
Any better ideas?
👉ī¸ For the regex to be functional as expected, the regex-engine indicated by scripting/programming language is important to know.
Please always add this information as tags besides regex.
Here: powershell, python
Regex to match a JSON text-field and capture the raw-value
Tested on Python, see regex101 demo:
(?<=:\s\s)\"([^\"]*)\"
💡ī¸ Components
To explain the composition of the regex and its working in steps:
(?<=:\s\s): positive look behind ?<=: for 2 white-spaces \s\s
to neglect the field-name also enclosed in double-quotes
\" and \": matching double-quotes before and after the capture group
the unwanted enclosing of the field-value
([^\"]*): capture-group denoted by parentheses surround any non-double-quote character [^\"]*
the wanted raw field-value (string) without enclosing double-quotes
ℹī¸ Note:
The character-group [^\"] matches any non (^) double-quote \".
It will start matching at the leading double-quote and stop matching as soon as a double-quote is detected. So the final \" in the regex is optional: It is not required for matching/capturing, but will ensure that each matched field-value is correctly enclosed by double-quotes.
Result
Matching following input lines:
{
"NodeID": "${NodeID}",
"EventID": "${EventID}"
}
Will give the desired raw field-values in group 1 for each match:
e.g.
${NodeID} for the first match
${EventID} for the second match
📚ī¸ Working with JSON in PowerShell
For your context assumed as parsing JSON following related links may be useful:
Microsoft Scripting Blog: Working with JSON data in PowerShell
Related Question: PowerShell parsing JSON
PowerShell Explained: Powershell: The many ways to use regex

Tcl regexp not returning all matches

I am reading a file, the content is as below:
Aug2017:
--------------------------------------
Name Age Phone
--------------------------------------
Jack 25 128736372
Peter 26 987840392
--------------------------------------
Sep2017:
--------------------------------------
Name Age Phone
--------------------------------------
Jared 21 874892032
Eric 24 847938427
--------------------------------------
So I wanted to extract the information between every dashed line and put them into a list. Assuming $data is containing the file content, I am using the tcl regexp below to extract the info:
regexp -all -inline -- {\s+\-{2,}\s+(.*?)\s+\-{2,}\s+} $data
As I know, the returned matched result will be stored as a list that containing fullMatch and subMatch.
I double checked with llength command, there is only one fullMatch and subMatch.
llength $data
2
Why is there only 1 subMatch? There supposed to be 5 matches like below:
Aug2017:
--------------------------------------
Name Age Phone --> 1st Match
--------------------------------------
Jack 25 128736372
Peter 26 987840392 --> 2nd Match
--------------------------------------
Sep2017: --> 3rd Match
--------------------------------------
Name Age Phone --> 4th Match
--------------------------------------
Jared 21 874892032
Eric 24 847938427 --> 5th Match
--------------------------------------
So in this case, I am choosing the second list element (subMatch) with lindex.
lindex [regexp -all -inline -- {\s+\-{2,}\s+(.*?)\s+\-{2,}\s+} $data] 1
However the result I got is like this, seems like it is matching from the beginning and end of the content:
Name Age Phone
--------------------------------------
Jack 25 128736372
Peter 26 987840392
--------------------------------------
Sep2017:
--------------------------------------
Name Age Phone
--------------------------------------
Jared 21 874892032
Eric 24 847938427
My impression was regexp should match from the beginning and match sequentially to the end of the string, not sure why tcl regex is behaving like this? Am I missing something?
** The main thing I want to achieve here is to extract data between the dashed separator, the above data is just an example.
Expected result: a list that containing all matches
{ {Name Age Phone} -->1st match
{Jack 25 128736372
Peter 26 987840392} -->2nd match
{Sep2017:} -->3rd match
{Name Age Phone} -->4th match
{Jared 21 874892032
Eric 24 847938427} -->5th match
}
UPDATE:
I have slightly changed my tcl regex as below, to include the lookahead and the suggestion by #glenn:
regexp -all -inline -expanded -- {\s+?-{2,}\s+?(.*?)(?=\s+?-{2,}\s+?)} $data
The result I got (10 submatches):
{ {----------------------
Name Age Phone} -->1st match
{Name Age Phone} -->2nd match
{----------------------
Jack 25 128736372
Peter 26 987840392} -->3rd match
{Jack 25 128736372
Peter 26 987840392} -->4th match
{----------------------
Sep2017:} -->5th match
{Sep2017:} -->6th match
...
...
}
It is pretty close to the expected result, but I still want to figure out how to use regex to perfectly match the expected 5 submatches.
Regular expression matching is not a good tool for this kind of problem. You're much better off with some kind of line filter.
A regular expression-based filter, closely matched to your example lines:
set f [open data.txt]
while {[gets $f line] >= 0} {
if {[regexp {:} $line]} continue
if {![regexp {\d} $line]} continue
puts $line
}
close $f
Rationale: only month name lines have colons, header lines and separators have no digits in them.
A filter that doesn't rely as much on regular expressions:
set f [open data.txt]
set skip 4
while {[gets $f line] >= 0} {
if {$skip < 1} {
if {[regexp {\-{2,}} $line]} {
set skip 4
} else {
puts $line
}
} else {
incr skip -1
}
}
close $f
This code reads every line, skips four lines at the beginning of each month, and resets the skip to 4 when a line of dashes interrupts the data.
(Note: the expression \-{2,} makes it look like the dash is special in a regular expression and needs to be escaped for that reason. Actually, it's because if the dash is the first character in the expression, the regexp command tries to interpret it as a switch. regexp -- {-{2,}} ... would work too but looks even stranger, I think.)
ETA (see comment): to get data between separators (i.e. just filter out the separators), try this:
set f [open data.txt]
while {[gets $f line] >= 0} {
if {![regexp {\-{2,}} $line]} {
puts $line
}
}
close $f
Or:
package require fileutil
::fileutil::foreachLine line data.txt {
if {![regexp {\-{2,}} $line]} {
puts $line
}
}
This should also work:
regsub -all -line {^\s+-{2,}.*(\n|\Z)} $data {}
Enabling newline-sensitive matching, this matches and removes all lines consisting only of whitespace, dashes, optional non-newlines and either a newline character or the end-of-outer-string.
To collect a list of matches rather than just printing filtered lines:
set matches {}
set matchtext {}
::fileutil::foreachLine line data.txt {
if {![regexp {\-{2,}} $line]} {
append matchtext $line\n
} else {
lappend matches $matchtext
set matchtext {}
}
}
After running this, the variable matches contains a list whose items are contiguous lines between separators.
Another way to to the same thing:
::textutil::splitx $data {(?n)^\s+-{2,}.*(?:\n|\Z)}
(It also adds an empty element at the end of the list, which is easy enough to remove if it is a problem.)
Documentation:
< (operator),
>= (operator),
append,
close,
continue,
fileutil (package),
gets,
if,
incr,
lappend,
open,
package,
puts,
regexp,
set,
textutil (package),
while,
Syntax of Tcl regular expressions

Regex Get a substring from a string nearest to the end

I'm trying to get a substring from a string using a powershell script and regex.
For example I'm trying to get a year that's part of a filename.
Example Filename "Expo.2000.Brazilian.Pavillon.after.Something.2016.SomeTextIDontNeed.jpg"
The problem is that the result of the regex gives me "2000" and no other matches. I need to get "2016" matched. Sadly $matches only has one matched instance. Do I have missed something? I feel getting nuts ;)
If $matches would contain all instances found I could handle getting the nearest to end instance with:
$Year = $matches[$matches.Count-1]
Powershell Code:
# Function to get the images year and clean up image information after it.
Function Remove-String-Behind-Year
{
param
(
[string]$OriginalFileName # Provide the BaseName of the image file.
)
[Regex]$RegExYear = [Regex]"(?<=\.)\d{4}(?=\.|$)" Regex to match a four digit string, prepended by a dot and followed by a dot or the end of the string.
$OriginalFileName -match $RegExYear # Matches the Original Filename with the Regex
Write-Host "Count: " $matches.Count # Why I only get 1 result?
Write-Host "BLA: " $matches[0] # First and only match is "2000"
}
Wanted Result Table:
"x.2000.y.2016.z" => "2016" (Does not work)
"x.y.2016" => "2016" (Works)
"x.y.2016.z" => "2016" (Works)
"x.y.20164.z" => "" (Works)
"x.y.201.z" => "" (Works)
PowerShell's -match operator only ever finds (at most) one match (although multiple substrings of that one match may be found with capture groups).
However, using the fact that quantifier * is greedy (by default), we can still use that one match to find the last match in the input:
-match '^.*\.(\d{4})\b' finds the longest prefix of the input that ends in a 4-digit sequence preceded by a literal . and followed by a word boundary, so that $matches[1] then contains the last occurrence of such a 4-digit sequence.
Function Extract-Year
{
param
(
[string] $OriginalFileName # Provide the BaseName of the image file.
)
if ($OriginalFileName -match '^.*\.(\d{4})\b') {
$matches[1] # output last 4-digit sequence found
} else {
'' # output empty string to indicate that no 4-digit sequence was found.
}
}
'x.2000.y.2016.z', 'x.y.2016', 'x.y.2016.z', 'x.y.20164.z', 'x.y.201.z' |
% { Extract-Year $_ }
yields
2016
2016
2016
# empty line
# empty line

How do I extract all matches with a Tcl regex?

hi everybody i want solution for this regular expression, my problem is Extract all the hex numbers in the form H'xxxx, i used this regexp but i didn't get all hexvalues only i get one number, how to get whole hex number from this string
set hex "V5CCH,IA=H'22EF&H'2354&H'4BD4&H'4C4B&H'4D52&H'4DC9"
set res [regexp -all {H'([0-9A-Z]+)&} $hex match hexValues]
puts "$res H$hexValues"
i am getting output is 5 H4D52
On -all -inline
From the documentation:
-all : Causes the regular expression to be matched as many times as possible in the string, returning the total number of matches found. If this is specified with match variables, they will contain information for the last match only.
-inline : Causes the command to return, as a list, the data that would otherwise be placed in match variables. When using -inline, match variables may not be specified. If used with -all, the list will be concatenated at each iteration, such that a flat list is always returned. For each match iteration, the command will append the overall match data, plus one element for each subexpression in the regular expression.
Thus to return all matches --including captures by groups-- as a flat list in Tcl, you can write:
set matchTuples [regexp -all -inline $pattern $text]
If the pattern has groups 0â€ĻN-1, then each match is an N-tuple in the list. Thus the number of actual matches is the length of this list divided by N. You can then use foreach with N variables to iterate over each tuple of the list.
If N = 2 for example, you have:
set numMatches [expr {[llength $matchTuples] / 2}]
foreach {group0 group1} $matchTuples {
...
}
References
regular-expressions.info/Tcl
Sample code
Here's a solution for this specific problem, annotated with output as comments (see also on ideone.com):
set text "V5CCH,IA=H'22EF&H'2354&H'4BD4&H'4C4B&H'4D52&H'4DC9"
set pattern {H'([0-9A-F]{4})}
set matchTuples [regexp -all -inline $pattern $text]
puts $matchTuples
# H'22EF 22EF H'2354 2354 H'4BD4 4BD4 H'4C4B 4C4B H'4D52 4D52 H'4DC9 4DC9
# \_________/ \_________/ \_________/ \_________/ \_________/ \_________/
# 1st match 2nd match 3rd match 4th match 5th match 6th match
puts [llength $matchTuples]
# 12
set numMatches [expr {[llength $matchTuples] / 2}]
puts $numMatches
# 6
foreach {whole hex} $matchTuples {
puts $hex
}
# 22EF
# 2354
# 4BD4
# 4C4B
# 4D52
# 4DC9
On the pattern
Note that I've changed the pattern slightly:
Instead of [0-9A-Z]+, e.g. [0-9A-F]{4} is more specific for matching exactly 4 hexadecimal digits
If you insist on matching the &, then the last hex string (H'4DC9 in your input) can not be matched
This explains why you get 4D52 in the original script, because that's the last match with &
Maybe get rid of the &, or use (&|$) instead, i.e. a & or the end of the string $.
References
regular-expressions.info/Finite Repetition, Anchors
I'm not Tclish, but I think you need to use both the -inline and -all options:
regexp -all -inline {H'([0-9A-Z]+)&} $string
EDIT: Here it is again, this time with a corrected regex (see the comments):
regexp -all -inline {H'[0-9A-F]+&} $string