I have just encountered this weird behavior in PowerShell and I'm wondering if there is any logical explanation for it:
After running a regex match on a string:
(Yes, I know, that this might not be the best way to do so, but the issue occurred when building a pipeline and here I only present a stripped down minimal example that still exhibits the behavior.)
$r = "asdf" | Select-String "(?<test>\w+)"
The following two expressions print the same results for me:
$r.Matches.Groups
$r.Matches[0].Groups
But out of these two only the second one works:
$r.Matches.Groups['test']
$r.Matches[0].Groups['test']
The weirdest thing is that if I use numeric indexes, it works in both cases.
$r.Matches.Groups[0]
$r.Matches[0].Groups[0]
Edit: I know that in this example the capture group is not necessary at all, but I only wanted to show a simple example that illustrates a problem. Originally I'm working with multiple patterns with multiple capture groups, that I would like to access by name. I know that I could solve it by just using Matches[0], but I'm interested in an explanation.
This is because of a PowerShell feature called property enumeration.
Since PowerShell 4.0, whenever you reference a member that doesn't exist on a collection type, PowerShell will enumerate the collection and invoke the member on each item.
That means that this expression:
$g = $r.Matches.Groups
... is basically the same as:
$g = foreach($match in $r.Matches){
foreach($group in $match.Groups){
$group
}
}
So, at this point, $g is no longer a GroupCollection- it's just an array of the values that were in any group from any match in $r.Matches.
This also explains why the [0] index expression works - regular arrays can be indexed into just fine.
Related
To reduce size of my simulation output files, I want to give variable name exceptions instead of a list of many certain variables to the simulationsOptions/outputFilter (cf. OpenModelica Users Guide / Output) of my model. I found the regexp operator "^" to fullfill my needs, but that didn't work as expected. So I think that something is wrong with the interpretation of connected character strings when negated.
Example:
When I have any derivatives der(...) in my model and use variableFilter=der.* the output file will contain all the filtered derivatives. Since there are no other varibles beginning with character d the same happens with variableFilter=d.*. For testing I also tried variableFilter=rde.* to confirm that every variable is filtered.
When I now try to except by variableFilter=^der.*, =^rde.* or =^d.*, I get exactly the same result as without using ^. So the operator seems to be ignored in this notation.
When I otherwise use variableFilter=[^der].*, =[^rde].* or even =[^d].*, all wanted derivation variables are filtered from the ouput, but there is no difference between those three expressions above. For me it seems that every character is interpretated standalone and not as as a connected string.
Did I understand and use the regexp usage right or could this be a code bug?
Side/follow-up question: Where can I officially report this for software revision?
_
OpenModelica v.1.19.2 (64-bit)
I was handed some very badly written vb.Net code today and asked to migrate it to use ODP.Net. To shortcut this a little, I used Find/Replace to set all of the command variables to BindByName = true. Based on the first few code files though, I though all of these were named "cmd". Unfortunately, they aren't; the original author of the code actually named all of their commands after their purpose, even though they only used one OracleCommand per function. They also decided that using was apparently not worth doing, either.
Dim cmGetStatus As New OracleCommand
cmGetStatus.CommandType = CommandType.StoredProcedure
cmd.BindByName = True `<--this was added by my previous replace with a regex
What regex could I use to grab all instances of "Dim ____ as New OracleCommand" and replace the variable name with "cmd"? What about the same sort of replacement on all instances of "_____.CommandType"? This would save me at least 8 hours of manual edits.
Search: (Dim ).*( As New OracleCommand)
Replace: $1cmd$2
Search: .*( = CommandType.StoredProcedure)
Replace: cmd$1
Group replacements are done with $1, $2, etc.
set ip 10.10.
if {[regexp
{^(([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\.?){4}$} $ip
match]} { puts $match }
the above pattern matching 10.10. can anyone tell me how this happening
First, using a regular expression to check ip addresses is extremely fragile and unnecessarily complex, and you still have to do the heavy lifting yourself. Instead, use the Tcllib_ip package.
package require ip
If you want to know if a given string is an IPv4 address, just check with
::ip::is 4 $str ;# 1 if valid ipv4, 0 otherwise
or
::ip::version $str ;# returns 4 or 6 for ipv4 or ipv6, -1 otherwise
The commands in the package also handle address strings that aren't dotted decimal.
The package isn't included in all distributions, but can be installed using teacup install or by downloading the files and sourcing them into the script.
To answer the question: the original asker has one error and one problem. The error is that the regular expression used to match the ip address also matches strings that aren't ip addresses. This is one of the most common problems when using regular expressions. The reason and the fix is addressed in other answers to the question. To recap: Captain noted that since the original regular expression makes the dot optional, the string 10.10. can be matched as 1 0. 1 0.. There are several possible solutions: {^(([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(\.|$)){4}$} as suggested by the same Captain seems valid but may turn out to have more problems if tested.
The main problem is that a non-trivial regular expression is used to match the address. For all but the most trivial regular expressions, rigorous testing must be performed to ensure that they don't produce false positives. This testing is usually impractical to make exhaustive, which means that you can't know for sure if it works until an angry customer tells you it doesn't. When a case of false positive match is found, the solution is either to drop the regular expression and try another method, or alternatively to make the regular expression more complex in order to make the match more strict. At this point, the test suite may also have to grow.
A better way is to step back and look for other solutions. If there is a standard library function for it, that should be used. If we imagine there is none in this case, simply reflecting on the most basic formulation of an ipv4 decimal-dot address ("four groups of integers from 0 to 255, joined by dots") suggests some simple and safe functions:
proc isOctet n {
expr {[string is integer -strict $n] && 0 <= $n && $n <= 255}
}
proc splitIpv4dd1 str {
split $str .
}
proc splitIpv4dd2 str {
scan $str %d.%d.%d.%d
}
proc splitIpv4dd3 str {
lrange [regexp -inline {^(\d+)\.(\d+)\.(\d+)\.(\d+)$} $str] 1 end
}
# plug any of the preceding splitIpv4ddN functions into this command
proc putsIpv4dd str {
set count 0
foreach n [splitIpv4dd1 $str] {
if {[isOctet $n]} {
incr count
}
}
if {$count == 4} {puts $str}
}
It is much easier to verify that each of these functions does its job correctly without false negatives or positives, and if they do, the command to print ip addresses can be assumed to work correctly. The third splitting function uses a regular expression, but in this case it's a trivial one without alternatives and optional atoms.
One important goal when writing robust and maintainable code is to keep functions cohesive and clear-cut without loopholes or irregularities. Matching with non-trivial regular expressions runs counter to this.
I certainly understand and actually applaud the wish to understand what went wrong, but the correct conclusion to draw from this is that regular expression matching isn't a good method to use in this case.
You can try to use this regex:
^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$
Regex Demo
To answer "how this is happening" - ´.´ optional, it finds 1, 0., 1, 0.
And the answer to the unasked question
The below expression will make the dot optional only if it is the end of the string (modified to ensure no trailing dot):
^(([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(\.(?=[0-9])|$)){4}$
Please remember that the original question was asking "how is this happening" - i.e. understanding the regular expression behaviour... NOTHING about how to change the regex or how this should be done...
So I'm completely new to the overwhelming world of Regex. Basically, I'm using the Gedit API to create a new custom language specification (derived from C#) for syntax-highlighting (for DM from Byond). In escaped characters in DM, you have to use [variable] as an escaping syntax, which is simple enough. However, it could also be nested, such as [array/list[index]] for instance. (It could be nested infinitely.) I've looked through the other questions, and when they ask about nested brackets they only mean exclusively nested, whereas in this case it could be either/or.
Several attempts I've tried:
\[.*\] produces the result "Test [Test[Test] Test]Test[Test] Test"
\[.*?\] produces the result "Test [Test[Test] Test]Test [Test] Test"
\[(?:.*)\] produces the result "Test [Test[Test] Test]Test[Test] Test"
\[(?:(?!\[|\]).)*\] produces the result "Test [Test[Test] Test]Test[Test] Test". This is derived from https://stackoverflow.com/a/9580978/2303154 but like mentioned above, that only matches if there are no brackets inside.
Obviously I've no real idea what I'm doing here in more complex matching, but at least I understand more of the basic operations from other sources.
From #Chaos7Theory:
Upon reading GtkSourceView's Specification Reference, I've figured out that it uses PCRE specifically. I then used that as a lead.
Digging into it and through trial-and-error, I got it to work with:
\[(([^\[\]]*|(?R))*)\]
I hope this helps someone else in the future.
In a perl script, I need to replace several strings. At the moment, I use:
$fasta =~ s/\>[^_]+_([^\/]+)[^\n]+/\>$1/g;
The aim is to format in a FASTA file every sequence name. It works well in my case so I don't need to touch this part. However, it happens that a sequence name appears several times in the file. I must not have at the end twice - or more - the same sequence name. I thus need to have for instance:
seqName1
seqName2
etc.
(instead of seqName, seqName, etc.)
Is this possible to somehow process differently every occurrence automatically? I don't know how many sequence there are, if there are similar names, etc. An idea would be to concatenate a random string at every occurrence for instance, hence my question.
Many thanks.
John perfectly solved it and chepner helped with the smart idea to avoid conflicts, here is the final result:
$fasta =~ s/\>[^_]+_([^\/]+)[^\n]+/
sub {
return '>'.$1.$i++;
}->();
/eg;
Many many thanks.
I was actually trying to do something like this the other day, here's what I came up with
$fasta =~ s/\>[^_]+_([^\/]+)[^\n]+/
sub {
# return random string
}->();
/eg;
the \e modifier interprets the substitution as code, not text. I use an anonymous code ref so that I can return at any point.