Complex named match group RegEx review - regex

From this example string
$logLine = '{header[3]}_Pragmatic Praxis Initialization Log'
I am trying to extract three pieces of data
header as type
3 as an (optional) tab value
everything after that _ as a string
What I have now is
$logLine = '{header[3]}_Pragmatic Praxis Initialization Log'
if ($logLine -match '^\{(?<type>[a-z]+)(?:\[?(?<tab>\d?)\]?)\}_(?<string>.+)$') {
Write-Host "$($matches['type'])"
Write-Host "$($matches['tab'])"
Write-Host "$($matches['string'])"
}
And it's working well. But I am so unskilled in RegEx, and this is by far the most complex RegEx I have ever cobbled together from scratch, that I am wondering if anyone sees a gotcha in this approach that I am not seeing?
Or do I need to open some wine and celebrate reaching some sort of RegEx comprehension milestone?
EDIT:
So my success made me over confident. I decided to make Tab required, but add an optional Target which can be either 'console' or 'file'. So I did this
$logLine = '{header[3]}_Pragmatic Praxis Initialization Log'
if ($logLine -match '^\{(?<type>[a-z]+)(?:-(?<target>(console|file)))\[(?<tab>\d*)\]\}_(?<string>.+)$') {
Write-Host "$($matches['type'])"
Write-Host "$($matches['target'])"
Write-Host "$($matches['tab'])"
Write-Host "$($matches['string'])"
}
Which works a treat when target is present, but fails when it is not. So, looks like I get to learn something, rather than celebrate. ;)
EDIT #2:
Per #Ansgar Wiechers, I was indeed misunderstanding (?:...), specifically confusing it for (....)?. based on that, this is my revised pattern, which seems to be doing what I want. I may still make both target and tab required, since I think it makes the code more readable while also simplifying the RegEx pattern, but still good to have it working as I initially intended it to work too.
if ($logLine -match '^\{(?<type>[a-z]+)(-(?<target>(console|file)))?(\[(?<tab>\d+)\])?\}_(?<string>.+)') {
Write-Host "$($matches['type'])"
Write-Host "$($matches['target'])"
Write-Host "$($matches['tab'])"
Write-Host "$($matches['string'])"
}

Looks to me like you're misunderstanding what (?:...) does. That construct does not define an optional match, but a non-capturing group. A (sub)expression (?:-(?<target>console|file)) will require the string to contain either -console or -file and return console or file (without the leading hyphen) as a named match "target". To make the group optional you need to add another ? after the group.
^\{(?<type>[a-z]+)(?:-(?<target>console|file))?\[(?<tab>\d*)\]\}_(?<string>.+)
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
Note that a trailing expression .+ or .* makes anchoring the expression at the end of the string ($) pointless, so just remove the $ from the end of your expression.
You also don't need the nested (unnamed) capturing group around console|file. The named capturing group is sufficient.

Related

Regex to capture an URL

I've extracted an URL from a website in this string form:
#{href=http://download.company.net/file.exe}[0]
I can't figure out pattern how to get this part out of it: http://download.company.net/file.exe so I can use it as URL to download file.
From my point of view the logic would be, that I need to first match "http" as beggining of a string, wildcard inbetween and then match "}", but not include it in final output. So IDK ...[http]*\} (I know that this "syntax" of mine is totally wrong, but you get the idea)
Reason I dont want to include "exe" to pattern, is that file extension could be "msi" and I want it to be more universal. Also some good and comprehensive PS regex article would help me greatly (with inexperience in mind) - I really didnt find any "newbie friendly" or comprehensive enough to understand this topic.
You can either, use [regex]::match or -replace.
In the following example, I capture everything after href= that is not a starting curly bracket }:
'#{href=http://download.company.net/file.exe}[0]' -replace '#{href=([^}]+).*', '$1'
Output:
http://download.company.net/file.exe
I'd use -cmatch or -imatch as
if ($content -imatch '(?<=href=).*(?=})') {
$result = $matches[0]
} else {
$result = ''
}
In case of test data, it will return
http://download.company.net/file.exe

URL regex group catching

Hello I'm trying to find a regex that would catch the terms in a url.
For example, given:
https://stackoverflow.com, it would catch "stackoverflow"
and given https://stackoverflow.com/questions/ask, it would catch "stackoverflow", "questions", "ask" and any potential terms in between the slash character after the domain name.
Up until now I managed to find the following regex but it cannot repeat catching groups
https?:\/\/(?:www\.)?([\da-z-]*)(?:[\.a-z]*)(?:\/([\da-z]*)\/?)+
Do you guys have any ways to resolve that issue?? that would be great.
I testet the answer of Michal M it appears not to get "www." so I updated it
/(?:\/(?:w{3}\.)?)\K([\w]+)/i
Edit: As soon as it's not important to match the "www." I placed it inside a non capturing group so it won't be captured. Btw I also placed the case insensitive modifier so "WWW." would be okay too.
Try this one:
(?:(\/))\K(\w+)
tested in notepad++
You may try using two separate regexes -- one for the hostname part and another for the terms in the path part. Then combine them with alternation construction and do global search:
https?:\/\/(?:\w+\.)*(\w+)\.\w+ # this would capture hostname "term"
|
\/(\w+) # this would capture path "terms"
(Note: requires /x modifier.)
Demo: https://regex101.com/r/nA8jT9/2
Thanks I managed to rearrange it for it to work with the "www"
(?:\/(?:www\.)?)\K([\w\d]+)

Regex -replace in Powershell

I am trying to read a .sln file and extract the strings that contain the path to the .csproj within my solution.
The lines that contain the information that I am looking for look like this:
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "Project", "Project\Project.csproj", "{0DB516E6-4358-499D-BFBF-408F50A44E14}"
So, this is what I am trying:
$projectsInFile = Select-String "$slnFile" -pattern '^Project'
$csprojectsNames = $projectsInFile -replace ".+= `"(\S*) `""
Now, $csprojectsName contain the information that I am looking for, but also the rest of the string.
Just like this:
Project\Project.csproj", "{0DB516E6-4358-499D-BFBF-408F50A44E14}"
What is the best way to retrieve the name of the .csproj file without needing to manually cut the rest of the string?
Thank you
What you can do is capture the entire string and use a capture group in your replacement string thereby dropping the unneeded parts.
$csprojectsNames = $projectsInFile -replace '.+= "(\S*) "(.*?)",.*"','$2'
The second capture group is the data inbetween the quotes that follow = "Project", ".....". Since it is the second capture group we replace the entire with that group '$2'. Using single quotes ensure that PowerShell does not try to expand a variable.
Better approach
You might just be able to use [^"]*?\.csproj in select-string directly without having to do a secondary parse. That will match everything before .csproj that is not a quote so it wont gooble up too much.
You can use a group to capture the file path and then use the value of the group in as the replacement value. For instance:
$csprojectsNames = $projectsInFile -replace 'Project\(.*?\) = "Project", "(.*?)"', '$1'

How can I change my regular expression to read UTF-8?

I got very far in a script I am working on only to find out it has a problem reading UTF-8 characters.
I have a contact in Sweden that made a VM on his machine with some UTF-8 in it and when my script hit that VM it lost its mind, but it was able to read all of the other VMs that are in the "normal" charset.
Anyhow, maybe my code will make more sense.
#!/usr/bin/perl
use strict;
use warnings;
#use utf8;
use Net::OpenSSH;
# Create a hash for storing the options needed by Net::OpenSSH
my %ssh_options = (
port => '22',
user => 'root',
password => 'password'
);
# Create a new Net::OpenSSH object
my $ssh = Net::OpenSSH->new('192.168.2.101', %ssh_options);
# Create an array and capture the ESX\ESXi output from the current server
my #getallvms = $ssh->capture('vim-cmd vmsvc/getallvms');
shift #getallvms;
# Process data gathered from server
foreach my $vm (#getallvms) {
# Match ID, NAME
$vm =~ m/^(?<id> \d+)\s+(?<name> .+?)\s+/xm;
my $id = "$+{id}";
my $name = "$+{name}";
print "$id\n";
print "$name\n";
print "\n";
}
I have narrowed it down to my regular expression as the problem, because here the raw output from the server before regular expression is applied.
416
TEST Box åäö!"''*#
And this is what I get after I apply my regular expression
416
TEST
For some reason the regular expression is not matching, I just don't know why. And the current regular expression in the example is the third attempt at getting it to work.
The FULL line that I am matching looks like this. The way my regular expression was done was because I only need the first two blocks of information, the expression you have wants to copy the entire line.
The code:
432 TEST Box åäö!"''*# [Store] TEST Box +w6XDpMO2IQ-_''_+Iw/TEST Box +w6XDpMO2IQ _''_+Iw.vmx slesGuest vmx-04
The subpattern
(?<name> .+?)\s+
in your regular expression means “match and remember one or more non-newline characters, but stop as soon as you find whitespace,” so $name contains TEST because the pattern stopped matching when it saw the space just before Box.
The VI Toolkit wiki gives an example of the getallvms subcommand's output:
# vmware-vim-cmd -H 10.10.10.10 -U root -P password /vmsvc/getallvms
Vmid Name File Guest OS Version Annotation
64 bartPE [store] BartPE/BartPE.vmx winXPProGuest vmx-04
96 trustix [store] Trustix/Trustix.vmx otherLinuxGuest vmx-04
The case is slightly different from the example in your question, but it appears that we can look for [store] as a bumper for the match:
/^(?<id> \d+) \s+ (?<name> .+?) \s+ \[store]/mix
The non-greedy quantifier +? means match one or more of something, but the match wants to hand control to the rest of the pattern as quickly as possible. Remember that [ has a special meaning in regular expressions, but the pattern \[ matches a literal rather than introducing a character class.
I think of this technique as bookending or tacking-and-stretching. If you want to extract a chunk of text that's difficult to characterize, look for surrounding features that are easy to match—often as simple as ^ or $. Then use a stretchy pattern to grab everything in between, usually (.+) or (.+?). Read the “Quantifiers” section of the perlre documentation for an explanation of your many options.
This fixes the immediate problem, and you can also add polish in a few areas.
Do not use $1, $2, and friends unconditionally! Always test that the pattern matches before using capture variables. For example
if (/(foo|bar|baz)/) {
print "got $1\n";
}
else {
print "no match\n";
}
An unprotected print $1 can produce surprising results that are tough to debug.
Judicious use of Perl's defaults can help emphasize the computation and lets the mechanism fade into the background. Dropping $vm in favor of $_ as the implicit loop variable and implicit match target makes for a nicer result.
Your comments merely translate from Perl to English. The most helpful comments explain the why, not the what. Also keep in mind Rob Pike's advice on commenting:
If your code needs a comment to be understood, it would be better to rewrite it so it's easier to understand.
In the assignments from %+, the quotes don't do anything useful. The values are already strings, so remove the quotes.
my $id = $+{id};
my $name = $+{name};
Below is a modified version of your code that captures everything after the number but before [store] into $name. The utf8 pragma declares that your source code—not, as with a common mistake, your input—contains UTF-8. The test below simulates with a canned echo the output from vim-cmd on the Swedish VM.
As Tom suggested, I use the Encode module to decode the output that arrives through the SSH connection and encode it for benefit of the local host before printing it out.
The perlunifaq documentation advises decoding external data into Perl's internal format and then encoding any output just before it's written. I assume that the value returned from $ssh->capture(...) uses UTF-8 encoding, that is, that the remote host is sending UTF-8. We see the expected result because I'm running a modern distribution of Linux and ssh-ing back to it, but in the wild, you may be dealing with some other encoding.
You're able to get away with skipping the calls to decode and encode because Perl's internal format happens to match those of the hosts you're using. In general, however, cutting corners can get you into trouble:
What if I don't decode?
What if I don't encode?
Finally, the code!
#! /usr/bin/env perl
use strict;
use utf8;
use warnings;
use Encode;
use Net::OpenSSH;
my %ssh_options = ();
my $ssh = Net::OpenSSH->new('localhost', %ssh_options);
# Create an array and capture the ESX\ESXi output from the current server
#my #getallvms = $ssh->capture('vim-cmd vmsvc/getallvms');
my #getallvms = $ssh->capture(<<EOEcho);
echo -e 'JUNK\n416 TEST Box åäö!"'\\'\\''*# [Store] TEST Box +w6XDpMO2IQ-_''_+Iw/TEST Box +w6XDpMO2IQ _''_+Iw.vmx slesGuest vmx-04'
EOEcho
shift #getallvms;
for (#getallvms) {
$_ = decode "utf8", $_, Encode::FB_CROAK;
if (/^(?<id> \d+) \s+ (?<name> .+?) \s+ \[store]/mix) {
my $id = $+{id};
my $name = $+{name};
print encode("utf8", $id), "\n",
encode("utf8", $name), "\n",
"\n";
}
else {
print "no match\n";
}
}
Output:
416
TEST Box åäö!"''*#
If you know the string you work on is UTF-8 and Net::OpenSSH doesn't (and hence doesn't mark it as such), you can convert it to an internal representation Perl can work on with one of:
use Encode;
decode_utf8( $in_place );
$decoded = decode_utf8( $raw );
So you have make sure, that Perl understand those names as UTF-8 encoded strings. So far I don't think it has. A comprehensive overview about UTF-8 in Perl.
You can test your strings unicodeness with Encode::is_utf8 and decode them with Encode::decode('UTF-8', $your_string).
UTF-8 is pretty messy still in Perl, IMHO. You must have pretty patient with it.
To print UTF-8 strings out in pretty way, you should use something like that in your script:
BEGIN {
binmode(STDOUT, ':encoding(UTF-8)');
binmode(STDERR, ':encoding(UTF-8)'); # Error messages
}
If you got Perl understand your UTF-8 names, you could regex them properly too.
Recent Net::OpenSSH releases have native support for charset encoding/decoding in capture methods:
my #getallvms = $ssh->capture({stream_encoding => 'utf8'},
'vim-cmd vmsvc/getallvms');

Except URL regex

Sigh, regex trouble again.
I have following in $text:
[img]http://www.site.com/logo.jpg[/img]
and
[url]http://www.site.com[/url]
I have regex expression:
$text = preg_replace("/(?<!(\[img\]|\[url\]))([http|ftp]+:\/\/)?\S+[^\s.,>)\];'\"!?]\.+[com|ru|net|ua|biz|org]+\/?[^<>\n\r ]+[A-Za-z0-9](?!(\[\/img\]|\[\/url\]))/","there was link",$text);
The point is to replace url only if it's not preceded by [img] or [url] and not followed by [/img] or [/url]. On the output of previous example I get:
there was link
and
there was link
Both, URL and lookbehind and lookforward regexps are working fine separately.
$text = "[img]bash.org/logo.jpg[/img]";
$text = preg_replace("/(?<!(\[img\]|\[url\]))bash.org(?!(\[\/img\]|\[\/url\]))/","there was link",$text);
echo $text leaves everything as is and gives me [img]bash.org/logo.jpg[/img]
I suppose the problem is in combination of lookarounds and URL regex. Where's my mistake?
I WANT TO
replace http://www.google.com with "there was link", but leave as is "[url]http://www.google.com[/url]"
I'M GETTING
http://www.google.com replaced with "there was link" and [url]http://www.google.com[/url] replaced with "there was link"
HERE'S PHP CODE TO TEST
<?php
$text = "[url]http://www.google.com[/url] <br><br> http://www.google.com";
// should NOT be changed //should be changed
$text = preg_replace("/(?<!\[url\])([http|ftp]+:\/\/)?\S+[^\s.,>)\];'\"!?]\.+[com|ru|net|ua|biz|org]+\/?[^<>\n\r ]+[A-Za-z0-9](?!\[\/url\])/","there was link",$text);
echo $text;
echo '<hr width="100%">';
$text = ":) :-) 0:) 0:-) :)) :-))";
$text = preg_replace("/(?<!0):-?\)(?!\))/","smiley",$text);
echo $text; // lookarounds work
echo '<hr width="100%">';
$text = "http://stackoverflow.com/questions/2482921/regexp-exclusion";
$text = preg_replace("/([http|ftp]+:\/\/)?\S+[^\s.,>)\];'\"!?]\.+[com|ru|net|ua|biz|org]+\/?[^<>\n\r ]+[A-Za-z0-9]/","it's a link to stackoverflow",$text);
echo $text; // URL pattern works fine
?>
Assuming I'm understanding you, you wish to replace all URLs in your $input, with the words 'link was here', unless the URL was within either the url or img bbcode tags. The reason the lookaround assertions aren't working is because those parts are actually matching against your very greedy URL pattern (which I'm fairly sure does lots of things you don't mean it to). Writing a pattern that will match any valid URL (including query string) within other text and that will also not match the tags attached to it is not necessarily the simplest of matters. Especially since your current pattern has the http:// or ftp:// as optional.
The only way you are likely to gain any success is to decide on a strict set of rules that constitute a url.
It is tough to fully understand your question, but it looks like you're doing reverse BBcode. So, leave it alone if it's surrounded by tags? If that is the case, then I think you will have an interesting problem on your hands because URL regexes are notoriously complex.
I think you may be making this more complex than it needs to be. Instead, I would change anything that is between the BBcode. Here's what I think needs to happen:
find the string segment "[url]"
capture anything that proceeds it
end the capture when the string segment "[/url]" is seen
That is an easy regex:
$string = "[url]http://www.google.com[/url] <br><br> http://www.google.com";
$replace = "there was link";
$text = preg_replace_all($regex,$replace,$text);
echo $text;
I know this isn't exactly what you asked for (in fact, probably the exact opposite), but it would achieve the same result and be much easier.
You can probably try using negative lookaheads with this regex, but I am not sure it would give you proper results:
$regex = "#(?!\[url\])(.*)(?!\[/url\])#";
One important note: This does not sanitize user input. Make sure you do this, but I would separate the logic so it is very easy to see what you are doing and where you are doing it. I would also use a library to do this because it's easier and probably safer.
Final working regexp looks like:
(?<!\[img\]|\[url\])((^|\s)([\w-]+://|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))(?!\[\/img\]|\[/url\])
Example:
<?php
$text = "
[img]http://google.com/logo.jpg[/img]
[img]www.google.com/logo.jpg[/img]
[img]http://www.google.com/logo.jpg[/img]
[url]http://google.com/logo.jpg[/url]
[url]www.google.com/logo.jpg[/url]
[url]http://www.google.com/logo.jpg[/url]
www.google.com/logo.jpg
http://google.com/logo.jpg
http://www.google.com/logo.jpg
";
$text = nl2br($text);
$text = preg_replace("'(?<!\[img\]|\[url\])((^|\s)([\w-]+://|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))(?!\[\/img\]|\[/url\])'i","<font color=\"#ff0000\">link</font>",$text);
echo $text;
?>
outputs:
[img]http://google.com/logo.jpg[/img]
[img]www.google.com/logo.jpg[/img]
[img]http://www.google.com/logo.jpg[/img]
[url]http://google.com/logo.jpg[/url]
[url]www.google.com/logo.jpg[/url]
[url]http://www.google.com/logo.jpg[/url]
link
link
link
The trick is to replace only links starting with ^ or \s . No other way to solve this issue wasn't found.
Where's my mistake?
Well, the worst mistake is the lookbehind. It isn't needed, and it's making the job much harder than it needs to be. Assuming the existing tags are well formed, you needn't bother looking for the opening tag; its presence is implied by the presence of the closing tag.
EDIT: Your regex has several other problems besides the lookbehind, but it didn't seem worthwhile to try and fix it. Instead, I grabbed a regex from RegexBuddy's built-in library of useful regexes, and added the lookahead to it.
Try this regex (or see it in action on ideone):
'_\b(?>
(?>www\.|ftp\.|(?:https?|ftp|file)://) # scheme or subdomain
[-+&##/%=~|$?!:,.\w]*[+&##/%=~|$\w] # everything else
)(?!\[/(?:img|url)\])
_x'
Just because a problem can be described in terms of looking forward or backward, preceding or following, etc., doesn't mean you should design the regex that way. Lookbehind in particular should never be the first tool you reach for.