Matching both greedy, nongreedy and all others in between [duplicate] - regex

This question already has answers here:
Parsing valid parent directories with regex
(3 answers)
Closed 8 years ago.
Given a string like "/foo/bar/baz/quux" (think of it like a path to a file on a unixy system), how could I (if at all possible) formulate a regular expression that gives me all possible paths that can be said to contain file quux?
In other words, upon running a regexp against the given string ("/foo/bar/baz/quux"), I would like to get as results:
"/foo/"
"/foo/bar/"
"/foo/bar/baz/"
I've tried the following:
'/\/.+\//g' - this is greedy by default, matches "/foo/bar/baz/"
'/\/.+?\//g' - lazy version, matches "/foo/" and also "/baz/"
P.S.: I'm using Perl-compatible Regexps in PHP in function preg_match(), for that matter)

Felipe not looking for /foo/bar/baz, /bar/baz, /baz but for /foo, /foo/bar, /foo/bar/baz
One solution building on regex idea in comments but give the right strings:
reverse the string to be matched: xuuq/zab/rab/oof/ For instance in PHP use strrev($string )
match with (?=((?<=/)(?:\w+/)+))
This give you
zab/rab/oof/
rab/oof/
oof/
Then reverse the matches with strrev($string)
This give you
/foo/bar/baz
/foo/bar
/foo
If you had .NET not PCRE you could do matching right to left and proably come up with same.

This solution will not give exact output as you are expecting but still give you pretty useful result that you can post-process to get what you need:
$s = '/foo/bar/baz/quux';
if ( preg_match_all('~(?=((?:/[^/]+)+(?=/[^/]+$)))~', $s, $m) )
print_r($m[0]);
Working Demo
OUTPUT:
Array
(
[0] => /foo/bar/baz
[1] => /bar/baz
[2] => /baz
)

Completely different answer without reversing string.
(?<=((?:\w+(?:/|$))+(?=\w)))
This matches
foo/
foo/bar/
foo/bar/baz/
but you have to use C# which use variable lookbehind not PCRE

Related

Ruby Puppet Regex string matching

I'm somewhat new to ruby and have done a ton of google searching but just can't seem to figure out how to match this particular pattern. I have used rubular.com and can't seem to find a simple way to match. Here is what I'm trying to do:
I have several types of hosts, they take this form:
Sample hostgroups
host-brd0000.localdomain
host-cat0000.localdomain
host-dog0000.localdomain
host-bug0000.localdomain
Next I have a case statement, I want to keep out the bugs (who doesn't right?). I want to do something like this to match the series of characters. However, it starts matching at host-b, host-c, host-d, and matches only a single character as if I did a [brdcatdog].
case $hostgroups { #variable takes the host string up to where the numbers begin
# animals to keep
/host-[["brd"],["cat"],["dog"]]/: {
file {"/usr/bin/petstore-friends.sh":
owner => petstore,
group => petstore,
mode => 755,
source => "puppet:///modules/petstore-friends.sh.$hostgroups",
}
}
I could do something like [bcd][rao][dtg] but it's not very clean looking and will match nonsense like "bad""cot""dat""crt" which I don't want.
Is there a slick way to use \A and [] that I'm missing?
Thanks for your help.
-wootini
How about using negative lookahead?
host-(?!bug).*
Here is the RUBULAR permalink matching everything except those pesky bugs!
Is this what you're looking for?
host-(brd|cat|dog)
(Following gtgaxiola's example, here's the Rubular permalink)

Why is it selecting this file?

I have the following statement:
Directory.GetFiles(filePath, "A*.pdf")
.Where(file => Regex.IsMatch(Path.GetFileName(file), "[Aa][i-lI-L].*"))
.Skip((pageNum - 1) * pageSize)
.Take(pageSize)
.Select(path => new FileInfo(path))
.ToArray()
My problems is that the above statement also finds the file "Adali.pdf" which it should not - but i cannot figure out why.
The above statement should only select files starting with a, and where the second letter is in the range i-l.
Because it matches Adali taking 3rd and 4th characters (al):
Adali
--
Try using ^ in your regex which allows looking for start of the string (regex cheatsheet):
Regex.IsMatch(..., "^[Aa][i-lI-L].*")
Also I doubt you need asterisk at all.
PS: As a sidenote let me notice that this question doesn't seem to be written that good. You should try debugging this code yourself and particularly you should try checking your regex against your cases without LINQ. I'm sure there is nothing to do here with LINQ (the tag you have in your question), but the issue is about regular expressions (which you didn't mention in tags at all).
You are not anchoring the string. This makes the regex match the al in Adali.pdf.
Change the regex to ^[Aa][i-lI-L].* You can do just ^[Aa][i-lI-L] if you don't need anything besides matching.
You should to do this
var f = Directory.GetFiles(tb_Path.Text, "A*.pdf").Where(file => Regex.IsMatch(Path.GetFileName(file), "[Aa][i-lI-L].pdf")).ToArray();
When you call ".*" Adali accept in Regex

Regex to match URL not followed by " or <

I'm trying to modify the url-matching regex at http://daringfireball.net/2010/07/improved_regex_for_matching_urls to not match anything that's already part of a valid URL tag or used as the link text.
For example, in the following string, I want to match http://www.foo.com, but NOT http://www.bar.com or http://www.baz.com
www.foo.com http://www.baz.com
I was trying to add a negative lookahead to exclude matches followed by " or <, but for some reason, it's only applying to the "m" in .com. So, this regex still returns http://www.bar.co and http://www.baz.co as matches.
I can't see what I'm doing wrong... any ideas?
\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))(?!["<])
Here is a simpler example too:
((((ht|f)tps?:\/\/)|(www.))[a-zA-Z0-9_\-.:#/~}?]+)(?!["<])
I looked into this issue last year and developed a solution that you may want to look at - See: URL Linkification (HTTP/FTP) This link is a test page for the Javascript solution with many examples of difficult-to-linkify URLs.
My regex solution, written for both PHP and Javascript - is not simple (but neither is the problem as it turns out.) For more information I would recommend also reading:
The Problem With URLs by Jeff Atwood, and
An Improved Liberal, Accurate Regex Pattern for Matching URLs by John Gruber
The comments following Jeff's blog post are a must read if you want to do this right...
Note also that John Gruber's regex has a component that can go into realm of catastrophic backtracking (the part which matches one level of matching parentheses).
Yeah, its actually trivial to make it work if you just want to exclude trailing characters, just make your expression 'independent', then no backtracking will occurr in that segment.
(?>\b ...)(?!["<])
A perl test:
use strict;
use warnings;
my $str = 'www.foo.com http://www.baz.comhttp://www.some.com';
while ($str =~ m~
(?>
\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
)
(?!["<])
~xg)
{
print "$1\n";
}
Output:
www.foo.com
http://www.some.com

RegEx: Matching Pattern within Pattern - I think I need to use Positive Lookbehinds?

I'm trying to use RegEx to find a pattern within a pattern. Specifically what I want to do is capture a URL into a reference and search within that for everything that comes after the last = sign and capture that as well.
So given this string
stuff
I would initially find
href="http://my.domain.com/?s_cid=EM&s_ev9=CMC21892&s_ev10=EM_CMC21892_LC_stuff"
Using this RegEx: href="(https?[^"]*)"
From there I could parse the actual string (when looking at the captured group) I'm looking for EM_CMC21892_LC_stuff with this: =[^"=]*$
I am having no success though when I try to combine the two to accomplish it in one RegEx.
Any thoughts?
He's right, using regexes to parse HTML is just asking for trouble.
That said, try href="http[^"]+=([^"]+?)" .
I agree with Mark Byer's comment about using existing html/url parsing functions instead of regex (though you didn't specify which language you are using so we can't really help on that...)
However, if you insist on doing it the regex way, here is a pattern:
/href="([^"]*=([^"]*))"/
edit to add: here is what the result would looks like, wasn't sure if you wanted to still capture the full url or just that last param value, but this pattern captures both:
Array
(
[0] => Array
(
[0] => href="http://my.domain.com/?s_cid=EM&s_ev9=CMC21892&s_ev10=EM_CMC21892_LC_stuff"
)
[1] => Array
(
[0] => http://my.domain.com/?s_cid=EM&s_ev9=CMC21892&s_ev10=EM_CMC21892_LC_stuff
)
[2] => Array
(
[0] => EM_CMC21892_LC_stuff
)
)

Pcrepp - Perl Regular Expression syntax to match host name [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
The Hostname Regex
I'm trying to use pcrepp (PCRE) to extract hostname from url.
the pcre regular expression is as same as Perl 5 regular expression.
for example:
url = "http://www.pandora.com/#/volume/73";
// the match will be "http://www.pandora.com/".
I can't find the correct syntax of the regex for this example.
Needs to work for any url: amazon.com/sds/ should return: amazon.com.
or abebooks.co.uk/isbn="62345627457245"/blabla/ should return abebooks.co.uk
I don't need to check if the url is valid. just to get the hostname.
Something like this:
^(?:[a-z]+://)?[^/]+/?
See Regexp::Common::URI::http which uses sub-patterns defined in Regexp::Common::URI::RFC2396. Examining the source code of those modules should give you a good idea how to put together a decent pattern.
Here is one possibility:
^[a-zA-Z0-9\-\.]+\.(com|org|net|mil|edu|COM|ORG|NET|MIL|EDU)$
And another:
^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$
These and other URL related regular expressions can be found here: Regular Expression Library
string regex1, regex2, finalRegex;
regex1 = "^((\\w+):\\/\\/\\/?)?((\\w+):?(\\w+)?#)?([^\\/\\?:]+):?(\\d+)?(\\/?[^\\?#;\\|]+)?([;\\|])?([^\\?#]+)?\\??";
regex2 = "([^#]+)?#?(\\w*)";
//concatenation
finalRegex= regex1+regex2;
the result will be at the sixth place.
answered in another question I asked: Details.