Parsing valid parent directories with regex - regex

Given the string a/b/c/d which represents a fully-qualified sub-directory I would like to generate a series of strings for each step up the parent tree, i.e. a/b/c, a/b and a.
With regex I can do a non-greedy /(.*?)\// which will give me matches of a, b and c or a greedy /(.*)\// which will give me a single match of a/b/c. Is there a way I can get the desired results specified above in a single regex or will it inherently be unable to create two matches which eat the same characters (if that makes sense)?
Please let me know if this question is answered elsewhere... I've looked, but found nothing.
Note this question is about whether it's possible with regex. I know there are many ways outside of regex.

One solution building on idea in this other question:
reverse the string to be matched: d/c/b/a For instance in PHP use strrev($string )
match with (?=(/(?:\w+(?:/|$))+))
This give you
/c/b/a
/b/a
/a
Then reverse the matches with strrev($string )
This give you
a/b/c/
a/b/
a/
If you had .NET not PCRE you could do matching right to left and proably come up with same.

Completely different answer without reversing string.
(?<=((?:\w+(?:/|$))+(?=\w)))
This matches
a/
a/b/
a/b/c/
but you have to use C# which use variable lookbehind

Yes, it's possible:
/([^\/]*)\//
So basically it replaces your .*? with [^/]*, and it does not have to be non-greedy. Since / is a special character in your case, you will have to escape it, like so: [^\/]*.

Related

RegEx for Google Analytics that picks text within urls

I am trying to build a RegEx that picks urls that end with "/topic". These urls have a different number of folders so whereas one might be www.example.com/pijamas/topic another could be www.example.com/pijamas/strippedpijamas/topic
What regular expression can I use to do that? My attempt is ^www.example.com/[a-zA-Z][1,]/topic$ but this hasn't worked. Even if it worked I'd like to have a shorter RegEx to do this really.
Any help on this would be much appreciated.
Thank you, A.
Try this:
^www\.example\.com\/[\w\/]*topic$
You need to make a few changes to your regex. Firstly, the dot (.) is a special character and needs to be escaped by prefacing it with a backslash.
Secondly, you probably meant {1,} instead of [1,] – the latter defines a character class. You can substitute {1,} with +.
Then there's the fact that your second URL has one more subdirectory, so you need to somehow incorporate a / into your regex.
Putting all this together:
^www\.example\.com/[a-zA-Z]+(/[a-zA-Z]+)*/topic$
To shorten it, you can use the i option to match regardless of case, cutting down the two [a-zA-Z] to [a-z]. Try this online here.

Matching all strings without 3 occurrences of/or final single character in RegEx

Trying to figure out the regex for the title,
i.e.,
foo
foo/bar/foo
foo/bar/foo/bar
foo/bar/d
I don't want it to match the 3rd or the 4th one but match the first two. In the 2nd option, the final foo can be anything but a single d.
You could use a regex but it will be more complicated than just counting the number of slashes and also checking the last character isn't a d. If you want to use a regex to check for the last part not being "/d" you could do something like check that it doesn't match ^.*/d$ but it may be clearer to just use code. (If counting slashes and checking string doesn't end in "/d" isn't exactly what you mean then it will help to have more examples)
Figured it out. See below if anyone is interested.
(^foo/?$)|(^foo/[^/]+/(([^d][^/]*)|(d[^/]+))/?$)

How to ensure a regex matches only after a certain character

Is there a way to specify regex that would match a string but from a certain position in this string? What I mean is that I have line:
"Somebodys_value is % value"
and I'd like to check if my regex matches this sentence but only after % sign.
Using only RegEx you just include the % in your pattern. If your pattern is value, you can change it to %.*value.
Another way, that's more dependent on your engine, is to provide an offset. You can use a strpos like function to find the %, and say to start matching after that.
Yet another method is to copy everything after the % into a new buffer/string, and then try to match that.
Any more specifics depend on the engine you're using.
edit:
It sounds like you don't want the % in your matches. A few implementation specific ways to do this are...
(?:%).*value where the % is in a non-capturing group
%\K.*value where \K discards everything before it (limited support)
%(.*value) where you will just use the first subpattern (often called $1).
or you can just do any operations starting at sub 1, and ignore the % at sub 0.
Superficially, it seems like you could use this regex (where the slashes are simple delimiters):
/%.*value/
It looks for your value after it's seen the percentage. Coding that up in C++ is only marginally fiddlier, but since you've given no indication of which regex package you're using, it is hard to know what code to write. There are a lot of possible regex packages you could be planning to use.
This all depends on the particular regex API you're using, of course, but I see two different options:
Advance your char* to point to the % before calling your match() function
Include the '%' in the regex itself, as #JonathanLeffler says.
Version 1 might be more efficient, but only if you already know where the '%' is!

Negative integer Regex doesn't match

I have Googled it, and found the following results:
http://icfun.blogspot.com/2008/03/regular-expression-to-handle-negative.html
http://regexlib.com/DisplayPatterns.aspx?cattabindex=2&categoryId=3
With some (very basic) Regex knowledge, I figured this would work:
r\.(^-?\d+)\.(^-?\d+)\.mcr
For parsing such strings:
r.0.0.mcr
r.-1.5.mcr
r.20.-1.mcr
r.-1.-1.mcr
But I don't get a match on these.
Since I'm learning (or trying to learn) Regex, could you please explain why my pattern doesn't match (instead of just writing a new working one for me)? From what I understood, it goes like so:
Match r
Match a period
Match a prefix negative sign or not, and store the group
Match a period
Match a prefix negative sign or not, and store the group
Match a preiod
Match mcr
But I'm wrong, apparently :).
You are very close. ^ matches the start of a string, so it should only be located at the start of a pattern (if you want to use it at all - that depends on whether you will also accept e.g. abcr.0.0.mcr or not). Similarly, one can use $ (but only at the end of the pattern) to indicate that you will only accept strings that do not contain anything after what the pattern matches (so that e.g. r.0.0.mcrabc won't be accepted). Otherwise, I think it looks good.
The ^ characters are telling it to match only at the beginning of a line; since it's obviously not at the beginning of a line in either case, it fails to match. In this case, you just need to remove both ^s. (I think what you're trying to say is "don't let anything else be in between these", but that's the default except at the start of the regex; you would need something like .* to make it allow additional characters between them.)
Since the ^ is not at the start of the expression, its meaning is 'not'. So in this case it means that there should not be a dash there.

How can I "inverse match" with regex?

I'm processing a file, line-by-line, and I'd like to do an inverse match. For instance, I want to match lines where there is a string of six letters, but only if these six letters are not 'Andrea'. How should I do that?
I'm using RegexBuddy, but still having trouble.
(?!Andrea).{6}
Assuming your regexp engine supports negative lookaheads...
...or maybe you'd prefer to use [A-Za-z]{6} in place of .{6}
Note that lookaheads and lookbehinds are generally not the right way to "inverse" a regular expression match. Regexps aren't really set up for doing negative matching; they leave that to whatever language you are using them with.
For Python/Java,
^(.(?!(some text)))*$
http://www.lisnichenko.com/articles/javapython-inverse-regex.html
In PCRE and similar variants, you can actually create a regex that matches any line not containing a value:
^(?:(?!Andrea).)*$
This is called a tempered greedy token. The downside is that it doesn't perform well.
The capabilities and syntax of the regex implementation matter.
You could use look-ahead. Using Python as an example,
import re
not_andrea = re.compile('(?!Andrea)\w{6}', re.IGNORECASE)
To break that down:
(?!Andrea) means 'match if the next 6 characters are not "Andrea"'; if so then
\w means a "word character" - alphanumeric characters. This is equivalent to the class [a-zA-Z0-9_]
\w{6} means exactly six word characters.
re.IGNORECASE means that you will exclude "Andrea", "andrea", "ANDREA" ...
Another way is to use your program logic - use all lines not matching Andrea and put them through a second regex to check for six characters. Or first check for at least six word characters, and then check that it does not match Andrea.
Negative lookahead assertion
(?!Andrea)
This is not exactly an inverted match, but it's the best you can directly do with regex. Not all platforms support them though.
If you want to do this in RegexBuddy, there are two ways to get a list of all lines not matching a regex.
On the toolbar on the Test panel, set the test scope to "Line by line". When you do that, an item List All Lines without Matches will appear under the List All button on the same toolbar. (If you don't see the List All button, click the Match button in the main toolbar.)
On the GREP panel, you can turn on the "line-based" and the "invert results" checkboxes to get a list of non-matching lines in the files you're grepping through.
I just came up with this method which may be hardware intensive but it is working:
You can replace all characters which match the regex by an empty string.
This is a oneliner:
notMatched = re.sub(regex, "", string)
I used this because I was forced to use a very complex regex and couldn't figure out how to invert every part of it within a reasonable amount of time.
This will only return you the string result, not any match objects!
(?! is useful in practice. Although strictly speaking, looking ahead is not a regular expression as defined mathematically.
You can write an inverted regular expression manually.
Here is a program to calculate the result automatically.
Its result is machine generated, which is usually much more complex than hand writing one. But the result works.
If you have the possibility to do two regex matches for the inverse and join them together you can use two capturing groups to first capture everything before your regex
^((?!yourRegex).)*
and then capture everything behind your regex
(?<=yourRegex).*
This works for most regexes. One problem I discovered was when I had a quantifier like {2,4} at the end. Then you gotta get creative.
In Perl you can do:
process($line) if ($line =~ !/Andrea/);