How to ensure a regex matches only after a certain character - c++

Is there a way to specify regex that would match a string but from a certain position in this string? What I mean is that I have line:
"Somebodys_value is % value"
and I'd like to check if my regex matches this sentence but only after % sign.

Using only RegEx you just include the % in your pattern. If your pattern is value, you can change it to %.*value.
Another way, that's more dependent on your engine, is to provide an offset. You can use a strpos like function to find the %, and say to start matching after that.
Yet another method is to copy everything after the % into a new buffer/string, and then try to match that.
Any more specifics depend on the engine you're using.
edit:
It sounds like you don't want the % in your matches. A few implementation specific ways to do this are...
(?:%).*value where the % is in a non-capturing group
%\K.*value where \K discards everything before it (limited support)
%(.*value) where you will just use the first subpattern (often called $1).
or you can just do any operations starting at sub 1, and ignore the % at sub 0.

Superficially, it seems like you could use this regex (where the slashes are simple delimiters):
/%.*value/
It looks for your value after it's seen the percentage. Coding that up in C++ is only marginally fiddlier, but since you've given no indication of which regex package you're using, it is hard to know what code to write. There are a lot of possible regex packages you could be planning to use.

This all depends on the particular regex API you're using, of course, but I see two different options:
Advance your char* to point to the % before calling your match() function
Include the '%' in the regex itself, as #JonathanLeffler says.
Version 1 might be more efficient, but only if you already know where the '%' is!

Related

Regex to get match up through second to last occurrence of a string

If I have the following string:
hello/everyone/good/bye/world
I want to match everything up through the second to last forward slash:
hello/everyone/good/
But the number of slashes may vary.
What would the regex be to do this? Been googling around to no avail.
Instead of a regular expression, use split and take advantage of "slicing".
t = "hello/everyone/good/bye/world"
ts = t.split("/")
print "/".join( ts[0:-2] ) + "/"
This prints
hello/everyone/good/
This answer is for python2 (you did not specify which language), there are equivalent commands in some other languages. For example, java has a String.split() method (to complicate things, it requires a regular expression). Consult the documentation for the language you are using.
Maybe try this https://regex101.com/r/bH2vI0/1:
((\w+\/)+(?!\w+\/\w+))
Not sure if sublime allows lookaheads though, but basically just match every word + / except for last two. [A-z] might be better if you don't want digits and _

Parsing valid parent directories with regex

Given the string a/b/c/d which represents a fully-qualified sub-directory I would like to generate a series of strings for each step up the parent tree, i.e. a/b/c, a/b and a.
With regex I can do a non-greedy /(.*?)\// which will give me matches of a, b and c or a greedy /(.*)\// which will give me a single match of a/b/c. Is there a way I can get the desired results specified above in a single regex or will it inherently be unable to create two matches which eat the same characters (if that makes sense)?
Please let me know if this question is answered elsewhere... I've looked, but found nothing.
Note this question is about whether it's possible with regex. I know there are many ways outside of regex.
One solution building on idea in this other question:
reverse the string to be matched: d/c/b/a For instance in PHP use strrev($string )
match with (?=(/(?:\w+(?:/|$))+))
This give you
/c/b/a
/b/a
/a
Then reverse the matches with strrev($string )
This give you
a/b/c/
a/b/
a/
If you had .NET not PCRE you could do matching right to left and proably come up with same.
Completely different answer without reversing string.
(?<=((?:\w+(?:/|$))+(?=\w)))
This matches
a/
a/b/
a/b/c/
but you have to use C# which use variable lookbehind
Yes, it's possible:
/([^\/]*)\//
So basically it replaces your .*? with [^/]*, and it does not have to be non-greedy. Since / is a special character in your case, you will have to escape it, like so: [^\/]*.

Simple regex for matching up to an optional character?

I'm sure this is a simple question for someone at ease with regular expressions:
I need to match everything up until the character #
I don't want the string following the # character, just the stuff before it, and the character itself should not be matched. This is the most important part, and what I'm mainly asking. As a second question, I would also like to know how to match the rest, after the # character. But not in the same expression, because I will need that in another context.
Here's an example string:
topics/install.xml#id_install
I want only topics/install.xml. And for the second question (separate expression) I want id_install
First expression:
^([^#]*)
Second expression:
#(.*)$
[a-zA-Z0-9]*[\#]
If your string contains any other special characters you need to add them into the first square bracket escaped.
I don't use C#, but i will assume that it uses pcre... if so,
"([^#]*)#.*"
with a call to 'match'. A call to 'search' does not need the trailing ".*"
The parens define the 'keep group'; the [^#] means any character that is not a '#'
You probably tried something like
"(.*)#.*"
and found that it fails when multiple '#' signs are present (keeping the leading '#'s)?
That is because ".*" is greedy, and will match as much as it can.
Your matcher should have a method that looks something like 'group(...)'. Most matchers
return the entire matched sequence as group(0), the first paren-matched group as group(1),
and so forth.
PCRE is so important i strongly encourage you to search for it on google, learn it, and always have it in your programming toolkit.
Use look ahead and look behind:
To get all characters up to, but not including the pound (#): .*?(?=\#)
To get all characters following, but not including the pound (#): (?<=\#).*
If you don't mind using groups, you can do it all in one shot:
(.*?)\#(.*) Your answers will be in group(1) and group(2). Notice the non-greedy construct, *?, which will attempt to match as little as possible instead of as much as possible.
If you want to allow for missing # section, use ([^\#]*)(?:\#(.*))?. It uses a non-collecting group to test the second half, and if it finds it, returns everything after the pound.
Honestly though, for you situation, it is probably easier to use the Split method provided in String.
More on lookahead and lookbehind
first:
/[^\#]*(?=\#)/ edit: is faster than /.*?(?=\#)/
second:
/(?<=\#).*/
For something like this in C# I would usually skip the regular expressions stuff altogether and do something like:
string[] split = exampleString.Split('#');
string firstString = split[0];
string secondString = split[1];

Regular expression greedy match not working as expected

I have a very basic regular expression that I just can't figure out why it's not working so the question is two parts. Why does my current version not work and what is the correct expression.
Rules are pretty simple:
Must have minimum 3 characters.
If a % character is the first character must be a minimum of 4 characters.
So the following cases should work out as follows:
AB - fail
ABC - pass
ABCDEFG - pass
% - fail
%AB - fail
%ABC - pass
%ABCDEFG - pass
%%AB - pass
The expression I am using is:
^%?\S{3}
Which to me means:
^ - Start of string
%? - Greedy check for 0 or 1 % character
\S{3} - 3 other characters that are not white space
The problem is, the %? for some reason is not doing a greedy check. It's not eating the % character if it exists so the '%AB' case is passing which I think should be failing. Why is the %? not eating the % character?
Someone please show me the light :)
Edit: The answer I used was Dav below: ^(%\S{3}|[^%\s]\S{2})
Although it was a 2 part answer and Alan's really made me understand why. I didn't use his version of ^(?>%?)\S{3} because it worked but not in the javascript implementation. Both great answers and a lot of help.
The word for the behavior you described isn't greedy, it's possessive. Normal, greedy quantifiers match as much as they can originally, but back off if necessary to allow the whole regex to match (I like to think of them as greedy but accommodating). That's what's happening to you: the %? originally matches the leading percent sign, but if there aren't enough characters left for an overall match, it gives up the percent sign and lets \S{3} match it instead.
Some regex flavors (including Java and PHP) support possessive quantifiers, which never back off, even if that causes the overall match to fail. .NET doesn't have those, but it has the next best thing: atomic groups. Whatever you put inside an atomic group acts like a separate regex--it either matches at the position where it's applied or it doesn't, but it never goes back and tries to match more or less than it originally did just because the rest of the regex is failing (that is, the regex engine never backtracks into the atomic group). Here's how you would use it for your problem:
^(?>%?)\S{3}
If the string starts with a percent sign, the (?>%?) matches it, and if there aren't enough characters left for \S{3} to match, the regex fails.
Note that atomic groups (or possessive quantifiers) are not necessary to solve this problem, as #Dav demonstrated. But they're very powerful tools which can easily make the difference between impossible and possible, or too damn slow and slick as can be.
Regex will always try to match the whole pattern if it can - "greedy" doesn't mean "will always grab the character if it exists", but instead means "will always grab the character if it exists and a match can be made with it grabbed".
Instead, what you probably want is something like this:
^(%\S{3}|[^%\s]\S{2})
Which will match either a % followed by 3 characters, or a non-%, non-whitespace followed by 2 more.
I always love to look at RE questions to see how much time people spend on them to "Save time"
str.len() >= str[0]=='&' ? 4 : 3
Although in real life I'd be more explicit, I just wrote it that way because for some reason some people consider code brevity an advantage (I'd call it an anti-advantage, but that's not a popular opinion right now)
Try the regex modified a little based on Dav's original one:
^(%\S{3,}|[^%\s]\S{2,})
with the regex option "^ and $ match at line breaks" on.

regex for finding method that have the expression

I try to find methods that have expression createNamedQuery
I try
public[\d\D]*?createNamedQuery
but it finds the first method, but i want the method that has expression createNamedQuery
I'm guessing that your problem is that you cannot find all of them, but rather just the first one. The way to retreive all of them is different from language to language. For instance, in Python, one would do something similar to the following:
import re
my_data = "some long piece of data that may or may not contain what you're looking for."
for match in re.findall("public[\d\D]*?createNamedQuery", my_data):
if m is not None:
print "A match found at position %s" % m.start()
Basically, just try to FindAll. Doing a single Match will only give you the first one.
If you're trying to find the shortest sequence that starts with "public" and ends with "createNamedQuery", using a non-greedy quantifier isn't good enough. That will just match from the first occurrence of "public" to the first (subsequent) occurrence of "createNamedQuery". To find the shortest sequence, you have to make sure the part between the two sentinels ("public" and "createNamedQuery") can't match the first sentinel again. Here's one way:
public(?:(?!public).)*?createNamedQuery
This isn't the fastest or most robust regex, just the simplest way (I think) to demonstrate the principle. Depending on which regex flavor you're using, I would make use of word boundaries (most flavors) and either atomic groups (Perl, .NET) or possessive quantifiers (Java, PHP, JGSoft).