Greedy and non-greedy regex - regex

I currently have this regex: this\.(.*)?\s[=,]\s, however I have come across a pickle I cannot fix.
I tried the following Regex, which works, but it captures the space as well which I don't want: this\.(.*)?(?<=\s)=|(?<!\s),. What I'm trying to do is match identifier names. An example of what I want and the result is this:
this.""W = blah; which would match ""W. The second regex above does this almost perfectly, however it also captures the space before the = in the first group. Can someone point me in the correct direction to fix this?
EDIT: The reason for not simply using [^\s] in the wildcard group is that sometimes I can get lines like this: this. "$ = blah;
EDIT2: Now I have another issue. Its not matching lines like param1.readBytes(this.=!3,0,param1.readInt()); properly. Instead of matching =!3 its matching =!3,0. Is there a way to fix this? Again, I cannot simply use a [^,] because there could be a name like param1.readBytes(this.,3$,0,param1.readInt()); which should match ,3$.

(.*) will match any character including whitespace.
To force it not to end in whitespace change it to (.*[^\s])
Eg:
this\.(.*[^\s])?\s?[=,]\s
For your second edit, it seems like you are doing a language parser. Even though regular expressions are powerful, they do have limits. You need a grammar parser for that.

Maybe you can tell in your first block to capture non space characters, instead of any.
this\.(\S*)?(?<=\s)=|(?<!\s),

Related

Regex for value.contains() in Google Refine

I have a column of strings, and I want to use a regex to find commas or pipes in every cell, and then make an action. I tried this, but it doesn't work (no syntax error, just doesn't match neither commas nor pipes).
if(value.contains(/(,|\|)/), ...
The funny thing is that the same regex works with the same data in SublimeText. (Yes, I can work it there and then reimport, but I would like to understand what's the difference or what is my mistake).
I'm using Google Refine 2.5.
Since value.match should return captured texts, you need to define a regex with a capture group and check if the result is not null.
Also, pay attention to the regex itself: the string should be matched in its entirety:
Attempts to match the string s in its entirety against the regex pattern p and returns an array of capture groups.
So, add .* before and after the pattern you are looking inside a larger string:
if(value.match(/.*([,|]).*/) != null)
You can use a combination of if and isNonBlank like:
if(isNonBlank(value.match(/your regex/), ...

RegEx to match acronyms

I am trying to write a regular expression that will match values such as U.S., D.C., U.S.A., etc.
Here is what I have so far -
\b([a-zA-Z]\.){2,}+
Note how this expression matches but does not include the last letter in the acronym.
Can anyone help explain what I am missing here?
SOLUTION
I'm posting the solution here in case this helps anyone.
\b(?:[a-zA-Z]\.){2,}
It seems as if a non-capturing group is required here.
Try (?:[a-zA-Z]\.){2,}
?: (non-capturing group) is there because you want to omit capturing the last iteration of the repeated group.
For example, without ?:, 'U.S.A.' will yield a group match 'A.', which you are not interested about.
None of these proposed solutions do what yours does - make sure that there are at least 2 letters in the acronym. Also, yours works on http://rubular.com/ . This is probably some issue with the regex implementation - to be fair, all of the matches that you got were valid acronyms. To fix this, you could either:
Make sure there's a space or EOF succeeding your expression ((?=\s|$) in ruby at least)
Surround your regex with ^ and $ to make sure it catches the whole string. You'd have to split the whole string on spaces to get matches with this though.
I prefer the former solution - to do this you'd have:
\b([a-zA-Z]\.){2,}(?=\s|$)
Edit: I've realized this doesn't actually work with other punctuation in the string, and a couple of other edge cases. This is super ugly, but I think it should be good enough:
(?<=\s|^)((?:[a-zA-Z]\.){2,})(?=[[:punct:]]?(?:\s|$))
This assumes that you've got this [[:punct:]] character class, and allows for 0-1 punctuation marks after an acronym that won't be captured. I've also fixed it up so that there's a single capture group that gets the whole acronym. Check out validation at http://rubular.com/r/lmr0qERLDh
Bonus: you now get to make this super confusing to anyone reading it.
This should work:
/([a-zA-Z]\.)+/g
I have slightly modified the solution above:
\b(?:[a-zA-Z]+\.){2,}
to enable capturing acronyms containing more than one letter between the dots, like in 'GHQ.AFP.X.Y'

How to capture the word between is after certain text after end with some text in regex?

I would like to find something like this:
-(IBOutlet)UIView *aView;
I would like to find aView, something that I can confirm is -(IBOutlet) must be a prefix, but it comes with not ensure a space or another string, after that, we need to string that must begin with '*', until it match the ;.
So, my regex look like that:
(IBOutlet)*\*?;
For sure, it can't capture what I want. Any advise?
You just have to build it up incrementally. The best reference that I have found (by far) is http://www.regular-expressions.info. After learning the basics, you can then use one of many online pattern matching tools, here is one:
https://regex101.com
With that, your goal is easily determined (with some allowances for free space):
^\s*-\s*\(IBOutlet\)(\w*)\s*(\*\w*)
First problem: you don't have a capturing group so how do you get aView back after the match?
Second, the \*? means "match the * character literally, 0 or 1 times", which I guess isn't what you want either.
Try this pattern:
(IBOutlet)*\*(.+);
RegEx 101 can explain what each component means.

Simple regex for matching up to an optional character?

I'm sure this is a simple question for someone at ease with regular expressions:
I need to match everything up until the character #
I don't want the string following the # character, just the stuff before it, and the character itself should not be matched. This is the most important part, and what I'm mainly asking. As a second question, I would also like to know how to match the rest, after the # character. But not in the same expression, because I will need that in another context.
Here's an example string:
topics/install.xml#id_install
I want only topics/install.xml. And for the second question (separate expression) I want id_install
First expression:
^([^#]*)
Second expression:
#(.*)$
[a-zA-Z0-9]*[\#]
If your string contains any other special characters you need to add them into the first square bracket escaped.
I don't use C#, but i will assume that it uses pcre... if so,
"([^#]*)#.*"
with a call to 'match'. A call to 'search' does not need the trailing ".*"
The parens define the 'keep group'; the [^#] means any character that is not a '#'
You probably tried something like
"(.*)#.*"
and found that it fails when multiple '#' signs are present (keeping the leading '#'s)?
That is because ".*" is greedy, and will match as much as it can.
Your matcher should have a method that looks something like 'group(...)'. Most matchers
return the entire matched sequence as group(0), the first paren-matched group as group(1),
and so forth.
PCRE is so important i strongly encourage you to search for it on google, learn it, and always have it in your programming toolkit.
Use look ahead and look behind:
To get all characters up to, but not including the pound (#): .*?(?=\#)
To get all characters following, but not including the pound (#): (?<=\#).*
If you don't mind using groups, you can do it all in one shot:
(.*?)\#(.*) Your answers will be in group(1) and group(2). Notice the non-greedy construct, *?, which will attempt to match as little as possible instead of as much as possible.
If you want to allow for missing # section, use ([^\#]*)(?:\#(.*))?. It uses a non-collecting group to test the second half, and if it finds it, returns everything after the pound.
Honestly though, for you situation, it is probably easier to use the Split method provided in String.
More on lookahead and lookbehind
first:
/[^\#]*(?=\#)/ edit: is faster than /.*?(?=\#)/
second:
/(?<=\#).*/
For something like this in C# I would usually skip the regular expressions stuff altogether and do something like:
string[] split = exampleString.Split('#');
string firstString = split[0];
string secondString = split[1];

How to continue a match in Regex

price:(?:(?:\d+)?(?:\.)?\d+|min)-?(?:(?:\d+)?(?:\.)?\d+|max)?
This Regex matches the following examples correctly.
price:1.00-342
price:.1-23
price:4
price:min-900.00
price:.10-.50
price:45-100
price:453.23-231231
price:min-max
Now I want to improve it to match these cases.
price:4.45-8.00;10.45-14.50
price:1.00-max;3-12;23.34-12.19
price:1.00-2.50;min-12;23.34-max
Currently the match stops at the semi colon. How can I get the regex to repeat across the semi-colon dividers?
Final Solution:
price:(((\d*\.)?\d+|min)-?((\d*\.)?\d+|max)?;?)+
Add an optional ; at the end, and make the whole pattern to match one or more:
price:((?:(?:\d+)?(?:\.)?\d+|min)-?(?:(?:\d+)?(?:\.)?\d+|max)?;?)+
(?:\d+)? is the same thing as \d*, and (?:\.)? can just be \.?. Simplified, your original regex is:
price:(?:\d*\.?\d+|min)(?:-(?:\d*\.?\d+|max))?
You have two choices. You can either do price([:;]range)* where range is the regex you have for matching number ranges, or be more precise about the punctuation but have to write out range twice and do price:range(;range)*.
price([:;]range)* -- shorter but allows first ':' to be ';'
price:range(;range)* -- longer but gets colon vs semi-colon correct
Pick one of these two regexes:
price[:;](?:\d*\.?\d+|min)(?:-(?:\d*\.?\d+|max))?
price:(?:\d*\.?\d+|min)(?:-(?:\d*\.?\d+|max))?(?:(?:\d*\.?\d+|min)(?:-(?:\d*\.?\d+|max))?)*
First there are some issues with your regular expression: to match xx.yyy instead of the expression (?:\d+)?(?:\.)?\d+ you can use this (?:\d*\.)?\d+. This can only match in one way so it avoids unnecessary backtracking.
Also currently your regular expression matches things like price:minmax and price:1.2.3 which I assume you do not want to match.
The simple way to repeat your match is to add a semi-colon and then repeat your regular expression verbatim.
You can do it like this though to avoid writing out the entire regular twice:
price:(?:(?:(?:\d*\.)?\d+|min)(?:-(?:(?:\d*\.)?\d+|max))?(?:;|$))*
See it in action on Rubular.
price:((?:(?:\d+)?(?:\.)?\d+|min)-?(?:(?:\d+)?(?:\.)?\d+|max)?;?)+
I'm not sure what's up with all of the ?'s (I know the syntax, I just don't know why you're using it so much), but that should do it for you.