VB.NET Regular Expressions [duplicate] - regex

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 9 years ago.
I have this HTML code:
<td class="Class 1">Example</td><td class="Class2">Other Example</td>
and I am trying to use Regular Expressions in VB.NET to extract "Example" and "Other Example"
Dim parsedtext As MatchCollection = Regex.Matches(htmlcode, ">(.+)<)
(the htmlcode variable contains the html code mentioned above as a string.)
However, looking at
parsedtext(0).Groups(0)
, it is returning ">Example</td><td class="Class2">Other Example<". I do not understand why this is happening, and I have tried many other pattern strings and cannot figure this problem out. How would one extract all text between two specific characters such as > and < in the example above?

I agree with #ColeJohnson (no one on SO is allowed to believe otherwise, at this point), but it's a good example for teaching the concept of greedy versus non-greedy matching.
By default, regular expressions quantifiers (+, *, ?) "eat up" as much as possible, and only eat less when some part of the match fails. That's called greedy matching. To make it non-greedy, you use non-greedy quantifiers: +?, *?, ??.
That is,
">(.+?)<"
In other words, your .+ continued to match as many character as possible, before finding a <; so you see, your output was to be expected. If, however, hypothetically, it had not found that last <, it would have backtracked to the last time it "saw" a <.

Related

Find and replace a Regex pattern occurring more than once [duplicate]

This question already has answers here:
How can I match overlapping strings with regex?
(6 answers)
Matching when an arbitrary pattern appears multiple times
(1 answer)
Closed 2 years ago.
I'm trying to find-and-replace instances where consecutive commas appear throughout a string; replacing them w/ something like ",N/A,". I was using a very simple /,,/g pattern, and that works on things like ",,abc" and ",,,,abc" (with even numbers of commas). However, it doesn't catch things like ",,,abc". That's because the first two commas are considered a match, and then the third comma is just considered part of a new ",abc" string. Is there a way to handle this w/ a RegEx pattern or options? Otherwise, I'm going to need to perform multiple searches.
FWIW - I'm working in JavaScript, but I'm guessing this is just a general RegEx question/answer.
The reason why /,,/g only matches once with three commas is because the global match restarts after the position of the final consumed characters. You need a way to match the pattern of ,, without consuming those characters for pattern matching purposes.
If your language supports it, use a positive lookahead. A positive lookeahead lets a regex match some additional characters, but not consume them in the pattern.
/,(?=,)/g
In English, this means:
, # match a comma, then
(?= #start a group that must exist, and if so, isn't consumed by the pattern,
, # a comma
)
See more about this here: https://www.regular-expressions.info/lookaround.html
Javascript supports positive lookahead. :)

regex ${something} [duplicate]

This question already has answers here:
Regex to get string between curly braces
(16 answers)
Closed 2 years ago.
How do I use regex to get what is inside of a ${} enclosed value such as:
Dont care about this. ${This is the thing I want regex to return} Not this either.
Should return:
This is the thing I want regex to return
I've tried \${ }$
the best I got messing around on regex101.com
I'll be honest I have no Idea what I'm doing as far as regex goes
using on c++ but would also (if possible) like to use in geany text editor
I suggest \${[^}]*}. Note that $ have special meaning in regular expressions and need to be escaped with a \ to be read literary.
I use [^}]* instead of .* between the braces to avoid making a long match including the entire value of:
${Another} match, more then one ${on the same line}
[^}] means anything but }
What you want is matching the starting ${ and the ending } with any amount of characters in between: \$\{.*\}. The special part here is the .*, . means any character and * means the thing in front of it can be matched 0 or more times.
Since you want thre matched results, you might also want to wrap it in (): (\$\{.*\}). The parenthesis makes regex remember the stuff inside for later use.
See this stackoverflow on how to get the results back:
How to match multiple results using std::regex

Regex with equals sign not working in Notepad++ [duplicate]

This question already has answers here:
Regex plus vs star difference? [duplicate]
(9 answers)
Closed 4 years ago.
Issue
I am having trouble matching lines with nothing but repeating equals signs (=) in Notepad++. I'm searching a plain text document for any line that begins with "=" and ends with "=". For instance, all these should match my regex:
========================
==================
==
=================================================
, etc.
My code
This is my regex:
^(=*)$
Not only does the regex not find equals signs, it falsely finds blank lines instead.
Rationale
^ = line begins with an equals sign.
=* = find any sequence of one or more equals signs
$= line ends with an equals sign
But my regex doesn't work. There must be some strange exception in Notepad++ because I verified that equals signs don't have to be escaped in JavaScript and my regex works fine at this online regex tester:
https://regex101.com/
Links I've found that haven't been fruitful
regex matching expression notepad++
Difficulties with adding spaces around equal signs using regular expressions with Notepad++
http://docs.notepad-plus-plus.org/index.php/Regular_Expressions
https://www.icewarp.com/support/online_help/203030104.htm
My questions
Why is my regex only returning blank lines?
If my regex is wrong, please explain why and what the correct regex is to do my find. Eventually I will do a replace as well, but I didn't want to cloud the issue.
Feedback on proposed duplicate
It was suggested that this post might be a duplicate of this question. This post is not a duplicate, and here is why:
Even if the content of the two posts were similar:
For a user to find the suggested post with "plus vs. star" in the title would suggest that he/she already had some idea what the problem was (i.e., use "plus" instead of "star").
Anyone else having this problem and being in my same predicament wouldn't necessarily know that plus vs star was the issue.
If the suggested post had come up as a possible answer when I searched "equals sign regex not working notepad++", I wouldn't have had to take my time to write this post.
* is zero or more.
+ is one or more.
Replace * with + in your regex.
So your regex would be ^(=+)$, this would match only lines with = and skip anything else.

Understanding Regex expression [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 6 years ago.
I have a file where the application is configured to check the following Regex
[\x00-\x1F\x7F&&[^\x0A]&&[^\x0D]]
Can anyone please tell me the meaning of this regex expression exactly what it means. I do know that this regex expression ignored line feed and character feed. I even validated my file on http://regexr.com/ with the above specified regex expression and it shows no match found so not understanding why the regex is getting matched in the application.
FYI: I do not want the regex to match file as it is stopping my processing.
It could be that in Java and Ruby the regex expression && refers to character class intersection, while http://regexr.com/ doesn't support that expression and is trying to match literal & symbols. The regex you posted means match any characters from \x00 to \x1f or \x7f as long as it's not \x0A or \x0D.

Regex to first occurrence only? [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 1 year ago.
Let's say I have the following string:
this is a test for the sake of
testing. this is only a test. The end.
and I want to select this is a test and this is only a test. What in the world do I need to do?
The following Regex I tried yields a goofy result:
this(.*)test (I also wanted to capture what was between it)
returns this is a test for the sake of testing. this is only a test
It seems like this is probably something easy I'm forgetting.
The regex is greedy meaning it will capture as many characters as it can which fall into the .* match. To make it non-greedy try:
this(.*?)test
The ? modifier will make it capture as few characters as possible in the match.
Andy E and Ipsquiggle have the right idea, but I want to point out that you might want to add a word boundary assertion, meaning you don't want to deal with words that have "this" or "test" in them-- only the words by themselves. In Perl and similar that's done with the "\b" marker.
As it is, this(.*?)test would match "thistles are the greatest", which you probably don't want.
The pattern you want is something like this: \bthis\b(.*?)\btest\b
* is a greedy quantifier. That means it matches as much as possible, i.e. what you are seeing. Depending on the specific language support for regex, you will need to find a non-greedy quantifier. Usually this is a trailing question mark, like this: *?. That means it will stop consuming letters as soon as the rest of the regex can be satisfied.
There is a good explanation of greediness here.
For me, simply remove /g worked.
See https://regex101.com/r/EaIykZ/1