Regular Expression does not remove html comment? - regex

I have the following string:
<TD><!-- 1.91 -->6949<!-- 9.11 --></TD>
I want to end up with:
<TD>6949/TD>
but instead I end up with just the tags and no information:
<TD></TD>
This is the regular expression I am using:
RegEx.Replace("<TD><!-- 1.91 -->6949<!-- 9.11 --></TD>","<!--.*-->","")
Can someone explain how to keep the numbers and remove just what the comments. Also if possible, can someone explain why this is happening?

.* is a greedy qualifier which matches as much as possible.
It's matching everything until the last -->.
Change it to .*?, which is a lazy qualifier.

.* is greedy so it will match as many characters as possible. In this case the opening of the first comment until the end of the second. Changing it to .*? or [^>]* will fix it as the ? makes the match lazy. Which is to say it will match as few characters as possible.

Parsing HTML with Regex is always going to be tricky. Instead, use something like HTML Agility Pack which will allow you to query and parse html in a structured manner.

Related

Regex to remove a whole phrase from the match

I am trying to remove a whole phrase from my regex(PCRE) matches
if given the following strings
test:test2:test3:test4:test5:1.0.department
test:test2:test3:test4:test5:1.0.foo.0.bar
user.0.display
"test:test2:test3:test4:test5:1.0".division
I want to write regex that will return:
.department
.foo.0.bar
user.0.display
.division
Now I thought a good way to do this would be to match everything and then remove test:test2:test3:test4:test5:1.0 and "test:test2:test3:test4:test5:1.0" but I am struggling to do this
I tried the following
\b(?!(test:test2:test3:test4:test5:1\.0)|("test:test2:test3:test4:test5:1\.0"))\b.*
but this seems to just remove the first tests from each and thats all. Could anyone help on where I am going wrong or a better approach maybe?
I suggest searching for the following pattern:
"?test:test2:test3:test4:test5:1\.0"?
and replacing with an empty string. See the regex demo and the regex graph:
The quotation marks on both ends are made optional with a ? (1 or 0 times) quantifier.

pattern for get all tags

I have this sample code:
<ul><li>aaa</li><li>bbb</li><li>ccc</li></ul>
I need to get aaa, bbb, ccc tags, and I wrote this pattern:
/<a .* class=\"tag\">(.*?)<\/a>/
But this return wrong results. You can see result here.
What's happen and how can I resolve it?
You made your second .* non-greedy, but not your first. Because of this greedy matching, it was matching everything from the opening <a right through to the end of the third opening <a. The simple fix is to make the first non-greedy too:
<a .*? class=\"tag\">(.*?)<\/a>
Here's the updated regex101.
That said, depending on what you have available in your language of choice, and whether or not you're ever expecting a (even very slighty) different HTML string, an HTML parser might be a better choice.

Regex match between two regex expressions

This has been driving me crazy, I can't find a solution that works!
I'm trying to do a regex between a couple of tags, bad idea I've heard but necessary this time :P
What I have at the start is a <body class="foo"> where foo can vary between files - <body.*?> search works fine to locate the only copy in each file.
At the end I have a <div id="bar">, bar doesn't change between files.
eg.
<body class="foo">
sometext
some more text
<maybe even some tags>
<div id="bar">
What I need to do is select everything between the two tags but not including them - everything between the closing > on body and the opening < on div - sometext to maybe even some tags.
I've tried a bunch of things, mostly variations on (?<=<body.*>)(.*?)(?=<div id="bar">) but I'm actually getting invalid expressions at worst on notepad++, http://regexpal.com/ and no matches at best.
Any help appreciated!
You are attempting to implement variable-length lookbehind in which most regular expression languages and notepad++ does not support. I assume you are using notepad++ so you can use the \K escape sequence.
<body[^>]*>\K.*?(?=<div id="bar">)
The \K escape sequence resets the starting point of the reported match and any previously consumed characters are no longer included. Make sure you have the . matches newline checkbox checked as well.
Alternatively, you can use a capturing group and avoid using lookaround assertions.
<body[^>]*>(.*?)<div id="bar">
Note: Using a capturing group, you can refer to group index "1" to get your match result.
Use the following pattern:
/<body[^>]*>(.*?)<div id="bar">/

Clear Regex for "URL Contains"

I'm always stymied by regular expressions. My tool has a filtering option for "Current URL Matches Regex (case insensitive)" but I'm not sure how to write the regular expression for my needs. I'd love to figure out how to write a regex that would ONLY trigger for URLs that contain ANY of these 5 strings anywhere in URL:
Product=Neo-Supreme
Product=Cordura
Product=Hawaiian
Product=Animal%20Deluxe
Product=Camo
Basically the regex you need is something along the lines of
'Product\=[^&]+'
unless you know that the product can be something other than one of those 5 options.
If so, you'll need to use
'Product\=(Neo-Supreme|Cordura|Hawaiian|Animal%20Deluxe|Camo)'
EDIT for comments:
To match anything you can always use .*, which matches on any number of any character (except a newline, unless otherwise specified).
'.*seat-option.*Product\=(Neo-Supreme|Cordura|Hawaiian|Animal%20Deluxe|Camo).*'
Here's a demo

How to get regex inner html in ubot studio

Hi i have an text inside ubot studio that i am trying to scrape from
The function that i use is find regular expresion and the list item is this:
<td class="amt base">
$136
</td>
How can i get the $136 with regex?
I tried to use the following:
<td class="amt base">(.*)</td>
<td class="amt base">(*)</td>
<td class="amt base">*</td>
But none of them seem to work.
Thanks alot for sharing your regex knowladge.
How about the /s
/(?:<td class="amt base">)(.*)(?:<\/td>)/s
Online Demo
/s will treat the string as single line. With this change change .* will match any character whatsoever, even a newline, which normally it would not match.
?: the none matching group is optional here but it was added to make a single group match.
\/ it is also important to scape \ as they have an special meaning in Regex.
Important: I did not see a language specified in your post and /s may not be supported by some languages like Javascript or Ruby.
Update
Even though it is uncertain if it will work with all your input value you could try this:
Online Demo 2
/(?:<td class="amt base">\s+)(.*)(?:\s+<\/td>)/
\d* will match any list of digits.
\$\d* will match a dollar sign and then a string of digits.
The options that you tried do not work because .* stops at the end of each line. You are attempting to match a multi-line statement.
use this regular expression which will scrape a dollar sign and the digits ater:
\$\d+