Regex for the value of an HTML Property - regex

I have a load of links that look like this:
Taboola - Content you may like
I want to delete the entire ICON and ADD_DATE attributes and their values.
I'm using sublime with a regex find/replace but I'm not sure how to write the regex to grab everything in between ICON=" AND "
Any help would be appreciated!

This should work (escaping quotes as necessary):
ICON="[^"]*"
The reason ICON=\"(.*)" won't work is that regex can 'be greedy' in what it takes. This means that if it can match more of the string to satisfy the pattern it will.
You can either specify a non greedy search, such as ICON=".*?" or explicitly declare matches on atoms that are not quotes as in the above answer.

Related

Notepad++ replace text with RegEx search result

I would like replace a standard string in a file, with another that is a result of a regular expression. The standard text looks like:
<xsl:variable name="ServiceCode" select="###"/>
I would like to replace ### with a servicecode, that I can find later in the same file, from this URL:
<a href="/Services/xyz" target="_self">
The regular expression (?<=\/Services\/)(.*)(?=\" )
returns the required service code "xyz".
So, I opened Notepad++, added "###" to the "Find what" and this RegEx to the "Replace with" section, and expected that the ### text will be replaced by xyz.
But I got this result:
<xsl:variable name="ServiceCode" select="?<=/Services/.*?=" "/>
I am new to RegEx, do I need to use different syntax in the replace section than I use to find a string? Can someone give me a hint how to achieve the required result? The goal is to standardize tons of files with similar structure as now all servicecodes are hardcoded in several places in the file. Thanks.
You could use a lookahead for capturing the part ahead.
Search for: (?s)###(?=.*/Services/([^"]+)") and replace with: $1
(?s) makes the dot also match newlines (there is also a checkbox available in np++)
[^"] matches a character that is not "
The replacement $1 corresponds to capture of first parenthesized subpattern.
I am no expert at RegEx but I think I may be able to help. It looks like you might be going at this the wrong way. The regex search that you are using would normally work like this:
The parenthesis () in RegEx allow you to select part of your search and use that in the replace section.
You place (?<=\/Services\/)(.*)(?=\" ) into the "Find what" section in Notepad++.
Then in the "Replace with" section you could use \1 or \2 or \3 to replace the contents of your search with what was found in the (?<=\/Services\/) or (.*) or (?=\" ) searches respectively.
Depending on the structure of your files, you would need to use a RegEx search that selects both lines of code (and the specific parts you need), then use a combination of \1\2\3 etc. to replace everything exactly how it was, except for the ### which you could replace with the \number associated with xyz.
See http://docs.notepad-plus-plus.org/index.php/Regular_Expressions for more info.

Changing some XML tags names but leaving unchanged values between them

In one of my XML file I need to find and replace some opening tags names using regex and Notepad++. Also I need to leave unchanged every text between them.
Example:
<uri>http://domain-name.com/41874_01_home_big.jpg</image_big>
I need to change into:
<image_big>http://domain-name.com/41874_01_home_big.jpg</image_big>
For some reasons I can't just change uri tag, cause there are others closing tags like /image_small in the document (opened with uri of course).
I tried to change it like:
<uri>.*?</image_big>
But I don't know with what I should replace it.
I tried with:
<image_big>\1</image_big>
but result is:
<image_big></image_big>
without any text inside.
I need your help. I'm not good with regex.
Just put .*? inside a capturing group.
<uri>(.*?)<\/image_big>
Then replace the match with <image_big>\1</image_big> or <image_big>$1</image_big>
Your regex <uri>.*?</image_big> matches correctly but in-order to fetch all the characters which are matched by .*? pattern, you must need to put that pattern inside a capturing group. So that we could back-reference it for later use.
DEMO
Find:<uri>(.*?)</image_big>
Replace:<image_big>\1</image_big> or <image_big>$1</image_big>
See demo.
https://www.regex101.com/r/rK5lU1/19

Delphi XE2 Regex: Quantifier does not work inside positive lookbehind?

I have a complete HTML document string from a web page containing this BASE tag:
<BASE href="http://whatreallyhappened.com/">
In Delphi XE2, I use this regular expression with the whole HTML document as subject to get the URL from the BASE tag between the double quotes:
BaseURL := TRegEx.Match(HTMLDocStr, '(?<=<base(\s)href=").*(?=")', [roIgnoreCase]).Value;
This works, but only if there is only ONE space character in the subject between BASE and href.
I tried to add a quantifier to the space part in the regex (\s), but it did not work.
So how can I make this regex match the URL even if there are several spaces between BASE and href?
You're making this far too complicated by using lookaround. If you want to extract only part of the regex match, simply add a capturing group. Then you can use the text matched by the capturing group instead of the overall match. In most cases you'll also get much better performance this way.
To find the base tag in a file and extract its URL you can use the regex <base[^>]+href=["']([^"']*)["']. Call TRegex.Match() to get a TMatch. This has a Groups property that you can use to retrieve group 1 if a match was found.
With lookaround
You can use different ways to try using quantifiers like these:
(?<=<BASE)\s+href=".*(?=")
(?<=<BASE)\s{0,30}href=".*(?=")
Working demo
Without lookaround
By the way, if you want just to get the content within href there is no need of lookaround you just can use:
<BASE\s+href="(.*?)"
Working demo
EDIT: after reading your comments I figured out a workaround (ugly but could work). You can try using something like this:
((?<=<BASE\shref=")|(?<=<BASE\s\shref=")|(?<=<BASE\s\s\shref=")).*(?=")
^---notice \s ^---notice \s\s ^---notice \s\s\s
I know that this is horrible, but if none of above work you can try with that.

Clear Regex for "URL Contains"

I'm always stymied by regular expressions. My tool has a filtering option for "Current URL Matches Regex (case insensitive)" but I'm not sure how to write the regular expression for my needs. I'd love to figure out how to write a regex that would ONLY trigger for URLs that contain ANY of these 5 strings anywhere in URL:
Product=Neo-Supreme
Product=Cordura
Product=Hawaiian
Product=Animal%20Deluxe
Product=Camo
Basically the regex you need is something along the lines of
'Product\=[^&]+'
unless you know that the product can be something other than one of those 5 options.
If so, you'll need to use
'Product\=(Neo-Supreme|Cordura|Hawaiian|Animal%20Deluxe|Camo)'
EDIT for comments:
To match anything you can always use .*, which matches on any number of any character (except a newline, unless otherwise specified).
'.*seat-option.*Product\=(Neo-Supreme|Cordura|Hawaiian|Animal%20Deluxe|Camo).*'
Here's a demo

replacing all open tags with a string

Before somebody points me to that question, I know that one can't parse html with regex :) And this is not what I am trying to do.
What I need is:
Input: a string containing html.
Output: replace all opening tags
***<tag>
So if I get
<a><b><c></a></b></c>, I want
***<a>***<b>***<c></a></b></c>
as output.
I've tried something like:
(<[~/].+>)
and replace it with
***$1
But doesn't really seem to work the way I want it to. Any pointers?
Clarification: it's guaranteed that there are no self closing tags nor comments in the input.
You just have two problems: ^ is the character to exclude items from a character class, not ~; and the .+ is greedy, so will match as many characters as possible before the final >. Change it to:
(<[^/].+?>)
You can also probably drop the parentheses and replace with $0 or $&, depending on the language.
Try using: (<[^/].*?>) and replace it with ***$1