Regex Notepad ++: How to remove everything except url? - regex

I have sitemap like this:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://mywebsite.com/article1</loc>
<lastmod>2014-08-10</lastmod>
<changefreq>monthly</changefreq>
</url>
<url>
<loc>http://mywebsite.com/article2</loc>
<lastmod>2014-08-10</lastmod>
<changefreq>monthly</changefreq>
</url>
<url>
<loc>http://mywebsite.com/article3</loc>
<lastmod>2014-08-10</lastmod>
<changefreq>monthly</changefreq>
</url>
</urlset>
I only want to keep url which inside . Do you know way to match the others and replace by nothing ? Thank you very much !

If your desired result is like this:
http://mywebsite.com/article1
http://mywebsite.com/article2
http://mywebsite.com/article3
search for:
\h*<url\b.*?(http[^<]+).*?</url>|<.*?>\s*
and replace with captured url (captured in first parenthesized group)
\1
\h matches any horzintal space, [^<]+ matches one or more characters, that are not <
Be sure to check the checkbox . matches \r and \n
Also see example and explanation on regex101.com

It seems like you intend to match what's inside elements. A multi-line regex matching content could do the job: (http.*)

You can use this regex to match everythin except the URLs and replace with nothing:
.*<url>.*\n?.*<loc>|<\/loc>(.*\n?){4}<\/url>

Related

regex to match link inside xml with last mod

<?xml version='1.0' encoding='UTF-8'?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>https://google.com/2020/08/this1.html</loc><lastmod>2020-08-06T11:30:55Z</lastmod></url>
<url><loc>https://google.com/2020/08/this2.html</loc><lastmod>2020-08-05T11:30:06Z</lastmod></url>
<url><loc>https://google.com/2020/08/this3.html</loc><lastmod>2020-08-06T11:29:25Z</lastmod></url>
</lastmod></url></urlset>
I'm trying to get links from above xml to get links which has lastmod of 2020-08-06
my regex code is https:.+2020-08-05.+<\/url
but it ended up getting it all from 1st and last link
I want to match only
<url><loc>https://google.com/2020/08/this1.html</loc><lastmod>2020-08-06T11:30:55Z</lastmod></url>
<url><loc>https://google.com/2020/08/this3.html</loc><lastmod>2020-08-06T11:29:25Z</lastmod></url>
/<loc>(.+)<\/loc>.*2020-08-06/g
capturing the group between loc tags
Demo and explanation here:
https://regex101.com/r/HBvG3K/8
A very easy and stupid regex - see regexr:
.*<lastmod>2020-08-06.*

regexp for hashtag/mention in href

My goal make html hastag, for this i'm need wrap text with # into
<a class="tag"><span class="hash">#</span>text</a>
I wan't make regexp which can give me words with # and #, but i'm have some trouble with URLs like this:
http://gitlab.com/#xxx or https://medium.com/#erikdkennedy
My example string:
<p>Some text <span class="highlighted">#test</span><br />
gitlab.com/#xxx<br />
<code>some feature</code></p>
My regexp is:
(?!.*(<mail-link|link))#([a-zA-Z0-9]+)
I get 2 matches #test and last #xxx (https://regex101.com/r/pXxIkf/1)
How i can get only test, and dont find inside the href definition?
Thank you!
Try this :
(?<=\>)(?:[\s]*(?:#|#))([a-zA-Z0-9]+)
(?<=>) Positive Lookbehind to make sure that there is > before the hashtag.
(?: start non-capturig group.
[\s]* there is whitespace or not.
(?:#|#) non-capturig group that make sure there either # or #
DEMO

Regex Pattern to Match A Href and Remove

I am trying to create a regex to match all a href links that contain my domain and I will end up removing the links. It is working fine until I run into an a href link that has another HTML tag within the tag.
Regex Statement:
(<a[^<]*coreyjansen\.com[^<]*>)([^"]*?)(<\/a>)
It matches the a href links in this statement with no problem
Need a lawyer? Contact <span style="color: #000000">Random text is great Corey is awesome</span>
It is unable to match both of the a href links this statement:
<strong><a href="http://coreyjansen.com/"><img class="alignright size-full
wp-image-12" src="http://50h0.com/wp-content/uploads/2014/06/lawyers.jpg"
alt="lawyers" width="250" height="250" /></a>
I have been trying to play with the neglected character set with no luck. If I remove the neglected character set what ends up happening is it will match two links that are right after each other such as example 2 as one match.
The issue here is that [^<]*> matches everything up until last >. That's the greedy behaviour of * asterisk. You can make it non-greedy by appending ? after asterisk(which you already do in other part of your query). It will then match everything until first occurrence of >. Then you have to change the middle part of your regex too ie. to catch everything until first tag </a> like this:
(<a[^<]*coreyjansen\.com[^<]*?>)(.*?)(<\/a>)
Use below regex which matches only a tag
(<a[^>]*coreyjansen\.com[^>]*>)
Example data
<strong><a href="http://coreyjansen.com/"><img class="alignright size-full
wp-image-12" src="http://50h0.com/wp-content/uploads/2014/06/lawyers.jpg"
alt="lawyers" width="250" height="250" /><a href="http://coreyjansen.com/"/>
Above regex will match all three a tag with your required domain.
Try above on regex
I'm playing with the following regex and it seems to be working:
<a.*coreyjansen\.com.*</a>
it captures anything between anchor tags that contain your site name. I am using javascript pattern matching from www.regexpal.com, depending on the language it could be slightly different
You need to match start of tag <a then match address before > char. You are matching wrong char. When you match that, then everithing between <a> and </a> is displayed link. I don't know why you compare to not contain quotes, every tag attribute (in HTML5) has value inside quotes, so you need to match everything except link ending tag </a>. It's done by ((?!string to not match).)* and after that should follow </a>. The result regex is:
(<a[^>]*coreyjansen\.com[^>]*>)((?!<\/a>).)*(<\/a>)

Regexp capture unlimited groups

I need a little help here.
So I have string:
{block name="something" param1="param" param2="param"}
it can be:
{block name="something"} or
{block name="something" param1="value" sm="value" ng="value" um="param" .. and so on}.
What I need is to capture all possible params.
What I could figure out so far is {(?<type>[\w]+) ((?<param>[\w]+)="(?<value>[\w]+)"), but it captures only first param - "name" :/
Any help will be appreciated.
Here you need to use \G in-order to do continuous string match. \h matches any horizontal whitespace character.
(?:^\{(?<type>\w+)|\G)\h*((?<param>\w+)="(?<value>\w+)")
DEMO

Limiting a character after a wildcard in regex to it's first occurrence,

How can I tell a character that comes after a wildcard to use the first occurrence of it?
I did the following to find any tag with the word "title" in it:
<(.*?)(title)(.*?)>
but clearly what happens is I end up with the entire tag to the end of
</title>
So that in
<Bla bla ="nametitle">Yada yada</title>
I want
<Bla bla ="nametitle">
but end up with the whole tag.
Please if anyone is offended by the use of parsing html with regex simply move on and accept my apologies for the transgression. I am simply trying to find out how to use the wildcard which I have not used before correctly and apply as I see fit. Thank you.
You can use this regex:
<title.+?>
The above matches <title and goes till it encounters a >
Stop parsing at the first >. Using your example, you could do this with: <(.*?)(title)([^>]*?)>
<(?![\/]).*?title.*?>
This will find title inside any set of < > tags except for closing tags beginning with </
Example:
https://regex101.com/r/QFs4ny/1