regex to match link inside xml with last mod - regex

<?xml version='1.0' encoding='UTF-8'?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>https://google.com/2020/08/this1.html</loc><lastmod>2020-08-06T11:30:55Z</lastmod></url>
<url><loc>https://google.com/2020/08/this2.html</loc><lastmod>2020-08-05T11:30:06Z</lastmod></url>
<url><loc>https://google.com/2020/08/this3.html</loc><lastmod>2020-08-06T11:29:25Z</lastmod></url>
</lastmod></url></urlset>
I'm trying to get links from above xml to get links which has lastmod of 2020-08-06
my regex code is https:.+2020-08-05.+<\/url
but it ended up getting it all from 1st and last link
I want to match only
<url><loc>https://google.com/2020/08/this1.html</loc><lastmod>2020-08-06T11:30:55Z</lastmod></url>
<url><loc>https://google.com/2020/08/this3.html</loc><lastmod>2020-08-06T11:29:25Z</lastmod></url>

/<loc>(.+)<\/loc>.*2020-08-06/g
capturing the group between loc tags
Demo and explanation here:
https://regex101.com/r/HBvG3K/8

A very easy and stupid regex - see regexr:
.*<lastmod>2020-08-06.*

Related

Grab only the first or the last match

I need some help with regex which does not work perfect:
/(?<=([H|h][i|I])+\w+\>)(.*)(?=(\<))/
I have got a few XML, I need to filter out the errorMessage and the errorCode from those XMLs. Not all XML have the same syntax. Sometimes errorMessage sometimes ERRORTEXT sometimes Error_Messages is the tag name in my XMLs.
An example:
<?xml version="1.0" endcoding=UTF-8"?>
<n0:szemelyKutyaFuleResponsexmlns:prx="urn:sap.comproxy:SWP:/1SAI/TREASE1243804269AE457508F4:753" mmlns:n0="http://csajgeneratorws.tny.interfesz.kok.lo/">
<return>
<tanzakciosAzonosito>46981682-4637-49d2-bd4d-dcfff543742ed</tanzakciosAzonosito>
<erdmeny>HIBAS</eredmeny>
<errorCode>TSH08</errorCode>
<errorMessage>Azonosítószám már hozzá lett rendelve üzleti partnerhez</errorMessage>
</return>
</n0:szemelyKutyaFuleResponse>
I think I need to create two regex:
One to find the text TSH08 in errorCode
and another regex to find Azonosítószám már hozzá lett rendelve üzleti partnerhez in errorMessage!
Pls help THX
If you just want the content of each tag, which is what I understood from your question, then perhaps something like these:
For the first regex:
<errorCode>([^<>]+)</errorCode> Demo
(?<=<errorCode>)[^<>]+(?=</errorCode>) Demo
For the second regex:
<errorMessage>([^<>]+)</errorMessage> Demo
(?<=<errorMessage>)[^<>]+(?=</errorMessage>) Demo
You also can merge them with an | between the two if you don't care about the tag.
A | can also be added if the tag's name might differ like this:
<(?:errorMessage|ERRORTEXT|Error_Messages)>([^<>]+)</(?:errorMessage|ERRORTEXT|Error_Messages)> Demo

Regular Expressions: Lookback to only the first occurrence (non-greedy lookback?)

Here's the problem:
XML:
<userPermissions>
<enabled>true</enabled>
<name>ViewPublicReports</name>
</userPermissions>
<userPermissions>
<enabled>true</enabled>
<name>ViewRoles</name>
</userPermissions>
<userPermissions>
<enabled>true</enabled>
<name>ViewSetup</name>
</userPermissions>
What I'm trying to match is:
<userPermissions>
<enabled>true</enabled>
<name>ViewRoles</name>
</userPermissions>
All the patterns that I've managed to put together matches up to the first string:
(?<=<userPermissions>)[\s\S]+?ViewRoles[\s\S]*?<\/userPermissions>
Not quite sure how to make the backwards match from "ViewRoles" non-greedy.
Thanks in advance for your help.
*Edit: I'm using a tool that deploys metadata between Salesforce instances, which are captured as XML. The tool provides a "find/replace" functionality that uses regex for the "find." I don't have the option of using an XML parser.
This <userPermissions>(?:(?!</userPermissions>)[\S\s])*?ViewRoles[\S\s]*?</userPermissions>
matches that tag.
Formatted
<userPermissions>
(?:
(?! </userPermissions> )
[\S\s]
)*?
ViewRoles
[\S\s]*?
</userPermissions>
It has been told, but the correct way to extract this would be to use an XML parser. However, you can also use the following regex:
(.+\n){2}.+ViewRoles.+\n.+
Which actually matches the following structure:
2 rows without restrictions
a row that includes "ViewRoles"
another row without restrictions

Trying to get v. simple Regex to work on XML file

This is my snippet of XML (the actual full file is 6964 lines):
<?xml version="1.0" encoding="UTF-8"?>
<listings xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchema Location="http://www.gstatic.com/localfeed/local_feed.xsd">
<language>en</language>
<id>43927</id>
<cell1>Andover House</cell1>
<cell2>28-30 Camperdown</cell2>
<cell3>Great Yarmouth</cell3>
<cell4>NR30 3JB</cell4>
<cell5>GB</cell5>
<cell6>52.6003767</cell6>
<cell7>1.7339649</cell7>
<cell8>+44 1493843490</cell8>
<category>British</cell9>
<cell10>http://contentadmin.livebookings.com/dynamaster/image_archive/original/f24c60a52e7ac0874be57e51bce30726.jpg</cell10>
<cell11>http://www.bookatable.co.uk/andover-house-great-yarmouth-norfolk</cell11>
For each category tag in the above snippet, I would simply like to add this text: Restaurant - (with one whitespace after the hyphen)
So the final result will be:
<category>Restaurants - British</category>
I am very new to Regex and find it very difficult, so this is what I've tried so far: https://regex101.com/r/yY5jB6/2
It looks like it is working in Regex 101 but when I bring it into a text editor like Sublime 2 (on Mac) and Notepad ++ (on Windows) using find/replace (specifying regex in settings), it says it can't find anything. Please help! Thanks!
NotePad++ uses \1 instead of $1, if you change your substitution from
$1Restaurants - to \1Restaurants - then it should work. (sourced from this question)
if you search for
<category>([^<]*)<\/.*>
and replace it with
<category>Restaurants - $1</category>
it would even work with your strange input that contains a </item9> tag.

Regex Notepad ++: How to remove everything except url?

I have sitemap like this:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://mywebsite.com/article1</loc>
<lastmod>2014-08-10</lastmod>
<changefreq>monthly</changefreq>
</url>
<url>
<loc>http://mywebsite.com/article2</loc>
<lastmod>2014-08-10</lastmod>
<changefreq>monthly</changefreq>
</url>
<url>
<loc>http://mywebsite.com/article3</loc>
<lastmod>2014-08-10</lastmod>
<changefreq>monthly</changefreq>
</url>
</urlset>
I only want to keep url which inside . Do you know way to match the others and replace by nothing ? Thank you very much !
If your desired result is like this:
http://mywebsite.com/article1
http://mywebsite.com/article2
http://mywebsite.com/article3
search for:
\h*<url\b.*?(http[^<]+).*?</url>|<.*?>\s*
and replace with captured url (captured in first parenthesized group)
\1
\h matches any horzintal space, [^<]+ matches one or more characters, that are not <
Be sure to check the checkbox . matches \r and \n
Also see example and explanation on regex101.com
It seems like you intend to match what's inside elements. A multi-line regex matching content could do the job: (http.*)
You can use this regex to match everythin except the URLs and replace with nothing:
.*<url>.*\n?.*<loc>|<\/loc>(.*\n?){4}<\/url>

Limiting a character after a wildcard in regex to it's first occurrence,

How can I tell a character that comes after a wildcard to use the first occurrence of it?
I did the following to find any tag with the word "title" in it:
<(.*?)(title)(.*?)>
but clearly what happens is I end up with the entire tag to the end of
</title>
So that in
<Bla bla ="nametitle">Yada yada</title>
I want
<Bla bla ="nametitle">
but end up with the whole tag.
Please if anyone is offended by the use of parsing html with regex simply move on and accept my apologies for the transgression. I am simply trying to find out how to use the wildcard which I have not used before correctly and apply as I see fit. Thank you.
You can use this regex:
<title.+?>
The above matches <title and goes till it encounters a >
Stop parsing at the first >. Using your example, you could do this with: <(.*?)(title)([^>]*?)>
<(?![\/]).*?title.*?>
This will find title inside any set of < > tags except for closing tags beginning with </
Example:
https://regex101.com/r/QFs4ny/1