Trying to get v. simple Regex to work on XML file - regex

This is my snippet of XML (the actual full file is 6964 lines):
<?xml version="1.0" encoding="UTF-8"?>
<listings xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchema Location="http://www.gstatic.com/localfeed/local_feed.xsd">
<language>en</language>
<id>43927</id>
<cell1>Andover House</cell1>
<cell2>28-30 Camperdown</cell2>
<cell3>Great Yarmouth</cell3>
<cell4>NR30 3JB</cell4>
<cell5>GB</cell5>
<cell6>52.6003767</cell6>
<cell7>1.7339649</cell7>
<cell8>+44 1493843490</cell8>
<category>British</cell9>
<cell10>http://contentadmin.livebookings.com/dynamaster/image_archive/original/f24c60a52e7ac0874be57e51bce30726.jpg</cell10>
<cell11>http://www.bookatable.co.uk/andover-house-great-yarmouth-norfolk</cell11>
For each category tag in the above snippet, I would simply like to add this text: Restaurant - (with one whitespace after the hyphen)
So the final result will be:
<category>Restaurants - British</category>
I am very new to Regex and find it very difficult, so this is what I've tried so far: https://regex101.com/r/yY5jB6/2
It looks like it is working in Regex 101 but when I bring it into a text editor like Sublime 2 (on Mac) and Notepad ++ (on Windows) using find/replace (specifying regex in settings), it says it can't find anything. Please help! Thanks!

NotePad++ uses \1 instead of $1, if you change your substitution from
$1Restaurants - to \1Restaurants - then it should work. (sourced from this question)

if you search for
<category>([^<]*)<\/.*>
and replace it with
<category>Restaurants - $1</category>
it would even work with your strange input that contains a </item9> tag.

Related

regex to match link inside xml with last mod

<?xml version='1.0' encoding='UTF-8'?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>https://google.com/2020/08/this1.html</loc><lastmod>2020-08-06T11:30:55Z</lastmod></url>
<url><loc>https://google.com/2020/08/this2.html</loc><lastmod>2020-08-05T11:30:06Z</lastmod></url>
<url><loc>https://google.com/2020/08/this3.html</loc><lastmod>2020-08-06T11:29:25Z</lastmod></url>
</lastmod></url></urlset>
I'm trying to get links from above xml to get links which has lastmod of 2020-08-06
my regex code is https:.+2020-08-05.+<\/url
but it ended up getting it all from 1st and last link
I want to match only
<url><loc>https://google.com/2020/08/this1.html</loc><lastmod>2020-08-06T11:30:55Z</lastmod></url>
<url><loc>https://google.com/2020/08/this3.html</loc><lastmod>2020-08-06T11:29:25Z</lastmod></url>
/<loc>(.+)<\/loc>.*2020-08-06/g
capturing the group between loc tags
Demo and explanation here:
https://regex101.com/r/HBvG3K/8
A very easy and stupid regex - see regexr:
.*<lastmod>2020-08-06.*

Regex to get comma within xml tags

I am new to regex. I have an xml like
<Root xmlns="rooter"><add>This is an example, test</add></Root>, 123, test, 8765
I want to find only comma which is within the xml tags
I have tried
<Root.*\,.*</Root>
and
<Root.*>(\,)
It return the xml tag but I want only comma and replace with other character.
I want to replace this comma with other character in atom text editor. If I replace it, it should be like
<Root xmlns="rooter"><add>This is an example# test</add></Root>, 123, test, 8765
The following regex will work if the text is the same format as you have defined above.
,(?=[^\/<]*<\/)
I have used look ahead here. You can check the link for more details.
https://www.regular-expressions.info/lookaround.html

I need to use remove every thing before and after a variable string in notepad++ using regex

Hopefully, someone can help me with this. I have a text file that has a list of RSS URLs in XML format on multiple lines. The text file would look like this:
<outline type="rss" text="Tech Viral" title="Tech Viral" xmlUrl="http://feeds.feedburner.com/TechViral" htmlUrl="https://techviral.net"/>
<outline type="rss" text="The Verge" title="The Verge" xmlUrl="http://www.theverge.com/rss/full.xml" htmlUrl="https://www.theverge.com/"/>
<outline type="rss" text="Joystiq" title="Joystiq" xmlUrl="http://www.joystiq.com/rss.xml" htmlUrl="https://www.engadget.com/rss.xml"/>
<outline type="rss" text="BGR" title="BGR" xmlUrl="http://www.boygeniusreport.com/feed/" htmlUrl="http://bgr.com"/>
I want to get rid of everything before :
xmlUrl="
and everything after:
"
So the final output would look like this:
http://feeds.feedburner.com/TechViral
http://www.theverge.com/rss/full.xml
http://www.joystiq.com/rss.xml
http://www.boygeniusreport.com/feed/
Basically, I just want the feed URLs in the file on a line left. Can anyone help with that? I am using Notepad++ on Windows but if there is another software that will do this easier then regular expressions, I'll take any suggestions that get the job done.
Thanks Guys!
Not the need for fancy
Find (?m)^.*xmlUrl="([^"]*)".*
Replace $1
Use look behid (?<=):
(?<=xmlUrl=")[^"]+
will match anything followed by xmlUrl=" until the next quote ".

RegEx for mining XML tag content

Fellow Forum Members,
I am using the latest NotePad++. I have 430 separate XML files and my goal is to make a "dmcode" list of all 430 XML files. The dmcode identifies each XML file and looks like the example code shown below. I need help in developing a Regular Expression that will grab the dmcode tag content located between the <dmCode opening tag and the closing /> terminator. Also I only need this extraction to only apply to dmcode tags that follows the <dmIdent> tag. In other words, any dmcode tag that is not preceded by a <dmIdent> tag does not end up on my NotePad++ search result list. Is such a Regular Expression that can pull targeted data from a lot of XML files possible?
<dmIdent>
<dmCode assyCode="00" disassyCode="00" disassyCodeVariant="00" infoCode="042" infoCodeVariant="A" itemLocationCode="O" modelIdentCode="SASA" subSubSystemCode="6" subSystemCode="0" systemCode="A03" systemDiffCode="XY"/>
As an alternative I have been researching using an XPath expression to accomplish the same task. However, I can't seem to find a NotePad++ XPath plugin that will enable me to specify the data I want to extract from 430 XML files by using an XPath expression instead of a Regular Expression. I will also appreciate it if anyone can provide an example of an XPath expression that will perform the same task I'm trying to accomplish by using a Regular Expression.
Any help will be greatly appreciated.
I know there are plugins for XPath, but I don't know one that allows you to search several files. The following XPath would match all attributes in <dmCode> as a child of the root element <dmIdent>:
/dmIdent/dmCode[#*]
I need help in developing a Regular Expression that will grab the dmcode tag content located between the <dmCode opening tag and the closing /> terminator. Also I only need this extraction to only apply to dmCode tags that follow the <dmIdent> tag.
This will work for the most simple cases, where:
<dmCode> is the first child of <dmIdent>
There are no comments, CDATA tags, or similar constructs that could make it fail.
(?i)<dmIdent>\s*<dmCode \K[^"/>]*(?>(?:"[^\\"]*(?:\\.[^\\"]*)*"|/(?!>))[^"/>]*)*(?=/>)
regex101 demo
Matches:
(?i)<dmIdent>\s*<dmCode both tags spearated by whitespace (case-insensitively)
\K resets the matched text
[^"/>]* Any characters except ", / or >
And loops:
"[^\\"]*(?:\\.[^\\"]*)*" text in quotes, or
/(?!>) a / not followed by >
both followed by the previous [^"/>]*
(?=/>) All followed by />

Delete text outside of tags

Using vim, I am attempting to remove all text outside of <text> blocks. This needs to span across newlines and other (unrelated) tags.
I have attempted to use regex to substitute text for newlines, but failed for a couple of reasons, one of which was my attempts did not span multiple lines, and I need to have my matches be non-greedy. (Is that accomplished using {-} somehow?)
The regex that should match the content I would like to delete would look like: <//text>.*<text.*> but if I make this match non-greedy, I may have other issues. (I also realize I'll have one partial tag section to clean up at the beginning doing this.)
Is there another approach that I should be taking, or can someone guide me to remove all content not between such tags using vim?
EDIT: Including sample text
<contributor>
<username>MalafayaBot</username>
<id>628</id>
</contributor>
<minor />
<comment>Robô: A modificar Categoria:Vocábulo de étimo latino (Português) para Categoria:Entrada de étimo latino (Português)</comment>
<text xml:space="preserve">={{-pt-}}=
==Substantivo==
{{flex.pt|ms=excerto|mp=excertos}}
{{paroxítona|ex|cer|to}} {{m}}
# [[extrato]] de um [[texto]], [[fragmento]]
#: ''A seguir, um '''excerto''' do texto original.''
===Tradução===
{{tradini}}
* {{trad|es|extracto}}
* {{trad|fr|extrait}}
{{tradmeio}}
* {{trad|en|excerpt}}
{{tradfim}}
=={{etimologia|pt}}==
:Do latim ''[[excerptu]]'' (colhido de).
=={{pronúncia|pt}}==
===Brasil===
* [[SAMPA]]: /e."sEx.tu/
* [[AFI]]: /esˈertu/
[[zh:excerto]]</text>
<sha1>8i1zywj37s74ah4wnai11ohorfjn8j5</sha1>
<model>wikitext</model>
Your struggles with regular expressions indicate that you're using the wrong tool for the job.
For text extraction from XML, you can use XSLT, which will handle all special cases far better than a regular expression. Or use special-purpose tools like xidel, a kind of grep for XML. With it, the extraction is as easy as:
xidel --extract "//text" input.xml
if you don't NEED to you vim, you can try using this sed command, just replace "test" with the name of your file. I would test this on a COPY of your file first since the -i option tells sed to modify the actual file you pass in.
sed -i 's/<\/text>[^<]*/<\/text>/g' test
EDIT: after seeing the sample, I'm going to take a different approach... instead of getting rid of all the text not within tags.. I'm going to select all the blocks and output it to a new file. Hopefully your version of grep supports the -P option. Try this:
grep -Pzo "(?s)<text.*?<\/text>" sample.txt > out.txt
I assume that there is only one <text> block in your file. In vim this line works for your sample text:
%s#\_.*\(<text.\{-}>\_.*</text>\)\_.*#\1#