Unable to match XML element using Python regular expression

Unable to match XML element using Python regular expression - regex

I have an XML document with the following structure-
> <?xml version="1.0" encoding="UTF-8"?> <!-- generated by CLiX/Wiki2XML
> [MPI-Inf, MMCI#UdS] $LastChangedRevision: 93 $ on 17.04.2009
> 12:50:48[mciao0826] --> <!DOCTYPE article SYSTEM "../article.dtd">
> <article xmlns:xlink="http://www.w3.org/1999/xlink"> <header>
> <title>Postmodern art</title> <id>192127</id> <revision>
> <id>244517133</id> <timestamp>2008-10-11T05:26:50Z</timestamp>
> <contributor> <username>FairuseBot</username> <id>1022055</id>
> </contributor> </revision> <categories> <category>Contemporary
> art</category> <category>Modernism</category> <category>Art
> movements</category> <category>Postmodern art</category> </categories>
> </header> <bdy> Postmodernism preceded by Modernism '' Postmodernity
> Postchristianity Postmodern philosophy Postmodern architecture
> Postmodern art Postmodernist film Postmodern literature Postmodern
> music Postmodern theater Critical theory Globalization Consumerism
> </bdy>
I am interested in capturing the text contained within ... and for that I wrote the following Python 3 regex code-
file = open("sample_xml.xml", "r")
xml_doc = file.read()
file.close()
body_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc)
But 'body_text' is always returning an empty list. However, when I try to capture the text for the tags ... using code-
category_text = re.findall(r'(.+)', xml_doc)
This does the job.
Any idea(s) as to why the ... XML element code is not working?
Thanks!

The special character . will not match a newline, so that regex will not match a multiline string.
You can change this behavior by specifying the DOTALL flag. To specify that flag you can include this at the start of your regular expression: (?s)
More information on Python's regular expression syntax can be found here: https://docs.python.org/3/library/re.html#regular-expression-syntax

You can use re.DOTALL
category_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc, re.DOTALL)
Output:
[" Postmodernism preceded by Modernism '' Postmodernity\n> Postchristianity Postmodern philosophy Postmodern architecture\n> Postmodern art Postmodernist film Postmodern literature Postmodern\n> music Postmodern theater Critical theory Globalization Consumerism\n> "]

Related

Change the color of x SVG files

I'd like to change the color of at least 1.000 SVG files. The main problem I have is that the currently SVGs doesnt contain the "fill" attribute, so I have to add fill="X" at the end of the SVG tag.
Heres is an example of one SVG file:
<?xml version="1.0" encoding="utf-8"?>
<!-- Generator: Adobe Illustrator 19.0.0, SVG Export Plug-In . SVG Version: 6.00 Build 0) -->
<svg version="1.1" id="Layer_1" " x="0px" y="0px"
viewBox="-236 286 30 30" style="enable-background:new -236 286 30 30;" xml:space="preserve">
Thanks for your help.

There are many possibilites to do that. The safest way would be to read the XML structure and then manipulate that. But for that specific example you could also use the following regex with e.g. sed or python:
With sed:
sed -E 's/xml:space=\"preserve\">/xml:space="preserve" fill="red" >/gm;t;d'
With Python:
import re
regex = r"xml:space=\"preserve\">"
test_str = ("<?xml version=\"1.0\" encoding=\"utf-8\"?>\n"
"<!-- Generator: Adobe Illustrator 19.0.0, SVG Export Plug-In . SVG Version: 6.00 Build 0) -->\n"
"<svg version=\"1.1\" id=\"Layer_1\" \" x=\"0px\" y=\"0px\"\n"
" viewBox=\"-236 286 30 30\" style=\"enable-background:new -236 286 30 30;\" xml:space=\"preserve\">")
subst = "xml:space=\"preserve\" fill=\"red\" >"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
you can also have a look at regex101.com/r/cIfbEd

BASH script to rename XML file to an attribute value

I have a lot of .xml files structured the same way:
<parent id="idvalue" attr1="val1" attr2="val2" ...>
<child attr3="val3" attr4="val4" ... />
<child attr3="val5" attr4="val6" ... />
...
</parent>
Each file has exactly one <parent> element with exactly one id attribute.
All of those files (almost 1,700,000 of them) are named as part.xxxxx where xxxxx is a random number.
I want to name each of those files as idvalue.xml, according to the sole id attribute from the file's content.
I believe doing it with a bash script would be the fastest and most automated way. But if there are other suggestions, I would love to hear them.
My main problem is that I am not able (don't know how) to get the idvalue in a specific file, so that I could use it with the mv file.xxxxx idvalue.xml command.

First I would iterate through the xml files using find:
find -maxdepth 1 -name 'part*.xml' -exec ./rename_xml.sh {} \;
The line above will execute rename_xml.sh for every xml file, passing the file name as command argument to the script.
rename_xml.sh should look like this:
#!/bin/bash
// Get the id using XPath. You might probably need
// to install xmllint for that if it is not already present.
// The xpath query will return a string like this (try it!):
//
// id="idvalue"
//
// We are using sed to extract the value from that
id=$(xmllint --xpath '//parent/#id' "$1" | sed -r 's/[^"]+"([^"]+).*/\1/')
mv -v "$1" "$id.xml"
Don't forget to
chmod +x rename_xml.sh

Use a proper XML handling tool to extract the id from the files. For example,
xsh:
for file in part.* ; do
mv "$file" $(xsh -aC 'open { shift }; echo /parent/#id' "$file").xml
done

Like I mentioned in my comment that I am not sure about the performance of XSLT in compared to bash scripts, but I created the XSLT for you to try out.
In the stylesheet below, Dire is the directory that contains the xml files.The select "tokenize(document-uri(.), '/')[last()]"
retrieves the filename and the second line concatenates the directory name with the filename to get the path of the file.The line with xsl:copy..is used to copy the entire xml.
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:msxml="urn:schemas-microsoft-com:xslt" xmlns:random="http://www.microsoft.com/msxsl">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/">
<xsl:for-each select="collection('Dire/?select=*.xml')" >
<xsl:variable name="filename" select="tokenize(document-uri(.), '/')[last()]"/>
<xsl:variable name="filepath" select="concat('Dire/',$filename)"/>
<xsl:variable name="doc" select="document($filepath)"/>
<xsl:variable name="outname" select="$doc/parent/#id"/>
<xsl:result-document href="{$outname}.xml" method="xml">
<xsl:copy-of select="$doc/node()"/>
</xsl:result-document>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
I ran the xslt using saxon8. Unfortunately I could not find any way to rename the xml directly.But the above code should be worth a try.

RegexFilter with RollingFileAppender not working properly

I am trying to use Regexfilter in RollingFileAppender. For 1st matching instance it retreived the logger, but after that I different patttern but nothing is logged in the file. Here is what I am using:
Main Class:
public class MainApp {
public static void main(String[] args) {
final Logger logger = LogManager.getLogger(MainApp.class.getName());
ApplicationContext context = new ClassPathXmlApplicationContext("Beans.xml");
HelloWorld obj = (HelloWorld) context.getBean("helloWorld");
logger.trace("NPF:Trace:Entering Log4j2 Example.");
logger.debug("NTL:debug Entering Log4j2 Example.");
obj.getMessage();
Company comp = new Company();
comp.setCompName("ANC");
comp.setEstablish(1889);
CompanyBusiness compBus = (CompanyBusiness)context.getBean("compBus");
compBus.finaceBusiness(comp.getCompName(), comp.getEstablish());
logger.trace("NTL: Trace: Exiting Log4j2 Example.");
}
}
log4j2.xml:
<?xml version="1.0" encoding="UTF-8" ?>
<Configuration>
<Appenders>
<Console name="STDOUT" target="SYSTEM_OUT">
<PatternLayout pattern="%d{yyyy-MM-dd [%t] HH:mm:ss} %-5p %c{1}:%L - %m%X%n" />
</Console>
<RollingFile name="RollingFile" fileName="C:\logTest\runtime\tla\els3.log" append="true" filePattern="C:\logTest\runtime\tla\els3-%d{yyyy-MM-dd}-%i.log" >
<PatternLayout pattern="%d{HH:mm:ss.SSS} [%t] %-5level %logger{36} - %m%X%n" />
<RegexFilter regex=".*business*." onMatch="ACCEPT" onMismatch="DENY"/>
<Policies>
<SizeBasedTriggeringPolicy size="20 MB" />
</Policies>
</RollingFile>
</Appenders>
<Loggers>
<Logger name="com.anc" level="trace"/>
<Root level="trace">
<AppenderRef ref="STDOUT" />
<AppenderRef ref="RollingFile"/>
</Root>
</Loggers>
</Configuration>
When I ran for the first time, in my logfile I got logs having only "business" related line. Latter I changed the patter from .business (pattern has astreik before and after business word). to "business", logging did not happen in file nor on the console. Also my application terminated without any kind of logging.
Then I tried to revert back the pattern to '.business.' (pattern has astreik before and after business word), thereafter no logging happened on the log file, but on the console all the log trace is printed. When I comment out the Regexfilter after trying for long time, my logs was printed in the log file.
I am not sure if this is a bug of Regexfilter works only for one time. Also if we do not pass any patter matching characters, the application stops without any log printing either on console or file.

If you want to log all events containing the word "business", then you shall use the regex .*business.* instead of .*business*.. Here is an example:
<RegexFilter regex=".*business.*" onMatch="ACCEPT" onMismatch="DENY"/>
For information, .*business*. means: anything, followed by business, followed by s character 0 or more time, followed by any single character.
More explaining:
. means any single character
* means 0 or more times
so .* means any character, 0 or more times.

Removing specific tags in a KML file

I have a KML file which is a list of places around the world with coordinates and some other attributes. It looks like this for one place:
<Placemark>
<name>Albania - Durrës</name>
<open>0</open>
<visibility>1</visibility>
<description>(Spot ID: 275801) show <![CDATA[forecast]]></description>
<styleUrl>#wgStyle001</styleUrl><Point>
<coordinates>19.489747,41.277806,0</coordinates>
</Point>
<LookAt><range>200000</range><longitude>19.489747</longitude><latitude>41.277806</latitude></LookAt>
</Placemark>
I would like to remove everything except the name of the place. So in this case that would mean I would like to remove everything except
<name>Albania - Durrës</name>
The problem is, this KML file includes more than 1000 of these places. Doing this manually obviously isn't an option, so then how can I remove all tags except for the name tags for all of the items in the list? Can I use some kind of program for that?

Use a specialized command line tool that understands XML documents.
One such tool is xmlstarlet, which is available here for Linux, Windows and Solaris.
To address your particular problem, I used the xmlstarlet executable xml.exe like this (on Windows):
xml.exe sel -N ns=http://www.opengis.net/kml/2.2 -t -v /ns:kml/ns:Document/ns:Placemark/ns:name places.kml
This produces this output:
Albania - Durrës
Second Name
Third Name
...
Final Name
If you can guarantee that <name> occurs only as a child of <Placemark>, then this abbreviated version will produce the same result:
xml.exe sel -N ns=http://www.opengis.net/kml/2.2 -t -v //ns:name places.kml
(This is because this shorter version finds all <name> elements no matter where they occur in the document.)
If you really want an XML document, you'll need to do a little post-processing. Here's an example of a complete XML document:
<?xml version='1.0' encoding='utf-8'?>
<items>
<item>Albania - Durrës</item>
<item>Second Name</item>
<item>Third Name</item>
<!-- ... -->
<item>Final Name</item>
</items>
This first line is the XML declaration. It declares the Unicode encoding utf-8. You'll need to include this line so that XML processors recognize that your document includes Unicode characters. (As in Durrës.)
More: Here's an enhanced 'xmlstarlet' command that will produce the XML document above:
xml.exe sel -N ns=http://www.opengis.net/kml/2.2 -T -t -o "<?xml version='1.0' encoding='utf-8'?>" -n -t -v "'<items>'" -n -t -m //ns:Placemark -v "concat('<item>',ns:name,'</item>')" -n -t -o "</items>" -n places.kml

If you are on linux or similar:
grep "<name>" your_file.kml > file_with_only_name_tags
On windows, see What are good grep tools for Windows?

Access attributes from XML in shell

I'm trying to parse out values from a Widget config.xml using shell. I do want to use sed for this task. If there is something that sucks less than xsltproc, I'd love to know.
In this example I am after the id attribute value from the config.xml below:
<?xml version="1.0" encoding="UTF-8"?>
<widget xmlns="http://www.w3.org/ns/widgets" id="http://example.org/exampleWidget" version="2.0 Beta" height="200" width="200">
<name short="123">Foo Widget</name>
</widget>
I wish it was as simple as Jquery's attr: var id = $("widget").attr("id");
Currently this shell code utilising xsltproc fails:
snag () {
TMP=$(tempfile)
cat << EOF > $TMP
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text" encoding="utf-8" indent="no"/>
<xsl:template>
<xsl:value-of select="$1"/>
</xsl:template>
</xsl:stylesheet>
EOF
echo $(xsltproc $TMP config.xml)
rm -f $TMP
}
ID=$(snag "widget/#id")
if test "$ID" = "http://example.org/exampleWidget"
then
echo Mission accomplished.
else
echo "<$ID> is wrong."
fi

XMLStarlet (http://xmlstar.sourceforge.net/) is a nice command line tools that supports such queries:
xmlstarlet sel -N w=namespace -T -t -m "/w:widget/#id" -v . -n config.xml

template match="widget"
select value-of="#id"

<xsl:template xmlns:wgt="http://www.w3.org/ns/widgets" match="/wgt:widget">
<xsl:select value-of="#id" />
</xsl:template>

You don't need XSLT if you're not doing a transform.
If you only need to grab a value use XPath.
There's an xpath program that comes with Perl's XML::XPath module.
From the shell:
ID=$(xpath config.xml 'string(/widget/#id)' )
( The string() function is to get only the value of the id.
/widget/#id by itself returns "id=value" )
If you only need to produce some other output depending on the value, you could
do it all in xslt. There are also other XPath implementations available from
other scripting languages: I've used Java's XPath from both rhino and Jython.
There's also XQuery from the command line with Saxon.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Unable to match XML element using Python regular expression - regex

Related

Change the color of x SVG files

BASH script to rename XML file to an attribute value

RegexFilter with RollingFileAppender not working properly

Removing specific tags in a KML file

Access attributes from XML in shell

Categories

Resources