I process an XML (XHTML) document using QDomDocument, as easy as this example:
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE html>
<html>
<body>
<p><i>foo</i><!-- comment --><b>bar</b></p>
</body>
</html>
As an (X)HTML document, the text should be rendered without a whitespace in between "foo" and "bar", since a comment in (X)HTML will not produce a whitespace (unless surrounded by some):
foobar
However, when parsed by QDomDocument and then stringified using .toString(), I get the following source back:
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE html>
<html>
<body>
<p>
<i>foo</i>
<!-- comment -->
<b>bar</b>
</p>
</body>
</html>
This is then rendered with a whitespace:
foo
bar
As I understand it, the output of QDomDocument seems to be a different XML document than the original one. I checked by inspection of the document tree that the parsed document does not contain the whitespace: The <i> element is immediately followed by a comment node which is followed by the <b> element; i.e. no text node in between.
The fact that the multi-whitespaces don't match (i.e. the indentation width) is not a problem (IIRC, the XML spec doesn't distinguish between multiple whitespaces and a single one, so does HTML). But for two tags with no whitespace in between originally, I require the output to also have no whitespace.
Is this a bug in QDomDocument::toString()? How can I work around this problem? Can I prevent QDomDocument::toString() to produce whitespace here? I already tried to pass -1 as the optional parameter indent, which is documented as: "no whitespace at all is added". It will however still add a line break after the comment (but not before). I could imagine implementing a custom serializer, but I don't know if it is as easy as I imagine.
Related
I am trying to get the first instance of string in following source string
Input string
><text color="#FFFF00" creationdate="D:20180307100631+04'00'" flags="print,nozoom,norotate" date="D:20180307100652+04'00'" name="a60915a3-1c23-4f6d-b8d4-fbe0dd4890e9" icon="Comment" page="7" rect="351.308000,135.732000,371.308000,153.732000" subject="Sticky Note" title="saddia"
><contents-richtext
><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/" xfa:APIVersion="Acrobat:9.0.0" xfa:spec="2.0.2"
><p dir="ltr"
><span dir="ltr" style="font-size:10.0pt;text-align:left;color:#000000;font-weight:normal;font-style:normal"
>As agreed with WPO that any unspecific area use GEN</span
><span dir="ltr" style="font-size:11.0pt;text-align:left;color:#1D477B;font-weight:normal;font-style:normal"
>
</span
><span dir="ltr" style="font-size:11.0pt;text-align:left;color:#000000;font-weight:normal;font-style:normal"
>
</span
I am trying retrieve output as below
page="7" rect="351.308000,135.732000,371.308000,153.732000" subject="Sticky Note" title="saddia"
><contents-richtext
><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/" xfa:APIVersion="Acrobat:9.0.0" xfa:spec="2.0.2"
><p dir="ltr"
><span dir="ltr" style="font-size:10.0pt;text-align:left;color:#000000;font-weight:normal;font-style:normal"
>As agreed with WPO that any unspecific area use GEN</span
which is upto first instance of </span.
My RegExp is as below which is picking last occurrence of desired end character group:
page="[0-9]+".+subject="(Text Box|Sticky Note)".+((\s+.+)+);<\/span
I have limited knowledge of RegEx so please bear with me.
The snippet is output XFDF (pdf comment export) but it was getting formatted weirdly so I have used html tagging to format.
In the following regex, the main changes I made were to make the dot lazy, meaning that it stops at the first pattern after the dot. This is to prevent the pattern from combing only once over the entire text.
page="[0-9]+".+?subject="(?:Text Box|Sticky Note)".+?<\/span
Demo
Note carefully that in order for the above pattern to work, the regex must be done in DOT ALL mode, meaning that dot also matches across newlines.
In VBA, which doesn't have a formal DOT ALL mode, we can simulate it using [\s\S]:
page="[0-9]+"[\s\S]+?subject="(?:Text Box|Sticky Note)"[\s\S]+?<\/span
Let's say I have some thousands of HTML files with some text inside 'em (articles, actually). Besides, let's say there are all sorts of scripts, styles, counters, other crap inside these HTMLs, somewhere above the actual text.
And my task is to replace everything that goes from the very beginning until a certain tag – i.e., we start with <head> and end with <div class="StoryGoesBelow"> with a clear
<html>
<head>
</head>
<body>
block.
Is there any regex way I can do this? Vim? Any other editor? Scripting language?
Thanks.
The simplest regex for this would be (?s)\A.*?(?=<div class="StoryGoesBelow">) (assuming you want to keep the <div> tag). Replace that with the text from your question.
Explanation:
(?s) # Allow the dot to match newlines
\A # Anchor the search at the start of the string
.*? # Match any number of characters, as few as possible
(?=<div class="StoryGoesBelow">) # and stop right before this <div>
This will fail, of course, if the text <div class="StoryGoesBelow"> could also occur in a comment or a literal string somewhere above the actual tag.
little bit of background:
I work at a multilingual communication company, where we’re working with a CMS system. Since its last update, all the files I export out of the system are ‘polluted’ with metadata, which I don't want to see, use or replace. To filter and change a heap of xml files, I use Powergrep, which operates with regexes.
I want my regex to find, e.g. "there is no spoon", "oracle", "I know kung-fu" and "bending method" (all straight quotation marks) and replace it with “there is no spoon”, “oracle”, “I know kung-fu” and “bending method” (all with curly quotation marks).
I don’t want it to find the metadata "concept.dtd" and "map.dtd"
The following lines are the first lines of my xml file. It's this "concept.dtd" that I would like to ignore.
<?xml version="1.0" encoding="UTF-16" standalone="no"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"[
]>
<?ish ishref="GUID-6B84EF92-DA99-4C54-BA91-FD0A113D4A96" version="1" lang="sv" srclng="en"?>
This is somewhere in the middle of the xml file
<row>
<entry colname="col1" valign="middle" align="left">"Bending method" </entry>
<entry colname="col2" valign="middle" align="left">another word</entry>
</row>
So.. this is the original regex:
(?<!=)”\b(.+?)\b”(?! \[)
Replacement:
“1”
Problem:
As the metadata “concept.dtd” and “map.dtd” are part of the file, I don’t want to replace their quotation marks in order not to change anything crucial. So I tried rewriting the regex:
(?<!=)”\b(.+?[\.d])\b”(?! \[)
It almost works: “concept.dtd” and “map.dtd” are skipped, most of the terms between quotation marks are found, but not all: “Bending method” is not found, for example.
What am I missing? Any help or opinions would be greatly appreciated!
Based on your last answers, here is a regexp that can help you:
(?<=<entry)[^>]+>[^<>]*?(".+?")[^<>]*?(?=<\x2Fentry>)
Description
Demo
http://regex101.com/r/lX2cU3
Discussion
I assume that you have one serie of words between straight quotations and that there are no carriage returns ou line feeds inside an <entry> node.
I am trying to parse a "wrong html" to fix it using perl regex.
The wrong html is the following: <p>foo<p>bar</p>foo</p>
I would like perl regex to return me the : <p>foo<p>
I tried something like: '|(<p\b[^>]*>(?!</p>)*?<p[^>]*>)|'
with no success because I cannot repeat (?!</p>)*?
Is there a way in Perl Regex to say all charactère except the following sequence (in my case </p>)
Try something like:
<p>(?:(?!</?p>).)*</p>(?!(?:(?!</?p>).)*(<p>|$))
A quick break down:
<p>(?:(?!</?p>).)*</p>
matches <p> ... </p> that does not contain either <p> and </p>. And the part:
(?!(?:(?!</?p>).)*(<p>|$))
is "true" when looking ahead ((?! ... )) there is no <p> or the end of the input ((<p>|$)), without any <p> and </p> in between ((?:(?!</?p>).)*).
A demo:
my $txt="<p>aaa aa a</p> <p>foo <p>bar</p> foo</p> <p> bb <p>x</p> bb</p>";
while($txt =~ m/(<p>(?:(?!<\/?p>).)*<\/p>)(?!(?:(?!<\/?p>).)*(<p>|$))/g) {
print "Found: $1\n";
}
prints:
Found: <p>bar</p>
Found: <p>x</p>
Note that this regex trickery only works for <p>baz</p> in the string:
<p>foo <p>bar</p> <p>baz</p> foo</p>
<p>bar</p> is not matched! After replacing <p>baz</p>, you could do a 2nd run on the input, in which case <p>bar</p> will be matched.
I concur with Andy. Parsing nontrivial HTML with regexps is a world of pain.
Have a good look at HTML::TreeBuilder::XPath and HTML::DOM for making structural changes to HTML documents.
This regexp:
<p>(?:(?!</p>).)*?<p>
when matched with
<p>foo<p>bar</p>foo</p>
results in
<p>foo<p>
If you're trying to validate HTML then consider a module like HTML::Tidy or HTML::Lint.
Perhaps Marpa::HTML would help you. Read some interesting abilities it has on the author's blog about it. The short of it is that the parser works with the interpreter (I probably am getting some of the semantics incorrect) to figure out what should be present based on what CAN be present at a certain logical place in the code.
The examples shown therein fix similar problems as you seem to be dealing with in a much more consistent way than employing regexes which will inevitably suffer from edge cases.
Marpa::HTML comes with a command-line utility, built using the module, called html_fmt. This implements a parsing engine to fix and pretty-print html. Here is an example. If 'bad.html' contains <p>foo<p>bar</p>foo</p> then html_fmt bad.html gives:
<!-- Following start tag is replacement for a missing one -->
<html>
<!-- Following start tag is replacement for a missing one -->
<head>
</head>
<!-- Preceding end tag is replacement for a missing one -->
<!-- Following start tag is replacement for a missing one -->
<body>
<p>
foo
</p>
<!-- Preceding end tag is replacement for a missing one -->
<p>
bar
</p>
foo
<!-- Next line is cruft -->
</p>
</body>
<!-- Preceding end tag is replacement for a missing one -->
</html>
<!-- Preceding end tag is replacement for a missing one -->
I am using Umbraco and I need to display an image in a Rss Feed. The feed is generated by Xslt.
Everything works if I do text stuff. Such stuff is technically feasible, but the feed I analyzed had been generated by WordPress.
The challenge is that I have no idea how I can embed within my tag.
I have a variable, say "url", that returns the full url of the underlying image. How can I insert within ? Remember I am using Xslt to achieve the task.
<content:encoded>
<img src="{$url}" />
</content:encoded>
I guess that CDATA must be used, but I am not able to escape correctly illegal characters :(
Thanks for your help.
Roland
roland, you're trying to escape things twice. It's unnecessary (not to mention hideous!) This page shows:
<content:encoded><![CDATA[This is <i>italics</i>.]]></content:encoded>
I.e. they're just escaping the markup inside the <content:encoded> once, and they use CDATA to do that. In your case, CDATA is awkward because you need to substitute $url in the middle. So you could use two CDATA sections wrapped around an <xsl:value-of select="$url" />: (indented for clarity)
<content:encoded>
<![CDATA[<img src="]]>
<xsl:value-of select='$url' />
<![CDATA[">]]>
</content:encoded>
But that would be needlessly verbose. The second CDATA section is unneeded. And we can do better while using the same principle: escape the markup characters (once) that would cause the string to be parsed into a tree. In your case, only the initial < needs to be escaped. You can use < instead of CDATA to escape the <. Put this in your XSLT:
<content:encoded><img src="<xsl:value-of select='$url' />"></content:encoded>
The <xsl:value-of> is not really inside quotes, from XSLT's perspective... those quotes are just the content of text nodes. The <xsl:value-of> works as a normal XSLT instruction.
Change select='$url' to select="concat($siteUrl, photo)" if that's what you need. (I.e. photo is a child element of the context node, and its text value is the name of the image file.)