How use regex for character after a digit - xslt

I want to select #reId which has a character after a digit( fig-FigF.3A ).
Input:
<p type="TOC_Level Two Entry">
<doclink refType="anchor" refId="fig-FigF.3A">Figure F.3A—Text<c
type="TOC_Leader Dots"><t/></tps:c></tps:doclink>
<ref format="TOC Page Number" refType="anchor" refId="fig-FigF.3A"/>
<p>
Output should be:
<p type="TOC_Level Two Entry"><doclink refType="anchor"
refId="fig-FigF.3A">F.3A<tps:t/>Text<c
type="TOC_Leader Dots"><t/></c></tps:doclink><ref
format="TOC Page Number" refType="anchor" refId="fig-FigF.3A"/></tps:p>
Tried code:
I tried to solve this with this regex ^(Figure )(\d+|[A-Z].\d+)(—)(.*). But it not workes.
How can I solve this? I am using xslt 2.0

Ist is not well-formed your Input plz check
if you want only text change then use this code with replace function:
Input:
<?xml version="1.0" encoding="UTF-8"?>
<p type="TOC_Level Two Entry">
<tps:doclink refType="anchor" refId="fig-FigF.3A" xmlns:tps="htttp:\\tps">Figure F.3A—Text<tps:c type="TOC_Leader Dots"><t/></tps:c></tps:doclink>
<ref format="TOC Page Number" refType="anchor" refId="fig-FigF.3A"/>
</p>
code:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="xs"
version="2.0">
<xsl:output method="xml" omit-xml-declaration="no"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="text()">
<xsl:value-of select="replace(., '(Figure )([A-Z])([.])([0-9A-Z]+)(.+?)([A-Za-z]+)', '$2$3$4')"/>
</xsl:template>
</xsl:stylesheet>
output:
<?xml version="1.0" encoding="UTF-8"?>
<p type="TOC_Level Two Entry">
<tps:doclink xmlns:tps="htttp:\\tps" refType="anchor" refId="fig-FigF.3A">F.3B<tps:c type="TOC_Leader Dots"><t/></tps:c></tps:doclink>
<ref format="TOC Page Number" refType="anchor" refId="fig-FigF.3A"/>
</p>
DEMO: https://xsltfiddle.liberty-development.net/ncntCS9/1

So, trying to extract a clear requirements statement from this, it seems you want the input "fig-FigF.3A" to result in the output "F.3A". Alternatively, perhaps you want to treat "Figure F.3A—Text" as the input? On the one hand you say you are selecting the #reId attribute -- which doesn't exist in your input; on the other hand your attempt at a solution is looking for the text "Figure" which appears in a text node, rather than an attribute.
So I think we need a much clearer requirements statement.
The other problem with this as a requirements statement is that you only really give one example, not a general rule. There's a hint of a general rule in your question "which has a character after a digit". But what does this mean? Your example seems to be looking for the pattern letter-dot-digit, which doesn't match your description of the problem at all.
Sorry, SO moderators, this isn't an answer, it's a comment on the question. It started as an answer, until I realised the question wasn't clear, but by then it was too long for a comment.

Related

XSLT target specific string before closing tags

I am currently trying to add a new attribute to an element but the value needs to come from the data itself and I have no clue how to target it as the text value can happen in 2 different places.
My input XML is as following:
Case 1
<div>
<title/>
<p>This is an example where the string is being used in the text (0123-45-6789) and how a sentence can look like. (0123-45-6789)</p>
</div>
Case 2
<div>
<title>This is an example title. (0123-45)</title>
<p>This is an example sentence.</p>
</div>
Target
<div id="0123-45">
<title>This is an example title. (0123-45)</title>
<p>This is an example sentence.</p>
</div>
The string I need is the one between the brackets and it can consist of 2 digits, 4 digits, 6 digits or 10 digits. As the string can also be used in text I can only target the ones that are before the closing tag and .
I already tried to use analyze-string with regex but ended up targeting all of the strings instead of the ones I need.
Is there any way this can be done in XSLT? Thanks in advance to point me in the right direction!
Kind regards
How about:
XSLT 2.0
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- identity transform -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="div">
<div id="{replace(., '.*\((.*)\).*', '$1')}">
<xsl:apply-templates/>
</div>
</xsl:template>
</xsl:stylesheet>

Regex of XML with multiple tags

I'm trying to find all text that is not within the XML markup:
<transcript>
<text start="9.75" dur="5.94">welcome to about my property here you
can learn more about how your property</text>
<text start="15.69" dur="4.71">was assessed see the information impact
has on file and compare your property to</text>
<text start="20.4" dur="1.3">others in your neighborhood</text>
<text start="21.7" dur="5.32">interested in learning about market
trends in your municipality no problem</text>
<text start="105.79" dur="6.23">I have all of this and more about life property
. see your property assessment know more</text>
<text start="112.02" dur="0.11">about</text>
</transcript>
I am using the following regex pattern, but obviously it is not correct because it grabs all of the text between the opening and closing <transcript> tags:
<transcript>[\s\S]*?<\/transcript>
How can modify this regex pattern to select only the text that is not within any of the markup tags?
Use XSLT. XSLT is a language specifically designed to convert XML into another output format (back to valid XML again, or something else such as (X)HTML, plain text, or any other format – but preferably, based on plain text).
In this case the smallest XSLT necessary is just this:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0" >
<xsl:output method="text" indent="no" />
<xsl:template match="text">
<!-- do NOTHING here! -->
</xsl:template>
</xsl:stylesheet>
This works because the default for processing a single XML tag is to recursively apply template matches to its containing tags, and plain text will always be copied. The only tag inside your <template> is <text>, and you process it by doing 'nothing' – i.e., by not copying its contents to the output. The line inside that template is just a comment.
All other "nodes", in XML terminology, are those without a surrounding tag and so are copied to the output.
Alternatively, if you have more types of tags than just <text> elements and you want to skip all of them, apply templates to / and transcript to process each and apply another to * (which will select all remaining tags not specified elsewhere) to not process them:
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0" >
<xsl:output method="text" indent="no" />
<xsl:template match="/">
<xsl:apply-templates />
</xsl:template>
<xsl:template match="transcript">
<xsl:apply-templates />
</xsl:template>
<xsl:template match="*">
<!-- do NOTHING here! -->
</xsl:template>
</xsl:stylesheet>
Again, the plain untagged text will fall through and not get processed, so their contents will be copied to output.
Both XSLT stylesheets will output only I ha, the only part in your sample text that is not surrounded by tags.
Do you want to find
welcome to about my property here you can learn more about how your property
from
<text start="9.75" dur="5.94">welcome to about my property here you can learn more about how your property</text>
??
Than it will work.
(?<=>).+?(?=<)

Problems Trying to Pretty Print XSLT Output

this is my first post so please let me know if I can make it more constructive in any way. I have read the forum guidelines so if I inadvertantly break them in anyway it will be nothing more than an innocent mistake.
The Question
Is a simple one:
How do I pretty print the output of an XSL file?
But with some criteria:
Using only native XSL functionality.
Without having to use a second XSL file to do a 'second pass'.
It must also work for elements with mixed content.
I have googled this reasonably thoroughly but have not found a clear answer to this question. I have only used XSL for about a week so go easy if I have somehow missed the answer elsewhere.
An Example
This XML...
<email>
<attachedItem>priceless photograph.jpg</attachedItem>
<attachedItem>important document.doc</attachedItem>
<attachedItem>access codes.pdf</attachedItem>
</email>
...Transformed by this XSL...
<!-- Pretty Print Output -->
<xsl:strip-space elements="*"/>
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/">
<email>
"Please find attached the stuff."
<xsl:apply-templates/>
</email>
</xsl:template>
<xsl:template match="attachedItem">
<xsl:copy/>
</xsl:template>
...Produces this result...
<?xml version="1.0" encoding="utf-8"?>
<email>
"Please find attached the stuff."
<attachedItem>priceless photograph.jpg</attachedItem>
<attachedItem>important document.doc</attachedItem>
<attachedItem>access codes.pdf</attachedItem>
</email>
Using the Saxon6.5.5 engine
Desired Output
<?xml version="1.0" encoding="utf-8"?>
<email>
"Please find attached the stuff."
<attachedItem>priceless photograph.jpg</attachedItem>
<attachedItem>important document.doc</attachedItem>
<attachedItem>access codes.pdf</attachedItem>
</email>
My Own Progress on the Problem
From the XSL above you will see I have discovered the use of <xsl:strip-space> and <xsl:output>. This meets the first 2 criteria but not the 3rd. In other words, it produces nice pretty printed XML without the mixed content, but with it I recieve the undesired output you can see above.
I know that the reason I get this output is because of the way whitespace is preserved in the source XML. White space is always preserved if it is part of a text node that contains other non-whitespace characters, regardless of the <xsl:strip-space> instructions. However despite my understanding I still cannot think of a solution.
Although I have addressed the first 2 criteria myself I would still like to know if this is the best way to achieve a pretty printed result.
Thanks in advance!
The following stylesheet produces exactly the output you request. The transformation was performed with Saxon 6.5.5. The correct indentation can only be achieved by meticulously typing all the line feed (
) and space ( )characters.
Note that pretty printing XML has no meaning when text content is concerned. The indentation of element tags can be easily controlled, but text nodes of elements with mixed content are always a problem. An application that takes XML as input should never rely on the exact indentation or whitespace handling of text content in XML.
In general, it is considered a bad idea to directly output literal text in an XSLT stylesheet. Always put text content inside xsl:text. xsl:strip-space has an effect only on whitespace-only text nodes of elements that belong to the input XML document (as suggested by #TobiasKlevenz already).
Stylesheet
<?xml version="1.0" encoding="UTF-8" ?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<!-- Pretty Print Output -->
<xsl:strip-space elements="*"/>
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/">
<email>
<xsl:text>
"Please find attached the stuff."
</xsl:text>
<xsl:apply-templates/>
</email>
</xsl:template>
<xsl:template match="attachedItem|text()">
<xsl:copy>
<xsl:apply-templates/>
</xsl:copy>
</xsl:template>
</xsl:transform>
Output
<?xml version="1.0" encoding="utf-8"?>
<email>
"Please find attached the stuff."
<attachedItem>priceless photograph.jpg</attachedItem>
<attachedItem>important document.doc</attachedItem>
<attachedItem>access codes.pdf</attachedItem>
</email>
you can wrap "Please find attached the stuff." in an
<xsl:text>
which would produce my assumption of your desired result, if not please post a 'desired output' example/.

Odd XSL output in Symphony CMS

I'm in Symphony CMS trying to return an article image like so.
<img src="{$workspace}/uploads/{/data/news-articles/entry/image-thumbnail}"/>
The output looks like this
<img src="/workspace/uploads/%0A%09%09%09%09penuts_thumb.png%0A%09%09%09%09%0A%09%09%09">
If I just try to return the node value
<xsl:value-of select="image-thumbnail" />
Output looks correct
penuts_thumb.png
Any thoughts on why I'm getting all the excess characters?
Output looks correct
No, it only "looks correct", because the browser ignores white-space characters.
What happens is that the string "penuts_thumb.png" is surrounded by whitespace. When this whitespace is serialized as part of the src attribute value, it is encoded (normalized) -- this is why you see %0A (code for newline) anf %09 (code for tab).
This transformation helps to see exactly what is generated in each case:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" indent="yes"/>
<xsl:variable name="workspace" select="'/workspace'"/>
<xsl:template match="/">
<img src="{$workspace}/uploads/{/data/news-articles/entry/image-thumbnail}"/>
===========
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="entry">
"<xsl:value-of select="image-thumbnail"/>"
</xsl:template>
</xsl:stylesheet>
when applied on this XML document:
<data>
<news-articles>
<entry>
<image-thumbnail>
penuts_thumb.png
</image-thumbnail>
</entry>
</news-articles>
</data>
produces this output:
<img src="/workspace/uploads/%0A penuts_thumb.png%0A ">
===========
"
penuts_thumb.png
"
As we can see (thanks to the quotes) in the second case the string "penuts_thumb.png" is also surrounded by a lot of whitespace characters.
Solution:
Use the normalize-space() function in this way:
<img src=
"{$workspace}/uploads/{normalize-space(/data/news-articles/entry/image-thumbnail)}"/>

How to get rid of xmlns: - Attributes in XSL transformation

I have an xsl transformation to generate ASP.NET User controls (ascx).
My XSL is defined this way:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:msxsl="urn:schemas-microsoft-com:xslt"
xmlns:asp="System.Web.UI.WebControls"
exclude-result-prefixes="asp msxsl"
>
<xsl:output method="xml" indent="no" omit-xml-declaration="yes" />
So from that exclude-result-prefixes I would assume, that everything with the asp prefix should not add the namespace information, but i.e. this template here:
<xsl:template match="Label">
<asp:Label runat="server" AssociatedControlID="{../#id}">
<xsl:copy-of select="./text()"/>
</asp:Label>
</xsl:template>
fed with this xml:
<Label>Label Text</Label>
results in this output:
<asp:Label runat="server" AssociatedControlID="SomeName" xmlns:asp="System.Web.UI.WebControls">Label Text</asp:Label>
So what do I need to do to prevent the xmlns:asp=".." to show up in every single tag in my result?
It is impossible, at least in MSXML, that is because output XML won't be well-formed. You can only output it like text, e.g. using CDATA.