XSLT tokenize - capturing the separators

XSLT tokenize - capturing the separators - xslt

here is a piece of code in XSL which tokenizes a text into fragments separated by interpunction and similar characters. I'd like to ask if there is a possibility to somehow capture the strings by which the text was tokenized, for example the comma or dot etc.
<xsl:stylesheet version="2.0" exclude-result-prefixes="xs xdt err fn" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:err="http://www.w3.org/2005/xqt-errors" xmlns:xdt="http://www.w3.org/2005/xpath-datatypes">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="GENERUJ">
<TEXT>
<xsl:variable name="text">
<xsl:value-of select="normalize-space(unparsed-text(#filename, 'UTF-8'))" disable-output-escaping="yes"/>
</xsl:variable>
<xsl:for-each select="tokenize($text, '(\s+("|\(|\[|\{))|(("|,|;|:|\s\-|\)|\]|\})\s+)|((\.|\?|!|;)"?\s*)' )">
<xsl:choose>
<xsl:when test="string-length(.)>0">
<FRAGMENT>
<CONTENT>
<xsl:value-of select="."/>
</CONTENT>
<LENGTH>
<xsl:value-of select="string-length(.)"/>
</LENGTH>
</FRAGMENT>
</xsl:when>
<xsl:otherwise>
<FRAGMENT_COUNT>
<xsl:value-of select="last()-1"/>
</FRAGMENT_COUNT>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each>
</TEXT>
</xsl:template>
As you see the constructed tags CONTENTS, LENGTH, I'd like to add one called SEPARATOR if you know what I mean. I couldnt find any answer to this on the internet and I'm just a beginner with xsl transformations so I'm looking for a quick solution. Thank you in advance.

The tokenize() function doesn't allow you to discover what the separators were. If you need to know, you will need to use xsl:analyze-string instead. If you use the same regex as for tokenize(), this passes the "tokens" to the xsl:non-matching-substring instruction and the "separators" to the xsl:matching-substring instruction.

Related

Extract parts of inputs strings

I am trying to extract equipment names from strings and would like if someone could help me find a good way to do this.
My input string can either contain 1 or 2 equipment names, consisting of EQ followed 1 to 3 digits, for example :
LocationEQ3Suffix
LocationEQ5EQ8Suffix
So in the first instance I would need 'EQ3' and in the second instance I would need 'EQ5' and 'EQ8'.
I need the output to be in a text format, for example :
SomeText.EQ3
SomeText.EQ5
SomeText.EQ8
I was thinking there might be a way to do this with xsl:analyze-string and a regex like EQ[0-9]{1,3}.
Any help is appreciated.
I started something like this, but I don't think it's the right approach.
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:variable name="input" select="'LocationEQ3EQ4Funct'"/>
<xsl:choose>
<!-- Case with 2 EQ -->
<xsl:when test="matches($input, 'EQ[0-9]{1,3}EQ[0-9]{1,3}')">
<xsl:value-of select="$input"/>
</xsl:when>
<!-- Case with 1 EQ -->
<xsl:otherwise>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>

You say you want to use xsl:analyze-string but you're not.
An implementation using it would look something like:
<xsl:analyze-string select="input-string" regex="EQ\d{{1,3}}">
<xsl:matching-substring>
<xsl:text>SomeText.</xsl:text>
<xsl:value-of select="." />
<xsl:text>
</xsl:text>
</xsl:matching-substring>
</xsl:analyze-string>
Demo: https://xsltfiddle.liberty-development.net/a9Hk1a

check for successively numbered attributes

I have a situation where I need to check for attribute values that may be successively numbered and input a dash between the start and end values.
<root>
<ref id="value00008 value00009 value00010 value00011 value00020"/>
</root>
The ideal output would be...
8-11, 20
I can tokenize the attribute into separate values, but I'm unsure how to check if the number at the end of "valueXXXXX" is successive to the previous value.
I'm using XSLT 2.0

You can use xsl:for-each-group with #group-adjacent testing for the number() value subtracting the position().
This trick was apparently invented by David Carlisle, according to Michael Kay.
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0">
<xsl:output indent="yes"/>
<xsl:template match="/">
<xsl:variable name="vals"
select="tokenize(root/ref/#id, '\s?value0*')[normalize-space()]"/>
<xsl:variable name="condensed-values" as="item()*">
<xsl:for-each-group select="$vals"
group-adjacent="number(.) - position()">
<xsl:choose>
<xsl:when test="count(current-group()) > 1">
<!--a sequence of successive numbers,
grab the first and last one and join with '-' -->
<xsl:sequence select="
string-join(current-group()[position()=1
or position()=last()]
,'-')"/>
</xsl:when>
<xsl:otherwise>
<!--single value group-->
<xsl:sequence select="current-group()"/>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each-group>
</xsl:variable>
<xsl:value-of select="string-join($condensed-values, ',')"/>
</xsl:template>
</xsl:stylesheet>

XSLT: Replace string with Abbreviations

I would like to know how to replace the string with the abbreviations.
My XML looks like below
<concept reltype="CONTAINS" name="Left Ventricular Major Axis Diastolic Dimension, 4-chamber view" type="NUM">
<code meaning="Left Ventricular Major Axis Diastolic Dimension, 4-chamber view" value="18074-5" schema="LN" />
<measurement value="5.7585187646">
<code value="cm" schema="UCUM" />
</measurement>
<content>
<concept reltype="HAS ACQ CONTEXT" name="Image Mode" type="CODE">
<code meaning="Image Mode" value="G-0373" schema="SRT" />
<code meaning="2D mode" value="G-03A2" schema="SRT" />
</concept>
</content>
</concept>
and I am selecting some value from the xml like,
<xsl:value-of select="concept/measurement/code/#value"/>
Now what I want is, I have to replace cm with centimeter. I have so many words like this. I would like to have a xml for abbreviations and replace from them.
I saw one similar example here.
Using a Map in XSL for expanding abbreviations
But it replaces node text, but I have text as attribute. Also, it would be better for me If I can find and replace when I select text using xsl:valueof select instead of having a separate xsl:template. Please help. I am new to xslt.

I have created XSLT v "1.1". For abbreviations I have created XML file as you have mentioned:
Abbreviation.xml:
<Abbreviations>
<Abbreviation>
<Short>cm</Short>
<Full>centimeter</Full>
</Abbreviation>
<Abbreviation>
<Short>m</Short>
<Full>meter</Full>
</Abbreviation>
</Abbreviations>
XSLT:
<xsl:stylesheet version="1.1" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes" method="xml" />
<xsl:param name="AbbreviationDoc" select="document('Abbreviation.xml')"/>
<xsl:template match="/">
<xsl:call-template name="Convert">
<xsl:with-param name="present" select="concept/measurement/code/#value"/>
</xsl:call-template>
</xsl:template>
<xsl:template name="Convert">
<xsl:param name="present"/>
<xsl:choose>
<xsl:when test="$AbbreviationDoc/Abbreviations/Abbreviation[Short = $present]">
<xsl:value-of select="$AbbreviationDoc/Abbreviations/Abbreviation[Short = $present]/Full"/>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="$present"/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
INPUT:
as you have given <xsl:value-of select="concept/measurement/code/#value"/>
OUTPUT:
centimeter
You just need to enhance this Abbreviation.xml to keep short and full value of abbreviation and call 'Convert' template with passing current value to get desired output.

Here a little shorter version:
- with abbreviations in xslt file
- make use of apply-templates with mode to make usage shorter.
But with xslt 1.0 node-set extension is required.
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:exsl="http://exslt.org/common"
extension-element-prefixes="exsl">
<xsl:output method="xml" indent="yes"/>
<xsl:variable name="abbreviations_txt">
<abbreviation abbrev="cm" >centimeter</abbreviation>
<abbreviation abbrev="m" >meter</abbreviation>
</xsl:variable>
<xsl:variable name="abbreviations" select="exsl:node-set($abbreviations_txt)" />
<xsl:template match="/">
<xsl:apply-templates select="concept/measurement/code/#value" mode="abbrev_to_text"/>
</xsl:template>
<xsl:template match="* | #*" mode="abbrev_to_text">
<xsl:variable name="abbrev" select="." />
<xsl:variable name="long_text" select="$abbreviations//abbreviation[#abbrev = $abbrev]/text()" />
<xsl:value-of select="$long_text"/>
<xsl:if test="not ($long_text)">
<xsl:value-of select="$abbrev"/>
</xsl:if>
</xsl:template>
</xsl:stylesheet>

Check for CR or LF in XSLT

I have a input XML like this :
<in_xml>
<company>
<project>
ProjNo1
ProjNo2
ProjNo3
</project>
</company>
</in_xml>
A simple XSLT is applied to this source, which writes another XML with the value of Project Tag.
The Project tag in input xml has three lines , it could be one or more line(s). I am looking at way for the XSLT to read only the first line, in case there are more than one and write the first line in the output xml.
The current XSLT is very simple as it just reads the Project tag and spits out the value, hence the code is not attached.
Regards.
I have added the answer to the question, see below #Maestro's answer.

If you are in the happy circumstances of being able to apply XSLT 2.0, the following may help:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="//project/text()">
<xsl:value-of select="tokenize(normalize-space(.),' ')[1]" />
</xsl:template>
</xsl:stylesheet>
Explanation: first normalize-space() to replace all whitespace strings by a single blank (and cut off leading and trailing whitespace), then split into words, then take the first one.
In XSLT 1.0 you could use
<xsl:value-of select="substring-before(normalize-space(.), ' ')"/>
instead. Less flexible if the second word has to be selected, but for the first word it works OK.
EDIT
you asked how to retrieve a first line in XSLT 1.0 - problem here is the leading whitespace which may contain a LF so you cannot just substring-before the first LF.
The below can probably be improved upon, but it works fine:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="//project/text()">
<xsl:variable name="afterLeadingWS"
select="substring-after(., substring-before(.,substring-before(normalize-space(.), ' ')))"/>
<xsl:choose>
<xsl:when test="contains($afterLeadingWS, '
')">
<xsl:value-of select="substring-before($afterLeadingWS, '
')"/>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="$afterLeadingWS"/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
Explanation: first get first word as before, then determine the whitespace before that first word, then get everything after that leading whitespace, then get the first line, which is the string before a LF character. It may just happen that there is no LF except maybe in the leading whitespace, hence the choose function.

first up thanks to #Maestro for taking time out and helping me, appreciate your help.
Here is the code that I used to get the first line from a paragraph of text that has a CR:
<xsl:variable name="projNumber" select="ProjectNumber" />
<xsl:variable name="crlf" select="'
'" />
<xsl:choose>
<xsl:when test="contains($projNumber,$crlf)">
<xsl:value-of select="substring-before($projNumber,$crlf)"/>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="$projNumber"/>
</xsl:otherwise>
</xsl:choose>
This can be written as a function but I don't know how to do it, maybe someone can guide but there you go. A better approach that a colleague of mine suggested is to escape the CR and directly use it in substring function this would avoid all those variables in the first place.
Thanks again.

Here's the code again, now with a function getFirstLine.
Note the addtional namespace that is needed.
Also note that this does require XSLT 2.0 (xsl:function is not available in 1.0).
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet
version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:f="http://temp.com/functions">
<xsl:output method="text"/>
<xsl:template match="//project/text()">
<xsl:value-of select="f:getFirstLine(.)"/>
</xsl:template>
<xsl:function name="f:getFirstLine">
<xsl:param name="input"/>
<xsl:variable name="afterLeadingWS" select="substring-after($input, substring-before($input,substring-before(normalize-space($input), ' ')))"/>
<xsl:choose>
<xsl:when test="contains($afterLeadingWS, '
')">
<xsl:value-of select="substring-before($afterLeadingWS, '
')"/>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="$afterLeadingWS"/>
</xsl:otherwise>
</xsl:choose>
</xsl:function>
</xsl:stylesheet>

Complex XSLT split?

Is it possible to split a tag at lower to upper case boundaries i.e.
for example, tag 'UserLicenseCode' should be converted to 'User License Code'
so that the column headers look a little nicer.
I've done something like this in the past using Perl's regular expressions,
but XSLT is a whole new ball game for me.
Any pointers in creating such a template would be greatly appreciated!
Thanks
Krishna

Using recursion, it is possible to walk through a string in XSLT to evaluate every character. To do this, create a new template which accepts only one string parameter. Check the first character and if it's an uppercase character, write a space. Then write the character. Then call the template again with the remaining characters inside a single string. This would result in what you want to do.
That would be your pointer. I will need some time to work out the template. :-)
It took some testing, especially to get the space inside the whole thing. (I misused a character for this!) But this code should give you an idea...
I used this XML:
<?xml version="1.0" encoding="UTF-8"?>
<blah>UserLicenseCode</blah>
and then this stylesheet:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:fo="http://www.w3.org/1999/XSL/Format">
<xsl:output method="text"/>
<xsl:variable name="Space">*</xsl:variable>
<xsl:template match="blah">
<xsl:variable name="Split">
<xsl:call-template name="Split">
<xsl:with-param name="Value" select="."/>
<xsl:with-param name="First" select="true()"/>
</xsl:call-template></xsl:variable>
<xsl:value-of select="translate($Split, '*', ' ')" />
</xsl:template>
<xsl:template name="Split">
<xsl:param name="Value"/>
<xsl:param name="First" select="false()"/>
<xsl:if test="$Value!=''">
<xsl:variable name="FirstChar" select="substring($Value, 1, 1)"/>
<xsl:variable name="Rest" select="substring-after($Value, $FirstChar)"/>
<xsl:if test="not($First)">
<xsl:if test="translate($FirstChar, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', '..........................')= '.'">
<xsl:value-of select="$Space"/>
</xsl:if>
</xsl:if>
<xsl:value-of select="$FirstChar"/>
<xsl:call-template name="Split">
<xsl:with-param name="Value" select="$Rest"/>
</xsl:call-template>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
and I got this as result:
User License Code
Do keep in mind that spaces and other white-space characters do tend to be stripped away from XML, which is why I used an '*' instead, which I translated to a space.
Of course, this code could be improved. It's what I could come up with in 10 minutes of work. In other languages, it would take less lines of code but in XSLT it's still quite fast, considering the amount of code lines it contains.

An XSLT + FXSL solution (in XSLT 2.0, but almost the same code will work with XSLT 1.0 and FXSL 1.x:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:f="http://fxsl.sf.net/"
xmlns:testmap="testmap"
exclude-result-prefixes="f testmap"
>
<xsl:import href="../f/func-str-dvc-map.xsl"/>
<testmap:testmap/>
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:variable name="vTestMap" select="document('')/*/testmap:*[1]"/>
'<xsl:value-of select="f:str-map($vTestMap, 'UserLicenseCode')"
/>'
</xsl:template>
<xsl:template name="mySplit" match="*[namespace-uri() = 'testmap']"
mode="f:FXSL">
<xsl:param name="arg1"/>
<xsl:value-of select=
"if(lower-case($arg1) ne $arg1)
then concat(' ', $arg1)
else $arg1
"/>
</xsl:template>
</xsl:stylesheet>
When the above transformation is applied on any source XML document (not used), the expected correct result is produced:
' User License Code'
Do note:
We are using the DVC version of the FXSL function/template str-map(). This is a Higher-order function (HOF) which takes two arguments: another function and a string. str-map() applies the function on every character of the string and returns the concatenation of the results.
Because the lower-case() function is used (in the XSLT 2.0 version), we are not constrained to only the Latin alphabet.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

XSLT tokenize - capturing the separators - xslt

Related

Extract parts of inputs strings

check for successively numbered attributes

XSLT: Replace string with Abbreviations

Check for CR or LF in XSLT

Complex XSLT split?

Categories

Resources