XSLT missing linebreaks when selecting nodes by path expression - xslt

I am copying some nodes according to XSLT: Copy child elements of a complex type only once by using a path expression within a copy-of tag:
<xsl:copy-of select="/xs:schema/xs:complexType[#name=current()/xs:element/#type]"/>
In the output all linebreaks are missing at the elements processed by this statement. (Elesewhere they are shown) It looks like this:
...</xs:complexType><xs:complexType....
I can only add linebreaks before and after, but not between them. How can i achive this? Thanx for your help!

You provided too little data to attempt any testing. E.g. it is not clear, what output method uses your script.
Quite often XSLT script contains xsl:strip-space instruction, which causes normalization of text nodes.
This normalization a.o. changes "internal" sequences of "white" chars, including line breaks,
into a single space.
Maybe this is the cause.
Take alse a look at xsl:output instruction in your script.
Does it contain indent="yes" attribute?
If it doesn't, the output contains no line breaks between output elements.
Maybe your script contains in some places output of explicite line breaks
(e.g. <xsl:text>&#aA;</xsl:text>), so these line breaks are rendered.
But if you have no indent="yes" attribute, then no line breaks are inserted
"automatically" between consecutive elements.

Your XPath expression only selects the xs:complexType elements, not the whitespace that separates them.
When you're working with a vocabulary such as XSD that doesn't use mixed content (except perhaps in annotations) it's probably best to remove all whitespace text nodes from the input using xsl:strip-space and then to generate new whitespace in the output using xsl:output indent='yes'.

Related

Duplicate line and replace string

I have an XML file that contains more than 10,000 items. Each item contains a line like this.
<g:id><![CDATA[FBM00101816_BLACK-L]]></g:id>
For each item I need to add another line below like this:
<sku><![CDATA[FBM00101816]]></sku>
So I need to duplicate each g:id line, replace the g:id with sku and trim the value to delete all characters after the underscore (including it). The final result would be like this:
<g:id><![CDATA[FBM00101816_BLACK-L]]></g:id>
<sku><![CDATA[FBM00101816]]></sku>
Any ideas how to accomplish this?
Thanks in advance.
In XSLT, it's
<xsl:template match="g:id">
<xsl:copy-of select="."/>
<sku><xsl:value-of select="substring-before(., '_')"/></sku>
</xsl:template>
Or using Saxon's Gizmo (https://www.saxonica.com/documentation11/index.html#!gizmo) it's
follow //g:id with <sku>{substring-before(., '_')}</sku>
Don't try to do this sort of thing in a text editor (or any other tool that doesn't involve a real XML parser) unless it's a one-off. Your code will be too sensitive to trivial variations in the way the source XML is written and will almost inevitably have bugs - which might not matter for a one-off, but do matter if it's going to be used repeatedly over a period of time.
Note also, the CDATA tags in your input (and output) are a waste of space. CDATA tags have no significance unless the element content includes special characters like < and &, which isn't the case in your examples.
Okay, so after commenting, I couldn't help myself. This seemed to do what you asked for.
find: <g:id><!\[CDATA\[([^\_]+)?(.+)?\]></g:id>
replace: $0\n<sku><![CDATA[$1]></sku>
I don't have BBEdit, but this is what it looked like in Textmate:

XSL "not" not working as expected attempting to compare two variables

Using XSL v3.0 I'm trying to compare two variables. One was created from .txt format directory listings imported as unparsed text. The other was created by querying xml files. Both contain references to jpgs.
I want to create a third variable using select="not" to find out which jpg references are present in one variable but not the other. I know the syntax for this $var1[not(.=$var2)] I was successfully able to do it in another place in this same XSL file.
I can output the values from each of the two variables and they look just like a .txt file would, and the values are what I would expect to see.
But for the life of me I cannot get the "not" to work. As far as I can tell it just returns the entire value of one of the variables.
Is there a way to just brute force these two variables into the same format so I can do this? I just want each variable to be a flat file that I can compare to the other and output another boring old flat file. I've tried all the combinations of tokenize and string-join etc. that I've stumbled across and nothing seems to work.
If I was using a bash script I would just pipe the dirs to two .txt files and use diff to do this, but achieving the same thing in XSL is killing me.
Clearly I am a novice at XSL. Any assistance appreciated.
per Michael Kay's suggestion
complete xsl available at this dropbox link
https://www.dropbox.com/s/fsltr34f5l3ci5a/jpg_report_stack.xsl?dl=0
variable with all jpg names - $jpg_all_distinct_joined
<xsl:variable name="jpg_all_distinct_joined" as="xs:string" select="string-join((distinct-values(($token_full, $token_800, $token_thumb))),'
')"/>
variable with all jpg references from xml - $jpg_all_links
<xsl:variable name="jpg_all_links" select="($jpg_link_pb, $jpg_link_bibl, $jpg_link_ref)"/>
not statement
<xsl:variable name="jpgs_in_xml_not_directories" select="($jpg_all_links)[not(.=$jpg_all_distinct_joined)]"/>
outputs the value of $jpg_all_links - this is not what I want - I want the output to be all jpg references from $jpg_all_links that are not in $jpg_all_distinct_joined
The variable $jpg_all_links is a sequence concatenation of the three variables ($jpg_link_pb, $jpg_link_bibl, $jpg_link_ref). All three of these variables are constructed using string-join() with newline as the separator, so it seems likely that $jpg_all_links is a sequence of three strings, each comprising multiple strings separated by newlines. The variable $jpg_all_distinct_joined is also formed using string-join() with a newline separator. So my suspicion is that (writing # to represent a newline character), you are doing something like
("A#B#C", "D#E#F")[not(. = "C#D")]
and nothing is being eliminated because none of the strings is equal to "C#D". You want to compare the sets of individual strings, not the composite strings formed using string-join().

UltraEdit/Notepad - XML Remove nodes with empty properties

I'm currently facing an issue with a software i'm working with , this software receives from an external sofware several Xmls that we do need to process , now our issue is that those Xml files contain a lot of nodes which are totally useless and also make the files (xmls) really heavy because of that , in result out program runs very slow to process each one of the xmls , this should be changed in the future and i'd like to prove that by removing those nodes we would improve our processing time a lot , now i'd like as first step to do this manually , using a sample xml and applying a regex syntax to remove all the nodes with value property empty , this is the syntax that i'm using now and through the replace function in notepad i'm able to remove those rows and then remove the empty lines :
<.*(\s\w+?[^=]*?="[^"]*?")*?\s+?value="[""]*?".*?>
Example
<TEST_NODE value="1"/>
<TEST_NODE value=""/>
<TEST_NODE value="0"/>
In my case nodes can be named differently and can have different properties , but the one that i should care for are the ones that contain something in the value property , therefore in this case i should remove the second row
This looks to be working fine , however with very large files (10 mb) the replace notepad++ function seems to have issues and it stop working properly breaking a lot of tags...
I've tried using another software called "Ultraedit" , but there the syntax i guess it's different as i can use regular Expressions but need to select one of those options : Perl , Unix , Ultraedit ; only using "Perl" i'm able to do this replacement but also there , for big files this is not working and i get the following error:
The complexity of matching the expression has exceeded available resources..
Can anyone help me out with this? unfortunately i'm not even that good with Regex and i'm not sure if the above code is good or bad..
Try this:
<(?=[^><]*?value\s*=\s*"")[^><]*>
Replace with nothing.
This might be a case of catastrophic backtracking when the regex runs caused by too many quantifiers applied to too many wide character classes like .
The quantifiers in this answer are only applied to not < or > class which should stop the expression backtracking through XML tags.
You're using the wrong tool for the job. If you're going to be manipulating XML then you need to add XSLT and/or XQuery to your tool kit. Using regular expressions for the job is slow and error-prone.
For example, here are just a few of the bugs in the answer that you accepted:
Elements that use single quotes (value='') won't be matched
Element with whitespace around the equals sign won't be matched
Elements with an attribute whose name ends in value (e.g. xvalue="") will be matched
value="" will be matched inside comment and CDATA nodes
value="" can be matched inside text nodes: <x>value=""</x>
Elements split across multiple lines won't be matched (I suspect)
In XSLT 3.0 this is simply
<xsl:transform version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:mode on-no-match="shallow-copy"/>
<xsl:template match="*[#value='']"/>
</xsl:transform>
Try this regular expression in Notepad++
<[^<]+value=""[^>]*>

XSLT output format

I am using XSLT to generate an .sql file from an .xml input file.
I have some problems with the indentation.
The way the stylesheet is formatted (how many line feeds and carriage returns and tabs) directly effects the output file i.e. if I include a few line feeds and CRs in my stylesheet to make it more readable, they are displayed in the output file as well (this would not be that bad if the tabs didn't affect the formatting of the output file as well):
It looks like this:
SQLStatement1<CR><LF>
<CR><LF>
<CR><LF>
SQLStatement2<CR><LF>
.... (tabs are also outputted)
I use an ant task to create the .sql file. The target looks like this:
<xslt in="input.xml"
out="queries.sql"
style="createQueries.xls">
</xslt>
I am using XSLT 1.0 and cannot use XSLT 2.0.
I thought about modifying some output parameters. However it does not have any effect if I change the method attribute to e.g. 'html' (I guess that the method is set to 'text' since the type of the output file(sql) is not known)
Any ideas on how to fix this issue?
Cheers
You would make it much easier on us if you showed a small but complete XML input sample, an XSLT sample, the output you get and the output you want.
If you use xsl:output method="text" and want to control the white space then make sure you use xsl:text to output literal text and xsl:value-of to output computed text. That way you should be able to control the white space exactly.

How to check if an apply-template filled variable is a string or (possibly) empty node in XSLT?

I need a test for a variable which would evaluate to true in two cases:
There is a string inside which contains any non white space characters
There is any node (which can be possibly empty)
and the variable is filled with apply-template call result.
I tried
test="normalize-space($var)"
but this doesn't cover the empty tag possibility. I also tried simply this:
test="$var"
but this evaluate to true even for white space only strings.
By the way "$var/*" produces an error "Expression ...something I don't remember... node-set" which is I think because of apply-template variable instantiation.
Is there any (which means even multi level decision) solution for this?
EDIT: I forgot to say that it's for XSLT 1.0 and preferably without any exslt extensions or similar.
If you want to check whether a result tree fragment contains any child nodes then you need exsl:node-set e.g.
<xsl:if test="exsl:node-set($variable)/node()"
xmlns:exsl="http://exslt.org/common">
Without that extension function (or a similar one your particular XSLT 1.0 processor offers) you can't perform that check.
Hope this helps:
"string($variable)" will test true if the string has any characters in it, false if it equals ''.
http://www.dpawson.co.uk/xsl/sect2/N3328.html#d4474e64
Contains no child nodes: not(node())
Contains no text content: not(string(.))
Contains no text other than whitespace: not(normalize-space(.))
Contains nothing except comments: not(node()[not(self::comment())])