XSL "not" not working as expected attempting to compare two variables - xslt

Using XSL v3.0 I'm trying to compare two variables. One was created from .txt format directory listings imported as unparsed text. The other was created by querying xml files. Both contain references to jpgs.
I want to create a third variable using select="not" to find out which jpg references are present in one variable but not the other. I know the syntax for this $var1[not(.=$var2)] I was successfully able to do it in another place in this same XSL file.
I can output the values from each of the two variables and they look just like a .txt file would, and the values are what I would expect to see.
But for the life of me I cannot get the "not" to work. As far as I can tell it just returns the entire value of one of the variables.
Is there a way to just brute force these two variables into the same format so I can do this? I just want each variable to be a flat file that I can compare to the other and output another boring old flat file. I've tried all the combinations of tokenize and string-join etc. that I've stumbled across and nothing seems to work.
If I was using a bash script I would just pipe the dirs to two .txt files and use diff to do this, but achieving the same thing in XSL is killing me.
Clearly I am a novice at XSL. Any assistance appreciated.
per Michael Kay's suggestion
complete xsl available at this dropbox link
https://www.dropbox.com/s/fsltr34f5l3ci5a/jpg_report_stack.xsl?dl=0
variable with all jpg names - $jpg_all_distinct_joined
<xsl:variable name="jpg_all_distinct_joined" as="xs:string" select="string-join((distinct-values(($token_full, $token_800, $token_thumb))),'
')"/>
variable with all jpg references from xml - $jpg_all_links
<xsl:variable name="jpg_all_links" select="($jpg_link_pb, $jpg_link_bibl, $jpg_link_ref)"/>
not statement
<xsl:variable name="jpgs_in_xml_not_directories" select="($jpg_all_links)[not(.=$jpg_all_distinct_joined)]"/>
outputs the value of $jpg_all_links - this is not what I want - I want the output to be all jpg references from $jpg_all_links that are not in $jpg_all_distinct_joined

The variable $jpg_all_links is a sequence concatenation of the three variables ($jpg_link_pb, $jpg_link_bibl, $jpg_link_ref). All three of these variables are constructed using string-join() with newline as the separator, so it seems likely that $jpg_all_links is a sequence of three strings, each comprising multiple strings separated by newlines. The variable $jpg_all_distinct_joined is also formed using string-join() with a newline separator. So my suspicion is that (writing # to represent a newline character), you are doing something like
("A#B#C", "D#E#F")[not(. = "C#D")]
and nothing is being eliminated because none of the strings is equal to "C#D". You want to compare the sets of individual strings, not the composite strings formed using string-join().

Related

Duplicate line and replace string

I have an XML file that contains more than 10,000 items. Each item contains a line like this.
<g:id><![CDATA[FBM00101816_BLACK-L]]></g:id>
For each item I need to add another line below like this:
<sku><![CDATA[FBM00101816]]></sku>
So I need to duplicate each g:id line, replace the g:id with sku and trim the value to delete all characters after the underscore (including it). The final result would be like this:
<g:id><![CDATA[FBM00101816_BLACK-L]]></g:id>
<sku><![CDATA[FBM00101816]]></sku>
Any ideas how to accomplish this?
Thanks in advance.
In XSLT, it's
<xsl:template match="g:id">
<xsl:copy-of select="."/>
<sku><xsl:value-of select="substring-before(., '_')"/></sku>
</xsl:template>
Or using Saxon's Gizmo (https://www.saxonica.com/documentation11/index.html#!gizmo) it's
follow //g:id with <sku>{substring-before(., '_')}</sku>
Don't try to do this sort of thing in a text editor (or any other tool that doesn't involve a real XML parser) unless it's a one-off. Your code will be too sensitive to trivial variations in the way the source XML is written and will almost inevitably have bugs - which might not matter for a one-off, but do matter if it's going to be used repeatedly over a period of time.
Note also, the CDATA tags in your input (and output) are a waste of space. CDATA tags have no significance unless the element content includes special characters like < and &, which isn't the case in your examples.
Okay, so after commenting, I couldn't help myself. This seemed to do what you asked for.
find: <g:id><!\[CDATA\[([^\_]+)?(.+)?\]></g:id>
replace: $0\n<sku><![CDATA[$1]></sku>
I don't have BBEdit, but this is what it looked like in Textmate:

UltraEdit/Notepad - XML Remove nodes with empty properties

I'm currently facing an issue with a software i'm working with , this software receives from an external sofware several Xmls that we do need to process , now our issue is that those Xml files contain a lot of nodes which are totally useless and also make the files (xmls) really heavy because of that , in result out program runs very slow to process each one of the xmls , this should be changed in the future and i'd like to prove that by removing those nodes we would improve our processing time a lot , now i'd like as first step to do this manually , using a sample xml and applying a regex syntax to remove all the nodes with value property empty , this is the syntax that i'm using now and through the replace function in notepad i'm able to remove those rows and then remove the empty lines :
<.*(\s\w+?[^=]*?="[^"]*?")*?\s+?value="[""]*?".*?>
Example
<TEST_NODE value="1"/>
<TEST_NODE value=""/>
<TEST_NODE value="0"/>
In my case nodes can be named differently and can have different properties , but the one that i should care for are the ones that contain something in the value property , therefore in this case i should remove the second row
This looks to be working fine , however with very large files (10 mb) the replace notepad++ function seems to have issues and it stop working properly breaking a lot of tags...
I've tried using another software called "Ultraedit" , but there the syntax i guess it's different as i can use regular Expressions but need to select one of those options : Perl , Unix , Ultraedit ; only using "Perl" i'm able to do this replacement but also there , for big files this is not working and i get the following error:
The complexity of matching the expression has exceeded available resources..
Can anyone help me out with this? unfortunately i'm not even that good with Regex and i'm not sure if the above code is good or bad..
Try this:
<(?=[^><]*?value\s*=\s*"")[^><]*>
Replace with nothing.
This might be a case of catastrophic backtracking when the regex runs caused by too many quantifiers applied to too many wide character classes like .
The quantifiers in this answer are only applied to not < or > class which should stop the expression backtracking through XML tags.
You're using the wrong tool for the job. If you're going to be manipulating XML then you need to add XSLT and/or XQuery to your tool kit. Using regular expressions for the job is slow and error-prone.
For example, here are just a few of the bugs in the answer that you accepted:
Elements that use single quotes (value='') won't be matched
Element with whitespace around the equals sign won't be matched
Elements with an attribute whose name ends in value (e.g. xvalue="") will be matched
value="" will be matched inside comment and CDATA nodes
value="" can be matched inside text nodes: <x>value=""</x>
Elements split across multiple lines won't be matched (I suspect)
In XSLT 3.0 this is simply
<xsl:transform version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:mode on-no-match="shallow-copy"/>
<xsl:template match="*[#value='']"/>
</xsl:transform>
Try this regular expression in Notepad++
<[^<]+value=""[^>]*>

XSLT missing linebreaks when selecting nodes by path expression

I am copying some nodes according to XSLT: Copy child elements of a complex type only once by using a path expression within a copy-of tag:
<xsl:copy-of select="/xs:schema/xs:complexType[#name=current()/xs:element/#type]"/>
In the output all linebreaks are missing at the elements processed by this statement. (Elesewhere they are shown) It looks like this:
...</xs:complexType><xs:complexType....
I can only add linebreaks before and after, but not between them. How can i achive this? Thanx for your help!
You provided too little data to attempt any testing. E.g. it is not clear, what output method uses your script.
Quite often XSLT script contains xsl:strip-space instruction, which causes normalization of text nodes.
This normalization a.o. changes "internal" sequences of "white" chars, including line breaks,
into a single space.
Maybe this is the cause.
Take alse a look at xsl:output instruction in your script.
Does it contain indent="yes" attribute?
If it doesn't, the output contains no line breaks between output elements.
Maybe your script contains in some places output of explicite line breaks
(e.g. <xsl:text>&#aA;</xsl:text>), so these line breaks are rendered.
But if you have no indent="yes" attribute, then no line breaks are inserted
"automatically" between consecutive elements.
Your XPath expression only selects the xs:complexType elements, not the whitespace that separates them.
When you're working with a vocabulary such as XSD that doesn't use mixed content (except perhaps in annotations) it's probably best to remove all whitespace text nodes from the input using xsl:strip-space and then to generate new whitespace in the output using xsl:output indent='yes'.

Possible combination (variations) of words in a string variable in stata

I have a string variable containing school names and I need to find all the possible combination of each word in this string variable in stata:
For example variation of a word "Academy" would be:
Academy,
Academy,
acdamey,
aacdemy,
dmcaamy,
aacedmy,
and so on.
I need this to standardize the raw data of school names, which has many typos of each word due to data entry issues, like the ones given above for "academy".
Depending whether your data is already in the Excel sheets or a file, you can either use regex trying to match all possible combinations (and probably fix them when found) or parse the strings first before bringing them into Excel. In either case you could make a file (or Excel list/table/area/etc.) that includes all the common typos and pick each typo as regex match to use when comparing to your actual input.
Making regexp that would actually find all possible cases is next to impossible, especially if there are cases where very similar (but correct) names for schools exist. In any case direct regexps would be very messy and complex, so I would advice you to parse the data by finding first the correct form, excluding it and then using (greedy) search/regex to find the typoed versions. You can then save the typos to use them as a filter/match/pattern.
To get some sort of starting ideas, check this links:
Regex: Search for verb roots
Read text file and extract string into Excel sheet using regex
P.s You should keep the count of all strings/school names and finally get a list of all names that did not match correct form or any of your regexp filters, so you can manually insert/correct them.

Parsing with fscanf() ignoring spaces or missing values?

I'm trying to scan a text file with XML, the XML has a number of items with this structure:
<enemy>
<type> 0 </type>
<x> 273 </x>
<y> 275 </y>
<event> </event>
</enemy>
The problem is that the xml may have spaces between tags or inside them. I created a loop and I'm trying to do a single scan in each iteration to get int type, x, y and event into a variable each. However I don't know how to ignore whitespaces nor how to handle missing values since some tags may or may not have a value (like event).
How can I scan this "enemy" regadless of spacing and missing values?
That's an easy one - you do not parse XML using fscanf(). Use a real XML parser otherwise you will end up with a very complicated code that will not work 80% of the time either returning wrong data or crashing.
XML format (despite seeming simplicity) is complicated even in most innocuous cases and existing XML parsers are there for a reason. See libxml or a lot of others.
Still, if you are hell-bent on parsing XML yourself, the right way to do it is to first tokenize the input and then ensure that your token sequences result in correct forms. That's way more complicated than using simple fscanf().