Duplicate line and replace string - regex

I have an XML file that contains more than 10,000 items. Each item contains a line like this.
<g:id><![CDATA[FBM00101816_BLACK-L]]></g:id>
For each item I need to add another line below like this:
<sku><![CDATA[FBM00101816]]></sku>
So I need to duplicate each g:id line, replace the g:id with sku and trim the value to delete all characters after the underscore (including it). The final result would be like this:
<g:id><![CDATA[FBM00101816_BLACK-L]]></g:id>
<sku><![CDATA[FBM00101816]]></sku>
Any ideas how to accomplish this?
Thanks in advance.

In XSLT, it's
<xsl:template match="g:id">
<xsl:copy-of select="."/>
<sku><xsl:value-of select="substring-before(., '_')"/></sku>
</xsl:template>
Or using Saxon's Gizmo (https://www.saxonica.com/documentation11/index.html#!gizmo) it's
follow //g:id with <sku>{substring-before(., '_')}</sku>
Don't try to do this sort of thing in a text editor (or any other tool that doesn't involve a real XML parser) unless it's a one-off. Your code will be too sensitive to trivial variations in the way the source XML is written and will almost inevitably have bugs - which might not matter for a one-off, but do matter if it's going to be used repeatedly over a period of time.
Note also, the CDATA tags in your input (and output) are a waste of space. CDATA tags have no significance unless the element content includes special characters like < and &, which isn't the case in your examples.

Okay, so after commenting, I couldn't help myself. This seemed to do what you asked for.
find: <g:id><!\[CDATA\[([^\_]+)?(.+)?\]></g:id>
replace: $0\n<sku><![CDATA[$1]></sku>
I don't have BBEdit, but this is what it looked like in Textmate:

Related

UltraEdit/Notepad - XML Remove nodes with empty properties

I'm currently facing an issue with a software i'm working with , this software receives from an external sofware several Xmls that we do need to process , now our issue is that those Xml files contain a lot of nodes which are totally useless and also make the files (xmls) really heavy because of that , in result out program runs very slow to process each one of the xmls , this should be changed in the future and i'd like to prove that by removing those nodes we would improve our processing time a lot , now i'd like as first step to do this manually , using a sample xml and applying a regex syntax to remove all the nodes with value property empty , this is the syntax that i'm using now and through the replace function in notepad i'm able to remove those rows and then remove the empty lines :
<.*(\s\w+?[^=]*?="[^"]*?")*?\s+?value="[""]*?".*?>
Example
<TEST_NODE value="1"/>
<TEST_NODE value=""/>
<TEST_NODE value="0"/>
In my case nodes can be named differently and can have different properties , but the one that i should care for are the ones that contain something in the value property , therefore in this case i should remove the second row
This looks to be working fine , however with very large files (10 mb) the replace notepad++ function seems to have issues and it stop working properly breaking a lot of tags...
I've tried using another software called "Ultraedit" , but there the syntax i guess it's different as i can use regular Expressions but need to select one of those options : Perl , Unix , Ultraedit ; only using "Perl" i'm able to do this replacement but also there , for big files this is not working and i get the following error:
The complexity of matching the expression has exceeded available resources..
Can anyone help me out with this? unfortunately i'm not even that good with Regex and i'm not sure if the above code is good or bad..
Try this:
<(?=[^><]*?value\s*=\s*"")[^><]*>
Replace with nothing.
This might be a case of catastrophic backtracking when the regex runs caused by too many quantifiers applied to too many wide character classes like .
The quantifiers in this answer are only applied to not < or > class which should stop the expression backtracking through XML tags.
You're using the wrong tool for the job. If you're going to be manipulating XML then you need to add XSLT and/or XQuery to your tool kit. Using regular expressions for the job is slow and error-prone.
For example, here are just a few of the bugs in the answer that you accepted:
Elements that use single quotes (value='') won't be matched
Element with whitespace around the equals sign won't be matched
Elements with an attribute whose name ends in value (e.g. xvalue="") will be matched
value="" will be matched inside comment and CDATA nodes
value="" can be matched inside text nodes: <x>value=""</x>
Elements split across multiple lines won't be matched (I suspect)
In XSLT 3.0 this is simply
<xsl:transform version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:mode on-no-match="shallow-copy"/>
<xsl:template match="*[#value='']"/>
</xsl:transform>
Try this regular expression in Notepad++
<[^<]+value=""[^>]*>

XSLT missing linebreaks when selecting nodes by path expression

I am copying some nodes according to XSLT: Copy child elements of a complex type only once by using a path expression within a copy-of tag:
<xsl:copy-of select="/xs:schema/xs:complexType[#name=current()/xs:element/#type]"/>
In the output all linebreaks are missing at the elements processed by this statement. (Elesewhere they are shown) It looks like this:
...</xs:complexType><xs:complexType....
I can only add linebreaks before and after, but not between them. How can i achive this? Thanx for your help!
You provided too little data to attempt any testing. E.g. it is not clear, what output method uses your script.
Quite often XSLT script contains xsl:strip-space instruction, which causes normalization of text nodes.
This normalization a.o. changes "internal" sequences of "white" chars, including line breaks,
into a single space.
Maybe this is the cause.
Take alse a look at xsl:output instruction in your script.
Does it contain indent="yes" attribute?
If it doesn't, the output contains no line breaks between output elements.
Maybe your script contains in some places output of explicite line breaks
(e.g. <xsl:text>&#aA;</xsl:text>), so these line breaks are rendered.
But if you have no indent="yes" attribute, then no line breaks are inserted
"automatically" between consecutive elements.
Your XPath expression only selects the xs:complexType elements, not the whitespace that separates them.
When you're working with a vocabulary such as XSD that doesn't use mixed content (except perhaps in annotations) it's probably best to remove all whitespace text nodes from the input using xsl:strip-space and then to generate new whitespace in the output using xsl:output indent='yes'.

TinyXML2: Replace Node function?

I am having a hard time using TinyXML2 (https://github.com/leethomason/tinyxml2) to write a C/C++ method that replaces a given node like:
<doc>
<replace>Foo</replace>
</doc>
...with another node:
<replacement>Bar</replacement>
...so that the outcome is:
<doc>
<replacement>Bar</replacement>
</doc>
However, the node to be replaced may appear multiple times an I would like to keep the order in case I replace the second node with something else.
This should actually be straight-forward, but I am failing with endless recursions.
Is there probably an example around of how to do that? Any help would be greatly appreciated.
Do you have sample code?
You could try calling tinyxml2::XMLNode::InsertAfterChild to insert <replacement> followed by a deletion of <replace>.
This answer also seems related: Updating Data in tiny Xml element
I'd recommend copying the source xml to a new document using the visitor pattern making substitutions as you go. Substituting in-place is very likely to lead to broken chains and the endless loops that you're experiencing.
You can find an example of using the vistor pattern to make substutions (in element attributes and text but it's the same principle) here. See xcopy function and associated code near the bottom.

Prevent Narrow Non-Breaking Space (n-nbsp) in XSLT output

I have an XSLT transform that puts   into my output. That is a narrow-non breaking space. Here is one section that results in nnbsp:
<span>
<xsl:text>§ </xsl:text>
<xsl:value-of select="$firstsection"/>
<xsl:text> to </xsl:text>
<xsl:value-of select="$lastsection"/>
</span>
The nnbsp in this case, comes in after the § and after the text to.
<span>§ 1 to 8</span>
(interestingly, the space before the to turns out to be a regular full size space)
This occurs in my UTF-8 encoded output, as well as iso-8859-1 (latin1).
How can I avoid the nnbsp? While the narrow space is visually more appropriate, it doesn't work for all the devices that will read this document. I need a plain vanilla blank space.
Is there a transform setting? I use Saxon 9 at the command line.
Should I do a another transform.. using a replace template to replace the nnbsp?
Should I re-do my templates like the one above? Example, if I did a concat() would that be a better coding practice?
UPDATE: For those who may find this question someday... as suggested by Michael Kay, I researched the issue further. Indeed, it turns out narrow-NBSP were in the source XML files (and bled into my templates via cut/paste). I did not know this, and it was hard to discover (hat tip to gVim hex view). The narrows don't exactly jump out at you in a GUI editor. I have no control over production of the source XML, so I had to find a way to 'deal with it.' Eric's answer below turned out to be my preferred way to scrub the narrow-nbsp. SED editing was (and is) an another option to consider, but I like keeping my production in XSLT when possible. So Eric's suggestion has worked well for me.
You could use the translate() function to replace your nnbsp by something else, but since you are using Saxon 9 you can rely on XSLT 2.0 features and use a character map which will do that kind of things automatically for you, for instance (assuming that you want to replace them by a non breaking space:
<xsl:output use-character-maps="nnbsp"/>
<xsl:character-map name="nnbsp">
<xsl:output-character character=" " string=" "/>
</xsl:character-map>
Eric
The narrow non-breaking space is coming from somewhere: either the source document or the stylesheet. It's not being magically injected by the XSLT processor. If it's in the stylesheet, then get rid of it. If it's in the source document, then transform it away, for example by use of the translate() function.
In fact, pasting your code fragment into a text editor and looking at it in hex, I see that the 202F characters are right there in your code. I don't know how you got them into your stylesheet, but you should (a) remove them, and (b) work out how it happened so it doesn't happen again.

A XML to CSV transformation, with complications

I swear I have looked at the existing threads! But I still need help.
I need to take some very messy XML and convert it to a very neat CSS file for upload to a website database.
I don't really need a finished solution, but I need help with understanding the process I should follow to solve my problem in XSLT. I won't ask you all to code for me, just tell me the elements and template structure I need. I would also love if the community could explain the logic behind the process, so that I can modify it as needed.
I have xml that has records in all orders and numbers:
<record-list>
<record>
<title>Title One</title
<author>Author One</author>
<subject>
Subject One A
Subject One B
Subject One C
</subject>
<subject>Subject Two</subject>
<subject>Subject Three</subject>
<subject>Subject Four</subject>
</record>
<record>
<subject>Subject Five</subject>
<title>Title Two</title>
<useless-element>Extra Stuff One</useless-element>
</record>
<record>
<title>Title Three</title>
<subject>Subject Six</subject>
<author/>
</record>
</record-list>
So I have multiple numbers of repeated elements, some missing elements, some empty elements, elements out of order, and some elements with extra line breaks.
I need a CSV file which reads as below, or with a different number of subject repeats (see requirements below)
"Title","Subject","Subject","Subject","Author"
"Title One","Subject One A ; Subject One B ; Subject One C","Subject Two","Subject Three","Author One"
"Title Two", "Subject Five","","",""
"Title Three","Subject Six","","",""
Requirements for the final output
-The number of columns of any repeated elements either needs to match the record with the most repeats of that element, or the program needs to chop off any repeats past a certain number.
-Each new record needs a line break and no other line breaks can exist in the files (only as record delimiters).
-The elements each need to be in the same order for each record.
-Each element text needs quotes around it (to handle intrinsic commas).
-Missing or empty elements need blank, comma surrounded quotes.
-Extra elements can't be sent through to the output
What I have done:
I have figured out how to get rid of the extra line breaks within the elements using the translate function, although I would love a solution that lets me replace the line breaks with more than one character (right now I will have to run find-and-replace to change a placeholder character to a space-semicolon-space in my output). I can get the quotes, commas, and line breaks in the output with text elements and strip-whitespace.
However, I don't know how to straighten out the order of the elements, handle the element repeats, or put through only some elements while still using the element as the cue for the line-break.
Right now, I just need a solution that works, even if all sorts of manual manipulation or multiple style-sheets are required. I can even do a find and replace in a text editor, as long as the output is good. Please help with an XSLT solution, I don't even begin to know any other suitable programing languages (college matlab many years ago is not helping).
I think I need to run two transforms. I looked at the XSLT bible, Mangano's XSLT Cookbook, where he used two transforms for a similar problem. However, his solution is so generalized, I can't understand it. If I can't figure out how it works, I can't modify it for my needs. Sorry, but without a programming background, the explanations on this site and in the text are challenging at best. However, I think I am presenting a problem with some novel features, compared to others asked on this forum.
Any help, be it non-generalized code, or even just a suggested procedure for multiple runs through my processor would be wonderful. I have been struggling with this for over a week and have made very little progress.
Thanks
CAMc
I'd suggest having a look at A CSV to XML converter in XSLT 2.0. There's a lot of useful info on that page, including how to run it.