RegEx to remove specific XML elements - regex

I'm using Kate to process text to create an XML file but I've hit a roadblock. The text now contains additional data that I need to remove based on its content.
To be specific, I have an XML element called <officers> that contains 0 or more <officer> elements, which contain further elements such as <title>, <name>, etc.. While I probably could exclude these at run time using XSL, the file also drives another process that I don't want to touch - it's a general purpose data importer for Scribus so I don't want to touch the coding.
What I want to do is remove an <officer> element if the <title> content isn't what I want. For example, I don't want the First VP, so I'd like to remove:
<officer>
<title>First VP</title>
<incumbent>Joe Somebody</incumbent>
<address>....</address>
<address>....</address>
......
</officer>
I don't know how many lines will be in any <officer> element nor what positions they will in within the <officers> element.
The easy part it getting to the start of the content I want removed. The hard part is getting to the </officer> end tag. All the solutions I've found so far just result in Kate deciding that the RegEx is invalid.
Any suggestions are appreciated.

Regex is the wrong tool for this job; never process XML without a proper parser, except possibly for a one-off job on a single document where you will throw the code away after running it and checking the results by hand. You might find a regex that works on one sample document, but you'll never get it to work properly on a well-designed set of 100 test documents.
And it's easily done using XSLT. It's a stylesheet with two template rules: a default "identity template" rule to copy elements unchanged, and a second rule to delete the elements you don't want. In fact in XSLT 3.0 it gets even simpler:
<xsl:mode on-no-match="shallow-copy"/>
<xsl:template match="officer[title='First VP']"/>

Related

How to skip a self closing tag in a ST function on an SAP system?

So I have this problem handling an XML file in my SAP ABAP-based software, with a Simple Transformation.
The file I receive have normally no empty tags like <test></test>, but can happen sometimes that I receive some self closing tag like <test/>.
This is an example of what I thought to use now. The first condition handles if the ref('test') is blank by skipping it. The second one takes the values if we have one.
<tt:cond check="initial(ref('test'))">
<tt:skip count="*" name="test"/>
</tt:cond>
<tt:cond check="not-initial(ref('test'))">
<test tt:value-ref="test"/>
</tt:cond>
The idea is: if we have this tag <test/> we need to skip it, otherwise we need to assign the data. Now, this working in the first case, cause he takes no date, but not in the second cause it not takes the data again.
Someone can help?
Thanks in advantages.
The XDM tree representations of <test></test> and <test/> are 100% identical, so there is no way an XSLT stylesheet can distinguish them or treat them differently. The idea of attaching different meanings to the two constructs is completely misguided: you can never be sure which representation an XML library will choose to use.
It is of course possible to distinguish an element that contains a value (such as <test>value</test>) from one that is empty - but both the above examples represent empty elements and must be treated as equivalent.

Check if <fo:page-number> is even or not using XSLT 2.0

How to check the <fo:page-number> is even or odd using xslt 2.0 Is there any way to use <fo:page-number> inside <xsl:if test="fo:page-number mod 2 = 0">
The XSLT stage generates the XSL-FO that the formatter then makes into pages. So, no, you can't get the current page number when you are generating the XSL-FO.
What do you want to change if it is an even-numbered page?
With XSL-FO, you can set up different page masters for odd and even pages (and more besides). The different page masters can have different margins, and you can set things up so that the formatter will direct different content to headers and footers on even pages than is used on odd pages.
See the 'Page Region and Structure' PDF and FO files in the 'XSL-FO Samples Collection' at https://www.antennahouse.com/xsl-fo-samples#structure
What you ask for cannot be done with a true batch formatter in a single pass. It requires "human" intervention to mark only those places where the break needs to occur and not others.
Also, there is no guarantee that one XSL FO formatter might yield different results than another. Because of the complexities in the way some formatters handle "line tightness" (which is very small squeezing of spaces and characters together to fit text within a line) as well as some supporting kerning and others not as well as many other factors, it is not possible to "pre-predict" whether some paragraph will appear/start on a page or not.
Formatting text in true typography is not merely word-space-word-space ... there are many other factors involved that could change the number of lines in a paragraph between one formatter and another which can easily ripple to a known paragraph existing on an even page in one formatter, yet an odd page in a different formatter.
Then you also need other rules like what if your paragraph using your formatter of choice is the first one on your page in which you wish to break. Do you want a blank page? Maybe, who knows?
The only way to accomplish your task is through a multipass approach that could be implemented such that it is generic to any formatter. You would need to format a whole document (or if you are chunking that document with page masters) at least a chunk that starts and ends in page boundaries. Format it, test your condition on the first paragraph. If it passes (meaning if a break is needed), go back to original content (or modify the XSL FO) and mark some attribute that would result in break-before="page" on that structure. Then repeat the process until you reach the end of the document. Some formatters can provide you the area tree and markers you can put in that tree so that you could do this programmatically and not by eye).
If your document is long and in one page-sequence (say like 3000 pages when formatted) and your break condition is frequent, you may have to repeat the process 700+ times.
As stated, some formatters through their API may allow you to control this programmatically. You can examine the area tree, look for your marker and keep count of pages. You may even be able to start formatting again at the break condition and not start over, but you need to program such things.

Regex or Xpath for extracting nodes?

I have an XML file with the following structure;
<JobList>
<Job><subnodes/></Job>
<Job><subnodes/></Job>
</JobList>
This xml can be broken sometimes leaving a missing ending of <JobList> and missing end of </Job>.
I would like to be able to extract the <Job> nodes with full content on those that are closed with </Job>. What is the best way to do this?
To make a long story short I am using .NET and built in serializers for deserializing xml content. But since new properties are added you cannot just go back and forth between different versions as it is to strict. Mostly it works, but I would like to have a backup recovery method for this - hence the question.
The current situation is that the deserializer "crashes" the whole deserializing when a new property has been added instead of ignoring it. I am looking to manually parse it on error.
As mentioned on the comments, the ideal would be to make the xml valid, if for whatever reason that is not possible, the workaround is parsing the file as text with a regex.
A general regex for this case could be something like:
<Job>((?!<Job>).)*</Job>$
this will bring anything between a complete pair
Please notice that this will also return nodes with 'broken' inner nodes, but according to your question you are only concerned about missing and tags.

Truncate text formatted via HTML with XSLT 1.0

I am trying to truncate some text that has been formatted via HTML, but I need to keep the html in tact. I am doing so in SharePoint 2007 - so I am using XSLT 1.0.
I found this bit of XSLT here: http://symphony-cms.com/download/xslt-utilities/view/20816/
I was able to implement it, but it is telling me that the variable or parameter "Limit" has been defined twice.
However, the author has named many variables and parameters "Limit" and I am not sure which one I need to change.
I am fairly new to XSLT, and any help is greatly appreciated.
This is because at the top the XSLT the author has defined limit as a parameter
<xsl:param name="limit"/>
But a few lines down, then defines it as a variable
<xsl:variable name="limit">
Perhaps he had a 'buggy' xslt processor which allowed variables to be re-defined, but it should not actually be valid.
I did try renaming the variable to newlimit but it is hard to know whn he subsequently refers to limit whether it is the paramater or variable it is referring too (I couldn't actually get it to output useful HTML).
You are probably better off looking for something else to meet your needs. There may even be similar questions here on StackOverflow if you search about. For example, perhaps this meets your needs
XSLT - Using substring with copy-of to preserve inner HTML tags
I am sure there may be others if you look. If not, feel free to ask a new question, giving your input HTML, and your expected output, so that it is clear what your requirements are.

XSLT Set difference but matching on a subsection of the node

I've implemented this in a recursive fashon but as most xml editors seem to run out of stack space I thought there should be a more efficient solution out there.
I've looked at Jenni Tenison's set difference template:
http://www.exslt.org/set/functions/difference/set.difference.template.xsl
but need something slightly different. I need node equality to be defined
as concat(node(.),#name).
There is a predefined set of nodes:
<a name="Adam"><!-- don't care about contents for equality purposes --></a>
<b name="Berty"><!-- don't care about contents for equality purposes --></b>
<a name="Charly"><!-- don't care about contents for equality purposes --></a>
I want to find out the subset of the below nodes that are not in the above list:
<b name="Berty"><!-- different contents --></b>
<b name="Boris"><!-- different contents --></b>
The result I'm after would be a node set of:
<b name="Boris"><!-- different contents --></b>
To complicate things I can't use Key as the nodes are in different documents (overriding imported definitions are the reason I'm trying to process this).
Also this needs to be XSLT 1.0 as I need to render in IE / Firefox.
Any thoughts / suggestions / guidence wellcome!
Have you taken a look at the technique in the XSLT Cookbook?
http://books.google.com/books?id=POJkiuHIAfoC&lpg=PP1&pg=PA324#v=onepage&q=&f=false
Mr. Mangano has a recipe for set difference, and a fairly well written explanation as well. Mind you, when you are comparing two elements that seem to be the same but have two different source documents, XSLT will usually report them as different, so you must test by the value of the element, attributes, etc.
You might want to poke at the example code from the book, provided here:
http://oreilly.com/catalog/9780596009748