Regex or Xpath for extracting nodes? - regex

I have an XML file with the following structure;
<JobList>
<Job><subnodes/></Job>
<Job><subnodes/></Job>
</JobList>
This xml can be broken sometimes leaving a missing ending of <JobList> and missing end of </Job>.
I would like to be able to extract the <Job> nodes with full content on those that are closed with </Job>. What is the best way to do this?
To make a long story short I am using .NET and built in serializers for deserializing xml content. But since new properties are added you cannot just go back and forth between different versions as it is to strict. Mostly it works, but I would like to have a backup recovery method for this - hence the question.
The current situation is that the deserializer "crashes" the whole deserializing when a new property has been added instead of ignoring it. I am looking to manually parse it on error.

As mentioned on the comments, the ideal would be to make the xml valid, if for whatever reason that is not possible, the workaround is parsing the file as text with a regex.
A general regex for this case could be something like:
<Job>((?!<Job>).)*</Job>$
this will bring anything between a complete pair
Please notice that this will also return nodes with 'broken' inner nodes, but according to your question you are only concerned about missing and tags.

Related

How to skip a self closing tag in a ST function on an SAP system?

So I have this problem handling an XML file in my SAP ABAP-based software, with a Simple Transformation.
The file I receive have normally no empty tags like <test></test>, but can happen sometimes that I receive some self closing tag like <test/>.
This is an example of what I thought to use now. The first condition handles if the ref('test') is blank by skipping it. The second one takes the values if we have one.
<tt:cond check="initial(ref('test'))">
<tt:skip count="*" name="test"/>
</tt:cond>
<tt:cond check="not-initial(ref('test'))">
<test tt:value-ref="test"/>
</tt:cond>
The idea is: if we have this tag <test/> we need to skip it, otherwise we need to assign the data. Now, this working in the first case, cause he takes no date, but not in the second cause it not takes the data again.
Someone can help?
Thanks in advantages.
The XDM tree representations of <test></test> and <test/> are 100% identical, so there is no way an XSLT stylesheet can distinguish them or treat them differently. The idea of attaching different meanings to the two constructs is completely misguided: you can never be sure which representation an XML library will choose to use.
It is of course possible to distinguish an element that contains a value (such as <test>value</test>) from one that is empty - but both the above examples represent empty elements and must be treated as equivalent.

RegEx to remove specific XML elements

I'm using Kate to process text to create an XML file but I've hit a roadblock. The text now contains additional data that I need to remove based on its content.
To be specific, I have an XML element called <officers> that contains 0 or more <officer> elements, which contain further elements such as <title>, <name>, etc.. While I probably could exclude these at run time using XSL, the file also drives another process that I don't want to touch - it's a general purpose data importer for Scribus so I don't want to touch the coding.
What I want to do is remove an <officer> element if the <title> content isn't what I want. For example, I don't want the First VP, so I'd like to remove:
<officer>
<title>First VP</title>
<incumbent>Joe Somebody</incumbent>
<address>....</address>
<address>....</address>
......
</officer>
I don't know how many lines will be in any <officer> element nor what positions they will in within the <officers> element.
The easy part it getting to the start of the content I want removed. The hard part is getting to the </officer> end tag. All the solutions I've found so far just result in Kate deciding that the RegEx is invalid.
Any suggestions are appreciated.
Regex is the wrong tool for this job; never process XML without a proper parser, except possibly for a one-off job on a single document where you will throw the code away after running it and checking the results by hand. You might find a regex that works on one sample document, but you'll never get it to work properly on a well-designed set of 100 test documents.
And it's easily done using XSLT. It's a stylesheet with two template rules: a default "identity template" rule to copy elements unchanged, and a second rule to delete the elements you don't want. In fact in XSLT 3.0 it gets even simpler:
<xsl:mode on-no-match="shallow-copy"/>
<xsl:template match="officer[title='First VP']"/>

Script to generate html Beyond Compare folder differences

I've found several ways to automate folder comparison using scripts in Beyond Compare, but none that produce the pretty html report created from Session>Folder Compare Report>View in browser.
Here is an example of what that looks like.
I would love to be able to find the script that gives me that html difference report.
Thanks!
This is what I am currently getting
load "C:\Users\UIDQ5763\Desktop\Enviornment.cpp" &
"C:\Users\UIDQ5763\Desktop\GreetingsConsoleApp"
folder-report layout:side-by-side options:display-all &
output-to:C:\Users\UIDQ5763\Report.html output-options:html-color
The documentation for Beyond Compare's scripting language is here. You were probably missing either layout:side-by-side, which gives the general display, or output-options:html-color which is required to get the correct HTML stylized output. You may want to change options:display-all to options:display-mismatches if you only want to see the differences, and you might want to add an expand all command immediately before the folder-report line if you want to see the subfolders recursively.'
The & characters shown in the sample are line continuation characters. Remove them if you don't need to wrap your lines.

Architecture for a c++ XML-parser with a HTML-reportgenerator

I want a program that parses a XML-file, build a structure with the tags I need and finally print a HTML-report using HTML-templates with keywords that get replaced by the data from the XML files.
Since I'm not(yet) really into the OO programming I hoped to get some tips and advices how to structure a program like this.
I thought that two classes should be enough. A parser class and a data class.
the first one to go through the XML-file and report every tag I want to store to a data object which stores all the tags in a hierarchical order. After that I want to call a print function which prints everything as HTML-report.
I'm not sure how to report the tags to the data object
Could I store the tags in one object which stores a tree of structs or would it be better to store each tag in a separate object?
Any help would be greatly appreciated!
You don't mention Qt in your question, but as you added it as a tag: there is QtXML, which will give a way to parse and generate XML documents, and will also work for HTML output. XML is typically handled either via DOM or SAX. With DOM, the documents are parsed into a tree structure, and you will work on the tree as your central data element. With SAX, you use callback functions that are called for the different XML elements while parsing the XML input.
There is a lot about DOM and SAX on the internet, Wikipedia is a good starting point. There is also a lot of documentation on QtXML on-line.
Using DOM and/or SAX will give a nice architecture for solving the problem.
I solved my problem and want to share my architecture.
I made a Class Parser to parse the Elements and report the tags to an HTMLHandler class which has Subclasses like Header, Content and Sub-content. which store the Data and all have write()- methodes to print themselves out.
works fine for me and is quit simple :)

Truncate text formatted via HTML with XSLT 1.0

I am trying to truncate some text that has been formatted via HTML, but I need to keep the html in tact. I am doing so in SharePoint 2007 - so I am using XSLT 1.0.
I found this bit of XSLT here: http://symphony-cms.com/download/xslt-utilities/view/20816/
I was able to implement it, but it is telling me that the variable or parameter "Limit" has been defined twice.
However, the author has named many variables and parameters "Limit" and I am not sure which one I need to change.
I am fairly new to XSLT, and any help is greatly appreciated.
This is because at the top the XSLT the author has defined limit as a parameter
<xsl:param name="limit"/>
But a few lines down, then defines it as a variable
<xsl:variable name="limit">
Perhaps he had a 'buggy' xslt processor which allowed variables to be re-defined, but it should not actually be valid.
I did try renaming the variable to newlimit but it is hard to know whn he subsequently refers to limit whether it is the paramater or variable it is referring too (I couldn't actually get it to output useful HTML).
You are probably better off looking for something else to meet your needs. There may even be similar questions here on StackOverflow if you search about. For example, perhaps this meets your needs
XSLT - Using substring with copy-of to preserve inner HTML tags
I am sure there may be others if you look. If not, feel free to ask a new question, giving your input HTML, and your expected output, so that it is clear what your requirements are.