Using Apache Nifi to extract HL7 values and apply regex - regex

I need to extract patient info from the HL7 XML document using Apache Nifi,
and to apply regex to extract diagnostic results from the sections that contain embedded HTML (yes, sorry. not my design choice :-( )
First path to data of interest in the HL7 is:
"ClinicalDocument" \ "recordTarget" \ "patientRole" \ "patient" \ "name",
and the second, more complicated one is:
"ClinicalDocument" \ "structuredBody" \ "component" \ "section" \ "text #mediaType="text/x-hl7-text+xml"" where the value of the title element equals to "Diagnostic Results"
I need to match on text of the sub-node text value of the title of the section within component that has value "Diagnostic Results" (Diagnostic Results), and then extract the text value of the peer node text.
My HL7 XML snippets look like:
</ClinicalDocument>
...
<recordTarget>
<patientRole>
....
<patient>
<name><given>John</given><family>Doe</family></name>
...
<structuredBody>
...
<component>
<section classCode="DOCSECT" moodCode="EVN">
<templateId root="0.0.0.0.0.0.1" />
<code code="000-01" codeSystem="0.0.0.1.0.0" />
<title>Diagnostic Results</title>
<text mediaType="text/x-hl7-text+xml">
Some data of interest expressed in n microns.<content ID="NKN_results"/>
</text>
Any suggestions on how do I do this in Apache Nifi?

You should be able to use XPath and the NiFi EvaluateXPath processor to match and extract the <text> element. I started with the structuredBody tag as root for the following expression:
/structuredBody/component/section[title = 'Diagnostic Results' and text[#mediaType='text/x-hl7-text+xml']]/text
But you should be able to adapt it for the full XML path. Once the <text> element is parsed out, starting in NiFi 0.5.0 you can use the GetHtmlElement processor to extract from the embedded HTML. Previous to NiFi 0.5.0, if the HTML is well-formed (XHTML, e.g.) you can use another EvaluateXPath processor instead.

Related

XSLT copy from external document

In XSLT 2.0 I am transforming a tei-xml document into HTML. In that transformationI need content from another document: I want to copy/transform a small set of nodes from the second document into the HTML output.
While processing the principal tei document I get the id and assign it to a variable:
<xsl:variable name="licenseid" select="./replace(#corresp,'#','')"/>
Then I go out to the other document and fetch the node using the variable, with the returned node assigned to a variable:
<xsl:variable name="licenseloc" select="doc(concat($somepath,'includes_sourcedesc.xml'))//tei:list[#type='copyright_type']/tei:item[#xml:id=$licenseid]"/>
This node I obtain looks like this:
<list type="copyright_type">
<item xml:id="copyright-cc-by-nc-sa-4.0">
<desc xml:lang="en">This work is made by available the author under the
<ref target="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike
4.0 International License</ref>.</desc>
</item>
</list>
And I want to transform it (from desc) to this:
<span>This work is made by available by the author under the
<a href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike
4.0 International License</a>.</span>
If this were in my 'current' tei document I would handle it through templates, but I'm unsure how to copy and transform the nested layers from within a different 'current' document.

Apache JMeter Regular Expressions Extractor Error

I have made an HTTP Request to a webpage and it respond successfully with a VAST code (XML) Afterwards I tried to use Apache JMeter Regular Expressions Extractor for Extracting a URL from the MediaFile tag in the responded XML code . but it doesn't work.
Here is the responded data (VAST XML):
<?xml version="1.0" encoding="UTF-8"?>
<VAST version="2.0">
<Ad id="brightroll_ad">
<InLine>
<AdSystem>BrightRoll</AdSystem>
<AdTitle></AdTitle>
<Impression><![CDATA[http://brxserv-22.btrll.com/v1/epix/6835714/3858435/84416/140363/AbQ93_XgMgCcRUTi_JAAFJwAACJEsAOuADAAAAAAAiyel-GCNFFg/event.imp/r_64.aHR0cDovL2Iuc2NvcmVjYXJkcmVzZWFyY2guY29tL3A_JmMxPTgmYzI9NjAwMDAwNiZjMz04NDQxNiZjND0zODU4NDM1JmM1PTIwNDYzJmM2PTY4MzU3MTQmYzEwPTE0MDM2MyZjdj0xLjcmY2o9MSZybj0xNDE0NDEwMTg1JnI9aHR0cCUzQSUyRiUyRnBpeGVsLnF1YW50c2VydmUuY29tJTJGcGl4ZWwlMkZwLWNiNkMwekZGN2RXakkuZ2lmJTNGbGFiZWxzJTNEcC42ODM1NzE0LjM4NTg0MzUuMCUyQ2EuMjA0NjMuODQ0MTYuMTQwMzYzJTJDdS45NjguNjQweDM2MCUzQm1lZGlhJTNEYWQlM0JyJTNEMTQxNDQxMDE4NQ]]></Impression>
<Impression><![CDATA[http://rc.rlcdn.com/361686.gif]]></Impression>
<Creatives>
<Creative id="140363" sequence="1">
<Linear>
<Duration>00:00:30</Duration>
<TrackingEvents>
<Tracking event="midpoint"><![CDATA[http://brxserv-22.btrll.com/v1/epix/6835714/3858435/84416/140363/AbQ93_XgMgCcRUTi_JAAFJwAACJEsAOuADAAAAAAAiyel-GCNFFg/event.mid]]></Tracking>
<Tracking event="complete"><![CDATA[http://brxserv-22.btrll.com/v1/epix/6835714/3858435/84416/140363/AbQ93_XgMgCcRUTi_JAAFJwAACJEsAOuADAAAAAAAiyel-GCNFFg/event.end]]></Tracking>
</TrackingEvents>
<AdParameters></AdParameters>
<VideoClicks>
<ClickTracking><![CDATA[http://brxserv-22.btrll.com/v1/epix/6835714/3858435/84416/140363/AbQ93_XgMgCcRUTi_JAAFJwAACJEsAOuADAAAAAAAiyel-GCNFFg/event.click]]></ClickTracking>
</VideoClicks>
<MediaFiles>
<MediaFile type="application/x-shockwave-flash" apiFramework="VPAID" height="360" width="640" delivery="progressive">
<![CDATA[http://shim.btrll.com/shim/20141023.75835_master/Scout.swf?type=VPAID&hidefb=true&asset_64=aHR0cDovL3J0ci5pbm5vdmlkLmNvbS9yMS41NDQ1OTU0ZDA5ZTY4OS40MjIxNTcxODtjYj0xNDE0NDEwMTg1O3NpdGVpZD0zODU4NDM1bGluZWl0ZW04NDQxNg&vid_click_url=&config_url_64=&h_64=YnJ4c2Vydi0yMi5idHJsbC5jb20&dn=-&e=p&p=6835714&s=3858435&l=84416&ic=140363&ii=20463&iq=t&cx=&x=AbQ93_XgMgCcRUTi_JAAFJwAACJEsAOuADAAAAAAAiyel-GCNFFg&adc=false&t=33&si=&vh_64=Z2VvLXJ0YnNlcnYtdjIuYnRybGwuY29t&apep=0.05&hbp=0.01&view=vast2]]>
</MediaFile>
</MediaFiles>
</Linear>
</Creative>
</Creatives>
</InLine>
and Here is the settings which I have used.
Reference Name: mediaFileUrl_VASTAdTagURI
Regular Expression: <MediaFile type="application//x-shockwave-flash" apiFramework="VPAID" height="360" width="640" delivery="progressive"><([^"]+)http:\/\/([^"]+)]]>>
Template: $1$$2$
Match No.: -1
Default Value: No mediaFileUrl_VASTAdTagURI
The result is always (No mediaFileUrl_VASTAdTagURI). any clue about the problem with the Regular Expression.
JMeter provides XPath Extractor to deal with XML and XHTML data. It can also work for HTML but you'll have to check Use Tidy box so JMeter could use JTidy to work against HTML.
XPath expression to extract contents of CDATA should look something like:
//MediaFile/text()[2]
See XPath Tutorial for more details. Few tools which can help in building/debugging XPath expressions:
XPath Checker Firefox add-on
FirePath Firefox add-on
View Results Tree JMeter's listener provides XPath Tester as well

How to read namespace declarations using XSLT?

I have to trasform the raw response of any OData feed (ATOM) in the form of a tree with expandable/collapsable nodes. For this purpose I am converting the raw response into HTML using XSLT transformation.
The problem is that response from some services have the feed element with namespace declarations as attributes. (eg: feed xmlns:d= ..., xmlns:m= ...).In my final output these namespace declarations are not displayed.
The XSLT processor ignores them while processing the attributes.(I am using the XPath expression "#*".) Is there a way to extract them using XSLT and display the namespace declaration content as-is in the trasformed output ?
Note that I get to know about these namespace declaration attributes at runtime in the OData response. I have no information before the query executes.
UPDATE:
Input : (RAW XML Entry)
<?xml version="1.0" encoding="utf-8"?><entry xml:base="http://services.odata.org/Northwind/Northwind.svc/" xmlns="http://www.w3.org/2005/Atom" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata"><id>http://services.odata.org/Northwind/Northwind.svc/Regions(1)</id><category term="NorthwindModel.Region" scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" /><link rel="edit" title="Region" href="Regions(1)" /><link rel="http://schemas.microsoft.com/ado/2007/08/dataservices/related/Territories" type="application/atom+xml;type=feed" title="Territories" href="Regions(1)/Territories" /><title /><updated>2014-03-17T10:24:14Z</updated><author><name /></author><content type="application/xml"><m:properties><d:RegionID m:type="Edm.Int32">1</d:RegionID><d:RegionDescription xml:space="preserve">Eastern </d:RegionDescription></m:properties></content></entry>
Desired Output: (The same ATOM entry,as a XML tree, pretty printed with expandable/collapsable nodes)
-<entry xml:base="http://services.odata.org/Northwind/Northwind.svc/" xmlns="http://www.w3.org/2005/Atom" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata">
-<id>
http://services.odata.org/Northwind/Northwind.svc/Regions(1)
</id>
<category term="NorthwindModel.Region" scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme"/>
<link rel="edit" title="Region" href= "Regions(1)" />
<link rel="http://schemas.microsoft.com/ado/2007/08/dataservices/related/Territories" type="application/atom+xml;type=feed" title="Territories" href= "Regions(1)/Territories" />
<title/>
<updated>2014-03-17T10:06:25Z</updated>
-<author>
<name/>
</author>
-<content type="application/xml">
-<m:properties>
<d:RegionID m:type="Edm.Int32">1</d:RegionID>
<d:RegionDescription xml:space="preserve">Eastern </d:RegionDescription>
</m:properties>
</content>
</entry>
Output which I am getting.
-<entry xml:base="http://services.odata.org/Northwind/Northwind.svc/">
-<id>
http://services.odata.org/Northwind/Northwind.svc/Regions(1)
</id>
<category term="NorthwindModel.Region" scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme"/>
<link rel="edit" title="Region" href= "Regions(1)" />
<link rel="http://schemas.microsoft.com/ado/2007/08/dataservices/related/Territories" type="application/atom+xml;type=feed" title="Territories" href= "Regions(1)/Territories" />
<title/>
<updated>2014-03-17T10:06:25Z</updated>
-<author>
<name/>
</author>
-<content type="application/xml">
-<m:properties>
<d:RegionID m:type="Edm.Int32">1</d:RegionID>
<d:RegionDescription xml:space="preserve">Eastern </d:RegionDescription>
</m:properties>
</content>
</entry>
Please note the missing name space declarations in the output's "entry" root element.
The output is a HTML which displays pretty-printed xml, with expandable/collapsable nodes and since it should display the data as-is, namespace declarations are required to be displayed in the output HTML.
Clicking on the "-" symbols collapses the nodes.
If a declaration is already in scope (remember, namespace bindings are inherited to children), it will not need to be redeclared and most XML serializers will not generate it. You haven't shown us an example yet, but I'd bet your output is fine as it is.
Namespace declarations in your source XML appear in the XPath data model as namespace nodes (not attribute nodes).
You need to make clear what you want to do with the namespaces. You're saying that you are generating HTML, but you also say you want to copy the namespaces to the output. That seems inconsistent. Show us you input and desired output.
I haven't managed to spot the difference between your actual output and your desired output (never was very good at "spot the ball" competitions), but from your revised description of the problem it sounds as if you should be using the namespace axis to find the in-scope namespaces for an element (in the same way as you are currently using the attribute axis).
The only tricky part is that the namespace axis will give you all the namespaces for an element and you probably only want to show those that are different from the namespaces of the parent element.
The details depend on whether you are using XSLT 1.0 or 2.0 - you need to tell us!

XPath - Querying two XML documents

I have have two xml docs:
XML1:
<Books>
<Book id="11">
.......
<AuthorName/>
</Book>
......
</Books>
XML2:
<Authors>
<Author>
<BookId>11</BookId>
<AuthorName>Smith</AuthorName>
</Author>
</Authors>
I'm trying to do the following:
Get the value of XML2/Author/AuthorName where XML1/Book/#id equals XML2/Author/BookId.
XML2/Author/AuthorName[../BookId = XML1/Book/#id]
An XPath 1.0 expression cannot refer to more than one XML document, unless the references to the additional documents have been set up in the context of the XPath engine by the hosting language. For example, if XSLT is the hosting language, then it makes its document() function available to the XPath engine it is hosting.
document($xml2Uri)/Authors/Author[BookId = $mainDoc/Books/Book/#id]
Do note, that even the main XML document needs to be referenced via another <xsl:variable>, named here $mainDoc.
The document() function is available only if Xpath is hosted by XSLT! This is not mentioned in the answer of Doc Brown and is misleading the readers.
An XPath 2.x expression may refer to any additional XML document using the XPath 2.0 doc() function.
for $doc in /,
$doc2 in doc(someUri)
return
$doc2/Authors/Author[BookId = $doc/Books/Book/#id]
The document function is your friend, here is a short tutorial how to combine multiple input files.
EDIT: Of course, that works only if your are using Xpath in an Xslt script.

Evernote export format (ENEX) to HTML, including pictures?

#Solved
The two subquestions I have created have been solved (yay for splitting this one up!), so this one is solved. I'll award the check mark to samjudson, since his answer was the closest. For actual working solutions though, see the below subquestions; both my implemented solutions and the checked answers.
#Deprecated
I am splitting this question into two separate questions, since this is a fairly complicated problem. Answers are still welcome though.
The suquestions are:
XSLT: Convert base64 data into
image files
XSLT: Obtaining or matching hashes
for base64 encoded data
Hi, just wondering if anyone here has had any success in converting Evernote's export format, which is XML, to HTML including the pictures. I do know that Evernote has an export to HTML function which does this, but I eventually want to do more fancy stuff with it.
I have managed to accomplish getting the text only using the following XSLT:
Sample code removed
See child questions for implemented solutions.
However, a.t.m. this simply ignores any pictures, and this is where I need help.
Stumbling block #1: Evernote stores its pictures as GIFs or PNGs, and when exported, it embeds these GIFs & PNGs directly in the XML using what appears to be base64 (I could be wrong). I need to be able to reconsitute the pictures. If you open the file in a text editor, look for the huge blocks of data in the **//note/resource/data**. For example (indents added manually):
<resource>
<data encoding="base64">
R0lGODlhEAAQAPMAMcDAwP/crv/erbigfVdLOyslHQAAAAECAwECAwECAwECAwECAwECAwECAwEC
AwECAyH/C01TT0ZGSUNFOS4wGAAAAAxtc09QTVNPRkZJQ0U5LjAHgfNAGQAh/wtNU09GRklDRTku
MBUAAAAJcEhZcwAACxMAAAsTAQCanBgAIf8LTVNPRkZJQ0U5LjATAAAAB3RJTUUH1AkWBTYSQXe8
fQAh+QQBAAAAACwAAAAAEAAQAAADSQhgpv7OlDGYstCIMqsZAXYJJEdRQRWRrHk2I9t28CLfX63d
ZEXovJ7htwr6dIQB7/hgJGXMzFApOBYgl6n1il0Mv5xuhBEGJAAAOw==
</data>
<mime>image/gif</mime>
<resource-attributes>
<file-name>clip_image001.gif</file-name>
</resource-attributes>
</resource>
Stumbling block #2: Evernote stores the file names of each picture under the resource node
**//note/resource/resource-attributes/file-name**
however, in the actual note in which it refers to the picture, it references the picture not by the filename, but by its hash, for example:
<en-media hash="4aaafc3e14314027bb1d89cf7d59a06c" type="image/gif" border="0" width="16" height="16" alt="Alt Text"/>
Can anyone shed some light on how to deal with (base64) encoded binary data inside XML?
Edit
I understand from the comments & answers that plain ol' XSLT won't get the job done handling images. The XSLT processor I am using is Xalan , however, if this is not good enough for the purposes of image processing or base64, then I am please suggest one that does do these!
Also, as requested, here is a sample Evernote export file. The code clips above are merely selected parts of this. I have stripped it down such that it contains just one note and edited most of the text out of it, and added indents for clarity.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE en-export SYSTEM "http://xml.evernote.com/pub/evernote-export.dtd">
<en-export export-date="20091029T063411Z" application="Evernote/Windows" version="3.0">
<note>
<title>A title here</title>
<content><![CDATA[
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml.dtd">
<en-note bgcolor="#FFFFFF">
<p>Some text here (followed by the picture)
<p><en-media hash="4aaafc3e14314027bb1d89cf7d59a06c" type="image/gif" border="0" width="16" height="16" alt="A picture"/></p>
<p>Some more text here (preceded by the picture)
</en-note>
]]></content>
<created>20090925T063154Z</created>
<note-attributes>
<author/>
</note-attributes>
<resource>
<data encoding="base64">
R0lGODlhEAAQAPMAMcDAwP/crv/erbigfVdLOyslHQAAAAECAwECAwECAwECAwECAwECAwECAwEC
AwECAyH/C01TT0ZGSUNFOS4wGAAAAAxtc09QTVNPRkZJQ0U5LjAHgfNAGQAh/wtNU09GRklDRTku
MBUAAAAJcEhZcwAACxMAAAsTAQCanBgAIf8LTVNPRkZJQ0U5LjATAAAAB3RJTUUH1AkWBTYSQXe8
fQAh+QQBAAAAACwAAAAAEAAQAAADSQhgpv7OlDGYstCIMqsZAXYJJEdRQRWRrHk2I9t28CLfX63d
ZEXovJ7htwr6dIQB7/hgJGXMzFApOBYgl6n1il0Mv5xuhBEGJAAAOw==
</data>
<mime>image/gif</mime>
<resource-attributes>
<file-name>clip_image001.gif</file-name>
</resource-attributes>
</resource>
</note>
</en-export>
And this needs to be transformed into this:
<html>
<body>
<p>Some text here (followed by the picture)
<p><img src="clip_image001.gif" border="0" width="16" height="16" alt="A picture"/></p>
<p>Some more text here (preceded by the picture)
</body>
</html>
With the file clip_image001.gif being generated and saved.
There is a new Data URI specification http://en.wikipedia.org/wiki/Data_URI_scheme which may be of some help provided you are only intending to support modern browsers, and your images are small (for example IE8 only support <32k images).
Other than that the only other thing you can do is use some external scripts to export the image data to file and use them. This would depend greatly on what XSLT processor you are using.
It exists a pure XSLT answer to this issue ; look at this page