parse hl7 with regex - regex

I have the following hl7 message:
MSH|^~\&|EPIC|SMHRMC|JCAPS|QHN|20170626165726|EDILABIH|ORU^R01^LAB|00004841|P|2.3|||||||||
PID|1||W00xxxxx^^^SMHRMC||mouse^Mickey^E||19860905|F||1|2601 somestreet AVE NO 8^^City^ST^zip^USA^^^county|MESA|(970)xxx-xxxx^P^PH|||Single||175375903|xxxxxxx||last^first^^|NON-HISPANIC||||||||||
PV1|1|I|MNEU^908^A^^R^^^^^^||||9999999^pcp^pcp^LYNNE^^^^^NPI^^^^NPI~999999999^last^first^LEE^^^^^NPI^^^^NPI||||||||||00000000^last^first^LYNNE^^^^^NPI^^^^NPI||000000603|CAID||||||||||||||||||||||||20170626000000
Hl7 is hard to extract with regex however I have an field that is always in the same location and feel that might be easier. I need to pull the encounter number which is the 'W00xxxxx' in the stream above. It is always in the 3rd pipe delimited section of the PID and stops at the ^.
Currently I have: select substring(column from 'PID\|[1]\|\|(.)\^') but this is not working. However when I use select substring(column from 'PV1\|[1]\|(.)\|') it will pull the 'I'. I can't see the big differences in my regex to know why this isn't working. Thanks.

how about this:
PID\|[1]\|\|(.+?)\^

You can't reliably parse HL7 V2.x messages using regex because the encoding characters may change in MSH-1 and MSH-2. Whatever language you're using there's probably already an HL7 parsing library you can use instead.

Related

Parsing JSON style text using sscanf()

["STRING", FLOAT, FLOAT, FLOAT],
I need to parse three values from this string - a STRING and three FLOATS.
sscanf() returns zero, probably I got the format specifiers wrong.
sscanf(current_line.c_str(), "[\"%s[^\"]\",%f,%f,%f],",
&temp_node.id,
&temp_node.pos.x,
&temp_node.pos.y,
&temp_node.pos.z))
Do you know what's wrong?
Please read the manual page on sscanf(3). The %s format does not match using a regular expression, it just scans non-whitespace characters. Even if it worked as you assumed, your regular expression would not be able to handle all JSON strings correctly (which might not be a problem if your input data format is sufficiently restricted, but would be unclean nonetheless).
Use a proper JSON parser. It's not really complicated. I used cJSON for a moderately complex case, you should be able to integrate it within a few hours.
To fix your immediate problem, use this format specifier:
"[\"%[^\"]s\",%f,%f,%f],"
The right syntax for parsing a set is %[...]s instead of %s[...].
That being said, sscanf() is not the right tool for parsing JSON. Even the "fixed" code would fail to parse strings that contain escaped quotes, for instance.
Use a proper JSON parser library.

Are my regex just wrong or is there a buggy behaviour in td-agent's format behaviour?

I am using fluentd, elasticsearch and kibana to organize logs. Unfortunately, these logs are not written using any standard like apache, so I had to come up with the regex for the format myself. I used this site here to verify that they are working: http://fluentular.herokuapp.com/ .
The logs have roughly this format here:
DEBUG: 24.04.2014 16:00:00 [SingleActivityStrategy] Start Activitiy 'barbecue' zu verabeiten.
the format regex I am using is as follows:
format /(?<pri>([INFO]|[DEBUG]|[ERROR])+)...(?<date>(\d{2}\.\d{2}\.\d{4})).(?<time>(\d{2}:\d{2}:\d{2})).\[(?<subject>(.*))\].(?<msg>(.*))/
Now, judging by that website that is supposed to test specifically fluentd's behaviour with regexes, the output SHOULD be this one:
Record
Key Value
pri DEBUG
date 24.04.2014
subject SingleActivityStrategy
msg Start Activitiy 'barbecue' zu verabeiten.
Instead though, I have this ?bug? that pri is always shortened to DEBU. Same for ERROR which becomes ERRO, only INFO stays INFO. I am not very experienced with regular expressions and I find it hard to believe that this is a bug, still it confuses me and any help is greatly appreciated.
I'm not sure I can link the complete config file because I dont personally own these log files and I am trying to keep it on a level that my boss won't get mad at me for posting sensitive information, but should it definately be needed, I will post them later on after having asked him how much I can reveal.
In general, the logs always look roughly like this:
First the priority, which is either DEBUG, ERROR or INFO, next the date , next what we call the subject which is always written in [ ] and finally just a message.
Here is a link to fluentular with the format I am using and a teststring that produces the right result in fluentular, but not in my config file:
Fluentular
Sorry I couldn't make it work like a regular link to just click on.
Another link to test out regex with my format and test string is this one:
http://rubular.com/r/dfXOkQYNXP
tl;dr version:
my td-agent format regex cuts off the last letter, although fluentular says it shouldn't. My fault or a bug?
How the regex would look if you're trying to match the data specifically:
(INFO|DEBUG|ERROR)\:\s+(\d{2}\.\d{2}\.\d{4})\s(\d{2}:\d{2}:\d{2})\s\[(.*)\](.*)
In your format string, you were using . and ... for where your spaces and colon should be. I'm not to sure on why this works in Fluentular, but you should have matched the \: explicitly and each space between the values.
So you'd be looking at the following regular expression with the Fluentd fields (which are grouping names):
(?<pri>(INFO|ERROR|DEBUG))\:\s+(?<date>(\d{2}\.\d{2}\.\d{4}))\s(?<time>(\d{2}:\d{2}:\d{2}))\s\[(?<subject>(.*))\]\s(?<msg>(.*))
Meaning your td-agent.conf should look like:
<source>
type tail
path /var/log/foo/bar.log
pos_file /var/log/td-agent/foo-bar.log.pos
tag foo.bar
format /(?<pri>(INFO|ERROR|DEBUG))\:\s+(?<date>(\d{2}\.\d{2}\.\d{4}))\s(?<time>(\d{2}:\d{2}:\d{2}))\s\[(?<subject>(.*))\]\s(?<msg>(.*))/
</source>
I would also take a look into comparing Logstash vs. Fluentd. I like Logstash far more because you create Grok filters to match the type of data you want, and it makes formatting your fields much easier because you are providing an abstraction layer, but you essentially will get the same data.
And I would watch out when you're using sites like Rubular, as they are fairly particular about multi-line matching and the like. I'd suggest something like Regexr which gives immediate feedback and you can set global and multiline matching as well.

IDML RightIndentTab causing line break

I am working on IDML files which are used by InDesign. I am facing a problem in inserting a special instruction. I need to embed RightIndentTab with IDML file. The unicode for the same is U+0008. When I try to add that it throws error as this unicode is not supported in XML specs.
I looked more into it and IDML has a special Processing Instruction which can be inserted it looks like now the problem is when I add this it introduces a line break before the RightIndent symbol. On debugging I found that the content element looks like
<Content>
<?ACE 8?>9731396</Content>
It is an XElement and I see \r\n when I call ToString() on it. I also tried using XmlWriter.
What I would like is an XElement object which looks like
<Content><?ACE 8?>9731396</Content>
Thanks in advanced!
I've encountered exactly the same problem adding processing instructions to IDML, using .NET. Even with significant whitespace turned off I got a line break that InDesign treats as part of the text.
The only solution I have found is to save the file as XML, then open it as a text document and use a regular expression to replace >\r\n<? with just ><?. It's ugly and kludgy, but it does work - I don't have the regex to hand but you should be able to figure it out fairly quickly.
I've never had any problems adding unicode chars to XML, though. I would just use  and also set the XmlWriter encoding to use unicode. See here for an example: http://bytes.com/topic/net/answers/176665-how-write-unicode-using-xmlwriter which recommends:
XmlTextWriter myWriter = new XmlTextWriter( fileStream,
new System.Text.UnicodeEncoding( false, false) );

Best way to remove XML declaration from BSTR

I'm wondering if someone can help me trying to remove the XML declaration from a string containing an XML doc. Any help would be appreciated. We're using MSXML 4.0, but I was having difficulties using that and ended up just doing a substring. I'm not very familiar with the ATL and other Microsoft SDKs. It works, but a little part of me died inside and I would prefer to have this done in a less fragile manner.
Edit: Currently I am doing a sub-string on the first occurrence of a newline character. I was trying to tokenize or sub-string on the "?>" of the XML declaration, but I'm having issues on getting the character matching (using wcstok and substring). I tried "\?>", "\?>" and "?>". The ideal solution would be to load the document into XMLDocument object and just get the text of the message body.
Look up the XML specification, particularly the grammar for the prolog:
[22] prolog ::= XMLDecl? Misc* (doctypedecl Misc*)?
[23] XMLDecl ::= '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'
So, your handspun code should be able to parse VersionInfo, EncodingDecl and SDDecl along with the XML declaration tag start and end tokens. For more info on these individual items see the specification.
However, my suggestion would be to use the right tool for the right job: Use a XML toolkit/parser. (The difference between a parser and a toolkit is mainly that the toolkit will support advanced operations such as DTD validation, Namespace handling, XPath etc.).
MSXML4 is pretty old. MSXML6 is the latest. However, MSXML6 is pretty useless for anything but small XML files. So, choose a parser depending on your input file size (if performance is important). There are freely available libraries like Xerces, RapidXML, pugixml etc. which have much better performance.
Also, can you specify what difficulties you have faced with MSXML4?

Can an tinyxml someone explain which characters need to be escaped?

I am using tinyxml to save input from a text ctrl. The user can copy whatever they like into the text box and it gets written to an xml file. I'm finding that the new lines don't get saved and neither do & characters. The weird part is that tinyxml just discards them completely without any warning. If I put a & into the textbox and save, the tag will look like:
<textboxtext></textboxtext>
newlines completely disappear as well. No characters whatsoever are stored. What's going on? Even if I need to escape them with &amp or something, why does it just discard everything? Also, I can't find anything on google regarding this topic. Any help?
EDIT:
I found this topic which suggest the discarding of these characters may be a bug.
TinyXML and preserving HTML Entities
It is, apparently, a bug in TinyXml.
The simple workaround is to escape anything that it might not like:
&, ", ', < and > got their regular xml entities encoding
strange characters (read non-alphanumerical / regular punctuation) are best translated to their unicode codepoint: &#....;
Remember that TinyXml is before all a lightweight xml library, not a full-fledged beast.