XML+SOAP: are newlines permitted? - web-services

I'm working with SOAP and XML to interact with some web-services.
I noticed that gsoap-generated routines do not have newlines and they work correctly. I successively tried to write my own routines by using libxml2, which indents the XML with newlines.
While all the web-services that I tested were able to deal with the code generated by gsoap, not all of them where able to deal with my hand-written code, and the triggering error was the presence of the newlines.
So my question is: are newlines forbidden in XML+SOAP? Do I have to write all the code on a single line? Or did I just face some broken services?

Newlines aren't forbidden. The whole reason, why gSoap writes the messages into one line is because there is no need to structure the message (bring it into a human readable form), it would be an additional cost with no effect.
There must be some other reason. Compare your XML Messages with those generated by gSoap, is the content really the same, the only differences the newlines?

Related

parse hl7 with regex

I have the following hl7 message:
MSH|^~\&|EPIC|SMHRMC|JCAPS|QHN|20170626165726|EDILABIH|ORU^R01^LAB|00004841|P|2.3|||||||||
PID|1||W00xxxxx^^^SMHRMC||mouse^Mickey^E||19860905|F||1|2601 somestreet AVE NO 8^^City^ST^zip^USA^^^county|MESA|(970)xxx-xxxx^P^PH|||Single||175375903|xxxxxxx||last^first^^|NON-HISPANIC||||||||||
PV1|1|I|MNEU^908^A^^R^^^^^^||||9999999^pcp^pcp^LYNNE^^^^^NPI^^^^NPI~999999999^last^first^LEE^^^^^NPI^^^^NPI||||||||||00000000^last^first^LYNNE^^^^^NPI^^^^NPI||000000603|CAID||||||||||||||||||||||||20170626000000
Hl7 is hard to extract with regex however I have an field that is always in the same location and feel that might be easier. I need to pull the encounter number which is the 'W00xxxxx' in the stream above. It is always in the 3rd pipe delimited section of the PID and stops at the ^.
Currently I have: select substring(column from 'PID\|[1]\|\|(.)\^') but this is not working. However when I use select substring(column from 'PV1\|[1]\|(.)\|') it will pull the 'I'. I can't see the big differences in my regex to know why this isn't working. Thanks.
how about this:
PID\|[1]\|\|(.+?)\^
You can't reliably parse HL7 V2.x messages using regex because the encoding characters may change in MSH-1 and MSH-2. Whatever language you're using there's probably already an HL7 parsing library you can use instead.

Regular expression incremental parsing

Are there languages or tools that support the parsing of regexes on a character-by-character basis?
I think this may be equivalent to "regexes on streams" which is something that seems to be one of the features of the upcoming Perl version 6.
Basically I want to do this because I'm building a tool that does translation of a terminal stream over a pseudo-terminal, and it occurred to me that the ultimate sort of flexibility that should be attainable is by allowing the specification of regex-replace expressions.
The use case is that I want to allow my mouse scroll events to be passed to a naive program such as the less pager, which means my tool (which spawns less over a PTY) will be doing something like issuing the code \x1b[?1000h which switches on mouse reporting, and then subsequently translating every mouse wheel escape code received thereafter such as \x1b[M!! (the last several chars encode the mouse position within the terminal and should be ignored but also stripped) into the \x1b[A Up-arrow code.
As you can see being able to specify a regex that works on the stdin terminal-reading stream to generate the translated stream to send to the slave pty would be ideal.
Do I need to wait for Perl 6 to be able to achieve this? There must be particular reasons for why regex engines generally require having the whole string available?
It's pretty obvious I don't need the full blown power of regex here. I can speculate for instance that it might be the case that supporting backtracking makes stream-parsing regex impossible.
So since I don't need backtracking maybe there is some sort of light-weight regex engine out there that provides a stream API. It just seems like taking advantage of some form of parsing system (if one exists that is suitable) would be smarter than building something arbitrary.
Looks like s2p is an example of something that I can use.
In particular, the potential of being able to set $| to not do line-buffering.
Actually I don't think this will work. It seems to be built around lines and uses the s operator to run regex.

Choosing line ending with libxml2

I try to generate some xml files (TMX) on our servers.
The servers are Solaris SPARC servers, but the destination of the files are some legacy Windows CAT Tools.
The CAT-Tool requires CR+LF line endings as is the default on Windows. Writing the files with libxml2, using xmlWriter is easy and works quite well. But I haven't figured out a way to force the lib to emit CR+LF instead of the Unix standard LF. The lib only seem to support the line ending of the platform it runs on.
Has somebody found a way to generate files with another line ending than the default of the platform it runs on. Actually my workaround is to open the written file and writing a new file with the changed line ending using a simple C loop. That works, but it is annoying to have such a unnecessary step in our chain.
I haven't tried this myself, but from xmlsave, I can see two possibilities
xmlSaveToBuffer: save to a buffer, convert to CR/LF and write it out yourself.
xmlSaveToIO: register an iowrite callback and convert to CF/LF while writing in your callback function
Maybe, there are other options, but I haven't found them.
The CAT-Tool requires CR+LF line endings as is the default on Windows.
FWIW, that means the CAT-Tool has a broken XML parser. It shouldn't care about this, as the the XML spec says:
To simplify the tasks of applications, the XML processor must behave as if it normalized all line breaks ... by translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character.
I know often these things are out of our control, but if you can lean on the CAT-Tool vendor to fix their software, it could become a more future-proof solution.
According to the source code (as of April 2013), libxml2 just puts "\n" into the output stream. At least, when writing dtd-part of a document. Therefore, re-encoding the stream on the fly is the only option to get "\r\n" as result.
If you were lucky (as me) and your tool run on Windows, you could open the file in the text mode, and the OS would do recoding for you.

Best way to remove XML declaration from BSTR

I'm wondering if someone can help me trying to remove the XML declaration from a string containing an XML doc. Any help would be appreciated. We're using MSXML 4.0, but I was having difficulties using that and ended up just doing a substring. I'm not very familiar with the ATL and other Microsoft SDKs. It works, but a little part of me died inside and I would prefer to have this done in a less fragile manner.
Edit: Currently I am doing a sub-string on the first occurrence of a newline character. I was trying to tokenize or sub-string on the "?>" of the XML declaration, but I'm having issues on getting the character matching (using wcstok and substring). I tried "\?>", "\?>" and "?>". The ideal solution would be to load the document into XMLDocument object and just get the text of the message body.
Look up the XML specification, particularly the grammar for the prolog:
[22] prolog ::= XMLDecl? Misc* (doctypedecl Misc*)?
[23] XMLDecl ::= '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'
So, your handspun code should be able to parse VersionInfo, EncodingDecl and SDDecl along with the XML declaration tag start and end tokens. For more info on these individual items see the specification.
However, my suggestion would be to use the right tool for the right job: Use a XML toolkit/parser. (The difference between a parser and a toolkit is mainly that the toolkit will support advanced operations such as DTD validation, Namespace handling, XPath etc.).
MSXML4 is pretty old. MSXML6 is the latest. However, MSXML6 is pretty useless for anything but small XML files. So, choose a parser depending on your input file size (if performance is important). There are freely available libraries like Xerces, RapidXML, pugixml etc. which have much better performance.
Also, can you specify what difficulties you have faced with MSXML4?

Coding a Gmail style "hide quoted text" for web based mailing list archive

I'm working on a web application that parses and displays email messages in a threaded format (among other things). Emails may come from any number of different mail clients, and in either text or HTML format.
Given that most people have a tendency to top post, I'd like to be able to hide the duplicated message in an email reply in a manner similar to how Gmail does it (e.g. "show quoted text").
Determining which part of the message is the reply is somewhat challenging. Personally, I use "> " delimiters at the beginning of the quoted text when replying. I created a regexp that looks for these lines and wraps a div around them to allow some JS to hide or show this block of text.
I then noticed that Outlook doesn't use the "> " characters by default, it simply adds a header block above the reply with the summary of the headers (From, Subject, Date, etc.). The reply is untouched. I can match on this and hide the rest of the email, working with the assumption that it's a top quote.
I then looked at Thunderbird, and it uses "> " for text, and <blockquote> for HTML mails. I still haven't looked at what Apple Mail does, what Notes does, or what any of the other millions of mail clients out there do.
Will I be writing a special case regexp for every single client out there? or is there something I'm missing?
Any suggestions, sample code or pointers to third party libraries much appreciated!
It'll be pretty hard to duplicate the way gmail does it since it doesn't care about whether it was a quoted piece or not, like Zac says, it just seems to care about the diff.
Its actually pretty hard to get this right 100% of the time. Plain text email is "lossy", its entirely possible for you to send
> Here is my long line that is over 74 chars (email line length limit)
Which can get encoded as something like
> Here is my long line that is over 74 chars (email=
line length limit)
And then is decoded as
> Here is my long line that is over 74 chars (email
line length limit)
Making it indistinguishable from an inline-reply.
This is email, so variations are abound. Email usually line-wraps at something like 74 characters, and encoding schemes can differ. Its a real PITA. If you can access the HTML version, you will probably have better luck looking for quote tags and the like. Another idea would be to parse both the plain text and html version to try and determine the boundries.
Additionally, its best to just plan for specific client hacks. They all construct mime messages differently, both in structure and header content.
Edit: I say this with the experience of writing an email processing system as well as seeing several people try to do the -exact- thing you're doing. It always only got "ok" results.
From what I can tell, gmail does not bother about prefixed lines or section headings, except to ignore them. If the text lines appeared earlier in the thread, and then reappear, it is considered to be quoted. Thus, e.g., if you send multiple messages and don't change your signature, the signature is considered to be quoted. If you've already dealt with the '>' prefix, a simple diff should do most of the rest. No need to get fancy.
First thing I think I'd do is strip out all the white space, or reduce white space to 1 between each word, and special characters from both blocks, then look for the old one in the new one.
Here's a mozdev project that may be helpful for others who stumble across this page looking for a Thunderbird solution:
http://quotecollapse.mozdev.org/