How to integrate Saxon with Kafka

How to integrate Saxon with Kafka - xslt

I would like to use Kafka as the kernel of my message-oriented middleware.
It means that, in addition to the transport of messages provided by Kafka, I would need payload transformation/enrichment between the payload sent by the producer and the payload received by the consumer. Indeed, in a lot of use cases, the format used by the producers will not be the most appropriate format for the consumers.
An obvious example would be having mainframe application producing COBOL copybooks as producer, and a node application optimized for JSON-based format as consumer! Or more commonly, two applications using XML syntax, but with different schemas.
In (traditional) message-oriented middleware, such payload transformation are usually performed by XSLT, which is a dedicated and performant language for transforming XML documents into other XML documents. And Saxon is one of the leading XSLT processor in the market.
Now, I am looking for advices/examples of how to integrate Saxon with Kafka... any hints/remarks/directions even questions would be appreciated!
Many thanks in advance!

Related

Difference between Data Mapper Mediator and Payload Factory Mediator

Besides the syntax what is the core difference between a data mapper and payload factory? They both can convert/transform data from one format to another.

I have used the data mapper only a few times (you stick with what you know). In my opinion both mediators provide mostly the same functionality (as does the xslt mediator) but the underlying technology and mainly the development method is radically different.
datamapper provides a graphical way of transforming messages. It uses
existing output and input messages to seed the transformation so it is strong when you have the output of service A and the input of service B and just need to map the data from A to B.
payloadFactory is able to quickly build messages. I use it mostly to create requests where only a few fields need to be mapped from the original request to the new request.
xslt is a versatile and powerful way of transforming messages but it requires some experience. A lot of 3th party tooling is available to assist with the transformation.

Avro message for Google Cloud Pub-Sub?

What is a best data format for publishing and consuming to/from Pub-Sub? I am looking at Avro message format due to it's binary format.
Usecases are there would be real time Microservice applications publishing Avro messages to pub-sub. Given that avro message is best suited when batching up messages(along with a schema attached with the binary message) and then publishing the messages, would that be a better suitable format for this usecase involving microservice?

Google Cloud Documentation contains some JSON examples but when looking for efficiency the main suggestion is to use the available client libraries, except if your needs don't met what client libraries can offer or if you are running on Google App Engine standard environment, in which case the use of two APIs is suggested.
In fact, the most important factor for efficiency is using the gRPC API instead of the REST API (which libraries' calls do by default). As mentioned here:
There are two major factors at work here: more efficient data encoding
and HTTP/2. gRPC keeps data in binary both in client memory and on the
wire by building on HTTP/2 and Protocol Buffers. This eliminates
processing and space required for string encoding schemes such as
Base64 or JSON. In addition, HTTP/2 itself makes things go faster with
multiplexed requests over a single connection and header compression.
I did not find data format explicit mentions anywhere. I suggest you to use your preferred language for the message, as for example Python. Client library description here and sample code here.
Based on this StackOverflow post, you can optimize your PubSub system efficienctly by:
Making sure you are using gRPC
Batching where possible, to reduce the number of calls and eliminate latency.
Only compressing when needed and after benchmarking (implies extra logic in your application)
Finally, if you intend to deploy a robust PubSub system, have a look on this Anusha Ramesh post. She is Project Manager at Google now and suggests and elaborates on three tips:
Don't underestimate the importance of capacity planning.
Make sure your pub/sub system is fault-tolerant.
NSM: Never Stop Monitoring.

There isn't going to be one correct answer for the best format to use for the messages for all use cases. Avro is certainly a popular choice. Protocol buffers would be another possibility, as would Thrift. For Pub/Sub, the data is all just bytes and it is up to the publisher and the subscriber to determine the interpretation of this data. People have run comparisons on the different data formats, so you may want to make the decision based on your needs in terms of performance and message sizes.
Pub/Sub itself uses Protocol buffers for defining its data types. With regard to batching, the Cloud Pub/Sub client libraries do batching themselves for publish, so you don't necessarily have to worry about that on your own. You can control the batch settings to optimize throughput and latency based on your use case by calling, for example, setBatchSettings in the Publisher.Builder for Java (other languages have an equivalent as well). You may decide to do your own batching if you want to associate some metadata with a set of messages instead of with each individual message or you have very specific needs in terms of how messages are batched together. Otherwise, depending on the client library to do the batching is probably the correct decision.

Best practices for server-side architecture for an XSLT-based client application

I'm considering using Saxon CE for a web application to edit ebooks (metadata and content). It seems like a good match given that important ebook components (such as content.opf) are natively XML. I understand how to grab XML data from the server, transform it, insert the results into the HTML DOM, and handle events to change what and how is displayed.
Where I am getting stuck is how best to sync changes back up to the server. Is it best practice to use an XML database on the server? Is it reasonable to maintain XML on the server as text files, and overwrite them with a post, and could/should this be done through a result-document with a remote URI?
I realize this question may seem a bit open-ended, but I've failed to find any examples of client-side XSLT applications which actually allow the modification of data on the server.

Actually, I don't think this question is specific to using Saxon-CE on the client. The issues would be exactly the same if you were using XForms, or indeed if the client-side code were written in Javascript. And I think the answer depends on volumetrics, availability and concurrency requirements, and so on.
If you're doing a serious level of concurrent update of a shared collection of XML data, then using an XML database is probably a good idea. On the other hand there might be scenarios where this isn't needed, for example where the XML data is part of the user-specific application context, or where the XML document received from the client simply needs to be saved somewhere "as is", or perhaps where it just needs to be appended to a some kind of XML log file.
I think that in nearly all cases, you'll need a server-side component to the application that responds to HTTP put/post requests and decides what to do with them.

Cost of serialization in web service

My next project involves the creation of a data API within an enterprise framework. The data will be consumed by several applications running on different software platforms. While my colleagues generally favour SOAP, I would like to use a RESTful architecture.
Most of the applications will only need a few objects at every call. Other applications will however sometimes need to make several sequential calls each involving thousands of records. I'm concerned about performance. Serialization/deserialization & network usage are where I fear to find a bottleneck. If each request involves a large delay, all of the enterprise's applications will be sluggish.
Are my fears realistic? Will serialization to a voluminous format like XML or JSON be a problem? Are there alternatives?
In the past, we've had to do these large data transfers using a "flatter"/leaner file format such as CSV for performance. How can I hope to achieve the performance I need using a web service?
While I'd prefer replies specific to REST, I'm interested in hearing how SOAP users might deal with this as well.

One advantage of REST is that you are free to use whatever media type you like. Why not continue to use text/csv? You could also enable HTTP compression to further reduce bandwidth consumption.
REST services are great for taking advantage of all different kinds of data formats. Whatever format fits your scenario best.

We offer both XML and JSON. Your mentioned rendering time really can be an issue. On server side we have JAXB whose standard sun-implementation is somewhat slow, when it comes to marshall XML. XML has the disadvantage of verbosity, but is also nice in interoperability and has schema + explicit versioning.
We compensated the verbosity in several ways (especially limiting the result-set):
In case you have a container with items in it, offer paging in your xml response (both page-size and page-number, e.g. /items?page=0&size=3) . The client can itself reduce the size by reducing the page-size.
Offer collapsing elements, for instance several clients are only interested in one data field of your whole item. Do this with a parameter (e.g. /items?select=name), then only the nested element 'name' is included inline of your item element. This dramatically decreases size.
Generally give the clients the power to use result-set limiting. They will definitley use it, because it speeds up response time also on their side :)
Also use compression, it reduces verbose XML extremely (in our case the payload got 10 times smaller). From client side you can do it by header 'Accept-Encoding: gzip'. If you use Apache, server configuration is also straight-forward

I'd like to offer three guidelines:
one is the observation that there are many SOAP Web services out there (especially built with .NET 2.0 "ASMX" technology) that send down their data transfer objects serialized in XML. There are of course many RESTful services that send down XML or JSON. XML serialization/deserialization is rarely the constraining factor.
one common cause of bottlenecks in Web services is an interface that encourages client applications to get data by making those thousands of sequential calls (there is a term for it: a chatty interface). This is what you should avoid when you design your Web service's interface, regardless of what four-letter acronym you decide to go ahead with.
one thing to remember about REST is that it (partially) stands for a transfer of state, which may be ill-suited to some operations where you don't want to transfer the state of a business object from the server to a client application. In those cases, a SOAP Web service (as suggested by your colleagues) is more appropriate; or perhaps a combination of SOAP and REST services, where the REST services would take care of operations where the state transfer is appropriate, and the SOAP services would implement the rest (pun unintended :-)) of the operations.

BlazeDS vs SOAP and web services

Advantage for one over the other?

My Census RIA Benchmark was created to compare AMF (BlazeDS) and SOAP or plain old XML (RESTful). Unfortunately SOAP is broken currently due to a JBoss 5.1 upgrade problem. However you can try the XML example instead of SOAP. The SOAP one is (was) slower due to all the extra parsing, transforming, etc. Usually AMF is the best option. And if you need a third-party endpoint you can always do SOAP and AMF to the same back-end services.
BTW: Due to a bug in Firefox click the Output panel on the right to start the test.

BlazeDS (technically AMF) - pro: binary format, so smaller, faster to transmit; con: pretty much Flash/Flex/AS only.
SOAP / Web Services - pro: works across many languages; con: very verbose, xml transmission with multiple layers, there are libraries in many languages to abstract this away, but regardless - a bigger "payload" gets sent every time.
REST - pro: lighter weight web service, can use XML messages or just text/JSON, piggybacks on top of existing HTTP, so anything that can talk HTTP can use REST; cons: still text transmission, but verbosity/complexity depends on individual, not spec. Custom messages, so need to document expected request/response formats and rely on developer to match them.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js