XSLT based transformation "service" on top of Apache Kafka - xslt

At the moment I am writing this question, there are not (yet) any questions tagged with both [apache-kafka] and [xslt].
I am a "classic" Message oriented middleware (BizTalk, TIBCO, ...) guy who is just discovering Kafka and its IMPRESSIVE performance figures!
And, then, I am wondering what the recommendation from the "Kafka-community" about how to transform the message payload between its publishing and its consumption...
Indeed, in my world of integration, the data structure (i.e. format) exposed by the producer is usually radically different from the data structure (format) expected by the consumer. As an example, I may have, as a producer, a mainframe application formatting data in a COBOL copybook structure while my front-end application wants to consume a modern JSON format.
[Update following the 1st answer from #morganw09dev]
I like the proposal from #morganw09dev, but I am a bit "annoyed" by the creation of consumer-specific topics. I see the "Topic B" (see #morganw09dev's 1st answer) as the topic specific for my front-end application in order for consuming information from the "Topic A". In order words, this specificity makes the "Topic B" a queue ;-) It is fine, but I am wondering if such a design would not "hurt" a Kafka-native ;-)
From my preliminary readings on Kafka, it is clear that I should also learn more about Storm... but, then, I have discovered Flink that, according to the graph at https://flink.apache.org/features.html, looks MUCH more performant than Storm, and now #morganw09dev has mentioned Samza! That means that I don't know where to start ;-)
Ultimately, I would like to code my transformations in XSLT, and, in the Java world, I think that Saxon is one of the leading XSLT processor. Do you know any "integration" of Saxon with Storm, Flink or Samza? Or, maybe my question does not make sense and I have to find another "way" to use Saxon with Kafka.
At the moment I am writing this comment, there are not (yet) any questions tagged with both [saxon] and any of the [apache-kafka], [apache-storm], [apache-flink] and/or [apache-samza].

Kafka itself cannot be used to transform data. It's only used for storing data to be consumed later.
One thought is having a three part architecture.
Kafka Topic A => Transformer => Kafka Topic B
Per your example. Your producer pushes COBOL related data to Kafka Topic A. Your Transformer reads from Topic A, does the necessary transformations and then outputs JSON to Topic B. Once in Topic B the front end application can then read it in its preferred format. If you go that route, the Transformer could be custom built using Kafka's default consumer and producer, or use a streaming framework such as Apache Samza or Apache Storm to help handle the messaging. Both Samza and Kafka were initially developed at LinkedIn and I believe work fairly naturally together. (Though I have never tried Samza).

Related

Does Akka obsolesce Camel?

My understanding of Akka is that it provides a model whereby multiple, isolated threads can communicate with each other in a highly concurrent fashion. It uses the "actor model", where each thread is an "actor" with a specific job to do. You can orchestrate which messages get passed to which actors under what conditions.
I've used Camel before, and to me, I feel like it's sort of lost its luster/utility now that Akka is so mature and well documented. As I understand it, Camel is about enterprise integration, that is, integrating multiple disparate systems together, usually in some kind of service bus fashion.
But think about it: if I am currently using Camel to:
Poll an FTP server for a file, and once found...
Transform the contents of that file into a POJO, then...
Send out an email if the POJO has a certain state, or
Persist the POJO to a database in all other cases
I can do the exact same thing with Akka; I can have 1 Actor for each of those steps (Poll FTP, transform file -> POJO, email or persist), wire them together, and let Akka handle all the asynchrony/concurrency.
So even though Akka is a concurrency framework (using actors), and even though Camel is about integration, I have to ask: Can't Akka solve everything that Camel does? In ther words: What use cases still exist to use Camel over Akka?
Akka and Camel are two different beasts (besides one is a mountain and one is an animal).
You mention it yourself:
Akka is a tool to implement the reactor pattern, i.e. a message based concurrency engine for potentially distributed systems.
Camel is a DSL/framwork to implement Enterprise Integration Patterns.
There are a lot of things that would be pretty though in Akka, that is easy in Camel. Transactions for sure. Then all the logic various transport logic and options, as Akka does not have an abstraction for an integration message. Then there are well developed EIPs that are great in Camel, the multicast, splitting, aggregation, XML/JSON handling, text-file parsing, HL7, to mention a only a few. Of course, you can do it all in pure java/scala, but that's not the point. The point is to be able to describe the integration using a DSL, not to implement the underlying logic once again.
Non the less, Akka TOGETHER with Camel is pretty interesting. Especially using Scala. Then you have EIP on top of the actor semantics, which is pretty powerful in the right arena.
Example from akka.io
import akka.actor.Actor
import akka.camel.{ Producer, Oneway }
import akka.actor.{ ActorSystem, Props }
class Orders extends Actor with Producer with Oneway {
def endpointUri = "jms:queue:Orders"
}
val sys = ActorSystem("some-system")
val orders = sys.actorOf(Props[Orders])
orders ! <order amount="100" currency="PLN" itemId="12345"/>
A full sample/tutorial can be found at typesafe.

WSDL-like code-generation for EbXML CPA+Schemas (not EbXML message itself)

Background:
A certain government-backed wholesaler of broadband services in Australia took feedback from discussion groups about how best to deliver B2B services to retail ISPs. They settled on EbXML.
Problem:
We're a very small shop (comparatively) that doesn't want to spend a lot of time going forward on integration. We're already familiar with integration of paired (inbound and outbound) SOAP services. In the past we've made use of WSDL-based code generation tooling (mostly with RPC/Literal services) where the WSDL has been descriptive and simple enough for the code generation tools to digest.
If at all possible we'd like to avoid having to hand-integrate the services with our business 'stack'. We know that the 'Interface Schemas' have been updated several times; we'd like to (as much as possible) do code and schema generation such that we can model our relationship with the supplier and the outbound/inbound messages as simple "queues" (tables) in an SQL database -- this will be our point of integration.
Starting with the outbound ("sender") SOAP web-service... it publishes a Document/Literal WSDL description of the service that seems to work correctly with various tools (e.g: wsdl2java, SoapUI) to generate the EBXML 'wrapper' messages. This says nothing about the 'payload' messages themselves which (at least for the MSH we've looked at) need to be multipart/related attachments with type of text/xml.
The 'payload' messages are defined in the provided CPA (something like bindings) and Schema (standard-looking XSD) files. The MSH itself doesn't seem to provide any external validation for the payload messages.
Question:
Is the same kind of code generation (as seen with WSDL-described SOAP web services) tooling available for EbXML CPAs/Schemas? (i.e: tools that can consume the CPA and 'payload' interface schemas and spit out java/c++/whatever, and/or something WSDL-like specific to the 'payload' interface messages and/or example messages).
If so, where do I look?
If not, are there any EbXML-specific problems that would prevent it? (I'd rather not get several weeks into a project to develop tools that are impossible to implement 'correctly' given the information at hand).
The MSH is payload agnostic. The payloads are not defined in the CPA, only the service and action names that are used to send the ebXML payloads are. The service and action are transmitted in the ebXML header, which is the first part of the multipart message. The payloads themselves can be xml, binary or a combination. Each payload is another part.
An MSH is responsible for tasks like:
sending (usually asynchronous) acknowledgements for received messages
resending messages if an acknowledgement has not been received within a certain amount of time
ignoring duplicate messages
assuring the order in which messages are delivered is correct
the actual behaviour is all configurable using the CPA, but a compliant MSH would support all this.
This implies that an MSH has to keep an administration of the messages it has sent and received, which is usually done in a database.
I would be surprised if you could find tooling to generate an MSH from a specific CPA. What you can find is software/components that implement a generic MSH and that can be configured with CPAs.
Assuming you don't want to build your own, look for an existing ebMS adapter. Configure it with your CPA(s). Then generate the payloads however you like and pass them to the ebMS adapter.
Google for "ebMS adapter" or "ebMS support".
Alas, it seems there's no specific tooling around the 'payload' messages for EbXML, spefically because EbXML doesn't regulate those messages.
However, the CPA (through canSend and canRecv) elements acts somewhat like a SOAP WSDL, and the XSDs serve the same purpose as with SOAP, so it's not too far off.
There does exist software for turning types defined in XSDs into messages (merging in user-supplied data) at runtime, but per my question there's no obvious tooling for code generation around CPAs and related XSDs.
Furthermore, actually writing software to do this yourself is made more problematic by the dificulty of searching for the meta-grammar for XML Schema (i.e: that grammar which remains of XML Schema once XML tokenization is factored out). Basically, this was difficult because in the XML world, the word "grammar" has an different meaning which polutes search results.
I was able to write a parser for the XML syntax snippets present at the top of each of the MSDN articals on XML Schema (elements listed down the left), which in turn allowed me to generate an LL1 grammar for XML schema which works on the pre-parsed AST of a given XSD.
From there I built a top-down parser from this meta-grammar which:
Follows <xsd:import>s and <xsd:include>s to resolve namespaces into further XSDs.
Recursively resolves message types in order to produce a 'flattened' type for each CPA message.
Generates a packer/unpacker data-structures for the message types which allow generation of code in various languages, as well as serialisation to and parsing from validated 'payload' XML.
There are still various XML Schema restrictions, keys, and other constraints that my code generators don't know about, but support for these can be added in time.
I'll update this answer with links to grammars (and possibly code -- depends on legals) as time permits. I'll leave the question as non-accepted for a while so that if someone miraculously finds a tool which makes much less work of the code generation, I'll accept an answer based on that.

How do you model a business workflow in ColdFusion?

Since there's no complete BPM framework/solution in ColdFusion as of yet, how would you model a workflow into a ColdFusion app that can be easily extensible and maintainable?
A business workflow is more then a flowchart that maps nicely into a programming language. For example:
How do you model a task X that follows by multiple tasks Y0,Y1,Y2 that happen in parallel, where Y0 is a human process (need to wait for inputs) and Y1 is a web service that might go wrong and might need auto retry, and Y2 is an automated process; follows by a task Z that only should be carried out when all Y's are completed?
My thoughts...
Seems like I need to do a whole lot of storing / managing / keeping
track of states, and frequent checking with cfscheuler.
cfthread ain't going to help much since some tasks can take days
(e.g. wait for user's confirmation).
I can already image the flow is going to be spread around in multiple UDFs,
DB, and CFCs
any opensource workflow engine in other language that maybe we can port over to CF?
Thank you for your brain power. :)
Study the Java Process Definition Language specification where JBoss has an execution engine for it. Using this Java based engine may be your easiest solution, and it solves many of the problems you've outlined.
If you intend to write your own, you will probably end up modelling states and transitions, vertices and edges in a directed graph. And this as Ciaran Archer wrote are the components of a State Machine. The best persistence approach IMO is capturing versions of whatever data is being sent through workflow via serialization, capturing the current state, and a history of transitions between states and changes to that data. The mechanism probably needs a way to keep track of who or what has responsibility for taking the next action against that workflow.
Based on your question, one thing to consider is whether or not you really need to represent parallel tasks in your solution. Where instead it might be possible to en-queue a set of messages and then specify a wait state for all of those to complete. Representing actual parallelism implies you are moving data simultaneously through several different processes. In which case when they join again you need an algorithm to resolve deltas, which is very much a non trivial task.
In the context of ColdFusion and what you're trying to accomplish, a scheduled task may be necessary if the system you're writing needs to poll other systems. Consider WDDX as a serialization format. JSON, while seductively simple, I recall has some edge cases around numbers and dates that can cause you grief.
Finally see my answer to this question for some additional thoughts.
Off the top of my head I'm thinking about the State design pattern with state persisted to a database. Check out the Head First Design Patterns's Gumball Machine example.
Generally this will work if you have something (like a client / order / etc.) going through a number of changes of state.
Different things will happen to your object depending on what state you are in, and that might mean sitting in a database table waiting for a flag to be updated by a user manually.
In terms of other languages I know Grails has a workflow module available. I don't know if you would be better off porting to CF or jumping ship to Grails (right tool for the job and all that).
It's just a thought, hope it helps.

How do you document a tree-to-tree transformation in a human-readable format?

I need to document an application that serves as a facade for a set of webservices. The application accepts SOAP requests and transforms these requests into a format understandable by the underlying web service. There are several such services, each with its own interface. Some accept SOAP, some HTTP POST, some... other formats not mentioned in polite society.
I need to document how we map the fields from our SOAP calls to the fields for these other formats. Before everyone cries "XSLT" I must mention that the notation must be human-friendly. Ideally it would be something Excel-able.
Has anyone encountered this sort of problems before? How did you solve it? Is there a human-friendly notation for tree-to-tree transformations that can fit on a spreadsheet?
I've had to do just this. The way I did it was to just start writing, following the hierarchical structure.
I eventually would find that I was repeating myself. An example was that certain elements had a common set of attributes. I would pull the documentation of that common set up before the sections on the specific elements. Same thing with documentation of handling of specific simpleTypes.
Eventually, there was even some high level discussion on the overall flow and "philosophy" of the transformation. But I let it all happen bit by bit, fixing it as I became bored with repetition.
That said, I'm a developer, not a tech writer.
I haven't really found anything so far, but I've found pointers to many libraries that help transform objects of one type to another in Java. For reference, I'm listing the most promising ones here, all doing some kind of JavaBean to JavaBean conversion:
Transmorph
EZMorph
Dozer

Atom Publishing Protocol in real life

I know that some big players have embraced it and are actually exposing some of their services in APP compliant way, already. However, I haven't found many other (smaller) players in this field. Do you know any web application/service that uses APP as its public API protocol? What is your own take on AtomPub? Do you have any practical experiences using it? What are its limitations and drawbacks? Do you prefer AtomPub as your REST style or do you have some other favourite one? And why?
I know, these are many questions, not just one. The thing I'm interested here in is simple, though - how did the APP standard hit the market and particularly how does it seem with its adoption among web developers?
The company that I work for, is developing a lot of RESTful services.
However none of them expose public APIs.(In the sense that all services are internally consumed by our own clients). The reason why we went for REST architectural style was that we wanted our services to be easily consumable and more importantly scale well.
From my own practical experience I have come to the conclusion that HTTP + ATOM syndication format is a good idea, provided you want to keep things flexible(In terms of different content model, attaching and extending meta data associated with payloads, uniform parsing etc). ATOM ensures that everybody interprets the payload in an uniform manner without any scope for ambiguity.
However if one does not have any such complex requirements or does not forsee such requirements then the ATOM format could be a bit of an overhead. (For instance elements like Author,Title etc make sense more in the blogging/RSS world and may not make sense in your particular problem domain).
Also if the goal is to just serialize data structures at one end and reconstruct it at the other end, then most web frameworks(like WCF) have custom formats which are more appealing.
So in my opinion ATOM Pub is good if you need flexiblity in terms of data representation and if the playing field is huge with different kind of client.
However if you have a good knowledge of potential clients and server/client usage patterns then custom formats might be a good idea.
If the client is browser based then formats like JSON are very appealing.
Hope this answers your question.
My own research so far:
Wordpress supports AtomPub as its API protocol since version 2.3
GData is probably the biggest shot in the AtomPub field so far
Habari - new promising blogging system promotes APP as one of its main features
BlogSvc.net - an AtomPub
server, blog engine for .NET
platform, written in C#
Jangle - an open source project
designed to facilitate API access to
Library Systems
There's also mod_atom - an Apache module that stores entries in the filesystem.
Last time I checked (2007 or so) Atompub was fairly complex to implement. While you can whip together something that emits valid Atom feeds during the lunch break, implementing AtomPub was a fairly big undertaking.
That might have changed due to better libraries and tools but still it might be too complex to be implemented by smaller sides just because it's cool.
And the lack of killer AtomPub client applications puts little or no pressure on server operators to offer an AtomPub interface.