What is a best data format for publishing and consuming to/from Pub-Sub? I am looking at Avro message format due to it's binary format.
Usecases are there would be real time Microservice applications publishing Avro messages to pub-sub. Given that avro message is best suited when batching up messages(along with a schema attached with the binary message) and then publishing the messages, would that be a better suitable format for this usecase involving microservice?
Google Cloud Documentation contains some JSON examples but when looking for efficiency the main suggestion is to use the available client libraries, except if your needs don't met what client libraries can offer or if you are running on Google App Engine standard environment, in which case the use of two APIs is suggested.
In fact, the most important factor for efficiency is using the gRPC API instead of the REST API (which libraries' calls do by default). As mentioned here:
There are two major factors at work here: more efficient data encoding
and HTTP/2. gRPC keeps data in binary both in client memory and on the
wire by building on HTTP/2 and Protocol Buffers. This eliminates
processing and space required for string encoding schemes such as
Base64 or JSON. In addition, HTTP/2 itself makes things go faster with
multiplexed requests over a single connection and header compression.
I did not find data format explicit mentions anywhere. I suggest you to use your preferred language for the message, as for example Python. Client library description here and sample code here.
Based on this StackOverflow post, you can optimize your PubSub system efficienctly by:
Making sure you are using gRPC
Batching where possible, to reduce the number of calls and eliminate latency.
Only compressing when needed and after benchmarking (implies extra logic in your application)
Finally, if you intend to deploy a robust PubSub system, have a look on this Anusha Ramesh post. She is Project Manager at Google now and suggests and elaborates on three tips:
Don't underestimate the importance of capacity planning.
Make sure your pub/sub system is fault-tolerant.
NSM: Never Stop Monitoring.
There isn't going to be one correct answer for the best format to use for the messages for all use cases. Avro is certainly a popular choice. Protocol buffers would be another possibility, as would Thrift. For Pub/Sub, the data is all just bytes and it is up to the publisher and the subscriber to determine the interpretation of this data. People have run comparisons on the different data formats, so you may want to make the decision based on your needs in terms of performance and message sizes.
Pub/Sub itself uses Protocol buffers for defining its data types. With regard to batching, the Cloud Pub/Sub client libraries do batching themselves for publish, so you don't necessarily have to worry about that on your own. You can control the batch settings to optimize throughput and latency based on your use case by calling, for example, setBatchSettings in the Publisher.Builder for Java (other languages have an equivalent as well). You may decide to do your own batching if you want to associate some metadata with a set of messages instead of with each individual message or you have very specific needs in terms of how messages are batched together. Otherwise, depending on the client library to do the batching is probably the correct decision.
Related
I have been tasked to develop an interface layer that will be used internally by other developers as a means to access Google Cloud Storage (GCS). For me the process began by reading the online documentation. I am now at the point of making a decision as to which API we'll use. There are a couple of outstanding questions and that documentation mentions posting questions to SO, so here I am.
A quick tidbit of background seems in order. We are predominately a C house, though we do have means to build and invoke C++ methods from C. The internal users will write C code that invokes my C code, which in turn must provide access to GCS. So that's the high-level call stack. The question then becomes one of how do I provide that access in the best manner, with performance being the top criteria?
To answer that, I began by reading the online documentation regarding the different GCS options. There is an XML and a JSON REST API to be had. We have an internal HTTP mechanism that would enable the C code to invoke those JSON/XML API methods directly. The documentation states the following:
"...JSON API is RESTful...and is specifically intended to be used with the Google Cloud Client Libraries."
What are Google Cloud libraries I wondered. Reading the documentation gleaned that they are libraries that are written in a particular language and can be utilized to access GCS. The documentation seems to steer you in the direction of using one of these client libraries versus the JSON/XML ones.
So the first question I had was "what does the 'specifically intended" business mean exactly?" After more reading, I arrived at the notion that these client libraries are an abstraction of the RESTful interfaces. These libraries invoke the JSON API methods, just as you could do yourself through the JSON API directly. Is that correct? If so they seem like a means of convenience to interact with GCS. The documentation even states that the cloud libraries "...provide better performance and usability" versus the JSON HTTP interface.
In the end, I see two paths forward:
Invoke the JSON API directly from C using our HTTP mechanism
Use the bridge to invoke C++ methods from C. Those C++ methods then invoke methods in the client libraries, which ultimately (if the above is correct) invoke the JSON API.
Note that I've already written some C code in-house that uses the aforementioned internal HTTP mechanism to interact with the Apache WebHDFS API, which is also RESTful and for which there are no client libraries. Thus I could leverage a fair portion of that code to re-use in this new development. For me it boils down to a question of performance. The second option above seems rather circuitous in comparison to the first. Thus the first would seem to yield some performance improvement over the second. However Google mentions that the client libraries provide better performance than the RESTful APIs. How is that? The documentation states that the client libraries handle all of the low level communication with the server, including authentication. Is this part of the reason?
And so I am posing this question to those who are more experienced with GCS (and perhaps GCP for that matter): which route would provide better performance in your opinion? Invoking the JSON API directly or using the client libraries (C++ in our case)?
Thanks!
Disclaimer: I work on the C++ client libraries for Google Cloud, specifically the C++ GCS library.
However Google mentions that the client libraries provide better performance than the RESTful APIs. How is that?
The libraries are tuned to use the RPCs "well". They have good defaults for retry and backoff strategies, they pick the right RPC if two alternatives are available. They can recover efficiently from an interrupted download or upload. They may implement higher-level abstractions to improve performance.
If you are asking "how can running their code to call an RPC be more efficient than me calling the same RPC", then the answer is "probably it cannot". I do not think that is the interesting problem. For example, if you want to upload a large file (say multiple GiB) the most efficient way in GCS is to split the file in chunks, use several parallel uploads to different GCS objects, then compose the objects, and delete the intermediate objects. That would be several times faster than uploading the file sequentially. Could you do the same? Sure! Do you want to? Maybe not.
We also worry about resuming an interrupted download correctly. And verifying the integrity of your data as you upload and download it (you can turn off those features, of course).
Finally, we can change the protocol under the hood if there is a more efficient way to do things. For example, we already switch between XML and JSON when we both are equivalent and one seems to perform better than the other. We are also working on supporting gRPC instead of REST, which is showing good improvements over the REST baseline (unfortunately it is not GA yet, and I do not have an estimated date).
HTH
The docs for PubSub state that the max payload after decoding is 10MB. My question is whether or not it is advantageous to compress the payload at the publisher before publishing to increase data throughput?
This especially can be helpful if the payload has a high compression ratio like a json formatted payload.
If you are looking for efficiency on PubSub I would first concentrate on using the best API, and that's the gRPC one. If are using the client libraries then the chance is high that it's using gRPC anyway. Why gRPC?
gRPC is binary and your payload doesn't need to go through hoops to be enoded
REST needs to base64 the payload, making it bigger and has an extra encoding step
Second I would try to batch the message if possible, making the number of calls lower, eliminating some latency.
And last I would look at compression, but that means you need to specifically de-compress it at the subscriber. This means your application code will get more complex. If all your workloads are on the Google Cloud Platform I wouldn't bother with compression. If your workload is outside of GCP you might consider it, but testing would make sense.
An alternative for compression and if your schema is stable, is looking at using ProtoBuf.
To conclude, I would:
Make sure your using gRPC
Batch where possible
Only compress when needed and after benchmarking (implies extra logic in your application)
I am redesigning a small monolith ETL software written in Python. I find a microservice architecture suitable as it will give us the flexibility to use different technologies if needed (Python is not the nicest language for enterprise software in my opinion). So if we had three microservices (call them Extract, Transform, Load), we could use Java for Transform microservice in the future.
The problem is, it is not feasible here to pass the result of a service call in an API response (say HTTP). The output from Extract is going to be gigabytes of data.
One idea is to call Extract and have it store the results in a database (which is really what that module is doing in the monolith, so easy to implement). In this case, the service will return only a yes/no response (was the process successful or not).
I was wondering if there were a better way to approach this. What would be a better architecture? Is what I'm proposing reasonable?
If your ETL process works on individual records (some parallelize-able units of computation), then there are a lot of options you could go with, here are a few:
Messaging System-based
You could base your processing around a messaging system, like Apache Kafka. It requires a careful setup and configuration (depending on durability, availability and scalability requirements of your specific use-cases), but may give you a better fit than a relational db.
In this case, the ETL steps would work completely independently, and just consume some topics, produce into some other topics. Those other topics are then picked up by the next step, etc. There would be no direct communication (calls) between the E/T/L steps.
It's a clean and easy to understand solution, with independent components.
Off-the-shelf processing solutions
There are a couple of OTS solutions for data processing/computation and transformation: Apache Flink, Apache Storm, Apache Spark.
Although these solutions would obviously confine you to one particular technology, they may be better than building a similar system from scratch.
Non-persistent
If the actual data is streaming/record-based, and it is not required to persist the results between steps, you could just get away with long-polling the HTTP output of the previous step.
You say it is just too much data, but that data doesn't have to go to the database (if it's not required), and could just go to the next step instead. If the data is produced continuously (not everything in one batch), on the same local network, I don't think this would be a problem.
This would be technically very easy to do, very simple to validate and monitor.
I would suggest you to have a look into the Apache flink, It is very similar to what big sized enterprise apps like informatica, talend and data stage mappings but it process in a smaller scale but repetitively. It actually helps you to compute and transform the stuff on the fly/as they arrive and then store/load into a file/db.
The current infra we have with flink process close 28.5GB per every 4 hours and it just works. In the initial days, we had to run our daily batch and the flink stream to ensure both of them are producing consistent results and eventually most of the streams were left active and the daily batches were retired gradually.
Hope it helps someone.
There's none preventing you to have an SFTP server containing CSV or database storing the results. You can do whatever make senses. Using messaging to pass gigabytes of data, or streaming through HTTP may or may not make senses for your case.
This is an interesting problem. The best solution for this could be Reactive Spring Boot. You can have your Extract service to be as a Reactive Spring Boot app and instead of sending GBs of data, stream the data to the required service.
Now you might be wondering that while streaming, it might hold on the working thread. The answer is NO. IT works at the OS level. It doesn't hold up any request thread to stream the results. That's the beauty of the Reactive Spring Boot.
Go through this and explore
https://spring.io/blog/2016/07/28/reactive-programming-with-spring-5-0-m1
I am not sure if this is the appropriate place for this, but I have come up with a "conceptual" modular design architecture that separates the logic out into individual services to allow an almost plug and play type scenario whereby there are no dependencies between the services. Think a list of features and only enabling the ones that you want.
To facilitate this I realise that I will need some type of middleware that will connect these all together and control the flow of data. However I am not sure of the specifics around what would be appropriate to achieve this.
I plan on implementing the services using .NET soap based services, so is this a case of using something like Tibco?
Any suggestions around what would be most appropriate or even where to start looking would be great.
If the above description didn't make sense hopefully this image is a bit clearer in describing the relationship between the services.
Thanks.
Depending on your needs you could use NServiceBus (http://particular.net/nservicebus). NServiceBus is communication middle ware which can be used with different types of queuing systems like MSMQ, RabbitMQ and others. It is essentially a servicebus which is very developer friendly and focused. It does not only facilitate asynchronous message based distributed communication but also:
Publish / Subscribe that is transport agnostic using automatic registration
Transports: Can be used with MSMQ, RabbitMQ, Azure Storage Queues, etc.
Security: Supports encryption of messages
BLOB's: Has support for storing large message payloads transparently with the data bus to allow for communicatie message larger then the transport allows.
Scalability: Out and upscaling to increase throughput
Reliability: Deduplication, idempotent processing without having distributed transactions.
Orchestration: Sagas can help in controlling message flow and routing.
Exception handling: Exceptions get automatically retried in two different stages.
Monitoring: Tools like Service Pulse, Service Insight and Windows Performance monitors to monitor performance and errors. See what errors occurred and
Serialization: Can use different serializers that support formats like xml, json, binary
Open Source: All source code is available
Auditing: Can move all processed message to an audit queue for archiving or audit requirements
Community: Has a large community of developers that are active on the forums but also supply additional transports, serializers and other features.
I must mention that I work for Particular but also that there are other options to consider. NServiceBus does not use SOAP for message exchange but a lightweight message in a format of choice as mentioned as the serialization bullet. It can integrate with services that require SOAP. It has the ability to expose an service (endpoint) as a WCF service for easy integration and it can use SOAP from within code to call external SOAP services using the features that the .net framework and visual studio provide.
Good luck in choosing the right technology for your project.
My next project involves the creation of a data API within an enterprise framework. The data will be consumed by several applications running on different software platforms. While my colleagues generally favour SOAP, I would like to use a RESTful architecture.
Most of the applications will only need a few objects at every call. Other applications will however sometimes need to make several sequential calls each involving thousands of records. I'm concerned about performance. Serialization/deserialization & network usage are where I fear to find a bottleneck. If each request involves a large delay, all of the enterprise's applications will be sluggish.
Are my fears realistic? Will serialization to a voluminous format like XML or JSON be a problem? Are there alternatives?
In the past, we've had to do these large data transfers using a "flatter"/leaner file format such as CSV for performance. How can I hope to achieve the performance I need using a web service?
While I'd prefer replies specific to REST, I'm interested in hearing how SOAP users might deal with this as well.
One advantage of REST is that you are free to use whatever media type you like. Why not continue to use text/csv? You could also enable HTTP compression to further reduce bandwidth consumption.
REST services are great for taking advantage of all different kinds of data formats. Whatever format fits your scenario best.
We offer both XML and JSON. Your mentioned rendering time really can be an issue. On server side we have JAXB whose standard sun-implementation is somewhat slow, when it comes to marshall XML. XML has the disadvantage of verbosity, but is also nice in interoperability and has schema + explicit versioning.
We compensated the verbosity in several ways (especially limiting the result-set):
In case you have a container with items in it, offer paging in your xml response (both page-size and page-number, e.g. /items?page=0&size=3) . The client can itself reduce the size by reducing the page-size.
Offer collapsing elements, for instance several clients are only interested in one data field of your whole item. Do this with a parameter (e.g. /items?select=name), then only the nested element 'name' is included inline of your item element. This dramatically decreases size.
Generally give the clients the power to use result-set limiting. They will definitley use it, because it speeds up response time also on their side :)
Also use compression, it reduces verbose XML extremely (in our case the payload got 10 times smaller). From client side you can do it by header 'Accept-Encoding: gzip'. If you use Apache, server configuration is also straight-forward
I'd like to offer three guidelines:
one is the observation that there are many SOAP Web services out there (especially built with .NET 2.0 "ASMX" technology) that send down their data transfer objects serialized in XML. There are of course many RESTful services that send down XML or JSON. XML serialization/deserialization is rarely the constraining factor.
one common cause of bottlenecks in Web services is an interface that encourages client applications to get data by making those thousands of sequential calls (there is a term for it: a chatty interface). This is what you should avoid when you design your Web service's interface, regardless of what four-letter acronym you decide to go ahead with.
one thing to remember about REST is that it (partially) stands for a transfer of state, which may be ill-suited to some operations where you don't want to transfer the state of a business object from the server to a client application. In those cases, a SOAP Web service (as suggested by your colleagues) is more appropriate; or perhaps a combination of SOAP and REST services, where the REST services would take care of operations where the state transfer is appropriate, and the SOAP services would implement the rest (pun unintended :-)) of the operations.