Efficient Google PubSub Publishing - google-cloud-platform

The docs for PubSub state that the max payload after decoding is 10MB. My question is whether or not it is advantageous to compress the payload at the publisher before publishing to increase data throughput?
This especially can be helpful if the payload has a high compression ratio like a json formatted payload.

If you are looking for efficiency on PubSub I would first concentrate on using the best API, and that's the gRPC one. If are using the client libraries then the chance is high that it's using gRPC anyway. Why gRPC?
gRPC is binary and your payload doesn't need to go through hoops to be enoded
REST needs to base64 the payload, making it bigger and has an extra encoding step
Second I would try to batch the message if possible, making the number of calls lower, eliminating some latency.
And last I would look at compression, but that means you need to specifically de-compress it at the subscriber. This means your application code will get more complex. If all your workloads are on the Google Cloud Platform I wouldn't bother with compression. If your workload is outside of GCP you might consider it, but testing would make sense.
An alternative for compression and if your schema is stable, is looking at using ProtoBuf.
To conclude, I would:
Make sure your using gRPC
Batch where possible
Only compress when needed and after benchmarking (implies extra logic in your application)

Related

should I do many smaller requests, or fewer but larger requests using s3 to pass data

I'm working on a project that requires data entries to be inserted into an RDS instance. We're using a serverless stack (cognito, api gateway, lambda, rds) to accomplish this. Our application requires a large amount of data to be read off of an embedded device, prior to insertion. That data must then be inserted immediately.
Based on our current setup, a single batch of data could be in excess of 60KB, but that's a worst case scenario.
Is there an accepted best practice or ideal way of sending/accessing this data this large in my lambda function? As of right now, I'm planning on shipping it off with my API request. I've seen s3 mentioned as an intermediary for large quantities of data, but I'm not sure if it's really necessary for something like this.
In my experience it depends on a number of factors. What communication are you using? What is the drop rate? do you experience corrupt packages? What is your embedded device?
If you can send the data in one time with a 97% success rate then I don't see a reason to split the data. If packets take a long time and connections can drop then its good to send multiple packets and resend the failed ones.
For the network 60KB is a small amount of data. If you have a slow 2G embedded device then that's your bottleneck and you need to experience what the most efficient way is to get the data out of it. A single stream of data would probably be the most efficient.

Google Speech streaming recognition slow response time

What is the fastest expected response time of the Google Speech API with streaming audio data? I am sending an audio stream to the API and am receiving the interim results with a 2000ms delay, of which I was hoping I could drop to below 1000ms. I have tested different sampling rates and different voice models.
I'm afraid that response time can't be measured or guaranteed because of the nature of the service. We don't know what is done under the hood, in fact there is no SLA for response time even though there is SLA for availability.
Something that can help you is working on building a good request:
Reducing 100-miliseconds frame size, for example, could ensure a good tradeoff between latency and efficiency.
Following Best Practices will help you to make a clean request so that the latency can be reduced.
You may want to check following links on specific uses cases to know how they addressed latency issues:
Realtime audio streaming to Google Speech engine
How to speed up google cloud speech
25s Latency in Google Speech to Text
If you really care about response time you'd better use Kaldi-based service on your own infrastructure. Something like https://github.com/alumae/kaldi-gstreamer-server together with https://github.com/Kaljurand/dictate.js
Google Cloud Speech itself works pretty fast, you can check how quick your microphone gets transcribed https://cloud.google.com/speech-to-text/.
You may probably experience buffering issue on your side, the tool you are using may buffer data before sending(buffer flush) to underlying device(stream).
You can find out how to decrease output buffer of that tool to lower values e.g. 2Kb, so data will reach Node app and Google service faster. Google recommends to send data that equals to 100ms buffer size.

Avro message for Google Cloud Pub-Sub?

What is a best data format for publishing and consuming to/from Pub-Sub? I am looking at Avro message format due to it's binary format.
Usecases are there would be real time Microservice applications publishing Avro messages to pub-sub. Given that avro message is best suited when batching up messages(along with a schema attached with the binary message) and then publishing the messages, would that be a better suitable format for this usecase involving microservice?
Google Cloud Documentation contains some JSON examples but when looking for efficiency the main suggestion is to use the available client libraries, except if your needs don't met what client libraries can offer or if you are running on Google App Engine standard environment, in which case the use of two APIs is suggested.
In fact, the most important factor for efficiency is using the gRPC API instead of the REST API (which libraries' calls do by default). As mentioned here:
There are two major factors at work here: more efficient data encoding
and HTTP/2. gRPC keeps data in binary both in client memory and on the
wire by building on HTTP/2 and Protocol Buffers. This eliminates
processing and space required for string encoding schemes such as
Base64 or JSON. In addition, HTTP/2 itself makes things go faster with
multiplexed requests over a single connection and header compression.
I did not find data format explicit mentions anywhere. I suggest you to use your preferred language for the message, as for example Python. Client library description here and sample code here.
Based on this StackOverflow post, you can optimize your PubSub system efficienctly by:
Making sure you are using gRPC
Batching where possible, to reduce the number of calls and eliminate latency.
Only compressing when needed and after benchmarking (implies extra logic in your application)
Finally, if you intend to deploy a robust PubSub system, have a look on this Anusha Ramesh post. She is Project Manager at Google now and suggests and elaborates on three tips:
Don't underestimate the importance of capacity planning.
Make sure your pub/sub system is fault-tolerant.
NSM: Never Stop Monitoring.
There isn't going to be one correct answer for the best format to use for the messages for all use cases. Avro is certainly a popular choice. Protocol buffers would be another possibility, as would Thrift. For Pub/Sub, the data is all just bytes and it is up to the publisher and the subscriber to determine the interpretation of this data. People have run comparisons on the different data formats, so you may want to make the decision based on your needs in terms of performance and message sizes.
Pub/Sub itself uses Protocol buffers for defining its data types. With regard to batching, the Cloud Pub/Sub client libraries do batching themselves for publish, so you don't necessarily have to worry about that on your own. You can control the batch settings to optimize throughput and latency based on your use case by calling, for example, setBatchSettings in the Publisher.Builder for Java (other languages have an equivalent as well). You may decide to do your own batching if you want to associate some metadata with a set of messages instead of with each individual message or you have very specific needs in terms of how messages are batched together. Otherwise, depending on the client library to do the batching is probably the correct decision.

Microservice architecture for ETL

I am redesigning a small monolith ETL software written in Python. I find a microservice architecture suitable as it will give us the flexibility to use different technologies if needed (Python is not the nicest language for enterprise software in my opinion). So if we had three microservices (call them Extract, Transform, Load), we could use Java for Transform microservice in the future.
The problem is, it is not feasible here to pass the result of a service call in an API response (say HTTP). The output from Extract is going to be gigabytes of data.
One idea is to call Extract and have it store the results in a database (which is really what that module is doing in the monolith, so easy to implement). In this case, the service will return only a yes/no response (was the process successful or not).
I was wondering if there were a better way to approach this. What would be a better architecture? Is what I'm proposing reasonable?
If your ETL process works on individual records (some parallelize-able units of computation), then there are a lot of options you could go with, here are a few:
Messaging System-based
You could base your processing around a messaging system, like Apache Kafka. It requires a careful setup and configuration (depending on durability, availability and scalability requirements of your specific use-cases), but may give you a better fit than a relational db.
In this case, the ETL steps would work completely independently, and just consume some topics, produce into some other topics. Those other topics are then picked up by the next step, etc. There would be no direct communication (calls) between the E/T/L steps.
It's a clean and easy to understand solution, with independent components.
Off-the-shelf processing solutions
There are a couple of OTS solutions for data processing/computation and transformation: Apache Flink, Apache Storm, Apache Spark.
Although these solutions would obviously confine you to one particular technology, they may be better than building a similar system from scratch.
Non-persistent
If the actual data is streaming/record-based, and it is not required to persist the results between steps, you could just get away with long-polling the HTTP output of the previous step.
You say it is just too much data, but that data doesn't have to go to the database (if it's not required), and could just go to the next step instead. If the data is produced continuously (not everything in one batch), on the same local network, I don't think this would be a problem.
This would be technically very easy to do, very simple to validate and monitor.
I would suggest you to have a look into the Apache flink, It is very similar to what big sized enterprise apps like informatica, talend and data stage mappings but it process in a smaller scale but repetitively. It actually helps you to compute and transform the stuff on the fly/as they arrive and then store/load into a file/db.
The current infra we have with flink process close 28.5GB per every 4 hours and it just works. In the initial days, we had to run our daily batch and the flink stream to ensure both of them are producing consistent results and eventually most of the streams were left active and the daily batches were retired gradually.
Hope it helps someone.
There's none preventing you to have an SFTP server containing CSV or database storing the results. You can do whatever make senses. Using messaging to pass gigabytes of data, or streaming through HTTP may or may not make senses for your case.
This is an interesting problem. The best solution for this could be Reactive Spring Boot. You can have your Extract service to be as a Reactive Spring Boot app and instead of sending GBs of data, stream the data to the required service.
Now you might be wondering that while streaming, it might hold on the working thread. The answer is NO. IT works at the OS level. It doesn't hold up any request thread to stream the results. That's the beauty of the Reactive Spring Boot.
Go through this and explore
https://spring.io/blog/2016/07/28/reactive-programming-with-spring-5-0-m1

Store streaming data - fast, cheap, reliable and good for batch consumption

I have a (spring-boot) web service that generates a json response for each request. This response, while returned to the querying user, also needs to be archived somewhere (so that we know what we responded with to the user).
The service needs to support 4,000 requests/second. As such, we need the archival method to be fast. The archived data would later be consumed by a map-reduce (batch) job.
I want to know which solution to use - Kafka, S3, or any other solution. The service has been deployed to AWS. So solutions within AWS are ideal.
The requirements are as follows:
Writes should be fast 94K req/s at least).
Writes should be non-blocking (so that the service response time is not affected).
Reads need not be fast but should be suitable for consumption by map-reduce jobs.
Data should be resilient to server crashes etc.
Should not be too expensive to write/store and read.
There is no data retirement plan, i.e. the data needs to persist until the end of time.
Which solutions do you recommend?
Some of your requirements like "should not be too expensive" are a bit vague. In the end, you are going to need to evaluate a service against all of your exact requirements yourself.
Given that qualification, I would look into streaming the data to Kenesis with the goal of archiving the data to S3. I recommend reading this blog post from AWS to get an idea of how to achieve this.