Can you do batch pull messages with Google Pub Sub? - google-cloud-platform

Trying to optimize our application but doing batch pulling. Pub Sub seems to allow asynchronously pulling one message at a time with different client nodes, but is there no way for a single node to do a batch pull from pub sub?
Both Streaming Pull and Pull RPC both only allow the subscriber to consume one message at a time. Right now, it looks like we would have to pull one message at a time and do application level batching.
Any insight would be helpful. Pretty new to this GCP in general.

The underlying pull and streaming pull operations can receive batches of messages in the same response. The Cloud Pub/Sub client library, which uses streaming pull, breaks these batches apart and hands them to the provided user callback one at a time. Therefore, you need not worry about optimizing the underlying receiving of messages.
If your concern is optimizing the subscriber code at the application level, e.g., you want to batch writes into a database, then you have a couple of options:
Use Pull directly, which allows one to process all of the messages in a batch at a time. Note that using pull effectively requires many simultaneously outstanding pull requests and replacing requests that return with new requests immediately.
In your user callback, re-batch messages and once the batch reaches a desired size (or you've waited a sufficient amount of time to fill the batch), process all of the messages together and then ack them.

You probably can implement that by using Dataflow (Apache Beam). You can have a running streaming job, where you group, window, transform messages according to your requirements. The results of processing can be saved in batches or steam further. It probably makes sense in case the number of messages is really big.

Related

Is it possible to selectively read from AWS SQS?

I have a use-case. I want to read from SQS always, except when another event happens.
For instance, I have football news into SQS as messages. I want to retrieve them always, except for times when live matches are happening.
Is there any possibility to read unless there is another event does the job?
I scrolled the docs and Stack Overflow, but I don't see a solution.
COMMENT: I have a small and week service, and I cannot because of technical limitations increase it (memory/CPU, etc.), but I still want 2 "conflicting" flows to be in the service. They are both supposed to communicate to the same API, and I don't want them to send conflicting requests.
Is there a way to do it, or will I have to write a custom communicator with SQS?
You can't select which messages you want to read from SQS and which you'd rather not - there is no filtering in SQS.
If you have messages that need to be processed at all times and others that need to be processed only sometimes or in batches, you should put them in separate queues and read from the seperately.
You don't say anything about the infrastructure that reads from the queue, but if it's a process on EC2, you could just stop it while live matches are happening and restart it later. SQS is built for asynchronous messaging and will store the messages for up to 14 days (depending on your configuration) until a consumer is available to read them.

Is there a way I can consume Google PubSub message using synchronous pull in Apache Beam job

I have already gone through the client library provided by google in the below doc. The given client library is just to poll the message from PubSub, But it will not poll continuously until we create the Unbounded Source Connector.
https://cloud.google.com/pubsub/docs/pull#synchronous_pull
Since the source connector I'm trying to build is Unbounded source,For that I need to take care of Checkpoint Marker, implement PubSub reader, PubSub split source and implement ACK and NACK logic and I believe it will take a good amount to time to create my own Unbounded source connector. Right now PubSubIO(Beam api) only supports asynchronous pull. So is there any way I can just implement ACK n NACK logic logic on top of PubSubIO api provided by Apache Beam. Is there any kind of already developed api which is more suitable for this kind of use-cases.
With synchronous pull, you can acknowledge the intended message and NACK the consumed message in case of any parsing failure.
The feature that you expect doesn't exist, and should not exist.
The current behavior, the async pull, get the message and as soon as the message is persisted (in the worker or sink in the pipeline, the first of both), the message is ACK.
In your case, you expect to ACK manually the message according with the pipeline processing. However, you are limited to 10 minutes to ACK your messages with PubSub. Imagine if you build a pipeline with windows of 15 minutes (or more). You need to wait the end of the processing of the windows to ACK the messages; impossible!
The correct design, in your case, is to manage your errors in your pipeline.

Multiple curl calls php issue

My problem every 20minutes I want to execute the curl request which is around 25000 or more than that and save the curl response in database. In PHP it is not handled properly which is the best AWS services I can use except lambda.
A common technique for processing large number of similar calls is:
Create an Amazon Simple Queue Service (SQS) queue and push each request into the queue as a separate message. In your case, the message would contain the URL that you wish to retrieve.
Create an AWS Lambda function that performs the download and stores the data in the database.
Configure the Lambda function to trigger off the SQS queue
This way, the SQS queue can trigger hundreds of Lambda functions running parallel. The default concurrency limit is 1000 Lambda functions, but you can request for this to be increased.
You would then need a separate process that, every 20 minutes, queries the database for the URLs and pushes the messages into the SQS queue.
The complete process is:
Schedule -> Lambda pusher -> messages into SQS -> Lambda workers -> database
The beauty of this design is that it can scale to handle large workloads and operates in parallel, rather than each curl request having to wait. If a message cannot be processed, it Lambda will automatically try again. Repeated failures will send the message to a Dead Letter Queue for later analysis and reprocessing.
If you wish to perform 25,000 queries every 20 minutes (1200 seconds), this would need a query to complete every 0.05 seconds. That's why it is important to work in parallel.
By the way, if you are attempting to scrape this information from a single website, I suggest you investigate whether they provide an API otherwise you might be violating the Terms & Conditions of the website, which I strongly advise against.

What is the recommended way to fanout in SQS lambda environment?

I would like to send a push notification to users in my database in a lambda environment via SQS / messaging queue architecture, in order to do that
I would first need to query all users in my database with push notifications enabled.
loop over all of them them
send a SQS event/message for each user.
let my sqs triggered lambda handle/send the push notification
Is there a better way to implement this to avoid querying a big number of users and/or looping over all the results to send a SQS message for each?
I would take a slightly different approach here, but similar.
Query the database for the users
Loop over the users
Send one messages to SQS for a batch of records to send, and use the SendMessageBatch operation of SQS to send them. So batches of batches. Each batch of messages would have several "users" to send to, not just one. This will should increase your performance because a batch will require fewer lambda invocations.
Lambda handles SQS messages (probably more than one), and each SQS message results in sending many push notifications. In the case of Firebase I believe there is a way to send batches, which is even better. Even without that you can send several messages at once using a Promise.all type logic.
With this structure you can send a very large number of messages really quickly, and probably a lot cheaper. Imagine you need to send to 1M users. If you send batches of 100, in batches of 25 to SQS, then you have 2,500 messages per call to SQS. That would mean 400 calls to SQS, far better than even the 40K you'd have to make if you sent single messages in batches of 25.
On the receiving side, even if you throttled the SQS integration to 1 message per invocation you'd have 10,000 lambda invocations. If you assume even 1s per invocation, and 1000 concurrent invocations, it would take 10 seconds (likely less). If you send one message per user you'd have to make 1M lambda invocations. If you assume each invocation takes 100ms then you can send 10/second, so with 1000 concurrent executions it would take 100 seconds. In reality the numbers are probably even better than that for the batch version, especially if you don't limit it to 1 message at a time.
Edit
Based on the comments the question seemed to be a bit more about the first part of the process. With that in mind I'd suggest the following options.
If you find yourself needing to address the same large groups repeatedly most messaging services (Firebase and SNS for sure) support some sort of topic subscription model. Given that these are push notifications you can subscribe a device to the topic in code. What this ultimately leads to is one messages sent from your code to the messaging service. The service handles the rest. This is probably the preferred solution for anything that has mass recipients, especially if you can know the recipients up front. This even works for dynamic topics. For example, consider a situation where a person comments on a post. Any new comment on that post should send a message to everyone who has commented on that post. You can create a topic on the fly when the post is created, and add recipients to the topic as they comment. If a user wishes to stop receiving messages you can remove the user from the topic.
If you don't know the recipients up front the above solution is a solid solution. However, if you are concerned with Lambda timeouts on the first two steps I'd modify slightly. I would take advantage of AWS Step Functions and page the data in the lambda. Lambda will tell you, via the context object supplied in the invocation, how much time is remaining. You can check that periodically to determine if you should exit the lambda and pass to the step function the current paging information. The step function can pass that paging information back into the lambda, which should be coded to accept the paging information as part of the request, and continue from that point if supplied.
I would suggest an additional piece in your application architecture,
I personally prefer to avoid using the Primary database for heavy querying,
assuming you have a large user base.
I will suggest maintaining your user list in a Search Engine like ElasticSearch or CloudSearch, or a simple table with just the user list in AWS DynamoDb or create a Read Replica of your DB.
To no confuse you, use a Search Engine(first choice) or an AWS DynamoDb
This will avoid creating pressure on your database when you query the read specialty datastore and won't affect other modules in operation
And it's way fast to query this way
Step 2: loop over all of them them
Step 3: batch send messages to SQS using its SendMessageBatch method like Jason is suggesting
Step 4: Based on your SQS setting, you may process multiple messages on your Lambda function

What are the possible use cases for Amazon SQS or any Queue Service?

So I have been trying to get my hands on Amazon's AWS since my company's whole infrastructure is based of it.
One component I have never been able to understand properly is the Queue Service, I have searched Google quite a bit but I haven't been able to get a satisfactory answer. I think a Cron job and Queue Service are quite similar somewhat, correct me if I am wrong.
So what exactly SQS does? As far as I understand, it stores simple messages to be used by other components in AWS to do tasks & you can send messages to do that.
In this question, Can someone explain to me what Amazon Web Services components are used in a normal web service?; the answer mentioned they used SQS to queue tasks they want performed asynchronously. Why not just give a message back to the user & do the processing later on? Why wait for SQS to do its stuff?
Also, let's just say I have a web app which allows user to schedule some daily tasks, how would SQS would fit in that?
No, cron and SQS are not similar. One (cron) schedules jobs while the other (SQS) stores messages. Queues are used to decouple message producers from message consumers. This is one way to architect for scale and reliability.
Let's say you've built a mobile voting app for a popular TV show and 5 to 25 million viewers are all voting at the same time (at the end of each performance). How are you going to handle that many votes in such a short space of time (say, 15 seconds)? You could build a significant web server tier and database back-end that could handle millions of messages per second but that would be expensive, you'd have to pre-provision for maximum expected workload, and it would not be resilient (for example to database failure or throttling). If few people voted then you're overpaying for infrastructure; if voting went crazy then votes could be lost.
A better solution would use some queuing mechanism that decoupled the voting apps from your service where the vote queue was highly scalable so it could happily absorb 10 messages/sec or 10 million messages/sec. Then you would have an application tier pulling messages from that queue as fast as possible to tally the votes.
One thing I would add to #jarmod's excellent and succinct answer is that the size of the messages does matter. For example in AWS, the maximum size is just 256 KB unless you use the Extended Client Library, which increases the max to 2 GB. But note that it uses S3 as a temporary storage.
In RabbitMQ the practical limit is around 100 KB. There is no hard-coded limit in RabbitMQ, but the system simply stalls more or less often. From personal experience, RabbitMQ can handle a steady stream of around 1 MB messages for about 1 - 2 hours non-stop, but then it will start to behave erratically, often becoming a zombie and you'll need to restart the process.
SQS is a great way to decouple services, especially when there is a lot of heavy-duty, batch-oriented processing required.
For example, let's say you have a service where people upload photos from their mobile devices. Once the photos are uploaded your service needs to do a bunch of processing of the photos, e.g. scaling them to different sizes, applying different filters, extracting metadata, etc.
One way to accomplish this would be to post a message to an SQS queue (or perhaps multiple messages to multiple queues, depending on how you architect it). The message(s) describe work that needs to be performed on the newly uploaded image file. Once the message has been written to SQS, your application can return a success to the user because you know that you have the image file and you have scheduled the processing.
In the background, you can have servers reading messages from SQS and performing the work specified in the messages. If one of those servers dies another one will pick up the message and perform the work. SQS guarantees that a message will be delivered eventually so you can be confident that the work will eventually get done.