Google Pub/Sub Partition Id - google-cloud-platform

In the Google Cloud Pub/Sub documentation about load balancing in pull delivery say:
Multiple subscribers can make pull calls to the same "shared"
subscription. Each subscriber will receive a subset of the
messages.
My concern is about the last phase. Can I decide the way to partition the topic? In others words, Can I decide the way the subsets are grouped?
For instance, in the Kinesis AWS service I can decide the partition key of the stream, in my case by user id, in consequence, a consumer recibe all the messages of a subset of users, or, from other point of view, all the messages of one user are consumed by the same consumer. The message stream of one user is not distributed between different consumers.
I want to do this kind of partition with the Google Pub/Sub service. Is that possible?

There is currently no way for the subscriber to specify a partition or set of keys for which they should receive messages in Google Cloud Pub/Sub, no. The only way to set up this partition would be to use separate topics.

Related

Routing granular messages from Amazon SNS to SQS with filtering

I am trying to achieve a point in a system architecture on top of AWS infrastructure where a message is published from a data processor and sent to multiple subscribers (clients). This message would contain information that some - but not all - clients would want to receive.
Very similar question > Routing messages from Amazon SNS to SQS with filtering
So to do this message filtering I have turned to the message FilterPolicy functionality provided by SNS using one topic. Currently the system is reaching a point in time that clients have more granular and specific filtering rules so now I am reaching the filtering limits of the AWS SNS.
See more about the SNS filter policy here https://docs.aws.amazon.com/sns/latest/dg/sns-subscription-filter-policies.html (section “Filter policy constraints”)
One example of my limitation is the amount of filter values in a policy, on above link it states 150 values. Right now my subscribers would be interested in receiving messages with a specific attribute value. Although this one attribute could have several hundreds or thousands of different values.
I can not also group this attributes since they represent a non-sequential identity.
I seek some guidance over on a architectural solution that would allow me to keep using AWS SNS. I am limited to use some of the AWS infrastructure services, so no RabbitMQ for me.

Does Google Cloud (GCP) Pub/Sub supports feature similar to ConsumerGroups as in Kafka

Trying to decide between Google Cloud (GCP) Pub/Sub vs Manager Kafka Service.
In latest update, Pub/Sub added support to replay messages which were processed before, is a welcome change.
One feature I am not able to find on their documentation is whether we can have something similar to Kafka's Consumer Groups, i.e have group of Subscribers and each processing data from the same topic, and be able to re-process the data from beginning for some Subscriber(consumer group) while others are not affected by it.
eg:
Lets say you have a Topic as StockTicks
And you have two consumer groups
CG1: with two consumers
CG2: With another two consumers
In Kafka I can read messages independently between these groups, but can I do the same thing with Pub/Sub.
And Kafka allows you to replay the messages from the beginning, can I do the same with Pub/Sub, I am ok if I cant replay the messages that were published before the CG was created, but can I replay the message that were submitted after a CG/Subscribers were created?
Cloud Pub/Sub's equivalent of a Kafka's consumer groups is a subscription. Subscribers are the equivalent of a consumer. This answer spells out the relationship between subscriptions and subscribers in a little more detail.
Your example in Cloud Pub/Sub terms would have a single topic, StockTicks, with two subscriptions (call them CG1 and CG2). You would bring up four subscribers, two that fetch messages for the subscription CG1 and two that fetch messages for the CG2 subscription. Acking and replay would be independent on CG1 and CG2, so if you were to seek back on CG1, it would not affect the delivery of messages to subscribers for CG2 at all.
Keep in mind with Cloud Pub/Sub that only messages published after a subscription is successfully created will be delivered to subscribers on that subscription. Therefore, if you create a new subscription, you won't get the all of the messages published since the beginning of time; you will only get messages published from that point on.
If you seek back on a subscription, you can only get up to 7 days of messages to replay (assuming the subscription was created at least 7 days ago) since that is the max retention time for messages in Cloud Pub/Sub.

Loadbalancing subscribers - How is it done?

GCP pubsub docs mention load balancing for pull mode, it's not clear how to use it.
The Subsciption nor The Subscriber builder api, doesn't seem to have a method to turn this on.
Question: How to configure load balancing accross multiple pubsub subscribers?
Background:
We use multiple subscribers for the same topic, to achieve resilience.
(Multiple endpoints can be queried for data from the same data store).
The subscriptions persist the messages, but with out distribution, all subscriptions get all messages, leading to data duplication in our data store. Perhaps this background, will give ideas for another way to achieve resilience.
Things we have thought of ourselves:
Use multiple data stores...
Mark the messages, and do some sort of optimistic locking/versioning of rows in the data store.
Technologies:
GCP pubsub
Spring Boot / Data
JPA
Postgres DB.
If all subscribers are receiving all messages, then it is likely that you are using different subscriptions for each subscriber. Load balancing happens when you have different subscribers all pulling from the same subscription. From the subscriber guide description of load balancing: "Multiple subscribers can make pull calls to the same "shared" subscription. Each subscriber will receive a subset of the messages" (Emphasis mine). When you use different subscriptions, you get fanout, where all subscribers receive all messages.

How to handle AWS IOT streaming data in relational database

Generic information :-i am designing solution for one of IOT problem approach in which data is continuously streaming from plc(programmable logic controller),plc have different tags these tags are representation of telemetry data and data will be continuously streaming from these tags, each of devices will have alarm tags which will be 0 or 1 , 1 means there is an equipment failure
problem statement:- i have to read the alarm tag and raise a ticket if any of alarm tag value is 1 and i have to stream these alerts to dashboard and also i have to maintain the ticket history too,so the operator can update the ticket status too
My solution:- i am using aws IOT , i am getting data in dynamo db then i am using dynamo db stream to check if any new item is added in alarm table and if it will trigger lambda function (which i have implemented in java) lambda function opens a new ticket in relational database using hibernate.
problem with my approach:-the aws iot data is continuously streaming in alarm table at a very fast rate and this is opening a lot of connection before it can be closed that's taking my relational database down
please let me know if other good design approach can i adopt?
USE Amazon Kinesis Analytics to process streaming data. Dynamodb isn't suitable for this.
Read more here
Below image will give you an idea for same
Just a proposal....
From lambda, do not contact RDS,
Rather push all alarms in AWS SQS
then you can have one another lambda scheduled for every minute using AWS CloudWatch Rules that will pick all items from AWS SQS and then insert them in RDS at once.
I agree with raevilman's design of not letting Lambda contact RDS directly.
Since creating a new ticket is not the only task you Lambda function is doing, you are also streaming these alerts to a dashboard. Depending on the streaming rate and the RDS limitations, you may want to split these tasks in multiple queues.
Generic solution: I'd suggest you can push the alarm to a fanout exchange and this exchange will in turn push the alarm to one or more queues as required. You can then batch the alarms and perform multiple writes together without performing connect/disconnect cycle multiple times.
AWS specific Solution: I haven't used SQS so can't really comment on it's architecture. Alternatively, you can create an SNS Topic and publish these alarms to this topic. You can then have SQS queues as subscribers to this topic which in turn will be used for Ticketing and Dashboard purpose independent of each other.
Here again, from Ticketing queue, you can poll messages using Lambda or your own scheduler in batch and process tickets(frequency depending on how time critical alarms are).
You may want to read this tutorial to get some pointers.
You can control number of lambda function concurrency. And this will reduce the number of lambdas that get spinned up based on the dynamo events. Thereby reducing the connections to RDS.
https://aws.amazon.com/blogs/compute/managing-aws-lambda-function-concurrency/
Ofcourse , this will throttle the dynamo events.

using AWS SNS and Lambda - what's the right use case for an activity feed

I want to use an AWS lambda function to fan out and insert activity stream info to a firebase endpoint for every user.
Should I be using Kinesis, SQS or SNS to trigger the lambda function for this use case? The updates to the activity stream can be triggered from the server and clients should receive the update near real time (within 60 seconds or so).
I think I have a pretty good idea on what SQS is, and have used Kinesis in the past but not quite sure about SNS.
If we created an SNS topic for each user and then each follower subscribes to these topics with an AWS lambda function - would that work?
Does it make sense to programmatically create topics and subscriptions for every user and follow relationship respectively?
As usual, answer to such a question is mostly, 'it depends on your use-case'.
Kinesis vs SQS:
If your clients care about relative (timestamp-based, for e.g.) ordering between events, you'll almost certainly have to go with Kinesis. SQS is a best-effort FIFO queue, meaning events can arrive out of order and it would up to your client to manage relative ordering.
As far as latencies are concerned, I have seen that data ingested into Kinesis can become visible to its consumer in as less as 300 ms.
When can SNS be interesting to you?
(Even with SNS, you'd have to use SQS). If you use SNS, it will be easy to add a new application that can process your events. For example, if in future you decide to ingest all events into, say, an Elasticsearch to provide real-time analytics, all you'd have to do is add another SQS queue to your existing topic(s) and write a consumer.