kafka for consuming subscription data sent by a webservice

kafka for consuming subscription data sent by a webservice - web-services

Scenario:
I have a north bound web service(say A) and a south bound application(say C).
And I am creating a micro service (say B) which transforms data received by A in a format that is readable by C.
A can send data any random time intervals, and as B receives it, it has to transform.
What I think :
Once B subscribes to A, and A starts sending data via a callback url. Save the data in mongodb and after some interval process the data and push to C.
Question:
1. Since A is streaming a type of data to B, Can I use KAFKA for consuming the data ?
2. If no, what are the other alternatives?
3. I want to know, is there any other efficient way of doing this ?

What you are describing sounds to me like a typical stream transformation application. You can definitely use Kafka for this, but you can probably also use any other pub/sub service out there.
Setting up Kafka just for this seems like overkill and I would rather look into simpler pub/sub services offered online.

Related

How to stream events with GCP platform?

I am looking into building a simple solution where producer services push events to a message queue and then have a streaming service make those available through gRPC streaming API.
Cloud Pub/Sub seems well suited for the job however scaling the streaming service means that each copy of that service would need to create its own subscription and delete it before scaling down and that seems unnecessarily complicated and not what the platform was intended for.
On the other hand Kafka seems to work well for something like this but I'd like to avoid having to manage the underlying platform itself and instead leverage the cloud infrastructure.
I should also mention that the reason for having a streaming API is to allow for streaming towards a frontend (who may not have access to the underlying infrastructure)
Is there a better way to go about doing something like this with the GCP platform without going the route of deploying and managing my own infrastructure?

If you essentially want ephemeral subscriptions, then there are a few things you can set on the Subscription object when you create a subscription:
Set the expiration_policy to a smaller duration. When a subscriber is not receiving messages for that time period, the subscription will be deleted. The tradeoff is that if your subscriber is down due to a transient issue that lasts longer than this period, then the subscription will be deleted. By default, the expiration is 31 days. You can set this as low as 1 day. For pull subscribers, the subscribers simply need to stop issuing requests to Cloud Pub/Sub for the timer on their expiration to start. For push subscriptions, the timer starts based on when no messages are successfully delivered to the endpoint. Therefore, if no messages are published or if the endpoint is returning an error for all pushed messages, the timer is in effect.
Reduce the value of message_retention_duration. This is the time period for which messages are kept in the event a subscriber is not receiving messages and acking them. By default, this is 7 days. You can set it as low as 10 minutes. The tradeoff is that if your subscriber disconnects or gets behind in processing messages by more than this duration, messages older than that will be deleted and the subscriber will not see them.
Subscribers that cleanly shut down could probably just call DeleteSubscription themselves so that the subscription goes away immediately, but for ones that shut down unexpectedly, setting these two properties will minimize the time for which the subscription continues to exist and the number of messages (that will never get delivered) that will be retained.
Keep in mind that Cloud Pub/Sub quotas limit one to 10,000 subscriptions per topic and per project. Therefore, if a lot of subscriptions are created and either active or not cleaned up (manually, or automatically after expiration_policy's ttl has passed), then new subscriptions may not be able to be created.

I think your original idea was better than ephemeral subscriptions tbh. I mean it works, but it feels totally unnatural. Depending on what your requirements are. For example, do clients only need to receive messages while they're connected or do they all need to get all messages?
Only While Connected
Your original idea was better imo. What I probably would have done is to create a gRPC stream service that clients could connect to. The implementation is essentially an observer pattern. The consumer will receive a message and then iterate through the subscribers to do a "Send" to all of them. From there, any time a client connects to the service, it just registers itself with that observer collection and unregisters when it disconnects. Horizontal scaling is passive since clients are sticky to whatever instance they've connected to.
Everyone always get the message, if eventually
The concept is similar to the above but the client doesn't implicitly un-register from the observer on disconnect. Instead, it would register and un-register explicitly (through a method/command designed to do so). Modify the 'on disconnected' logic to tell the observer list that the client has gone offline. Then the consumer's broadcast logic is slightly different. Now it iterates through the list and says "if online, then send, else queue", and send the message to a ephemeral queue (that belongs to the client). Then your 'on connect' logic will send all messages that are in queue to the client before informing the consumer that it's back online. Basically an inbox. Setting up ephemeral, self-deleting queues is really easy in most products like RabbitMQ. I think you'll have to do a bit of managing whether or not it's ok to delete a queue though. For example, never delete the queue unless the client explicitly unsubscribes or has been inactive for so long. Fail to do that, and the whole inbox idea falls apart.
The selected answer above is most similar to what I'm subscribing here in that the subscription is the queue. If I did this, then I'd probably implement it as an internal bus instead of an observer (since it would be unnecessary) - You create a consumer on demand for a connecting client that literally just forwards the message. The message consumer subscribes and unsubscribes based on whether or not the client is connected. As Kamal noted, you'll run into problems if your scale exceeds the maximum number of subscriptions allowed by pubsub. If you find yourself in that position, then you can unshackle that constraint by implementing the pattern above. It's basically the same pattern but you shift the responsibility over to your infra where the only constraint is your own resources.
gRPC makes this mechanism pretty easy. Alternatively, for web, if you're on a Microsoft stack, then SignalR makes this pretty easy too. Clients connect to the hub, and you can publish to all connected clients. The consumer pattern here remains mostly the same, but you don't have to implement the observer pattern by hand.
(note: arrows in diagram are in the direction of dependency, not data flow)

How is Google Cloud Pub/Sub avoiding clock skew

I am looking into ways to order list of messages from google cloud pub/sub. The documentation says:
Have a way to determine from all messages it has currently received whether or not there are messages it has not yet received that it needs to process first.
...is possible by using Cloud Monitoring to keep track of the pubsub.googleapis.com/subscription/oldest_unacked_message_age metric. A subscriber would temporarily put all messages in some persistent storage and ack the messages. It would periodically check the oldest unacked message age and check against the publish timestamps of the messages in storage. All messages published before the oldest unacked message are guaranteed to have been received, so those messages can be removed from persistent storage and processed in order.
I tested it locally and this approach seems to be working fine.
I have one gripe with it however, and this is not something easily testable by myself.
This solution relies on server-side assigned (by google) publish_time attribute. How does Google avoid the issues of skewed clocks?
If my producer publishes messages A and then immediately B, how can I be sure that A.publish_time < B.publish_time is true? Especially considering that the same documentation page mentions internal load-balancers in the architecture of the solution. Is Google Pub/Sub using atomic clocks to synchronize time on the very first machines which see messages and enrich those messages with the current time?
There is an implicit assumption in the recommended solution that the clocks on all the servers are synchronized. But the documentation never explains if that is true or how it is achieved so I feel a bit uneasy about the solution. Does it work under very high load?
Notice I am only interested in relative order of confirmed messages published after each other. If two messages are published simultaneously, I don't care about the order of them between each other. It can be A, B or B, A. I only want to make sure that if B is published after A is published, then I can sort them in that order on retrieval.
Is the aforementioned solution only "best-effort" or are there actual guarantees about this behavior?

There are two sides to ordered message delivery: establishing an order of messages on the publish side and having an established order of processing messages on the subscribe side. The document to which you refer is mostly concerned with the latter, particularly when it comes to using oldest_unacked_message_age. When using this method, one can know that if message A has a publish timestamp that is less than the publish timestamp for message B, then a subscriber will always process message A before processing message B. Essentially, once the order is established (via publish timestamps), it will be consistent. This works if it is okay for the Cloud Pub/Sub service itself to establish the ordering of messages.
Publish timestamps are not synchronized across servers and so if it is necessary for the order to be established by the publishers, it will be necessary for the publishers to provide a timestamp (or sequence number) as an attribute that is used for ordering in the subscriber (and synchronized across publishers). The subscriber would sort message by this user-provided timestamp instead of by the publish timestamp. The oldest_unacked_message_age will no longer be exact because it is tied to the publish timestamp. One could be more conservative and only consider messages ordered that are older than oldest_unacked_message_age minus some delta to account for this discrepancy.

Google Cloud Pub-sub does not guarantee order of events receive to consumers as they were produced. Reason behind that is Google Cloud Pub-sub also running on a cluster of nodes. The possibility is there an event B can reach the consumer before event A. To Ensure ordering you have to make changes on both producer and consumer to identify the order of events. Here is section from docs.

Communicate internally between Google Cloud Functions?

We've created a Google Cloud Function that is essentially an internal API. Is there any way that other internal Google Cloud Functions can talk to the API function without exposing a HTTP endpoint for that function?
We've looked at PubSub but as far as we can see, you can send a request (per say!) but you can't receive a response.
Ideally, we don't want to expose a HTTP endpoint due to the extra security ramifications and we are trying to follow a microservice approach so every function is its own entity.

I sympathize with your microservices approach and trying to keep your services independent. You can accomplish this without opening all your functions to HTTP. Chris Richardson describes a similar case on his excellent website microservices.io:
You have applied the Database per Service pattern. Each service has
its own database. Some business transactions, however, span multiple
services so you need a mechanism to ensure data consistency across
services. For example, lets imagine that you are building an e-commerce store
where customers have a credit limit. The application must ensure that
a new order will not exceed the customer’s credit limit. Since Orders
and Customers are in different databases the application cannot simply
use a local ACID transaction.
He then goes on:
An e-commerce application that uses this approach would create an
order using a choreography-based saga that consists of the following
steps:
The Order Service creates an Order in a pending state and publishes an OrderCreated event.
The Customer Service receives the event attempts to reserve credit for that Order. It publishes either a Credit Reserved event or a
CreditLimitExceeded event.
The Order Service receives the event and changes the state of the order to either approved or cancelled.
Basically, instead of a direct function call that returns a value synchronously, the first microservice sends an asynchronous "request event" to the second microservice which issues a "response event" that the first service picks up. You would use Cloud PubSub to send and receive the messages.
You can read more about this under the Saga pattern on his website.

The most straightforward thing to do is wrap your API up into a regular function or object, and deploy that extra code along with each function that needs to use it. You may even wish to fully modularize the code, as you would expect from an npm module.

I need help clarifying a high level use-case of Amazon SQS

So I need a second pair of eyes to correct or confirm my understand standing of Amazon SQS. From my understanding, you can add an unlimited amount of messages to one queue. A message can be 256 KB in size, and if it needs to be larger than that, you can use amazon s3 to store 2 GB. Reading around online, it appears there are many use cases for this queuing service. For example one use case of SQS can act as a database buffer.
But here's what I'm looking to do.. I'm looking to make a real time messaging system. My current functionality acts like more of a message board, so the implementation just inserts into the database then reads the data and packages it into JSON to be inserted on SQLITE mobile phone. That works great, but I'm getting a lot of requests from people to make it real-time.
So what I'm wondering is can I utilize amazon SQS to write and read messages for a chat application? So in my theoretical use case of SQS would have a message queue to write to, and pull from the that queue every second to check for messages on mobile. But here's where I'm confused. Since you cannot "Query" a particular message from the queue, would it make sense to have a queue per user then a generic queue for the app server to read from? Or am I just talking crazy and should spend cognitive resources thinking about implementing an open connection on an Ec2 instance?
Any help would be great,
Thanks!

Have you thought about using Amazon SNS to push the chat messages to your mobile devices? Each user publishes to a topic and the readers subscribe to that topic. You just have to be ok with missing messages if the app isn't running.

If you only have a few (or maybe, less than 100) users, you could have thought of having one SQS queue per user. If that is not so, the solution won't be operationally feasible.
If you were to have one generic queue, SQS won't help because it doesn't allow querying for a given field in all available messages.
I can think of following options for your use case:
Setup one Redis cluster, possibly on Amazon ElastiCache. Have one message List per user.
One Messages table in MySQL, possibly on AWS RDS. This will provide an easy way to query messages for a given user.
You can also use DynamoDB in #2.

How to use Kinesis to broadcast records?

I know that Kinesis typical use case is event streaming, however we'd like to use it to broadcast some information to have it in near real time in some apps besides making it available for further stream processing. KCL seems to be the only viable option to use Kinesis as stream API is too low level.
As far I understand to use KCL we'd have to generate random applicationId so all apps could receive all the data, but this means creating a new DynamoDB table each time an application starts. Of course we can perform clean up when application stops but when application doesn't stop gracefully there would be DynamoDB table hanging around.
Is there a way/pattern to use Kinesis streams in a broadcast fashion?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js