Role of kafka consumer, seperate service or Django component? - django

I'm designing a web log analytic.
And I found an architect with Django(Back-end & front-end)+ kafka + spark.
I also found some same system from this link:http://thevivekpandey.github.io/posts/2017-09-19-high-velocity-data-ingestion.html with below architect
But I confuse about the role of kafka-consumer. It will is a service, independent to Django, right?
So If I want to plot real-time data to front-end chart, how to I attached to Django.
It will too ridiculous if I place both kafka-consumer & producer in Django. Request from sdk come to Django by pass to kafa topic (producer) and return Django (consumer) for process. Why we don't go directly. It looks simple and better.
Please help me to understand the role of kafka consumer, where it should belong? and how to connect to my front-end.
Thanks & best Regards,
Jame

The article mentions about the use case without Kafka:
We saw that in times of peak load, data ingestion was not working properly: it was taking too long to connect to MongoDB and requests were timing out. This was leading to data loss.
So the main point of introducing Kafka and Kafka Consumer is to avoid too much load on DB layer and handle it gracefully with a messaging layer in between. To be honest, any message queue can be used in this case, not only Kafka.
Kafka Consumer can be a part of the web layer. It wouldn't be optimal, because you want the separation of concerns (which makes the system more reliable in case of failures) and ability to scale things independently.
It's better to implement the Kafka Consumer as a separate service if the concerns mentioned above really matter (scalability and reliability) and it's easy for you to do operationally (because you need to deploy, monitor, etc. a new service now). In the end it's a classic monolith vs. microservices dilemma.

Related

Should I use a task queue (Celery), ayncio or neither for an API that polls other APIs?

I have written an API with Django which purpose is to operate as a bridge between a website back-end and external services we use, so that the website doesn't have to handle many requests to external APIs (CRM, calendar events, email providers etc.).
The API mainly polls other services, parses the results and forwards them to the website backend.
I initially went for a Celery-based task queue, as it seemed to me like the right tool to offload that processing to another instance, but I'm starting to think it doesn't really fit the purpose.
As the website expects synchronous responses, my code contains a lot of :
results = my_task.delay().get()
or
results = chain(fetch_results.s(), parse_results.s()).delay().get()
Which doesn't feel like the proper way to use Celery tasks.
It is efficient when pulling dozens of requests and processing the results in parallel - a periodic refresh task for example - but adds a lot of overhead for simple requests (fetch - parse - forward), which represent most of the traffic.
Should I go full synchronous for those "simple requests" and keep Celery tasks for specific scenarios ? Is there an alternative design (maybe involving asyncio) that would better suit the purpose of my API ?
Using Django, Celery (w/ Amazon SQS) on an EBS EC2 instance.
You could consider using Gevent with your Django webserver to allow it to operate efficiently for the "simple requests" you've mentioned without being blocked. If you proceed with this approach, be sure to pool database connections with PgBouncer or Pgpool-II or a Python library since each greenlet will make its own connection.
Once you've implemented that, it's possible to also use Gevent instead of Celery to handle asynchronous processing by joining on multiple Greenlets that each make an external API request, rather than incur the overhead of passing messages to an external celery worker.
Your implementation is similar to what we've done at Kloudless, which provides a single API to access multiple other APIs, including CRM, calendar, storage, etc.

(Unit) Testing Akka Streams Kafka

I'am currently evaluating Apache Kafka for the use as a middleware in a microservice environment. Not only as a Message Queue but also for aggregating other data sources. Kafka seems to be the perfect fit. The services in majority are based on the Play Framework, so Akka Stream Kafka seems to be the natural choice to interact with Kafka.
I prototyped a small App with a Consumer and a Publisher, communicating via JSON and that was pretty straight forward. But when it comes to unit testing I become a little helpless. Is it possible to run the tests in a lightweight fashion and not with a running Kafka Cluster or an embedded server (check here)? I also found this project which looked promising, but I was not able to test my Consumer with it. Isn't that the right tool? I'am a little confused.
Not sure if your question is still relevant, but have you had a look at the Alpakka Kafka testkit?

Best JMS implementation for AWS

I have a Java/Spring application running in the Amazon AWS cloud.
My server instances are using load balancing and runs the same image of a Linus OS, with a Tomcat application server.
They are also connected to S3 as a shared file system (s3fs), and an RDS database.
My concern is to be sure the state of the different applications is synchronized. Today, the point of synchronization is the database, but when memory caching is needed, out of sync problems appear.
The solution I would like to use is to put in place a messaging system between the applications. For specific reasons, I cannot use Amazon SQS service, then JMS seems to fit my needs. After some reading, HornetQ seems also a very good implementation of it. Once an application state change, it communicates the change to all other applications. Each application is producer and consumer of the same queue.
As we are in a dynamic system where servers and IPs are automatically created and deleted, the automatic discovery of instances seems to be the best solution to use.
But in AWS, broadcast is not possible!
For HornetQ, I saw a kind of work around which is using JGroups additionally. But for me, this is a second framework to investigate and learn. Twice the work. And no more an out-of-the-box solution.
What is your opinion? Does anyone already build a solution for similar needs?
Maybe other out-of-the-box solutions exists?
Thanks in advance for your answer!
In my experience you could try to use TCPGOSSIP, that is a HornetQ configuration.
See https://docs.jboss.org/jbossclustering/cluster_guide/5.1/html/jgroups.chapt.html

Ideas for scaling chat in AWS?

I'm trying to come up with the best solution for scaling a chat service in AWS. I've come up with a couple potential solutions:
Redis Pub/Sub - When a user establishes a connection to a server that server subscribes to that user's ID. When someone sends a message to that user, a server will perform a publish to the channel with the user's id. The server the user is connected to will receive the message and push it down to the appropriate client.
SQS - I've thought of creating a queue for each user. The server the user is connected to will poll (or use SQS long-polling) that queue. When a new message is discovered, it will be pushed to the user from the server.
SNS - I really liked this solution until I discovered the 100 topic limit. I would need to create a topic for each user, which would only support 100 users.
Are their any other ways chat could be scaled using AWS? Is the SQS approach viable? How long does it take AWS to add a message to a queue?
Building a chat service isn't as easy as you would think.
I've built full XMPP servers, clients, and SDK's and can attest to some of the subtle and difficult problems that arise. A prototype where users see each other and chat is easy. A full features system with account creation, security, discovery, presence, offline delivery, and friend lists is much more of a challenge. To then scale that across an arbitrary number of servers is especially difficult.
PubSub is a feature offered by Chat Services (see XEP-60) rather than a traditional means of building a chat service. I can see the allure, but PubSub can have drawbacks.
Some questions for you:
Are you doing this over the Web? Are users going to be connecting and long-poling or do you have a Web Sockets solution?
How many users? How many connections per user? Ratio of writes to reads?
Your idea for using SQS that way is interesting, but probably won't scale. It's not unusual to have 50k or more users on a chat server. If you're polling each SQS Queue for each user you're not going to get anywhere near that. You would be better off having a queue for each server, and the server polls only that queue. Then it's on you to figure out what server a user is on and put the message into the right queue.
I suspect you'll want to go something like:
A big RDS database on the backend.
A bunch of front-end servers handling the client connections.
Some middle tier Java / C# code tracking everything and routing messages to the right place.
To get an idea of the complexity of building a chat server read the XMPP RFC's:
RFC 3920
RFC 3921
SQS/ SNS might not fit your chatty requirement. we have observed some latency in SQS which might not be suitable for a chat application. Also SQS does not guarantee FIFO. i have worked with Redis on AWS. It is quite easy and stable if it is configured taking all the best practices in mind.
I've thought about building a chat server using SNS, but instead of doing one topic per user, as you describe, doing one topic for the entire chat system and having each server subscribe to the topic - where each server is running some sort of long polling or web sockets chat system. Then, when an event occurs, the data is sent in the payload of the SNS notification. The server can then use this payload to determine what clients in its queue should receive the response, leaving any unrelated clients untouched. I actually built a small prototype for this, but haven't done a ton of testing to see if it's robust enough for a large number of users.
HI realtime chat doesn't work well with SNS. It's designed for email/SMS or service 1 or a few seconds latency is acceptable. In realtime chat, 1 or a few seconds are not acceptable.
check this link
Latency (i.e. “Realtime”) for PubNub vs SNS
Amazon SNS provides no latency guarantees, and the vast majority of latencies are measured over 1 second, and often many seconds slower. Again, this is somewhat irrelevant; Amazon SNS is designed for server-to-server (or email/SMS) notifications, where a latency of many seconds is often acceptable and expected.
Because PubNub delivers data via an existing, established open network socket, latencies are under 0.25 seconds from publish to subscribe in the 95% percentile of the subscribed devices. Most humans perceive something as “realtime” if the event is perceived within 0.6 – 0.7 seconds.
the way i would implement such a thing (if not using some framework) is the following:
have a webserver (on ec2) which accepts the msgs from the user.
use Autoscalling group on this webserver. the webserver can update any DB on amazon RDS which can scale easily.
if you are using your own db, you might consider to decouple the db from the webserver using the sqs (by sending all requests the same queue), and then u can have a consumer which consume the queue. this consumer can also be placed behind an autoscalling group, so that if the queue is larger than X msgs, it will scale (u can set it up with alarms)
sqs normally updates pretty fast i.e less than one second. (from the moment u sent it, to the moment it appears on the on the queue), and rarely more than that.
Since a new AWS IoT service started to support WebSockets, Keepalive and Pub/Sub couple months ago, you may easily build elastic chat on it. AWS IoT is a managed service with lots of SDKs for different languages including JavaScript that was build to handle monster loads (billions of messages) with zero administration.
You can read more about update here:
https://aws.amazon.com/ru/about-aws/whats-new/2016/01/aws-iot-now-supports-websockets-custom-keepalive-intervals-and-enhanced-console/
Edit:
Last SQS update (2016/11): you can now use Amazon Simple Queue Service (SQS) for applications that require messages to be processed in a strict sequence and exactly once using First-in, First-out (FIFO) queues. FIFO queues are designed to ensure that the order in which messages are sent and received is strictly preserved and that each message is processed exactly once.
Source:
https://aws.amazon.com/about-aws/whats-new/2016/11/amazon-sqs-introduces-fifo-queues-with-exactly-once-processing-and-lower-prices-for-standard-queues/
Now on, implementing SQS + SNS looks like a good idea too.

Expose Amazon SQS directly to clients or via an Webservice as proxy

I would like to use Amazon SQS in my application to queue requests from other external systems that don't belong to me.
What is the better way of doing this, directly expose the SQS Queue and the required messageformat OR publish a web service (WCF) that queues the request.
Also I read that SQS is relative slow for a singe access, but am I right that it can handle easyly a lot of concurrent accesses from different clients?
Best
Thomas
This is largely a matter of preference and depends a bit on your situation. But my recommendation would be to wrap it with your own web-service.
Building your web-service allows you to do things like validation, throttling, schema versioning etc. E.g. you can reject invalid messages with immediate synchronous feedback to the sender. If the external systems are publishing directly to your queue, then invalid messages become your problem not theirs, and if you revise your schema and want to reject old-schema messages then you either have to drop them or set up a separate back-channel to feed back information to the publisher. That adds unnecessary complexity to your system. Having a web-service would even let you switch to other queuing technologies later if you need to.
But building your own web-service has downsides too: will your own service be able to handle the same load as the SQS API with the same low latency? It won't scale infinitely like SQS, so how responsive will you need to be to changes in load? Have you got the resources to manage a separate service? And it's more work than just giving a client's AWS account permission to publish to your queue.
If you're happy with the extra work involved, and you want a more future-proof system, IMHO it's worth building the web-service wrapper.