Kafka - Real Time Streaming with Web Service

Kafka - Real Time Streaming with Web Service - web-services

I am trying to stream a data through web service and planning to consume it into kafka. Streaming data would be of size 4 MB, at max it can goes upto 10 MB. Data source SDK is written onto .Net and Apache Kafka does not provide DLL for its consumer and producer. Its very typical to write Kafka producer and consumer in .Net and we can't use github Kafka producer.
My Questions are -
Is web service is good option for real time streaming?
Is web service able to stream upto 10MB of data without impacting the performance of web server and data ingestion?
Is there any better approach to solve this issue?
answer with authentic source will really helps me.
Thanks...

I think, the used of webservices is a good option to push stream data if you would like to decouple your front-ends either from kafka or an API written into a specific language. But that depends of your use-case.
You should have a look at the open-source Kafka REST Proxy provided by the Confluent Distribution (http://docs.confluent.io/2.0.0/kafka-rest/docs/index.html).
It will allow you to produce/consumer messsage though webservice.
If you expect to push messages with a max size of 10MB don't forget to increase the following kafka properties (as by default kafka is tuned for 1MB messages).
max.message.bytes
replica.fetch.max.bytes
fetch.message.max.bytes (consumer config)

Related

Role of kafka consumer, seperate service or Django component?

I'm designing a web log analytic.
And I found an architect with Django(Back-end & front-end)+ kafka + spark.
I also found some same system from this link:http://thevivekpandey.github.io/posts/2017-09-19-high-velocity-data-ingestion.html with below architect
But I confuse about the role of kafka-consumer. It will is a service, independent to Django, right?
So If I want to plot real-time data to front-end chart, how to I attached to Django.
It will too ridiculous if I place both kafka-consumer & producer in Django. Request from sdk come to Django by pass to kafa topic (producer) and return Django (consumer) for process. Why we don't go directly. It looks simple and better.
Please help me to understand the role of kafka consumer, where it should belong? and how to connect to my front-end.
Thanks & best Regards,
Jame

The article mentions about the use case without Kafka:
We saw that in times of peak load, data ingestion was not working properly: it was taking too long to connect to MongoDB and requests were timing out. This was leading to data loss.
So the main point of introducing Kafka and Kafka Consumer is to avoid too much load on DB layer and handle it gracefully with a messaging layer in between. To be honest, any message queue can be used in this case, not only Kafka.
Kafka Consumer can be a part of the web layer. It wouldn't be optimal, because you want the separation of concerns (which makes the system more reliable in case of failures) and ability to scale things independently.
It's better to implement the Kafka Consumer as a separate service if the concerns mentioned above really matter (scalability and reliability) and it's easy for you to do operationally (because you need to deploy, monitor, etc. a new service now). In the end it's a classic monolith vs. microservices dilemma.

(Unit) Testing Akka Streams Kafka

I'am currently evaluating Apache Kafka for the use as a middleware in a microservice environment. Not only as a Message Queue but also for aggregating other data sources. Kafka seems to be the perfect fit. The services in majority are based on the Play Framework, so Akka Stream Kafka seems to be the natural choice to interact with Kafka.
I prototyped a small App with a Consumer and a Publisher, communicating via JSON and that was pretty straight forward. But when it comes to unit testing I become a little helpless. Is it possible to run the tests in a lightweight fashion and not with a running Kafka Cluster or an embedded server (check here)? I also found this project which looked promising, but I was not able to test my Consumer with it. Isn't that the right tool? I'am a little confused.

Not sure if your question is still relevant, but have you had a look at the Alpakka Kafka testkit?

Can KAFKA producer read log files?

Log files of my application keep accumulating on a server.I want to dump them into HDFS through KAFKA.I want the Kafka producer to read the log files,send them to Kafka broker and then move those files to another folder.Can the Kafka producer read log files ? Also, is it possible to have the copying logic in Kafka producer ?

Kafka maintains feeds of messages in categories called topics.
We'll call processes that publish messages to a Kafka topic producers.
We'll call processes that subscribe to topics and process the feed of published messages consumers..
Kafka is run as a cluster comprised of one or more servers each of which is called a broker.
So, at a high level, producers send messages over the network to the Kafka cluster which in turn serves them up to consumers like this:
So this is not a suitable for your application where you want to injest log files. Instead you can try flume.
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.

As you know, Apache Kafka is publish-subscribe messaging system. you can send message from your application. To send message from your application you can use kafka clients or kafka rest api.
In short, you can read your log with your application and can send these logs to kafka topics.
To handle these logs, you can use apache storm. You can find many integrated solution for these purposes. And by using storm you can
add any logic your stream processing.
You can read many useful detailed information about storm kafka integration.
Also to put your processed logs to hdfs, you can easily integrate your storm with hadoop. You can check this repo for it.

Kafka was developed to support high volume event streams such as real-time log aggregation. From the kafka documentation
Many people use Kafka as a replacement for a log aggregation solution. Log aggregation typically collects physical log files off servers and puts them in a central place (a file server or HDFS perhaps) for processing. Kafka abstracts away the details of files and gives a cleaner abstraction of log or event data as a stream of messages. This allows for lower-latency processing and easier support for multiple data sources and distributed data consumption
Also I got this little piece of information from this nice article which almost similar to your use-case
Today, Kafka has been used in production at LinkedIn for a number of projects. There are both offline and online usage. In the offline case, we use Kafka to feed all activity events to our data warehouse and Hadoop, from which we then run various batch analysis

Ideas for scaling chat in AWS?

I'm trying to come up with the best solution for scaling a chat service in AWS. I've come up with a couple potential solutions:
Redis Pub/Sub - When a user establishes a connection to a server that server subscribes to that user's ID. When someone sends a message to that user, a server will perform a publish to the channel with the user's id. The server the user is connected to will receive the message and push it down to the appropriate client.
SQS - I've thought of creating a queue for each user. The server the user is connected to will poll (or use SQS long-polling) that queue. When a new message is discovered, it will be pushed to the user from the server.
SNS - I really liked this solution until I discovered the 100 topic limit. I would need to create a topic for each user, which would only support 100 users.
Are their any other ways chat could be scaled using AWS? Is the SQS approach viable? How long does it take AWS to add a message to a queue?

Building a chat service isn't as easy as you would think.
I've built full XMPP servers, clients, and SDK's and can attest to some of the subtle and difficult problems that arise. A prototype where users see each other and chat is easy. A full features system with account creation, security, discovery, presence, offline delivery, and friend lists is much more of a challenge. To then scale that across an arbitrary number of servers is especially difficult.
PubSub is a feature offered by Chat Services (see XEP-60) rather than a traditional means of building a chat service. I can see the allure, but PubSub can have drawbacks.
Some questions for you:
Are you doing this over the Web? Are users going to be connecting and long-poling or do you have a Web Sockets solution?
How many users? How many connections per user? Ratio of writes to reads?
Your idea for using SQS that way is interesting, but probably won't scale. It's not unusual to have 50k or more users on a chat server. If you're polling each SQS Queue for each user you're not going to get anywhere near that. You would be better off having a queue for each server, and the server polls only that queue. Then it's on you to figure out what server a user is on and put the message into the right queue.
I suspect you'll want to go something like:
A big RDS database on the backend.
A bunch of front-end servers handling the client connections.
Some middle tier Java / C# code tracking everything and routing messages to the right place.
To get an idea of the complexity of building a chat server read the XMPP RFC's:
RFC 3920
RFC 3921

SQS/ SNS might not fit your chatty requirement. we have observed some latency in SQS which might not be suitable for a chat application. Also SQS does not guarantee FIFO. i have worked with Redis on AWS. It is quite easy and stable if it is configured taking all the best practices in mind.

I've thought about building a chat server using SNS, but instead of doing one topic per user, as you describe, doing one topic for the entire chat system and having each server subscribe to the topic - where each server is running some sort of long polling or web sockets chat system. Then, when an event occurs, the data is sent in the payload of the SNS notification. The server can then use this payload to determine what clients in its queue should receive the response, leaving any unrelated clients untouched. I actually built a small prototype for this, but haven't done a ton of testing to see if it's robust enough for a large number of users.

HI realtime chat doesn't work well with SNS. It's designed for email/SMS or service 1 or a few seconds latency is acceptable. In realtime chat, 1 or a few seconds are not acceptable.
check this link
Latency (i.e. “Realtime”) for PubNub vs SNS
Amazon SNS provides no latency guarantees, and the vast majority of latencies are measured over 1 second, and often many seconds slower. Again, this is somewhat irrelevant; Amazon SNS is designed for server-to-server (or email/SMS) notifications, where a latency of many seconds is often acceptable and expected.
Because PubNub delivers data via an existing, established open network socket, latencies are under 0.25 seconds from publish to subscribe in the 95% percentile of the subscribed devices. Most humans perceive something as “realtime” if the event is perceived within 0.6 – 0.7 seconds.

the way i would implement such a thing (if not using some framework) is the following:
have a webserver (on ec2) which accepts the msgs from the user.
use Autoscalling group on this webserver. the webserver can update any DB on amazon RDS which can scale easily.
if you are using your own db, you might consider to decouple the db from the webserver using the sqs (by sending all requests the same queue), and then u can have a consumer which consume the queue. this consumer can also be placed behind an autoscalling group, so that if the queue is larger than X msgs, it will scale (u can set it up with alarms)
sqs normally updates pretty fast i.e less than one second. (from the moment u sent it, to the moment it appears on the on the queue), and rarely more than that.

Since a new AWS IoT service started to support WebSockets, Keepalive and Pub/Sub couple months ago, you may easily build elastic chat on it. AWS IoT is a managed service with lots of SDKs for different languages including JavaScript that was build to handle monster loads (billions of messages) with zero administration.
You can read more about update here:
https://aws.amazon.com/ru/about-aws/whats-new/2016/01/aws-iot-now-supports-websockets-custom-keepalive-intervals-and-enhanced-console/
Edit:
Last SQS update (2016/11): you can now use Amazon Simple Queue Service (SQS) for applications that require messages to be processed in a strict sequence and exactly once using First-in, First-out (FIFO) queues. FIFO queues are designed to ensure that the order in which messages are sent and received is strictly preserved and that each message is processed exactly once.
Source:
https://aws.amazon.com/about-aws/whats-new/2016/11/amazon-sqs-introduces-fifo-queues-with-exactly-once-processing-and-lower-prices-for-standard-queues/
Now on, implementing SQS + SNS looks like a good idea too.

Ideal way/architecture to deliver large data over Web Services

We are trying to design 6 web services, which will serve another client component. The client component requires data from the web service we are implementing.
Now, the problem is, there is not 1 Web Service we are implementing, there is one Web Service which the client component hits, this initiates a series (5 more) of Web Services which gather data from their respective data stores and finally provide the data back to the original Web Service, which then delivers the data back to the client component.
So, if the requested data becomes huge, then, this will be a serious problem for our internal communication channel.
So, what do you guys suggest? What can be done to avoid overloading of the communication channel between the internal Web Service and at the same time, also delivering the data to the client component.
Update 1
Using 5 WS, where, 1WS does not know about the others, except the next one is a business requirement. Actually, 5 companies "small services" are being integrated.
We use Java and Axis2

We've had a similar problem. Apart from trying to avoid it (eg for internal communication go direct to db instead of web service) you can mitigate it by at least not performing the 5 or so tasks in series. Make new threads to collect them all in parallel and process them at the end to reduce latency (except where they might contend for the same resource and bottle neck).
But before I'd do anything load test it and see if it is even an issue and get some baseline stats so you can see what improvement each change makes. Also sometimes you might be better off tweaking network settings or the actual network rather than trying to optimise the code - but again test and see.

Put all the data on a temporary compressed file and give back the ftp url of the file.
The client fetches the big data chunk uncompress it and reads it. (maybe some authentication mechanism for the ftp server)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js