Flume use case: reading from HTTP and push to HDFS via Kafka

Flume use case: reading from HTTP and push to HDFS via Kafka - hdfs

I'm new to Flume and thinking to use Flume in the below scenario.
Our system receives events as HTTP POST, and we need to store a copy of them in Kafka (for further processing) and another copy in HDFS (as permanent store).
Can we configure Flume source as HTTP, channel as KAFKA, sink as HDFS to meet our requirement. Will this solution works ?

If I've understood well, you want Kafka as the final backend were to store the data, not as the internal channel used by a Flume agent to communicate both source and sink. I mean, a Flume agent is basically compose of a source receiving data and building Flume events that are put into a channel in order a sink reads those events and do something with them (typically, persisting this data in a final backend). Thus, according yo your design, if you use Kafka as the internal channel, it will be that, an internal way of communicating the HTTP source and the HDFS sink; but it never will be accessible from outside the agent.
In order to meet your needs, you will need and agent such as:
http_source -----> memory_channel -----> HDFS_sink ------> HDFS
|
|----> memory_channel -----> Kafka_sink -----> Kafka
{.................Flume agent.....................} {backend}
Please observe the memory-based channels are the internal ones, they can be based on memory, or files, even in Kafka, but that Kafka channels will be different than the final Kafka you will be persisting the data and that will be accessible by your app.

Related

Dynamically handle channels to publish to with Redis in C++

I have 2 applications (a GUI in javascript and another in C++) which need to communicate to each other.
The C++ application (server) contains multiple realtime sensor data which it has to stream to the GUI (client). The data is buffered and sent as a big chunk. The GUI simply renders the data and doesn't buffer it locally (current library renders relatively slow).
We want to use Redis where each channel is a sensor. On the client side the user can select which sensor has to be streamed. This requires to let the server somehow know which channels to publish to.
Now the question is more about performance and extensibility. Which scenario is best?
Publish all sensor data. +-30 sensors with data at max 64 bit. Each up to 10,000 samples streamed at up to 50hz. (This is maxing out absolutely everything, but does give a ballpark).
Store the channel names in Redis as a JSON object or namespaced keys. Listen for a set event server-side, get the channels and cache them and dynamically publish to the channels.
Same as above but get the channels during every cycle from Redis without listening to any set event.
Use a configuration channel where the client publishes the configuration (via JSON string) when it's changed. Server side we subscribe to the configuration channel and handle the new channels appropriately.
Something else. Please elaborate.

Try to use redis streams feature from recently released redis 5.0. If you are looking for performant C++ library, which supports redis streams try to use bredis, for example.

Can Apache Kafka be used with C++?

I am new to Apache Kafka. Kafka documentation (https://kafka.apache.org/documentation.html#introduction) mentions that it can be used with C++. But i am not sure how to do that. My application continuously generates image files and need to be transferred to another machine. I felt i could use stream api of Kafka but not sure how to stream image files.

Yes, you can use Apache Kafka with C/C++.
By far the most popular C/C++ client is https://github.com/edenhill/librdkafka. You can use the client to read data from Kafka and write data back to Kafka.
Further documentation on librdkafka is available at http://docs.confluent.io/current/clients/index.html (the author of librdkafka, Magnus Edenill, works at Confluent).

Kafka - Real Time Streaming with Web Service

I am trying to stream a data through web service and planning to consume it into kafka. Streaming data would be of size 4 MB, at max it can goes upto 10 MB. Data source SDK is written onto .Net and Apache Kafka does not provide DLL for its consumer and producer. Its very typical to write Kafka producer and consumer in .Net and we can't use github Kafka producer.
My Questions are -
Is web service is good option for real time streaming?
Is web service able to stream upto 10MB of data without impacting the performance of web server and data ingestion?
Is there any better approach to solve this issue?
answer with authentic source will really helps me.
Thanks...

I think, the used of webservices is a good option to push stream data if you would like to decouple your front-ends either from kafka or an API written into a specific language. But that depends of your use-case.
You should have a look at the open-source Kafka REST Proxy provided by the Confluent Distribution (http://docs.confluent.io/2.0.0/kafka-rest/docs/index.html).
It will allow you to produce/consumer messsage though webservice.
If you expect to push messages with a max size of 10MB don't forget to increase the following kafka properties (as by default kafka is tuned for 1MB messages).
max.message.bytes
replica.fetch.max.bytes
fetch.message.max.bytes (consumer config)

Can KAFKA producer read log files?

Log files of my application keep accumulating on a server.I want to dump them into HDFS through KAFKA.I want the Kafka producer to read the log files,send them to Kafka broker and then move those files to another folder.Can the Kafka producer read log files ? Also, is it possible to have the copying logic in Kafka producer ?

Kafka maintains feeds of messages in categories called topics.
We'll call processes that publish messages to a Kafka topic producers.
We'll call processes that subscribe to topics and process the feed of published messages consumers..
Kafka is run as a cluster comprised of one or more servers each of which is called a broker.
So, at a high level, producers send messages over the network to the Kafka cluster which in turn serves them up to consumers like this:
So this is not a suitable for your application where you want to injest log files. Instead you can try flume.
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.

As you know, Apache Kafka is publish-subscribe messaging system. you can send message from your application. To send message from your application you can use kafka clients or kafka rest api.
In short, you can read your log with your application and can send these logs to kafka topics.
To handle these logs, you can use apache storm. You can find many integrated solution for these purposes. And by using storm you can
add any logic your stream processing.
You can read many useful detailed information about storm kafka integration.
Also to put your processed logs to hdfs, you can easily integrate your storm with hadoop. You can check this repo for it.

Kafka was developed to support high volume event streams such as real-time log aggregation. From the kafka documentation
Many people use Kafka as a replacement for a log aggregation solution. Log aggregation typically collects physical log files off servers and puts them in a central place (a file server or HDFS perhaps) for processing. Kafka abstracts away the details of files and gives a cleaner abstraction of log or event data as a stream of messages. This allows for lower-latency processing and easier support for multiple data sources and distributed data consumption
Also I got this little piece of information from this nice article which almost similar to your use-case
Today, Kafka has been used in production at LinkedIn for a number of projects. There are both offline and online usage. In the offline case, we use Kafka to feed all activity events to our data warehouse and Hadoop, from which we then run various batch analysis

Best Practice: AWS ftp with file processing

I'm looking for some direction on an AWS architictural decision. My goal is to allow users to ftp a file to an EC2 instance and then run some analysis on the file. My focus is to build this in as much a service-oriented way as possible.. and in the future scale it out for multiple clients where each would have there own ftp server and processing queue with no co-mingling of data.
Currently I have a dev EC2 instance with vsftpd installed and a node.js process running Chokidar that is continuously watching for new files to be dropped. When that file drops I'd like for another server or group of servers to be notified to get the file and process it.
Should the ftp server move the file to S3 and then use SQS to let the pool of processing servers know that it's ready for processing? Should I use SQS and then have the pool of servers ssh into the ftp instance (or other approach) to get the file rather than use S3 as a intermediary? Are there better approaches?
Any guidance is very much appreciated. Feel free to school me on any alternate ideas that might save money at high file volume.

I'd segregate it right down into small components.
Load balancer
FTP Servers in scaling group
Daemon on FTP Servers to move to S3 and then queue a job
Processing servers in scaling group
This way you can scale the ftp servers if necessary, or scale the processing servers (on SQS queue length or processor utilisation). You may end up with one ftp server and 5 processing servers, or vice versa - but at least this way you only scale at the bottleneck.
The other thing you may want to look at is DataPipeline - which (whilst not knowing the details of your job) sounds like it's tailor made for your use case.
S3 and queues are cheap, and it gives you more granular control around the different components to scale as appropriate. There are potentially some smarts around wildcard policies and IAM you could use to tighten the data segregation too.

Ideally I would try to process the file on the server that it is currently placed.
This will save a lot of network traffic and CPU load.
However if you want one of the servers to be like reverse proxy and to load balance between farm of servers, then I will notify the server with http call that the file has arrived. I would made the file available via ftp since you already has working vsftp that will not be a problem and will include the file ftp url in the http call, so the server that will do the processing can get the file and start working on it immediately.
This way you will save money by not using any extra S3 or SQS or any other additional services.
If the farm of servers are made of equal type of servers, then the algorithm for distributing the load should be RoundRobin if the servers are with different capacity then the load distribution should be made according to the server performance.
For example if server ONE has 3 times more performnce then server THREE and server TWO has 2 times better performance than server THREE, then you can do:
1: Server ONE - forward 3 request
2: Server TWO - forward 2 request
3: Server THREE - forward 1 request
4: GOTO 1
Ideally there should be feedback from the serves that report the current load so the load-balancer knows who is the best candidate for the next request instead of using hard-coded algorithms, since probably the requests do not need exactly equal amount of resources to be processed, but this start looking like Map-Reduce paradigm and is out of the scope ... at least for the begining. :)

If you want to stick with S3, you could use RioFS to mount S3 bucket as a local filesystem on your FTP and processing servers. Then you could do the usual file operations (e.g.: get the notification when a file was created / modified).

As well as RioFS s3fs-fuse utilizes FUSE to provide a filesystem that is (virtual locally) mountable; s3fs-fuse is currently well-maintained.
In contrast the Filesystem Abstraction for S3, HDFS and normal filesystem
swineherd-fs allows to have a different (locally virtual) approach:
All filesystem abstractions implement the following core methods, based on standard UNIX functions, and Ruby's File class [...].
As the 'local abstraction layer' has been Only thoroughly tested on Ubuntu Linux i'd personally go for a more mainstream/solid/less experimental stack, i.e.:
a (sandboxed) vsftpd for
FTP transfers
(optionally) listen for filesystem changes and finally
trigger
middleman-s3_sync to do the cloud lift (or synchronize all by itself).
Alternatively, and more experimental, there are some github projects that might fit it:
s3-ftp: FTP server front-end
that forwards all uploads to an S3 bucket (Clojure)
ftp-to-s3: An FTP server that
uploads every file it receives to S3 (Python)
ftp-s3: FTP frontend to S3 in Python.
Last but not least i do recommend using donationware Cyberduck if on OSX - a comfortable (and very FTP-like) client interfacing S3 directly. For Windows there is a (PRO optional) freeware named S3 Browser.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js