I need to build a Kinesis Producer app that simply puts data into a Kinesis Stream. The app will need to connect to a remote host and maintain a TCP socket to which data will be pushed from the remote host. There is very little data transformation, so the producer application will be very simple... I know I could setup an EC2 instance for this, but if there's a better way I'd like to explore that.
Examples:
You can build a producer on AWS Lambda, but since I have to maintain a long-running TCP connection, that wouldn't work.
You can maintain a connection to a WebSocket with AWS IoT and invoke a Lambda function on each message, but my connection is just a standard TCP connection
Question: What other products in the AWS suite of products that I could use to build a producer?
There are no suitable managed options, here. If your task is to...
originate and maintain a persistent TCP connection to a third-party remote device that you don't control,
consume wherever payload comes down the pipe,
process/transform it, and
feed it to code that serves as a Kinesis producer
...then you need a server, because there is not a service that does all of these things. EC2 is the product you are looking for.
The Producer code typically runs on the thing that is the source of the information you wish to capture.
For example:
When capturing network events, the Producer should be the networking equipment that is monitoring traffic.
When capturing retail purchases, the Producer is the system processing the transactions.
When capturing earth tremors, the Producer is the equipment that is monitoring vibrations.
In your case, the remote host should be the Producer, which sends the data to Kinesis. Rather than having the remote host push data to a Lambda function, simply have the remote host push directly to Kinesis.
Update
You mention Kinesis Agent:
Kinesis Agent is a stand-alone Java software application that offers an easy way to collect and send data to Kinesis Firehose.
If you are using Amazon Kinesis Firehose, then the Kinesis Agent can be your Producer. It sends the data to the Firehose. Or, you can write your own Producer for Firehose.
From Writing to a Kinesis Firehose Delivery Stream Using the AWS SDK:
You can use the Kinesis Firehose API to send data to a Kinesis Firehose delivery stream using the AWS SDK for Java, .NET, Node.js, Python, or Ruby.
If you are using Amazon Kinesis Streams, you will need to write your own Producer. From Producers for Amazon Kinesis Streams:
A producer puts data records into Kinesis streams. For example, a web server sending log data to an Kinesis stream is a producer
So, a Producer is just the term applied to whatever sends the data into Kinesis, and it is retrieved by a Consumer.
A couple of options:
You may be able to use IoT with a kinesis action for your remote host to push into a kinesis stream. In this case your remote app would be a device that talks directly to the AWS IoT infrastructure. You'd then setup a rule to forward all of the messages to a kinesis stream for processing. See https://aws.amazon.com/iot-platform/how-it-works/.
A benefit of this is that you no longer have to host a producer app anywhere. But you would need to be able to modify the app running on the remote host.
You don't have to use the Kinesis Producer Library (KPL), your data source could simply make repeated calls to PutRecord or PutRecords. Again this would require modifications to the remote app.
Or as you know, you could run your KPL app on an EC2. Talk to it over the network. This may give you more control over how the thing runs and would require less modifications to the remote app. But you now have a greater dev ops burden.
Related
My company is doing a POC on some streaming data and one of the tasks is sending data from AWS Kinesis to Azure Event Hubs.
Has anyone tryed to do something like this before?
I was thinking of a lambda function listening to kinesis firehose and sending the data to event hubs but I have no experience on Azure at all and I don't even know if this is possible.
Yes, this is very much possible.
Inter-Cloud environment where data can be streamed among two services can be achieved using AWS Kinesis and Azure Event Hub.
You can stream data from Amazon Kinesis directly to Azure Event Hub in Real-Time. Using ‘serverless’ model and cloud computing to process and transfer events without having the need to manage any native application written on an on-premise server.
You will be required connection string, SharedAccessKeyName, and SharedAccessKey from the Azure Event Hub. This will be needed to send data to Event Hub. Also, make sure the Event hub can receive data from the IP address you are running the program from.
Refer this third-party tutorial to accomplish the same
We have a third party Apache Kafka producer which is on premise. We need to read message from third party kafka component into AWS using AWS services and triggering lambda function. What approach should be taken to consume message from kafka in AWS
You have a few options. You could just schedule a lambda call every few seconds and read from the kafka topic using your favourite language. That is quite simple and depending on the data volume you are getting, it might be good enough.
Alternatively, you can install a community contributed kafka connector for lambda, and just invoke lambda directly.
Or you can use the awslabs kafka connector for Kinesis, that relays messages from Kafka into Kinesis data streams or kinesis firehose, where you can use lambda natively.
I have a server which can only process 20 request at a time. When lots of request coming, I want to store the request data, in some queues. and read a set of request (i.e 20) and process them by batch. What would be the ideal way to that ? Using SQS, or kinesis. I'm totally confused.
SQS = Simple Queue Service is for queuing messages in a 1:1 (once the message is consumed, it is removed from the queue)
Kinesis = low latency, high volumetry data streaming ... typically for 1:N (many consumers of messages)
As Kinesis is also storing the data for a period of time, both are often confused, but their architectural patterns are totally different.
Queue => SQS.
Data Streams => Kinesis.
Taken from https://aws.amazon.com/kinesis/data-streams/faqs/ :
Q: How does Amazon Kinesis Data Streams differ from Amazon SQS?
Amazon Kinesis Data Streams enables real-time processing of streaming
big data. It provides ordering of records, as well as the ability to
read and/or replay records in the same order to multiple Amazon
Kinesis Applications. The Amazon Kinesis Client Library (KCL) delivers
all records for a given partition key to the same record processor,
making it easier to build multiple applications reading from the same
Amazon Kinesis data stream (for example, to perform counting,
aggregation, and filtering).
Amazon Simple Queue Service (Amazon SQS) offers a reliable, highly
scalable hosted queue for storing messages as they travel between
computers. Amazon SQS lets you easily move data between distributed
application components and helps you build applications in which
messages are processed independently (with message-level ack/fail
semantics), such as automated workflows.
Q: When should I use Amazon Kinesis Data Streams, and when should I
use Amazon SQS?
We recommend Amazon Kinesis Data Streams for use cases with
requirements that are similar to the following:
Routing related records to the same record processor (as in streaming MapReduce). For example, counting and aggregation are
simpler when all records for a given key are routed to the same record
processor.
Ordering of records. For example, you want to transfer log data from the application host to the processing/archival host while maintaining
the order of log statements.
Ability for multiple applications to consume the same stream concurrently. For example, you have one application that updates a
real-time dashboard and another that archives data to Amazon Redshift.
You want both applications to consume data from the same stream
concurrently and independently.
Ability to consume records in the same order a few hours later. For example, you have a billing application and an audit application that
runs a few hours behind the billing application. Because Amazon
Kinesis Data Streams stores data for up to 7 days, you can run the
audit application up to 7 days behind the billing application.
We recommend Amazon SQS for use cases with requirements that are
similar to the following:
Messaging semantics (such as message-level ack/fail) and visibility timeout. For example, you have a queue of work items and want to track
the successful completion of each item independently. Amazon SQS
tracks the ack/fail, so the application does not have to maintain a
persistent checkpoint/cursor. Amazon SQS will delete acked messages
and redeliver failed messages after a configured visibility timeout.
Individual message delay. For example, you have a job queue and need to schedule individual jobs with a delay. With Amazon SQS, you can
configure individual messages to have a delay of up to 15 minutes.
Dynamically increasing concurrency/throughput at read time. For example, you have a work queue and want to add more readers until the
backlog is cleared. With Amazon Kinesis Data Streams, you can scale up
to a sufficient number of shards (note, however, that you'll need to
provision enough shards ahead of time).
Leveraging Amazon SQS’s ability to scale transparently. For example, you buffer requests and the load changes as a result of occasional
load spikes or the natural growth of your business. Because each
buffered request can be processed independently, Amazon SQS can scale
transparently to handle the load without any provisioning instructions
from you.
My company currently uses Azure for our Data Warehousing infrastructure. In the past we have used Azure Event Hubs for streaming data. When working on previous projects this hasn't been an issue we just provide the connection details and they start sending us data.
However we have recently started working on a new project where most of their infrastructure is hosted on AWS, we have been asked to set up a Amazon Kinesis endpoint instead as they do not support Azure Event Hubs.
I don't know much about sending the data, but is it asking a lot to send to an Event Hub instead of Kinesis?
My suggestion for this is you could introduce a middle layer which understands both Kinesis and Event hub. And the middle layer I know is Spring Cloud Stream. It provides binder abstraction to supports various message middleware such as Kafka, Kinesis and Event hub.
I'd like to use AWS IoT to manage a grid of devices. Data by device must be sent to a queue service (RabbitMQ) hosted on an EC2 instance that is the starting point for a real time control application. I read how to make a rule to write data to other Service: Here
However there isn't an example for EC2. Using the AWS IoT service, how can I connect to a service on EC2?
Edit:
I have a real time application developed with storm that consume data from RabbitMQ and puts the result of computation in another RabbitMQ queue. RabbitMQ and storm are on EC2. I have devices producing data and connected to IoT. Data produced by devices must be redirected to the queue on EC2 that is the starting point of my application.
I'm sorry if I was not clear.
The AWS IoT supports pushing the data directly to other AWS services. As you have probably figured out by now publishing to third party APIs isn't directly supported.
From the choices AWS offers Lambda, SQS, SNS and Kinesis would probably work best for you.
With Lambda you could directly forward the incoming message using the one of Rabbit MQs APIs.
With SQS you would put it into an AWS queue first and than poll this queue transfering it to RabbitMQ.
Kinesis would allow more sophisticated processing, but is probably too complex.
I suggest you program a Lamba with the programming language of your choice using one of the numerous RabbitMQ APIs.