My goal is to connect to various websockets (30 more or less) and save the raw data into S3 in the cheapest and most efficient way possible.
I know that there is Kinesis to capture and process streaming data.
What I'm missing is what comes before Kinesis, since I need a program running that listens to the WS Connection and forwards the data into kinesis.
I have tried using an EC2 instance and containers on Fargate to run the websocket connections and both options work.
I was wondering if there is another way (recommended way) of doing this.
Related
My company is doing a POC on some streaming data and one of the tasks is sending data from AWS Kinesis to Azure Event Hubs.
Has anyone tryed to do something like this before?
I was thinking of a lambda function listening to kinesis firehose and sending the data to event hubs but I have no experience on Azure at all and I don't even know if this is possible.
Yes, this is very much possible.
Inter-Cloud environment where data can be streamed among two services can be achieved using AWS Kinesis and Azure Event Hub.
You can stream data from Amazon Kinesis directly to Azure Event Hub in Real-Time. Using ‘serverless’ model and cloud computing to process and transfer events without having the need to manage any native application written on an on-premise server.
You will be required connection string, SharedAccessKeyName, and SharedAccessKey from the Azure Event Hub. This will be needed to send data to Event Hub. Also, make sure the Event hub can receive data from the IP address you are running the program from.
Refer this third-party tutorial to accomplish the same
I'm looking to connect to a public websocket and receive the data into AWS Lambda or SNS.
Looking at similar posts, it seems the only way to do this is via EC2, ECS etc. These posts were from a few years ago so I'd first like to see if anyone has found a way to do this in a serverless manor?
Everything you are thinking of doing in EC2/ECS you should be able to do in AWS Lambda + a Cloudwatch rule to run this websocket data ingestion logic on whatever schedule you need. That said Lambda functions are optimized for running burst workloads and handling non-steady traffic, if you need to maintain a consistent real-time connection to this websocket and dont want to manage servers, your best option will be Fargate, which offers serverless compute for containers that you can use to build an application like yours.
Also, I highly recommend looking into AWS Copilot for a simple way to manage/deploy your application: https://aws.github.io/copilot-cli/
Our .net core web app currently accepts websocket connections and pushes out data to clients on certain events (edit, delete, create of some of our entities).
We would like to load balance this application now but foresee a problem in how we handle the socket connections. Basically, if I understand correctly, only the node that handles a specific event will push data out to its clients and none of the clients connected to the other nodes will get the update.
What is a generally accepted way of handling this problem? The best way I can think of is to also send that same event to all nodes in a cluster so that they can also update their clients. Is this possible? How would I know about the other nodes in the cluster?
The will be hosted in AWS.
You need to distribute the event to all nodes in the cluster, so that they can each push the update out to their websocket clients. A common way to do this on AWS is to use SNS to distribute the event to all nodes. You could also use ElastiCache Redis Pub/Sub for this.
As an alternative to SNS or Redis, you could use a Kinesis Stream. But before going to that link, read about Apache Kafka, because the AWS docs don't do a good job of explaining why you'd use Kinesis for anything other than log ingest.
To summarize: Kinesis is a "persistent transaction log": everything that you write to it is stored for some amount of time (by default a day, but you can pay for up to 7 days) and is readable by any number of consumers.
In your use case, each worker process would start reading at the then-current end-of stream, and continue reading (and distributing events) until shut down.
The main issue that I have with Kinesis is that there's no "long poll" mechanism like there is with SQS. A given read request may or may not return data. What it does tell you is whether you're currently at the end of the stream; if not, you have to keep reading until you are. And, of course, Amazon will throttle you if you read too fast. As a result, your code tends to have sleeps.
I need to build a Kinesis Producer app that simply puts data into a Kinesis Stream. The app will need to connect to a remote host and maintain a TCP socket to which data will be pushed from the remote host. There is very little data transformation, so the producer application will be very simple... I know I could setup an EC2 instance for this, but if there's a better way I'd like to explore that.
Examples:
You can build a producer on AWS Lambda, but since I have to maintain a long-running TCP connection, that wouldn't work.
You can maintain a connection to a WebSocket with AWS IoT and invoke a Lambda function on each message, but my connection is just a standard TCP connection
Question: What other products in the AWS suite of products that I could use to build a producer?
There are no suitable managed options, here. If your task is to...
originate and maintain a persistent TCP connection to a third-party remote device that you don't control,
consume wherever payload comes down the pipe,
process/transform it, and
feed it to code that serves as a Kinesis producer
...then you need a server, because there is not a service that does all of these things. EC2 is the product you are looking for.
The Producer code typically runs on the thing that is the source of the information you wish to capture.
For example:
When capturing network events, the Producer should be the networking equipment that is monitoring traffic.
When capturing retail purchases, the Producer is the system processing the transactions.
When capturing earth tremors, the Producer is the equipment that is monitoring vibrations.
In your case, the remote host should be the Producer, which sends the data to Kinesis. Rather than having the remote host push data to a Lambda function, simply have the remote host push directly to Kinesis.
Update
You mention Kinesis Agent:
Kinesis Agent is a stand-alone Java software application that offers an easy way to collect and send data to Kinesis Firehose.
If you are using Amazon Kinesis Firehose, then the Kinesis Agent can be your Producer. It sends the data to the Firehose. Or, you can write your own Producer for Firehose.
From Writing to a Kinesis Firehose Delivery Stream Using the AWS SDK:
You can use the Kinesis Firehose API to send data to a Kinesis Firehose delivery stream using the AWS SDK for Java, .NET, Node.js, Python, or Ruby.
If you are using Amazon Kinesis Streams, you will need to write your own Producer. From Producers for Amazon Kinesis Streams:
A producer puts data records into Kinesis streams. For example, a web server sending log data to an Kinesis stream is a producer
So, a Producer is just the term applied to whatever sends the data into Kinesis, and it is retrieved by a Consumer.
A couple of options:
You may be able to use IoT with a kinesis action for your remote host to push into a kinesis stream. In this case your remote app would be a device that talks directly to the AWS IoT infrastructure. You'd then setup a rule to forward all of the messages to a kinesis stream for processing. See https://aws.amazon.com/iot-platform/how-it-works/.
A benefit of this is that you no longer have to host a producer app anywhere. But you would need to be able to modify the app running on the remote host.
You don't have to use the Kinesis Producer Library (KPL), your data source could simply make repeated calls to PutRecord or PutRecords. Again this would require modifications to the remote app.
Or as you know, you could run your KPL app on an EC2. Talk to it over the network. This may give you more control over how the thing runs and would require less modifications to the remote app. But you now have a greater dev ops burden.
I'd like to use AWS IoT to manage a grid of devices. Data by device must be sent to a queue service (RabbitMQ) hosted on an EC2 instance that is the starting point for a real time control application. I read how to make a rule to write data to other Service: Here
However there isn't an example for EC2. Using the AWS IoT service, how can I connect to a service on EC2?
Edit:
I have a real time application developed with storm that consume data from RabbitMQ and puts the result of computation in another RabbitMQ queue. RabbitMQ and storm are on EC2. I have devices producing data and connected to IoT. Data produced by devices must be redirected to the queue on EC2 that is the starting point of my application.
I'm sorry if I was not clear.
The AWS IoT supports pushing the data directly to other AWS services. As you have probably figured out by now publishing to third party APIs isn't directly supported.
From the choices AWS offers Lambda, SQS, SNS and Kinesis would probably work best for you.
With Lambda you could directly forward the incoming message using the one of Rabbit MQs APIs.
With SQS you would put it into an AWS queue first and than poll this queue transfering it to RabbitMQ.
Kinesis would allow more sophisticated processing, but is probably too complex.
I suggest you program a Lamba with the programming language of your choice using one of the numerous RabbitMQ APIs.