Do I need SQS queues to store remote data in the Amazon Web Services (AWS) cloud? - amazon-web-services

My first question is, do I need SQS queues to receive my remote data, or can it go directly into an Amazon cloud storage solution like S3 or EC2?
Currently, my company uses a third-party vendor to gather and report on our remote data. By remote data, I mean data coming from our machines out in the wilderness. These data are uploaded a few times each day to Amazon Web Services SQS queues (setup by the third party vendor), and then the third-party vendor polls the data from the queues, removing it and saving it in their own on-premises databases for one year only. This company only provides reporting services to us, so they don't need to store the data long-term.
Going forward, we want to own the data and store it permanently in Amazon Web Services (AWS). Then we want to use machine learning to monitor the data and report any potential problems with the machines.
To repeat my first question, do we need SQS queues to receive this data, or can it go directly into an Amazon cloud storage solution like S3 or EC2?
My second question is, can an SQS queue send data to two different places? That is, can the queue send the data to the third party vendor, and also to an Amazon Web Services database?
I am an analyst/data scientist, so I know how to use the data once it's in a database. I just don't know the best way of getting it into a database.

You don't really need to have a queue. Whenever you push an item in Queue a function gets triggered and you can perform your custom logic in that. whether you want to store the information to S3/EC2 or sending it to anyother Http service.
Your Lambda(function) can send the data to anyother 3rd party service easily.

Related

Transfer/Replicate Data periodically from AWS Documentdb to Google Cloud Big Query

We are building a customer facing App. For this app, data is being captured by IoT devices owned by a 3rd party, and is transferred to us from their server via API calls. We store this data in our AWS Documentdb cluster. We have the user App connected to this cluster with real time data feed requirements. Note: The data is time series data.
The thing is, for long term data storage and for creating analytic dashboards to be shared with stakeholders, our data governance folks are requesting us to replicate/copy the data daily from the AWS Documentdb cluster to their Google cloud platform -> Big Query. And then we can directly run queries on BigQuery to perform analysis and send data to maybe explorer or tableau to create dashboards.
I couldn't find any straightforward solutions for this. Any ideas, comments or suggestions are welcome. How do I achieve or plan the above replication? And how do I make sure the data is copied efficiently - memory and pricing? Also, don't want to disturb the performance of AWS Documentdb since it supports our user facing App.
This solution would need some custom implementation. You can utilize Change Streams and process the data changes in intervals to send to Big Query, so there is a data replication mechanism in place for you to run analytics. One of the use cases of using Change Streams is for analytics with Redshift, so Big Query should serve a similar purpose.
Using Change Streams with Amazon DocumentDB:
https://docs.aws.amazon.com/documentdb/latest/developerguide/change_streams.html
This document also contains a sample Python code for consuming change streams events.

AWS SQS and other services

my company has a messaging system which sends real-time messages in JSON format, and it's not built on AWS
our team is trying to use AWS SQS to receive these messages, which will then have DynamoDB to storage this messages
im thinking to use EC2 to read this messages then save them
any better solution ?? or how to do it i don't have a good experience
First of All EC2 is infrastructure on Cloud, It is similar to physical machine with OS on local setup. If you want to create any application that will fetch the data from Amazon SQS(Messages in Json Format) and Push it in dynamodb(No Sql database), Your design is correct as both SQS and DynamoDb have thorough Json Support. Once your application is ready then you deploy that application on EC2 machine.
For achieving this, your application must have the asyc Buffered SQS consumer that will consume the messages(limit of sqs messages is 256KB), Hence whichever application is publishing messages size of messages needs to be less thab 256Kb.
Please refer below link for sqs consumer
is putting sqs-consumer to detect receiveMessage event in sqs scalable
Once you had consumed the message from sqs queue you need to save it in dynamodb, that you can easily do it using crud repository. With Repository you can directly save the json in Dynamodb table but please sure to configure the provisioning write capacity based on requests, because more will be the provisioning capacity more will be the cost. Please refer below link for configuring the write capacity of table.
Dynamodb reading and writing units
In general, you'll have a setup something like this:
The EC2 instances (one or more) will read your queue every few seconds to see if there is anything there. If so, they will write this data to DynamoDB.
Based on what you're saying you'll have less than 1,000,000 reads from SQS in a month so you can start out on the free tier for that. You can have a single EC2 instance initially and that can be a very small instance - a T2.micro should be more than sufficient. And you don't need more than a few writes per second on DynamoDB.
The advantage of SQS is that if for some reason your EC2 instance is temporarily unavailable the messages continue to queue up and you won't lose any of them.
From a coding perspective, you don't mention your development environment but there are AWS libraries available for a pretty wide variety of environments. I develop in Java and the code to do this would be maybe 100 lines. I would guess that other languages would be similar. Make sure you look at long polling in the language you're using - it can help to speed up the processing and save you money.

How to handle AWS IoT Thing events

I've recently signed up to AWS to test out their IoT platform and after setting up a few Things and going through the documentation I still seem to be missing a crucial bit of information - how to wrangle all the information from my Things?
For example if I were to build a web-based application to display the health/status of all the Things and possibly also interact with a specific Thing, what would be the way to go about it?
Do I register a "dummy" thing that also uses the device SDK to pub/sub to the topics?
Do I take whatever data the Things publish and route it to a shared DB for further processing?
Do I create Lambdas that the Things invoke?
Do I create a stand-alone application that uses the general AWS SDK to connect itself to the IoT platform?
To me the last idea sounds the most viable and "preferred" as I would need two-way interaction, not just passive listening to changes in Things, is that correct?
Generally speaking your setup might be:
IoT device publishes to AWS SQS
Some Service (application or lambda) reads from SQS and processes data (e.g. saves it to DynamoDB)
And then to display data
Stand alone application reads from DynamoDB and makes data available to users
There are lots of permutations of this. For example your IoT device can write directly to DynamoDB, then you can process the data from there. I would suggest a better pattern is to write to SQS, as you will have a clean separation between data publishing, processing and storage.
In the first instance I would probably write one application that reads from the SQS, processes the data, stores it in DynamoDB and then provides access to that data for users. A better solution longer term is to have separate systems to process/store the data, and to present that data to users.
Lambda is popular for processing of the device data, as its cost effective (runs only when needed) and scales well. Your data presentation application is probably a traditional webapp running on something like elastic beanstalk.

How to push persistent events from a central server to others?

I have a system that consists of one central server, many mobile clients and many worker server. Each worker server has its own database and may be on the customer infrastructure (when he purchases the on premise installation).
In my current design, mobile clients send updates to the central server, which updates its database. The worker servers periodically pull the central to get updated information. This "pull model" creates a lot of requests and is still not suficient, because workers often use outdated information.
I want a "push model", where the central server can "post" updates to "somewhere", which persist the last version of the data. Then workers can "subscribe" to this "somewhere" and be always up-to-date.
The main problems are:
A worker server may be offline when an update happen. When it come back online, it should receive the updates it lost.
A new worker server may be created and need to get the updated data, even the data that was posted before it exists.
A bonus point:
Not needing to manage this "somewhere" myself. My application is deployed at AWS, so if there's any combination of services I can use to achieve that would be great. Everything I found has limited time data retention.
The problems with a push model are:
If clients are offline, the central system would need a retry method, which would generate many more requests than a push model
The clients might be behind firewalls, so cannot receive the message
It is not scalable
A pull model is much more efficient:
Clients should retrieve the latest data when the start, and also at regular intervals
New clients simply connect to the central server -- no need to update the central server with a list of clients (depending upon your security needs)
It is much more scalable
There are several options for serving traffic to pull requests:
Via an API call, powered by AWS API Gateway. You would then need an AWS Lambda function or a web server to handle the request.
Directly from DynamoDB (but the clients would require access credentials)
From an Amazon S3 bucket
Using an S3 bucket has many advantages: Highly scalable, a good range of security options (public; via credentials; via pre-signed URLs), no servers required.
Simply put the data in an S3 bucket and have the clients "pull" the data. You could have one set of files for "every" client, and a specific file for each individual client, thereby enabling individual configuration. Just think of S3 as a very large key-value datastore.

Amazon sqs vs custom implementation

We need to sync data between different web servers. The idea is very basic: when one entity is created on one server, it should be sent to all the other servers. What's the right way to do it? We are currently evaluating 2 approaches: amazon's sqs and sns services and custom implementation with some key-value database (like memcached and memqueue). What are the common pitfalls of custom implementations? Any feedback will be highly appreciated.
SQS would work OK if you create a new queue for each server and write the data to each queue. The biggest downside is that you will need each server to poll for new messages.
SNS would work more efficiently because it allows you to broadcast a message to multiple locations. However, it's a one-shot try; if a machine can't receive its notification when SNS sends it SNS will not try again.
You don't specify how many messages you are sending or what your performance requirements are, but any SQS/SNS system will likely be much, much slower (mostly due to latencies between sending the message and the servers receiving it) then a local memcache/key-value server solution.
A mixed solution would be to use a persistant store (like SimpleDB) and use SNS to alert the servers that new data is available.