Move data from Opeansearch to Dynamodb - amazon-web-services

As a total AWS noobs, I was wondering if there is a way to “migrate” all the existing data from OpenSearch to DynamoDB?
AS we need storing rather than fast querying, we have been thinking to migrate to DynamoDB. Are there any best practices that could be followed to prevent any future problems?

This depends on how much data you have in Opensearch.
If you have a small amount of data, you can use Lambda or EC2.
If you have large amounts of data then use:
AWS Glue
AWS Glue has a connector for both DynamoDB and Opensearch which makes reading from one and writing to another really simple.

Related

Getting Amazon DynamoDB data in Athena

I have information in Amazon DynamoDB, that has frequently updated/added rows (it is updated by receiving events from Kinesis Stream and processing those events with a Lambda).
I want to provide a way for other teams to query that data through Athena.
It has to be as real-time as possible (the period between receiving the event and the query to Athena including that new/updated information).
The best/most cost optimized way to do that?
I know about some of the options:
scan the table regularly and put the information in Athena. This is going to be quite expensive and not real time.
start putting the raw events in S3 as well, not just DynamoDB, and make a glue crawler that scans the new records only. That's going to be closer to real time, but I don't know how to deal with duplicate events. (the information is quite frequently updated in DynamoDB, it updates old records). also not sure if it is the best way.
maybe update the data catalog directly from the lambda? not sure if that is even possible, I'm still new to the tech stack in aws.
Any better ways to do that?
You can use Athena Federated Query for this use-case.

AWS Redshift or RDS for a Data warehouse?

Right now we have an ETL that extracts info from an API, transforms, and Store in one big table in our OLTP database we want to migrate this table to some OLAP solution. This table is only read to do some calculations that we store on our OLTP database.
Which service fits the most here?
We are currently evaluating Redshift but never used the service before. Also, we thought of some snowflake schema(some kind of fact table with dimensions) in an RDS because is intended to store 10GB to 100GB but don't know how much this approach can scale.
Which service fits the most here?
imho you could do a PoC to see which service is more feasible for you. It really depends on how much data you have, what queries and what load you plan to execute.
AWS Redshift is intended for OLAP on top of peta- or exa-bytes scale handling heavy parallel workload. RS can as well aggregate data from other data sources (jdbc, s3,..). However RS is not OLTP, it requires more static server overhead and extra skills for managing the deployment.
So without more numbers and use cases one cannot advice anything. Cloud is great that you can try and see what fits you.
AWS Redshift is really great when you only want to read the data from the database. Basically, Redshift in the backend is a column-oriented database that is more suitable for analytics. You can transfer all your existing data to redshift using the AWS DMS. AWS DMS is a service that basically needs your bin logs of the existing database and it will automatically transfer your data we don't have to do anything. From my Personal experience Redshift is really great.

AWS Timestream DB - AWS IOT

I am building out a simple sensor which sends out 5 telemetry data to AWS IoT Core. I am confused between AWS Timestream DB and Elastic Search to store this telemetries.
For now I am experimenting with Timestream and wanted to know is this the right choice ? Any expert suggestions.
Secondly I want to store the db records for ever as this will feed into my machine
learning predictions in the future. Timestream deletes records after a while or is it possible to never delete it
I will be creating a custom web page to show this telemetries per tenant - any help with how I can do this. Should I directly query the timestream db over api or should i back it up in another db like dynamic etc ?
Your help will be greatly appreciated. Thank you.
For now I am experimenting with Timestream and wanted to know is this the right choice? Any expert suggestions.
I would not call myself an expert but Timestream DB looks like a sound solution for telemetry data. I think ElasticSearch would be overkill if each of your telemetry data is some numeric value. If your telemetry data is more complex (e.g. JSON objects with many keys) or you would benefit from full-text search, ElasticSearch would be the better choice. Timestream DB is probably also easier and cheaper to manage.
Secondly I want to store the db records for ever as this will feed into my machine learning predictions in the future. Timestream deletes records after a while or is it possible to never delete it
It looks like the retention is limited to 4 weeks 200 Years per default. You probably can increase that by contacting AWS support. But I doubt that they will allow infinite retention.
We use Amazon Kinesis Data Firehose with AWS Glue to store our sensor data on AWS S3. When we need to access the data for analysis, we use AWS Athena to query the data on S3.
I will be creating a custom web page to show this telemetries per tenant - any help with how I can do this. Should I directly query the timestream db over api or should i back it up in another db like dynamic etc ?
It depends on how dynamic and complex the queries are you want to display. I would start with querying Timestream directly and introduce DynamoDB where it makes sense to optimize cost.
Based on your approach " simple sensor which sends out 5 telemetry data to AWS IoT Core" Timestream is the way to go, fairly simple and cheaper solution for simple telemetry data.
The Magnetic storage is above what you will ever need (200years)

Alternative to writing data to S3 using Kinesis Firehose

I am trying to write some IOT data to the S3 bucket and so I know 2 options so far.
1) Use AWS CLI and put data directly to the S3.
The downside of this approach is that I would have to parse out the data and figure out how to write it to S3. So there would be some dev required here. The upside is that there isn't additional cost associated to this.
2) Use Kinesis firehose
The downside of this approach is that it costs more money. It might be wasteful because the data doesn't have to be transferred in the real time, and it's not a huge amount of data. The upside is that I don't have to write any code for this data to be written in the S3 bucket.
Is there another alternative that I can explore?
If you're looking at keeping costs low, can you use some sort of cron functionality on your IoT device to POST data to a Lambda function that writes to S3, possibly?
Option 2 with Kinesis Data Firehose has the least administrative overhead.
You may also want to look into the native IoT services. It may be possible to use IoT Core and put the data directly in S3.

Kinesis to S3 custom partitioning

Question
I've read this and this and this articles. But they provide contradictory answers to the question: how to customize partitioning on ingesting data to S3 from Kinesis Stream?
More details
Currently, I'm using Firehose to deliver data from Kinesis Streams to Athena. Afterward, data will be processed with EMR Spark.
From time to time I have to handle historical bulk ingest into Kinesis Streams. The issue is that my Spark logic hardly depends on data partitioning and order of event handling. But Firehouse supports partitioning only by ingestion_time (into Kinesis Stream), not by any other custom field (I need by event_time).
For example, under Firehouse's partition 2018/12/05/12/some-file.gz I can get data for the last few years.
Workarounds
Could you please help me to choose between the following options?
Copy/partition data from Kinesis Steam with help of custom lambda. But this looks more complex and error-prone for me. Maybe because I'm not very familiar with AWS lambdas. Moreover, I'm not sure how well it will perform on bulk load. At this article it was said that Lambda option is much more expensive than Firehouse delivery.
Load data with Firehouse, then launch Spark EMR job to copy the data to another bucket with right partitioning. At least it sounds simpler for me (biased, I just starting with AWS Lambas). But it has the drawback of double-copy and additional spark Job.
At one hour I could have up to 1M rows that take up to 40 MB of memory (at compressed state). From Using AWS Lambda with Amazon Kinesis I know that Kinesis to Lambda event sourcing has a limitation of 10,000 records per batch. Would it be effective to process such volume of data with Lambda?
While Kinesis does not allow you to define custom partitions, Athena does!
The Kinesis stream will stream into a table, say data_by_ingestion_time, and you can define another table data_by_event_time that has the same schema, but is partitioned by event_time.
Now, you can make use of Athena's INSERT INTO capabilities to let you repartition data without needing to write Hadoop or a Spark job and you get the serverless scale-up of Athena for your data volume. You can use SNS, cron, or a workflow engine like Airflow to run this at whatever interval you need.
We dealt with this at my company and go in-to more depth details of the trade-offs of using EMR or a streaming solution, but now you don't need to introduce anymore systems like Lambda or EMR.
https://radar.io/blog/custom-partitions-with-kinesis-and-athena
you may use the kinesis stream, and create the partitions like you wants.
you create a producer, and in your consumer, create the partitions.
https://aws.amazon.com/pt/kinesis/data-streams/getting-started/