I would like to create a data pipeline which will export data from dynamoDB and import it to s3. everything seems fine but there is a problem because , my data on dynamoDB is binary and pipeline settings , there is not accepted data type as binary.
What can i do about it ?
Cheers,
You can create your own Hive Script that exports data from dynamoDB to S3. (Samples)
You would need to disable staging to run your own Hive Script (An example to do this here).
Related
I have an issue:
I need to migrate data from DynamoDB to Redshift. The problem is that I receive such exception:
ERROR: Unsupported Data Type: Current Version only supports Strings and Numbers Detail: ----------------------------------------------- error: Unsupported Data Type: Current Version only supports Strings and Numbers code: 9005 context: Table Name = user_session query: 446027 location: copy_dynamodb_scanner.cpp:199 process: query0_124_446027 [pid=25424] -----------------------------------------------
In my Dynamo item I have boolean field. How can I modify field from Boolean to INT(for example)?
I tried to use as a VARCHAR(5), but didn't help(so it one ticket in Github without response)
Will be appreciate for any suggestions.
As a solution, I migrated data from DynamoDB to S3 first and then to Redshift.
I used Exports to S3 build in feature in DynamoDB. It saves all data as *.json files into S3 realy fast(but not sorted).
After that I used ETL script, using Glue Job and custom script with pyspark to process and save data into Redshift.
Also can be done with Glue crawler to define schema, but still need to validate its result, as sometimes it was not correct.
Using crawlers to parse DynamoDB directly is overkill of your tables if you are not using ONDEMAND read/write. So the better way is to do that with data from S3.
The scenario is the following: I have a lambda function that does an http request to get the data of today and the last 365 days and stores them in DynamoDB. The function is triggered every day at 8am, so the most recent data is always saved in the DynamoDB table.
Now my goal is to export the DynamoDB table to a S3 file automatically on an everyday basis as well, so I'm able to use services like QuickSight, Athena, Forecast on the data.
If possible and easily implementable, I'd like to only have one S3 file that gets added with the most recent data of the day, because an extra file everyday seems kinda pricey. If that's not possible, an extra file everyday would also be fine.
What's the best way to go about doing so without using CLI (because I'm not allowed to install programs to my laptop) and without using Lambda (because I wouldn't know how to write a function for that without any tutorials)?
Take a look at DataPipeline. This is a use case and most of the configuration is simple.
It will also not require any knowledge of Lambda and can be automated.
More info: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBPipeline.html
DynamoDB recently released a new, native feature to export your table's data to an S3 bucket. It supports exporting into DynamoDB JSON and Amazon Ion - see the documentation on how to use it at:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DataExport.html
This will enable you to run whatever analytics tools you'd like (Athena, etc.) on the data exported in S3.
I need to do some grouping job from a Source DynamoDB table, then write each resulting Item to another Target DynamoDB table (or a secondary index of the Source one).
Here I see that DynamoDB can be used as a Source (as well as reported in Connection Types).
However, it's not clear to me if a DynamoDB table can be used as Target as well.
Note: each resulting grouping item must be written into a separate DynamoDB Item (i.e., if there are X objects resulting from grouping, X Items must be written to Target DynamoDB table).
Glue can now read and write to DynamoDB. The option to write is not available via the console, but can be done by editing the script.
Example:
Datasink1 = glueContext.write_dynamic_frame_from_options(
frame=ApplyMapping_Frame1,
connection_type="dynamodb",
connection_options={
"dynamodb.output.tableName": "myDDBTable",
"dynamodb.throughput.write.percent": "1.0"
}
)
As per:
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#etl-connect-dynamodb-as-sink
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-dynamo-db-cross-account.html
The Glue Job scripts can be customized to write to any datasource. If you are using the auto generated scripts, you can add boto3 library to write to DynamoDb tables.
If you want to test the scripts easily, you can create a Dev endpoint through AWS console & launch a jupyter notebook to write and test your glue job scripts.
We're looking at moving away from Splunk as our datastore and looking at AWS Data Lake backed by S3.
What would be the process of migrating data from Splunk to S3? I've read lots of documents talking about archiving data from Splunk to S3 but not sure if this archives the data as a usable format OR if its in some archive format that needs to be restored to splunk itself?
Check out Splunk's SmartStore feature. It moves your non-hot buckets to S3 so you save storage costs. Running SmartStore on AWS only makes sense, however, if you run Splunk on AWS. Otherwise, the data export charges will bankrupt you. Data export applies when Splunk needs to search a bucket that's stored in S3 and so copies that bucket to an indexer. See https://docs.splunk.com/Documentation/Splunk/8.0.0/Indexer/AboutSmartStore for more information.
From what I've read there are a couple of ways to do it:
Export using the Web UI
Export using REST API Endpoint
Export using CLI
Copy certain files in the filesystem
So far I've tried using the CLI to export and I've managed to export around 500,000 events at a time using
splunk search "index=main earliest=11/11/2019:00:00:01 latest=11/15/2019:23:59:59" -output rawdata -maxout 500000 > output2.dmp
However - I'm not sure how I can accurately repeat this step to make sure I include all 100 million+ events. IE search from DATE A to DATE B for 500,000 records, then search from DATE B to DATE C for the next 500,000 - without missing any events inbetween.
I have a whole bunch of data in AWS S3 stored in JSON format. It looks like this:
s3://my-bucket/store-1/20190101/sales.json
s3://my-bucket/store-1/20190102/sales.json
s3://my-bucket/store-1/20190103/sales.json
s3://my-bucket/store-1/20190104/sales.json
...
s3://my-bucket/store-2/20190101/sales.json
s3://my-bucket/store-2/20190102/sales.json
s3://my-bucket/store-2/20190103/sales.json
s3://my-bucket/store-2/20190104/sales.json
...
It's all the same schema. I want to get all that JSON data into a single database table. I can't find a good tutorial that explains how to set this up.
Ideally, I would also be able to perform small "normalization" transformations on some columns, too.
I assume Glue is the right choice, but I am open to other options!
If you need to process data using Glue and there is no need to have a table registered in Glue Catalog then there is no need to run Glue Crawler. You can setup a job and use getSourceWithFormat() with recurse option set to true and paths pointing to the root folder (in your case it's ["s3://my-bucket/"] or ["s3://my-bucket/store-1", "s3://my-bucket/store-2", ...]). In the job you can also apply any required transformations and then write the result into another S3 bucket, relational DB or a Glue Catalog.
Yes, Glue is a great tool for this!
Use a crawler to create a table in the glue data catalog (remember to set Create a single schema for each S3 path under Grouping behavior for S3 data when creating the crawler)
Read more about it here
Then you can use relationalize to flatten our your json structure, read more about that here
Json and AWS Glue may not be the best match. Since AWS Glue is based on hadoop, it inherits hadoop's "one-row-per-newline" restriction, so even if your data is in json, it has to be formatted with one json object per line [1]. Since you'll be pre-processing your data anyway to get it into this line-separated format, it may be easier to use csv instead of json.
Edit 2022-11-29: There does appear to be some tooling now for jsonl, which is the actual format that AWS expects, making this less of an automatic win for csv. I would say if your data is already in json format, it's probably smarter to convert it to jsonl than to convert to csv.