I'm currently facing an issue on CloudSearch when trying to index a large DynamoDB table via AWS Console:
Retrieving a subset of the table...
The request took too long to complete. Please try again or use the command line tools.
After looking throught the documentations[1, 2, 3], there are examples of uploading several file formats thought the CLI but no mention of uploading data from a DynamoDB table using the CLI.
How this could be done without having to download the entire database in a file and uploading it?
I've got in contact with AWS Support and there's not way of indexing DynamoDB data using the CLI.
The only available way is downloading the data and uploading though the CLI using a classical file format/structure(.csv, .json, .xml and .txt).
A shame, unfortunately.
Related
We're trying to use AWS Glue for ETL operations in our nodejs project. The workflow will be like below
user uploads csv file
data transformation from XYZ format to ABC format(mapping and changing field names)
download transformed csv file to local system
Note that, this flow should happen programmatically(creating crawlers, job triggers should be done programmatically not using the console). I don't know why documentation and other articles always show how to create crawlers, create jobs from glue console?
I believe that we have to create lambda functions and triggers. but not quite sure how to achieve this end to end flow. can anyone please help me. Thanks
The scenario is the following: I have a lambda function that does an http request to get the data of today and the last 365 days and stores them in DynamoDB. The function is triggered every day at 8am, so the most recent data is always saved in the DynamoDB table.
Now my goal is to export the DynamoDB table to a S3 file automatically on an everyday basis as well, so I'm able to use services like QuickSight, Athena, Forecast on the data.
If possible and easily implementable, I'd like to only have one S3 file that gets added with the most recent data of the day, because an extra file everyday seems kinda pricey. If that's not possible, an extra file everyday would also be fine.
What's the best way to go about doing so without using CLI (because I'm not allowed to install programs to my laptop) and without using Lambda (because I wouldn't know how to write a function for that without any tutorials)?
Take a look at DataPipeline. This is a use case and most of the configuration is simple.
It will also not require any knowledge of Lambda and can be automated.
More info: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBPipeline.html
DynamoDB recently released a new, native feature to export your table's data to an S3 bucket. It supports exporting into DynamoDB JSON and Amazon Ion - see the documentation on how to use it at:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DataExport.html
This will enable you to run whatever analytics tools you'd like (Athena, etc.) on the data exported in S3.
We're looking at moving away from Splunk as our datastore and looking at AWS Data Lake backed by S3.
What would be the process of migrating data from Splunk to S3? I've read lots of documents talking about archiving data from Splunk to S3 but not sure if this archives the data as a usable format OR if its in some archive format that needs to be restored to splunk itself?
Check out Splunk's SmartStore feature. It moves your non-hot buckets to S3 so you save storage costs. Running SmartStore on AWS only makes sense, however, if you run Splunk on AWS. Otherwise, the data export charges will bankrupt you. Data export applies when Splunk needs to search a bucket that's stored in S3 and so copies that bucket to an indexer. See https://docs.splunk.com/Documentation/Splunk/8.0.0/Indexer/AboutSmartStore for more information.
From what I've read there are a couple of ways to do it:
Export using the Web UI
Export using REST API Endpoint
Export using CLI
Copy certain files in the filesystem
So far I've tried using the CLI to export and I've managed to export around 500,000 events at a time using
splunk search "index=main earliest=11/11/2019:00:00:01 latest=11/15/2019:23:59:59" -output rawdata -maxout 500000 > output2.dmp
However - I'm not sure how I can accurately repeat this step to make sure I include all 100 million+ events. IE search from DATE A to DATE B for 500,000 records, then search from DATE B to DATE C for the next 500,000 - without missing any events inbetween.
I am submitting a Python script (pyspark actually) to a Glue Job to process parquet files and extract some analytics from this data source.
These parquet files live on an S3 folder and continuously increase with new data. I was happy with the logic of bookmarking provided by AWS Glue because it helps a lot: basically allows us to process only new data without reprocessing already processed data.
Unfortunately in this scenario I notice instead that each time duplicates are produced and looks like that AWS Glue bookmarking is not working at all. What's the reason of this unexpected behaviour?
Can you please check now. It supports Parquet and ORC. But Version 1.0 and later. Version Version 0.9, it was not supporting
https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html
From https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html
The Apache Parquet and ORC formats are currently not supported.
UPDATE
Since Jul 26 2019 AWS Glue supports Parquet and ORC formats as well for bookmarking
https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html
After saving Avro files with snappy compression (also same error with gzip/bzip2 compression) in S3 using AWS Glue, when I try to read the data in athena using AWS Crawler, I get the following error - HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split - using org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat: Not a data file. Any idea why I get this error and how to resolve this?
Thank you.
Circumvented this issue by attaching native spark avro jar file to the glue job during execution and using native spark read/write methods to write them in avro format and for the compression setting spark.conf.set("spark.sql.avro.compression.codec","snappy") as soon as the spark session is created.
Works perfectly for me and could be read via Athena as well.
AWS Glue doesn't support writing avro with compression files even though it's not stated clearly in docs. A job succeeds but it applies compressions in a wrong way: instead of compressing file blocks it compresses entire file that is wrong and that's the reason why Athena can't query it.
There are plans to fix the issue but I don't know ETA.
It would be nice if you could contact AWS support to let them know that you are having this issue too (more customers affected - sooner fixed)