Incrementally loaded data from DynamoDB to S3 using Amazon Data Pipeline - amazon-web-services

My scenario is based on 'DAT' (which contains date) column in DynamoDB, I need to incrementally load the data to S3 using Amazon Data Pipeline console.
To perform this I have used Hive Copy activity and added filtersql as DAT > unix_timestamp(\"2015-01-01 01:00:00.301\", \"yyyy-MM-dd'T'HH:mm:ss\"). When I use filtersql then getting an error message
Failed to complete HiveActivity: Hive did not produce an error file. Cause: EMR job '#TableBackupActivity_2015-03-14T07:17:02_Attempt=3' with jobFlowId 'i-3NTVWJANCCOH7E' is failed with status 'FAILED' and reason 'Waiting after step failed'. Step '#TableBackupActivity_2015-03-14T07:17:02_Attempt=3' is in status 'FAILED' with reason 'null'
If I use without filtersql statement then data moved from DynamoDB to S3 without any error. Please someone help me on this error.

Related

SQL compilation error: Table '"s3://xxxxxxxxxxxx/output/"' does not exist

We are able to load data from s3://xxxxxxxxxxxx/input/ to Snowflake by using Snow-Pipe.
Code for loading data from S3 bucket to Snowflake:
COPY INTO "s3://xxxxxxxxxx/output/"
FROM #"V_PIPELINE_DB"."V_PIPELINE_SCHEMA"."V_PIPELINE_STAGE"
FILE_FORMAT = ( FORMAT_NAME = "V_PIPELINE_DB"."V_PIPELINE_SCHEMA"."V_PIPELINE_CSV_FILEFORMAT" )
ON_ERROR = 'continue'
FORCE =TRUE;
But we are facing an compiler error while loading from Snowflake to AWS S3 bucket and below is the error.
SQL compilation error: Table '"s3://xxxxxxxxxxxx/output/"' does not exist
Firstly, you mentioned that the command you attached is loading data from S3 to Snowflake, but that command looks more like the other way around (unloading data from Snowflake to S3). Therefore I'm not sure which query is causing that error.
To get the correct query structure, please refer to the following manuals:
Loading from S3 to Snowflake
Unloading from Snowflake to S3

Error Bigquery/dataflow "Could not resolve table in Data Catalog"

I'm having troubles with a job I've set up on dataflow.
Here is the context, I created a dataset on bigquery using the following path
bi-training-gcp:sales.sales_data
In the properties I can see that the data location is "US"
Now I want to run a job on dataflow and I enter the following command into the google shell
gcloud dataflow sql query ' SELECT country, DATE_TRUNC(ORDERDATE , MONTH),
sum(sales) FROM bi-training-gcp.sales.sales_data group by 1,2 ' --job-name=dataflow-sql-sales-monthly --region=us-east1 --bigquery-dataset=sales --bigquery-table=monthly_sales
The query is accepted by the console and returns me a sort of acceptation message.
After that I go to the dataflow dashboard. I can see a new job as queued but after 5 minutes or so the job fails and I get the following error messages:
Error
2021-09-29T18:06:00.795ZInvalid/unsupported arguments for SQL job launch: Invalid table specification in Data Catalog: Could not resolve table in Data Catalog: bi-training-gcp.sales.sales_data
Error 2021-09-29T18:10:31.592036462ZError occurred in the launcher
container: Template launch failed. See console logs.
My guess is that it cannot find my table. Maybe because I specified the wrong location/region, since my table is specified to be location in "US" I thought it would be on a US server (which is why I specified us-east1 as a region), but I tried all us regions with no success...
Does anybody know how I can solve this ?
Thank you
This error occurs if the Dataflow service account doesn't have access to the Data Catalog API. To resolve this issue, enable the Data Catalog API in the Google Cloud project that you're using to write and run queries. Alternately, assign the roles/datacatalog.

AWS Data Pipeline Dynamo to Redshift

I have an issue:
I need to migrate data from DynamoDB to Redshift. The problem is that I receive such exception:
ERROR: Unsupported Data Type: Current Version only supports Strings and Numbers Detail: ----------------------------------------------- error: Unsupported Data Type: Current Version only supports Strings and Numbers code: 9005 context: Table Name = user_session query: 446027 location: copy_dynamodb_scanner.cpp:199 process: query0_124_446027 [pid=25424] -----------------------------------------------
In my Dynamo item I have boolean field. How can I modify field from Boolean to INT(for example)?
I tried to use as a VARCHAR(5), but didn't help(so it one ticket in Github without response)
Will be appreciate for any suggestions.
As a solution, I migrated data from DynamoDB to S3 first and then to Redshift.
I used Exports to S3 build in feature in DynamoDB. It saves all data as *.json files into S3 realy fast(but not sorted).
After that I used ETL script, using Glue Job and custom script with pyspark to process and save data into Redshift.
Also can be done with Glue crawler to define schema, but still need to validate its result, as sometimes it was not correct.
Using crawlers to parse DynamoDB directly is overkill of your tables if you are not using ONDEMAND read/write. So the better way is to do that with data from S3.

An internal error occurred when attempting to deliver data in AWS Firehose data stream

I am implement AWS kinesis-Firehose data stream and facing issue in data delivery from s3 to redshift. can you please help me and let me know what is missing?
An internal error occurred when attempting to deliver data. Delivery
will be retried; if the error persists, it will be reported to AWS for
resolution. InternalError 2
It happened for me, and the problem was an error of inconsistency of the input record format the DB table.
Try to check AWS Docs of COPY command to make sure the COPY command parameters are defined properly.

Importing data from Excel sheet to DynamoDB table

I am having a problem importing data from Excel sheet to a Amazon DynamoDB table. I have the Excel sheet in an Amazon S3 bucket and I want to import data from this sheet to a table in DynamoDB.
Currently I am following Import and Export DynamoDB Data Using AWS Data Pipeline but my pipeline is not working normally.
It gives me WAITING_FOR_RUNNER status and after sometime the status changed to CANCELED. Please suggest what I am doing wrong or is there any other way to import data from an Excel sheet to a DynamoDB table?
The potential reasons are as follows:-
Reason 1:
If your pipeline is in the SCHEDULED state and one or more tasks
appear stuck in the WAITING_FOR_RUNNER state, ensure that you set a
valid value for either the runsOn or workerGroup fields for those
tasks. If both values are empty or missing, the task cannot start
because there is no association between the task and a worker to
perform the tasks. In this situation, you've defined work but haven't
defined what computer will do that work. If applicable, verify that
the workerGroup value assigned to the pipeline component is exactly
the same name and case as the workerGroup value that you configured
for Task Runner.
Reason 2:-
Another potential cause of this problem is that the endpoint and
access key provided to Task Runner is not the same as the AWS Data
Pipeline console or the computer where the AWS Data Pipeline CLI tools
are installed. You might have created new pipelines with no visible
errors, but Task Runner polls the wrong location due to the difference
in credentials, or polls the correct location with insufficient
permissions to identify and run the work specified by the pipeline
definition.