I know others have had this same problem[1], and now I have as well. I have tried all the suggested troubleshooting techniques on that question. To summarize:
This is a new Firehose to Redshift
S3 objects are predictably appearing with 100% success in CloudWatch
Redshift delivery is showing as 0% success, so it must be trying
I'm seeing that Firehose is making connections to Redshift so the firewall rules must be correct
I am using JSON formatted entries with an external column mapping file.
The Firehose and Redshift cluster are in the us-west-2 region, but the bucket is in US Standard (us-east-1) so I'm using the WITH REGION option.
Following in the path of others, I have tried deleting and recreating the firehose to no avail.
I also tried doing the COPY from the redshift cluster manually and found that it worked perfectly.
There doesn't appear to be anything in the Redshift error tables, nor in the errors section of the bucket.
I'm about to give up on this. Anyone have suggestions for where to find the error logs before I admit failure?
[1] AWS Kinesis Firehose not inserting data in Redshift
Related
I am using AWS RDS(MySQL) and I would like to sync this data to AWS elasticsearch in real-time.
I am thinking that the best solution for this is AWS Glue but I am not sure about I could realize what I want.
This is information for my RDS database:
■ RDS
・I would like to sync several tables(MySQL) to opensearch(1 table to 1 index).
・The schema of tables will be changed dynamically.
・The new column will be added or The existing columns will be removed since previous sync.
(so I also have to sync this schema change)
Could you teach me roughly whether I could do these things by AWS Glue?
I wonder if AWS Glue can deal with dynamic schame change and syncing in (near) real-time.
Thank you in advance.
Glue Now have OpenSearch connector but Glue is like a ETL tool and does batch kind of operation very well but event based or very frequent load to elastic search might not be best fit ,and cost also can be high .
https://docs.aws.amazon.com/glue/latest/ug/tutorial-elastisearch-connector.html
DMS can help not completely as you have mentioned schema keeps changing .
Logstash Solution
Since Elasticsearch 1.5, Elasticsearch added jdbc input plugin in Logstash to sync MySQL data into Elasticsearch.
AWS Native solution
You can have a lambda function on MySQL event Invoking a Lambda function from an Amazon Aurora MySQL DB cluster
The lambda will write to Kinesis Firehouse in json and kinesis can load into OpenSearch .
Can we connect Amazon S3 buckets present in two different regions and migrate CSV Files data into one Particular region Amazon RDS ? I am trying to use AWS Glue.
There are certainly different ways to solve this use case. You can use AWS Glue. You can also write a workflow using AWS Step Functions that can solve this as well. For example, you can write a series of Lambda functions that can read CSV in an Amazon S3 bucket, get the values and then write the values to an Amazon RDS database. Both ways are valid.
See these docs as ref:
https://aws.amazon.com/blogs/big-data/orchestrate-multiple-etl-jobs-using-aws-step-functions-and-aws-lambda/
https://aws.amazon.com/glue/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc
Keep in mind however. a workflow is not ideal when your data set is so large, it will timeout the 15 min window that Lambda uses. In this case, you should use AWS Glue.
I have a use case where my redshift cluster is private and supports only VPN connection to the VPC. I need to send data from kinesis firehose which is in another VPC. I found out that we need to make redshift public or attach an internet gateway to make this happen but I can't use internet gateway. I need to connect to redshift from kinesis firehose with VPN only. I am not able to figure out any way to do this.
As you are already aware, you cannot use a private Redshift cluster in a VPC as a target for Firehose without Internet access. There is no direct solution for this as detailed here and here.
That said, I can think of at least two work arounds that might suffice.
You can have Firehose target S3. Then setup a private link access to S3 from the private VPC and setup an event to copy the data into the Redshift cluster on an acceptable cadence. I think this is probably the best option.
You MIGHT be able to setup Firehose with a lambda processor that feeds the records into Redshift. The reason I say "might" is because the lambda will also need to be within the VPC and will need to be able to keep up with the Firehose flow. This could be fraught with performance issues, and potentially expensive. And Redshift isn't really meant to have high write transactions as a data warehouse. This is the worst option.
Firehose aggregates data in S3 and then triggers a COPY command in Redshift. As you don't have a network path from Firehose to Redshift this fails. However, Firehose can just stop at placing the data in S3.
Now you just need a way to trigger Redshift to COPY the data. There are a number of ways to do this but the easiest way is to use Lambda (in your Redshift VPC) to issue the COPY commands. You will need to decide on when the Lambda should run - Firehose uses two parameters to determine when a COPY should be issued; time since last COPY and data size not yet copied. You can emulate this behavior if you like but the simplest way is to just issue COPYs on some regular time interval, like every 5 min.
To do this you set up CloudWatch to trigger your Lambda every 5 min. The
Lambda looks in the Firehose location in S3 and lists all the files
renames (moves) all these files to put them in a new uniquely named
S3 "subfolder"
issues the COPY command to Redshift to ingest from this "subfolder"
Upon successful ingestion these files can be moved again, left in
the above "subfolder" or deleted
The reason to rename/move the files in S3 is to ensure that each run of the Lambda is operating on a unique set of files and that files aren't ingested more than once.
We have a Data pipeline that does a nightly copy of our DynamoDB to S3 buckets so we can run reports on the data with Athena. Occasionally the pipeline will fail with a 503 SlowDown error. The retries will usually "succeed" but create tons of duplicate records in S3. The DynamoDB has On-Demand read capacity and the pipeline has 0.5 myDDBReadThroughputRatio. A couple of questions here:
I assume reducing the myDDBReadThroughputRatio would probably lessen the problem, if true does anyone have a good ratio that will still be performant but not cause these errors?
Is there a way to prevent the duplicate records in S3? I can't figure out why these are being generated? (possibly the records from the failed run are not removed?)
Of course any other thoughts/solutions for the problem would be greatly appreciated.
Thanks!
Using AWS Data Pipeline for continuous backups is not recommended.
AWS recently launched a new functionality that allows you to export DynamoDB table data to S3 and can be further analysed by Athena. Check it out here
You can also use Amazon glue to do the same (link).
If you still want to continue to use data pipeline, then the issue seems to be happening due to S3 limits being reached. You might need to see if there are other requests also writing to S3 at same time OR if you can limit the request rate from pipeline using some configuration.
I have setup audit logs storage from Redshift in S3. Now, I am planning to have external tables setup on these audit logs. On trying to use AWS Glue crawler for reading those files, I get tons of tables. There is one table for each file. I was assuming that there will be two tables overall(as we log two of the activities). If someone has any success in reading Amazon Redshift audit logs using external tables, I would like to have your inputs.
Thanks
Why is the AWS Glue crawler creating multiple tables from my source data, and how can I prevent that from happening? - https://aws.amazon.com/premiumsupport/knowledge-center/glue-crawler-multiple-tables/