Copy data from Redshift to ElasticSearch - amazon-web-services

We have large amount of data stored on ES cluster. I need to add one more field to the ES cluster and upload data for this field from Redshift table’s column. I’ve never work with such data transfer, and I’m new to AWS and not sure how to approach this task and what I should read to perform such data transfer. Do you know what is the best approach to do it?

Are you using logstash doing just the data if yes then you can easily add column in logstash. And restart the lock start from the beginning so that the additional column data is ingested into the index. Let me know what is your current setup.

As i understand you want to dump data from elasticssearch cluster and load it to redshift.
Here is a approach i would take:
Dump data from elasticsearch using:https://github.com/taskrabbit/elasticsearch-dump
Copy the json file to s3 : using aws cli
Copy the json file from s3 to redshift using : https://docs.aws.amazon.com/redshift/latest/dg/copy-usage_notes-copy-from-json.html

Related

Batch file processing in AWS using Data Pipeline

I have a requirement of reading a csv batch file that was uploaded to s3 bucket, encrypt data in some columns and persist this data in a Dynamo DB table. While persisting each row in the DynamoDB table, depending on the data in each row, I need to generate an ID and store that in the DynamoDB table too. It seems AWS Data pipeline allows to create a job to import S3 bucket files into DynanoDB, but I can't find a way to add a custom logic there to encrypt some of the column values in the file and add custom logic to generate the id mentioned above.
Is there any way that I can achieve this requirement using AWS Data Pipeline? If not what would the best approach that I can follow using AWS services?
We also have a situation where we need fetch data from S3 and populate it to DynamoDb after performing some transformations (business logic).
We also use AWS DataPipeline for this process.
We first trigger a EMR cluster from Data Pipeline where we fetch the data from S3 and then transform it and populate the DynamoDB(DDB). You can include all the logic you require in the EMR cluster.
We have a timer set in the pipeline which triggers the EMR cluster every day once to perform the task.
This can be having additional costs too.

How to perform Backfilling in redshift to bigquery migration?

I am using BigQuery Data Transfer Service to migrate all data from redshift to bigquery.
After that, i want to perform backfilling for specific time, if any data is missing. But i don't see any backfilling option in Transfer job.
How can i achieve that in bigquery?
Reading your question under the light of your comments I would proceed differently from what you describe. You reach the same goal however :) .
Using your ETL pipeline, the first step would be to accumulate raw data in a datalake.
Let's take a storage service like S3 to do so. For this ETL pipeline, S3 is your datasink.
Note that your pipeline does nothing more than taking raw data from A to put it into S3. Also, the location in S3 should be under a timestampted folder on day for instance (e.g: yyyymmdd) so that you can sort and consume your data on time dimension.
Obviously the considered data is ahead in time from the one you already have in Redshift.
Maybe it is also a different structure from the one you already put in redshift due to potential transformation you set in your initial pipeline.
In case you set raw data directly into redshift, then just export the data into the same S3 bucket under the name legacy/*. (In case it is transformed, then you have to put a second S3 datasink in your pipeline with this intermediary transformation an keep the same S3 naming strategy).
Let's take a break to understand what we have. We filled an S3 bucket with raw data that we can now replay at will on a specific day using a cron or an orchestrating tool such as Apache Airflow. Moreover you can freely modified the content of each timestamped folder in case you missed data to replay the following pipelines => the backfill you want.
Speaking of which, S3 would act as a data source for these following pipelines that would set wanted transformations on the raw data from S3 and choose BigQuery and potentially Redshift as Datasink. Now please take in consideration the price of these operations. Streaming API in BQ is expensive. As high of 0.50$ per Gb. Do that only if you need real time result. If you can afford latency of more than 5 minutes a better strategy would to set GCS as the datasink of your ETL and transfer the data from there into BQ (note to put the data in the same file naming pattern yyyymmdd to enable potential backfill). This transfer is free if GCS bucket and BQ dataset are in the same region. You would trigger the transfer with GCS events for instance (trigering a cloud function on blob creation that put the data into BQ).
Last but not least, backfilling should be done wisely especially in BQ where update or creation at row level is not peformant and is an open door for duplication. What you should consider is BigQuery partition that you can set on a column that contains a timestamp or an hidden one if your data contains none. Which timestamp? Well the one set in GCS folder name!
Once again you can modify data in your GCS bucket per day and replay the transfer into BQ.
But each transfer from a given day must overwrtite the partition the considered data belongs to. (e.g: the data under 20200914 would overwrite the associated partition in BQ. We abide by the concept of pure task doing so which a guarantee for idempotency and non duplication).
Please read this article to have more insights.
Note: If you intend to get rid off Redshit, you can choose to do it directly and forget about S3 as a datasink of your first ETL. Choose directly GCS (ingress is free) and migrate your already present Redshift data into GCS using S3 as an intermediary service and the Google transfer service from S3 to GCS.

Copy data from PostgreSQL to S3 using AWS Data Pipeline

I am trying to copy all the tables from a schema (PostgreSQL, 50+ tables) to Amazon S3.
What is the best way to do this? I am able to create 50 different copy activities, but is there a simple way to copy all tables in a schema or write one pipeline and loop?
I think the old method is :
1. Unload your data from PostgreSQL to a CSV file first using something like psql
2. Then just copy the csv to S3
But, AWS gives u a script to do so , RDSToS3CopyActivity See this link from AWS
Since you have a large number of tables. I would recommend using AWS Glue as compared to AWS Data Pipeline. Glue is easily configurable having crawlers etc that allows you the flexibility to choose columns, define etc. Moreover,he underlying jobs in AWS Glue are pyspark jobs that scale really well giving you really good performance.

How to upload data via SQL to Amazon Redshift?

I created a cluster and connected to the database via SQL Workbench, but how can I upload data via SQL to Amazon Redshift?
I guess I have to use Amazon S3 but I could not find a sample video or text that describes it well.
There are two ways to insert information into Amazon Redshift:
Via the COPY command
Via INSERT statements
It is not recommended to use INSERT statements because they are not efficient for large data volumes. They are okay for doing ETL-type processes such as copying data between tables, but as a general rule data should be loaded via COPY.
As per Using a COPY Command to Load Data, the COPY command can load data from:
Amazon S3 (recommended, highly parallel)
Amazon EMR (Hadoop)
Amazon DynamoDB
Via SSH from remote hosts
The load from Amazon S3 is performed in parallel across all nodes and is the most efficient way to load data.
The Amazon Redshift COPY command can read several file formats:
Delimited (eg CSV)
Fixed-Width
AVRO
JSON
And these formats can also be compressed (eg gzip)
Bottom line: Get your data into Amazon S3 in a compatible format, then use COPY to load it.
Also, try to understand DISTKEY and SORTKEY to get full performance benefits out of Redshift. Definitely read the manual -- it will save you more time than it takes to read!

Confusions related to Redshift about dataset (Structured, Unstructured, Semi-structured) and format to be used

Can anyone explain me clearly about what kind of data Redshift can handle(like structured, unstructured , or in any formats)?
How to copy Cloudfront logs into Amazon Redshift even the log is in unstructured data without going to Amazon EMR?
**How to find Database size which is created in Amazon Redshift?
Please someone explain me clearly about all the three questions which i have mentioned it above...It will be better if you explain me with some example or sample code or any source it will be very helpful for my project
Amazon Redshift provides a standard SQL interface (based on PostgreSQL). Therefore, it is best suited for structured data that is stored in Tables, Rows and Columns.
It is also possible to store JSON records within a field and access them via JSON functions.
To load data into Amazon Redshift, it needs to be in a delimited file format, such as comma delimited, tab delimited, fixed-length fields or JSON format. Any data that is not in a suitable format will need to be pre-processed and converted to a suitable format. This could be done with tools such as Amazon Athena (Presto) or Amazon EMR (Hadoop).
Amazon CloudFront logs are in Tab-Delimited format and can be loaded directly into Amazon Redshift. For an example, see: Analyzing S3 and CloudFront Access Logs with AWS Redshift
Information about disk space consumed by tables can be obtained via the SVV_DISKUSAGE system view.