Upload CSV data directly to Amazon Redshift with Talend - amazon-web-services

Is it possible to upload data directly to Amazon Redshift without passing through Amazon S3 (Using Talend)?

It is possible to do this using talend connectors for postgres, but the result would be very slow indeed (could be seconds per row of data).
You really need to
split large csv files up e.g. 10MB each (no set number for this)
gzip each csv file
upload to s3
run a redshift copy command
run some sql on redshift if required to process the new data (upsert
for example)

It is possible using INSERT queries, but is not at all efficient, and very slow, and thus, not recommended. Redshift is built for handling and managing bulk loads.
Using COPY command to load data into Redshift after splitting the large files into smaller parts, using multi-part file upload to S3 and then loading the data from S3 to Redshift using COPY command, in parallel (see this), is the best and fastest approach to load data into Redshift.

Related

Data Ingestion in Amazon Redshift

I have multiple data source from which I need to build and implement a DWH in AWS. I have one challenge with respect to one of my unstructured data source (Data coming from different APIs). How can I ingest data from this source into the Amazon Redshift??? Can we first pull it into Amazon S3 bucket and then integrate S3 with Amazon redshift? What is a better approach?
Yes, S3 first. You APIs can write to S3 or/and if you like you can use a service like Kinesis (with or without firehose) to populate S3. From there it is just work in Redshift.
Without knowing more about the sources, yes S3 is likely the right approach - whether you require latency in seconds, minutes or hours will be an important consideration.
If latency is not a driving concern, simply:
Set up an S3 bucket to use a destination from your initial source(s).
Create tables in your Redshift database (loading data from S3 to Redshift requires pre-existing destination table).
Use the COPY command load from S3 to Redshift.
As noted, there may be value in Kinesis, especially if you're working with real-time data streams (the service recently introduced support for skipping S3 and streaming directly to Redshift).
S3 is probably the easier approach, if you're not trying to analyze real-time streams.

Read S3 Bucket from EC2 for ML Training

I am trying to train a machine learning model on AWS EC2. I have over 50GB of data currently stored in an AWS S3 bucket. When training my model on EC2, I want to be able to access this data.
Essentially, I want to be able to call this command:
python3 train_model.py --train_files /data/train.csv --dev_files /data/dev.csv --test_files /data/test.csv
where /data/train.csv is my S3 bucket s3://data/. How can I do this? I currently only see ways to cp my S3 data into my EC2.
You can develop an enhancement to your code using boto.
But if you want access to your S3 as if it was another local filesystem I would consider s3fs-fuse, explained further here.
Another option would be to use the aws-cli to sync your code to a local folder.
How can I do this? I currently only see ways to cp my S3 data into my EC2.
S3 is a object storage system. It does not allow for direct access nor reading of files like a regular file system.
Thus to read your files, you need to download it first (downloading in parts is also possible), or have some third party software do it for you like s3-fuse. You can download it to your instance, or store in external file system (e.g. EFS).
Its not clear from your question if you have one 50GB CSV file, or multiple small ones. In case you have one large CSV file of 50GB in size, you can reduce the amount of data read, if not all of its needed, at once using S3 Select:
With S3 Select, you can use a simple SQL expression to return only the data from the store you’re interested in, instead of retrieving the entire object. This means you’re dealing with an order of magnitude less data which improves the performance of your underlying applications.
Amazon S3 Select works on objects stored in CSV, JSON, or Apache Parquet format.

Copy data from PostgreSQL to S3 using AWS Data Pipeline

I am trying to copy all the tables from a schema (PostgreSQL, 50+ tables) to Amazon S3.
What is the best way to do this? I am able to create 50 different copy activities, but is there a simple way to copy all tables in a schema or write one pipeline and loop?
I think the old method is :
1. Unload your data from PostgreSQL to a CSV file first using something like psql
2. Then just copy the csv to S3
But, AWS gives u a script to do so , RDSToS3CopyActivity See this link from AWS
Since you have a large number of tables. I would recommend using AWS Glue as compared to AWS Data Pipeline. Glue is easily configurable having crawlers etc that allows you the flexibility to choose columns, define etc. Moreover,he underlying jobs in AWS Glue are pyspark jobs that scale really well giving you really good performance.

Copy data from Redshift to ElasticSearch

We have large amount of data stored on ES cluster. I need to add one more field to the ES cluster and upload data for this field from Redshift table’s column. I’ve never work with such data transfer, and I’m new to AWS and not sure how to approach this task and what I should read to perform such data transfer. Do you know what is the best approach to do it?
Are you using logstash doing just the data if yes then you can easily add column in logstash. And restart the lock start from the beginning so that the additional column data is ingested into the index. Let me know what is your current setup.
As i understand you want to dump data from elasticssearch cluster and load it to redshift.
Here is a approach i would take:
Dump data from elasticsearch using:https://github.com/taskrabbit/elasticsearch-dump
Copy the json file to s3 : using aws cli
Copy the json file from s3 to redshift using : https://docs.aws.amazon.com/redshift/latest/dg/copy-usage_notes-copy-from-json.html

How to upload data via SQL to Amazon Redshift?

I created a cluster and connected to the database via SQL Workbench, but how can I upload data via SQL to Amazon Redshift?
I guess I have to use Amazon S3 but I could not find a sample video or text that describes it well.
There are two ways to insert information into Amazon Redshift:
Via the COPY command
Via INSERT statements
It is not recommended to use INSERT statements because they are not efficient for large data volumes. They are okay for doing ETL-type processes such as copying data between tables, but as a general rule data should be loaded via COPY.
As per Using a COPY Command to Load Data, the COPY command can load data from:
Amazon S3 (recommended, highly parallel)
Amazon EMR (Hadoop)
Amazon DynamoDB
Via SSH from remote hosts
The load from Amazon S3 is performed in parallel across all nodes and is the most efficient way to load data.
The Amazon Redshift COPY command can read several file formats:
Delimited (eg CSV)
Fixed-Width
AVRO
JSON
And these formats can also be compressed (eg gzip)
Bottom line: Get your data into Amazon S3 in a compatible format, then use COPY to load it.
Also, try to understand DISTKEY and SORTKEY to get full performance benefits out of Redshift. Definitely read the manual -- it will save you more time than it takes to read!