Python script to load data from AWS S3 to Redshift - python-2.7

Has anybody worked on creating a python script to load data from s3 to redshift tables for multiple files. How can we acheive it in AWS CLI. Your learnings and inputs on the same is appreciated.

The COPY command is the best way to load data from Amazon S3 to Amazon Redshift. It can load multiple files in parallel into the one table.
Use any Python library (eg PostgreSQL + Python | Psycopg) to connect to Amazon Redshift, then issue the COPY command.
The AWS Command-Line Interface (CLI) does not have the ability to run the COPY command on Redshift because it needs to be issued to the database, while the AWS CLI issues commands to AWS. (The AWS CLI can be used to launch/terminate a Redshift cluster, but not to connect to the cluster itself.)

Related

AWS EMR: How to migrate data from one EMR to another EMR

I currently have an AWS EMR cluster running with HBase. And I am saving the data to S3. I want to migrate the data to a new EMR cluster on the same account. What is the proper way to migrate data from one EMR to another?
Thank you
There are different ways two copy the table from one cluster to another:
Use CopyTable utility. The disadvantage is that it can degrade the region server performance or there is a need to disable the tables prior to copy.
Hbase Snapshots. (Recommended). It has a little impact on region server performance.
You can follow the aws documentation to perform snapshot/restore operations.
Basically you will do the following:
Create Snapshot
Export to S3
Import from S3
Restore to Hbase

How to load files located On-Prem to AWS using AWS Glue

Can I directly load files located in an On-Prem location to RDS using AWS GLUE?
Also if I have to park the files in an S3 before loading, what options do I have apart from using CLI?
Did you check out AWS Data Migration Service? In that you do not need AWS Glue.
If you prefer not to use AWS DMS, and go with S3, you can use S3 clients such as CloudBerry to move files from on-prem to S3, and use RDS commands, "Load data from S3...." to insert data into the RDS Tables (Only for MySQL) in AWS Glue script.
Please refer here for RDS COMMAND Load data from S3: S3 to RDS MySQL
you can load s3 data to RDS using aws glue crawler and job.
But to load on premise data to s3 if your data is not that huge you can to use s3 file upload functionality via script or program, but if your data is too large then go for AWS Data Migration Service.

How to run glue script from Glue Dev Endpoint

I have a glue script (test.py) written say in a editor. I connected to glue dev endpoint and copied the script to endpoint or I can store in S3 bucket. Basically glue endpoint is an EMR cluster, now how can I run the script from the dev endpoint terminal? Can I use spark-submit and run it ?
I know we can run it from glue console,but more interested to know if I can run it from glue end point terminal.
You don't need a notebook; you can ssh to the dev endpoint and run it with the gluepython interpreter (not plain python).
e.g.
radix#localhost:~$ DEV_ENDPOINT=glue#ec2-w-x-y-z.compute-1.amazonaws.com
radix#localhost:~$ scp myscript.py $DEV_ENDPOINT:/home/glue/myscript.py
radix#localhost:~$ ssh -i {private-key} $DEV_ENDPOINT
...
[glue#ip-w-x-y-z ~]$ gluepython myscript.py
You can also run the script directly without getting an interactive shell with ssh (of course, after uploading the script with scp or whatever):
radix#localhost:~$ ssh -i {private-key} $DEV_ENDPOINT gluepython myscript.py
If this is a script that uses the Job class (as the auto-generated Python scripts do), you may need to pass --JOB_NAME and --TempDir parameters.
For development / testing purpose, you can setup a zeppelin notebook locally, have an SSH connection established using the AWS Glue endpoint URL, so you can have access to the data catalog/crawlers,etc. and also the s3 bucket where your data resides.
After all the testing is completed, you can bundle your code, upload to an S3 bucket. Then create a Job pointing to the ETL script in S3 bucket, so that the job can be run, and scheduled as well.
Please refer here and setting up zeppelin on windows, for any help on setting up local environment. You can use dev instance provided by Glue, but you may incur additional costs for the same(EC2 instance charges).
Once you set up the zeppelin notebook, you can copy the script(test.py) to the zeppelin notebook, and run from the zeppelin.
According to AWS Glue FAQ:
Q: When should I use AWS Glue vs. Amazon EMR?
AWS Glue works on top of the Apache Spark environment to provide a
scale-out execution environment for your data transformation jobs. AWS
Glue infers, evolves, and monitors your ETL jobs to greatly simplify
the process of creating and maintaining jobs. Amazon EMR provides you
with direct access to your Hadoop environment, affording you
lower-level access and greater flexibility in using tools beyond
Spark.
Do you have any specific requirement to run Glue script in an EMR instance? Since in my opinion, EMR gives more flexibility and you can use any 3rd party python libraries and run directly in a EMR Spark cluster.
Regards

Setting hive properties in Amazon EMR?

I'm trying to run a Hive query using Amazon EMR, and am trying to get Apache Tez to work with it too, which from what I understand requires setting the hive.execution.engine property to tez according to the hive site?
I get that hive properties can be set with set hive.{...} usually, or in the hive-site.xml, but I don't know how either of those interact with / are possible to do in Amazon EMR.
So: is there a way to set Hive Configuration Properties in Amazon EMR, and if so, how?
Thanks!
You can do this in two ways:
1) DIRECTLY WITHIN SINGLE HIVE SCRIPT (.hql file)
Just put your properties at the beginning of your Hive hql script, like:
set hive.execution.engine=tez;
CREATE TABLE...
2) VIA APPLICATION CONFIGURATIONS
When you create a EMR cluster, you can specify Hive configurations that work for the entire cluster's lifetime. This can be made either via AWS Management Console, or via AWS CLI.
a) AWS Management Console
Open AWS EMR service and click on Create cluster button
Click on Go to advanced options at the top
Be sure to select Hive among the applications, then enter a JSON configuration like below, where you can find all properties you usually have in hive-site xml configuration, I highlighted the TEZ property as example. You can optionally load the JSON from a S3 path.
b) AWS CLI
As stated in detail here, you can specify the Hive configuration on cluster creation, using the flag --configurations, like below:
aws emr create-cluster --configurations file://configurations.json --release-label emr-5.9.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large --auto-terminate
The JSON file has the same content shown above in the Management Console example.
Again, you can optionally specify a S3 path instead:
--configurations https://s3.amazonaws.com/myBucket/configurations.json
Amazon Elastic MapReduce (EMR) is an automated means of deploying a normal Hadoop distribution. Commands you can normally run against Hadoop and Hive will also work under EMR.
You can execute hive commands either interactively (by logging into the Master node) or via scripts (submitted as job 'steps').
You would be responsible for installing TEZ on Amazon EMR. I found this forum post: TEZ on EMR

Configuring external data source for Elastic MapReduce

We want to use Amazon Elastic MapReduce on top of our current DB (we are using Cassandra on EC2). Looking at the Amazon EMR FAQ, it should be possible:
Amazon EMR FAQ: Q: Can I load my data from the internet or somewhere other than Amazon S3?
However, when creating a new job flow, we can only configure a S3 bucket as input data origin.
Any ideas/samples on how to do this?
Thanks!
P.S.: I've seen this question How to use external data with Elastic MapReduce but the answers do not really explain how to do it/configure it, simply that it is possible.
How are you processing the data? EMR is just managed hadoop. You still need to write a process of some sort.
If you are writing a Hadoop Mapreduce job, then you are writing java and you can use Cassandra apis to access it.
If you are wanting to use something like hive, you will need to write a Hive storage handler to use data backed by Cassandra.
Try using scp to copy files to your EMR instance:
my-desktop-box$ scp mylocaldatafile my-emr-node:/path/to/local/file
(or use ftp, or wget, or curl, or anything else you want)
then log into your EMR instance with ssh and load it into hadoop:
my-desktop-box$ ssh my-emr-node
my-emr-node$ hadoop fs -put /path/to/local/file /path/in/hdfs/file