Amazon Redshift: Insert data into table from S3 using Java API - amazon-web-services

I currently have a file in S3. I would like to issue commands using the Java AWS SDK, to take this data and place it into a RedShift table. If the table does not exist I would like to also create the table. I have been unable to find any clear examples on how to do this so I am wondering if I am going about it the wrong way? Should I be using standard postgres java connectors instead of the AWS SDK?

Connect (http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-in-code.html#connecting-in-code-java) and submit your CREATE TABLE and COPY commands

Guys answer serves most of purpose.
I would like to post a working java JDBC code that does exactly Copy from S3 to Redshift table. I hope it will help others.
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.Statement;
import java.util.Properties;
public class RedShiftJDBC {
public static void main(String[] args) {
Connection conn = null;
Statement statement = null;
try {
//Even postgresql driver will work too. You need to make sure to choose postgresql url instead of redshift.
//Class.forName("org.postgresql.Driver");
//Make sure to choose appropriate Redshift Jdbc driver and its jar in classpath
Class.forName("com.amazon.redshift.jdbc42.Driver");
Properties props = new Properties();
props.setProperty("user", "username***");
props.setProperty("password", "password****");
System.out.println("\n\nconnecting to database...\n\n");
//In case you are using postgreSQL jdbc driver.
//conn = DriverManager.getConnection("jdbc:postgresql://********8-your-to-redshift.redshift.amazonaws.com:5439/example-database", props);
conn = DriverManager.getConnection("jdbc:redshift://********url-to-redshift.redshift.amazonaws.com:5439/example-database", props);
System.out.println("\n\nConnection made!\n\n");
statement = conn.createStatement();
String command = "COPY my_table from 's3://path/to/csv/example.csv' CREDENTIALS 'aws_access_key_id=******;aws_secret_access_key=********' CSV DELIMITER ',' ignoreheader 1";
System.out.println("\n\nExecuting...\n\n");
statement.executeUpdate(command);
//you must need to commit, if you realy want to have data saved, otherwise it will not appear if you query from other session.
conn.commit();
System.out.println("\n\nThats all copy using simple JDBC.\n\n");
statement.close();
conn.close();
} catch (Exception ex) {
ex.printStackTrace();
}
}
}

Related

How to Load Data into remote Neo4j AWS instance?

I want to import data into a Neo4j instance brought up in AWS (community edition from AWS marketplace). One option is to convert the data to CSV and run the LOAD CSV command in the Neo4j UI, and point it to a public http address that reads from S3. This, however, means we need to publicly expose the file externally which would expose sensitive data. How else can import this data?
Thanks!
I would suggest you use any of the Neo4j driver like Python or Java. Here is one Python example that I used in my posts:
def store_to_neo4j(distances):
data = [{'source': el[0], 'target': el[1], 'weight': distances[el]} for el in distances]
with driver.session() as session:
session.run("""
UNWIND $data as row
MERGE (c:Character{name:row.source})
MERGE (t:Character{name:row.target})
MERGE (c)-[i:INTERACTS]-(t)
SET i.weight = coalesce(i.weight,0) + row.weight
""", {'data': data})
You don't want to execute the import line by line, but you want to batch lets say 1000 rows in a parameter and then use UNWIND operator to import this data.

AWS DynamoDB read newly inserted record

Since i am new to AWS and other AWS services. for my hands on , prepared dynamodb use case. Whenever records insert into Dynamodb, that record should move to S3 for further processing. Written below code snippet in java using KCL
public static void main(String... args) {
KinesisClientLibConfiguration workerConfig = createKCLConfiguration();
StreamsRecordProcessorFactory recordProcessorFactory = new StreamsRecordProcessorFactory();
System.out.println("Creating worker");
Worker worker = createKCLCWorker(workerConfig, recordProcessorFactory);
System.out.println("Starting worker");
worker.run();
}
public class StreamsRecordProcessorFactory implements IRecordProcessorFactory {
public IRecordProcessor createProcessor() {
return new StreamRecordsProcessor();
}
}
method in StreamRecordsProcessor class
private void processRecord(Record record) {
if (record instanceof RecordAdapter) {
com.amazonaws.services.dynamodbv2.model.Record streamRecord = ((RecordAdapter) record)
.getInternalObject();
if ("INSERT".equals(streamRecord.getEventName())) {
Map<String, AttributeValue> attributes
= streamRecord.getDynamodb().getNewImage();
System.out.println(attributes);
System.out.println(
"New item name: " + attributes.get("name").getS());
}
}
}
From my local environment , i can able to see the record whenever we added the records in dynamodb. but i have few questions.
How can i deploy this project into AWS.
What is procedure or any required configuration from AWS side.
Please share your thoughts.
You should be able to use AWS Lambda as the integration point between Kinesis that ingest data from the DynamoDB stream and your Lambda function that reads data from the stream and pushes into a Kinesis Firehose stream to be ultimately deposited in S3. Here is an AWS blog article that can serve as a high-level guide for doing this. It gives you information about the AWS components you can use to build this and additional research on each component can help you put the pieces together.
Give that a try, if you get stuck anywhere, please add a comment and I'll respond in due time.

AWS Athena ODI JDBC connection

Has anyone tried connecting AWS Athena from Oracle Data Integrator.
I have been trying this since long but am not able to find the appropriate JDBC connection string.
Steps I have followed from
https://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html#jdbc-url-format
Downloaded AthenaJDBC42_2.0.7.jar driver from AWS
Copied the same into the userlib directory of ODI
Created new technology in ODI
Trying to add Data server. Not able to form JDBC url.
JDBC string Sample format (which isn't working):
jdbc:awsathena://AwsRegion=[Region];User=[AccessKey];Password=[SecretKey];S3OutputLocation=[Output];
Please can anyone help? Thanks.
This is sorter version of JDBC I implemented for Athena. This was just POC and we want to go with AWS SDK rather then jdbc though less important here.
package com.poc.aws.athena;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
public class AthenaJDBC {
public static void main(String[] args) throws ClassNotFoundException, SQLException {
Connection connection = null;
Class.forName("com.simba.athena.jdbc.Driver");
connection = DriverManager.getConnection("jdbc:awsathena://AwsRegion=us-east-1;User=EXAMPLEKEY;"
+ "Password=EXAMPLESECRETKYE;S3OutputLocation=s3://example-bucket-name-us-east-1;");
Statement statement = connection.createStatement();
ResultSet queryResults = statement.executeQuery(ExampleConstants.ATHENA_SAMPLE_QUERY);
System.out.println(queryResults.next());
}
}
The only important point here related to url.
jdbc:awsathena://AwsRegion=us-east-1;User=EXAMPLEKEY;"
+ "Password=EXAMPLESECRETKYE;S3OutputLocation=s3://example-bucket-name-us-east-1;.
us-east-1 must be replaced with your actual region like us-west-1 etc
EXAMPLEKEY must be replaced with your AWS Access key that has Athena access.
EXAMPLESECRETKEY must be replaced with your AWS Secret key that has Athena access.
example-bucket-name-us-east-1 must be replaced with your S3 bucket that above keys has write access too.
There other keys simba driver support but less important here.
I hope this helps.
Sorry I missed to post answer on this.
It all worked fine after configuring a Athena JDBC connection in ODI like below and providing the 4 key values while connecting.
JDBC URL: jdbc:awsathena://athena.eu-west-2.amazonaws.com:443;AWSCredentialsProviderArguments=ACCESSKEYID,SECRETACCESSKEY,SESSIONTOKEN

use SQL Workbench import csv file to AWS Redshift Database

I'm look for a manual and automatic way to use SQL Workbench to import/load a LOCAL csv file to a AWS Redshift database.
The manual way could be a way that click a navigation bar and select a option.
The automatic way could be some query codes to load the data, just run it.
here's my attempt:
there's an error "my target table in AWS is not found." but I'm sure the table exists, anyone know why?
WbImport -type=text
-file ='C:\myfile.csv'
-delimiter = ,
-table = public.data_table_in_AWS
-quoteChar=^
-continueOnError=true
-multiLine=true
You can use wbimport in SQL Workbench/J to import data
For more info : http://www.sql-workbench.net/manual/command-import.html
Like it was mentioned in the comments COPY command provided by Redshift is the optimal solution. You can use copy from S3, EC2 etc.
S3 Example:
copy <your_table>
from 's3://<bucket>/<file>'
access_key_id 'XXXX'
secret_access_key 'XXXX'
region '<your_region>'
delimiter '\t';
For more examples:
https://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html

Run Redshift Queries Periodically

I have started researching into Redshift. It is defined as a "Database" service in AWS. From what I have learnt so far, we can create tables and ingest data from S3 or from external sources like Hive into Redhshift database (cluster). Also, we can use JDBC connection to query these tables.
My questions are -
Is there a place within Redshift cluster where we can store our queries run it periodically (like Daily)?
Can we store our query in a S3 location and use that to create output to another S3 location?
Can we load a DB2 table unload file with a mixture of binary and string fields to Redshift directly, or do we need a intermediate process to make the data into something like a CSV?
I have done some Googling about this. If you have link to resources, that will be very helpful. Thank you.
I used cursor method using psycopg2 function in python. The sample code is given below. You have to set all the redshift credentials in env_vars files.
you can set your queries using cursor.execute. here I mension one update query so you can set your query in this place (you can set multiple queries). After that you have to set this python file into crontab or any other autorun application for running your queries periodically.
import psycopg2
import sys
import env_vars
conn_string = "dbname=%s port=%s user=%s password=%s host=%s " %(env_vars.RedshiftVariables.REDSHIFT_DW ,env_vars.RedshiftVariables.REDSHIFT_PORT ,env_vars.RedshiftVariables.REDSHIFT_USERNAME ,env_vars.RedshiftVariables.REDSHIFT_PASSWORD,env_vars.RedshiftVariables.REDSHIFT_HOST)
conn = psycopg2.connect(conn_string);
cursor = conn.cursor();
cursor.execute("""UPDATE database.demo_table SET Device_id = '123' where Device = 'IPHONE' or Device = 'Apple'; """);
conn.commit();
conn.close();