Export Data from AWS S3 bucket to MySql RDS instance using JDBC - amazon-web-services

I want to import CSVs in my s3 bucket into my MySql rds instance using jdbc. It is a one time process and not an ongoing one. Interested in knowing the end to end process.

As you mentioned its one time activity, hence I would like you to suggest direct CSV import to MySQL rather then using JDBC unless there is specific reason you might have that you have not mentioned into the question.
Here is approach you could utilize,
for loop of your files in S3, the use following command to import data to MySQL RDS.
>mysqlimport [options] db_name textfile1 [textfile2 ...]
Please refer following for more details.
https://dev.mysql.com/doc/en/mysqlimport.html
I hope this provides you way to move forward. If I'm missing something, re-frame your question, I could reattempt the answer.

Related

How to directly query SAP HANA via AWS Glue Notebook

I'm utilizing the NGDBC driver (SAP HANA JDBC driver) with an AWS Glue Notebook. I'm using the following line once I include the JAR file to access data from SAP HANA in our environment.
df = glueContext.read.format("jdbc").option("driver", jdbc_driver_name).option("url", db_url).option("dbtable", "KNA1").option("user", db_username).option("password", db_password).load()
In this example, it simply download the KNA1 table, but I have yet to see any documentation that tells me how to actually query the SAP HANA instance through these options. I attempted to use a "query" option, but that didn't seem like it was available via the JAR.
Am I to understand that I have to simply get entire tables, then query against the DataFrame? That seems expensive and not what I want to do. Maybe someone can provide some insight.
Try like this:
df = glueContext.read.format("jdbc").option("driver", jdbc_driver_name).option("url", db_url).option("dbtable", "(select name1 from kna1 where kunnr='1111') as name").option("user", db_username).option("password", db_password).load()
i.e. wrap the query into asterisks and provide an alias as help suggests.

Errors importing large CSV file to DynamoDB using Lambda

I want to import a large csv file (around 1gb with 2.5m rows and 50 columns) into a DynamoDb, so have been following this blog from AWS.
However it seems I'm up against a timeout issue. I've got to ~600,000 rows ingested, and it falls over.
I think from reading the CloudWatch log that the timeout is occurring due to the boto3 read on the CSV file (it opens the entire file first, iterates through and batches up for writing)... I tried to reduce the file size (3 columns, 10,000 rows as a test), and I got a timeout after 2500 rows.
Any thoughts here?!
TIA :)
I really appreciate the suggestions (Chris & Jarmod). After trying and failing to break things programmatically into smaller chunks, I decided to look at the approach in general.
Through research I understood there were 4 options:
Lambda Function - as per the above this fails with a timeout.
AWS Pipeline - Doesn't have a template for importing CSV to DynamoDB
Manual Entry - of 2.5m items? no thanks! :)
Use an EC2 instance to load the data to RDS and use DMS to migrate to DynamoDB
The last option actually worked well. Here's what I did:
Create an RDS database (I used the db.t2.micro tier as it was free) and created a blank table.
Create an EC2 instance (free Linux tier) and:
On the EC2 instance: use SCP to upload the CSV file to the ec2 instance
On the EC2 instance: Firstly Sudo yum install MySQL to get the tools needed, then use mysqlimport with the --local option to import the CSV file to the rds MySQL database, which took literally seconds to complete.
At this point I also did some data cleansing to remove some white spaces and some character returns that had crept into the file, just using standard SQL queries.
Using DMS I created a replication instance, endpoints for the source (rds) and target (dynamodb) databases, and finally created a task to import.
The import took around 4hr 30m
After the import, I removed the EC2, RDS, and DMS objects (and associated IAM roles) to avoid any potential costs.
Fortunately, I had a flat structure to do this against, and it was only one table. I needed the cheap speed of the dynamodb, otherwise, I'd have stuck to the RDS (I almost did halfway through the process!!!)
Thanks for reading, and best of luck if you have the same issue in the future.

Export DynamoDB table to S3 automatically

The scenario is the following: I have a lambda function that does an http request to get the data of today and the last 365 days and stores them in DynamoDB. The function is triggered every day at 8am, so the most recent data is always saved in the DynamoDB table.
Now my goal is to export the DynamoDB table to a S3 file automatically on an everyday basis as well, so I'm able to use services like QuickSight, Athena, Forecast on the data.
If possible and easily implementable, I'd like to only have one S3 file that gets added with the most recent data of the day, because an extra file everyday seems kinda pricey. If that's not possible, an extra file everyday would also be fine.
What's the best way to go about doing so without using CLI (because I'm not allowed to install programs to my laptop) and without using Lambda (because I wouldn't know how to write a function for that without any tutorials)?
Take a look at DataPipeline. This is a use case and most of the configuration is simple.
It will also not require any knowledge of Lambda and can be automated.
More info: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBPipeline.html
DynamoDB recently released a new, native feature to export your table's data to an S3 bucket. It supports exporting into DynamoDB JSON and Amazon Ion - see the documentation on how to use it at:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DataExport.html
This will enable you to run whatever analytics tools you'd like (Athena, etc.) on the data exported in S3.

Apache Zeppelin with Athena handling session token using jdbc Interpreter

I am trying to connect Athena with Apache Zeppelin.I need to handle secret_key, Access_key, and Session_token. I am feeling hard to establish my connection with the Zeppelin JDBC interpreter.
I am following the steps as mentioned in this block,
If any one can help me out in establishing the connection with AWS Session token approach that would be helpful.
Thank You
The main docs for this are here:
https://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html
I found there are 2 driver versions, -1.1.0 and -1.0.1 . I could only get Zeppelin working with 1.1.0, and the links on that page don't go to that file, the only way to get it was using the aws s3 cp command
e.g.
aws s3 cp s3://athena-downloads/drivers/AthenaJDBC41-1.1.0.jar .
although I've given feedback on that page so it should be fixed soon.
Regarding the parameters, you use default.user and enter the Access_Key, default.password and enter the Secret_key. The default.driver should be com.amazonaws.athena.jdbc.AthenaDriver
The default.s3_staging_dir is actually the bucket where csv results are written so needs to match your athena settings.
There is no mention of where you might put a session token, however, you could always try putting it on the jdbc connection string ( which goes in default.url parameter value)
e.g.
jdbc:awsathena://athena.{REGION}.amazonaws.com:443?SessionToken=blahblahsomethingrealsessiontokengoeshere
but of course, replace {REGION} with the actual aws region and use your real session token.

AWS S3 error with PFFiles after importing the exported Parse data

Looks like Parse.com stores the PFFile objects on AWS S3 and only stores a reference to the actual files on S3 in Parse for the PFFile object types.
So my problem here is I only get a link to AWS S3 link for my PFFile if I export the data using the out of the box Parse.com export functionality. After I import the same data to my Parse application, for some reason the security setting on those PFFiles on S3 is changed in a way that all PFFiles won't be accessible to me after an import due to security error.
My question is, does anyone know how the security is being set on the PFFiles? Here's a link to PFFile https://parse.com/docs/osx/api/Classes/PFFile.html but I guess this is rather an advanced topic and wasn't revealed on this page.
Also looking a solution for this, all I found is this from their forum:
In this case, the PFFiles are stored in a different app. You might
need to download these files and upload them again to the new app and
update the pointers. I know this is not a great answer but we're
working on making this process more straightforward.
https://www.parse.com/questions/import-pffile-object-not-working-in-iphone-application