Distcp from S3 to HDFS - amazon-web-services

Im trying to copy data from S3 to HDFS using distcp tool. Problem with that is, that S3 cluster uses VPC endpoint and I dont know how to properly configure distcp. I have trtied several configurations but none has worked. Currently Im using following command:
hadoop distcp
-Dfs.s3a.access.key=[KEY]
-Dfs.s3a.secret.key=[SECRET]
-Dfs.s3a.region=eu-west-1
-Dfs.s3a.bucket.[BUCKET NAME].endpoint=https://bucket.vpce-[vpce id].s3.eu-west-1.vpce.amazonaws.com
s3a://[BUCKET NAME]/[FILE]
hdfs://[DESTINATION]/[FILE]
But im getint this error:
22/03/16 09:14:39 ERROR tools.DistCp: Exception encountered org.apache.hadoop.fs.s3a.AWSBadRequestException: doesBucketExistV2 on [BUCKET NAME]: com.amazonaws.services.s3.model.AmazonS3Exception: The authorization header is malformed; the region 'vpce' is wrong; expecting 'eu-west-1'
Any ideas how Distcp should be configured with VPC endpoints?
Thanks in advance

you need hadoop 3.3.1 for this, then it should work. ideally use 3.3.2, which is now out
grab the cloudstore jar and use its storediag command to debug this before going near distcp.

Related

aws kinesis data analytics application (flink) change property originally located at flink-conf.yaml

As the runtime of my flink application I use managed flink by AWS (Kinesis Data Analytics Application)
I added functionality (sink) for write processed events from kinesis queue in S3 in a parquet format.
Locally everything works for me, but when I try to run the application in the cloud I get the following exception:
"throwableInformation": [
"com.esotericsoftware.kryo.KryoException: Error constructing instance of class: org.apache.avro.Schema$LockableArrayList",
"Serialization trace:",
"types (org.apache.avro.Schema$UnionSchema)",
"schema (org.apache.avro.Schema$Field)",
"fieldMap (org.apache.avro.Schema$RecordSchema)",
After finding a solution to the problem, I found that I need to change following property (checked this on a local cluster):
classloader.resolve-order: child-first -> classloader.resolve-order: parent-first
Is it possible to change this configuration when using AWS managed Fink (not EMR, Kinesis Data Analytics applications) in any way?
aws support answer: No. This property cannot be changed.

Spark org.postgresql.Driver not found even though it's configured EMR

I am trying to write a pyspark data frame to a Postgres database with the following code:
mode = "overwrite"
url = "jdbc:postgresql://host/database"
properties = {"user": "user","password": "password","driver": "org.postgresql.Driver"}
dfTestWrite.write.jdbc(url=url, table="test_result", mode=mode, properties=properties)
However I am getting the following error:
An error occurred while calling o236.jdbc.
: java.lang.ClassNotFoundException: org.postgresql.Driver
I've found a few SO questions that address a similar issue but haven't found anything that helps. I followed the AWS docs here to add the configuration and from the EMR console it looks as though it was successful:
What am I doing wrong?
What you followed document is to add the database connector for the Presto and it is not a way to add the jdbc driver into the spark. Connector does not mean the driver.
You should download the postgresql jdbc driver and locate it to the spark lib directory or somewhere to refer it by a configuration.

Apache Zeppelin with Athena handling session token using jdbc Interpreter

I am trying to connect Athena with Apache Zeppelin.I need to handle secret_key, Access_key, and Session_token. I am feeling hard to establish my connection with the Zeppelin JDBC interpreter.
I am following the steps as mentioned in this block,
If any one can help me out in establishing the connection with AWS Session token approach that would be helpful.
Thank You
The main docs for this are here:
https://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html
I found there are 2 driver versions, -1.1.0 and -1.0.1 . I could only get Zeppelin working with 1.1.0, and the links on that page don't go to that file, the only way to get it was using the aws s3 cp command
e.g.
aws s3 cp s3://athena-downloads/drivers/AthenaJDBC41-1.1.0.jar .
although I've given feedback on that page so it should be fixed soon.
Regarding the parameters, you use default.user and enter the Access_Key, default.password and enter the Secret_key. The default.driver should be com.amazonaws.athena.jdbc.AthenaDriver
The default.s3_staging_dir is actually the bucket where csv results are written so needs to match your athena settings.
There is no mention of where you might put a session token, however, you could always try putting it on the jdbc connection string ( which goes in default.url parameter value)
e.g.
jdbc:awsathena://athena.{REGION}.amazonaws.com:443?SessionToken=blahblahsomethingrealsessiontokengoeshere
but of course, replace {REGION} with the actual aws region and use your real session token.

Unable to upload to S3 with Grails S3 Demo Application

I am trying to run a demo project for uploading to S3 with Grails 3.
The project in question is this, more specifically the S3 upload is only for the 'Hotel' example at the end.
When I run the project and go to upload the image, I get an updated message but nothing actually happens - there's no inserted url in the dbconsole table.
I think the issue lies with how I am running the project, I am using the command:
grails -Daws.accessKeyId=XXXXX -Daws.secretKey=XXXXX run-app
(where I am supplementing the X's for my keys obviously).
This method of running the project appears to be slightly different to the method shown in the example. I run my project from the command line and I do not use GGTS, just Sublime.
I have tried inserting my AWS keys into the application.yml but I receive an internal server error then.
Can anyone help me out here?
Check your bucket policy in s3. You need to grant permissions to the API user to allow uploads.

AWS Pipeline: Staging local files to S3 failed. The request signature we calculated does not match the signature you provided

Here's my setup:
I am trying to copy files from an external Webserver to a S3 Bucket using the DataPipeline.
To do this I'm using the ShellCommandActivity which uses a script to Download the files to the Output-Bucket specified in the Pipeline. In the script I use the environment variable ${OUTPUT1_STAGING_DIR} to adress the bucket. Of course I turned 'staging' to true in my pipeline.
When the script finishes, the state of the Activity becomes "FAILED" with following Error:
Staging local files to S3 failed. The request signature we calculated does not match the signature you provided. Check your key and signing method
When I look in the stdout file, I can see that my script finished sucessfully, only the staging to the bucket did not work.
I recon this could be an permission problem with the bucket but I have no idea which things I have to change!
I came across some discussions, where people got this error because the path to the bucket was configured wrong, so this is how I did it in the Pipeline Datanode Directory Path:
s3://testBucket
Is this correct?
I would appreciate any help here!
The problem was the datanode directory Path: It cannot be just a bucket, but HAS to be a directory inside a bucket.
Like this:
s3://testBucket/test
Great work with the error messages, Amazon!