Specify the AWS credentials in hadoop - amazon-web-services

I want to specify the AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID at run-time.
I already tried using
hadoop -Dfs.s3a.access.key=${AWS_ACESS_KEY_ID} -Dfs.s3a.secret.key=${AWS_SECRET_ACCESS_KEY} fs -ls s3a://my_bucket/
and
export HADOOP_CLIENT_OPTS="-Dfs.s3a.access.key=${AWS_ACCESS_KEY_ID} -Dfs.s3a.secret.key=${AWS_SECRET_ACCESS_KEY}"
and
export HADOOP_OPTS="-Dfs.s3a.access.key=${AWS_ACCESS_KEY_ID} -Dfs.s3a.secret.key=${AWS_SECRET_ACCESS_KEY}"
In the last two examples, I tried to run with:
hadoop fs -ls s3a://my-bucket/
In all the cases I got:
-ls: Fatal internal error
com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:325)
at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:235)
at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:218)
at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:201)
at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:287)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:340)
What am doing wrong?

This is a correct way to pass the credentials at runtime,
hadoop fs -Dfs.s3a.access.key=${AWS_ACCESS_KEY_ID} -Dfs.s3a.secret.key=${AWS_SECRET_ACCESS_KEY} -ls s3a://my_bucket/
Your syntax needs a small fix. Make sure that empty strings are not passed as the values to these properties. It would make these runtime properties invalid and would go on searching for the credentials as per the authentication chain.
The S3A client follows the following authentication chain:
If login details were provided in the filesystem URI, a warning is
printed and then the username and password extracted for the AWS key
and secret respectively.
The fs.s3a.access.key and fs.s3a.secret.key are looked for in the
Hadoop XML configuration.
The AWS environment variables are then looked for.
An attempt is made to query the Amazon EC2 Instance Metadata Service
to retrieve credentials published to EC2 VMs.
The other possible methods to pass the credentials at runtime (please note that it is neither safe nor recommended to supply them during runtime),
1) Embed them in the S3 URI
hdfs dfs -ls s3a://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY#my-bucket/
If the secret key contains any + or / symbols, escape them with %2B and %2F respectively.
Never share the URL, logs generated using it, or use such an inline authentication mechanism in production.
2) export environment variables for the session
export AWS_ACCESS_KEY_ID=<YOUR_AWS_ACCESS_KEY_ID>
export AWS_SECRET_ACCESS_KEY=<YOUR_AWS_SECRET_ACCESS_KEY>
hdfs dfs -ls s3a://my-bucket/

I think part of the problem is that, confusingly, unlike the JVM -D opts, the Hadoop -D command expects a space between the -D and the key, e.g:
hadoop fs -ls -D fs.s3a.access.key=AAIIED s3a://landsat-pds/
I would still avoid doing that on the command line though, as anyone who can do a ps command can see your secrets.
Generally we stick them into core-site.xml when running outside EC2; in EC2 it's handled magically

Related

Druid can not see/read GOOGLE_APPLICATION_CREDENTIALS defined on env path

I installed apache-druid-0.22.1 as a cluster (master, data and query nodes) and enabled “druid-google-extensions” by adding it to the array druid.extensions.loadList in common.runtime.properties.
Finally I defined GOOGLE_APPLICATION_CREDENTIALS ( which has the value of service account json as defined in https://cloud.google.com/docs/authentication/production )as an environment variable of user that run the druid services.
However, I got the following error when I try to ingest data from GCR buckets:
Error: Cannot construct instance of
org.apache.druid.data.input.google.GoogleCloudStorageInputSource,
problem: Unable to provision, see the following errors: 1) Error in
custom provider, java.io.IOException: The Application Default
Credentials are not available. They are available if running on Google
App Engine, Google Compute Engine, or Google Cloud Shell. Otherwise,
the environment variable GOOGLE_APPLICATION_CREDENTIALS must be
defined pointing to a file defining the credentials. See
https://developers.google.com/accounts/docs/application-default-credentials
for more information. at
org.apache.druid.common.gcp.GcpModule.getHttpRequestInitializer(GcpModule.java:60)
(via modules: com.google.inject.util.Modules$OverrideModule ->
org.apache.druid.common.gcp.GcpModule) at
org.apache.druid.common.gcp.GcpModule.getHttpRequestInitializer(GcpModule.java:60)
(via modules: com.google.inject.util.Modules$OverrideModule ->
org.apache.druid.common.gcp.GcpModule) while locating
com.google.api.client.http.HttpRequestInitializer for the 3rd
parameter of
org.apache.druid.storage.google.GoogleStorageDruidModule.getGoogleStorage(GoogleStorageDruidModule.java:114)
at
org.apache.druid.storage.google.GoogleStorageDruidModule.getGoogleStorage(GoogleStorageDruidModule.java:114)
(via modules: com.google.inject.util.Modules$OverrideModule ->
org.apache.druid.storage.google.GoogleStorageDruidModule) while
locating org.apache.druid.storage.google.GoogleStorage 1 error at
[Source: (org.eclipse.jetty.server.HttpInputOverHTTP); line: 1,
column: 180] (through reference chain:
org.apache.druid.indexing.overlord.sampler.IndexTaskSamplerSpec["spec"]->org.apache.druid.indexing.common.task.IndexTask$IndexIngestionSpec["ioConfig"]->org.apache.druid.indexing.common.task.IndexTask$IndexIOConfig["inputSource"])
A case reported on this matter caught my attention. But I can not see
any verified solution to that case. Please help me.
We want to take data from GCP to on prem Druid. We don’t want to take cluster in GCP. So that we want solve this problem.
For future visitors:
If you run Druid by systemctl you then need to add required environments in service file of systemctl, to ensure it is always delivered to druid regardless of user or environment changes.
You must define the GOOGLE_APPLICATION_CREDENTIALS that points to a file path, and not contain the file content.
In a cluster (like Kubernetes), it's usual to mount a volume with the file in it, and to se the env var to point to that volume.

How to specify the GCP Credential Location in application.properties file (for using the Pub/Sub in GCP)?

This seems straightforward to do that passing the Service Account key file (generated from the GCP console) by specifying the file location in the application.properties file. However, I tried all the following options:
1. spring.cloud.gcp.credentials.location=file:/home/my_user_id/mp6key.json
2. spring.cloud.gcp.credentials.location=file:src/main/resources/mp6key.json
3. spring.cloud.gcp.credentials.location=file:./main/resources/mp6key.json
4. spring.cloud.gcp.credentials.location=file:/src/main/resources/mp6key.json
It all ended up with the same error:
java.io.FileNotFoundException: /home/my_user_id/mp6key.json (No such file or directory)
Could anyone advise where I should put the key file and then how should I specify the path to the file properly?
The same programs run successfully in Ecplise with messages published and subscribed using the Pub/Sub processing from GCP (using the Project Id/Service Account key generated in GCP), but now stuck with the above issue after deployed to run on GCP.
As mentioned in the official documentation, the credentials file can be obtained from a number of different locations such as the file system, classpath, URL, etc.
for example, if the service account key file is stored in the classpath as src/main/resources/key.json, pass the following property
spring.cloud.gcp.credentials.location=classpath:key.json
if the key file is stored somewhere else in your local file system, use the file prefix in the property value
spring.cloud.gcp.credentials.location=file:<path to key file>
My line looks like this:
spring.cloud.gcp.credentials.location=file:src/main/resources/[my_json_file]
And this works.
The following also works if I put it in the root of the project directory:
spring.cloud.gcp.credentials.location=file:./[my_json_file]
Have you tried to follow this quickstart? Please, try to follow it thoughtfully and explain if you get any error finishing the quickstart.
Anyway, before running your Java script, try running on the console the following (please modify with the exact path where you store your key):
export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/mp6key.json"
How are you authenticating your credentials in your Java script?
My answer is easy: if you run you code on GCP, you don't have to use service account key file. Problem eliminated, problem solved!
More seriously, have a look on service identity. I don't know what is your current service (Compute? Function? Cloud Run?). Anyway, you can attach any service account on GCP components. Then, when you code, simply use the default credential. Automatically the component identity is loaded. No key to manage, no key to store securely, no key to rotate!
If you provide more detail on your target platform, I could provide your some guidance to achieve this.
Keep in mind that the service account key file are designed to be used by automatic apps (w/o user account involved) hosted outside GCP (on prem, other Cloud Provider, a CI/CD, Apigee,...)
UPDATE
When you use your personal account, you can also use the default credential.
Install gcloud SDK on your computer
Use the command gcloud auth application-default login
Follow the instructions
Enjoy!
If it doesn't work, get the <path> displayed after the login command and set this value in the environment variable named GOOGLE_APPLICATION_CREDENTIALS.
If you definitively want to use service account key file (which are a security issue for the previous reason, but...), you can use it locally
Either set the json key file path into the GOOGLE_APPLICATION_CREDENTIALS environment variable
Or run this command gcloud auth activate-service-account --key-file=<path to your json key file>
Provided your file is in the resources folder try
file://mp6key.json
using file:// instead of file:/ works for me at least

Apache Zeppelin with Athena handling session token using jdbc Interpreter

I am trying to connect Athena with Apache Zeppelin.I need to handle secret_key, Access_key, and Session_token. I am feeling hard to establish my connection with the Zeppelin JDBC interpreter.
I am following the steps as mentioned in this block,
If any one can help me out in establishing the connection with AWS Session token approach that would be helpful.
Thank You
The main docs for this are here:
https://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html
I found there are 2 driver versions, -1.1.0 and -1.0.1 . I could only get Zeppelin working with 1.1.0, and the links on that page don't go to that file, the only way to get it was using the aws s3 cp command
e.g.
aws s3 cp s3://athena-downloads/drivers/AthenaJDBC41-1.1.0.jar .
although I've given feedback on that page so it should be fixed soon.
Regarding the parameters, you use default.user and enter the Access_Key, default.password and enter the Secret_key. The default.driver should be com.amazonaws.athena.jdbc.AthenaDriver
The default.s3_staging_dir is actually the bucket where csv results are written so needs to match your athena settings.
There is no mention of where you might put a session token, however, you could always try putting it on the jdbc connection string ( which goes in default.url parameter value)
e.g.
jdbc:awsathena://athena.{REGION}.amazonaws.com:443?SessionToken=blahblahsomethingrealsessiontokengoeshere
but of course, replace {REGION} with the actual aws region and use your real session token.

InvalidSignatureException when using boto3 for dynamoDB on aws

Im facing some sort of credentials issue when trying to connect to my dynamoDB on aws. Locally it all works fine and I can connect using env variables for AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_DEFAULT_REGION and then
dynamoConnection = boto3.resource('dynamodb', endpoint_url='http://localhost:8000')
When changing to live creds in the env variables and setting the endpoint_url to the dynamoDB on aws this fails with:
"botocore.exceptions.ClientError: An error occurred (InvalidSignatureException) when calling the Query operation: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details."
The creds are valid as they are used in a different app which talks to the same dynamoDB. Ive also tried not using env variables but rather directly in the method but the error persisted. Furthermore, to avoid any issues with trailing spaces Ive even used the credentials directly in the code. Im using Python v3.4.4.
Is there maybe a header that also should be set that Im not aware of? Any hints would be apprecihated.
EDIT
Ive now also created new credentials (to make sure there are only alphanumerical signs) but still no dice.
You shouldn't use the endpoint_url when you are connecting to the real DynamoDB service. That's really only for connecting to local services or non-standard endpoints. Instead, just specify the region you want:
dynamoConnection = boto3.resource('dynamodb', region_name='us-west-2')
It sign that your time zone is different. Maybe you can check your:
1. Time zone
2. Time settings.
If there are some automatic settings, you should fix your time settings.
"sudo hwclock --hctosys" should do the trick.
Just wanted to point out that accessing DynamoDB from a C# environment (using AWS.NET SDK) I ran into this error and the way I solved it was to create a new pair of AWS access/secret keys.
Worked immediately after I changed those keys in the code.

kinesis stream account incorrect

I have setup my pc with python and connections to AWS. This has been successfully tested using the s3_sample.py file, I had to create an IAM user account with the credentials in a file which worked fine for S3 buckets.
My next task was to create an mqtt bridge and put some data in a stream in kinesis using the awslab - awslabs/mqtt-kinesis-bridge.
This seems to be all ok except I get an error when I run the bridge.py. The error is:
Could not find ACTIVE stream:my_first_stream error:Stream my_first_stream under account 673480824415 not found.
Strangely this is not the account I use in the .boto file that is suggested to be set up for this bridge, which are the same credentials I used for the S3 bucket
[Credentials]
aws_access_key_id = AA1122BB
aws_secret_access_key = LlcKb61LTglis
It would seem to me that the bridge.py has a hardcoded account but I can not see it and i can't see where it is pointing to the .boto file for credentials.
Thanks in Advance
So the issue of not finding the Active stream for the account is resolved by:
ensure you are hooked into the US-EAST-1 data centre as this is the default data centre for bridge.py
create your stream, you will only need 1 shard
The next problem stems from the specific version of MQTT and the python library paho-mqtt I installed. The bridge application was written with the API of MQTT 1.2.1 using paho-mqtt 0.4.91 in mind.
The new version which is available for download on their website has a different way of interacting with the paho-mqtt library which passes an additional "flags" object to the on_connect callback. This generates the error I was experiencing, since its not expecting the 5th argument.
You should be able to fix it by making the following change to bridge.py
Line 104 currently looks like this:
def on_connect(self, mqttc, userdata, msg):
Simply add flags, after userdata, so that the callback function looks like this:
def on_connect(self, mqttc, userdata,flags, msg):
This should resolve the issue of the final error of the incorrect number of arguments being passed.
Hope this helps others, thank for the efforts.
When you call python SDK for aws service, there is a line to import the boto modules for aws services in bridge.py.
import boto
The setting is pointing to the .boto for credentials and defined defaultly in boto.
Here is the explanation Boto Config :
Details
A boto config file is a text file formatted like an .ini configuration file that specifies values for options that control the behavior of the boto library. In Unix/Linux systems, on startup, the boto library looks for configuration files in the following locations and in the following order:
/etc/boto.cfg - for site-wide settings that all users on this machine will use
~/.boto - for user-specific settings
~/.aws/credentials - for credentials shared between SDKs
Of course, you can set the environment directly,
export AWS_ACCESS_KEY_ID="Your AWS Access Key ID"
export AWS_SECRET_ACCESS_KEY="Your AWS Secret Access Key"