How can I pass AWS credentials (aws_access_key and aws_secret_key) to PIG PigStorage function?
Thanks
Given this question is tagged with EMR I am going to assume you are using AWS EMR for the Hadoop cluster. If this is the case, then no further setup is required to access S3. The EMR service automatically configured Hadoop FS (which PigStorage will leverage) with either the AWS credentials of the user starting the cluster or uses the instance role requested. Just provide the S3 location and Pig will interface with S3 according to the policy and permissions of the user/role.
A = LOAD 's3://<yourbucket>/<path>/' using PigStorage('\t') as (id:int, field2:chararray, field3:chararray);
I wasn't very explicit, and did gave an example of my use case, sorry. I needed that because I needed to use two different AWS access_keys, and using something like s3n://access:secret#bucket did not solve. I solved this changing the PigStorage function , storing in hdfs the results, and on the cleanUpWithSucess method invoke a method that uploads the hdfs files to s3 with credentials. In this way I can pass the credentials to the PigStorageFunction when it is used to store, of course I also changed the constructor of the PigStorage to receive these arguments.
Related
I have a lambda written in .net core. It will invoke a COPY command from redshift. My lambda executes under a role which has access to redshift and to s3.
My COPY command which looks like this:
COPY my_table_name FROM 's3://my_bucket/my_file.csv' CREDENTIALS 'aws_access_key_id=x;aws_secret_access_key=y' DELIMITER ',' CSV;
This works fine. My problem is that the CREDENTIALS I am using for the COPY is completely independent of the role which lambda is running under.
Is there a way to execute the COPY command using the role which the lambda is executing under?
I definitely recommend using Role-Based authentication instead of Key-Based, but I believe you will need to assign the role directly to your Redshift cluster instead of trying to pass the role that is assigned to your Lambda function. The COPY command runs from within the Redshift cluster, your Lambda function is only telling Redshift to run the command, so it can't really use the Lambda function's role.
Please see the detailed documentation on Redshift Role-Based authentication.
I have a key that is being shared among different services and it is currently stored in an s3 bucket inside a text file.
My goal is to read that variable and pass it to my lambda service through cloudformation.
for an ec2 instance it was easy because I could download the file and read it, and that was easily achievable by putting the scripts inside my cloudformation json file. But I don't have any idea how to do it for my lambdas....!
I tried to put my credentials in gitlab pipeline but because of the access permissions it doesn't let gitlab pass it on, so my best and least expensive option right now is to do it in cloud formation.
The easiest method would be to have the Lambda function read the information from Amazon S3.
The only way to get CloudFormation to "read" some information from Amazon S3 would be to create a Custom Resource, which involves writing an AWS Lambda function. However, since you already have a Lambda function, it would be easier to simply have that function read the object.
It's worth mentioning that, rather than storing such information in Amazon S3, you could use the AWS Systems Manager Parameter Store, which is a great place to store configuration information. Your various applications can then use Parameter Store to store and retrieve the configuration. CloudFormation can also access the Parameter Store.
I am creating a transient EMR Spark cluster programmatically, reading a vanilla S3 object, converting it to a Dataframe and writing a parquet file.
Running on a local cluster (with S3 credentials provided) everything works.
Spinning up a transient cluster and submitting the job fails on the write to S3 with the error:
AmazonS3Exception: The AWS Access Key Id you provided does not exist in our records.
But my job is able to read the vanilla object from S3, and it logs to S3 correctly. Additionally I see that EMR_EC2_DefaultRole is set as EC2 instance profile, and that EMR_EC2_DefaultRole has the proper S3 permissions, and that my bucket has a policy set for EMR_EC2_DefaultRole.
I get the the 'filesystem' that I am trying to write the parquet file to is special, but I cannot figure out what needs to be set for this to work.
Arrrrgggghh! Basically as soon as I had posted my question, the light bulb went off.
In my Spark job I had
val cred: AWSCredentials = new DefaultAWSCredentialsProviderChain().getCredentials
session.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", cred.getAWSAccessKeyId)
session.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", cred.getAWSSecretKey)
which were necessary when running locally in a test cluster, but clobbered the good values when running on EMR. I changed the block to
overrideCredentials.foreach(cred=>{
session.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", cred.getAWSAccessKeyId)
session.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", cred.getAWSSecretKey)
})
and pushed the credential retrieval into my test harness (which is, of course, where it should have been all the time.)
If you are running in EC2 on the AWS code (not EMR), use the S3A connector, as it will use the EC2 IAM credential provider as the last one of the credential providers it uses by default.
The IAM credentials are short lived and include a session key: if you are copying them then you'll need to refresh at least every hour and set all three items: access key, session key and secret.
Like I said: s3a handles this, with the IAM credential provider triggering a new GET of the instance-info HTTP server whenever the previous key expires.
I have created two S3 buckets named as 'ABC' and 'XYZ'.
If i upload the file(Object) in 'ABC' bucket it should get automatically copy to 'XYZ'.
For Above scenario i have to write a lambda function using node.js
i am newly learning the lambda so if you provide the details steps it will be good for me.
it will be good if we can do it by web console otherwise Np.
This post should be useful to copy between buckets of same region, https://aws.amazon.com/blogs/compute/content-replication-using-aws-lambda-and-amazon-s3/
If the usecase you are trying to achieve is for DR purpose in another region, you may use this https://aws.amazon.com/blogs/aws/new-cross-region-replication-for-amazon-s3/. S3 natively does the replication for you but it's unclear if that is in same region or different region from your question
I'm writing a pyspark job that needs to read out of two different s3 buckets. Each bucket has different credentials, which are stored on my machine as separate profiles in ~/.aws/credentials.
Is there a way to tell pyspark which profile to use when connecting to s3?
When using a single bucket, I had set the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables in conf/spark-env.sh. Naturally, this only works for accessing 1 of the 2 buckets.
I am aware that I could set these values manually in pyspark when required, using:
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "ABCD")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "EFGH")
But would prefer a solution where these values were not hard-coded in. Is that possible?
Different S3 buckets can be accessed with different S3A client configurations. This allows for different endpoints, data read and write strategies, as well as login details.
All fs.s3a options other than a small set of unmodifiable values (currently fs.s3a.impl) can be set on a per bucket basis.
The bucket specific option is set by replacing the fs.s3a. prefix on an option with fs.s3a.bucket.BUCKETNAME., where BUCKETNAME is the name of the bucket.
When connecting to a bucket, all options explicitly set will override the base fs.s3a. values.
source http://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Configurations_different_S3_buckets
s3n doesn't support the aws credentials stored in ~/.aws/credentials, you should try to use hadoop 2.7 and the new hadoop s3 impl: s3a, it is using aws sdk.
not sure if the current spark release 1.6.1 works well with hadoop 2.7, but spark 2.0 is definitely no problem with hadoop 2.7 and s3a.
for spark 1.6.x, we made some dirty hack, with the s3 driver from EMR... you can take a look this doc: https://github.com/zalando/spark-appliance#emrfs-support