I'm writing a pyspark job that needs to read out of two different s3 buckets. Each bucket has different credentials, which are stored on my machine as separate profiles in ~/.aws/credentials.
Is there a way to tell pyspark which profile to use when connecting to s3?
When using a single bucket, I had set the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables in conf/spark-env.sh. Naturally, this only works for accessing 1 of the 2 buckets.
I am aware that I could set these values manually in pyspark when required, using:
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "ABCD")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "EFGH")
But would prefer a solution where these values were not hard-coded in. Is that possible?
Different S3 buckets can be accessed with different S3A client configurations. This allows for different endpoints, data read and write strategies, as well as login details.
All fs.s3a options other than a small set of unmodifiable values (currently fs.s3a.impl) can be set on a per bucket basis.
The bucket specific option is set by replacing the fs.s3a. prefix on an option with fs.s3a.bucket.BUCKETNAME., where BUCKETNAME is the name of the bucket.
When connecting to a bucket, all options explicitly set will override the base fs.s3a. values.
source http://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Configurations_different_S3_buckets
s3n doesn't support the aws credentials stored in ~/.aws/credentials, you should try to use hadoop 2.7 and the new hadoop s3 impl: s3a, it is using aws sdk.
not sure if the current spark release 1.6.1 works well with hadoop 2.7, but spark 2.0 is definitely no problem with hadoop 2.7 and s3a.
for spark 1.6.x, we made some dirty hack, with the s3 driver from EMR... you can take a look this doc: https://github.com/zalando/spark-appliance#emrfs-support
Related
I have enabled AWS S3 replication in an account and I want to replicate the same S3 data to another account and it all works fine. But I don't want to use S3 versioning because of its additional cost.
So is there any other way to accommodate this scenario?
The automated Same Region Replication(SRR) and Cross Region Replication(CRR) requires versioning to be activated due to the way that data is replicated between S3 buckets. For example, a new version of an object might be uploaded while a bucket is still being replicated, which can lead to problems without having separate versions.
If you do not wish to retain other versions, you can configure Amazon S3 Lifecycle Rules to expire (delete) older versions.
An alternative method would be to run the AWS CLI aws s3 sync command at regular intervals to copy the data between buckets. This command would need to be run on an Amazon EC2 instance or even your own computer. It could be triggered by a cron schedule (Linux) or a Schedule Task (Windows).
I am practicing AWS commands. My client has given me the AWS IAM access key and the secret but not the account that I can log in to the admin panel. Those keys are being used with the project itself. What I am trying to do is that I am trying to list down all the files recursive within a S3 bucket.
This is what I have done so far.
I have configured the AWS profile for CLI using the following command
aws configure
Then I could list all the available buckets by running the following command
aws s3 ls
Then I am trying to list all the files within a bucket. I tried running the following command.
aws s3 ls s3://my-bucket-name
But it seems like it is not giving me the correct content. Also, I need a way to navigate around the bucket too. How can I do that?
You want to list all of the objects recursively but aren't using --recursive flag. This will only show prefixes and any objects at the root level
Relevant docs https://docs.aws.amazon.com/cli/latest/reference/s3/ls.html
A few options.
Roll your own
if you run an aws s3 ls and a line item has the word "PRE" instead of a modify date and size, that means it's a "directory" that you can traverse. You can write a quick bash script to run recursive aws s3 ls commands on everything that returns "PRE" indicating it's hiding more files.
s3fs
Using the S3FS-Fuse project on github, you can mount an S3 bucket on your file system and explore it that way. I haven't tested this and thus can't personally recommend it, but it seems viable, and might be a simple way to use tools you already have and understand (like tree).
One concern that I might have, is when I've used software similar to this it has made a lot of API calls and if left mounted for the long-term, it can run up costs just by the number of API calls.
Sync everything to a localhost (not recommended)
Adding this for completion, but you can run
aws s3 sync s3://mybucket/ ./
This will try to copy everything to your computer and you'll be able to use your own filesystem. However, s3 buckets can hold petabytes of data, so you may not be able to sync it all to your system. Also, s3 provides a lot of strong security precautions to protect the data, which your personal computer probably doesn't.
I am creating a transient EMR Spark cluster programmatically, reading a vanilla S3 object, converting it to a Dataframe and writing a parquet file.
Running on a local cluster (with S3 credentials provided) everything works.
Spinning up a transient cluster and submitting the job fails on the write to S3 with the error:
AmazonS3Exception: The AWS Access Key Id you provided does not exist in our records.
But my job is able to read the vanilla object from S3, and it logs to S3 correctly. Additionally I see that EMR_EC2_DefaultRole is set as EC2 instance profile, and that EMR_EC2_DefaultRole has the proper S3 permissions, and that my bucket has a policy set for EMR_EC2_DefaultRole.
I get the the 'filesystem' that I am trying to write the parquet file to is special, but I cannot figure out what needs to be set for this to work.
Arrrrgggghh! Basically as soon as I had posted my question, the light bulb went off.
In my Spark job I had
val cred: AWSCredentials = new DefaultAWSCredentialsProviderChain().getCredentials
session.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", cred.getAWSAccessKeyId)
session.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", cred.getAWSSecretKey)
which were necessary when running locally in a test cluster, but clobbered the good values when running on EMR. I changed the block to
overrideCredentials.foreach(cred=>{
session.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", cred.getAWSAccessKeyId)
session.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", cred.getAWSSecretKey)
})
and pushed the credential retrieval into my test harness (which is, of course, where it should have been all the time.)
If you are running in EC2 on the AWS code (not EMR), use the S3A connector, as it will use the EC2 IAM credential provider as the last one of the credential providers it uses by default.
The IAM credentials are short lived and include a session key: if you are copying them then you'll need to refresh at least every hour and set all three items: access key, session key and secret.
Like I said: s3a handles this, with the IAM credential provider triggering a new GET of the instance-info HTTP server whenever the previous key expires.
I have configured encryption enabled EMR cluster (properties in emrfs-site.xml)
I am using dataframe savemode.append to write into S3n://my-bucket/path/
to save in s3.
But I am not able to see the object getting AWS KMS encrypted.
However, when I do a simple insert from hive from EMR, I am able to see the objects getting aws kms encrypted.
How can I encrypt files from dataframe in S3 using sse kms?
The problem was we were using s3a to save the files from spark program to EMR. AWS officially doesn't support use of s3a on EMR. Though we were able to save data in S3, it was not encrypting the data. I tried using s3:// and s3n:// The encryption works with both.
How can I pass AWS credentials (aws_access_key and aws_secret_key) to PIG PigStorage function?
Thanks
Given this question is tagged with EMR I am going to assume you are using AWS EMR for the Hadoop cluster. If this is the case, then no further setup is required to access S3. The EMR service automatically configured Hadoop FS (which PigStorage will leverage) with either the AWS credentials of the user starting the cluster or uses the instance role requested. Just provide the S3 location and Pig will interface with S3 according to the policy and permissions of the user/role.
A = LOAD 's3://<yourbucket>/<path>/' using PigStorage('\t') as (id:int, field2:chararray, field3:chararray);
I wasn't very explicit, and did gave an example of my use case, sorry. I needed that because I needed to use two different AWS access_keys, and using something like s3n://access:secret#bucket did not solve. I solved this changing the PigStorage function , storing in hdfs the results, and on the cleanUpWithSucess method invoke a method that uploads the hdfs files to s3 with credentials. In this way I can pass the credentials to the PigStorageFunction when it is used to store, of course I also changed the constructor of the PigStorage to receive these arguments.