I have been running my Spark job on a local cluster which has hdfs from where the input is read and the output is written too. Now I have set up an AWS EMR and an S3 bucket where I have my input and I want my output to be written to S3 too.
The error:
User class threw exception: java.lang.IllegalArgumentException: Wrong
FS: s3://something/input, expected:
hdfs://ip-some-numbers.eu-west-1.compute.internal:8020
I tried searching for the same issue and there are several questions regarding this issue. Some suggested that it's only for the output, but even when I disable output I get the same error.
Another suggestion is that there is something wrong with FileSystem in my code. Here are all of the occurances of input/output in my program:
The first occurance is in my custom FileInputFormat, in getSplits(JobContext job) which I have not actually modified myself but I can:
FileSystem fs = path.getFileSystem(job.getConfiguration());
Similar case in my custom RecordReader, also have not modified myself:
final FileSystem fs = file.getFileSystem(job);
In nextKeyValue() of my custom RecordReader which I have written myself I use:
FileSystem fs = FileSystem.get(jc);
And finally when I want to detect the number of files in a folder I use:
val fs = FileSystem.get(sc.hadoopConfiguration)
val status = fs.listStatus(new Path(path))
I assume the issue is with my code, but how can I modify the FileSystem calls to support input/output from S3?
This is what I have done to solve this when launching a spark-job on EMR :
val hdfs = FileSystem.get(new java.net.URI(s"s3a://${s3_bucket}"), sparkSession.sparkContext.hadoopConfiguration)
Make sure to replace s3_bucket by the name of your bucket
I hope it's going to be helpful for someone
The hadoop filesystem apis do not provide support for S3 out of the box. There are two implementations of the hadoop filesystem apis for S3: S3A, and S3N. S3A seems to be the preferred implementation. To use it you have to do a few things:
Add the aws-java-sdk-bundle.jar to your classpath.
When you create the FileSystem include values for the following properties in the FileSystem's configuration:
fs.s3a.access.key
fs.s3a.secret.key
When specify paths on S3 don't use s3:// use s3a:// instead.
Note: create a simple user and try things out with basic authentication first. It is possible to get it to work with AWS's more advanced temporary credential mechanisms, but it's a bit involved and I had to make some changes to the FileSystem code in order to get it to work when I tried.
Source of info is here
EMR is configured to avoid the use of keys in the code or in your job configuration.
The problem there is how the FileSystem is created in your example.
The default FileSystem that Hadoop create is the one for the hdfs schema.
So next code will not work if that path schema is s3://.
val fs = FileSystem.get(sc.hadoopConfiguration)
val status = fs.listStatus(new Path(path))
To create the right FileSystem, you need to use the path with the schema that you will use. For example, something like this:
val conf = sc.hadoopConfiguration
val pObj = new Path(path)
val status = pObj.getFileSystem(conf).listStatus(pObj)
From the Hadoop code:
Implementation in the FileSystem.get
public static FileSystem get(Configuration conf) throws IOException {
return get(getDefaultUri(conf), conf);
}
Implementation using Path.getFileSystem:
public FileSystem getFileSystem(Configuration conf) throws IOException {
return FileSystem.get(this.toUri(), conf);
}
Try setting the default URI for the FileSystem:
FileSystem.setDefaultUri(spark.sparkContext.hadoopConfiguration, new URI(s"s3a://$s3bucket"))
After specifying the key and secret using
fs.s3a.access.key
fs.s3a.secret.key
And getting file system as noted:
val hdfs = FileSystem.get(new java.net.URI(s"s3a://${s3_bucket}"), sparkSession.sparkContext.hadoopConfiguration)
I would still get the error
java.lang.IllegalArgumentException: Wrong FS: s3a:// ... , expected: file:///
To check the default filesystem, you can look at the above created hdfs FileSystem:
hadoopfs.getUri which for me still returned file:///
In order to get this to work correctly, prior to running FileSystem.get, set the default URI of the filesystem.
val s3URI = s"s3a://$s3bucket"
FileSystem.setDefaultUri(spark.sparkContext.hadoopConfiguration, new URI(s3URI))
val hdfs: FileSystem = FileSystem.get(spark.sparkContext.hadoopConfiguration)
Related
I am trying to Automate File movement from One folder to another folder within the same S3 bucket on the file creation event in the S3 bucket.
I was hoping to use Lambda function's triggers to do this but I feel, Lambda triggers at the Root directory level and can not use it at the Folder Level.
Example:
Bucket Name: my-only-s3-bucket
Source Folder: s3://my-only-s3-bucket/Landing
Target Folder: s3://my-only-s3-bucket/Staging
Requirement:
When a file gets created or uploaded into, Source Folder: s3://my-only-s3-bucket/Landing, it should get moved to s3://my-only-s3-bucket/Staging automatically without any manual intervention
How to achieve this?
I was hoping to use Lambda function's triggers to do this but I feel, Lambda triggers at the Root directory level and can not use it at the Folder Level.
This is not true. S3 has no concept of folders. You can trigger at any "level" using a filter prefix i.e prefix -> "Landing/" and/or a suffix (as example ".jpg").
S3 trigger will call the lambda and pass the event with the new object as input. Then just use any language you are familiar with and use s3 copy built in function from any of the available AWS SDK(.Net, Java, python, etc..) to copy to the destination.
example:
def object_copied?(
s3_client,
source_bucket_name,
source_key,
target_bucket_name,
target_key)
return true if s3_client.copy_object(
bucket: target_bucket_name,
copy_source: source_bucket_name + '/' + source_key,
key: target_key
)
rescue StandardError => e
puts "Error while copying object: #{e.message}"
end
I think the concept of relative path can solve your problem. Here's the code snippet that solves your problem using a library called s3pathlib, a objective-oriented s3 file system interface.
# import the library
from s3pathlib import S3Path
# define source and target folder
source_dir = S3Path("my-only-s3-bucket/Landing/")
target_dir = S3Path("my-only-s3-bucket/Staging/")
# let's say you have a new file in Landing folder, the s3 uri is
s3_uri = "s3://my-only-s3-bucket/Landing/my-subfolder/data.csv"
# I guess you want to cut the file to the new location and delete the original one
def move_file(p_file, p_source_dir, p_target_dir):
# validate if p_file is inside of p_source_dir
if p_file.uri.startswith(p_source_dir.uri):
raise ValueError
# find new s3 path based on the relative path
p_file_new = S3Path(
p_target_dir, p_file.relative_to(p_source_dir)
)
# move
p_file.move_to(p_file_new)
# if you want copy you can do p_file.copy_to(p_file_new)
# then let's do your work
if __name__ == "__main__":
move_file(
p_file=S3Path.from_s3_uri(s3_uri),
p_source_dir=source_dir,
p_target_dir=target_dir,
)
If you want more advanced path manipulation, you can reference this document. And the S3Path.change(new_abspath, new_dirpath, new_dirname, new_basename, new_fname, new_ext) would be the most important one you need to know.
I'm doing what I think is a very simple thing to check that alpakka is working:
val awsCreds = AwsBasicCredentials.create("xxx", "xxx")
val credentialsProvider = StaticCredentialsProvider.create(awsCreds)
implicit val staticCreds = S3Attributes.settings(S3Ext(context.system).settings.withCredentialsProvider(credentialsProvider)
.withS3RegionProvider(new AwsRegionProvider {val getRegion: Region = Region.US_EAST_2}))
val d = S3.checkIfBucketExists(action.bucket)
d foreach { msg => log.info("mgs: " + msg.toString)}
When I run this I get
msgs: NotExists
But the bucket referred to by action.bucket does exist, and I can access it using these credentials. What's more, when I modify the credentials (by changing the secret key), I get the same message. What I should get, according to the documentation, is AccessDenied.
I got to this point because I didn't think the environment was picking up on the right credentials - hence all the hard-coded values. But now I don't really know what could be causing this behavior.
Thanks
Update: The action object is just a case class consisting of a bucket and a path. I've checked in debug that action.bucket and action.path point to the things they should be - in this case an S3 bucket. I've also tried the above code with just the string bucket name in place of action.bucket.
Just my carelessness . . .
An errant copy added an extra implicit system to the mix. Some changes were made to implicit materializers in akka 2.6 and I think those, along with the extra implicit actor system, made for a weird mix.
fileTransferUtility = new TransferUtility(s3Client);
try
{
if (file.ContentLength > 0)
{
var filePath = Path.Combine(Server.MapPath("~/Files"), Path.GetFileName(file.FileName));
var fileTransferUtilityRequest = new TransferUtilityUploadRequest
{
BucketName = bucketName,
FilePath = filePath,
StorageClass = S3StorageClass.StandardInfrequentAccess,
PartSize = 6291456, // 6 MB.
Key = keyName,
CannedACL = S3CannedACL.PublicRead
};
fileTransferUtilityRequest.Metadata.Add("param1", "Value1");
fileTransferUtilityRequest.Metadata.Add("param2", "Value2");
fileTransferUtility.Upload(fileTransferUtilityRequest);
fileTransferUtility.Dispose();
}
Im getting this error
The file indicated by the FilePath property does not exist!
I tried changing the path to the actual path of the file to C:\Users\jojo\Downloads but im still getting the same error.
(Based on a comment above indicating that file is an instance of HttpPostedFileBase in a web application...)
I don't know where you got Server.MapPath("~/Files") from, but if file is an HttpPostedFileBase that's been uploaded to this web application code then it's likely in-memory and not on your file system. Or at best it's on the file system in a temp system folder somewhere.
Since your source (the file variable contents) is a stream, before you try to interact with the file system you should see if the AWS API you're using can accept a stream. And it looks like it can.
if (file.ContentLength > 0)
{
var transferUtility = new TransferUtility(/* constructor params here */);
transferUtility.Upload(file.InputStream, bucketName, keyName);
}
Note that this is entirely free-hand, I'm not really familiar with AWS interactions. And you'll definitely want to take a look at the constructors on TransferUtility to see which one meets your design. But the point is that you're currently looking to upload a stream from the file you've already uploaded to your web application, not looking to upload an actual file from the file system.
As a fallback, if you can't get the stream upload to work (and you really should, that's the ideal approach here), then your next option is likely to save the file first and then upload it using the method you have now. So if you're expecting it to be in Server.MapPath("~/Files") then you'd need to save it to that folder first, for example:
file.SaveAs(Path.Combine(Server.MapPath("~/Files"), Path.GetFileName(file.FileName)));
Of course, over time this folder can become quite full and you'd likely want to clean it out.
I have a large list of objects in source S3 bucket and i selectively want to copy a subset of objects in to destination bucket.
As per doc here it seems its possible with TransferManager.copy(from_bucket, from_key, to_bucket, to_key), however i need to do it one at a time.
Is anyone aware of other ways, preferably to copy in a batched fashion instead of calling copy() for each object ?
If you wish to copy a whole directory, you could use the AWS Command-Line Interface (CLI):
aws s3 cp --recursive s3://source-bucket/folder/* s3://destination-bucket/folder/
However, since you wish to selectively copy files, there's no easy way to indicate which files to copy (unless they all have the same prefix).
Frankly, when I need to copy selective files, I actually create an Excel file with a list of filenames. Then, I create a formula like this:
="aws s3 cp s3://source-bucket/"&A1&" s3://destination-bucket/"
Then just use Fill Down to replicate the formula. Finally, copy the commands and paste them into a Terminal window.
If you are asking whether there is a way to programmatically copy multiples between buckets using one API call, then the answer is no, this is not possible. Each API call will only copy one object. You can, however, issue multiple copy commands in parallel to make things go faster.
I think it's possible via the S3 console but using the SDK there's no such option. Although this isn't the solution to your problem, this script selectively copies objects one at a time and if you're reading from an external file, it's just a matter of entering your file names there.
ArrayList<String> filesToBeCopied = new ArrayList<String>();
filesToBeCopied.add("sample.svg");
filesToBeCopied.add("sample.png");
String from_bucket_name = "bucket1";
String to_bucket = "bucket2";
BasicAWSCredentials creds = new BasicAWSCredentials("<key>","<secret>");
final AmazonS3 s3 = AmazonS3ClientBuilder.standard().withRegion(Regions.AP_SOUTH_1)
.withCredentials(new AWSStaticCredentialsProvider(creds)).build();
ListObjectsV2Result result = s3.listObjectsV2(from_bucket_name);
List<S3ObjectSummary> objects = result.getObjectSummaries();
try {
for (S3ObjectSummary os : objects) {
String bucketKey = os.getKey();
if (filesToBeCopied.contains(bucketKey)) {
s3.copyObject(from_bucket_name, bucketKey, to_bucket, bucketKey);
}
}
} catch (AmazonServiceException e) {
System.err.println(e.getErrorMessage());
System.exit(1);
}
I'm trying to use the erlcloud library for S3 uploads in my app. As a test, I'm trying to get it to list buckets via an iex console:
iex(4)> s3 = :erlcloud_s3.new("KEY_ID", "SECRET_KEY")
...
iex(5)> :erlcloud_s3.list_buckets(s3)
** (ErlangError) erlang error: {:aws_error, {:socket_error, :timeout}}
(erlcloud) src/erlcloud_s3.erl:909: :erlcloud_s3.s3_request/8
(erlcloud) src/erlcloud_s3.erl:893: :erlcloud_s3.s3_xml_request/8
(erlcloud) src/erlcloud_s3.erl:238: :erlcloud_s3.list_buckets/1
I've checked that inets, ssl, and erlcoud are all started, and I know the credentials work fine, because I've tested them in a similar fashion with a Ruby library in irb.
I've tried configuring it with a longer timeout, but no matter how high I set it I still get this error.
Any ideas? Or approaches I could take to debug this?
I could simulate the same error, and could resolve it by replacing double-quote with single-quote.
> iex(4)> s3 = :erlcloud_s3.new('KEY_ID', 'SECRET_KEY')
> iex(5)> :erlcloud_s3.list_buckets(s3)
Assuming the double-quote was used, it may be caused by a type mismatch between string and char-list.