download, process, upload large number of s3 files with spark - amazon-web-services

I have a large amount of files (~500k hdf5) inside a s3 bucket which I need to process and reupload to another s3 bucket.
I am pretty new to such tasks, so I am not quite sure if my approach is correct here. I do the following:
I use boto to get the list of keys inside the bucket and parallelize it with spark:
s3keys = bucket.list()
data = sc.parallelize(s3keys)
data = data.map(lambda x: download_process_upload(x))
result = data.collect()
where download_process_upload is a function which downloads the file specified by the key, does some processing on it and re-uploads it to another bucket (returning 1 if everything was successful, and 0 if there was an error)
So in the end I could do
success_rate = sum(result) / float(len(s3keys))
I have read that spark map statements should be stateless, while my custom map function definitely is not stateless. It downloads the file to disk and then loads it into memory etc.
So is this the proper way to do such a task?

I've successfully used your methodology to download and process data from S3. I have not tried to upload the data from within a map statement. But, I see no reason why you wouldn't be able to read the file from s3, process it, and then upload it to a new location.
Also, you can save a few keystrokes and take the explicit lambda out of the map statement like this data = data.map(download_process_upload)

Related

AWS Lambda avoid recursive trigger

I'm downloading data from an API and writing it to a csv file that I store in an S3 bucket. I'm then copying my file from this input bucket into an output bucket with a Lambda function. From the output bucket I'm ingesting it into a MySQL RDS instance with another Lambda function.
The copy-to-another-bucket and upload-to-RDS lambda functions both get triggered when I create a new object in a bucket. Since I'm appending to my csv file, the upload-to-RDS function gets triggered way more than it should and I end up with ~30 rows in my database instead of 6.
I thought by copying the files between S3 buckets I could avoid this, but it doesn't help. Is there any way to only upload the csv file to the database once it has been written and not while it's being updated? Can I delay the trigger maybe?
The only other solution I can think of is to skip the copy-to-another-bucket function altogether and to schedule the upload-to-RDS function.
You need to realize that S3 doesn't support updating an existing file. If you are appending a row to an existing CSV file in S3, then that operation requires uploading the entire contents of the CSV file to S3 again, which S3 sees as a new object.
If you need to store a temporary version of the CSV file in S3 while you are updating it, then you should store it in a separate path, like s3://your_bucket/tmp and then when you have completed your updates, move it to the final path like s3://your_bucket/complete and only configure the Lambda trigger on the /complete path.

Is there a way to list or iterate over the CONTENT of a file in S3?

I have a S3 object that has a key
I am trying to iterate over the values of an key inside S3, which is basically a simple .txt file. I have found similar questions for iterating over objects and listing files in an object, but nothing so far on iterating over the actual contents of the file itself.
The code below will return the object and bucket containing the data but it doesn't list it's content nor give me an optiopn to iterate over it's contents. This appears to just filter the keys in the object itself, but I am trying to open or/and iterate over the values of the key.
s3 = boto3.resource('s3')
bucket = s3.Bucket('account-id-metadata')
for i in bucket.objects.filter(Prefix='data.txt'):
print(i)
Would like to know if this is possible with S3 using boto3?
NOTE: This was originally in a local file and & I was planning to iterate over the file locally instead; however, because of the large amount of data it was crashing & taking up a lot of memory, so I moved this to S3 hoping to perform the same functionality.
Thanks you in advance.
The only Amazon S3 operation that works on the "contents" of an object is S3 Select and Glacier Select – Retrieving Subsets of Objects | AWS News Blog.
This allows you to use SQL-like commands to extract rows and columns from a single object for certain file formats. This is useful when wanting to extract a small amount of information from large objects.

Faster way to Copy S3 files

I am trying to copy around 50 million files and 15TB in total size from one s3 bucket to another bucket.
There are AWS CLI option to copy fast. But in my case, I want to put a filter and date range. So I thought to write code by using boto3.
The source bucket input structure:
Folder1
File1 - Date1
File2 - Date1
Folder2
File1 - Date2
File2 - Date2
Folder3
File1_Number1 - Date3
File2_Number1 - Date3
Folder4
File1_Number1 - Date2
File2_Number1 - Date2
Folder5
File1_Number2 - Date4
File2_Number2 - Date4
So the purpose is to copy all files which start with 'File1' from each folder by using a date range(Date2 to Date4). date(Date1, Date2, Date3, Date4) is file modified date.
The output would have date key partition and I am using UUID to keep every file name unique so it would never replace the existing file. So the files which have an identical date(modified date of the file) will be in the same folder.
Target Bucket would have output:
Date2
File1_UUID1
File1_Number1_UUID2
Date3
File1_Number1_UUID3
Date4
File1_Number2_UUID4
I have written code by using boto3 API and AWS glue to run the code. But boto3 API copies 500 thousand files every day.
The code:
s3 = boto3.resource('s3', region_name='us-east-2', config=boto_config)
# source and target bucket names
src_bucket_name = 'staging1'
trg_bucket_name = 'staging2'
# source and target bucket pointers
s3_src_bucket = s3.Bucket(src_bucket_name)
print('Source Bucket Name : {0}'.format(s3_src_bucket.name))
s3_trg_bucket = s3.Bucket(trg_bucket_name)
print('Target Bucket Name : {0}'.format(s3_trg_bucket.name))
# source and target directories
trg_dir = 'api/requests'
# source objects
s3_src_bucket_objs = s3_src_bucket.objects.all()
# Request file name prefix
file_prefix = 'File1'
# filter - start and end date
start_date = datetime.datetime.strptime("2019-01-01", "%Y-%m-%d").replace(tzinfo=None)
end_date = datetime.datetime.strptime("2020-06-15", "%Y-%m-%d").replace(tzinfo=None)
# iterates each source directory
for iterator_obj in s3_src_bucket_objs:
file_path_key = iterator_obj.key
date_key = iterator_obj.last_modified.replace(tzinfo=None)
if start_date <= date_key <= end_date and file_prefix in file_path_key:
# file name. It start with value of file_prefix.
uni_uuid = uuid.uuid4()
src_file_name = '{}_{}'.format(file_path_key.split('/')[-1], uni_uuid)
# construct target directory path
trg_dir_path = '{0}/datekey={1}'.format(trg_dir, date_key.date())
# source file
src_file_ref = {
'Bucket': src_bucket_name,
'Key': file_path_key
}
# target file path
trg_file_path = '{0}/{1}'.format(trg_dir_path, src_file_name)
# copy source file to target
trg_new_obj = s3_trg_bucket.Object(trg_file_path)
trg_new_obj.copy(src_file_ref, ExtraArgs=extra_args, Config=transfer_config)
# happy ending
Do we have any other way to make it fast or any alternative way to copy files in such target structure? Do you have any suggestions to improve the code? I am looking for some faster way to copy files. Your input would be valuable. Thank you!
The most likely reason that you can only copy 500k objects per day (thus taking about 3-4 months to copy 50M objects, which is absolutely unreasonable) is because you're doing the operations sequentially.
The vast majority of the time your code is running is spent waiting for the S3 Copy Object request to be sent to S3, processed by S3 (i.e., copying the object), and then sending the response back to you. On average, this is taking around 160ms per object (500k/day == approx. 1 per 160ms), which is reasonable.
To dramatically improve the performance of your copy operation, you should simply parallelize it: make many threads run the copies concurrently.
Once the Copy commands are not the bottleneck anymore (i.e., after you make them run concurrently), you'll encounter another bottleneck: the List Objects requests. This request runs sequentially, and returns only up to 1k keys per page, so you'll end up having to send around 50k List Object requests sequentially with the straightforward, naive code (here, "naive" == list without any prefix or delimiter, wait for the response, and list again with the provided next continuation token to get the next page).
Two possible solutions for the ListObjects bottleneck:
If you know the structure of your bucket pretty well (i.e., the "names of the folders", statistics on the distribution of "files" within those "folders", etc), you could try to parallelize the ListObjects requests by making each thread list a given prefix. Note that this is not a general solution, and requires intimate knowledge of the structure of the bucket, and also usually only works well if the bucket's structure had been planned out originally to support this kind of operation.
Alternatively, you can ask S3 to generate an inventory of your bucket. You'll have to wait at most 1 day, but you'll end up with CSV files (or ORC, or Parquet) containing information about all the objects in your bucket.
Either way, once you have the list of objects, you can have your code read the inventory (e.g., from local storage such as your local disk if you can download and store the files, or even by just sending a series of ListObjects and GetObject requests to S3 to retrieve the inventory), and then spin up a bunch of worker threads and run the S3 Copy Object operation on the objects, after deciding which ones to copy and the new object keys (i.e., your logic).
In short:
grab a list of all the objects first;
then launch many workers to run the copies.
One thing to watch out for here is if you launch an absurdly high number of workers and they all end up hitting the exact same partition of S3 for the copies. In such a scenario, you could end up getting some errors from S3. To reduce the likelihood of this happening, here are some things you can do:
instead of going sequentially over your list of objects, you could randomize it. E.g., load the inventory, put the items into a queue in a random order, and then have your workers consume from that queue. This will decrease the likelihood of overheating a single S3 partition
keep your workers to not more than a few hundred (a single S3 partition should be able to easily keep up with many hundreds of requests per second).
Final note: there's another thing to consider which is whether or not the bucket may be modified during your copy operation. If it could be modified, then you'll need a strategy to deal with objects that might not be copied because they weren't listed, or with objects that were copied by your code but got deleted from the source.
You may be able to complete it using S3 Batch Operations.
You can use S3 Batch Operations to perform large-scale batch operations on Amazon S3 objects. S3 Batch Operations can execute a single operation on lists of Amazon S3 objects that you specify. A single job can perform the specified operation on billions of objects containing exabytes of data. Amazon S3 tracks progress, sends notifications, and stores a detailed completion report of all actions, providing a fully managed, auditable, serverless experience. You can use S3 Batch Operations through the AWS Management Console, AWS CLI, AWS SDKs, or REST API.
Use S3 Batch Operations to copy objects and set object tags or access control lists (ACLs). You can also initiate object restores from Amazon S3 Glacier or invoke an AWS Lambda function to perform custom actions using your objects. You can perform these operations on a custom list of objects, or you can use an Amazon S3 inventory report to make generating even the largest lists of objects easy. Amazon S3 Batch Operations use the same Amazon S3 APIs that you already use with Amazon S3, so you'll find the interface familiar.
It would be interesting if you could report back whether this ends up working with the amount of data that you have, and any issues you may have encountered along the way.
You can use Skyplane which is much faster and cheaper than aws s3 cp (up to 110x).
You can transfer data between buckets with the following command, after running skyplane init:
skyplane cp -r s3://<bucket-A>/ s3://<bucket-B>/

AWS Lambda generates large size files to S3

Currently we are having a aws lambda (java based runtime) which takes a SNS as input and then perform business logic and generate 1 XML file , store it to S3.
The implementation now is create the XML at .tmp location which we know there is space limitation of aws lambda (500mb).
Do we have any way to still use lambda but can stream XML file to S3 without using .tmp folder?
I do research but still do not find solution for it.
Thank you.
You can directly load an object to s3 from memory without having to store it locally. You can use the put object API for this. However, keep in mind that you still have time and total memory limits with lambda as well. You may run out of those too if your object size is too big.
If you can split the file into chunks and don't require to update the beginning of the file while working with its end you can use multipart upload providing a ready to go chunk and then free the memory for the next chunk.
Otherwise you still need a temporary storage for form all the parts of the XML. You can use DynamoDB or Redis and when you collect there all the parts of the XML you can start uploading it part by part, then cleanup the db (or set TTL to automate the cleanup).

Upload files to S3 Bucket directly from a url

We need to move our video file storage to AWS S3. The old location is a cdn, so I only have url for each file (1000+ files, > 1TB total file size). Running an upload tool directly on the storage server is not an option.
I already created a tool that downloads the file, uploads file to S3 bucket and updates the DB records with new HTTP url and works perfectly except it takes forever.
Downloading the file takes some time (considering each file close to a gigabyte) and uploading it takes longer.
Is it possible to upload the video file directly from cdn to S3, so I could reduce processing time into half? Something like reading chunk of file and then putting it to S3 while reading next chunk.
Currently I use System.Net.WebClient to download the file and AWSSDK to upload.
PS: I have no problem with internet speed, I run the app on a server with 1GBit network connection.
No, there isn't a way to direct S3 to fetch a resource, on your behalf, from a non-S3 URL and save it in a bucket.
The only "fetch"-like operation S3 supports is the PUT/COPY operation, where S3 supports fetching an object from one bucket and storing it in another bucket (or the same bucket), even across regions, even across accounts, as long as you have a user with sufficient permission for the necessary operations on both ends of the transaction. In that one case, S3 handles all the data transfer, internally.
Otherwise, the only way to take a remote object and store it in S3 is to download the resource and then upload it to S3 -- however, there's nothing preventing you from doing both things at the same time.
To do that, you'll need to write some code, using presumably either asynchronous I/O or threads, so that you can simultaneously be receiving a stream of downloaded data and uploading it, probably in symmetric chunks, using S3's Multipart Upload capability, which allows you to write individual chunks (minimum 5MB each) which, with a final request, S3 will validate and consolidate into a single object of up to 5TB. Multipart upload supports parallel upload of chunks, and allows your code to retry any failed chunks without restarting the whole job, since the individual chunks don't have to be uploaded or received by S3 in linear order.
If the origin supports HTTP range requests, you wouldn't necessarily even need to receive a "stream," you could discover the size of the object and then GET chunks by range and multipart-upload them. Do this operation with threads or asynch I/O handling multiple ranges in parallel, and you will likely be able to copy an entire object faster than you can download it in a single monolithic download, depending on the factors limiting your download speed.
I've achieved aggregate speeds in the range of 45 to 75 Mbits/sec while uploading multi-gigabyte files into S3 from outside of AWS using this technique.
This has been answered by me in this question, here's the gist:
object = Aws::S3::Object.new(bucket_name: 'target-bucket', key: 'target-key')
object.upload_stream do |write_stream|
IO.copy_stream(URI.open('http://example.com/file.ext'), write_stream)
end
This is no 'direct' pull-from-S3, though. At least this doesn't download each file and then uploads in serial, but streams 'through' the client. If you run the above on an EC2 instance in the same region as your bucket, I believe this is as 'direct' as it gets, and as fast as a direct pull would ever be.
if a proxy ( node express ) is suitable for you then the portions of code at these 2 routes could be combined to do a GET POST fetch chain, retreiving then re-posting the response body to your dest. S3 bucket.
step one creates response.body
step two
set the stream in 2nd link to response from the GET op in link 1 and you will upload to dest.bucket the stream ( arrayBuffer ) from the first fetch