should batch_writer in dynamodb be super fatser? - amazon-web-services

I have a weired results on uploading 17,956,000,000 records into dynamodb. Thos records are on 134,000 files (and each file have a 134,000 record). If you ask about those records, its a full city distance matrix (the city has 134,000 nodes). To do so, i have uploaded all files into s3 pucket. Also, I have tried a lambda python function to do a sqs queue later. When measure a single file uploading time (from s3 to dynamodb using lambda python function), i couldn't upload a full file withing 15 min (max exec time on lambda). I did some tests comparing the batch write with a 3 minutes execution as:
with table.batch_writer() as bw:
for k, v in obj.items():
bw.put_item(Item={"_from":src_node,"_to":dst_node,"travel_time":v})
which got to upload a total of 2,475 records.
for k, v in obj.items():
dynamodb.put_item(TableName='xxx', Item={"_from":{"N":src_node},"_to":{'N':dst_node},"travel_time":{"N":str(v)}})
which got to upload a total of 2,404 records.
I wan't to validate if those results are logical or something should be wrong.

Related

Merge S3 files into multiple <1GB S3 files

I have multiple S3 files in a bucket.
Input S3 bucket :
File1 - 2GB data
File 2 - 500MB data
File 3 - 1Gb Data
file 4 - 2GB data
and so on. Assume there are 50 such files. Data within files is of same schema, lets say attribute1, attribute 2.
I want to merge these files and output into a new bucket as follows, such that each file is less than 1GB in same schema as before.
Files 1 - < 1GB
Files 2 - < 1GB
Files 3 - < 1GB
I am looking for AWS based solutions which I can deliver using AWS CDK. I was considering following two solutions :
AWS Athena - reads and writes to S3 but not sure if I can set up a 1GB limit while writing.
AWS Lambda - read file sequentially, store in memory, when size is near 1GB, write to new file in s3 bucket. Repeat until all files completed. I'm worried about the 15 min timeout, not sure if lambda will be able to process.
Expected scales -> Overall file input size sum : 1 TB
What would be a good way to go about implementing this? Hope I have phrased the question right, I'd be happy to comment if any doubts.
Thanks!
Edit :
Based on a comment ->
Apologies for calling it a merge. More of a reset. All files have the same schema, placed in csv files. In terms of pseudo code
List<Files> listOfFiles = ReadFromS3(key)
New file named temp.csv
for each file : listOfFiles :
append file to temp.csv
List<1GBGiles> finalList = Break down temp.csv into sets of 1GB each
for(File file : finalList)
writeToS3(finalList)
Amazon Athena can run a query across multiple objects in a given Amazon S3 path, as long as they all have the same format (eg same columns in a CSV file).
It can store the result in a new External Table, with a location pointing to an S3 bucket, by using a CREATE TABLE AS command and a LOCATION parameter.
The size of the output files can be controlled by setting the number of output buckets (which is not the same as an S3 bucket).
See:
Bucketing vs partitioning - Amazon Athena
Set the number or size of files for a CTAS query in Amazon Athena
If your process includes ETL(Extraction Transformation Load) post process, you could use AWS GLUE
Please find here an example for Glue using s3 as a source.
If you’d like to use it with Java SDK, the best starting points are:
the Glue GitHub repo
The aws Java code sample catalog for Glue
Out of all of them your the Tutorial to create a crawler (that you can find in GitHub as per above url) should match your case as it crawls an S3 bucket and put it in a glue catalog for transformation.

Faster way to Copy S3 files

I am trying to copy around 50 million files and 15TB in total size from one s3 bucket to another bucket.
There are AWS CLI option to copy fast. But in my case, I want to put a filter and date range. So I thought to write code by using boto3.
The source bucket input structure:
Folder1
File1 - Date1
File2 - Date1
Folder2
File1 - Date2
File2 - Date2
Folder3
File1_Number1 - Date3
File2_Number1 - Date3
Folder4
File1_Number1 - Date2
File2_Number1 - Date2
Folder5
File1_Number2 - Date4
File2_Number2 - Date4
So the purpose is to copy all files which start with 'File1' from each folder by using a date range(Date2 to Date4). date(Date1, Date2, Date3, Date4) is file modified date.
The output would have date key partition and I am using UUID to keep every file name unique so it would never replace the existing file. So the files which have an identical date(modified date of the file) will be in the same folder.
Target Bucket would have output:
Date2
File1_UUID1
File1_Number1_UUID2
Date3
File1_Number1_UUID3
Date4
File1_Number2_UUID4
I have written code by using boto3 API and AWS glue to run the code. But boto3 API copies 500 thousand files every day.
The code:
s3 = boto3.resource('s3', region_name='us-east-2', config=boto_config)
# source and target bucket names
src_bucket_name = 'staging1'
trg_bucket_name = 'staging2'
# source and target bucket pointers
s3_src_bucket = s3.Bucket(src_bucket_name)
print('Source Bucket Name : {0}'.format(s3_src_bucket.name))
s3_trg_bucket = s3.Bucket(trg_bucket_name)
print('Target Bucket Name : {0}'.format(s3_trg_bucket.name))
# source and target directories
trg_dir = 'api/requests'
# source objects
s3_src_bucket_objs = s3_src_bucket.objects.all()
# Request file name prefix
file_prefix = 'File1'
# filter - start and end date
start_date = datetime.datetime.strptime("2019-01-01", "%Y-%m-%d").replace(tzinfo=None)
end_date = datetime.datetime.strptime("2020-06-15", "%Y-%m-%d").replace(tzinfo=None)
# iterates each source directory
for iterator_obj in s3_src_bucket_objs:
file_path_key = iterator_obj.key
date_key = iterator_obj.last_modified.replace(tzinfo=None)
if start_date <= date_key <= end_date and file_prefix in file_path_key:
# file name. It start with value of file_prefix.
uni_uuid = uuid.uuid4()
src_file_name = '{}_{}'.format(file_path_key.split('/')[-1], uni_uuid)
# construct target directory path
trg_dir_path = '{0}/datekey={1}'.format(trg_dir, date_key.date())
# source file
src_file_ref = {
'Bucket': src_bucket_name,
'Key': file_path_key
}
# target file path
trg_file_path = '{0}/{1}'.format(trg_dir_path, src_file_name)
# copy source file to target
trg_new_obj = s3_trg_bucket.Object(trg_file_path)
trg_new_obj.copy(src_file_ref, ExtraArgs=extra_args, Config=transfer_config)
# happy ending
Do we have any other way to make it fast or any alternative way to copy files in such target structure? Do you have any suggestions to improve the code? I am looking for some faster way to copy files. Your input would be valuable. Thank you!
The most likely reason that you can only copy 500k objects per day (thus taking about 3-4 months to copy 50M objects, which is absolutely unreasonable) is because you're doing the operations sequentially.
The vast majority of the time your code is running is spent waiting for the S3 Copy Object request to be sent to S3, processed by S3 (i.e., copying the object), and then sending the response back to you. On average, this is taking around 160ms per object (500k/day == approx. 1 per 160ms), which is reasonable.
To dramatically improve the performance of your copy operation, you should simply parallelize it: make many threads run the copies concurrently.
Once the Copy commands are not the bottleneck anymore (i.e., after you make them run concurrently), you'll encounter another bottleneck: the List Objects requests. This request runs sequentially, and returns only up to 1k keys per page, so you'll end up having to send around 50k List Object requests sequentially with the straightforward, naive code (here, "naive" == list without any prefix or delimiter, wait for the response, and list again with the provided next continuation token to get the next page).
Two possible solutions for the ListObjects bottleneck:
If you know the structure of your bucket pretty well (i.e., the "names of the folders", statistics on the distribution of "files" within those "folders", etc), you could try to parallelize the ListObjects requests by making each thread list a given prefix. Note that this is not a general solution, and requires intimate knowledge of the structure of the bucket, and also usually only works well if the bucket's structure had been planned out originally to support this kind of operation.
Alternatively, you can ask S3 to generate an inventory of your bucket. You'll have to wait at most 1 day, but you'll end up with CSV files (or ORC, or Parquet) containing information about all the objects in your bucket.
Either way, once you have the list of objects, you can have your code read the inventory (e.g., from local storage such as your local disk if you can download and store the files, or even by just sending a series of ListObjects and GetObject requests to S3 to retrieve the inventory), and then spin up a bunch of worker threads and run the S3 Copy Object operation on the objects, after deciding which ones to copy and the new object keys (i.e., your logic).
In short:
grab a list of all the objects first;
then launch many workers to run the copies.
One thing to watch out for here is if you launch an absurdly high number of workers and they all end up hitting the exact same partition of S3 for the copies. In such a scenario, you could end up getting some errors from S3. To reduce the likelihood of this happening, here are some things you can do:
instead of going sequentially over your list of objects, you could randomize it. E.g., load the inventory, put the items into a queue in a random order, and then have your workers consume from that queue. This will decrease the likelihood of overheating a single S3 partition
keep your workers to not more than a few hundred (a single S3 partition should be able to easily keep up with many hundreds of requests per second).
Final note: there's another thing to consider which is whether or not the bucket may be modified during your copy operation. If it could be modified, then you'll need a strategy to deal with objects that might not be copied because they weren't listed, or with objects that were copied by your code but got deleted from the source.
You may be able to complete it using S3 Batch Operations.
You can use S3 Batch Operations to perform large-scale batch operations on Amazon S3 objects. S3 Batch Operations can execute a single operation on lists of Amazon S3 objects that you specify. A single job can perform the specified operation on billions of objects containing exabytes of data. Amazon S3 tracks progress, sends notifications, and stores a detailed completion report of all actions, providing a fully managed, auditable, serverless experience. You can use S3 Batch Operations through the AWS Management Console, AWS CLI, AWS SDKs, or REST API.
Use S3 Batch Operations to copy objects and set object tags or access control lists (ACLs). You can also initiate object restores from Amazon S3 Glacier or invoke an AWS Lambda function to perform custom actions using your objects. You can perform these operations on a custom list of objects, or you can use an Amazon S3 inventory report to make generating even the largest lists of objects easy. Amazon S3 Batch Operations use the same Amazon S3 APIs that you already use with Amazon S3, so you'll find the interface familiar.
It would be interesting if you could report back whether this ends up working with the amount of data that you have, and any issues you may have encountered along the way.
You can use Skyplane which is much faster and cheaper than aws s3 cp (up to 110x).
You can transfer data between buckets with the following command, after running skyplane init:
skyplane cp -r s3://<bucket-A>/ s3://<bucket-B>/

Doubts using Amazon S3 monthly calculator

I'm using Amazon S3 to store videos and some audios (average size of 25 mb each) and users of my web and android app (so far) can access them with no problem but I want to know how much I'll pay later exceeding the free stage of S3 so I checked the S3 monthly calculator.
I saw that there is 5 fields:
Storage: I put 3 gb cause right now there are 130 files (videos and audios)
PUT/COPY/POST/LIST Requests: I put 15 cause I'll upload manually around 10-15 files each month
GET/SELECT and Other Requests: I put 10000 cause a projection tells me that the users will watch/listen those files around 10000 times monthly
Data Returned by S3 Select: I put 250 Gb (10000 x 25 mb)
Data Scanned by S3 Select: I don't know what to put cause I don't need that amazon scans or analyze those files.
Am I using that calculator in a proper way?
What do I need to put in "Data Scanned by S3 Select"?
Can I put only zero?
For audio and video, you can definitely specify 0 for S3 Select -- both data scanned and data returned.
S3 Select is an optional feature that only works with certain types of text files -- like CSV and JSON -- where you make specific requests for S3 to scan through the files and return matching values, rather than you downloading the entire file and filtering it yourself.
This would not be used with audio or video files.
Also, don't overlook "Data transfer out." In addition to the "get" requests, you're billed for bandwidth when files are downloaded, so this needs to show the total size of all the downloads. This line item is data downloaded from S3 via the Internet.

s3 vs dynamoDB for gps data

I have the following situation that I try to find the best solution for.
A device writes its GPS coordinates every second to a csv file and uploads the file every x minutes to s3 before starting a new csv.
Later I want to be able to get the GPS data for a specific time period e.g 2016-11-11 8am until 2016-11-11 2pm
Here are two solutions that I am currently considering:
Use a lambda function to automatically save the csv data to a dynamoDB record
Only save the metadata (csv gps timestamp-start, timestamp-end, s3Filename) in dynamoDB and then request the files directly from s3.
However both solutions seem to have a major drawback:
The gps data uses about 40 bytes per record (second). So if I use 10min chunks this will result in a 24 kB file. dynamoDB charges write capacities by item size (1 write capacity unit = 1 kB). So this would require 24 units for a single write. Reads (4kB/unit) are even worse since a user may request timeframes greater than 10 min. So for a request covering e.g. 6 hours (=864kB) it would require a read capacity of 216. This will just be too expensive considering multiple users.
When I read directly from S3 I face the browser limiting the number of concurrent requests. The 6 hour timespan for instance would cover 36 files. This might still be acceptable, considering a connection limit of 6. But a request for 24 hours (=144 files) would just take too long.
Any idea how to solve the problem?
best regards, Chris
You can avoid using DynamoDB altogether if the S3 keys contain the date in a reasonable format (e.g. ISO: deviceid_2016-11-27T160732). This allows you to find the correct files by listing the object keys: https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysUsingAPIs.html.
(If you can not control the naming, you could use a Lambda function to rename the files.)
Number of requests is an issue, but you could try to put a CloudFront distribution in front of it and utilize HTTP/2, which allows the browser to request multiple files over the same connection.
Have you considered using AWS Firehose? Your data will be periodically shovelled into Redshift which is like Postgres. You just pump a JSON formatted or a | delimited record into an AWS Firehose end-point and the rest is magic by the little AWS elves.

download, process, upload large number of s3 files with spark

I have a large amount of files (~500k hdf5) inside a s3 bucket which I need to process and reupload to another s3 bucket.
I am pretty new to such tasks, so I am not quite sure if my approach is correct here. I do the following:
I use boto to get the list of keys inside the bucket and parallelize it with spark:
s3keys = bucket.list()
data = sc.parallelize(s3keys)
data = data.map(lambda x: download_process_upload(x))
result = data.collect()
where download_process_upload is a function which downloads the file specified by the key, does some processing on it and re-uploads it to another bucket (returning 1 if everything was successful, and 0 if there was an error)
So in the end I could do
success_rate = sum(result) / float(len(s3keys))
I have read that spark map statements should be stateless, while my custom map function definitely is not stateless. It downloads the file to disk and then loads it into memory etc.
So is this the proper way to do such a task?
I've successfully used your methodology to download and process data from S3. I have not tried to upload the data from within a map statement. But, I see no reason why you wouldn't be able to read the file from s3, process it, and then upload it to a new location.
Also, you can save a few keystrokes and take the explicit lambda out of the map statement like this data = data.map(download_process_upload)