Recommended way to read an entire table (Lambda, DynamoDB/S3) - amazon-web-services

I'm new to AWS and am working on a Serverless application where one function needs to read a large array of data. Never will a single item be read from the table, but all the items will routinely be updated by a schedule function.
What is your recommendation for the most efficient way of handling this scenario? My current implementation uses the scan operation on a DynamoDB table, but with my limited experience I'm unsure if this is going to be performant in production. Would it be better to store the data as a JSON file on S3 perhaps? And if so would it be so easy to update the values with a schedule function?
Thanks for your time.
PS: to give an idea of the size of the database, there will be ~1500 items, each containing an array of up to ~100 strings

It depends on the size of each item, but how?
First of all to use DynamoDB or S3 you pay for two services (in your case*):
1- Request per month
2- Storage per month
If you have small items the fist case will be up to 577 times cheaper if you read items from DynamoDB instead of S3
How: $0.01 per 1,000 requests for S3 compared to 5.2 million reads (up to 4 KB each) per month for DynamoDB. Plus you should pay $0.01 per GB for data retrieval in S3 which should be added up to that price. However, your writes into S3 will be free while you should pay for each write into your DynamoDB (which is almost 4 times more expensive than reading).
However if your items require so many RCUs per reads maybe S3 would be cheaper in this case.
And regarding the storage cost, S3 is cheaper but again you should see how big your data will be in size as you pay maximum $0.023 per GB for S3 while you pay $0.25 per GB per month which is almost 10 times more expensive.
Conclusion:
If you have so many request and your items are smaller its easier and even more straight forward to use DynamoDB as you're not giving up any of the query functionalities that you have using DynamoDB which clearly you won't have in case you use S3. Otherwise, you can consider keeping a pointer to objects' locations stored in S3 in DynamoDB.
(*) The costs you pay for tags in S3 or indexes in DynamoDB are another factors to be considered in case you need to use them.

Here is how I would do:
Schedule Updates:
Lambda (to handle schedule changes) --> DynamoDB --> DynamoDBStream --> Lambda(Read if exists, Apply Changes to All objects and save to single object in S3)
Read Schedule:
With Lambda Read the single object from S3 and serve all the schedules or single schedule depending upon the request. You can check whether the object is modified or not before reading the next time, so you don't need to read every time from S3 and serve only from memory.
Scalability:
If you want to scale, you need to split the objects to certain size so that you will not load all objects exceeding 3GB memory size (Lambda Process Memory Size)
Hope this helps.
EDIT1:
When you cold start your serving lambda, load the object from s3 first and after that, you can check s3 for an updated object (after certain time interval or a certain number of requests) with since modified date attribute.
You can also those data to Lambda memory and serve from memory until the object is updated.

Related

Amazon S3 Store Millions of Files

I am trying to find the most cost effective way of doing this, will appreciate any help:
I have 100s of millions of files. Each file is under 1MB each (usually 100KB or so)
In total this is over 5 TB of data - as of now, and this will grow weekly
I cannot merge/concatenate the files together. The files must be stored as is
Query and download requirements are basic. Around 1 Million files to be selected and downloaded per month
I am not worried about S3 storage or Data Retrieval or Data Scan cost.
My question is when I upload 100s of million files, does this count as one PUT request per file (meaning one per object)? If so, just the cost to upload the data will be massive. If I upload a directory with a million files, is that one PUT request?
What if I zip the 100 million files on prem, then upload the zip, and use lambda to unzip. Would that count as one PUT request?
Any advise?
You say that you have "100s of millions of files", so I shall assume you have 400 million objects, making 40TB of storage. Please adjust accordingly. I have shown my calculations so that people can help identify my errors.
Initial upload
PUT requests in Amazon S3 are charged at $0.005 per 1,000 requests. Therefore, 400 million PUTs would cost $2000. (.005*400m/1000)
This cost cannot be avoided if you wish to create them all as individual objects.
Future uploads would be the same cost at $5 per million.
Storage
Standard storage costs $0.023 per GB, so storing 400 million 100KB objects would cost $920/month. (.023*400m*100/1m)
Storage costs can be reduced by using lower-cost Storage Classes.
Access
GET requests are $0.0004 per 1,000 requests, so downloading 1 million objects each month would cost 40c/month. (.0004*1m/1000)
If the data is being transferred to the Internet, Data Transfer costs of $0.09 per GB would apply. The Data Transfer cost of downloading 1 million 100KB objects would be $9/month. (.09*1m*100/1m)
Analysis
You seem to be most fearful of the initial cost of uploading 100s of millions of objects at a cost of $5 per million objects.
However, storage will also be high, and the cost of $2.30/month per million objects ($920/month for 400m objects). That ongoing cost is likely to dwarf the cost of initial uploads.
Some alternatives would be:
Store the data on-premises (disk storage is $100/4TB, so 400m files would require $1000 of disks, but you would want extra drives for redundancy), or
Store the data in a database: There are no 'PUT' costs for databases, but you would need to pay for running the database. This might work out a lower cost. or
Combine the data in the files (which you say you do not wish to do), but in a way that can be easily split-apart. For example, marking records by an identifier for easy extractions. or
Use a different storage service, such as Digital Ocean, who do not appear to have a 'PUT' cost.

Database suggestion for large unstructured datasets to integrate with elasticsearch

A scenario where we have millions of records saved in database, currently I was using dynamodb for saving metadata(and also do write, update and delete operations on objects), S3 for storing files(eg: files can be images, where its associated metadata is stored in dynamoDb) and elasticsearch for indexing and searching. But due to dynamodb limit of 400kb for a row(a single object), it was not sufficient for data to be saved. I thought about saving for an object in different versions in dynamodb itself, but it would be too complicated.
So I was thinking for replacement of dynamodb with some better storage:
AWS DocumentDb
S3 for saving metadata also, along with object files
So which one is better option among both in your opinion and why, which is also cost effective. (Also easy to sync with elasticsearch, but this ES syncing is not much issue as somehow it is possible for both)
If you have any other better suggestions than these two you can also tell me those.
I would suggest looking at DocumentDB over Amazon S3 based on your use case for the following reasons:
Pricing of storing the data would be $0.023 for standard and $0.0125 for infrequent access per GB per month (whereas Document DB is $0.10per GB-month), depending on your size this could add up greatly. If you use IA be aware that your costs for retrieval could add up greatly.
Whilst you would not directly get the data down you would use either Athena or S3 Select to filter. Depending on the data size being queried it would take from a few seconds to possibly minutes (not the milliseconds you requested).
For unstructured data storage in S3 and the querying technologies around it are more targeted at a data lake used for analysis. Whereas DocumentDB is more driven for performance within live applications (it is a MongoDB compatible data store after all).

Creating prefixes in S3 to paralellise reads and increase performance

I'm doing some research and I was reading this page
https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
It says
Amazon S3 automatically scales to high request rates. For example, your application can achieve at least 3,500 PUT/POST/DELETE and 5,500 GET requests per second per prefix in a bucket. There are no limits to the number of prefixes in a bucket. It is simple to increase your read or write performance exponentially. For example, if you create 10 prefixes in an Amazon S3 bucket to parallelise reads, you could scale your read performance to 55,000 read requests per second.
I'm not sure what the last bit means. My understanding is that for the filename 'Australia/NSW/Sydney', the prefix is 'Australia/NSW'. Correct?
How does creating 10 of these improve your read performance? Do you create for example Australia/NSW1/, Australia/NSW2/, Australia/NSW3/, and then map them to a load balancer somehow?
S3 is designed like a Hashtable/HashMap in Java. The prefix form the hash for the hash-bucket... and the actual files are stored in groups in these buckets...
To search a particular file you need to compare all files in a hash-bucket... whereas getting to a hash-bucket is instant (constant-time).
Thus the more descriptive the keys, the more hash-buckets hence lesser items in those buckets... which makes the lookup faster...
Eg. a bucket with tourist attraction details for all countries in the world
Bucket1: placeName.jpg (all files in the bucket no prefix)
Bucket2: countryName/state/placeName.jpg
now if you are looking for Sydney.info in Australia/NSW... the lookup will be faster in second bucket.
No, S3 doesn't connect to LB, ever. This article covers this topic but the important highlights:
(...) keys in S3 are partitioned by prefix
(...)
Partitions are split either due to sustained high request rates, or because they contain a large number of keys (which would slow down lookups within the partition). There is overhead in moving keys into newly created partitions, but with request rates low and no special tricks, we can keep performance reasonably high even during partition split operations. This split operation happens dozens of times a day all over S3 and simply goes unnoticed from a user performance perspective. However, when request rates significantly increase on a single partition, partition splits become detrimental to request performance. How, then, do these heavier workloads work over time? Smart naming of the keys themselves!
So Australia/NSW/ could be read from the same partition while Australia/NSW1/ and might Australia/NSW2/ be read from two others. It doesn't have to be that way but still prefixes allow some control of how to partition the data because you have a better understanding of what kind of reads you will be doing on it. You should aim to have reads spread evenly over the prefixes.

Aws Dynamo db performance is slow

For my application I am using free tier aws account I have given 5 read capacity and 5Write capacity(i can’t increase the capacity because they will charge if I increase) to the dynamo db here I am using scan operation. The api is loading in between 10 seconds to 20 seconds.
I have used parallel scan too but the api is loading same time. Is there any alternate service in aws.
click here to see the image
It is not a good idea to use a Scan on a NoSQL database.
DynamoDB is optimize for Query requests. The data will come back very quickly, guaranteed (within the allocated Capacity).
However, when using a Scan, the database must read each item from the database and each item consumes a Read Capacity unit. So, if you have a table with 1000 items, a Query on one item would consume one Unit, whereas a Scan would consume 1000 Units.
So, either increase the Capacity Units (and cost) or, best of all, use a Query rather than a Scan. Indexes can also help.
You might need to re-think how you store your data if you always need to do a Scan.

Improving DynamoDB Write Operation

I am trying to call dynamodb write operation to write around 60k records.
I have tried to put 1000 write capacity unites for Provisioned Write capacity. But my write operation is still taking lot of time. Also when I check the metrics I can still see the consumed Write capacity units as around 10 per seconds.
My record size is definitely less than 1KB.
Is there a way we can speed up the write operation for dynamodb?
So here is what I figured out.
I changed my call to use batchWrite and my consumed Write capacity units has increased significantly upto 286 write capacity units.
Also the complete write operation finished within couple of minutes.
As mentioned in all above answers using putItem to load large number of data has the latency issues and it affects your consumed capacities. It is always better to batchWrite.
DynamoDB performance, like most databases is highly dependent on how it is used.
From your question, it is likely that you are using only a single DynamoDB partition. Each partition can support up to 1000 write capacity units and up to 10GB of data.
However, you also mention that your metrics show only 10 write units consumed per second. This is very low. Check all the metrics visible for the table in the AWS console. This is a tab per table under the DynamoDB pages. Check for throttling and any errors. Check the consumed capacity is below the provisioned capacity on the charts.
It is possible that there is some other bottleneck in your process.
It looks like you can send more requests per second. You can perform more request, but if you send requests in a loop like this:
for item in items:
table.putItem(item)
You need to mind the roundtrip latency for each request.
You can use two tricks:
First, upload data from multiple threads/machines.
Second, you can use BatchWriteItem method that allow you to write up to 25 items in one request:
The BatchWriteItem operation puts or deletes multiple items in one or
more tables. A single call to BatchWriteItem can write up to 16 MB of
data, which can comprise as many as 25 put or delete requests.
Individual items to be written can be as large as 400 KB.