I am trying to cache the data (for every user I have in a list) from external URL request using AWS lambda function by almost 24 hours or less.
For instance:
cache.set(`response-${user}`, data, 24 * 60 * 60 * 100)
I am looking for the best way to save the response, I read that lambda has limited the storage for this.
Also, I read that I can use ElastiCache, but currently I am using memory-cache with node
Thank you,
Related
What is the fastest way of getting an exact count of rows for a 100GB CSV file stored on Amazon S3 without using Athena nor any Fargate or EC2 VM? I can't use Athena, because the CSV file isn't clean-enough for it. I can't use Fargates or EC2 VMs, because I need a purely serverless solution. I can't use third-party services like Snowflake (native AWS services only).
Also, 100GB is too large to fit within a Lambda Function's /tmp (limited to 10GB). I could try to run something like DuckDB (or any other streaming database engine) on a Lambda and scan the entire file with a SELECT COUNT(*) FROM "s3://myBucket/myFile.csv" query, but the Lambda is quite likely to timeout, because its read bandwidth from S3 is 100MB/s at best, and it cannot run for more than 15 minutes (900s).
I know the approximate size of the file.
Note: I have an inaccurate estimate of the number of rows provided by AWS Glue Data Catalog's crawler, with an error margin of -50%/+100%. This could be used for some kind of iterative or dichotomous process, but I could not figure any out. For example, I tried adding an OFFSET with a value lower than but close to the number of rows to the aforementioned query, but the Lambda running DuckDB timed out. That was disappointing and somewhat surprising, because a query like SELECT * FROM "s3://myBucket/myFile.csv" LIMIT 10 OFFSET 10000000 worked well.
The fastest solution is probably to use SelectObjectContent with ScanRange to parallelize the request on chunks of 50MB or so.
Have you tried "AWS S3 select":https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-glacier-select-sql-reference-select.html. It lets you run queries on S3 files. I use the service to get basic insight into any file on S3(Provided it can be queried).
I have a requirement to PUT about 20 records per second to an S3 bucket.
That comes to roughly 20 * 60 * 60 * 24 * 30 = 51,840,000 PUTs per month.
I do not need any transformations but I would certainly want the PUTs to be GZIPped and partitioned by year/date/month/hour.
Option 1 - Just do PUTObjects on S3
Price comes to about ~$260 a month
I would have to do the GZIP/Partitions etc on the client side
Option 2 - Introduce a Firehose and wire it to S3
And let's say I buffer only once in 10 minutes then that is about 6 * 24 * 30 = 4,320 PUTs. Price of S3 comes down to $21. With each record about 20 KB Firehose pricing is about 1000GB * 0.029 comes to about $30. So total pricing is $51. Costs for data transfer / storage etc are same in both approaches I believe.
Firehose provides GZip/Partitions/buffering for me OOTB
It appears like Option 2 is the best for my use case. Am I missing something here?
Thanks for looking!
When running this query on AWS Athena, it manages to query a 63GB Traders.csv file
SELECT * FROM Trades WHERE TraderID = 1234567
Tt takes 6.81 seconds, scanning 63.82GB in so doing (almost exactly the size of the Trades.csv file, so is doing a full table scan).
What I'm shocked at is the unbelievable speed of data drawn from s3. It seems like AWS Athena's strategy is to use an unbelievably massive box with a ton of RAM and incredible s3 loading ability to get around the lack of indexing (although on a standard SQL DB you would have an index on TraderID and load millions times less data).
But in my experiments I only managed to get these data reads from S3 (which are still impressive):
InstanceType
Mb/s
Network Card Gigabits
t2.2xlarge
113
low
t3.2xlarge
140
up to 5
c5n.2xlarge
160
up to 25
c6gn.16xlarge
230
100
(that's megabytes rather than megabits)
I'm using an internal VPC Endpoint for the s3 on eu-west-1. Anyone got any tricks/tips for getting s3 to load fast? Has anyone got over 1GB/s read speeds from s3? Is this even possible?
It seems like AWS Athena's strategy is to use an unbelievably massive
box with a ton of RAM
No, it's more like many small boxes, not a single massive box. Athena is running your query in parallel, on multiple servers at once. The exact details of that are not published anywhere as far as I am aware, but they make very clear in the documentation that your queries run in parallel.
I have the following situation that I try to find the best solution for.
A device writes its GPS coordinates every second to a csv file and uploads the file every x minutes to s3 before starting a new csv.
Later I want to be able to get the GPS data for a specific time period e.g 2016-11-11 8am until 2016-11-11 2pm
Here are two solutions that I am currently considering:
Use a lambda function to automatically save the csv data to a dynamoDB record
Only save the metadata (csv gps timestamp-start, timestamp-end, s3Filename) in dynamoDB and then request the files directly from s3.
However both solutions seem to have a major drawback:
The gps data uses about 40 bytes per record (second). So if I use 10min chunks this will result in a 24 kB file. dynamoDB charges write capacities by item size (1 write capacity unit = 1 kB). So this would require 24 units for a single write. Reads (4kB/unit) are even worse since a user may request timeframes greater than 10 min. So for a request covering e.g. 6 hours (=864kB) it would require a read capacity of 216. This will just be too expensive considering multiple users.
When I read directly from S3 I face the browser limiting the number of concurrent requests. The 6 hour timespan for instance would cover 36 files. This might still be acceptable, considering a connection limit of 6. But a request for 24 hours (=144 files) would just take too long.
Any idea how to solve the problem?
best regards, Chris
You can avoid using DynamoDB altogether if the S3 keys contain the date in a reasonable format (e.g. ISO: deviceid_2016-11-27T160732). This allows you to find the correct files by listing the object keys: https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysUsingAPIs.html.
(If you can not control the naming, you could use a Lambda function to rename the files.)
Number of requests is an issue, but you could try to put a CloudFront distribution in front of it and utilize HTTP/2, which allows the browser to request multiple files over the same connection.
Have you considered using AWS Firehose? Your data will be periodically shovelled into Redshift which is like Postgres. You just pump a JSON formatted or a | delimited record into an AWS Firehose end-point and the rest is magic by the little AWS elves.
I am learning about Amazon Web services. I just want to know what is the maximum number of connections(roughly) that can be held by Amazon S3 simultaneously without crashing...
Theoretically this is infinite. To achieve this, they use a partitioning scheme they explain here: http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
Basically they partition your buckets on different servers based on the first few characters of the filename. If those are random, you scale indefinitely (they just take more characters to partition on). If you prepend all files with file_ or something (so S3 cannot partition the files correctly because all files have the same starting characters), the limit is about 300 GET / sec or 100 PUT/DELETE/POST per second.
See that page for an in-depth explanation.
Given the AWS documentation you will receive HTTP 503 Slow Down over 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second.
The limit has been increased in July 2018.
More information :
https://aws.amazon.com/en/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/
https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html