I have around 6000 videos that need to be either deleted or moved to a specific project. However I only get around 1000 api calls before I am rate limited. So is there any way to send, all videos to be deleted, or all videos to be moved, in a single api call?
Batch requests are only possible on a select number of endpoints; deleting videos is not one of those endpoints. You'll need to distribute those 6000 video deletion requests over a period of time to avoid rate limit bans.
Related
I have a Dynamodb table for Connections. The idea is that when a user logs into a website, a connection will be made via websockets and this connection information is stored in this table.
Now, we have a feature we want to release which shows total users online. My thoughts are, i could add a new API end point which scans dynamodb and returns the count of connections, but this would involve a dynamodb scan every time the UI refreshes - guessing this would be very expensive.
Another option i thought of was creating an API and an scheduled lambda that calls this API once every 10 minutes and uploads the count to an S3 file, the API for the UI could then be pointed at the S3 file which would be cheaper but this would not be real time as its 10 mins out of date potentially.
Alternatively, i tried to use the /#connections end point to see if this returned the total connections via the websocket API but seems i am getting CORS error when doing so and there's no way in AWS for us to be able to set CORS on the provided HTTP #connections route.
I would be interested in some ideas how to achieve this in the most efficient way :) my estimated table of connections could have anywhere between 5k-10k items.
Best thing here would be to use an item in the table to hold the live item count.
Add connection:
Add connection to DDB -> Stream -> Lambda -> Increment count item
Remove connection:
Remove connection from DDB -> Stream -> Lambda -> Decrement count item
This will allow you to efficiently gain the number of live-users on the system by a simple GetItem.
You just need to be mindful that a single item can consume only 1000WCU per second, so if you are trying to update the item more than 1000 times per second you will either:
Have to aggregate the events in Lambda, using a sliding window.
Artificially shard the count item n ways, count-item0, count-item1 etc...
The plan was to get data from aws data exchange, move it to an s3 bucket then query it by aws athena for a data api. Everything works, just feels a bit slow.
No matter the dataset nor the query I can't get below 2 second in athena response time. Which is a lot for an API. I checked the best practices but seems that those are also above 2 sec.
So my question:
Is 2 sec the minimal response time for athena?
If so then I have to switch to postgres.
Athena is indeed not a low latency data store. You will very rarely see response times below one second, and often they will be considerably longer. In the general case Athena is not suitable as a backend for an API, but of course that depends on what kind of an API it is. If it's some kind of analytics service, perhaps users don't expect sub second response times? I have built APIs that use Athena that work really well, but those were services where response times in seconds were expected (and even considered fast), and I got help from the Athena team to tune our account to our workload.
To understand why Athena is "slow", we can dissect what happens when you submit a query to Athena:
Your code starts a query by using the StartQueryExecution API call
The Athena service receives the query, and puts it on a queue. If you're unlucky your query will sit in the queue for a while
When there is available capacity the Athena service takes your query from the queue and makes a query plan
The query plan requires loading table metadata from the Glue catalog, including the list of partitions, for all tables included in the query
Athena also lists all the locations on S3 it got from the tables and partitions to produce a full list of files that will be processed
The plan is then executed in parallel, and depending on its complexity, in multiple steps
The results of the parallel executions are combined and a result is serialized as CSV and written to S3
Meanwhile your code checks if the query has completed using the GetQueryExecution API call, until it gets a response that says that the execution has succeeded, failed, or been cancelled
If the execution succeeded your code uses the GetQueryResults API call to retrieve the first page of results
To respond to that API call, Athena reads the result CSV from S3, deserializes it, and serializes it as JSON for the API response
If there are more than 1000 rows the last steps will be repeated
A Presto expert could probably give more detail about steps 4-6, even though they are probably a bit modified in Athena's version of Presto. The details aren't very important for this discussion though.
If you run a query over a lot of data, tens of gigabytes or more, the total execution time will be dominated by step 6. If the result is also big, 7 will be a factor.
If your data set is small, and/or involves thousands of files on S3, then 4-5 will instead dominate.
Here are some reasons why Athena queries can never be fast, even if they wouldn't touch S3 (for example SELECT NOW()):
There will at least be three API calls before you get the response, a StartQueryExecution, a GetQueryExecution, and a GetQueryResults, just their round trip time (RTT) would add up to more than 100ms.
You will most likely have to call GetQueryExecution multiple times, and the delay between calls will puts a bound on how quickly you can discover that the query has succeeded, e.g. if you call it every 100ms you will on average add half of 100ms + RTT to the total time because on average you'll miss the actual completion time by this much.
Athena will writes the results to S3 before it marks the execution as succeeded, and since it produces a single CSV file this is not done in parallel. A big response takes time to write.
The GetQueryResults must read the CSV from S3, parse it and serialize it as JSON. Subsequent pages must skip ahead in the CSV, and may be even slower.
Athena is a multi tenant service, all customers are competing for resources, and your queries will get queued when there aren't enough resources available.
If you want to know what affects the performance of your queries you can use the ListQueryExecutions API call to list recent query execution IDs (I think you can go back 90 days at the most), and then use GetQueryExecution to get query statistics (see the documentation for QueryExecution.Statistics for what each property means). With this information you can figure out if your slow queries are because of queueing, execution, or the overhead of making the API calls (if it's not the first two, it's likely the last).
There are some things you can do to cut some of the delays, but these tips are unlikely to get you down to sub second latencies:
If you query a lot of data use file formats that are optimized for that kind of thing, Parquet is almost always the answer – and also make sure your file sizes are optimal, around 100 MB.
Avoid lots of files, and avoid deep hierarchies. Ideally have just one or a few files per partition, and don't organize files in "subdirectories" (S3 prefixes with slashes) except for those corresponding to partitions.
Avoid running queries at the top of the hour, this is when everyone else's scheduled jobs run, there's significant contention for resources the first minutes of every hour.
Skip GetQueryExecution, download the CSV from S3 directly. The GetQueryExecution call is convenient if you want to know the data types of the columns, but if you already know, or don't care, reading the data directly can save you some precious tens of milliseconds. If you need the column data types you can get the ….csv.metadata file that is written alongside the result CSV, it's undocumented Protobuf data, see here and here for more information.
Ask the Athena service team to tune your account. This might not be something you can get without higher tiers of support, I don't really know the politics of this and you need to start by talking to your account manager.
I'm new to AWS and am working on a Serverless application where one function needs to read a large array of data. Never will a single item be read from the table, but all the items will routinely be updated by a schedule function.
What is your recommendation for the most efficient way of handling this scenario? My current implementation uses the scan operation on a DynamoDB table, but with my limited experience I'm unsure if this is going to be performant in production. Would it be better to store the data as a JSON file on S3 perhaps? And if so would it be so easy to update the values with a schedule function?
Thanks for your time.
PS: to give an idea of the size of the database, there will be ~1500 items, each containing an array of up to ~100 strings
It depends on the size of each item, but how?
First of all to use DynamoDB or S3 you pay for two services (in your case*):
1- Request per month
2- Storage per month
If you have small items the fist case will be up to 577 times cheaper if you read items from DynamoDB instead of S3
How: $0.01 per 1,000 requests for S3 compared to 5.2 million reads (up to 4 KB each) per month for DynamoDB. Plus you should pay $0.01 per GB for data retrieval in S3 which should be added up to that price. However, your writes into S3 will be free while you should pay for each write into your DynamoDB (which is almost 4 times more expensive than reading).
However if your items require so many RCUs per reads maybe S3 would be cheaper in this case.
And regarding the storage cost, S3 is cheaper but again you should see how big your data will be in size as you pay maximum $0.023 per GB for S3 while you pay $0.25 per GB per month which is almost 10 times more expensive.
Conclusion:
If you have so many request and your items are smaller its easier and even more straight forward to use DynamoDB as you're not giving up any of the query functionalities that you have using DynamoDB which clearly you won't have in case you use S3. Otherwise, you can consider keeping a pointer to objects' locations stored in S3 in DynamoDB.
(*) The costs you pay for tags in S3 or indexes in DynamoDB are another factors to be considered in case you need to use them.
Here is how I would do:
Schedule Updates:
Lambda (to handle schedule changes) --> DynamoDB --> DynamoDBStream --> Lambda(Read if exists, Apply Changes to All objects and save to single object in S3)
Read Schedule:
With Lambda Read the single object from S3 and serve all the schedules or single schedule depending upon the request. You can check whether the object is modified or not before reading the next time, so you don't need to read every time from S3 and serve only from memory.
Scalability:
If you want to scale, you need to split the objects to certain size so that you will not load all objects exceeding 3GB memory size (Lambda Process Memory Size)
Hope this helps.
EDIT1:
When you cold start your serving lambda, load the object from s3 first and after that, you can check s3 for an updated object (after certain time interval or a certain number of requests) with since modified date attribute.
You can also those data to Lambda memory and serve from memory until the object is updated.
I would like to use the list method in the Reports API to periodically fetch Activities of all users of some applications (e.g. 'admin' and 'login') and keep a local copy of all that data (using watch and push notifications is not an option in my particular scenario).
The idea is defining small time windows (e.g. 60 seconds) and, at the end of each time window plus some small delay, using the 'list' method and setting the startTime and endTime accordingly, fetching all events logged during the already finished time window.
This way I would be able to have an almost-real-time list of events locally stored. However, I'm not sure about what minimal delay should be used to ensure that the list method will be able to fetch all events. I'm assuming some delay is required here. Am I right? If so, is there any minimum delay that guarantees all events will be fetched?
In theory you wouldn't need a delay, but probably 10 secs would be fine if you want to be sure. Another important thing would be the api quota, in this case the project would be limited to 5 queries per second. Here is the documentation on that https://developers.google.com/admin-sdk/reports/v1/limits
I know that DynamoDB is bound to a writes and read per second limit, which I set. This means that when I delete items they are bound to the same limits. I want to be able to delete many records at some point in time, without that having a negative effect on the other operations that my app is doing.
So for example, if I run a script to delete 10,000 items and it takes 1 minutes, I don't want my database to stop serving other users that are using my app. Is there a way to separate the two, one for background processes (admin) and give it its own limits and one for the main process (the app)?
Note: The item deletion will be by date ranges, and I have no way in knowing how much items are there ahead of time.
App in ASP.NET C#
Thanks
The limits are set on the DynamoDB tables themselves, not on the client requests, so the answer is no.
One workaround is to write a script that:
increases the write ops limit
runs the delete queries in a throttled manner so that it uses only the offset between the old limit and the newly set one
decreases the limit back, after the operations are completed.
You could then optimise the amount by which you scale up the writes/second to balance the time it takes for the script to complete.