I have a API build in django, it goes to query data with PostgreSQL and serializer to json.
Now, this API content-length is 3MB (I use the gzip, the truely size is 20MB) response time cost me about 10~20 seconds.
I want to ask, Is this performance is right? there is any optimization space that I can do?
If it takes 10-20 seconds, it sounds more like poor API design, however I don't know your use-case so I can't be sure.
Check if you can do any of the recommendations of Ken. Here are some more ideas:
Pagination is a great way to split data - if your data can be split into logical parts, DRF has a great number of ways to paginate querysets.
Using the right gzip compression level could be a factor, given the size of your data. Read more about it here
See if you can use etag option, where the server sends a 304 Not Modified if the API response has not changed between two consecutive calls to the API. DRF does not support etags out-of-the-box AFAIK, so you will have to find a work-around.
Since you mentioned "real-time" data, I am assuming there's some concept of streaming of temporal data. There are interesting ways to do combine client caching + cursor based pagination to only send the 'new data' you can explore. This only works if the two preconditions are met: The changes to your API are incremental and temporal in nature.
Related
I am a new cloud data fusion user and have run into a problem I cant find a solution for.
I have a table in BQ with ~150 rows of latitude and longitude points. For each row, I want to pass the lat and lng into an HTTP post request to get a result from TravelTime API. Ultimately I want to have a table with all my original rows with a column with the response for each one/
Where I am stuck is that so far I have only been able to hard-code the body of the post request into the HTPP Source plugin and successfully write the response to a file in gcs. However, I expect the rows will change over time, so I would like to dynamically generate and pass in the POST request body from my BQ data.
Is this possible with data fusion? Is this an advisable approach? Or is there a better way?
As #Albert Shau and #user3750486 agreed in the comments:
There is no out-of-the-box way to pass data from BQ rows dynamically in a POST HTTP request.
A possible workaround is to have an HTTP transform plugin that sits in the middle of the pipeline and can be configured to make calls based on the input data. Then you would have a BQ source followed by that plugin followed by the GCS sink. I think your best bet would be to write a custom transform.
This can be done by following this link that #Albert Shau provided or to do a custom code using GCP's Cloud Function as OP did.
Posting the answer as community wiki for the benefit of the community that might encounter this use case in the future.
Feel free to edit this answer for additional information.
The plan was to get data from aws data exchange, move it to an s3 bucket then query it by aws athena for a data api. Everything works, just feels a bit slow.
No matter the dataset nor the query I can't get below 2 second in athena response time. Which is a lot for an API. I checked the best practices but seems that those are also above 2 sec.
So my question:
Is 2 sec the minimal response time for athena?
If so then I have to switch to postgres.
Athena is indeed not a low latency data store. You will very rarely see response times below one second, and often they will be considerably longer. In the general case Athena is not suitable as a backend for an API, but of course that depends on what kind of an API it is. If it's some kind of analytics service, perhaps users don't expect sub second response times? I have built APIs that use Athena that work really well, but those were services where response times in seconds were expected (and even considered fast), and I got help from the Athena team to tune our account to our workload.
To understand why Athena is "slow", we can dissect what happens when you submit a query to Athena:
Your code starts a query by using the StartQueryExecution API call
The Athena service receives the query, and puts it on a queue. If you're unlucky your query will sit in the queue for a while
When there is available capacity the Athena service takes your query from the queue and makes a query plan
The query plan requires loading table metadata from the Glue catalog, including the list of partitions, for all tables included in the query
Athena also lists all the locations on S3 it got from the tables and partitions to produce a full list of files that will be processed
The plan is then executed in parallel, and depending on its complexity, in multiple steps
The results of the parallel executions are combined and a result is serialized as CSV and written to S3
Meanwhile your code checks if the query has completed using the GetQueryExecution API call, until it gets a response that says that the execution has succeeded, failed, or been cancelled
If the execution succeeded your code uses the GetQueryResults API call to retrieve the first page of results
To respond to that API call, Athena reads the result CSV from S3, deserializes it, and serializes it as JSON for the API response
If there are more than 1000 rows the last steps will be repeated
A Presto expert could probably give more detail about steps 4-6, even though they are probably a bit modified in Athena's version of Presto. The details aren't very important for this discussion though.
If you run a query over a lot of data, tens of gigabytes or more, the total execution time will be dominated by step 6. If the result is also big, 7 will be a factor.
If your data set is small, and/or involves thousands of files on S3, then 4-5 will instead dominate.
Here are some reasons why Athena queries can never be fast, even if they wouldn't touch S3 (for example SELECT NOW()):
There will at least be three API calls before you get the response, a StartQueryExecution, a GetQueryExecution, and a GetQueryResults, just their round trip time (RTT) would add up to more than 100ms.
You will most likely have to call GetQueryExecution multiple times, and the delay between calls will puts a bound on how quickly you can discover that the query has succeeded, e.g. if you call it every 100ms you will on average add half of 100ms + RTT to the total time because on average you'll miss the actual completion time by this much.
Athena will writes the results to S3 before it marks the execution as succeeded, and since it produces a single CSV file this is not done in parallel. A big response takes time to write.
The GetQueryResults must read the CSV from S3, parse it and serialize it as JSON. Subsequent pages must skip ahead in the CSV, and may be even slower.
Athena is a multi tenant service, all customers are competing for resources, and your queries will get queued when there aren't enough resources available.
If you want to know what affects the performance of your queries you can use the ListQueryExecutions API call to list recent query execution IDs (I think you can go back 90 days at the most), and then use GetQueryExecution to get query statistics (see the documentation for QueryExecution.Statistics for what each property means). With this information you can figure out if your slow queries are because of queueing, execution, or the overhead of making the API calls (if it's not the first two, it's likely the last).
There are some things you can do to cut some of the delays, but these tips are unlikely to get you down to sub second latencies:
If you query a lot of data use file formats that are optimized for that kind of thing, Parquet is almost always the answer – and also make sure your file sizes are optimal, around 100 MB.
Avoid lots of files, and avoid deep hierarchies. Ideally have just one or a few files per partition, and don't organize files in "subdirectories" (S3 prefixes with slashes) except for those corresponding to partitions.
Avoid running queries at the top of the hour, this is when everyone else's scheduled jobs run, there's significant contention for resources the first minutes of every hour.
Skip GetQueryExecution, download the CSV from S3 directly. The GetQueryExecution call is convenient if you want to know the data types of the columns, but if you already know, or don't care, reading the data directly can save you some precious tens of milliseconds. If you need the column data types you can get the ….csv.metadata file that is written alongside the result CSV, it's undocumented Protobuf data, see here and here for more information.
Ask the Athena service team to tune your account. This might not be something you can get without higher tiers of support, I don't really know the politics of this and you need to start by talking to your account manager.
I am essentially trying to build a website where members can post blog entries and i want to record unique and overall page views for the different posts in absolute terms as well as over different time-frames e.g., last 24h, last week etc.
My initial approach was to use the date as primary key and the blogPostId as secondary key, i could then add all the posts visited during a given day. If i then include the userIds as an attribute i should then be able to a)get unique page views and b)overall page views (which might include duplicate visits by a specific user) for a given day. Finally, i would then pull the primary key for let's say the last 7 days and extract the most popular post.
As far as i can tell this should work fine as long as there aren't too many entries, however, i'm sceptical if this will scale. More specifically, if the number of blog posts increases a lot for a given interval, or if i want to find the all-time most viewed post i'd essentially have to read the whole table.
Has anyone an idea how i could implement this more efficiently?
DynamoDB will almost certainly work for you, and if you need an excuse to use it, by all means give it a try. If you get a ton or traffic it might end up being expensive.
Personally, I would consider using redis for what you are asking to do, and here is a pretty good/detailed question/answer on how you might implement it:
Scalable way of logging page request data from a PHP application?
DynamoDB can be used to iterate and create this feature quickly.
Nonetheless, this is a feature for Amazon Kinesis Data Streams, which will let you ingest data and then manipulate it to your needs.
Know that Kinesis can become expensive if you try to be as frugal as possible.
But, if you start receiving a lot of traffic, Kinesis will work as a Queue and let you manipulate the data before ingesting it to DynamoDB (Or another Data Store) (It will be cheaper than sending all those write requests).
Another limitation you'd like to check out is that DynamoDB will only return up to 1MB per Query.
Amazon recommends you use Redshift to handle all those operations as it is more suited to perform aggregation and calculation across Data warehouses.
I have a database where I need to remove documents with a regular interval. It is going to be in the ballpark of 100k documents per batch.
As of today that is achieved by first doing a request to a view which returns a list of the _id:s and _rev:s of the documents to be removed.
I then do a http DELETE request to hostname/database/_id?=_rev for each of those documents.
To me that seems ridiculously inefficient since I must do a http request for each of those 100k documents.
Is there any more efficient way of deleting large ammounts of documents in couch? I have been looking for a command similar to the POST for creating new documents, where you send the data in the body of the http. Or way of doing this in a mapreduce. But so far no luck.
You can bundle all of the delete operations into one bulk_docs update.
For 100k documents, you will notice that the operation takes a little time, but it is much faster than individual DELETE updates.
Looking for a solution to an issue caused by large data sets forcing Ember to lock up the browser while it tries to process the data.
For pagination, I'm using tchak's handy pagination mixin to paginate approximately 13,000+ objects being loaded from a backend API.
The Ember Data objects contain an ID, one text attribute and several number attributes.
The problem is it takes close to a minute before the browser finishes processing the data, rendering the browser unusable in the meantime. Firefox even goes as far as to issue a warning that a script is using up all browser resources and suggests that script be terminated.
I've written my own pagination mixin that requests objects by range, i.e. items 10-25, and it works generally well except for one serious limitation: sorting. To sort the data, I need to make additional requests to the backend and reload the objects even if some of them have already been loaded.
I would love to be able to load all of the content upfront to simplify the process of sorting without doing additional requests to the backend API. I'm looking for guidance on how to tackle this issue but I'm open to an entirely alternative approach.
If nothing else, is it possible to reduce the resource footprint Ember places on the browser as it tries to load all 13k objects into the ArrayController?
I'm using Ember 1.0.0-pre2 with the latest Ember Data (currently at Revision 10).
On the backend is Rails 3.2.8.
Update I sidestepped the issue by loading data into an ArrayController property other than content. This brought the load times down from over a minute to only a few seconds. I then slice the requested number of items and load those into content. This works well for any number of items, at the cost of not being able to easily sort the data.
I suggest you take a look at Ember Table. The demo shows a table with 500 000 records and works very fast. Digging around the source code might help.
Can't you query a view from your db that handles the sorting? Pass in the sort conditions in the query string ?sortBy=name&sortAsc=true