Django database; how to download huge data in csv format

Django database; how to download huge data in csv format - django

I have setup my database in Django in which I have huge amount of data. The task is to download all the data at a time in csv format. The problem which I am facing here is when the data size (in number of table rows) is upto 2000, I am able to download it but when number of rows reaches to more than 5k, it throws an error, "Gateway timeout". How to handle such issue. There is no table indexing as of now.
Also, when there is 2K data available, it takes around 18sec to download. So how this can be optimized.

First, make sure the code that is generating the CSV is as optimized as possible.
Next, the gateway timeout is coming from your front end proxy; so simply increase the timeout there.
However, this is a temporary reprieve - as your data set grows, this timeout will be exhausted and you'll keep getting these errors.
The permanent solution is to trigger a separate process to generate the CSV in the background, and then download it once its finished. You can do this by using celery or rq which are both ways to queue tasks for execution (and then collect the results at a later time).

If you are currently using HttpResponse from django.http then you could try using StreamingHttpResponse instead.
Failing that, you could try querying the database directly. For example, if you use the MySql database backend, these answers might help you:
dump-a-mysql-database-to-a-plaintext-csv-backup-from-the-command-line
As for the speed of the transaction, you could experiment with other database backends. However, if you need to do this often enough for the speed to be a major issue then there may be something else in the larger process which should be optimized instead.

Related

AWS Athena too slow for an api?

The plan was to get data from aws data exchange, move it to an s3 bucket then query it by aws athena for a data api. Everything works, just feels a bit slow.
No matter the dataset nor the query I can't get below 2 second in athena response time. Which is a lot for an API. I checked the best practices but seems that those are also above 2 sec.
So my question:
Is 2 sec the minimal response time for athena?
If so then I have to switch to postgres.

Athena is indeed not a low latency data store. You will very rarely see response times below one second, and often they will be considerably longer. In the general case Athena is not suitable as a backend for an API, but of course that depends on what kind of an API it is. If it's some kind of analytics service, perhaps users don't expect sub second response times? I have built APIs that use Athena that work really well, but those were services where response times in seconds were expected (and even considered fast), and I got help from the Athena team to tune our account to our workload.
To understand why Athena is "slow", we can dissect what happens when you submit a query to Athena:
Your code starts a query by using the StartQueryExecution API call
The Athena service receives the query, and puts it on a queue. If you're unlucky your query will sit in the queue for a while
When there is available capacity the Athena service takes your query from the queue and makes a query plan
The query plan requires loading table metadata from the Glue catalog, including the list of partitions, for all tables included in the query
Athena also lists all the locations on S3 it got from the tables and partitions to produce a full list of files that will be processed
The plan is then executed in parallel, and depending on its complexity, in multiple steps
The results of the parallel executions are combined and a result is serialized as CSV and written to S3
Meanwhile your code checks if the query has completed using the GetQueryExecution API call, until it gets a response that says that the execution has succeeded, failed, or been cancelled
If the execution succeeded your code uses the GetQueryResults API call to retrieve the first page of results
To respond to that API call, Athena reads the result CSV from S3, deserializes it, and serializes it as JSON for the API response
If there are more than 1000 rows the last steps will be repeated
A Presto expert could probably give more detail about steps 4-6, even though they are probably a bit modified in Athena's version of Presto. The details aren't very important for this discussion though.
If you run a query over a lot of data, tens of gigabytes or more, the total execution time will be dominated by step 6. If the result is also big, 7 will be a factor.
If your data set is small, and/or involves thousands of files on S3, then 4-5 will instead dominate.
Here are some reasons why Athena queries can never be fast, even if they wouldn't touch S3 (for example SELECT NOW()):
There will at least be three API calls before you get the response, a StartQueryExecution, a GetQueryExecution, and a GetQueryResults, just their round trip time (RTT) would add up to more than 100ms.
You will most likely have to call GetQueryExecution multiple times, and the delay between calls will puts a bound on how quickly you can discover that the query has succeeded, e.g. if you call it every 100ms you will on average add half of 100ms + RTT to the total time because on average you'll miss the actual completion time by this much.
Athena will writes the results to S3 before it marks the execution as succeeded, and since it produces a single CSV file this is not done in parallel. A big response takes time to write.
The GetQueryResults must read the CSV from S3, parse it and serialize it as JSON. Subsequent pages must skip ahead in the CSV, and may be even slower.
Athena is a multi tenant service, all customers are competing for resources, and your queries will get queued when there aren't enough resources available.
If you want to know what affects the performance of your queries you can use the ListQueryExecutions API call to list recent query execution IDs (I think you can go back 90 days at the most), and then use GetQueryExecution to get query statistics (see the documentation for QueryExecution.Statistics for what each property means). With this information you can figure out if your slow queries are because of queueing, execution, or the overhead of making the API calls (if it's not the first two, it's likely the last).
There are some things you can do to cut some of the delays, but these tips are unlikely to get you down to sub second latencies:
If you query a lot of data use file formats that are optimized for that kind of thing, Parquet is almost always the answer – and also make sure your file sizes are optimal, around 100 MB.
Avoid lots of files, and avoid deep hierarchies. Ideally have just one or a few files per partition, and don't organize files in "subdirectories" (S3 prefixes with slashes) except for those corresponding to partitions.
Avoid running queries at the top of the hour, this is when everyone else's scheduled jobs run, there's significant contention for resources the first minutes of every hour.
Skip GetQueryExecution, download the CSV from S3 directly. The GetQueryExecution call is convenient if you want to know the data types of the columns, but if you already know, or don't care, reading the data directly can save you some precious tens of milliseconds. If you need the column data types you can get the ….csv.metadata file that is written alongside the result CSV, it's undocumented Protobuf data, see here and here for more information.
Ask the Athena service team to tune your account. This might not be something you can get without higher tiers of support, I don't really know the politics of this and you need to start by talking to your account manager.

BigQuery python client dropping some rows using Streaming API

I have around a million data items being inserted into BigQuery using streaming API (BigQuery Python Client's insert_row function), but there's some data loss, ~10,000 data items are lost while inserting. Is there a chance BigQuery might be dropping some of the data? Since there aren't any insertion errors (or any errors whatsoever for that matter).

I would recommend you to file a private Issue Tracker in order for the BigQuery Engineers to look into this. Make sure to provide affected project, the source of the data, the code that you're using to stream into BigQuery along with the client library version.

Access to pandas dataframe object between requests via session key

I have a pandas dataframe with a loose wrapper class around it that provides metadata for my django/DRF application. The application is basically a user friendly (non programmer) way to do some data analysis and validation. Between requests I want to be able to save the state of the dataframe so I can have a series of interactions with the data but it does not need to be saved in a database ( It only needs to survive as long as the browser session ). From this it was logical to check out django's session framework, but from what I've heard session data should be lightweight and the dataframe object does not json serialize.
Because I dont have a ton of users, and I want the app to feel like a desktop site, I was thinking of using the django cache as a way to keep the dataframe object in memory. So putting the data in the cache would go something like this
>>> from django.core.cache import caches
>>> cache1 = caches['default']
>>> cache1.set(request.session._get_session_key, dataframe_object)
and then the same except using get in the following requests to access.
Is this a good way to do handle this workflow or is there another system I should use to keep rather large data(5mb to 100mb) in memory?

If you are running your application on a modern server then 100mb is not a huge amount of memory. However if you have more than a couple dozen simultaneous users, each requiring 100mb of cache then this could add up to more memory than your server can handle. Your cache and server should be configured appropriately and you may want to limit the total number of cached dataframes in your python code.
Since it does appear that Django needs to serialize session data your choice is to either use sessions with PickleSerializer or to use the cache. According to documentation, PickleSerializer is not recommended for security reasons so your choice to use the cache is a good one.
The default cache backend in Django does not share entries across processes so you would get better memory and time efficiency by installing memcached and enabling the memcached.MemcachedCache backend.

Amazon DynamoDB and Provisioned Throughput

I am new to DynamoDB and I'm having trouble getting my head around the Provisioned Throughput.
From what I've read it seems you can use this to set the limit of reads and writes at one time. Have I got that wrong?
Basically what I want to do is store emails that are sent through my software. I currently store them in a MySQL database but the amount of data is very large which is why I am looking at DynamoDB. This data I do not need to access very often but when it's needed, I need to be able to access it.
Last month 142,925 emails were sent and each "row" (or email) in the MySQL table I store them in is around 2.5KB.
Sometimes 1 email is sent, other times there might be 3,000 at one time. There's no way of knowing when or how many will be sent at any given time.
Do you have any suggestions on what my Throughputs should be?
And if I did go over, am I correct in understanding that Amazon throttles it and adds them over time? Or does it just throw and error and that's the end of it?
Thanks so much for your help.

I'm using DynamoDB with the Java SDK. When you have an access burst, amazon first tries to keep up, even allowing a bit above the provisioned throughput, after that it start throttling and also throws exceptions. In our code we use this error to break the requests into smaller batches and sometimes force a sleep to cool it down a bit.
When dealing with your situation it really depends on the type of crunching you need to do "from time to time". How much time do you need to get all the data from the table? do you really need to get all of it? And ~100k a month doesn't sound too much for MySQL in my mind.. it all depends on the querying power you need.
Also note that in DynamoDB writes are more expensive than reads so maybe that alone signals that it is not the best fit for your write-intensive problem.

DynamoDb is very expensive, I would suggest not to store emails in dynamo db as each read and write cost good amount, Basically 1 read unit means 4KB data read per sec and 1 write unit means 1KB data write per sec, As you mentioned your each email is 2.5KB, hence while searching data(if you dont have proper key for searching the email) table will be completely scanned that will cost a very good amount as you will need several write units for reading.

Ember choking upon encountering large data sets

Looking for a solution to an issue caused by large data sets forcing Ember to lock up the browser while it tries to process the data.
For pagination, I'm using tchak's handy pagination mixin to paginate approximately 13,000+ objects being loaded from a backend API.
The Ember Data objects contain an ID, one text attribute and several number attributes.
The problem is it takes close to a minute before the browser finishes processing the data, rendering the browser unusable in the meantime. Firefox even goes as far as to issue a warning that a script is using up all browser resources and suggests that script be terminated.
I've written my own pagination mixin that requests objects by range, i.e. items 10-25, and it works generally well except for one serious limitation: sorting. To sort the data, I need to make additional requests to the backend and reload the objects even if some of them have already been loaded.
I would love to be able to load all of the content upfront to simplify the process of sorting without doing additional requests to the backend API. I'm looking for guidance on how to tackle this issue but I'm open to an entirely alternative approach.
If nothing else, is it possible to reduce the resource footprint Ember places on the browser as it tries to load all 13k objects into the ArrayController?
I'm using Ember 1.0.0-pre2 with the latest Ember Data (currently at Revision 10).
On the backend is Rails 3.2.8.
Update I sidestepped the issue by loading data into an ArrayController property other than content. This brought the load times down from over a minute to only a few seconds. I then slice the requested number of items and load those into content. This works well for any number of items, at the cost of not being able to easily sort the data.

I suggest you take a look at Ember Table. The demo shows a table with 500 000 records and works very fast. Digging around the source code might help.

Can't you query a view from your db that handles the sorting? Pass in the sort conditions in the query string ?sortBy=name&sortAsc=true

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js