I am trying to figure out how to pull a set of data and then manipulate it afterwards, instead of querying the database as and when I require data. I am building a dashboard and this is required for performance reasons.
To dumb down the example as much as possible, lets say you want to pull all users
users = User.all
Then you want to filter that data, but you want to do it after all the data has been pulled
my_users = users.where(is_admin: true)
I know the previous line isn't the correct syntax for what I am looking for, but I want something to that affect that will query the preloaded data stored in the users variable.
Do I convert the active relation to an array and cycle through the array looking at each item? For example should I do this
users.to_ary.reject{|u| u[:is_admin] == true }
or something to that affect?
Is there an easier way to do this?
my_user = User.all.to_array.keep_if{|u| u.is_admin == true}
I think is the best way of doing it.
Related
Initial opening: I am utilising postgresql JSONFields.
I have the following attribute (field) in my User model:
class User(AbstractUser):
...
benefits = JSONField(default=dict())
...
I essentially currently serialize benefits for each User on the front end with DRF:
benefits = UserBenefit.objects.filter(user=self)
serializer = UserBenefitSerializer(benefits, many=True)
As the underlying returned benefits changes little and slowly, I thought about "caching" the JSON in the database every time there is a change to improve the performance of the UserBenefit.objects.filter(user=user) QuerySet. Instead, becoming user.benefits and hopefully lightening DB load over 100K+ users.
1st Q:
Should I do this?
2nd Q:
Is there an efficient way to write the corresponding serializer.data <class 'rest_framework.utils.serializer_helpers.ReturnList'> to the JSON field?
I am currently using:
data = serializers.serialize("json", UserBenefit.objects.filter(user=self))
For your first question:
It's not a bad idea if you don't want to use caching alternatives.
If you have to query the database because of some changes or ... and you can't cache the hole request, then the idea of saving a JSON object can be a pretty good idea. This way you only retrieve the data and skip most parts of serializing and also terminate the need to query a pivot table to get the m2m data. But also note that this way, you are adding a whole bunch of extra data to your rows and unless you're going to need them most of the time, and you will get extra data that you don't really need which you can help it using values function on querysets but still it requires more coding. Basically, you're going to use more bandwidth for your first query and more storage to store the data instead of process power. Also, the pagination will be really hard to achieve on your benefits if you need it at some point.
Getting m2m relation data is usually pretty fast depending on the amount of data you have on your database but the ultimate way of getting better performance is caching the requests and reducing the database hits as much as possible.
And as you probably hear it a lot, you should test and benchmark to see which options really works for you the best depending on your requirements and limitations. It's really hard to suggest an optimization method without knowing the information about the whole scope and the current solution.
And for your second question:
I think I don't really get it. If you are storing a JSON object which is a field in User model, then why do you need data = serializers.serialize("json", UserBenefit.objects.filter(user=self)) ?
You don't need it since the serializer can just return the JSON field data.
What I have is a BigQuery table(>5mil rows).
I need to fetch this data in batches and process it inside AppEngine, python.
The only way to fetch from a table that I know is to run SELECT query on this table and then iterate the result using tokens fetch_data returns.
It looks like this:
query = u"""\
SELECT url FROM %s
""" % (query_table)
query_job = client.run_async_query(str(uuid.uuid4()), query)
query_job.begin()
wait_for_job(query_job, 1)
query_results = query_job.results()
rows, total_rows, next_token = query_results.fetch_data(max_results=per_page, page_token=page_token)
This works on smaller tables, but on larger ones like mine it asks to allow large requests and specify target table. But this makes no sense to me. For to simply fetch data from a table I have to copy it to another table?
What you are running into is described in this documentation. In summary, apart from the limit on how much data can be fetched at a time, there is a point where your results become "large results." This is when your results are more than 128MB compressed as described here. When your results are classified as large, you can only store the result of a query in a table in Big Query.
Unfortunately I'm not sure there's a nice way to do what you want without reducing how many rows you are retrieving at once. What you'll likely need to do is explore the exporting data documentation for big query.
You should use tabledata.list API for fetching data from table.
Using parameters (startIndex or pageToken) and maxResults you can control size of page you fetch.
I think this is exactly what you need link, as far as I understood you can't get a large result of a query but you can get the entire table data to your app no mater how big it is, thats why you need to put the large result in a table and then get this table data to your app and do whatever you want with it
good luck :)
I'm listing queryset results and would like to add an option for choosing the order results are displayed.
I would like to pass the actual data from the database to other page for sorting.
I was able to achieve such thing by getting all objects ids and use django session to recreate a new queryset based on the order criteria.
I was thinking if there is any other way to achieve such goal?
10x
Assuming you are currently displaying the data as a table, you could give chance to some javascript client side table sorter such as tablesorter. There are lots of javascript table sorte.
I'm away from my development machine right now, but I think you could just pass the list of ids to a new Queryset, pk__in=list_of_object_ids, and then use the native order_by function.
For example:
objs = Object.objects.filter(pk__in=list_of_object_ids).order_by('value_to_order_by')
Anyway, that's what I would try first, though I'm sure there are better optimizations.
For example, instead of a list of object ids, you could pass a dictionary with a key:value pair that has the value you want to order by.
For example:
[{'obj_id':1,'obj_value':'foo'},{'obj_id':2,'obj_value':'foo'}]
Then use some lambda function to sort it, like here.
I am trying to lower the amount of queries that my django app is using, but I am a little confused on how to do it.
I would like to get a query set with one hit to the database and then filter items from that set. I have tried a couple of things, but I always get queries for each set.
let's say I want to get all names from my DB, but also separate out the people just named Ted. Both the names and the ted set will be used in the template.
This will give me two sets, one with all names and one with Ted.. but also hits the database twice:
namelist = People.objects.all()
tedList = namelist.filter(name='ted')
Is there a way to filter the first set without hitting the data base again?
tedList = [person for person in namelist if person.name == 'ted']
This will filter the initial QueryList on the client side.
My Django application retrieves an RSS feed every day. I would like to persist the time the feed was last updated somewhere in the app. I'm only retrieving one feed, it will never grow to be multiple feeds. How can I persist the last updated time?
My ideas so far
Create a model and add a datetime field to it. This seems like overkill as it adds another table to the database, in which there will only ever be one row. Other than that, it's the most obvious and straight-forward solution.
Create a settings object which just stores key/value mappings. The last updated date would just be row in this database. This is essentially a generic version of the previous solution.
Use dbsettings/django-values, which allows you to store settings in the database. The last updated date would just be a 'setting'.
Any other ideas that I'm missing?
In spite of the fact databases regularly store many rows in any given table, having a table with only one row is not especially costly, so long as you don't have (m)any indexes, which would waste space. In fact most databases create many single row tables to implement some features, like monotonic sequences used for generating primary keys. I encourage you to create a regular model for this.
RAM is volatile, thus not persistent: memcached is not what you asked for.
XML it is not the right technology to store a single value.
RDMS is not the right technology to store a single value.
Django cache framework will answer your question if CACHE_BACKEND is set to anything else than file://...
The filesystem is the right technology to "persist a single value".
In settings.py:
RSS_FETCH_DATETIME_PATH=os.path.join(
os.path.abspath(os.path.dirname(__file__)),
'rss_fetch_datetime'
)
In your rss fetch script:
from django.conf import settings
handler = open(RSS_FETCH_DATETIME_PATH, 'w+')
handler.write(int(time.time()))
handler.close()
Wherever you need to read it:
from django.conf import settings
handler = open(RSS_FETCH_DATETIME_PATH, 'r+')
timestamp = int(handler.read())
handler.close()
But cron is the right tool if you want to "run a command every day", for example at 5AM:
0 5 * * * /path/to/manage.py runscript /path/to/retreive/script
Of course, you can still write the last update timestamp in a file at the end of the retreive script, and use it somewhere else, if that makes sense to you.
Concluding by quoting Ken Thompson:
One of my most productive days was
throwing away 1000 lines of code.
One solution I've used in the past is to use Django's cache feature. You set a value to True with an expiration time of one day (in your case.) If the value is not set, you fetch the feed, otherwise you don't do anything.
You can see my solution here: Importing your Flickr photos with Django
If you need it only for caching purposes, why not store it in the memcached?
On the other hand, if you use this data for other purposes (e.g. display it on the page, or to make some calculation, etc.), then I would store it in a new model - in Django, all persistence is built on top of the database, via models, and I would not try to use other "clever" solutions.
One thing I used to do when I was deving with PHP, was to store the xml somewhere, but with a new tag inserted to hold the timestamp of the latest retrieval. It wasn't great, but it was quick and simple.
Keeping it simple would lead to the idea of just storing it in the file system ... why can't you do that? You could, for example, have a siteconfig module in one of your apps which held these sorts of data. This could load up data from a specific file, which could be text, JSON, ConfigParser, pickle or any suitable format. Just import siteconfig somewhere, and it can load the data and make it available to the other modules in your site. You could easily extend this to hold a dict-like object with a number of settings (e.g., if you ever have multiple feeds, but don't want to have a model just for 2-3 rows, you could easily hold the last-retrieved time for each feed in a dict keyed by feed URL).
Create a session key, which persists forever and update the feed timestamp every time you access it.