In Django, why does queryset's iterator() method reduce memory usage? - django

In Django, I can't understand why queryset's iterator() method reduces memory usage.
Django document says like below.
A QuerySet typically caches its results internally so that repeated evaluations do not result in additional queries. In contrast, iterator() will read results directly, without doing any caching at the QuerySet level (internally, the default iterator calls iterator() and caches the return value). For a QuerySet which returns a large number of objects that you only need to access once, this can result in better performance and a significant reduction in memory.
In my knowledge, however, wheather iterator() method is used or not, after evaluation, the queried rows are fetched from the database at once and loaded into memory. Isn't it the same that memory proportional to the number of rows is used, wheather the queryset do caching or not? Then what is the benifit of using iterator() method, assuming evaluating the queryset only once?
Is it because raw data fetched from the database and data that is cached (after instantiating) are stored in separate memory spaces? If so, I think I can understand that not performing caching using the iterator() method saves memory.

When using iterator, Django uses DB cursors to get data row by row. If you use something like all and iterate on that with python, all of the records would be cached in the memory while you need one by one.
So by using iterator, you are using DB cursor, and fetching data one by one, and the other records would not be fetched all at once.

Related

QAbstractItemModel - Should QModelIndex objects be cached when created?

When subclassing a QAbstractItemModel and re-implementing the index() method, I had been simply returning a new index each time with createIndex(). But I noticed that the index() method gets called thousands of times when the model is used in conjunction with a view, for all sorts of paint events and whatnot.
Should I instead be caching the QModelIndex object after I generate it the first time in index(), and then be returning the cached index when index() is subsequently called on the same row/col? It's not mentioned in the documentation, and it seems that indexes themselves can become invalidated under certain circumstances, so I am unsure of what to do here.
In my particular case, I'm working with Pyside6, but I imaging this could apply to any implementation of the Qt framework.
If your model supports inserting or removing rows, your indexes are not persistent. You can still use cache but you must invalidate it every time model shape changes.
If creating index logic is complicated there might be benefit in caching.
Size of QModelIndex is about four ints (row, column and pointer/id and pointer), so it's relatively lightweight, creating and moving it around is cheap.
Either way there's only one way to be sure: try caching and measure perfomance gain.

Does iterator provide improvement when used together with values_list?

Recently I saw code which used together iterator() and values_list(). Does it make sence to use them both together? Will it improve speed or memory usage?
Sample code:
Customer.objects.values_list("pk", flat=True).iterator()
value_list() returns a QuerySet that returns dictionaries[docs].
QuerySets are lazy – the act of creating a QuerySet doesn’t involve any database activity. You can stack filters together all day long, and Django won’t actually run the query until the QuerySet is evaluated [docs].
You can see this to check when a queryset is evaluated.
In fact by creating a queryset django doesn't hit the database until it is being evaluated (by something like iterator).
as Iterator() read the results without caching will result to better performance in situations we need to access the objects just onece, and it's not related to the kind of querysets

In Django does .get() have better performance than .first()?

The Django implementation of .first() seems to get all items into a list and then return the first one.
Is .get() more performant ? Surely the database can just return one item, the implementation of .first() seems suboptimal,
I see no reason to think so, although I have not actually profiled.
Slicing on Django querysets is implemented by modifying the query to use LIMIT and OFFSET terms to retrieve only the necessary number of elements. This means the first() implementation only fetches a single element from the database.

Selection appropriate STL container for logging Data

I require logging and filtering mechanism in my client server application.where client may request log data based on certain parameter.
log will have MACID,date and time,command type and direction as field.
server can filter log data based on these parameter as well.
size of the the log is 10 mb afterwards the log will be override the message from beginning.
My approach is I will log data in to file as well in the STL container as "in memory" so that when the client request data server will filter the log data based on any criteria
So the process is server will first do the sorting on particular criteria on vector<> and then filter it using binary search.
I am planning to use vector as STL container for in memory logging data.
I am bit confused whether vector will appropriate on this situation or not.
since size of the data can max upto 10 mb in vector.
my question whether vector is fare enough for this case or not ?
I'd go with a deque, double ended queue. It's like a vector but you can add/remove elements from both ends.
I would first state that I would use a logging library since there are many and I assure you they will do a better job (log4cxx for ex). If you insist on doing this your yourself A vector is an appropriate mechanism but you will have to manually sort the data biased upon user requests. One other idea is to use sqllite and let it manage storing sorting and filtering your data.
The actual response will depend a lot on the usage pattern and interface. If you are using a graphical UI, then chances are that there is already a widget that implements that feature to some extent (ability to sort by different columns, and even filter). If you really want to implement this outside of the UI, then it will depend on the usage pattern, will the user want a particular view more than others? does she need only filtering, or also sorting?
If there is one view of the data that will be used in most cases, and you only need to show a different order a few times, I would keep an std::vector or std::deque of the elements, and filter out with remove_copy_if when needed. If a different sort is required, I would copy and sort the copy, to avoid having to re-sort back to time based to continue adding elements to the log. Beware if you the application keeps pushing data that you will need to update the copy with the new elements in place (or provide a fixed view and rerun the operation periodically).
If there is no particular view that occurs much more often than the rest, of if you don't want to go through the pain of implementing the above, take a look a boost multi index containers. They keep synchronized views of the same data with different criteria. That will probably be the most efficient in this last case, and even if it might be less efficient in the general case of a dominating view, it might make things simpler, so it could still be worth it.

Caching data from MySQL DB - technique and appropriate STL container?

I am designing a data caching system that could have a very large amount of records held at a time, and I need to know what stl container to use and how to use it. The application is that I have an extremely large DB of records for users - when they log in to my system I want to pull their record and cache some data such as username and several important properties. As they interact with the system, I update and access their properties. Several properties are very volatile and I'm doing this to avoid "banging" on the DB with many transactions. Also, I rarely need to be using the database for sorting or anything - I'm using this just like a glorified binary save file (which is why I am happy to cache records to memory..); a more important goal for me is to be able to scale to huge numbers of users.
When the user logs out, server shuts down, or periodically in round-robin fashion (just in case..), I want to write their data back to the DB.
The server keeps its own:
vector <UserData *> loggedInUsers;
With UserData keeping things like username (string) and other properties from the DB, as well as other temporary data like network handles.
My first Q is, if I need to find a specific user in this vector, what's the fastest way to do that and is there a different stl container I can use to do this faster? What I do now is create an iterator, start it at loggedInUsers.begin() and iterate to .end(), checking *iter->username == "foo" and returning when it's found. If the username is at the end of the vector, or if the vector has 5000 users, this is a significant delay.
My second Q is, how can I round-robin schedule this data to be written back to the DB? I can call a function every time I'm ready to write a few records to the DB. But I can't hold an iterator to the vector, because it will become invalid. What I'd like to do is have a rotating queue where I can access the head of the queue, persist it to the DB, then rotate it to be the end of the queue. That seems like a lot of overhead.. what type could I use to do this better?
My third Q is, I'm using MySQL server and libmysqlclient connector/C.. is there any kind of built in caching that could solve this problem "for free", or is there a different technique altogether? I'm open to suggestions
A1. you're better off with a map, this is a tree that does the lookup for you. Test with a map and (assuming you have the right compiler) or a hash_map (which does the same thing, but the lookup mechanism is different). They have different performance characteristics for different types of data storage workloads.
A2. A list would probably be better for you - push to the front, pull off the end. (a deque could also be used, but you cannot keep an iterator if you erase from it, you can with a list). push_back and pop_front (or vice-versa) will allow you to keep a rolling queue of cached data.
A3. You could try SQLite, which is a mini-database designed for simple application-level db storage needs. It can work entirely in-memory too.
You don't say what your system does or how it's accessed, but this kind of technique probably won't scale well (because eventually you'll run out of memory and whatever you use to find information won't be as efficient as a database) and won't necessarily handle concurrent users properly, unless you make sure that data can be shared properly between them.
That said.. you might be better off using a map (http://www.cplusplus.com/reference/stl/map/) with the username as the key.
In terms of writing it back to the database, why not store a separate structure (a queue) that you can clear every time you write it to the database? As long as you're storing pointers it won't use much more memory. Which brings me to.. rather than using pointers you should take a look at smart pointers (for example boost's shared_ptr) which let you pass them around without worrying about ownership.