If I had say 70,000 objects and wanted to do statistics on them, but the statistics didnt need to be 100% accurate, what is the best way to pull out say 1,000 objects, do statistics on those objects and then just scale it to approximate the statistics for the 70,000? I can't quite seem to find an efficient way to get 1000 random objects from a queryset.
You can get random objects with:
objs = list(MyModel.objects().order_by("?")[:1000])
But the underlying order by random that gets generated for the SQL isn't particularly efficient.
I know this isn't the answer you're looking for, but sometimes when doing heavy reporting you need more than Django's ORM can offer. I worked with a guy who used Django for his main application, but for some reporting tools (and a JSON service) he used Flask and SQLAlchemy and was able to accomplish a whole lot more and without having to write SQL.
there is a great post on the issue of getting random rows from the database (there are few good points in the comments too).
the only thing I would check is to get some objects by "in_bulk" method, because you may be even faster this way.
Related
I was wondering about best practices in Django of validating the tables content
I am creating a Sales Orders and my SO should check availability of the items I have in stock and if they are not in stock it will trigger manufacturing orders and purchase orders.
I don't want to make very complex view and looking for a way to decouple logic from there and also I predict performance issues.
What are best practices or ready solutions I can use in Django framework to address view complexity ?
I see different possibilities but I am wondering what will be the best fit in my case :
managers
celery - just to run a job occasionally I want the app to be
real time so I don't like this option.
using signals /pre_save/post_sav
model validation
creating extra layer like services.py file
Since I am new to Django I am a bit puzzled what root to take.
Not sure if this is the answer you are looking for.
Signals are for doing things automatically when events happen. Most commonly used to do things before and after model operations. So if you need to do something every time you save a record or every time you create a new record or delete that is where you use signals.
Managers are used to manage record retrieval and manipulations. If you want to do some clever way of retrieving data you can define a custom manager and add some custom methods to it. If you want to override some default behaviors of querysets you would also do it with a custom manager.
Celery is for running things asynchronously. If you are worried that some processing you are doing might take a long time that is were you might consider offloading things to celery. A friendly warning though, doing things asynchronously raises complexity of your code quite a bit, since you need to add some mechanism to pass the data back from celery tasks into your django app and your users.
services.py link that you posted seems to do what you want, it just provides a place where you can put logic that is not specific to a particular view.
Here on stackoverflow, i got an advice from some experienced developers that premature optimization is the root of all evil.
What i suggest is keep it simple. Making the view a little more complex is actually better than effectively adding one more layer of complexity. I would suggest that you try to put most of you logic in models and whatever remains after that in views.
Also, unnecessarily using multiple packages would not solve much of your problem so use the when its necessary. Otherwise try to write the minimal logic yourself so that you donot have to use many apps.
Signals and other things as everybody say is not a great thing however promising it may seem. Just try to make things simpler.
One more point from my side as you are just starting out, go through class based views and try to use them when you get familiar. That will simplify your views the most. Plus, if ou are new to django, read a little code. https://github.com/vitorfs/bootcamp might help you in initiation.
I've got a web page loading pretty slowly, so I installed the Django Debug Toolbar. I'm pretty new at this, so I'm trying to figure out what I can do with it.
I can see the database did 264 queries in 205 ms. Looks kind of high. I'm pretty sure I can cut down on that by adding some indexes and just writing better queries. But my question is: What is a "good" number that should be trying to hit here? What is generally accepted as "fast enough" and further optimization isn't really worth it. 50ms? 20ms?
Also on this same page it's showing 2500ms in user CPU. That sounds terrible to me, and I'm surprised it's so much higher than the database, which I assumed was the bottleneck. Is this maybe an indication that I am trying to do too much in python code instead of at the database layer? Would reducing the number of SQL queries help with CPU? (Waiting between queries?). Again is there some well known target response time I should be aiming for.
I'm looking for a snappy response from my clients. Right now when I click around I can feel a "pregnant pause" before the pages load.
By default accessing related model fields results in one extra query per model per row. Look into select_related() and prefetch_related(), this usually cuts down number of queries and speeds things up by a lot. I think debug toolbar shows you the actual queries, if not, need to enable sql logs before doing any query optimizations. Once you cut down number of queries to a minimum (no extra queries per pow), look for the slowest query and use EXPLAIN sql syntax to see if indexes are being used, this is another area where it can get slow especially on big data.
Usually database is the bottleneck, unless you are doing some major looping in your code. If you believe python code is slow, then need to profile it, otherwise it's just guessing.
So I implemented Haystack with ElasticSearch a week ago within our BETA application. One thing I can notice is that getting some data (large amount) back to our users (for example listing all the users within the application) is much faster by going through Haystack then Django's ORM. Now, I will be releasing a REST service (with TastyPie) to serve the possible tablets within the next weeks, as I want to be able to access the information from iPads, Nexus tablets and so on.
One thing I was wondering, is when should I be querying the ORM vs Haystack/ElasticSearch? For example, if the user on the tablet is requesting a specific set of users, should we let TastyPie query the ORM, or go to ElasticSearch?
If we look at this answer Django: Haystack or ORM, we can all agree that a DB is made to retrieve and write data. However, could we say that retrieving faster can be faster with Haystack/ElasticSearch once the search engine was updated?
I am a bit confused as to when, should we not be querying Haystack if it is much faster?!
To make things clear I guess you're talking about querying Elasticsearch via Haystack without later instantiating any objects for your search results with data from you database.
Some points to consider besides the points mentioned in the other post:
A search engine like Elasticsearch is highly optimized when dealing with full-text searches (When doing something with SQL it highly depends on the database/engine you are using)
Queries that are involving a lot of relations/joins will most like be easier to handle with the ORM, but on the other hand you can eg save data from foreign-key relations in a denormalized fashion when using ES which could give you a performance boost. Of course you can denormalize your database tables as well but this is quite often considered as a bad practice as long as you know what you are doing, eg when solving a performance bottleneck.
ES is somehow quite easy to scale while scaling your SQL DB might be more complicated.
Most likely this is a decision that depends very much on your use case, the amount of data to process and the queries you are intending to run. So the best thing of course is - as always - to do some benchmarking yourself and compare this two solutions. But don't do any premature optimisations as one big advantage of the ORM is to keep things simple - you don't have to care much about the integrity of your data and maintain an additional system.
I want to store some single data of my web-site. Actually, I want to set articles that I want to display at the start page, popular tags and another stuff.
Django offers me to make a model, so it is supposed that there are lots of such data.
How to realize this task in the right way? May be my approach is completely wrong?
Thank you in advance!
You might consider looking at a CMS, Django-CMS is getting quite mature.
Aside from that, it sounds like you need to store some one-off or singleton objects. You can most certainly use models for this as it will not only help you think properly about your data structures and learn about this powerful Django DB abstraction, but I suspect that you'll find rather quickly that you may indeed want to create multiple objects over time (its is often rare that you don't).
If you have something that really is and always should be a singleton, consider placing it in your settings.py file instead.
I'm writing a project in C++/Qt and it is able to connect to any type of SQL database supported by the QtSQL (http://doc.qt.nokia.com/latest/qtsql.html). This includes local servers and external ones.
However, when the database in question is external, the speed of the queries starts to become a problem (slow UI, ...). The reason: Every object that is stored in the database is lazy-loaded and as such will issue a query every time an attribute is needed. On average about 20 of these objects are to be displayed on screen, each of them showing about 5 attributes. This means that for every screen that I show about 100 queries get executed. The queries execute quite fast on the database server itself, but the overhead of the actual query running over the network is considerable (measured in seconds for an entire screen).
I've been thinking about a few ways to solve the issue, the most important approaches seem to be (according to me):
Make fewer queries
Make queries faster
Tackling (1)
I could find some sort of way to delay the actual fetching of the attribute (start a transaction), and then when the programmer writes endTransaction() the database tries to fetch everything in one go (with SQL UNION or a loop...). This would probably require quite a bit of modification to the way the lazy objects work but if people comment that it is a decent solution I think it could be worked out elegantly. If this solution speeds up everything enough then an elaborate caching scheme might not even be necessary, saving a lot of headaches
I could try pre-loading attribute data by fetching it all in one query for all the objects that are requested, effectively making them non-lazy. Of course in that case I will have to worry about stale data. How would I detect stale data without at least sending one query to the external db? (Note: sending a query to check for stale data for every attribute check would provide a best-case 0x performance increase and a worst-caste 2x performance decrease when the data is actually found to be stale)
Tackling (2)
Queries could for example be made faster by keeping a local synchronized copy of the database running. However I don't really have a lot of possibilities on the client machines to run for example exactly the same database type as the one on the server. So the local copy would for example be an SQLite database. This would also mean that I couldn't use an db-vendor specific solution. What are my options here? What has worked well for people in these kinds of situations?
Worries
My primary worries are:
Stale data: there are plenty of queries imaginable that change the db in such a way that it prohibits an action that would seem possible to a user with stale data.
Maintainability: How loosely can I couple in this new layer? It would obviously be preferable if it didn't have to know everything about my internal lazy object system and about every object and possible query
Final question
What would be a good way to minimize the cost of making a query? Good meaning some sort of combination of: maintainable, easy to implement, not too aplication specific. If it comes down to pick any 2, then so be it. I'd like to hear people talk about their experiences and what they did to solve it.
As you can see, I've thought of some problems and ways of handling it, but I'm at a loss for what would constitute a sensible approach. Since it will probable involve quite a lot of work and intensive changes to many layers in the program (hopefully as few as possible), I thought about asking all the experts here before making a final decision on the matter. It is also possible I'm just overlooking a very simple solution, in which case a pointer to it would be much appreciated!
Assuming all relevant server-side tuning has been done (for example: MySQL cache, best possible indexes, ...)
*Note: I've checked questions of users with similar problems that didn't entirely satisfy my question: Suggestion on a replication scheme for my use-case? and Best practice for a local database cache? for example)
If any additional information is necessary to provide an answer, please let me know and I will duly update my question. Apologies for any spelling/grammar errors, english is not my native language.
Note about "lazy"
A small example of what my code looks like (simplified of course):
QList<MyObject> myObjects = database->getObjects(20, 40); // fetch and construct object 20 to 40 from the db
// ...some time later
// screen filling time!
foreach (const MyObject& o, myObjects) {
o->getInt("status", 0); // == db request
o->getString("comment", "no comment!"); // == db request
// about 3 more of these
}
At first glance it looks like you have two conflicting goals: Query speed, but always using up-to-date data. Thus you should probably fall back to your needs to help decide here.
1) Your database is nearly static compared to use of the application. In this case use your option 1b and preload all the data. If there's a slim chance that the data may change underneath, just give the user an option to refresh the cache (fully or for a particular subset of data). This way the slow access is in the hands of the user.
2) The database is changing fairly frequently. In this case "perhaps" an SQL database isn't right for your needs. You may need a higher performance dynamic database that pushes updates rather than requiring a pull. That way your application would get notified when underlying data changed and you would be able to respond quickly. If that doesn't work however, you want to concoct your query to minimize the number of DB library and I/O calls. For example if you execute a sequence of select statements your results should have all the appropriate data in the order you requested it. You just have to keep track of what the corresponding select statements were. Alternately if you can use a looser query criteria so that it returns more than one row for your simple query that ought to help performance as well.