Django ORM Which technique is better? - django

I have a project.
project = Project.objects.get(id=1)
and now i want to select the data from related tables of project. It can be done it 2 ways, let me know which one is better. and why?
attachments = project.attachments_set.all()
samples = project.projectsamples_set.all()
OR
attachments = Attachments.objects.filter(project=ctx['project'])
samples = ProjectSamples.objects.filter(project=ctx['project'])
I would like to know the Technical prospective.

These queries are exactly equivalent, as you can see if you examine the generated SQL. I would say that the first is preferable as it is more compact and readable, but that is very much subjective so it is up to you which you use.
(Note that if you don't actually have the project object to start with, and don't need it, then it's more efficient to query Attachments and Samples via project_id than to get the product and use the related accessors. However that doesn't appear to be the case in your example.)

Related

jOOQ CommonTableExpression with SelectQuery

We have a query with a common structure that we use under different circumstances (where clause is different)
So we have something like this
val baseQuery: SelectQuery<Record> = dsl
.select(someFields)
.from(someTable)
.join(otherTable).on(joinClause)
.query
In other places we then extend this query however needed.
baseQuery.apply {
addConditions(conditionsA)
}.fetch()
baseQuery.apply {
addConditions(conditionsB)
}.fetch()
So far so good. But now it would be cool if we could somehow use this base in combination with a CTE. Not sure how to do that though.
val someCTE: CommonTableExpression<*> = DSL.....
// dsl
// .with(someCTE)
// .selectQuery(baseQuery.apply {}) ¯\_(ツ)_/¯
baseQuery.apply {
// addWith(someCTE) ¯\_(ツ)_/¯
addSelect(someCTE.field(cteField))
addJoin(someCTE, joinClause)
addConditions(conditionsC)
}
Is there a way to do it? Perhaps other suggestions how to reuse the base query when using a CTE?
Edit: Solution
With the help of Lukas' answer I settled on this approach
fun dynamicQuery(
context: SelectSelectStep<*> = dsl.select(),
selects: List<Field<*>> = listOf(),
joins: List<Pair<Table<*>, Condition>> = listOf(),
conditions: List<Condition> = listOf()
): SelectQuery<Record>
So in normal customizations I can
dynamicQuery(
conditions = conditionsA
).fetch()
dynamicQuery(
conditions = conditionsB
).fetch()
It can be combined with a CTE
val someCTE: CommonTableExpression<*> = DSL.....
dynamicQuery(
context = dsl.with(someCTE).select(),
selects = listOf(someCTE.field(cteField)),
joins = listOf(someCTE to joinClause),
conditions = conditionsC
).fetch()
TL;DR: You can't use CTEs with jOOQ 3.15's model API
Some background on the jOOQ model API vs DSL API distinction
The very old jOOQ 1.0 only had what is now called the "model API", a mutable, procedural API with setters (no getters), where you can manipulate dynamic SQL.
jOOQ 2.0 introduced the DSL API, which is what most people are using today. The fact that the DSL API mimicks the SQL language helps users discover jOOQ API much more easily. Everything is named exactly as expected. With the exception of a few quirks in the areas of CTE and derived tables, you can write jOOQ-SQL almost just like actual SQL.
The model API was not deprecated, but wrapped by the DSL API, and kept around:
for backwards compatibility reasons
because some people seemed to like the procedural approach
You can't do anything with the model API that you couldn't do with the DSL API as well, though a more functional programming style may be helpful when doing this with the DSL API. See: https://blog.jooq.org/2017/01/16/a-functional-programming-approach-to-dynamic-sql-with-jooq
The future of jOOQ
While the model API is still getting some new clauses support for SELECT, INSERT, UPDATE, DELETE statements, there are some statements that are not available in a model API form. These include MERGE, TRUNCATE, all DDL statements, all procedural statements. And, the WITH clause.
The strategy is to eventually deprecate the model API, because the redundancy creates a lot of extra work that is better invested elsewhere. There are also subtle bugs when people call model API methods in unexpected order, i.e. an order that is not possible through the DSL API.
In a first step, pretty soon, jOOQ will inverse the relationship between APIs: https://github.com/jOOQ/jOOQ/issues/11241. The model API will be the auxiliary wrapper of the DSL API for those who rely on it for backwards compatibility. It isn't unlikely that the model API will even be extracted into a separate compatibility module, to discourage its use in new code
In a next step, with the dependencies inversed, the DSL API can finally become consistently immutable, which is what many users expect, and to their surprise, find lacking: https://github.com/jOOQ/jOOQ/issues/9047
Eventually, the model API will be deprecated, and then dropped
You can still use it today, and the deprecation and removal will be done over a long period of time, so there's no hurry in getting off this API (as always with jOOQ). But in the context of your question, it's good to see that jOOQ will not invest in adding too many features to it, anymore. CTE support won't be added to the model API.
Workarounds
You can, of course, work around this limitation, because internally, the model API is CTE capable:
You could use reflection to add new CTEs to the SelectQuery internal representation. I won't document how this works, here, because it's never a good idea to document these things :)
You could start creating a query using the DSL API, and then extract the internal SelectQuery representation using SelectFinalStep.getQuery(), and continue working from there.

Django: how do i create a model dynamically

How do I create a model dynamically upon uploading a csv file? I have done the part where it can read the csv file.
This doc explains very well how to dynamically create models at runtime in django. It also links to an example of doing so.
However, as you will see after looking at the document, it is quite complex and cumbersome to do this. I would not recommend doing this and believe it is quite likely you can determine a model ahead of time that is flexible enough to handle the CSV. This would be much better practice since dynamically changing the schema of your database as your application is running is a recipe for a ton of bugs in your code.
I understand that you want to create new schema's on the fly based on fields in the those in a CSV. While thats a valid use case and could be the absolute right call. I doubt it though - it lends itself to a data model for a single tenet SaaS application that could have goofy performance and migration issues.
I'd try using Mongo/ some other NoSQL solutions as others have mentioned. But a simpler approach may be a modified Star Schema implemented in SQL. In this case you create a dimensions tables that stores each header, then create an instance of each data element that has a foreign key to dimension and records the value of that dimension.
If you read the csv the psuedo code would look something like this:
for row in DictReader(file):
for k in row.keys():
try:
dim = Dimension.objects.get(name=k)
except:
dim = Dimension(name=k)
dim.save()
DimensionRecord(dimension=dim, value=row[k]
Obviously you could better handle reading the headers and error trapping if dimensions already exist, but this would be an example of how you could dynamically load variable headered CSV's into a SQL db.

Datastore NDB best practices when querying and extracting thousands of rows

I'm using the High Replication Datastore, along with ndb. I have a kind with over 27,000 entities, which isn't that much. Supposedly the datastore is efficient in querying and extracting large amounts of data, but whenever I query over that kind, queries take a long time to finish (I've even got DeadlineExceededErrors).
I have a model where I store keywords and URLs I want to index in Google:
class Keywords(ndb.Model):
keyword = ndb.StringProperty(indexed=True)
url = ndb.StringProperty(indexed=True)
number_articles = ndb.IntegerProperty(indexed=True)
# Some other attributes... All attributes are indexed
My current use cases are to build my Sitemap, and to fetch my top 20 keywords to link from my hope page.
When I fetch many entities, I usually do:
Keywords.query().fetch() # For the sitemap, as I want all of the urls
Keywords.query(Keywords.number_articles > 5).fetch() # For the homepage, I want to link to keywords with more than 5 articles
Is there a better way to extract data?
I've tried to index data into the Search API, and I've seen huge speed gains. Even though this works, I don't think it's ideal to replicate data from the Datastore into Search API with basically the same fields.
Thanks in advance!
I would split this functionality.
For home page you can use your second query, but add, as advised by Bruyere, limit=20 paramater. Such request should run very fast, if you have the right index.
The site map is a bigger issue. Usually, to process large number of entities, you use Map reduce.
It's probably a good idea, but only if you don't have too many requests to sitemap. It can also be the only solution if you update Keywords entities often and want as up to date site map as possible.
Another option can be to generate sitemap in a task, save it as a blob and serve this blob in the request. That is really quick. If your updates to the Keywords entities are not very frequent, then you can run this task after any update. If you have many updates, then you can schedule the task to run periodically in cron. As you have success using search API, then this is probably the best option for you.
Generally speaking I don't think it's a good idea to use datastore to retrieve large amounts of data. I recommend to look at least at Datastore comparison with traditional databases. It's designed to handle large databases, but not necessarily large result sets. I would say that datastore is designed to handle large amounts of small requests.
DB speed is related to the number of results returned, not the number of records in the DB. You say:
to build my Sitemap, and to fetch my top 20 keywords
If thats the case add limit=20 in both fetches. If you do it that way then use run instead as per the docs:
https://developers.google.com/appengine/docs/python/datastore/queryclass#Query_fetch

What is a sane way to perform a radical Django Model migration in a production environment?

I have an existing django web app that is in use. I have to radically migrate one key model in my design to a completely new design, but I want to cache all of the existing data for that model and migrate them to the new records in production when ready to deploy.
I can afford to bring my website down for a few hours one night and do whatever I need to do to migrate. What are some sane ways I can do this migration?
It seems any migration would need to:
1) Dump all of the existing data into some format, such as SQL, JSON, XML
2) Migrate the model to the new format
3) Reload the data into the new model using a conversion script
I also thought of trying to store all of the existing data in some other model called "OldModel" (if Model is the name of the existing model) and then migrating the data live.
There is a project to help with migrations that I've heard of: South.
Having said that, I admit we've not used it. We still plan our migrations using a file of SQL statements. Madness, I know, but it has the advantage of testability. You can run it as many times as necessary during development and staging testing before the "big deploy". It can be source controlled, diffed, etc. It can also, therefore, be called from a larger deployment script. Of course, we back up production before running it :-)
If your database does journaling, using the old-fashioned method has the added advantage that there is a transaction history that can be rolled back.
Experiments we've run with JSON, XML and "OldModel" -> "NewModel" style dumps have scaled pretty poorly. Mind you, YMMV... we have quite a large database. By using a script, you can run on your production database without having to offload or reload vast amounts of data. This way even a complicated migration can take seconds, rather than hours.
There are around 5 or 6 tools to help automate some portion of migrations. Several of them are listed in this question and I'll add the others just for completeness.
Next, see S. Lott's answer to this question about migration workflows for a great idea on using version numbers in the model name to make migrations easier, including structuring a standalone script to properly convert the tables. To my mind this is vastly superior to serializing the data for export and then trying to build your new tables by importing.
Finally, I haven't been able to think of a way to do a hot migration properly and haven't seen any hints from anywhere else either, so maintenance downtime is inevitable.
Make all migrations in steps!
If you need to add a field, go ahead and add it, with a default value or being optional. This is safe.
If you need to make an existing optional field required, give it a default first.
If you need to make an existing field with a default not have a default, drop the default after fixing all the code that creates instances.
If you need to change the type of a field, add a new field that inherits the value from the current one, first. Then, run a script to update the existing instances to populate the new field. Thirdly, Remove all the code that uses the old field to use the new one. Finally, which no code is left using the original, you can drop it.
For every situation there is a small step you can make. For every bigger change, you can break it down into little ones. This is one place iterative development pays off. Keep good backups in place and don't be afraid to push often! Make the small changes quickly to see if they work.
If you are more comfortable with the Django ORM than with raw SQL, you might consider using Model -> BackupModel -> TestModel -> Model, where all but the last step can be performed without dropping data.
def backup(InModel,OutModel):
in_objs = InModel.objects.all()
for obj in in_objs:
out_obj = OutModel.convert_from(InModel,obj)
out_obj.save()
Here, you would just make sure that all your models have convert_from methods implemented. These should all be trivial conversions except for BackupModel -> TestModel. In the other cases, nothing but the class would change, all data being identically preserved.
The advantage to this is that before you go rewriting all your interfaces, you can play around with TestModel and make sure that your conversions were what you thought they'd be. If everything goes wrong, you convert from BackupModel->Model, and everything is okay. In a worst-case scenario, you give up on Django's ORM, run back to SQL, and simply rename all your tables that begin with backupmodel__* to model__* in your database.
Disclaimer: I've never done this.

Multijoin queries in Django

What's the best and/or fastest method of doing multijoin queries in Django using the ORM and QuerySet API?
If you are trying to join across tables linked by ForeignKeys or ManyToManyField relationships then you can use the double underscore syntax. For example if you have the following models:
class Foo(models.Model):
name = models.CharField(max_length=255)
class FizzBuzz(models.Model):
bleh = models.CharField(max_length=255)
class Bar(models.Model):
foo = models.ForeignKey(Foo)
fizzbuzz = models.ForeignKey(FizzBuzz)
You can do something like:
Fizzbuzz.objects.filter(bar__foo__name = "Adrian")
Don't use the API ;-) Seriously, if your JOIN are complex, you should see significant performance increases by dropping down in to SQL rather than by using the API. And this doesn't mean you need to get dirty dirty SQL all over your beautiful Python code; just make a custom manager to handle the JOINs and then have the rest of your code use it rather than direct SQL.
Also, I was just at DjangoCon where they had a seminar on high-performance Django, and one of the key things I took away from it was that if performance is a real concern (and you plan to have significant traffic someday), you really shouldn't be doing JOINs in the first place, because they make scaling your app while maintaining decent performance virtually impossible.
Here's a video Google made of the talk:
http://www.youtube.com/watch?v=D-4UN4MkSyI&feature=PlayList&p=D415FAF806EC47A1&index=20
Of course, if you know that your application is never going to have to deal with that kind of scaling concern, JOIN away :-) And if you're also not worried about the performance hit of using the API, then you really don't need to worry about the (AFAIK) miniscule, if any, performance difference between using one API method over another.
Just use:
http://docs.djangoproject.com/en/dev/topics/db/queries/#lookups-that-span-relationships
Hope that helps (and if it doesn't, hopefully some true Django hacker can jump in and explain why method X actually does have some noticeable performance difference).
Use the queryset.query.join method, but only if the other method described here (using double underscores) isn't adequate.
Caktus blog has an answer to this: http://www.caktusgroup.com/blog/2009/09/28/custom-joins-with-djangos-queryjoin/
Basically there is a hidden QuerySet.query.join method that allows adding custom joins.