Doctrine 2 query cache - doctrine-orm

My caching folder with Doctrine 2 gets filled up with thousands of folders which seem to contain the translated SQL queries for all doctrine actions that are performed. In-code, there are no explicit caching commands on my DQL-queries. Almost all DQL contains some variable part based on user actions, mostly some filtering in the WHERE-clause. I like the concept of caching the translation process, but it seems that queries are handled as a whole, rendering caching pointless as many queries are unique.
As I'm fairly new to Doctrine, there must be something I'm missing. Can somebody give me some directions on how to get the query caching working properly?
Also, clear-cache on the Doctrine command-line does not remove these folders, I delete them manually, what could that mean?
Thanks, Mike

Parameters values do not affect Doctrine's query cache. Only one cache entry will be created for:
$query = $em->createQuery('SELECT u FROM MyBundle\Entity\User u WHERE u.id = :id');
Regardless how many times you will call it with different values:
$query->setParameters('id' => 1)->getResult();
$query->setParameters('id' => 2)->getResult();
<...>
So either this is not query cache, or you are not using parameters binding.

Related

How to cache a query result for several minutes in Django?

I have a query like this:
profiles = UserProfile.objects.all().order_by('-created')
There are several thousands of userprofiles hence the query is costly. so I am wondering how to cache the result for 5 minutes.
I have read this doc but could not find my answer.
You should have a look here and here instead.
In essence, django's querysets are lazy and their maintain a cache, meaning they only call the db when evaluated for the first time. This means that your profile variable will only query the db once during it's lifetime (unless you do profile.filter() or something similar to alter the query in which case it will have to be reevaluated).
Now the important part in regards to your question is the lifetime of the variable. It will go out of scope as soon as the method it is defined in returns, in other words it will be created anew, querying the db, for each new request. In order to avoid that you can either cache the results using memcached or a similar system OR by making your profile variable global which will persist between requests.

Cassandra NOT EQUAL Operator

Question to all Cassandra experts out there.
I have a column family with about a million records.
I would like to query these records in such a way that I should be able to perform a Not-Equal-To kind of operation.
I Googled on this and it seems I have to use some sort of Map-Reduce.
Can somebody tell me what are the options available in this regard.
I can suggest a few approaches.
1) If you have a limited number of values that you would like to test for not-equality, consider modeling those as a boolean columns (i.e.: column isEqualToUnitedStates with true or false).
2) Otherwise, consider emulating the unsupported query != X by combining results of two separate queries, < X and > X on the client-side.
3) If your schema cannot support either type of query above, you may have to resort to writing custom routines that will do client-side filtering and construct the not-equal set dynamically. This will work if you can first narrow down your search space to manageable proportions, such that it's relatively cheap to run the query without the not-equal.
So let's say you're interested in all purchases of a particular customer of every product type except Widget. An ideal query could look something like SELECT * FROM purchases WHERE customer = 'Bob' AND item != 'Widget'; Now of course, you cannot run this, but in this case you should be able to run SELECT * FROM purchases WHERE customer = 'Bob' without wasting too many resources and filter item != 'Widget' in the client application.
4) Finally, if there is no way to restrict the data in a meaningful way before doing the scan (querying without the equality check would returning too many rows to handle comfortably), you may have to resort to MapReduce. This means running a distributed job that would scan all rows in the table across the cluster. Such jobs will obviously run a lot slower than native queries, and are quite complex to set up. If you want to go this way, please look into Cassandra Hadoop integration.
If you want to use not-equals operator on a specific partition key and get all other data from table then you can use a combination of range queries and TOKEN function from CQL to achieve this
For example, if you want to fetch all rows except the ones having partition key as 'abc' then you execute below 2 queries
select <column1>,<column2> from <keyspace1>.<table1> where TOKEN(<partition_key_column_name>) < TOKEN('abc');
select <column1>,<column2> from <keyspace1>.<table1> where TOKEN(<partition_key_column_name>) > TOKEN('abc');
But, beware that result is going to be huge (depending on size of table and fields you need). So you might want to use this in conjunction with dsbulk kind of utility. Also note that there is no guarantee of ordering in your result. This is just a kind of data dump which will most probably be useful for some kind of one-time data migration like scenarios.

Overcoming querying limitations in Couchbase

We recently made a shift from relational (MySQL) to NoSQL (couchbase). Basically its a back-end for social mobile game. We were facing a lot of problems scaling our backend to handle increasing number of users. When using MySQL loading a user took a lot of time as there were a lot of joins between multiple tables. We saw a huge improvement after moving to couchbase specially when loading data as most of it is kept in a single document.
On the downside, couchbase also seems to have a lot of limitations as far as querying is concerned. Couchbase alternative to SQL query is views. While we managed to handle most of our queries using map-reduce, we are really having a hard time figuring out how to handle time based queries. e.g. we need to filter users based on timestamp attribute. We only need a user in view if time is less than current time:
if(user.time < new Date().getTime() / 1000)
What happens is that once a user's time is set to some future time, it gets exempted from this view which is the desired behavior but it never gets added back to view unless we update it - a document only gets re-indexed in view when its updated.
Our solution right now is to load first x user documents and then check time in our application. Sorting is done on user.time attribute so we get those users who's time is less than or near to current time. But I am not sure if this is actually going to work in live environment. Ideally we would like to avoid these type of checks at application level.
Also there are times e.g. match making when we need to check multiple time based attributes. Our current strategy doesn't work in such cases and we frequently get documents from view which do not pass these checks when done in application. I would really appreciate if someone who has already tackled similar problems could share their experiences. Thanks in advance.
Update:
We tried using range queries which works for only one key. Like I said in most cases we have multiple time based keys meaning multiple ranges which does not work.
If you use Date().getTime() inside a view function, you'll always get the time when that view was indexed, just as you said "it never gets added back to view unless we update it".
There are two ways:
Bad way (don't do this in production). Query views with stale=false param. That will cause view to update before it will return results. But view indexing is slow process, especially if you have > 1 milllion records.
Good way. Use range requests. You just need to emit your date in map function as a key or a part of complex key and use that range request. You can see one example here or here (also if you want to use DateTime in couchbase this example will be more usefull). Or just look to my example below:
I.e. you will have docs like:
doc = {
"id"=1,
"type"="doctype",
"timestamp"=123456, //document update or creation time
"data"="lalala"
}
For those docs map function will look like:
map = function(){
if (doc.type === "doctype"){
emit(doc.timestamp,null);
}
}
And now to get recently "updated" docs you need to query this view with params:
startKey="dateTimeNowFromApp"
endKey="{}"
descending=true
Note that startKey and endKey are swapped, because I used descending order. Here is also a link to documnetation about key types that couchbase supports.
Also I've found a link to a question that can also help.

Django 1.2 PostgreSQL cascading delete for keys with ON DELETE NO ACTION

I have a postgresql database with about 150 tables(it's a Django 1.2 project). Django adds ON DELETE NO ACTION and ON UPDATE NO ACTION to foreign keys at the time of table creation.
Now I need to bulk delete data (about 800,000 records) from a bunch of tables based on certain condition.
Using Model.objects.filter().delete() is not an options because data is huge and it takes a lot of time.
Only sanest options seems a cascading delete, but since Django has add "ON DELETE NO ACTION" it seem like a no option.
So my question: Is there any way to change all foreing keys to ON DELETE CASCADE in an easy way(there are many of them) or something similar.
(I am aware that I can manually write the SQL queries for each table, but that would be a monumental and difficult to maintain task.)
https://docs.djangoproject.com/en/dev/ref/models/fields/#django.db.models.ForeignKey.on_delete
As pointed out in the link which comprises Andrew's answer, if you set this to CASCADE in Django, then Django will go and do the deletes "retail". If it is set to NO ACTION you can create a database-level foreign key definition to handle things. That sounds like a reasonable plan to me.
Be sure you have an index defined on the referencing set of columns for every foreign key; otherwise you're going to see very slow performance. Some database products will automatically create such an index when you define a foreign key, but there are situations where that is not advantageous, so PostgreSQL puts the matter in your hands to optimize as you see fit. (Just as one example, it might not be worth the cost of maintaining the index during normal operations, but be worth building it before a purge and dropping it after.)
One note: ON DELETE CASCADE performs miserably on bulk operations. The reason is that this is done as a trigger. Consequently the way it looks from an algorithmic perspective is:
for row in delete_set:
for dependent row in (scan for referencing rows):
delete dependent row
If you are deleting 800000 rows in a parent table this translates into 800000 separate delete scans on the dependent tables. Even at your best case, with indexes usable 800000 separate index scans will be much slower than one sequential scan.
A better way to do this is to use a writeable common table expression in 9.1 or higher, or to just do separate delete statements in the same transaction. Something like:
WITH rows_to_delete (id) AS (
SELECT id FROM mytable WHERE where_condition
),
deleted_rows (id) AS (
DELETE FROM referencing_table WHERE mytable_id IN (select id FROM rows_to_delete)
RETURNING mytable_id
),
DELETE FROM mytable WHERE id IN (select id FROM deleted_rows);
This Reduces to something like, algorithmically:
scan for rows to delete as delete_set
for dependent in scan for rows dependent to delete:
delete dependent
for to_delete in scan for rows referenced by deleted dependents:
delete to_delete
Getting rid of the forced nested loop scan will greatly speed things up.

How to limit columns returned by Django query?

That seems simple enough, but all Django Queries seems to be 'SELECT *'
How do I build a query returning only a subset of fields ?
In Django 1.1 onwards, you can use defer('col1', 'col2') to exclude columns from the query, or only('col1', 'col2') to only get a specific set of columns. See the documentation.
values does something slightly different - it only gets the columns you specify, but it returns a list of dictionaries rather than a set of model instances.
Append a .values("column1", "column2", ...) to your query
The accepted answer advising defer and only which the docs discourage in most cases.
only use defer() when you cannot, at queryset load time, determine if you will need the extra fields or not. If you are frequently loading and using a particular subset of your data, the best choice you can make is to normalize your models and put the non-loaded data into a separate model (and database table). If the columns must stay in the one table for some reason, create a model with Meta.managed = False (see the managed attribute documentation) containing just the fields you normally need to load and use that where you might otherwise call defer(). This makes your code more explicit to the reader, is slightly faster and consumes a little less memory in the Python process.