Doctrine performance with hydration

Doctrine performance with hydration - doctrine-orm

I am having a problem with long execution times in PHP (10s) when using Doctrine2 and am wondering what solution can be applied.
I am loading an actor entity which has a 1:n relation to features. I am loading 50 actors each having 20 features which results in 1000 rows. The problem seems to be Doctrine Hydration. It doesn´t make a big difference if I am using object or array hydration. In the profiler I can see that the main impact is the method gatherRowData/hydrateRowData which are both called 13.141 times.
Is the only solution to use a plain old Mysql resultset and loop over that array? If yes I am wondering which sense it then makes to use a ORM. Hopefully someone can shed some light on this. Normally I want to have objects which can be accessed through methods like getIdentity() or which could hold some business logic e.g. for obtaining associated entities.
Best Regards
Christian

Related

Correct implementation of the Filter (Criteria) Design Pattern

The design pattern is explained here:
http://www.tutorialspoint.com/design_pattern/filter_pattern.htm
I'm working on a software very similar to Adobe Lightroom or ACDSee but with different purposes. The user (photographer) is able to import thousands of images from his hard drive (it wouldn't be weird to have over 100k/200k images).
We have a side panel where users can create custom "filters" which are expressions like:
Does contain the keyword: "car"
AND
Does not contain the keyword "woods"
AND
(
Camera model is "Nikon D300s"
OR
Camera model is "Canon 7D Mark II"
)
AND
NOT
Directory is "C:\today_pictures"
You can get the idea from the above example.
We have a SQLite database where all image information is stored. The question is, should we load ALL Photo objects into memory from the database the first time the program is loaded and implement the Criteria/Filter design pattern as explained in the website cited above so our Criteria classes filter objects or is better to do the criteria classes actually generate an SQL query that is finally executed in order to retrieve only what's needed from the database?
We are developing the program with C++ (QT).

TL;DR: It's already properly implemented in SQLITE3, and look at how long that took. You'll face the same burden.
It'd be a horrible case of data duplication to read the data from the database and store it again in another data structure. Use database queries to implement the query that the user gave you. Let the database execute the query. That's what databases are for.
By reimplementing a search/query system for ~500k records, you'll be rewriting large chunks of a bog-standard database yourself. It'd be a mostly pointless exercise. SQLITE3 is very well tested and is essentially foolproof. It'll cost you thousands of hours of work to reimplement even a small fraction of its capabilities and reliability/resiliency. If that doesn't scream "reinventing the wheel", I don't know what does.
The database also allows you to very easily implement lookahead/dropdowns to aid the user in writing the query. For example, as you're typing out "camera model is", the user can have an option of autocompletion or a dropdown to select one or more models from.
You paid the "price" of a database, it'd be a shame for it all to go to waste. So, use it. It'll give you lots of leverage, and allow you to implement features two orders of magnitude faster than otherwise.
The pattern you've linked to is just a pattern. It doesn't mean that it's an exact blueprint of how to design your application to perform on real data. You'll be, eventually, fighting things such as concurrency (a file scanning thread running to update the metadata), indexing, resiliency in face of crashes, etc. In the end you'll end up with big chunks of SQLITE reimplemented for your particular application. 500k metadata records are nothing much, if you design your query translator well and support it with proper indexes, it'll work perfectly well.

When should a person use native queries with JPA 2.0 instead of JPQL or CriteriaBuilder?

I am very confused about when and when not to use native queries in JPA 2.0. I was under the impression that using native queries could cause me to get out of sync with the JPA cache. If I can accomplish the same thing with a JPQL or CriteriaBuilder query, is there any good reason to use a native query? Similarly, is there any danger in using a native query if I can accomplish the same thing with JPQL or CriteriaBuilder? And finally, if there is a danger in using a native query as far as getting out of sync with the JPA cache, would the same danger exist in executing an equivalent query with JPQL or CriteriaBuilder?
My philosophy has been do avoid native queries, but surely there are times where they are necessary. It seems to me that if I can do it with JPQL or CriteriaBuilder, then I should.
Thanks.

I agree with your philosophy.
The main problem with native queries, IMHO, is the maintainability. First of all, they're generally more complex and longer than JPQL queries. But they also hardcode table and column names, rather than using class and property names.
JPQL queries are already problematic when refactoring, because they hard-code class and property names in Strings. But native queries are even worse, because they hard-code table and column names everywhere.
I don't think native select queries are a problem regarding the cache. Native update, insert and delete queries are a problem, though, because they modify data behind the back of the first and second-level cache. So these might become stale.
Another problem is that your native queries could use a syntax that is recognized by one database but not by another, making the application harder to migrate from one database to another.

SQL Query minimizing/caching in a C++ application

I'm writing a project in C++/Qt and it is able to connect to any type of SQL database supported by the QtSQL (http://doc.qt.nokia.com/latest/qtsql.html). This includes local servers and external ones.
However, when the database in question is external, the speed of the queries starts to become a problem (slow UI, ...). The reason: Every object that is stored in the database is lazy-loaded and as such will issue a query every time an attribute is needed. On average about 20 of these objects are to be displayed on screen, each of them showing about 5 attributes. This means that for every screen that I show about 100 queries get executed. The queries execute quite fast on the database server itself, but the overhead of the actual query running over the network is considerable (measured in seconds for an entire screen).
I've been thinking about a few ways to solve the issue, the most important approaches seem to be (according to me):
Make fewer queries
Make queries faster
Tackling (1)
I could find some sort of way to delay the actual fetching of the attribute (start a transaction), and then when the programmer writes endTransaction() the database tries to fetch everything in one go (with SQL UNION or a loop...). This would probably require quite a bit of modification to the way the lazy objects work but if people comment that it is a decent solution I think it could be worked out elegantly. If this solution speeds up everything enough then an elaborate caching scheme might not even be necessary, saving a lot of headaches
I could try pre-loading attribute data by fetching it all in one query for all the objects that are requested, effectively making them non-lazy. Of course in that case I will have to worry about stale data. How would I detect stale data without at least sending one query to the external db? (Note: sending a query to check for stale data for every attribute check would provide a best-case 0x performance increase and a worst-caste 2x performance decrease when the data is actually found to be stale)
Tackling (2)
Queries could for example be made faster by keeping a local synchronized copy of the database running. However I don't really have a lot of possibilities on the client machines to run for example exactly the same database type as the one on the server. So the local copy would for example be an SQLite database. This would also mean that I couldn't use an db-vendor specific solution. What are my options here? What has worked well for people in these kinds of situations?
Worries
My primary worries are:
Stale data: there are plenty of queries imaginable that change the db in such a way that it prohibits an action that would seem possible to a user with stale data.
Maintainability: How loosely can I couple in this new layer? It would obviously be preferable if it didn't have to know everything about my internal lazy object system and about every object and possible query
Final question
What would be a good way to minimize the cost of making a query? Good meaning some sort of combination of: maintainable, easy to implement, not too aplication specific. If it comes down to pick any 2, then so be it. I'd like to hear people talk about their experiences and what they did to solve it.
As you can see, I've thought of some problems and ways of handling it, but I'm at a loss for what would constitute a sensible approach. Since it will probable involve quite a lot of work and intensive changes to many layers in the program (hopefully as few as possible), I thought about asking all the experts here before making a final decision on the matter. It is also possible I'm just overlooking a very simple solution, in which case a pointer to it would be much appreciated!
Assuming all relevant server-side tuning has been done (for example: MySQL cache, best possible indexes, ...)
*Note: I've checked questions of users with similar problems that didn't entirely satisfy my question: Suggestion on a replication scheme for my use-case? and Best practice for a local database cache? for example)
If any additional information is necessary to provide an answer, please let me know and I will duly update my question. Apologies for any spelling/grammar errors, english is not my native language.
Note about "lazy"
A small example of what my code looks like (simplified of course):
QList<MyObject> myObjects = database->getObjects(20, 40); // fetch and construct object 20 to 40 from the db
// ...some time later
// screen filling time!
foreach (const MyObject& o, myObjects) {
o->getInt("status", 0); // == db request
o->getString("comment", "no comment!"); // == db request
// about 3 more of these
}

At first glance it looks like you have two conflicting goals: Query speed, but always using up-to-date data. Thus you should probably fall back to your needs to help decide here.
1) Your database is nearly static compared to use of the application. In this case use your option 1b and preload all the data. If there's a slim chance that the data may change underneath, just give the user an option to refresh the cache (fully or for a particular subset of data). This way the slow access is in the hands of the user.
2) The database is changing fairly frequently. In this case "perhaps" an SQL database isn't right for your needs. You may need a higher performance dynamic database that pushes updates rather than requiring a pull. That way your application would get notified when underlying data changed and you would be able to respond quickly. If that doesn't work however, you want to concoct your query to minimize the number of DB library and I/O calls. For example if you execute a sequence of select statements your results should have all the appropriate data in the order you requested it. You just have to keep track of what the corresponding select statements were. Alternately if you can use a looser query criteria so that it returns more than one row for your simple query that ought to help performance as well.

Approximating object statistics with Django

If I had say 70,000 objects and wanted to do statistics on them, but the statistics didnt need to be 100% accurate, what is the best way to pull out say 1,000 objects, do statistics on those objects and then just scale it to approximate the statistics for the 70,000? I can't quite seem to find an efficient way to get 1000 random objects from a queryset.

You can get random objects with:
objs = list(MyModel.objects().order_by("?")[:1000])
But the underlying order by random that gets generated for the SQL isn't particularly efficient.

I know this isn't the answer you're looking for, but sometimes when doing heavy reporting you need more than Django's ORM can offer. I worked with a guy who used Django for his main application, but for some reporting tools (and a JSON service) he used Flask and SQLAlchemy and was able to accomplish a whole lot more and without having to write SQL.

there is a great post on the issue of getting random rows from the database (there are few good points in the comments too).
the only thing I would check is to get some objects by "in_bulk" method, because you may be even faster this way.

django-nonrel on Google App Engine - Implications of using ListField for ManyToMany

I am working on a Google App Engine application and I am relatively new at this.
I have built an app already in Django and have a model using a field type of ManyToMany.
I am aware that django-nonrel does not support many-to-many field types of Django. So I am considering using ListField instead.
Questions:
- What is the implication of using ListField instead of ManyToMany?
- I am aware that this means that Django's JOIN API cannot be used. But what does this mean for my app?
- Am I going to have problems when it comes to doing a search for something in a many-to-many field?
Apologies if these are programming 101 questions. I'm a designer trying to get my head around development.
Thanks

Well as you probably know, you will be spanning the relationship more manually.
Django cannot help quite as much as when using ManyToMany, but it should not be that big a problem.
Depending on the complexity of the relationship, you might want to consider building a model just for this purpose.
I have never used that approach on GAE, since IMO its only valid when an object has alot relations (more than 50 I would say) or when the lookups you plan to do, will benefit from this. Maybe because they start at either end of the relationship with equal frequency or it would be nice to be able to loop over the relationships to display them or something along those lines.
Last time I made something on GAE I used the ListField (or ListProperty as it was known then) since most of the objects only had about 20 related objects and the lookups would rarely go the other way.
So all in all, its not a big deal and I don't remember it as any kind of a pain to work with/around.
Hope this was helpful despite it being rather "IMO"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js