When to use Haystack/ElasticSearch vs Django's ORM

When to use Haystack/ElasticSearch vs Django's ORM - django

So I implemented Haystack with ElasticSearch a week ago within our BETA application. One thing I can notice is that getting some data (large amount) back to our users (for example listing all the users within the application) is much faster by going through Haystack then Django's ORM. Now, I will be releasing a REST service (with TastyPie) to serve the possible tablets within the next weeks, as I want to be able to access the information from iPads, Nexus tablets and so on.
One thing I was wondering, is when should I be querying the ORM vs Haystack/ElasticSearch? For example, if the user on the tablet is requesting a specific set of users, should we let TastyPie query the ORM, or go to ElasticSearch?
If we look at this answer Django: Haystack or ORM, we can all agree that a DB is made to retrieve and write data. However, could we say that retrieving faster can be faster with Haystack/ElasticSearch once the search engine was updated?
I am a bit confused as to when, should we not be querying Haystack if it is much faster?!

To make things clear I guess you're talking about querying Elasticsearch via Haystack without later instantiating any objects for your search results with data from you database.
Some points to consider besides the points mentioned in the other post:
A search engine like Elasticsearch is highly optimized when dealing with full-text searches (When doing something with SQL it highly depends on the database/engine you are using)
Queries that are involving a lot of relations/joins will most like be easier to handle with the ORM, but on the other hand you can eg save data from foreign-key relations in a denormalized fashion when using ES which could give you a performance boost. Of course you can denormalize your database tables as well but this is quite often considered as a bad practice as long as you know what you are doing, eg when solving a performance bottleneck.
ES is somehow quite easy to scale while scaling your SQL DB might be more complicated.
Most likely this is a decision that depends very much on your use case, the amount of data to process and the queries you are intending to run. So the best thing of course is - as always - to do some benchmarking yourself and compare this two solutions. But don't do any premature optimisations as one big advantage of the ORM is to keep things simple - you don't have to care much about the integrity of your data and maintain an additional system.

Related

When creating models in Django, should database normalization be taken into consideration?

I'm new to Django and web frameworks in general, but have worked with DBMSs quite often. Knowing that each class within django models maps to a table in the database, should the models be based on an ERD where tables are normalized? Would normal form matter in this case? Thanks!

Assuming you're using a SQL backend (i.e. not something like mongodb), then the same guidelines for normalisation would apply. Remember, django is just a pretty way to access the database, but in the backgroud you're still executing a series of sql queries which will benefit in the same way from normalisation.
That said, a lot of the business logic that you would normally build into the database can now be handled by django, so it is possible to get away with a slightly de-normalised structure if it makes working with it easier. The approach I usually take is to normalise where it makes sense to avoid duplication, and de-normalise where a normalised structure would result in really complicated queries (django doesn't like complicated queries). I ensure consistency in the data though the use of receivers or overloading the save method.

Django/Sqlite Improve Database performance

We are developing an online school diary application using django. The prototype is ready and the project will go live next year with about 500 students.
Initially we used sqlite and hoped that for the initial implementation this would perform well enough.
The data tables are such that to obtain details of a school day (periods, classes, teachers, classrooms, many tables are used and the database access takes 67ms on a reasonably fast PC.
Most of the data is static once the year starts with perhaps minor changes to classrooms. I thought of extracting the timetable for each student for each term day so no table joins would be needed. I put this data into a text file for one student, the file is 100K in size. The time taken to read this data and process it for a days timetable is about 8ms. If I pre-load the data on login and store it in sessions it takes 7ms at login and 2ms for each query.
With 500 students what would be the impact on the web server using this approach and what other options are there (putting the student text files into a sort of memory cache rather than session for example?)
There will not be a great deal of data entry, students adding notes, teachers likewise, so it will mostly be checking the timetable status and looking to see what events exist for that day or week.

What is your expected response time, and what is your expected number of requests per minute? One twentieth of a second for the database access (which is likely to be slow part) for a request doesn't sound like a problem to me. SQLite should perform fine in a read-mostly situation like this. So I'm not convinced you even have a performance problem.
If you want faster response you could consider:
First, ensuring that you have the best response time by checking your indexes and profiling individual retrievals to look for performance bottlenecks.
Pre-computing the static parts of the system and storing the HTML. You can put the HTML right back into the database or store it as disk files.
Using the database as a backing store only (to preserve state of the system when the server is down) and reading the entire thing into in-memory structures at system start-up. This eliminates disk access for the data, although it limits you to one physical server.

This sounds like premature optimization. 67ms is scarcely longer than the ~50ms where we humans can observe that there was a delay.
SQLite's representation of your data is going to be more efficient than a text format, and unlike a text file that you have to parse, the operating system can efficiently cache just the portions of your database that you're actually using in RAM.
You can lock down ~50MB of RAM to cache a parsed representation of the data for all the students, but you'll probably get better performance using that RAM for something else, like the OS disk cache.

I agree with some of other answers which suggest to use MySQL or PostgreSQL instead of SQLite. It is not designed to be used as production db. It is great for storing data for one-user applications such as mobile apps or even a desktop application, but it falls short very quickly in server applications. With Django it is trivial to switch to any other full-pledges database backend.
If you switch to one of those, you should not really have any performance issues, especially if you will do all the necessary joins using select_related and prefetch_related.
If you will still need more performance, considering that "most of the data is static", you actually might want to convert Django site a static site (a collection of html files) and then serve those using nginx or something similar to that. The simplest way I can think of doing that is to just write a cron-job which will loop over all needed url-configs, request the page from Django and then save that as an html file. If you want to go into that direction, you also might want to take a look at Python's static site generators: Hyde and Pelican.
This approach will certainly work much faster then any caching system however you will loose any dynamic components of the site. If you need them, then caching seems like the best and fastest solution.

You should use MySQL or PostgreSQL for your production database. sqlite3 isn't a good idea.
You should also avoid pre-loading data on login. Since your records can be inserted in advance, write django management commands and run the import to your chosen database before hand and design your models such that when a user logs in, the user would already be able to access and view/edit his or her related data (which are pre-inserted before the application even goes live). Hardcoding data operations when log in does not smell right at all from an application design point-of-view.
https://docs.djangoproject.com/en/dev/howto/custom-management-commands/
The benefit of designing your django models and using custom management commands to insert the records right way before your application goes live implies that you can use django orm to make the appropriate relationships between users and their records.
I suspect - based on your description of what you need above - that you need to re-look at the approach you are creating this application.
With 500 students, we shouldn't even be talking about caching. If you want response speed, you should deal with the following issues in priority:-
Use a production quality database
Design your application use case correctly and design your application model right
Pre-load any data you need to the production database
front end optimization comes first (css/js compression etc)
use django debug toolbar to figure out if any of your sql is slow and optimize specifically those
implement caching (memcached etc) as needed
As a general guideline.

At what point is it worth using a database?

I have a question relating to databases and at what point is worth diving into one. I am primarily an embedded engineer, but I am writing an application using Qt to interface with our controller.
We are at an odd point where we have enough data that it would be feasible to implement a database (around 700+ items and growing) to manage everything, but I am not sure it is worth the time right now to deal with. I have no problems implementing the GUI with files generated from excel and parsed in, but it gets tedious and hard to track even with VBA scripts. I have been playing around with converting our data into something more manageable for the application side with Microsoft Access and that seems to be working well. If that works out I am only a step (or several) away from using an SQL database and using the Qt library to access and modify it.
I don't have much experience managing data at this level and am curious what may be the best way to approach this. So what are some of the real benefits of using a database if any in this case? I realize much of this can be very application specific, but some general ideas and suggestions on how to straddle the embedded/application programming line would be helpful.
This is not about putting a database in an embedded project. It is also not a business type application where larger databases are commonly used. I am designing a GUI for a single user on a desktop to interface with a micro-controller for monitoring and configuration purposes.
I decided to go with SQLite. You can do some very interesting things with data that I didn't really consider an option when first starting this project.

A database is worthwhile when:
Your application evolves to some
form of data driven execution.
You're spending time designing and
developing external data storage
structures.
Sharing data between applications or
organizations (including individual
people)
The data is no longer short and
simple.
Data Duplication
Evolution to Data Driven Execution
When the data is changing but the execution is not, this is a sign of a data driven program or parts of the program are data driven. A set of configuration options is a sign of a data driven function, but the whole application may not be data driven. In any case, a database can help manage the data. (The database library or application does not have to be huge like Oracle, but can be lean and mean like SQLite).
Design & Development of External Data Structures
Posting questions to Stack Overflow about serialization or converting trees and lists to use files is a good indication your program has graduated to using a database. Also, if you are spending any amount of time designing algorithms to store data in a file or designing the data in a file is a good time to research the usage of a database.
Sharing Data
Whether your application is sharing data with another application, another organization or another person, a database can assist. By using a database, data consistency is easier to achieve. One of the big issues in problem investigation is that teams are not using the same data. The customer may use one set of data; the validation team another and development using a different set of data. A database makes versioning the data easier and allows entities to use the same data.
Complex Data
Programs start out using small tables of hard coded data. This evolves into using dynamic data with maps, trees and lists. Sometimes the data expands from simple two columns to 8 or more. Database theory and databases can ease the complexity of organizing data. Let the database worry about managing the data and free up your application and your development time. After all, how the data is managed is not as important as to the quality of the data and it's accessibility.
Data Duplication
Often times, when data grows, there is an ever growing attraction for duplicate data. Databases and database theory can minimize the duplication of data. Databases can be configured to warn against duplications.
Moving to using a database has many factors to be considered. Some include but are not limited to: data complexity, data duplication (including parts of the data), project deadlines, development costs and licensing issues. If your program can run more efficiently with a database, then do so. A database may also save development time (and money). There are other tasks that you and your application can be performing than managing data. Leave data management up to the experts.

What you are describing doesn't sound like a typical business application, and many of the answers already posted here assume that this is the kind of application you are talking about, so let me offer a different perspective.
Whether or not you use a database for 700 items is going to depend greatly on the nature of the data.
I would say that, about 90% of the time at this scale, you will benefit from a light-weight database like SQLite, provided that:
The data may potentially grow substantially larger than what you are describing,
The data may be shared by more than one user,
You may need to run queries against the data (which I don't think you're doing right now), and
The data can easily be described in table form.
The other 10% of the time, your data will be highly structured, hierarchical, object-based, and doesn't neatly fit into the table model of a database or Excel table. If this is the case, consider using XML files.
I know developers instinctively like to throw databases at problems like this, but if you are currently using Excel data to design user interfaces (or display configuration settings), rather than display a customer record, XML may be a better fit. XML is more expressive than either Excel or database tables, and can be easily manipulated with a simple text editor.
XML parsers and data binders for C++ are easy to find.

I recommend you to introduce a Database in your app, your application will gain flexibility and will be easier to maintain and to improve with new features in the future.
I would start with a lightweight file based db like Sqlite.
With a well designed db you'll have:
Reduced data redundancy
Greater data integrity
Improved data security
Last but not least, using a database will save you from the Excel import/update/export Hell!

Reasons for using a database:
Concurrent writes. It's easy to achieve concurrency in databases
Easy querying. SQL queries tend to be much concise than procedural code to search data. UPDATEs, INSERT INTOs can also do lots of stuff with very little code
Integrity. Constraints are very easy to define and are enforced without writing code. If you have a non-null constraint, you can rest assured that the value won't be null, no need to write checks anywhere. If you have a foreign key constraint in place, you won't have "dangling references".
Performance over large datasets. Indexing is very simple to add to an SQL database
Reasons for not using a database:
It tends to be an extra dependency (although there exist very lightweight databases- I like H2 for Java, for instance)
Data not well suited to a relational schema. Things that are basically key/value maps. XML (although databases often support XPath, etc.).
Sometimes files are more convenient. They can be diff'ed, merged, edited with a plain text editor, etc. Sometimes spreadsheets can be more practical (you don't have to build an editor- you can use a spreadsheet program)
Your data is already somewhere else

When you have a lot of data that you're not sure how they will be exploited in the future.
For example you might want to add an SQLite database in an embedded application that need to register statistics that you're not sure how will be used. Later you send the full database for injection in a bigger one running on a central server and those data can easily be exploited, using requests.
In fact, if your application's purpose is to "gather data" then having a database is a must have.

I see quite a few requirements that well met by databases:
1). Ad hoc queries. Find me all the {X} that meet criteria Y
2). Data with structure that can benefit from normalisation - factoring out common values into separate "tables". You can save space and reduce the possibility of inconsistency this way. Once you've done this then those ad-hoc queries start to be really useful.
3). Large data volumes. Professional database are very good at making good use of resoruces, clever query optmisations and paging strategies. Trying to write this stuff yourself is a real challenge.
You're clearly not needing that last one, but the other two, maybe do apply to you.

Don't forget that the appropriate database can be quite different depending on your requirements (and don't forget that a text file could be used as a database if you're requirements are simple enough - for example, config files are just a specific kind of database). Such parameters might be:
number of records
size of data items
does the database need to be shared with other devices? Concurrently?
how complex are the relationships between the various pieces of data
is the database read only (created at build time and not changed, for example)?
does the database need to be updated by multiple entities concurrently?
do you need to support complex queries?
For a database with 700 entries, an in-memory sorted array loaded from a text file might well be appropriate. But I could also see the need for an embedded SQL database or maybe having the controller request data from the database over a network connection depending on what the various requirements (and resource limitations) are.

There isn't a specific point at which a database is worthwhile. Instead I usually ask the following questions:
Is the amount of data the application uses/creates growing?
Is the upper limit of this data growth unknown (or unclear)?
Will the application need to aggregate or filter this data?
Could there be future uses of the data that may not be obvious right now?
Is performance of data retrieval and/or storage important?
Are there (or could there be) multiple users of the application who share data?
If I answer 'Yes' to most of these questions I almost always choose a database (as opposed to other options such as XML/ini/CSV/Excel/text files or the filesystem).
Also, if the application will have many users who could be accessing the data concurrently, I'll lean towards a full database server (MySQL, SQl Server, Oracle etc).
But often in a single user (or small concurrency) situation, a local database such as SQLite cannot be beaten for portability and ease of deployment.

To add a negative: not suitable for real-time processing, due to non-deterministic latency. However, It would be quite ample for looking up and setting operating parameters, for instance during startup. I would not put database accesses on critical time paths.

You don't need a database if you have a few thousand rows in one or two tables to handle in a single user app (for the embedded point).
If it is for multiple users (concurrent access, locking) or the need of transactions you definitly should consider a database.
Handling complex datastructures in normalized tables and maintain integrety, or a huge amount of data would be another indication you should use a database.

It sounds like your application is running on a desktop computer and simply communicating to the embedded device.
As such using a database is much more feasible. Using one on an embedded platform is a much more complex issue.
On the desktop front I use a database when there is the need to store new information continuously and the need to extract that information in a relational way. What i don't use databases for is storing static information, information i read once at load and thats it. The exception is when the application has many users and there is the need to storage this information on a per user basis.
It sounds be to me like your collecting information from your embedded device, storing it somehow, then using it later to display via a GUI.
This is a good case for using a database, especially if you can architect the system such that there is a data collection daemon that manages the continuous communication with the embedded device. This app can then just write the data into the database. When the GUI is launched it can extract the data for display.
Using the database will also ease your GUI develop if you need to display different views, such as "show me all the entries between 2 dates". With a database you just ask it for the correct values to display with a proper SQL query and the GUI displays whatever comes back allowing you to decouple much of the "business logic" code from the GUI.

We are also facing a similar situation. We have set of data coming from different test setups and it is currently being dumped into excel sheets, processed using Perl or VBA.
We found out this method had lot of problems:
i. Managing data using excel sheets is quite cumbersome. After some time you have a whole lot of excel sheets and there is no easy way to retrieve required data from it.
ii. People start sending the excel sheets to and fro for comments and review through e-mails. E-Mail becomes the primary mode of managing the comments related to the data. These comments are lost at a later point of time and there is no way of retrieving it back.
iii. Multiple copies of the files get created and changes in one copy are not reflected in the other - there is no versioning.
This is for the same reasons we have decided to move to a database based solution and are currently working on it. Let me summaries what we are trying to do:
i. The database is in a central server accessible by PC in all the test setups.
ii. All the data goes into a temporary location (local hard disk in files) as soon as it is generated. From the files, it is pushed into database by a process running in the background (so even if there is a network problem, data will be present in the local files system).
iii. We have a web based application which allows users to log in and access data in the format they want. The portal will allow them to add comment, generate different kind of reports, share it with other users after review etc. It will also have the ability to export data into excel sheet, just in case you need to take it with you.
Let know if this can be better implemented.

"At what point is it worth using a database?"
If and when you've got data to manage ?

Have you tried to use SQLite as the query engine for your raw database?

I have made a custom report generator for our database (Oracle Berkeley DB engine). Now it's time for me to add more flexibility and I am in dilemma. Do a partial or a fundamental redesign?
Lets assume that I have plenty of time.
I can only read the database, I don't have the right to modify it.
Inspired from Query Anything with SQLite article, I would like to let SQLite engine to do the dirty work (grouping, filtering, etc).
Have you tried it? Any examples? What about performance issues?

It works fine for what I am using it :-) However I don't use it together with another database, just standalone. There's a list of Well-known Users of SQlite on their website.
You need to tell us more about your usecase to make any speculations about performance, but I'd rather make a POC and measure performance Long-held, incorrect programming assumptions
There is a nice quickstart article on the sqlite site.
Here's the C/C++ API Reference.
I assume you should be able to create a temporary SQLite table by initially querying your other DB and inserting the data into the temporary SQLite table. Then you can use different querys on that temporary table to do your grouping, filtering, etc.

Query building in a database agnostic way

In a C++ application that can use just about any relational database, what would be the best way of generating queries that can be easily extended to allow for a database engine's eccentricities?
In other words, the code may need to retrieve data in a way that is not consistent among the various database engines. What's the best way to design the code on the client side to generate queries in a way that will make supporting a new database engine a relatively painless affair.
For example, if I have (MFC)code that looks like this:
CString query = "SELECT id FROM table"
results = dbConnection->Query(query);
and we decide to support some database that uses, um, "AVEC" instead of "FROM". Now whenever the user uses that database engine, this query will fail.
Options so far:
Worst option: have the code making the query check the database type.
Better option: Create query request method on the db connection object that takes a unique query "code" and returns the appropriate query based on the database engine in use.
Betterer option: Create a query builder class that allows the caller to construct queries without using any SQL directly. Once the query is completed, caller can invoke a "Generate" method which returns a query string approrpriate for the active database engine
Best option: ??
Note: The database engine itself is abstracted away through some thin layers of our own creation. It's the queries themselves are the only remaining problem.
Solution:
I've decided to go with the "better" option (query "selector") for two reasons.
Debugging: As mentioned below, debugging is going to be slightly easier with the selector approach since the queries are pre-built and listed out in a readable form in code.
Flexibility: It occurred to me that there are some databases which might have vastly better and completely different ways of solving a particular query. For example, with Access I perform a complicated query on multiple tables each time because I have to, but on Sql Server I'd like to setup a view. Selecting from the view and from several tables are completely different queries (i think) and this query selector would handle it easily.

You need your own query-writing object, which can be inherited from by database-specific implementations.
So you would do something like:
DbAgnosticQueryObject query = new PostgresSQLQuery();
query.setFrom('foo');
query.setSelect('id');
// and so on
CString queryString = query.toString();
It can get pretty complicated in there once you go past simple selects from a single table. There are already ORM packages out there that deal with a lot of these nuances; it may be worth at looking at them instead of writing your own.

Best option: Pick a database, and code to it.
How often are you going to up and swap out the database on the back end of a production system? And even if you did, you'd have a lot more to worry about than just minor syntax issues. (Major stuff like join syntax, even datatypes can differ widely between databases.)
Now, if you are designing a commercial application where you want the customer to be able to use one of several back-end options when they implement it, then you may have to specify "we support Oracle, MS SQl, or MYSQL" and code to those specific options.

All of your options can be reduced to
Worst option: have the code making the query check the database type.
It's just a matter of where you're putting the logic to check the database type.
The option that I've seen work best in practice is
Better option: Create query request method on the db connection object that takes a unique query "code" and returns the appropriate query based on the database engine in use.
In my experience it is much easier to test queries independently from the rest of your code. It gets a lot harder if you have objects that are piecing together queries from bits of syntax, because then you have to test the query-creation code and the query itself.
If you pull all of your SQL out into separate files that are written and maintained by hand, you can have someone who is an expert in SQL write them (you can still automate the testing of these queries). If you try to write query-generating functions you'll essentially have a C++ expert writing SQL.

Choose an ORM, and start mapping.
If you are to support more than one DB, your problem is only going to get worse.
And just think of DB that are comming - cloud dbs with no (or close to no) SQL, and Object databases.

Take your queries outside the code - put them in the DB or in a resource file and allow overrides for different database engines.
If you use SPs it's potentially even easier, since the SPs abstract away your database differences.

I would think that what you would want to do, if you needed the ability to support multiple databases, would be to create a data provider interface (or abstract class) and associated concrete implementations. The data provider would need to support your standard query operators and other common, supported functionality required support your query operations (have a look at IEnumerable extension methods in .NET 3.5). Each concrete provider would then translate these into specific queries based on the target database engine.
Essentially, what you do is create a database abstraction layer and have your code interact with it. If you can find one of these for C++, it would probably be worth buying instead of writing. You may also want to look for Inversion of Control (IoC) containers for C++ that would basically do this and more. I know of several for Java and C#, but I'm not familiar with any for C++.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js