Neo4j.rb slow queries - ruby-on-rails-4

Neo4j.rb slow queries - ruby-on-rails-4

I have Neo4j v2.1.6 (default configuration) and Neo4j.rb v4.1.0. All queries are slow around 50ms. I have only 5 nodes in db.
For example:
User.find_by(person_id: 826268332)
CYPHER 47ms MATCH (n:`User`) WHERE (n.person_id = {n_person_id}) RETURN n LIMIT {limit_1} | {:n_person_id=>826268332, "limit_1"=>1}
Where can be a problem?

I'm one of the core maintainers of Neo4j.rb, along with Brian Underwood, who replied above. This is not exactly a full answer since we need to know more about your system to answer that, but I'm posting this here because it's too much for one comment.
My money is on something wrong with your DB or your system. We had a similar issue reported -- slow queries when working locally, no cause able to be determined -- for a user running Windows. See Neo4j.rb version 3.0 slow performance RoR, over 1024ms for all queries. We weren't able to pin it down. Locally, running that exact same query, I see 13ms the first time I run it and ~3ms every time after that. Indexing won't make a difference in a DB that small.
Ways to limit the chance of a problem and generally improve performance:
Use Ruby MRI 2.2.0
Use Neo4j 2.1.6 or 2.2.0
Use Mac or Linux, not Windows
Require the oj and oj_mimic_json gems in your app
You will see longer responses for a query like that if your db and app server are in two different networks.
Regarding the comment that this simple query is much faster in MongoDB and PostgreSQL: yes, it's going to be. Both of those return simple queries faster than Neo4j.rb for no fewer than two reasons:
The Ruby gems for connecting to those DBs do not use a REST interface, they use custom binary protocols.
Both of those are optimized for returning single records quickly, Neo is optimized for returning large groups of records quickly.
Before releasing Neo4j.rb 4.0, I did a ton of benchmarks against Postgres and MongoDB and found the same results: they crush us when returning single objects. (PostgreSQL is amazing technology general.) As soon as you start looking for related objects, though, things balance out, and as you add complexity, the difference becomes even more significant. I don't have any numbers to share, unfortunately, but I'll make a blog post about it sometime soon if I have some time.

That is strange. In the neo4j gem I often see simple queries run in around 1-5 ms.
For debugging, what if you did this?
User.where(yeti_person_id: 826268332).first
Also, what does this give you?
puts User.where(yeti_person_id: 826268332).to_cypher

Related

How to link TypeSense with my MongoDB database?

I have a MongoDB Database that contains a collection of about 3 billion documents and that is willing to grow in the future.
I've set up a full text search index within that database. It is fast, but I'm not really satisfied by the results of query searches and I would like a faster search engine for regex searches.
I heard linking Solr / ElasticSeach to Mongo would be a great idea to make faster queries.. What about TypeSense ?
Thank you !!

I know this is a late response, but I do have some advice.
First of all, I'm making the assumption that your Mongo database is user generated content.
Yes, Typesense would absolutely work for 3 billion documents as is it nearly infinitely scalable, but it depends on the amount of ram you have, and don't forget that you cannot search for an exact keyword with Typesense. If that's something that you need, you can eliminate Typesense.
Elasticsearch is the best solution for you though. It's not easy to setup, but there are many guides on how to do so. 3 billion records is massive, but Elasticsearch is built for those datasets exclusively.
If you're not willing to setup Elasticsearch, then keep your current setup as a couple hundred milliseconds shouldn't be an issue. Elasticseach can be very costly for large datasets as well but has the absolute most complete features.
Typesense has made a comparison of the 4 most common search engines and their features here: https://typesense.org/typesense-vs-algolia-vs-elasticsearch-vs-meilisearch/

Django Debug Toolbar Target?

I've got a web page loading pretty slowly, so I installed the Django Debug Toolbar. I'm pretty new at this, so I'm trying to figure out what I can do with it.
I can see the database did 264 queries in 205 ms. Looks kind of high. I'm pretty sure I can cut down on that by adding some indexes and just writing better queries. But my question is: What is a "good" number that should be trying to hit here? What is generally accepted as "fast enough" and further optimization isn't really worth it. 50ms? 20ms?
Also on this same page it's showing 2500ms in user CPU. That sounds terrible to me, and I'm surprised it's so much higher than the database, which I assumed was the bottleneck. Is this maybe an indication that I am trying to do too much in python code instead of at the database layer? Would reducing the number of SQL queries help with CPU? (Waiting between queries?). Again is there some well known target response time I should be aiming for.
I'm looking for a snappy response from my clients. Right now when I click around I can feel a "pregnant pause" before the pages load.

By default accessing related model fields results in one extra query per model per row. Look into select_related() and prefetch_related(), this usually cuts down number of queries and speeds things up by a lot. I think debug toolbar shows you the actual queries, if not, need to enable sql logs before doing any query optimizations. Once you cut down number of queries to a minimum (no extra queries per pow), look for the slowest query and use EXPLAIN sql syntax to see if indexes are being used, this is another area where it can get slow especially on big data.
Usually database is the bottleneck, unless you are doing some major looping in your code. If you believe python code is slow, then need to profile it, otherwise it's just guessing.

How to choose server for production release of my Django application?

My company is at the very end of development process of some (awesome:)) web application. This app will be available as a online service for (hopefully) some significant number of users. This is our biggest django release so far and as we are preparing to release some question about deployment have to be answered.
Q1: how to determine required server parameters for predicted X number of users/Y hits per minute or other factor?
Q2: what hosting solution (shared/vps/dedicated) is worth considering?
Q3: what optimizations can be done at a first place?
I know that this is very subjective and dependent of size of a site, code quality and other factors but I'm very interested in your experiences with django (and not only django) deployment. Any hints, links, advices are kindly welcome. Thanks in advance.

What hosting solution you want to have depends also on how much you want to take of your server yourself (from upgrades etc to backup...), and you should decide if you want to have the responsibility or leave it to someone else.
I think you can only really determine the necessary requirements and bottlenecks in your applications through testing with the estimated load! Try to simulate as many requests.... as you expect - think about caching (where memcached is the best option you have)! If you try to cache things one great tool is the django debug toolbar (http://github.com/robhudson/django-debug-toolbar) which can show you also much about how many database hits you have (dont take the times it shows for that for granted, but analyse them and keep an eye on the number of hits) and eg. how many templates are rendered....
If your system grows, you can first of all think about serving your static media files from another location! Coming to the web server I have made great experiences using lighttpd instead of the fat apache, but I guess you have to evaluate that for yourself!
Also take in consideration what database backend to use, in shared envionments there's in most cases a bigger load on the mysql than on the postgres servers, but also evaluate what works best for you!

You can get some guesses here, but to get a halfway reasonable performance estimate you have to measure the performance of your application yourself. You should then be able to roughly extrapolate the performance on different hardware.
Most of the time the bottleneck is the database, you should get enough RAM to keep it in memory if possible.
"Web application" can encompass so many different things, we can really do no more than guess here.
As for optimization, if it fits to your needs implement some caching (e.g. with memchached), that can give you huge speed improvements.

Search engine solution for Django that actually works?

The story so far:
Decided to go with Xapian as search backend because it has all search-engine features I was looking for, knows about Unicode, stemming, has few dependencies and requires no bloated app-server installation on top of it.
Tried Django and Haystack (plus xapian-haystack, the backend glue code to tie Haystack to Xapian) because it was advertised on quite some blogs as "working". Did not work. Neither django-haystack nor the xapian-haystack project provide a version combination that actually works together. MASTER from both projects yields an error from Xapian, so it's not stable at all. Haystack 1.0.1 and xapian-haystack 1.0.x/1.1.0 are not API-compatible. Plus, in a minimally working installation of Haystack 1.0.1 and xapian-haystack MASTER, any complex query yields zero results due to errors in either django-haystack or xapian-haystack (I double-verified this), maybe because the unit-tests actually test very simple cases, and no edge-cases at all.
Tried Djapian. The source-code is riddled with spelling errors (mind you, in variable names, not comments), documentation is also riddled with ambiguities and outdated information that will never lead to a working installation. Not surprisingly, users rarely ask for features but how to get it working in the first place.
Next on the plate: exploring Solr (installing a Java environment plus Tomcat gives me headaches, the machine is RAM- and CPU-constrained), or Lucene (slightly less headaches, but still).
Before I proceed spending more time with a solution that might or might not work as advertised, I'd like to know: Did anyone ever get an actual, real-world search solution working in Django? I'm serious. I find it really frustrating reading about "large problems mostly solved", and then realizing that you will never get a working installation from the source-code because, actually, all bloggers dealing with those "mostly solved problems" never went past basic installation and copy-pasting the official tutorials.
So here are the requirements:
must be able to search for 10-100 terms in one query
must handle + (term must be present) and - (term must not be present), AND/OR
must handle arbitrary grouping (i.e. parentheses around AND/OR)
must allow for Django-ORM filtering before or after fulltext-search (i.e. pre-/post-processing of results with the full set of filters that Django knows about)
alternatively, there must be a facility to bulk-fetch the result set and transform it into a QuerySet
should be light on the machine, so preferably no humongous JVM and Java-based app-server installation
Is there anything out there that does this? I'm not interested in anecdotal evidence, or references to some blog posts that claim it should be working. I'd like to hear from someone who actually has a fully-functional setup working in the real world, under real conditions, with real queries.
EDIT:
Let me repeat again that I'm not so much interested in anecdotal evidence that someone, somewhere has a somewhat running installation working with unspecified properties. I already went there, I read all the blog posts, mailing lists, I contacted the authors, but when it came to actual implementation of real-world scenarios, nothing ever worked as advertised.
Also, and a user below brought that point up as well, considering the TCO of any project, I'm definitely not interested in hearing that someone, somewhere was able to pull it off once a vendor parachuted in an unknown number of specialists to monkey-patch the whole installation with specific domain-knowledge that's documented nowhere.
So, please, if you claim you have a working installation that actually satisfies minimum requirements for a full-fledged search (see requirements above), please provide the following so that we can all benefit from a search solution for Django that actually solves the problem:
exact Linux distribution, release version,
exact release version of Haystack (or equivalent) and release version of search backend,
exact release version of the search engine
publicly (!) available documentation how to set up all components exactly in the way that your installation was set up such that the minimal requirements above are met.
Thank you.

I have developed some Django applications with xapian support too. The biggest of them has a xapian database with an index of 8G storing 2.4M documents (including forum posts, wiki entries, planet entries and blog entries) - still growing.
Overall I am quite happy with xapian. It performs extremely well and is easy to use. The only thing I don't like is that xapian won't work with mod_wsgi (except of the global mode) because of a deadlock. So you are forced to use fastcgi (or connect to xapian-tcpsrv or write your own service).
I recommend you, to use the xapian-bindings directly. Xapian nowadays offers quite a lot of useful helpers (TermGenerator, QueryParser etc), which makes both the indexing and the querying simple. In fact, there is nothing I can imaging which would justify an additional library. In my opinion they are all more complicated and don't allow you to index efficiently.
The only thing you need, is some understanding of the way how xapian is working. (What are terms? What are values? What is stemming and where should I use it? and so on). You can find all those topics on the xapian website, and as soon as you understand those concepts, dealing with xapian will become easy.
Also, the xapian API is extremly stable. I've started using it a long time before the 1.0 release and never had any problems with API changes or version conflicts. The only thing which has changed is that all those helpers (query parser, tokenizer, etc.) I have once written for my Django project are now useless, because similar classes have made their way into the xapian core.
So, to summarize, just give the direct usage of xapian-bindings a try.

I can vouch for Django-Haystack with the Xapian backend (In the interest of full disclosure, I am the author of the xapian-haystack backend) in a real life, production environment. We currently use Haystack/Xapian on several sites, the largest of which has more than 20,000 registered users and a Xapian database with 20,000+ documents containing more than 143,000 unique terms for a total size of ~141mb.
As for not being able to get any combination of Haystack and the Xapian backend running, I'll admit that I was not as diligent as I should have been with my tagging and so there is some confusion with the versions. You should, however, be able to use the current master of both codebases without any issue. If this is not case, I'd be more than happy to assist with problems. You'll need to be a little bit more specific about the issue though. Simply saying "it did not work" is not enough information.
Daniel and I both do our best to respond to any issues opened on Github within a timely manner. Also, we're both usually available on the #haystack IRC channel during the day and the django-haystack Google Group.
Versions used:
Haystack 1.0BETA with Xapian-Haystack 1.1.0BETA
Haystack 1.0.1FINAL with Xapian-Haystack 1.1.3BETA
Most of the sites we've deployed with Haystack have been running Ubuntu 8.04 LTS with Xapian 1.0.5

Short answer: No.
We bailed and went with a Google Custom Search. Although the site has over 10,000 possible page views, we keep the sitemap feed down to the main 4,000 pages or so and it costs $250/year, which is about 2 hours of my time. The customer is happy and he feels comfortable with the results.
I'd love to see someone come up with a good FOSS solution, but in a commercial situation the TCO has got to make economic sense.

The details you requested.
exact Linux distribution, release version - Ubuntu 9.04 & 9.10
exact release version of Haystack (or equivalent) - Haystack 1.0 as well as master
release version of search backend - The Solr & Whoosh backends included with Haystack
exact release version of the search engine - Solr 1.3, Solr 1.4 & Whoosh 0.3.15
publicly (!) available documentation how to set up all components exactly in the way that your installation was set up such that the minimal requirements above are met.
http://docs.haystacksearch.org/dev/installing_search_engines.html#solr (or #whoosh)
Beyond this, it's the standard configuration bits from the tutorial, plus any additional overrides from (which I can't link to, thanks Stack Overflow) as needed.
As the maintainer of Haystack, I'm actively running all of the above previous setups. The smallest Haystack installation (Haystack 1.0 + Whoosh) is ~600 documents. A slightly larger one (Haystack master + Solr 1.4) is ~4000 documents. The largest deployment I'm aware of (Haystack master + Solr 1.4) is ~3 million documents.
I generally try to avoid Stack Overflow, so don't be surprised if you see nothing further from me. The mailing list is the best place for support, but given your responses thus far, I'm sure you'd rather just trash me here.

I (and my colleagues) have successfully used Haystack to achieve a fairly good search functionality.
It is easy to start with haystack and whoosh backend; and change to the Apache-Solr backend when performance of whoosh is not acceptable.
We really got to get around to write a detailed post about it with links to the projects where it works.
For now I can suggest you to have a look at this search: http://www.webdevjobshq.com/search/?q=rails implemented using Haystack with Apache-Solr backend. Or this: http://www.govbuddy.com/search/?q=Roy

Have you considered Sphinx? What are you using as you data store? It has a MySQL engine that works terrific. I think it meet most of your requirements except I'm not exactly certain how nicely it can be tied into Django-ORM.
I'm heavily considering using Sphinx in one of my own Django Apps to improve performance on an auto-suggest field that does a prefix and infix search on a corpus of 3.5 million records. But I haven't got around to implementing it yet, so I can't speak to Django+Sphinx integration. My only Sphinx experience is with the MySQL Engine and directly querying MySQL.

I use Djapian. It was quite simple to install and works great. There is an actual tutorial that covers basic use-cases and shows entire integration process.
Yes, it has some ambiguities but issue tracker is open and authors rapidly fixes bugs and add features.

how to perform profiling for a website?

I currently have a django site, and it's kind of slow, so I want to understand what's going on. How can I profile it so to differentiate between:
effect of the network
effect of the hosting I'm using
effect of the javascript
effect of the server side execution (python code) and sql access.
any other effect I am not considering due to the massive headache I happen to have tonight.
Of course, for some of them I can use firebug, but some effects are correlated (e.g. javascript could appear slow because it's doing slow network access)
Thanks

client side:
check with firebug if/which page components take long to load, and how long the browser needs to render the page after loading is completed. If everything is fast but rendering takes its time, then probably your html/css/js is the problem, otherwise it's server side.
server side (i assume you sit on some unix-alike server):
check the web server with a small static content (a small gif or a little html page), using apache bench (ab, part of the apache webserver package) or httperf, the server should be able to answerat least 100 requests per second (of course this depends heavily on the size of your test content, webserver type, hardware and other stuff, so dont take that 100 to seriously). if that looks good,
test django with ab or httperf on a "static view" (one that doesnt use a database object), if thats slow it's a hint that you need more cpu power. check cpu utilization on the server with top. if thats ok, the problem might be in the way the web server executes the python code
if serving semi-static content is ok, your problem might be the database or IO-bound. Database problems are a wide field, here is some general advice:
check i/o throughput with iostat. if you see lot's of writes then you have get a better disc subsystem, faster raid, SSD hard drives .. or optimize your application to write less.
if its lots of reads, the host might not have enough ram dedicated as file system buffer, or your database queries might not be optimized
if i/o looks ok, then the database might be not be suited for your workload or not correctly configured. logging slow queries and monitoring database activity, locks etc might give you some idea
if you let us know what hardware/software you use i might be able to give more detailed advice
edit/PS: forgot one thing: of course your app might have a bad design and does lots of unnecessary/inefficient things ...

Take a look at the Django debug toolbar - that'll help you with the server side code (e.g. what database queries ran and how long they took); and is generally a great resource for Django development.
The other non-Django specific bits you could profile with yslow.

There are various tools, but problems like this are not hard to find because they are big.
You have a problem, and when you remove it, you will experience a speedup. Suppose that speedup is some factor, like 2x. That means the program is spending 50% of its time waiting for the slow part. What I do is just stop it a few times and see what it's waiting for. In this case, I would see the problem 50% of the times I stop it.
First I would do this on the client side. If I see that the 50% is spent waiting for the server, then I would try stopping it on the server side. Then if I see it is waiting for SQL queries, I could look at those.
What I'm almost certain to find out is that more work is being requested than is actually needed. It is not usually something esoteric like a "hotspot" or an "algorithm". It is usually something dumb, like doing multiple queries when one would have been sufficient, so as to avoid having to write the code to save the result from the first query.
Here's an example.

First things first; make sure you know which pages are slow. You might be surprised. I recommend django_dumpslow.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js