Limits of django-haystack with Solr 4.0 - django

We are planning to use django-haystack with Solr4.0 (with near real time search) for our web app, and I was wondering if anyone could advice on the limits of using haystack (when compared to using solr directly). i.e is there a performance hit/overhead of using django-haystack? We have around 3 million+ documents which would need indexing + an additional (estimated) 100k added everyday.
Ideally, I'd think we need a simple API over Solr4 - but I am finding it hard to find anything specific to python which is still actively maintained (except django-haystack ofcourse). I'd appreciate any guidance on this.

It seems like your question could be rephrased "How has Haystack burned you?" Haystack is nice for some things, but has also caused me some headaches on the job. Here are some things I've had to code around.
You mentioned indexing. Haystack has management commands for rebuilding the index. We'll use these for nuke and pave rebuilding during testing, but for reindexing our production content we kind of hit the wall. The command would take forever, you wouldn't know where it was in terms of progress, and if it failed you were screwed and had to start all over again. We reached a point where we had too much content and it would fail reliably enough. We switched to making batches of content and reindexing these in celery tasks. We made a management command to do the batching and kick off all those tasks. This has been a little more robust in the face of partial failures and, even better, it actually finishes.The underlying task will use haystack calls to convert a database object to a solr document -- this ORMiness is nice and hasn't burned me yet. We edit our solr schema by hand, though.
The query API is okay for simple stuff. If you are doing more convoluted solr queries, you can find yourself just feeding in raw queries. This can lead to spaghetti code. We eventually pushed that raw spaghetti into request handlers in the solrconfig file instead. We still use haystack to turn on highlighting or to choose a specific request handler, but again, I'm happier when we keep it simple and we did hack on a method to add arbitrary parameters as needed.
There are other assumptions about how you want to use solr that seem to get baked in to the code. Haystack is the only open source project where I actually have some familiarity with the code. I'm not sure if that's a good thing, since it hasn't always been by choice. We have plenty of layer cake code that extends a haystack class and overrides it to do the right thing. This isn't terrible, but when you have to also copy and paste haystack code into those overridden methods then it starts being more terrible.
So... it's not perfect, but parts are handy. If you are on the fence about writing your own API anyway, using haystack might save you some trouble, especially when you want to take all your solr results and pass them back into django templates or whatever. It sounds like with the steady influx of documents, you will want to write some batch indexing jobs anyway. Start with that, prepare to be burnt a little and look at the source when that happens, it's actually pretty readable.

Related

Ember Way to Add Rss Feed without third party widget, Front-end only

I am using Ember 3.0 at the moment. Wrote my first lines of code in ANY language about 1 year ago (I switched careers from something totally unrelated to development), but I quickly took to ember. So, not a ton of experience, but not none. I am writing a multi-tenant site which will include about 20 different sites, all with one Ember frontend and a RubyOnRails backend. I am about 75% done with the front end, now just loading content into it. I haven’t started on the backend yet, one, because I don’t have MUCH experience with backend stuff, and two, because I haven’t needed it yet. My sites will be informational to begin with and I’ll build it up from there.
So. I am trying to implement a news feed on my site. I need it to pull in multiple rss feeds, perhaps dozens, filtered by keyword, and display them on my site. I’ve been scouring the web for days just trying to figure out where to get started. I was thinking of writing a service that parses the incoming xml, I tried using a third party widget (which I DON’T really want to do. Everything on my site so far has been built from scratch and I’d like to keep it that way), but in using these third party systems I get some random cross domain errors and node-child errors which only SOMETIMES pop up. Anyway, I’d like to write this myself, if possible, since I’m trying to learn (and my brain is wired to do the code myself - the only way it sticks with me).
Ultimately, every google result I read says RSS feeds are easy to implement. I don’t know where I’m going wrong, but I’m simply looking for:
1: An “Ember-way” starting point. 2: Is this possible without a backend? 3: Do I have to use a third party widget/aggregator? 4: Whatever else you think might help on the subject.
Any help would be appreciated. Here in New Hampshire, there are basically no resources, no meetings, nothing. Thanks for any help.
Based on the results I get back when searching on this topic, it looks like you’ll get a few snags if you try to do this in the browser:
CORS header issues (sounds like you’ve already hit this)
The joy of working with XML in JavaScript (that just might be sarcasm 😉, it’s actually unlikely to be fun)
If your goal is to do this as a learning exercise, then doing it Javascript/Ember will definitely help you learn lots of new things. You might start with this article as a jumping off point: https://www.raymondcamden.com/2015/12/08/parsing-rss-feeds-in-javascript-options/
However, if you want to have this be maintainable for the long run and want things to go quickly and smoothly, I would highly recommend moving the RSS parsing system into your backend and feeding simple data out to Ember. There are enough gotchas and complexities to RSS feeds over time that using a battle-tested library is going to be your best way to stay sane. And loading that type of library up in Ember (while quite doable) will end up increasing your application size. You will avoid all those snags (and more I’m probably not thinking of) if you move your parsing back to the server ...

What step would u take to refactor a ball of mud CF app into something modern and maintainable

I am going to pick up a task that no one has ever attempted to try at my workplace. It is a CF app first written using CF 2.0 (Yes, 2.0!) 10 yrs ago with > 10 cfscheduler tasks.. We explored the idea of rewriting the app, but 10 yrs of work simply can't be rewrote in 2-3 months.
What steps shall one take to modernize the app into a maintainable, extendable state? The one that I keep hearing is "write tests", but how can I write tests when it wasn't even in MVC?
Any advice would be appreciated, thanks!
p.s. I should thank Allaire, Macromedia and Adobe for keeping CF so freaking backward compatible all the way back to 2.0!
btw, what's the most modern, maintainable state for a CF app without MVC framework? or should my end goal be ultimately refactoring it into a MVC app?? I can't image how many links I will break if I do... seems impossible... thought?
update: found 2 related Q's...
https://softwareengineering.stackexchange.com/questions/6395/how-do-you-dive-into-large-code-bases
https://softwareengineering.stackexchange.com/questions/29788/how-do-you-dive-into-a-big-ball-of-mud
I am not sure if you need to move the whole site to a MVC application. Recently I did helped with an site that was not MVC, that still had a library with the Models, Services and Assemblers in a clean and organized manor. It worked great, and we didn't need to do anything more than what was necessary.
That being said, my first step would be to organize the spaghetti code into their different purposes. It may be hard to properly create the models, but at the very least you could break out the services like functions from the pages. With that done, it should be a lot cleaner already.
Then, I would try to take the repeated code and put them into custom tags. It will make the code more reusable, and easier to read.
Good Luck!
Consider, whether a full fledged framework is really necessary. In its most basic form a framework is merely highly organized code. So if procedural, that is well organized, works leave it.
Keep in mind something like FW/1 as migration path can be better than say Coldbox if you don't need all the other stuff.
Lastly, consider this I was able to migrate a 4.5 almost 70% of the way to Coldbox (very simple and really more about directory and file organization versus IOC, plugins, modules, etc...) just using a few extra lines per file plus onMissingMethod functions.
Good Luck.
I had to deal with a similar situation for about two years at my last job, however, it wasn't quite as old as yours. I think I was dealing with code from 4.0 on. There's no silver bullet here, and you'll need to be careful that you don't get too caught up in re-factoring the code and costing your company tons of money in the process. If the app works as it is rewriting it would be a pretty big wast of money.
What I did was update small chunks at a time, I wouldn't even refactor whole templates at a time, just small portions of one at a time. If I saw a particular ugly loop, or nested if statements I'd try to clean it up the best I could. If the app can be broken down into smaller modules or areas of functionality and you have the extra time you can try to clean up the code a module at a time.
A good practice I heard from the Hearding Code podcast is create a testing harness template that would use a particular cfm page that has a known output that you can re-run to make sure that it still has the same output once you've done refactoring. Its not nearly as granular as a unit test, but its something and something is almost always better than nothing, right?
I suspect that the reason this app hasn't been touched for years is because for the most part it works. So the old adage "if it ain't broken don't fix it" probably applies; However, code can always be improved :)
The first thing I'd do is switch to Application.cfc and add some good error logging. That way you may find out about things that need to be fixed, and also if you do make changes you're know if they break anything else.
The next thing I'd do is before you change any code is use selenium to create some tests - it can be used as a FireFox plugin and will record what you do. It's really good for testing legacy apps without much work on your part.
Chances are that you won't have much if any protection from SQL injection attacks so you will want to add cfqueryparam to everything!!
After that I'd be looking for duplicated code - eliminating duplicate code is going to make maintenance easier.
Good luck!
Funnily enough, I'm currently involved in converting an old CF app into an MVC3 application.
Now this isn't CF2, it was updated as recently as a year ago so all of this may not apply at all to your scenario, apologies if this is the case.
The main thing I had to do consolidate the mixed up CFQuerys and their calls into logical units of code that I could then start porting in functionality either to C# or JavaScript.
Thankfully this was a very simple application, the majority of the logic was called on a database using the DWR Ajax library; that which wasn't was mostly consolidated in a functions.cfm file.
Obviously a lot of that behavior doesn't need to be replicated as packaging up the separate components of logic (such as they were) in the CF app did map quite neatly to the various Partial Views and Editor Templates that I envisaged in the MVC application.
After that, it was simply a case of, page by page, finding out which logic was called when, what it relied upon that then finally creating a series of UML class and sequence diagrams.
Honestly though, I think I gained the most ground when I simply hit File-New Project and started trying to replicate the behavior of the app from the top of index.cfm.
I would break logical parts of the app into CFC's
Pick a single view, look at the logic within. Move that out to a CFC and invoke it.
Keep doing that you will have something much easier to work with that can be plugged into an MVC later. Its almost no work to do this, just copy and paste sections of code and call them.
You can consider using object factory to layer your application. We have similar situation at work and we started refactoring by putting Lightwire DI framework.
First we migrated all the sql statement into gateways, then we started using services and take a lot of code out of the templates to the services.
The work is not finished yet but the application is looking better already.
For large, really complex applications I'd prefer ColdBox for a re-factor project. However, I just saw a presentation at the D2W Conference on F/W 1 (Framework One), a VERY simple ColdFusion MVC framework. Check out code from the presentation here.
It's 1 (one) CFC file and a set of conventions for organizing your code. I highly recommend evaluating it for your project.

Optimisation tips when migrating data into Sitecore CMS

I am currently faced with the task of importing around 200K items from a custom CMS implementation into Sitecore. I have created a simple import page which connects to an external SQL database using Entity Framework and I have created all the required data templates.
During a test import of about 5K items I realized that I needed to find a way to make the import run a lot faster so I set about to find some information about optimizing Sitecore for this purpose. I have concluded that there is not much specific information out there so I'd like to share what I've found and open the floor for others to contribute further optimizations. My aim is to create some kind of maintenance mode for Sitecore that can be used when importing large columes of data.
The most useful information I found was on Mark Cassidy's blogpost http://intothecore.cassidy.dk/2009/04/migrating-data-into-sitecore.html. At the bottom of this post he provides a few tips for when you are running an import.
If migrating large quantities of data, try and disable as many Sitecore event handlers and whatever else you can get away with.
Use BulkUpdateContext()
Don't forget your target language
If you can, make the fields shared and unversioned. This should help migration execution speed.
The first thing I noticed out of this list was the BulkUpdateContext class as I had never heard of it. I quickly understood why as a search on the SND forum and in the PDF documentation returned no hits. So imagine my surprise when i actually tested it out and found that it improves item creation/deletes by at least ten fold!
The next thing I looked at was the first point where he basically suggests creating a version of web config that only has the bare essentials needed to perform the import. So far I have removed all events related to creating, saving and deleting items and versions. I have also removed the history engine and system index declarations from the master database element in web config as well as any custom events, schedules and search configurations. I expect that there are a lot of other things I could look to remove/disable in order to increase performance. Pipelines? Schedules?
What optimization tips do you have?
Incidentally, BulkUpdateContext() is a very misleading name - as it really improves item creation speed, not item updating speed. But as you also point out, it improves your import speed massively :-)
Since I wrote that post, I've added a few new things to my normal routines when doing imports.
Regularly shrink your databases. They tend to grow large and bulky. To do this; first go to Sitecore Control Panel -> Database and select "Clean Up Database". After this, do a regular ShrinkDB on your SQL server
Disable indexes, especially if importing into the "master" database. For reference, see http://intothecore.cassidy.dk/2010/09/disabling-lucene-indexes.html
Try not to import into "master" however.. you will usually find that imports into "web" is a lot faster, mostly because this database isn't (by default) connected to the HistoryManager or other gadgets
And if you're really adventureous, there's a thing you could try that I'd been considering trying out myself, but never got around to. They might work, but I can't guarantee that they will :-)
Try removing all your field types from App_Config/FieldTypes.config. The theory here is, that this should essentially disable all of Sitecore's special handling of the content of these fields (like updating the LinkDatabase and so on). You would need to manually trigger a rebuild of the LinkDatabase when done with the import, but that's a relatively small price to pay
Hope this helps a bit :-)
I'm guessing you've already hit this, but putting the code inside a SecurityDisabler() block may speed things up also.
I'd be a lot more worried about how Sitecore performs with this much data... assuming you only do the import once, who cares how long that process takes. Is this going to be a regular occurrence?

What are the gotchas with ColdFusion?

Background:
I have a new site in the design phase and am considering using ColdFusion. The Server is currently set-up with ColdFusion and Python (done for me).
It is my choice on what to use and ColdFusion seems intriguing with the tag concept. Having developed sites in PHP and Python the idea of using a new tool seems fun but I want to make sure it is as easy to use as my other two choices with things like URL beautification and scalability.
Are there any common problems with using ColdFusion in regards to scalability and speed of development?
My other choice is to use Python with WebPy or Django.
ColdFusion 9 with a good framework like Sean Cornfeld's FW/1 has plenty of performance and all the functionality of any modern web server development language. It has some great integration features like exchange server support and excel / pdf support out of the box.
Like all tools it may or may not be the right one for you but the gotchas in terms of scalability will usually be with your code, rarely the platform.
Liberally use memcached or the built in ehache in CF9, be smart about your data access strategy, intelligently chunk returned data and you will be fine performance wise.
My approach with CF lately involves using jQuery extensively for client side logic and using CF for the initial page setup and ajax calls to fill tables. That dramatically cuts down on CF specific code and forces nice logic separation. Plus it cuts the dependency on any one platform (aside from the excellent jQuery library).
To specifically answer your question, if you read the [coldfusion] tags here you will see questions are rarely on speed or scalability, it scales fine. A lot of the questions seem to be on places where CF is a fairly thin layer on another tool like Apache Axis (web services) and ExtJs (cfajax) - neither of which you need to use. You will probably need mod-rewrite or IIS rewrite to hide .cfm
Since you have both ColdFusion and Python available to you already, I would carefully consider exactly what it is you're trying to accomplish.
Do you need a gradual learning curve, newbie-friendly language (easy for someone who knows HTML to learn), great documentation, and lots of features that make normally difficult tasks easy? That sounds like a job for ColdFusion.
That said, once you get the basics of ColdFusion down, it's easy to transition into an Object Oriented approach (as others have noted, there are a plethora of MVC frameworks available: FW/1, ColdBox, Fusebox, Model-Glue, Mach-ii, Lightfront, and the list goes on...), and there are also dependency management (DI/IoC) frameworks (my favorite of which is ColdSpring, modeled after Java's Spring framework), and the ability to do Aspect-Oriented Programming, as well. Lastly, there are also several ORM frameworks (Transfer, Reactor, and DataFaucet, if you're using CF8 or earlier, or add Hibernate to the list in CF9+).
ColdFusion also plays nicely with just about everything else out there. It can load and use .Net assemblies, provides native access to Java classes, and makes creating and/or consuming web services (particularly SOAP, but REST is possible) a piece of cake. (I think it even does com/corba, if you feel like using tech from 1991...)
Unfortunately, I've got no experience with Python, so I can't speak to its strengths. Perhaps a Python developer can shed some light there.
As for url rewrting, (again, as others have noted) that's not really done in the language (though you can fudge it); to get a really nice looking URL you really need either mod_rewrite (which can be done without .htaccess, instead the rules would go into your Apache VHosts config file), or with one of the IIS URL Rewriting products.
The "fudging" I alluded to would be a url like: http://example.com/index.cfm/section/action/?search=foo -- the ".cfm" is in the URL so that the request gets handed from the web server (Apache/IIS) to the Application Server (ColdFusion). To get rid of the ".cfm" in the URL, you really do have to use a URL rewriting tool; there's no way around it.
From two years working with CF, for me the biggest gotchas are:
If you're mainly coding using tags (rather than CFScript) and formatting for readability, be prepared for your output to be filled with whitespace. Unlike other scripting languages, the whitespace between statements are actually sent to the client - so if you're looping over something 100 times and outputting the result, all the linebreaks and tabs in the loop source code will appear 100 times. There are ways around this but it's been a while - I'm sure someone on SO has asked the question before, so a quick search will give you your solution.
Related to the whitespace problem, if you're writing a script to be used with AJAX or Flash and you're trying to send xml; even a single space before the DTD can break some of the more fussy parsing engines (jQuery used to fall over like this - I don't know if it still does and flash was a nightmare). When I first did this I spent hours trying to figure out why what looked like well formed XML was causing my script to die.
The later versions aren't so bad, but I was also working on legacy systems where even quite basic functionality was lacking. Quite often you'll find you need to go hunting for a COM or Java library to do the job for you. Again, though, this is in the earlier versions.
CFAJAX was a heavy, cumbersome beast last time I checked - so don't bother, roll your own.
Other than that, I found CF to be a fun language to work with - it has its idiosyncracies like everything else, but by and large it was mostly headache free and fast to work with.
Hope this helps :)
Cheers
Iain
EDIT: Oh, and for reasons best known to Adobe, if you're running the trial version you'll get a lovely fat HTML comment before all of your output - regardless of whether or not you're actually outputting HTML. And yes, because the comment appears before your DTD, be prepared for some browsers (not looking at any one in particular!) to render it like crap. Again - perhaps they've rethought this in the new version...
EDIT#2: You also mentioned URL Rewriting - where I used to work we did this all the time - no problems. If you're running on Apache, use mod_rewrite, if you're running on IIS buy ISAPI Rewrite 3.
do yourself the favor and check out the CFWheels project. it has the url rewriting support and routes that you're looking for. also as a full stack mvc framework, it comes with it's own orm.
It's been a few years, so my information may be a little out of date, but in my experience:
Pros:
Coldfusion is easy to learn, and quick to get something up and running end-to-end.
Cons:
As with many server-side scripting languages, there is no real separation between persistence logic, business logic, and presentation. All of these are typically interwoven throughout a typical Coldfusion source file. This can mean a lot more work if you want to make changes to the database schema of a mature application, for example.
There are some disciplines that can be followed to make things a little more maintainable; "Fusebox" was one. There may be others.

Which web framework incurs the least overhead?

I'm playing around with Django on my website hosting service.
I found out that a simple Django page, which has only some static text, and is rendered from a very simple template I created takes a significant time to render. When compared to a static HTML page, I am getting ~2 seconds difference in the load times. Keep in mind this is a simple test of mine with nothing complicated. Also note that my web hosting is on a shared server (not dedicated), so I might be hitting some CPU limitations.
Seems to me that either:
I have some basic CGI/Apache/Django configuration wrong
Django takes significant overhead, at least in this specific scenario.
I find #1 not probable since I followed my web hosting service wiki on how to set up Django. So we are left with the overhead problem.
My question is which web framework do you find the best to use in scenarios where the website is hosted on a shared server, and CPU/memory overhead must be kept to minimum?
Edit: seems that my configuration is something I might want to look at, and perhaps later on I'll be opening a question on how to best configure Django.
For now, I would appreciate answers focusing on your experience, in general, with web frameworks, and which of those you found to be the best in terms of performance in the aforementioned scenario.
"I have some basic CGI/Apache/Django configuration wrong"
Correct.
First. The very first time Django returns a page, it takes forever. A lot of initialization happens for the first request.
Second. What specific configuration are you using. We just switched from mod_python to mod_wsgi in daemon mode and are very happy with the performance changes.
Third. What database are you using?
Fourth. What test configuration are you using?
Fifth. What caching parameters and reverse proxy are you using?
Odds are good that you have a lot of degrees of freedom in your configuration.
Edit
The question "which of those you found to be the best in terms of performance" is largely impossible to answer.
See http://wiki.python.org/moin/WebFrameworks
There are dozens of frameworks. Few people can examine more than a few to do head-to-head comparison.
The best possible performance is achieved through static content. A Python app that makes static pages (for instance a collection of Jinja templates) is fastest.
After that, it's largely impossible to say. Even http://werkzeug.pocoo.org/ involves some processing overheads that may be unacceptable in the above scenario. Python can be slow.
Django, with a modicum of effort, is often fast enough. Serving static content separately from dynamic content, for example, can be a huge speedup.
Since Django does so much automatically, there's a huge victory in not having to write every little administrative page.
I'd say there has to be something funky with your setup there to get such a large performance difference. Try mod_wsgi (if you're not already) and follow the excellent suggestions by the posters above. If Django genuinely was this slow in all cases, there's just no way companies would be able to use it for production applications. It's more than likely not to be Django that is holding the request up. Once you have the .pyc files all sorted (automatically generated bytecode), then the execution should be fairly zippy.
However, if you don't actually need all Django has to offer, then why use it? I'm using it in quite a large production application, and we're not using all of its features… if you're doing something fairly simple, you may want to consider using something like web.py or Werkzeug (or something non-Python-based if you'd rather).
Frameworks like Django or Ruby on Rails grew out of real world needs. As different as these needs were, as different they turned out.
Here is my Experience:
As a former PHP programmer, I prefered CakePHP for simple stuff and Symfony for more advanced applications. I had a look into Ruby, but the documentation sucked back then. Now I'm using Django. Django works very well for me. In contrast to Symfony I feel like Django brings less flexibility out of the Box, but its easier to extend.
Another approach would be to use 'no framework' CherryPy
I think the host may be an issue. I do Django development on my localhost (Mac) and it's way better. I like WebFaction for cheap hosting and Amazon ec2 for premium hosting.
The framework is strong and it can handle heavy sites - don't obsess about that stuff. The important thing is to create a clean product, Django can handle it. There are about a thousand steps to take when you see how the application handles in the wild, but for now, just trust us that you don't need to worry about the inherent speed of the framework before exhausting a whole slew of parameters including a dedicated VPS/instance when you need it.
Also, following on your edit - I personally don't think performance is a major issue in programming. Here are the issues in terms of concern:
UI/UX efficiency
UI/UX speed (application caching)
Well designed models/views
Optimization of the system (n-tier architecture, etc...)
Optimization of the process (good QA to reduce failures/bottlenecks from deployment)
Optimization of the subsystems (database, etc...)
Hardware
Framework internal optimization
Don't waste time with comparing framework speeds. Their advantage is in extensible code, smart architectures, etc...
On a side note, DO NOT NOT USE A FRAMEWORK FOR A NEW WEB APPLICATION. I'm sorry I can't say it loud enough, but it's an absolute requirement nowadays. It's not even a debate about not using one, just which one to use.
I personally chose Django, which is great. But I can't definitively knock the others out there.
It's possibly both. Django does have stuff for caching built-in, which would be worth trying. Regardless, any non-cached page will nearly always take longer than a static file. A file has to be read in both cases, and in the case of a dynamic page, it also must be executed. And then, in both cases, sent over to the client.
Definetely shared hosting is not the best choice to run heavy frameworks such as Django or CakePHP. If you can afford it, buy VPS.
As for performance, probably your host uses Python with mod-python, which is not recommended now. WSGI is preferred standard for Python powered webapps.