Optimisation tips when migrating data into Sitecore CMS - sitecore

I am currently faced with the task of importing around 200K items from a custom CMS implementation into Sitecore. I have created a simple import page which connects to an external SQL database using Entity Framework and I have created all the required data templates.
During a test import of about 5K items I realized that I needed to find a way to make the import run a lot faster so I set about to find some information about optimizing Sitecore for this purpose. I have concluded that there is not much specific information out there so I'd like to share what I've found and open the floor for others to contribute further optimizations. My aim is to create some kind of maintenance mode for Sitecore that can be used when importing large columes of data.
The most useful information I found was on Mark Cassidy's blogpost http://intothecore.cassidy.dk/2009/04/migrating-data-into-sitecore.html. At the bottom of this post he provides a few tips for when you are running an import.
If migrating large quantities of data, try and disable as many Sitecore event handlers and whatever else you can get away with.
Use BulkUpdateContext()
Don't forget your target language
If you can, make the fields shared and unversioned. This should help migration execution speed.
The first thing I noticed out of this list was the BulkUpdateContext class as I had never heard of it. I quickly understood why as a search on the SND forum and in the PDF documentation returned no hits. So imagine my surprise when i actually tested it out and found that it improves item creation/deletes by at least ten fold!
The next thing I looked at was the first point where he basically suggests creating a version of web config that only has the bare essentials needed to perform the import. So far I have removed all events related to creating, saving and deleting items and versions. I have also removed the history engine and system index declarations from the master database element in web config as well as any custom events, schedules and search configurations. I expect that there are a lot of other things I could look to remove/disable in order to increase performance. Pipelines? Schedules?
What optimization tips do you have?

Incidentally, BulkUpdateContext() is a very misleading name - as it really improves item creation speed, not item updating speed. But as you also point out, it improves your import speed massively :-)
Since I wrote that post, I've added a few new things to my normal routines when doing imports.
Regularly shrink your databases. They tend to grow large and bulky. To do this; first go to Sitecore Control Panel -> Database and select "Clean Up Database". After this, do a regular ShrinkDB on your SQL server
Disable indexes, especially if importing into the "master" database. For reference, see http://intothecore.cassidy.dk/2010/09/disabling-lucene-indexes.html
Try not to import into "master" however.. you will usually find that imports into "web" is a lot faster, mostly because this database isn't (by default) connected to the HistoryManager or other gadgets
And if you're really adventureous, there's a thing you could try that I'd been considering trying out myself, but never got around to. They might work, but I can't guarantee that they will :-)
Try removing all your field types from App_Config/FieldTypes.config. The theory here is, that this should essentially disable all of Sitecore's special handling of the content of these fields (like updating the LinkDatabase and so on). You would need to manually trigger a rebuild of the LinkDatabase when done with the import, but that's a relatively small price to pay
Hope this helps a bit :-)

I'm guessing you've already hit this, but putting the code inside a SecurityDisabler() block may speed things up also.
I'd be a lot more worried about how Sitecore performs with this much data... assuming you only do the import once, who cares how long that process takes. Is this going to be a regular occurrence?

Related

Django best practices to validate data in other tables -taking complexity from view file?

I was wondering about best practices in Django of validating the tables content
I am creating a Sales Orders and my SO should check availability of the items I have in stock and if they are not in stock it will trigger manufacturing orders and purchase orders.
I don't want to make very complex view and looking for a way to decouple logic from there and also I predict performance issues.
What are best practices or ready solutions I can use in Django framework to address view complexity ?
I see different possibilities but I am wondering what will be the best fit in my case :
managers
celery - just to run a job occasionally I want the app to be
real time so I don't like this option.
using signals /pre_save/post_sav
model validation
creating extra layer like services.py file
Since I am new to Django I am a bit puzzled what root to take.
Not sure if this is the answer you are looking for.
Signals are for doing things automatically when events happen. Most commonly used to do things before and after model operations. So if you need to do something every time you save a record or every time you create a new record or delete that is where you use signals.
Managers are used to manage record retrieval and manipulations. If you want to do some clever way of retrieving data you can define a custom manager and add some custom methods to it. If you want to override some default behaviors of querysets you would also do it with a custom manager.
Celery is for running things asynchronously. If you are worried that some processing you are doing might take a long time that is were you might consider offloading things to celery. A friendly warning though, doing things asynchronously raises complexity of your code quite a bit, since you need to add some mechanism to pass the data back from celery tasks into your django app and your users.
services.py link that you posted seems to do what you want, it just provides a place where you can put logic that is not specific to a particular view.
Here on stackoverflow, i got an advice from some experienced developers that premature optimization is the root of all evil.
What i suggest is keep it simple. Making the view a little more complex is actually better than effectively adding one more layer of complexity. I would suggest that you try to put most of you logic in models and whatever remains after that in views.
Also, unnecessarily using multiple packages would not solve much of your problem so use the when its necessary. Otherwise try to write the minimal logic yourself so that you donot have to use many apps.
Signals and other things as everybody say is not a great thing however promising it may seem. Just try to make things simpler.
One more point from my side as you are just starting out, go through class based views and try to use them when you get familiar. That will simplify your views the most. Plus, if ou are new to django, read a little code. https://github.com/vitorfs/bootcamp might help you in initiation.

Limits of django-haystack with Solr 4.0

We are planning to use django-haystack with Solr4.0 (with near real time search) for our web app, and I was wondering if anyone could advice on the limits of using haystack (when compared to using solr directly). i.e is there a performance hit/overhead of using django-haystack? We have around 3 million+ documents which would need indexing + an additional (estimated) 100k added everyday.
Ideally, I'd think we need a simple API over Solr4 - but I am finding it hard to find anything specific to python which is still actively maintained (except django-haystack ofcourse). I'd appreciate any guidance on this.
It seems like your question could be rephrased "How has Haystack burned you?" Haystack is nice for some things, but has also caused me some headaches on the job. Here are some things I've had to code around.
You mentioned indexing. Haystack has management commands for rebuilding the index. We'll use these for nuke and pave rebuilding during testing, but for reindexing our production content we kind of hit the wall. The command would take forever, you wouldn't know where it was in terms of progress, and if it failed you were screwed and had to start all over again. We reached a point where we had too much content and it would fail reliably enough. We switched to making batches of content and reindexing these in celery tasks. We made a management command to do the batching and kick off all those tasks. This has been a little more robust in the face of partial failures and, even better, it actually finishes.The underlying task will use haystack calls to convert a database object to a solr document -- this ORMiness is nice and hasn't burned me yet. We edit our solr schema by hand, though.
The query API is okay for simple stuff. If you are doing more convoluted solr queries, you can find yourself just feeding in raw queries. This can lead to spaghetti code. We eventually pushed that raw spaghetti into request handlers in the solrconfig file instead. We still use haystack to turn on highlighting or to choose a specific request handler, but again, I'm happier when we keep it simple and we did hack on a method to add arbitrary parameters as needed.
There are other assumptions about how you want to use solr that seem to get baked in to the code. Haystack is the only open source project where I actually have some familiarity with the code. I'm not sure if that's a good thing, since it hasn't always been by choice. We have plenty of layer cake code that extends a haystack class and overrides it to do the right thing. This isn't terrible, but when you have to also copy and paste haystack code into those overridden methods then it starts being more terrible.
So... it's not perfect, but parts are handy. If you are on the fence about writing your own API anyway, using haystack might save you some trouble, especially when you want to take all your solr results and pass them back into django templates or whatever. It sounds like with the steady influx of documents, you will want to write some batch indexing jobs anyway. Start with that, prepare to be burnt a little and look at the source when that happens, it's actually pretty readable.

Django/Sqlite Improve Database performance

We are developing an online school diary application using django. The prototype is ready and the project will go live next year with about 500 students.
Initially we used sqlite and hoped that for the initial implementation this would perform well enough.
The data tables are such that to obtain details of a school day (periods, classes, teachers, classrooms, many tables are used and the database access takes 67ms on a reasonably fast PC.
Most of the data is static once the year starts with perhaps minor changes to classrooms. I thought of extracting the timetable for each student for each term day so no table joins would be needed. I put this data into a text file for one student, the file is 100K in size. The time taken to read this data and process it for a days timetable is about 8ms. If I pre-load the data on login and store it in sessions it takes 7ms at login and 2ms for each query.
With 500 students what would be the impact on the web server using this approach and what other options are there (putting the student text files into a sort of memory cache rather than session for example?)
There will not be a great deal of data entry, students adding notes, teachers likewise, so it will mostly be checking the timetable status and looking to see what events exist for that day or week.
What is your expected response time, and what is your expected number of requests per minute? One twentieth of a second for the database access (which is likely to be slow part) for a request doesn't sound like a problem to me. SQLite should perform fine in a read-mostly situation like this. So I'm not convinced you even have a performance problem.
If you want faster response you could consider:
First, ensuring that you have the best response time by checking your indexes and profiling individual retrievals to look for performance bottlenecks.
Pre-computing the static parts of the system and storing the HTML. You can put the HTML right back into the database or store it as disk files.
Using the database as a backing store only (to preserve state of the system when the server is down) and reading the entire thing into in-memory structures at system start-up. This eliminates disk access for the data, although it limits you to one physical server.
This sounds like premature optimization. 67ms is scarcely longer than the ~50ms where we humans can observe that there was a delay.
SQLite's representation of your data is going to be more efficient than a text format, and unlike a text file that you have to parse, the operating system can efficiently cache just the portions of your database that you're actually using in RAM.
You can lock down ~50MB of RAM to cache a parsed representation of the data for all the students, but you'll probably get better performance using that RAM for something else, like the OS disk cache.
I agree with some of other answers which suggest to use MySQL or PostgreSQL instead of SQLite. It is not designed to be used as production db. It is great for storing data for one-user applications such as mobile apps or even a desktop application, but it falls short very quickly in server applications. With Django it is trivial to switch to any other full-pledges database backend.
If you switch to one of those, you should not really have any performance issues, especially if you will do all the necessary joins using select_related and prefetch_related.
If you will still need more performance, considering that "most of the data is static", you actually might want to convert Django site a static site (a collection of html files) and then serve those using nginx or something similar to that. The simplest way I can think of doing that is to just write a cron-job which will loop over all needed url-configs, request the page from Django and then save that as an html file. If you want to go into that direction, you also might want to take a look at Python's static site generators: Hyde and Pelican.
This approach will certainly work much faster then any caching system however you will loose any dynamic components of the site. If you need them, then caching seems like the best and fastest solution.
You should use MySQL or PostgreSQL for your production database. sqlite3 isn't a good idea.
You should also avoid pre-loading data on login. Since your records can be inserted in advance, write django management commands and run the import to your chosen database before hand and design your models such that when a user logs in, the user would already be able to access and view/edit his or her related data (which are pre-inserted before the application even goes live). Hardcoding data operations when log in does not smell right at all from an application design point-of-view.
https://docs.djangoproject.com/en/dev/howto/custom-management-commands/
The benefit of designing your django models and using custom management commands to insert the records right way before your application goes live implies that you can use django orm to make the appropriate relationships between users and their records.
I suspect - based on your description of what you need above - that you need to re-look at the approach you are creating this application.
With 500 students, we shouldn't even be talking about caching. If you want response speed, you should deal with the following issues in priority:-
Use a production quality database
Design your application use case correctly and design your application model right
Pre-load any data you need to the production database
front end optimization comes first (css/js compression etc)
use django debug toolbar to figure out if any of your sql is slow and optimize specifically those
implement caching (memcached etc) as needed
As a general guideline.

Tools and tips for switching CMS

I work for a university, and in the past year we finally broke away from our static HTML site of several thousand pages and moved to a Drupal site. This obviously entails massive amounts of data entry.
What if you're already using a CMS and are switching to another one that better suits your needs? How do you minimize the mountain of data entry during such a huge change? Are there tools built for this, or some best practices one should follow?
The Migrate module for Drupal would provide a big help. The Economist.com data migration to Drupal will give you an overview of the process.
The video from the Migration: not just for the birds presentation at Drupalcon DC 2009 is probably somewhat out-of-date, but also gives a good introduction.
Expect to have to both pre-process and post-process your data manually, whatever happens. Accept early on that your data is likely to be in a worse state than you think it is: fields will be misused; record-to-record references (foreign keys) might not be implemented properly, or at all; content is likely to need weeding and occasionally to be just bad or incorrect.
Check your database encoding. Older databases won't be in Unicode encodings, and get grumpy if you have to export data dumps and import them elsewhere. Even then, assume that there'll be some wacky nonprintable characters in your data: programs like Word seem to somehow inject them everywhere, and I've seen... codepoints... you people wouldn't believe. Consider sweeping your data before you even start (or even sweeping a database dump) for these characters. Decide whether or not to junk them or try to convert them in the case of e.g. Word "smart" punctuation characters.
It's very difficult to create explicit data structures from implied one. If your incoming data has a separate date field, you can map that to a date field; if it has a date as part of a big lump of HTML, even if that date is in a tag with an id attribute, simple scripting won't work. You could use offline scripting with BeautifulSoup or (if your HTML's a bit nicer) the faster lxml to pre-process your data set, extract those implicit fields, and save them into an implicit format. Consider creating an intermediate database where these revisions are going to go.
The Migrate module is excellent, but to get really good data fidelity and play more clever tricks you might need to learn about its hook system (Drupal's terminology for functions following a particular naming scheme) and the basics of writing a module to put these hooks in (a module is broadly just a PHP file where all the functions begin with the same text, the name of the module file.)
All imported content should be flagged for at least a cursory check. You can do this by importing it with status=0 i.e. unpublished, and then create a view with the Views module to go through the content and open it in other tabs for checking. Views Bulk Operations lets you have a set of checkboxes alongside your view items, so you could approve many nodes at once.
Expect to run and re-run and re-run the import, fixing new things every time. Check ten, or twenty items, as early as possible. If there are any problems, check ten or twenty more. Fix and repeat the import.
Gauge how long a single import run is likely to take. Be pessimistic: we had an import we expected to take ten hours encounter exponential slowdown when we introduced the full data set; until we finally fixed some slow queries, it was projected to take two weeks.
If in doubt, or if you think the technical aspects of the above are just going to take more time than the work itself, then just hire temps to do the data. But you still need decent quality controls, as early as possible during their work. Drupal developers are also for hire: try your country's relevant IRC channel, or post a note in a relevant groups.drupal.org group. They're more expensive than temps but they usually write better PHP...! Consider hiring an agency too: that's a shameless plug, as I work for one, but sometimes it's best to get experts in for these specific jobs.
Really good imports are always hard, harder than you expect. Don't let it get you down!
Migrate + table wizard (and schema + views) is the way to go. With table wizard you can expose any table to drupal and map fields accordingly using migrate.
Look here for a detailed walktrough:
http://www.lullabot.com/articles/drupal-data-imports-migrate-and-table-wizard
You'll want to have an access to existing data from django. This helps me a lot with migrating: http://docs.djangoproject.com/en/1.2/howto/legacy-databases/ . With correct model definitions you'll have full django power including the admin. In fact, I'm using django just as admin backend for several legacy php projects - django's admin can easily outachieve a lot of custom hand-written admin scripts.
Authorization should remain the same. Users should be able to login with their credentials but it is hard to write a migration script for auth data because password hashing schemas may be different and there is no way to convert between them without knowing plain passwords. Django provides a way to support different sources of auth so you can write Drupal auth backend: http://docs.djangoproject.com/en/1.2/topics/auth/#writing-an-authentication-backend
There is no need to do the full rewrite. If some parts are working fine they can still be powered by Drupal. New code can written using Django with same UI. Routing between old and new parts can be performed by web server url rewriting. Both django and drupal parts can be powered by the same DB.

Is there a database implementation that has notifications and revisions?

I am looking for a database library that can be used within an editor to replace a custom document format. In my case the document would contain a functional program.
I want application data to be persistent even while editing, so that when the program crashes, no data is lost. I know that all databases offer that.
On top of that, I want to access and edit the document from multiple threads, processes, possibly even multiple computers.
Format: a simple key/value database would totally suffice. SQL usually needs to be wrapped, and if I can avoid pulling in a heavy ORM dependency, that would be splendid.
Revisions: I want to be able to roll back changes up to the first change to the document that has ever been made, not only in one session, but also between sessions/program runs.
I need notifications: each process must be able to be notified of changes to the document so it can update its view accordingly.
I see these requirements as rather basic, a foundation to solve the usual tough problems of an editing application: undo/redo, multiple views on the same data. Thus, the database system should be lightweight and undemanding.
Thank you for your insights in advance :)
Berkeley DB is an undemanding, light-weight key-value database that supports locking and transactions. There are bindings for it in a lot of programming languages, including C++ and python. You'll have to implement revisions and notifications yourself, but that's actually not all that difficult.
It might be a bit more power than what you ask for, but You should definitely look at CouchDB.
It is a document database with "document" being defined as a JSON record.
It stores all the changes to the documents as revisions, so you instantly get revisions.
It has powerful javascript based view engine to aggregate all the data you need from the database.
All the commits to the database are written to the end of the repository file and the writes are atomic, meaning that unsuccessful writes do not corrupt the database.
Another nice bonus You'll get is easy and flexible replication and of your database.
See the full feature list on their homepage
On the minus side (depending on Your point of view) is the fact that it is written in Erlang and (as far as I know) runs as an external process...
I don't know anything about notifications though - it seems that if you are working with replicated databases, the changes are instantly replicated/synchronized between databases. Other than that I suppose you should be able to roll your own notification schema...
Check out ZODB. It doesn't have notifications built in, so you would need a messaging system there (since you may use separate computers). But it has transactions, you can roll back forever (unless you pack the database, which removes earlier revisions), you can access it directly as an integrated part of the application, or it can run as client/server (with multiple clients of course), you can have automatic persistency, there is no ORM, etc.
It's pretty much Python-only though (it's based on Pickles).
http://en.wikipedia.org/wiki/Zope_Object_Database
http://pypi.python.org/pypi/ZODB3
http://wiki.zope.org/ZODB/guide/index.html
http://wiki.zope.org/ZODB/Documentation