Tools and tips for switching CMS - django

I work for a university, and in the past year we finally broke away from our static HTML site of several thousand pages and moved to a Drupal site. This obviously entails massive amounts of data entry.
What if you're already using a CMS and are switching to another one that better suits your needs? How do you minimize the mountain of data entry during such a huge change? Are there tools built for this, or some best practices one should follow?

The Migrate module for Drupal would provide a big help. The Economist.com data migration to Drupal will give you an overview of the process.
The video from the Migration: not just for the birds presentation at Drupalcon DC 2009 is probably somewhat out-of-date, but also gives a good introduction.

Expect to have to both pre-process and post-process your data manually, whatever happens. Accept early on that your data is likely to be in a worse state than you think it is: fields will be misused; record-to-record references (foreign keys) might not be implemented properly, or at all; content is likely to need weeding and occasionally to be just bad or incorrect.
Check your database encoding. Older databases won't be in Unicode encodings, and get grumpy if you have to export data dumps and import them elsewhere. Even then, assume that there'll be some wacky nonprintable characters in your data: programs like Word seem to somehow inject them everywhere, and I've seen... codepoints... you people wouldn't believe. Consider sweeping your data before you even start (or even sweeping a database dump) for these characters. Decide whether or not to junk them or try to convert them in the case of e.g. Word "smart" punctuation characters.
It's very difficult to create explicit data structures from implied one. If your incoming data has a separate date field, you can map that to a date field; if it has a date as part of a big lump of HTML, even if that date is in a tag with an id attribute, simple scripting won't work. You could use offline scripting with BeautifulSoup or (if your HTML's a bit nicer) the faster lxml to pre-process your data set, extract those implicit fields, and save them into an implicit format. Consider creating an intermediate database where these revisions are going to go.
The Migrate module is excellent, but to get really good data fidelity and play more clever tricks you might need to learn about its hook system (Drupal's terminology for functions following a particular naming scheme) and the basics of writing a module to put these hooks in (a module is broadly just a PHP file where all the functions begin with the same text, the name of the module file.)
All imported content should be flagged for at least a cursory check. You can do this by importing it with status=0 i.e. unpublished, and then create a view with the Views module to go through the content and open it in other tabs for checking. Views Bulk Operations lets you have a set of checkboxes alongside your view items, so you could approve many nodes at once.
Expect to run and re-run and re-run the import, fixing new things every time. Check ten, or twenty items, as early as possible. If there are any problems, check ten or twenty more. Fix and repeat the import.
Gauge how long a single import run is likely to take. Be pessimistic: we had an import we expected to take ten hours encounter exponential slowdown when we introduced the full data set; until we finally fixed some slow queries, it was projected to take two weeks.
If in doubt, or if you think the technical aspects of the above are just going to take more time than the work itself, then just hire temps to do the data. But you still need decent quality controls, as early as possible during their work. Drupal developers are also for hire: try your country's relevant IRC channel, or post a note in a relevant groups.drupal.org group. They're more expensive than temps but they usually write better PHP...! Consider hiring an agency too: that's a shameless plug, as I work for one, but sometimes it's best to get experts in for these specific jobs.
Really good imports are always hard, harder than you expect. Don't let it get you down!

Migrate + table wizard (and schema + views) is the way to go. With table wizard you can expose any table to drupal and map fields accordingly using migrate.
Look here for a detailed walktrough:
http://www.lullabot.com/articles/drupal-data-imports-migrate-and-table-wizard

You'll want to have an access to existing data from django. This helps me a lot with migrating: http://docs.djangoproject.com/en/1.2/howto/legacy-databases/ . With correct model definitions you'll have full django power including the admin. In fact, I'm using django just as admin backend for several legacy php projects - django's admin can easily outachieve a lot of custom hand-written admin scripts.
Authorization should remain the same. Users should be able to login with their credentials but it is hard to write a migration script for auth data because password hashing schemas may be different and there is no way to convert between them without knowing plain passwords. Django provides a way to support different sources of auth so you can write Drupal auth backend: http://docs.djangoproject.com/en/1.2/topics/auth/#writing-an-authentication-backend
There is no need to do the full rewrite. If some parts are working fine they can still be powered by Drupal. New code can written using Django with same UI. Routing between old and new parts can be performed by web server url rewriting. Both django and drupal parts can be powered by the same DB.

Related

Can I do direct database updates in Django easily?

I am trying to write a web application that displays the content of a database as a website, e.g. in the form of a table, and lets the user update the table entries, which should automatically be reflected in the database, so that on page reload, the table looks exactly as the user left it.
Since my web development skills are fairly outdated, I wanted to take this as an excuse to try some new stuff. I know my way around SQL and Python quite well, so I thought Django would be a good choice. I don't have a lot of experience in Javascript however. I already worked through the tutorial, which covers classic HTML forms where you enter a bunch of data and then hit "submit" to push to the database.
What I would really prefer though is to have my whole table freely editable and either immediately save any change to the database (e.g. whenever I click a checkbox or "focus out" of a text box). As a second option I thought about having a single "save" button for the whole page (which may easily be several screens in size).
Now, for the first option, I assume I will likely have to use Javascript and Ajax techniques, which I am not comfortable with them yet, so writing greater pieces of Javascript code is something I am not very keen on at the moment.
For the second option, I would probably have my whole table be a huge, single form with a single submit button. I am a bit wary about this as it does not seem very robust to me.
So what my question boils down to: Are there ways to accomplish what I want in a robust and easy way without having to reinvent the wheel? From my understanding, Django does not cover the final rendering in HTML, it only provides the data, so I would assume I need some third party technology to handle that part?
Yes, for your second idea, submitting the whole table at once, Django has a thing called a ModelFormSet where you define a web form which is repeated for each row in the table (or, for the set of records you select). There are a good amount of basic things you'll need to understand to do it.. eg. how to create a Django view, how to set up a url, how to write templates... but you say you want to learn Django.. so.. it's a good exercise. The Django documentation has a good tutorial that leads you through development of a basic working app and from there it's not much further to do what you're seeking.
Here's the part of the Django documentation that discusses ModelFormSets:
https://docs.djangoproject.com/en/3.0/topics/forms/modelforms/#model-formsets
BTW, Django detects which rows have changed so it won't write every row every time, even though you've submitted them all.

Should I use an internal API in a django project to communicate between apps?

I'm building/managing a django project, with multiple apps inside of it. One stores survey data, and another stores classifiers, that are used to add features to the survey data. For example, Is this survey answer sad? 0/1. This feature will get stored along with the survey data.
We're trying to decide how and where in the app to actually perform this featurization, and I'm being recommended a number of approaches that don't make ANY sense to me, but I'm also not very familiar with django, or more-than-hobby-scale web development, so I wanted to get another opinion.
The data app obviously needs access to the classifiers app, to be able to run the classifiers on the data, and then reinsert the featurized data, but how to get access to the classifiers has become contentious. The obvious approach, to me, is to just import them directly, a la
# from inside the Survey App
from ClassifierModels import Classifier
cls = Classifier.where(name='Sad').first() # or whatever, I'm used to flask
data = Survey.where(question='How do you feel?').first()
labels = cls(data.responses)
# etc.
However, one of my engineers is saying that this is bad practice, because apps should not import one another's models. And that instead, these two should only communicate via internal APIs, i.e. posting all the data to
http://our_website.com/classifiers/sad
and getting it back that way.
So, what feels to me like the most pressing question: Why in god's name would anybody do it this way? It seems to me like strictly more code (building and handling requests), strictly less intuitive code, that's more to build, harder to work with, and bafflingly indirect, like mailing a letter to your own house rather than talking to the person who lives there, with you.
But perhaps in easier to answer chunks,
1) Is there REALLY anything the matter with the first, direct, import-other-apps-models approach? (The only answers I've found say 'No!,' but again, this is being pushed by my dev, who does have more industrial experience, so I want to be certain.)
2) What is the actual benefit of doing it via internal API's? (I've asked of course, but only get what feel like theoretical answers, that don't address the concrete concerns, of more and more complicated code for no obvious benefit.)
3) How much do the size of our app, and team, factor into which decision is best? We have about 1.75 developers, and only, even if we're VERY ambitious, FOUR users. (This app is being used internally, to support a consulting business.) So to me, any questions of Best Practices etc. have to factor in that we have tiny teams on both sides, and need something stable, functional, and lean, not something that handles big loads, or is externally secure, or fast, or easily worked on by big teams, etc.
4) What IS the best approach, if NEITHER of these is right?
It's simply not true that apps should not import other apps' models. For a trivial refutation, think about the apps in django.contrib which contain models such as User and ContentType, which are meant to be imported and used by other apps.
That's not to say there aren't good use cases for an internal API. I'm in the planning process of building one myself. But they're really only appropriate if you intend to split the apps up some day into separate services. An internal API on its own doesn't make much sense if you're not in a service-based architecture.
I cant see any reason why you should not import an app model from another one. Django itself uses several applications and theirs models internally (like auth and admin). Reading the applications section of documentation we can see that the framework has all the tools to manage multiple applications and their models inside a project.
However it seems quite obvious to me that it would make your code really messy and low-performance to send requests to your applications API.
Without context it's hard to understand why your engineer considers this a bad practice. He was maybe referring to database isolation (thus, see "Working multiple databases" in documentation) or proper code isolation for testing.
It is right to think about decoupling your apps. But I do not think that internal REST API is a good way.
Neither direct import of models, calling queries and updates in another app is a good approach. Every time you use model from another app, you should be careful. I suggest you to try to separate communication between apps to the simple service layer. Than you Survey app do not have to know models structure of Classifier app::
# from inside the Survey App
from ClassifierModels.services import get_classifier_cls
cls = get_classifier_cls('Sad')
data = Survey.where(question='How do you feel?').first()
labels = cls(data.responses)
# etc.
For more information, you should read this thread Separation of business logic and data access in django
In more general, you should create smaller testable components. Nowadays I am interested in "functional core and imperative shell" paradigm. Try Gary Bernhardt lectures https://gist.github.com/kbilsted/abdc017858cad68c3e7926b03646554e

Django/Sqlite Improve Database performance

We are developing an online school diary application using django. The prototype is ready and the project will go live next year with about 500 students.
Initially we used sqlite and hoped that for the initial implementation this would perform well enough.
The data tables are such that to obtain details of a school day (periods, classes, teachers, classrooms, many tables are used and the database access takes 67ms on a reasonably fast PC.
Most of the data is static once the year starts with perhaps minor changes to classrooms. I thought of extracting the timetable for each student for each term day so no table joins would be needed. I put this data into a text file for one student, the file is 100K in size. The time taken to read this data and process it for a days timetable is about 8ms. If I pre-load the data on login and store it in sessions it takes 7ms at login and 2ms for each query.
With 500 students what would be the impact on the web server using this approach and what other options are there (putting the student text files into a sort of memory cache rather than session for example?)
There will not be a great deal of data entry, students adding notes, teachers likewise, so it will mostly be checking the timetable status and looking to see what events exist for that day or week.
What is your expected response time, and what is your expected number of requests per minute? One twentieth of a second for the database access (which is likely to be slow part) for a request doesn't sound like a problem to me. SQLite should perform fine in a read-mostly situation like this. So I'm not convinced you even have a performance problem.
If you want faster response you could consider:
First, ensuring that you have the best response time by checking your indexes and profiling individual retrievals to look for performance bottlenecks.
Pre-computing the static parts of the system and storing the HTML. You can put the HTML right back into the database or store it as disk files.
Using the database as a backing store only (to preserve state of the system when the server is down) and reading the entire thing into in-memory structures at system start-up. This eliminates disk access for the data, although it limits you to one physical server.
This sounds like premature optimization. 67ms is scarcely longer than the ~50ms where we humans can observe that there was a delay.
SQLite's representation of your data is going to be more efficient than a text format, and unlike a text file that you have to parse, the operating system can efficiently cache just the portions of your database that you're actually using in RAM.
You can lock down ~50MB of RAM to cache a parsed representation of the data for all the students, but you'll probably get better performance using that RAM for something else, like the OS disk cache.
I agree with some of other answers which suggest to use MySQL or PostgreSQL instead of SQLite. It is not designed to be used as production db. It is great for storing data for one-user applications such as mobile apps or even a desktop application, but it falls short very quickly in server applications. With Django it is trivial to switch to any other full-pledges database backend.
If you switch to one of those, you should not really have any performance issues, especially if you will do all the necessary joins using select_related and prefetch_related.
If you will still need more performance, considering that "most of the data is static", you actually might want to convert Django site a static site (a collection of html files) and then serve those using nginx or something similar to that. The simplest way I can think of doing that is to just write a cron-job which will loop over all needed url-configs, request the page from Django and then save that as an html file. If you want to go into that direction, you also might want to take a look at Python's static site generators: Hyde and Pelican.
This approach will certainly work much faster then any caching system however you will loose any dynamic components of the site. If you need them, then caching seems like the best and fastest solution.
You should use MySQL or PostgreSQL for your production database. sqlite3 isn't a good idea.
You should also avoid pre-loading data on login. Since your records can be inserted in advance, write django management commands and run the import to your chosen database before hand and design your models such that when a user logs in, the user would already be able to access and view/edit his or her related data (which are pre-inserted before the application even goes live). Hardcoding data operations when log in does not smell right at all from an application design point-of-view.
https://docs.djangoproject.com/en/dev/howto/custom-management-commands/
The benefit of designing your django models and using custom management commands to insert the records right way before your application goes live implies that you can use django orm to make the appropriate relationships between users and their records.
I suspect - based on your description of what you need above - that you need to re-look at the approach you are creating this application.
With 500 students, we shouldn't even be talking about caching. If you want response speed, you should deal with the following issues in priority:-
Use a production quality database
Design your application use case correctly and design your application model right
Pre-load any data you need to the production database
front end optimization comes first (css/js compression etc)
use django debug toolbar to figure out if any of your sql is slow and optimize specifically those
implement caching (memcached etc) as needed
As a general guideline.

Need Application Advice

I have a django application that contains customer and product information. It is simple but gets the job done. My customer wants the capability to send form letters and news letters and create them from within my app. From a text editing perspective, I understand how simple form letters work by replacing certain blocks of text with queries from the database. But my customer wants fliers with complex graphics and column-based layouts. There are some very mature products out there that I can't come close (nor want to) compete with. He talked about building and editing these pages from within my app. This seems a bridge too far but I don't know if its even possible to build that into a django app.
I have no idea how to approach this and it seems very complicated. But, before say no, I want to explore the available technology and level of effort entailed. Are there open source packages that can help? How could one integrate this capability into a django-based web application? and how hard is it? For a part-time intermediate developer how long would it take?
I'd farm the layout off to something like TeX Live. From there you can generate a PDF and attach it to an email message. Don't let "... a popular means by which to typeset complex mathematical formulae" throw you, TeX is useful for more than that.

Optimisation tips when migrating data into Sitecore CMS

I am currently faced with the task of importing around 200K items from a custom CMS implementation into Sitecore. I have created a simple import page which connects to an external SQL database using Entity Framework and I have created all the required data templates.
During a test import of about 5K items I realized that I needed to find a way to make the import run a lot faster so I set about to find some information about optimizing Sitecore for this purpose. I have concluded that there is not much specific information out there so I'd like to share what I've found and open the floor for others to contribute further optimizations. My aim is to create some kind of maintenance mode for Sitecore that can be used when importing large columes of data.
The most useful information I found was on Mark Cassidy's blogpost http://intothecore.cassidy.dk/2009/04/migrating-data-into-sitecore.html. At the bottom of this post he provides a few tips for when you are running an import.
If migrating large quantities of data, try and disable as many Sitecore event handlers and whatever else you can get away with.
Use BulkUpdateContext()
Don't forget your target language
If you can, make the fields shared and unversioned. This should help migration execution speed.
The first thing I noticed out of this list was the BulkUpdateContext class as I had never heard of it. I quickly understood why as a search on the SND forum and in the PDF documentation returned no hits. So imagine my surprise when i actually tested it out and found that it improves item creation/deletes by at least ten fold!
The next thing I looked at was the first point where he basically suggests creating a version of web config that only has the bare essentials needed to perform the import. So far I have removed all events related to creating, saving and deleting items and versions. I have also removed the history engine and system index declarations from the master database element in web config as well as any custom events, schedules and search configurations. I expect that there are a lot of other things I could look to remove/disable in order to increase performance. Pipelines? Schedules?
What optimization tips do you have?
Incidentally, BulkUpdateContext() is a very misleading name - as it really improves item creation speed, not item updating speed. But as you also point out, it improves your import speed massively :-)
Since I wrote that post, I've added a few new things to my normal routines when doing imports.
Regularly shrink your databases. They tend to grow large and bulky. To do this; first go to Sitecore Control Panel -> Database and select "Clean Up Database". After this, do a regular ShrinkDB on your SQL server
Disable indexes, especially if importing into the "master" database. For reference, see http://intothecore.cassidy.dk/2010/09/disabling-lucene-indexes.html
Try not to import into "master" however.. you will usually find that imports into "web" is a lot faster, mostly because this database isn't (by default) connected to the HistoryManager or other gadgets
And if you're really adventureous, there's a thing you could try that I'd been considering trying out myself, but never got around to. They might work, but I can't guarantee that they will :-)
Try removing all your field types from App_Config/FieldTypes.config. The theory here is, that this should essentially disable all of Sitecore's special handling of the content of these fields (like updating the LinkDatabase and so on). You would need to manually trigger a rebuild of the LinkDatabase when done with the import, but that's a relatively small price to pay
Hope this helps a bit :-)
I'm guessing you've already hit this, but putting the code inside a SecurityDisabler() block may speed things up also.
I'd be a lot more worried about how Sitecore performs with this much data... assuming you only do the import once, who cares how long that process takes. Is this going to be a regular occurrence?