Tastypie-nonrel, django, mongodb: too many nestings - django

I am developing a web application with django, backbone.js, tastypie and mongodb. In order to adapt tastypie and django to mongodb I am using django-mongodb-engine and tastypie-nonrel. This application has a model Project, which has a list of Tasks. So it looks like this:
class Project(models.Model):
user = models.ForeignKey(User)
tasks = ListField(EmbeddedModelField('Task'), null=True, blank=True)
class Task(models.Model):
title = models.CharField(max_length=200)
Thanks to tastypie-nonrel, getting the list of task of a project is done in a simple way with a GET request at /api/v1/project/:id:/tasks/
Now I want to extend this Task model with a list of comments:
class Task(models.Model):
title = models.CharField(max_length=200)
comments = ListField(EmbeddedModelField('Comment'), null=True, blank=True)
class Comment(models.Model):
text = models.CharField(max_length=1000)
owner = models.ForeignKey(User)
The problem with this implementation is that tastypie-nonrel does not support another nesting, so is not possible to simple POST a comment to /api/v1/project/:id:/task/:id:/comments/
The alternative is to just make a PUT request of a Task to /api/v1/project/:id:/task/, but this would create problems if two users decide to add a comment to the same Task at the same time, as the last PUT would override the previous one.
The last option (aside from changing tastypie-nonrel) is to not embed Comment inside the Task and just hold the ForeignKey, so the request would go to /api/v1/Comment/. My question is if this breaks the benefits of using MongoDB (as it is needed cross queries)? Is there any better way of doing it?
I have little experience in any of the technologies of the stack, so it may be I am not focusing well the problem. Any suggestions are welcome.

It seems like you are nesting too much. That said, you can create custom methods/URL mappings for tastypie and then run your own logic instead of relying on "auto-magic" tastypie. If you are worried about the comment overriding issue, you need transactions anyway. Your code then should be robust enough to handle the behavior of a failed transaction, for example to retry. This would greatly throttle your writes for sure if you are constantly locking on a large object with many writers, however, which points to a design issue as well.
One way you can mitigate this a bit is to write to intermediate source such as a task queue or redis, then dump in the comments as needed. It just depends on how reliable/durable your solution. A task queue would handle retries for failed transactions at least; with redis you can do something with pub/sub.
You should consider a few things about your design IMO regarding MongoDB.
Avoid creating overly large monolithic objects. Although this is a benefit of Mongo, it depends on your usage. If you are for instance always returning your project as a top-level object, then as the tasks and comments grow, the network traffic alone will kill performance.
Imagine a very contrived example in which the project specific
data is 10k, each task alone is 5k, and each comment alone is 2k, if you have a project with 5 tasks, 10 comments per task, you are talking about 10k + 5*5k + 10*2k. For a very active project with lots of comments, this will be heavy sending across the network. You can do slice/projection queries to reconcile this issue, but with some limitations and implications.
A corollary to the above, structure your objects to your use cases. If you don't need to bring things back together, they can be in different collections. Just because you "think" you need to bring them back together, doesn't mean implementation wise they need to be retrieved in the same fetch (although this is normally ideal).
Even if you do need everything in one use case/screen, another solution that may be possible in some designs is to load things in parallel, or even deferred via JavaScript after the page loads, using AJAX. For example, you could load the task info at the top, and then make an async call to load the comments separately, similar to how Disqus or Livefyre work as integrations in other sites. This could help resolve your nesting issue somewhat as you'd get rid of the task/project levels and simply store some IDs on each comment/record to be able to query between collections.
Keep in mind you may not want to retrieve all comments at once, and if you have a lot of comments, you may hit up against limitations of the size of one document. The size is bigger these days in recent versions of Mongo, however it usually doesn't make sense anyway to have a single record with a large amount of data, going back to the first item above.
My recommendations are:
Use transactions if you're concerned about losing comments.
Add a task queue/redis/something durable if you're worried about competing writes and losing things as a result of #1. If not, ignore it. Is it the end of the world if you lose a comment?
Consider restructuring particularly comments into a separate collection to ease your tastypie issues. Load things deferred or in parallel if needed.

Related

Should I use an internal API in a django project to communicate between apps?

I'm building/managing a django project, with multiple apps inside of it. One stores survey data, and another stores classifiers, that are used to add features to the survey data. For example, Is this survey answer sad? 0/1. This feature will get stored along with the survey data.
We're trying to decide how and where in the app to actually perform this featurization, and I'm being recommended a number of approaches that don't make ANY sense to me, but I'm also not very familiar with django, or more-than-hobby-scale web development, so I wanted to get another opinion.
The data app obviously needs access to the classifiers app, to be able to run the classifiers on the data, and then reinsert the featurized data, but how to get access to the classifiers has become contentious. The obvious approach, to me, is to just import them directly, a la
# from inside the Survey App
from ClassifierModels import Classifier
cls = Classifier.where(name='Sad').first() # or whatever, I'm used to flask
data = Survey.where(question='How do you feel?').first()
labels = cls(data.responses)
# etc.
However, one of my engineers is saying that this is bad practice, because apps should not import one another's models. And that instead, these two should only communicate via internal APIs, i.e. posting all the data to
http://our_website.com/classifiers/sad
and getting it back that way.
So, what feels to me like the most pressing question: Why in god's name would anybody do it this way? It seems to me like strictly more code (building and handling requests), strictly less intuitive code, that's more to build, harder to work with, and bafflingly indirect, like mailing a letter to your own house rather than talking to the person who lives there, with you.
But perhaps in easier to answer chunks,
1) Is there REALLY anything the matter with the first, direct, import-other-apps-models approach? (The only answers I've found say 'No!,' but again, this is being pushed by my dev, who does have more industrial experience, so I want to be certain.)
2) What is the actual benefit of doing it via internal API's? (I've asked of course, but only get what feel like theoretical answers, that don't address the concrete concerns, of more and more complicated code for no obvious benefit.)
3) How much do the size of our app, and team, factor into which decision is best? We have about 1.75 developers, and only, even if we're VERY ambitious, FOUR users. (This app is being used internally, to support a consulting business.) So to me, any questions of Best Practices etc. have to factor in that we have tiny teams on both sides, and need something stable, functional, and lean, not something that handles big loads, or is externally secure, or fast, or easily worked on by big teams, etc.
4) What IS the best approach, if NEITHER of these is right?
It's simply not true that apps should not import other apps' models. For a trivial refutation, think about the apps in django.contrib which contain models such as User and ContentType, which are meant to be imported and used by other apps.
That's not to say there aren't good use cases for an internal API. I'm in the planning process of building one myself. But they're really only appropriate if you intend to split the apps up some day into separate services. An internal API on its own doesn't make much sense if you're not in a service-based architecture.
I cant see any reason why you should not import an app model from another one. Django itself uses several applications and theirs models internally (like auth and admin). Reading the applications section of documentation we can see that the framework has all the tools to manage multiple applications and their models inside a project.
However it seems quite obvious to me that it would make your code really messy and low-performance to send requests to your applications API.
Without context it's hard to understand why your engineer considers this a bad practice. He was maybe referring to database isolation (thus, see "Working multiple databases" in documentation) or proper code isolation for testing.
It is right to think about decoupling your apps. But I do not think that internal REST API is a good way.
Neither direct import of models, calling queries and updates in another app is a good approach. Every time you use model from another app, you should be careful. I suggest you to try to separate communication between apps to the simple service layer. Than you Survey app do not have to know models structure of Classifier app::
# from inside the Survey App
from ClassifierModels.services import get_classifier_cls
cls = get_classifier_cls('Sad')
data = Survey.where(question='How do you feel?').first()
labels = cls(data.responses)
# etc.
For more information, you should read this thread Separation of business logic and data access in django
In more general, you should create smaller testable components. Nowadays I am interested in "functional core and imperative shell" paradigm. Try Gary Bernhardt lectures https://gist.github.com/kbilsted/abdc017858cad68c3e7926b03646554e

Django best practices to validate data in other tables -taking complexity from view file?

I was wondering about best practices in Django of validating the tables content
I am creating a Sales Orders and my SO should check availability of the items I have in stock and if they are not in stock it will trigger manufacturing orders and purchase orders.
I don't want to make very complex view and looking for a way to decouple logic from there and also I predict performance issues.
What are best practices or ready solutions I can use in Django framework to address view complexity ?
I see different possibilities but I am wondering what will be the best fit in my case :
managers
celery - just to run a job occasionally I want the app to be
real time so I don't like this option.
using signals /pre_save/post_sav
model validation
creating extra layer like services.py file
Since I am new to Django I am a bit puzzled what root to take.
Not sure if this is the answer you are looking for.
Signals are for doing things automatically when events happen. Most commonly used to do things before and after model operations. So if you need to do something every time you save a record or every time you create a new record or delete that is where you use signals.
Managers are used to manage record retrieval and manipulations. If you want to do some clever way of retrieving data you can define a custom manager and add some custom methods to it. If you want to override some default behaviors of querysets you would also do it with a custom manager.
Celery is for running things asynchronously. If you are worried that some processing you are doing might take a long time that is were you might consider offloading things to celery. A friendly warning though, doing things asynchronously raises complexity of your code quite a bit, since you need to add some mechanism to pass the data back from celery tasks into your django app and your users.
services.py link that you posted seems to do what you want, it just provides a place where you can put logic that is not specific to a particular view.
Here on stackoverflow, i got an advice from some experienced developers that premature optimization is the root of all evil.
What i suggest is keep it simple. Making the view a little more complex is actually better than effectively adding one more layer of complexity. I would suggest that you try to put most of you logic in models and whatever remains after that in views.
Also, unnecessarily using multiple packages would not solve much of your problem so use the when its necessary. Otherwise try to write the minimal logic yourself so that you donot have to use many apps.
Signals and other things as everybody say is not a great thing however promising it may seem. Just try to make things simpler.
One more point from my side as you are just starting out, go through class based views and try to use them when you get familiar. That will simplify your views the most. Plus, if ou are new to django, read a little code. https://github.com/vitorfs/bootcamp might help you in initiation.

Workflow frameworks for Django [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I've been looking for a framework to simplify the development of reasonably complex workflows in Django applications. I'd like to be able to use the framework to automate the state transitions, permissioning, and perhaps some extras like audit logging and notifications.
I've seen some older information on the same topic, but not too much in the last 2-3 years. The major choices I've heard of are GoFlow (not updated since 2/2009) and django-workflow (seems more active).
Has anyone used these packages? Are they mature and/or compatible with modern (1.3) Django? Are there other options out there worth considering that might be better or better supported?
Let me give a few notes here as i'm the author of django-fsm and django-viewflow, two projects that could be called "workflow libraries".
Workflow word itself is a bit overrated. Different kind of libraries and software could call themselves "workflow" but have varying functionality.
The commonality is that a workflow connects the steps of some process into a whole.
General classification
As I see, workflow implementation approaches can be classified as follows:
Single/Multiple users - Whether workflow library automates single user tasks or has permission checking/task assignment options.
Sequential/Parallel - Sequential workflow is just a state machine pattern implementation and allows to have single active state at a moment. Parallel workflows allow to have several active tasks at once, and probably have some sort of parallel sync/join functionality.
Explicit/Implicit - Whether workflow is represented as a separate external entity, or is weaved into some other class, that main responsibility is different.
Static/Dynamic - Static workflows are implemented in python code once and then executed, dynamic workflows typically could be configuring by changing contents of workflow database tables. Static workflows are usually better integrated with the rest of the django infrastructure
like views, forms and templates, and support better customization by usual python constructions like class inheritance. Dynamic workflows assume that you have generic interface that can adapt to any workflow runtime changes.
Of these, the first two could be considered gradual differences, but the other two are fundamental.
Specific packages
Here is brief description what we have nowadays in django, djangopackages and awesome-django project list under workflow section:
django.contrib.WizardView - implicit, single user, sequential, static the simplest workflow implementation we could have. It stores intermediate state in the hidden form post data.
django-flows - explicit, single user, sequential, static workflow, that keeps flow state in external storage, to allow user to close or open page on another tab and continue working.
django-fsm - implicit, multi-user, sequential, static workflow - the most compact and lightweight state machine library. State change events represented just as python methods calls of model class. Has rudimentary support for flow inheritance and overrides. Provides slots for associate permission with state transitions. Allows to use optimistic locking to prevent concurrent state updates.
django-states - explicit, multi-user, sequential, static workflow with separate class for state machine and state transitions. Transitions made by passing string name of transition to make_transition method. Provides way for associate permission with state transitions. Has a simple REST generic endpoint for changing model states using AJAX calls. State
machine inheritance support is not mentioned in the documentation, but class state definition makes it possible with none or few core library modifications.
django_xworkflows - explicit, sequential, static workflow with no support for user permission checking, separated class for state machine. Uses tuples for state and transition definitions, makes workflow inheritance support hard.
django-workflows - explicit, multi-user, sequential, dynamic workflow storing the state in library provided django models. Has a way to attach permission to workflow transition, and basically thats all.
None of these django state machine libraries have support for parallel workflows, which limits their scope of application a lot. But there are two that do:
django-viewflow - explicit, multi-user, parallel, static workflow, with support for parallel tasks execution, complex split and join semantic. Provides helpers to integrate with django functional and class based views, and different background task execution queries, and various pessimistic and optimistic lock strategies to prevent concurrent updates.
GoFlow, mentioned in question, tends to be the explicit, multi-user, parallel, dynamic workflow, but it has been forsaken by author for a years.
I see the way to implement dynamic workflow construction functionality on top of django-viewflow. As soon as it is completed, if will close the last and the most sophisticated case for workflow implementation in the django world.
Hope, if anyone was able to read hitherto, now understands the workflow term better, and can do the conscious choice for workflow library for their project.
Are there other options out there worth considering that might be better or better supported?
Yes.
Python.
You don't need a workflow product to automate the state transitions, permissioning, and perhaps some extras like audit logging and notifications.
There's a reason why there aren't many projects doing this.
The State design pattern is pretty easy to implement.
The Authorization rules ("permissioning") are already a first-class
part of Django.
Logging is already a first-class part of Python (and has been
added to Django). Using this for audit logging is either an audit
table or another logger (or both).
The message framework ("notifications") is already part of Django.
What more do you need? You already have it all.
Using class definitions for the State design pattern, and decorators for authorization and logging works out so well that you don't need anything above and beyond what you already have.
Read this related question: Implementing a "rules engine" in Python
It's funny because I would have agreed with S.Lott about just using Python as is for a rule engine. I have a COMPLETELY different perspective now having done it.
If you want a full rule engine, it needs a quite a few moving parts. We built a full Python/Django rules engine and you would be surprised what needs to be built in to get a great rule engine up and running. I will explain further, but first the website is http://nebrios.com.
A rule engine should atleast have:
Acess Control Lists - Do you want everyone seeing everything?
Key/Value pair API - KVP's store the state, and all the rules react to changed states.
Debug mode - Being able to see every changed state, what changed it and why. Paramount.
Interaction through web forms and email - Being able to quickly script a web form is a huge plus, along with parsing incoming emails consistently.
Process ID's - These track a "thread" of business value. Otherwise processes would be continually overlapping.
Sooo much more!
So try out Nebri, or the others I list below to see if they meet your needs.
Here's the debug mode
An auto generated form
A sample workflow rule:
class task_sender(NebriOS):
# send a task to the person it got assigned to
listens_to = ['created_date']
def check(self):
return (self.created_date is not None) and (self.creator_status != "complete") and (self.assigned is not None)
def action(self):
send_email (self.assigned,"""
The ""{{task_title}}"" task was just sent your way!
Once you finish, send this email back to log the following in the system:
i_am_finished := true
It will get assigned back to the task creator to look over.
Thank you!! - The Nebbs
""", subject="""{{task_title}}""")
So, no, it's not simple to build a rules based, event based workflow engine in Python alone. We have been at it over a year! I would recommend using tools like
http://nebrios.com
http://pyke.sourceforge.net (It's Python also!)
http://decisions.com
http://clipsrules.sourceforge.net
A package written by an associate of mine, django-fsm, seems to work--it's both fairly lightweight and sufficiently featureful to be useful.
I can add one more library which supports on the fly changes on workflow components unlike its equivalents.
Look at django-river
It is now with a pretty admin called River Admin
ActivFlow: a generic, light-weight and extensible workflow engine for agile development and automation of complex Business Process operations.
You can have an entire workflow modeled in no time!
Step 1: Workflow App Registration
WORKFLOW_APPS = ['leave_request']
Step 2: Activity Configuration
from activflow.core.models import AbstractActivity, AbstractInitialActivity
from activflow.leave_request.validators import validate_initial_cap
class RequestInitiation(AbstractInitialActivity):
"""Leave request details"""
employee_name = CharField(
"Employee", max_length=200, validators=[validate_initial_cap])
from = DateField("From Date")
to = DateField("To Date")
reason = TextField("Purpose of Leave", blank=True)
def clean(self):
"""Custom validation logic should go here"""
pass
class ManagementApproval(AbstractActivity):
"""Management approval"""
approval_status = CharField(verbose_name="Status", max_length=3, choices=(
('APP', 'Approved'), ('REJ', 'Rejected')))
remarks = TextField("Remarks")
def clean(self):
"""Custom validation logic should go here"""
pass
Step 3: Flow Definition
FLOW = {
'initiate_request': {
'name': 'Leave Request Initiation',
'model': RequestInitiation,
'role': 'Submitter',
'transitions': {
'management_approval': validate_request,
}
},
'management_approval': {
'name': 'Management Approval',
'model': ManagementApproval,
'role': 'Approver',
'transitions': None
}
}
Step 4: Business Rules
def validate_request(self):
return self.reason == 'Emergency'
I migrate the django-goflow from django 1.X -python 2.X to fit for django 2.X - python 3.x, the project is at django2-goflow

Optimisation tips when migrating data into Sitecore CMS

I am currently faced with the task of importing around 200K items from a custom CMS implementation into Sitecore. I have created a simple import page which connects to an external SQL database using Entity Framework and I have created all the required data templates.
During a test import of about 5K items I realized that I needed to find a way to make the import run a lot faster so I set about to find some information about optimizing Sitecore for this purpose. I have concluded that there is not much specific information out there so I'd like to share what I've found and open the floor for others to contribute further optimizations. My aim is to create some kind of maintenance mode for Sitecore that can be used when importing large columes of data.
The most useful information I found was on Mark Cassidy's blogpost http://intothecore.cassidy.dk/2009/04/migrating-data-into-sitecore.html. At the bottom of this post he provides a few tips for when you are running an import.
If migrating large quantities of data, try and disable as many Sitecore event handlers and whatever else you can get away with.
Use BulkUpdateContext()
Don't forget your target language
If you can, make the fields shared and unversioned. This should help migration execution speed.
The first thing I noticed out of this list was the BulkUpdateContext class as I had never heard of it. I quickly understood why as a search on the SND forum and in the PDF documentation returned no hits. So imagine my surprise when i actually tested it out and found that it improves item creation/deletes by at least ten fold!
The next thing I looked at was the first point where he basically suggests creating a version of web config that only has the bare essentials needed to perform the import. So far I have removed all events related to creating, saving and deleting items and versions. I have also removed the history engine and system index declarations from the master database element in web config as well as any custom events, schedules and search configurations. I expect that there are a lot of other things I could look to remove/disable in order to increase performance. Pipelines? Schedules?
What optimization tips do you have?
Incidentally, BulkUpdateContext() is a very misleading name - as it really improves item creation speed, not item updating speed. But as you also point out, it improves your import speed massively :-)
Since I wrote that post, I've added a few new things to my normal routines when doing imports.
Regularly shrink your databases. They tend to grow large and bulky. To do this; first go to Sitecore Control Panel -> Database and select "Clean Up Database". After this, do a regular ShrinkDB on your SQL server
Disable indexes, especially if importing into the "master" database. For reference, see http://intothecore.cassidy.dk/2010/09/disabling-lucene-indexes.html
Try not to import into "master" however.. you will usually find that imports into "web" is a lot faster, mostly because this database isn't (by default) connected to the HistoryManager or other gadgets
And if you're really adventureous, there's a thing you could try that I'd been considering trying out myself, but never got around to. They might work, but I can't guarantee that they will :-)
Try removing all your field types from App_Config/FieldTypes.config. The theory here is, that this should essentially disable all of Sitecore's special handling of the content of these fields (like updating the LinkDatabase and so on). You would need to manually trigger a rebuild of the LinkDatabase when done with the import, but that's a relatively small price to pay
Hope this helps a bit :-)
I'm guessing you've already hit this, but putting the code inside a SecurityDisabler() block may speed things up also.
I'd be a lot more worried about how Sitecore performs with this much data... assuming you only do the import once, who cares how long that process takes. Is this going to be a regular occurrence?

my Django development (needs advice)

I am writing a website using Django. I need to push the web site out as soon as possible. I don't need a lot of amazing things right now.
I am concern about the future development.
If I enable registration, which means I allow more contents to be writable. If I don't, then only the admins can publish the content. The website isn't exactly a CMS.
This is a big problem, as I will continue to add new features and rewriting codes (either by adapting third-party apps, or rewrites the app itself). So how would either path affects my database contents?
So the bottom line is, how do I ensure as the development continues, I can ensure the safety of my data?
I hope someone can offer a little insights on this matter.
Thank you very much. It's hard to describe my concern, really.
Whatever functionalities you will add after, if you add new fields, etc ... you can still migrate your data to the "new" database.
It becomes more complicated with relationships, because you might have integrity problems. Say you have a Comment model, and say you don't enable registration, so all users can comment on certain posts. If after, you decide to enable registration, and you decide that ALL the comments have to be associated with a user, then you will have problems migrating your data, because you'll have lots of comments for which you'll have to make up a user, or that you'll just have to drop. Of course, in that case there would be work-arounds, but it is just to illustrate some of the problems you might encounter later.
Personally, I try to have a good data-model, with only the minimum necessary fields (more fields will come after, with new functionalities). I especially try to avoid having to add new foreign keys in already existing models. For example, it is fine to add a new model later, with a foreign key to existing model, but the opposite is more complicated.
Finally, I am not sure about why you hesitate to enable registration. It is actually very very simple to do (you can for example use django-registration, and you would just have to write some urlconf, and some templates, and that's all ...)
Hope this helps !
if you are afraid of data migration, just use south...