I'm working on a web app with a lot of data, and would like some general tech stack advice. I'm a Django developer, but I haven't worked with this much data before.
Apologies for the general question, but I'd really appreciate some general advice. If it's really not right for SO, rather than just voting to close it, I'd really appreciate a suggestion for a forum where I could ask for this advice.
My database will have three tables, one of which will have approximately 500m rows (100GB of data). The data is read-only and changes infrequently, only once a month.
The large table (500m rows) is spending items each month for the past five years, and the other tables are the institutions doing the spending (~10k rows) and the items bought (~4000 rows). The models basically look like this:
class Organisation(models.Model):
name = models.CharField(max_length=200)
class SpendItem(models.Model):
name = models.CharField(max_length=200)
class Spend(models.Model):
spend_item = models.ForeignKey(SpendingItem)
organisation = models.ForeignKey(Organisation)
spend_value = models.FloatField()
processing_date = models.DateField()
I'll need to offer pages in the web app querying aggregating this spending data in various ways. For example, I might want to show a page per institution, with the total spend for each month, and the total spend per type of item. Or a page per item, with total spent, and spending by institution.
My initial plan was to have a Postgres back-end, since I know the shape of the data, and simply make queries through the Django ORM, or raw SQL if necessary for speed.
But I'm starting to get worried: will aggregate queries be much too slow over 500m rows? Will I need to pre-calculate all the aggregate queries? Should I also be looking into other technologies that I haven't used before, like Elasticsearch, or even BigQuery?
Another concern: is a Postgres database this size (presumably ~200GB with indexes) likely to run at an acceptable speed from an SSD, or do I need to pay for enough RAM to hold it all in memory? (Eeeek.)
I know the answer really is "try it and see", but I'd really appreciate any upfront advice from more experienced Django/Postgres/data developers. If you were working on an application this shape, how would you approach it?
I may NOT have understood the problem very clearly, but here's how i would approach it.
I won't take the overhead of elasticsearch/solr just for calculating aggregates, (in my opinion, its useful when you want FTS, ranking and stuff)
I would rather have an index on processing_date and take another two fields for last_indexing_date and last_aggregate (in each organisation maybe) and update these periodically using some background async task.
For real time details I would just pick the last_indexing_date and run an aggregate for spendings beyond that date, and finally sum it up with last_aggregate and update these fields as well.
Not sure what you meant by:
Create a usable web app that offers this information to those users
Hope this helps :)
Related
After a basic introduction to Python thanks to an edX course and a chat with a friend who told me about Django, I thought I could implement a solution for my laboratory. My goal is to keep track of every reagent order made by everyone of us researchers to the suppliers.
After one month I have a pretty decent version of it which I'm very proud of (I also have to thank a lot of StackOverFlow questions that helped me). Nonetheless, there's one requirement of the ordering flow that I haven't been able to translate to the Django app. Let me explain:
Users have a form to anotate the reagent (one per form) they need, and then it is passed to the corresponding manufacturer for them to send us an invoice. It's convenient that each invoice has several products, but they all have to: a) be sold by the same manufacturer, b) be sent to the same location and c) be charged to the same bank account (it's actually more complicated, but this will suffice for the explanation).
According to that, administrators of the app could process different orders by different users and merge them together as long as they meet the three requirements.
How would you implement this into the app regarding tables and relationships?
What I have now is an Order Model and an Order Form which has different CharFields regarding information of the product (name, reference, etc.), and then the sending direction and the bank account (which are ForeingKeys).
When the administrators (administratives) process the orders, they asign several of them that meet the requirements to an invoice, and then the problem comes: all the data has to be filled repeatedly for each of the orders.
The inmediate solution for this would be to create a Products Model and then each Order instance could have various products as long as they meet the three requirements, but this presents two problems:
1) The products table is gonna be very difficult to populate properly. Users are not gonna be concise about references and important data.
2) We would still have different Orders that could be merged into the same invoice.
I thought maybe I could let the users add fields dynamically to the Order model (adding product 1, product 2, product 3). I read about formsets, but they repeat whole Forms as far as I understood, and I would just need to repeat fields. Anyway, that would solve the first point, but not the second.
Any suggestions?
Thanks, and sorry for the long block!
I’m working on a dating app for a hackathon project. We have a series of questions that users fill out, and then every few days we are going to send suggested matches. If anyone has a good tutorial for these kinds of matching algorithms, it would be very appreciated. One idea is to assign a point value for each question and then to do a
def comparison(person_a, person_b) function where you iterate through these questions, and where there’s a common answer, you add in a point. So the higher the score, the better the match. I understand this so far, but I’m struggling to see how to save this data in the database.
In python, I could take each user and then iterate through all the other users with this comparison function and make a dictionary for each person that lists all the other users and a score for them. And then to suggest matches, I iterate through the dictionary list and if that person hasn’t been matched up already with that person, then make a match.
person1_dictionary_of_matches = {‘person2’: 3, ‘person3’: 5, ‘person4’: 10, ‘person5’: 12, ‘person6’: 2,……,‘person200’:10}
person_1_list_of_prior_matches = [‘person3’, 'person4']
I'm struggling on how to represent this in django. I could have a bunch of users and make a Match model like:
class Match(Model):
person1 = models.ForeignKey(User)
person2 = models.ForeignKey(User)
score = models.PositiveIntegerField()
Where I do the iteration and save all the pairwise scores.
and then do
person_matches = Match.objectsfilter(person1=sarah, person2!=sarah).order_by('score').exclude(person2 in list_of_past_matches)
But I’m worried with 1000 users, I will have 1000000 rows in my table if do this. Will this be brutal to have to save all these pairwise scores for each user in the database? Or does this not matter if I run it at like Sunday night at 1am or just cache these responses once and use the comparisons for a period of months? Is there a better way to do this than matching everyone up pairwise? Should I use some other data structure to capture the people and their compatibility score? Thanks so much for any guidance!
Interesting question. In machine learning's current paradigm you work with sparse matrices that means that you would not have to perform every single match evaluation. The sparsity may come from two alternatives:
Create a batch offline analysis of your data to perform some clustering (fancy solution).
Filter the individuals by some key attributes: a) gender/sexual preference, b) geographical location, c) dating status etc. (simple solution)
After the filtering you could perform a function for estimating appropriate matches for the new user. Based on the selected choices of the user adscribe selected matches into the database for future queries. However, if you get serious about this problem I suggest you give Spark a try. This is not a problem for an SQL database but for a Big Data Engine.
Say I have a general website that allows someone to download their feed in a small amount of time. A user can be subscribed to many different pages, and the user's feed must be returned to the user from the server with only N of the most recent posts between all of the pages subscribed to. Originally when a user queried the server for a feed, the algorithm was as follows:
look at all of the pages a user subscribed to
getting the N most recent posts from each page
sorting all of the posts
return the N most recent posts to the user as their feed
As it turns out, doing this EVERY TIME a user tried to refresh a feed was really slow. Thus, I changed the database to have a table of feedposts, which simply has a foreign key to a user and a foreign key to the post. Every time a page makes a new post, it creates a feed post for each of its subscribing followers. That way, when a user wants their feed, it is already created and does not have to be created upon retrieval.
The way I am doing this is creating far too many rows and simply does not seem scalable. For instance, if a single page makes 1 post & has 1,000,000 followers, we just created 1,000,000 new rows in our feedpost table.
Please help!
How do companies such as facebook handle this problem? Do they generate the feed upon request? Are my database relationships terrible?
It's not that the original schema itself would be inherently wrong, at least not based on the high-level description you have provided. The slowness stems from the fact that you're not accessing the database in a way relational databases should be accessed.
In general, when querying a relational database, you should use JOINs and in-database ordering where possible, instead of fetching a bunch of data, and then trying to connect related objects and sort them in your code. If you let the database do all this for you, it will be much faster, because it can take advantage of indices, and only access those objects that are actually needed.
As a rule of thumb, if you need to sort the results of a QuerySet in your Python code, or loop through multiple querysets and combine them somehow, you're most likely doing something wrong and you should figure out how to let the database do it for you. Of course, it's not true every single time, but certainly often enough.
Let me try to illustrate with a simple piece of code. Assume you have the following models:
class Page(models.Model):
name = models.CharField(max_length=47)
followers = models.ManyToManyField('auth.User', related_name='followed_pages')
class Post(models.Model):
title = models.CharField(max_length=147)
page = models.ForeignKey(Page, related_name='posts')
content = models.TextField()
time_published = models.DateTimeField(auto_now_add=True)
You could, for example, get the list of the last 20 posts posted to pages followed by the currently logged in user with the following single line of code:
latest_posts = Post.objects.filter(page__followers=request.user).order_by('-time_published')[:20]
This runs a single SQL query against your database, which only returns the (up to) 20 results that match, and nothing else. And since you're joining on primary keys of all tables involved, it will conveniently use indices for all joins, making it really fast. In fact, this is exactly the kind of operation relational databases were designed to perform efficiently.
Caching will be the solution here.
You will have to reduce the database reads, which are much slower as compared to cache reads.
You can use something like Redis to cache the post.
Here is an amazing answer for better understanding
Is Redis just a cache
Each page can be assigned a key, and you can pull all of the posts for that page under that key.
you need not to cache everything , just cache resent M posts, where M>>N and safe enough to reduce the database calls.Now if in case user requests for posts beyond the latesd M ones, then they can be directly fetched from the DB.
Now when you have to generate the feed you can make a DB call to get all of the subscribed pages(or you can put in the cache as well) and then just get the required number of post's from the cache.
The problem here would be keeping the cache up-to date.
For that you can use something like django-signals. Whenever a new post is added, add it to the cache as well using the signal.
So for each DB write you will have to write to cache as well.
But then you will not have to read from DB and as Redis is a in memory datastore it is pretty fast as compared to standard relational databases.
Edit:
These are a few more articles which can help for better understanding
Does Stack Exchange use caching and if so, how
How Twitter Uses Redis to Scale - 105TB RAM, 39MM QPS, 10,000+ Instances
I'm working to optimize a Django application that's (mainly) backed by MongoDB. It's dying under load testing. On the current problematic page, New Relic shows over 700 calls to pymongo.collection:Collection.find. Much of the code was written by junior coders and normally I would look for places to add indicies, make smarter joins and remove loops to reduce query calls, but joins aren't an option here. What I have done (after adding indicies based on EXPLAINs) is tried to reduce the cost in loops by making a general query and then filtering that smaller set in the loops*. While I've gotten the number down from 900 queries, 700 still seems insane even with the intense amount of work being done on the page. I thought perhaps find was called even when filtering an existing queryset, but the code suggests it's always a database query.
I've added some logging to mongoengine to see where the queries come from and to look at EXPLAIN statements, but I'm not having a ton of luck sifting through the wall of info. mongoengine itself seems to be part of the performance problem: I switched to mongomallard as a test and got a 50% performance improvement on the page. Unfortunately, I got errors on a bunch of other pages (as best I can tell it appears Mallard doesn't do well when filtering an existing queryset; the error complains about a call to deepcopy that's happening in a generator, which you can't do-- I hit a brick wall there). While Mallard doesn't seem like a workable replacement for us, it does suggest a lot of the proessing time is spent converting objects to and from Python in mongoengine.
What can I do to further reduce the calls? Or am I focusing on the wrong thing and should be attacking the problem somewhere else?
EDIT: providing some code/ models
The page in question displays the syllabus for a course, showing all the modules in the course, their lessons and the concepts under the lessons. For each concept, the user's progress in the concept is also shown. So there's a lot of looping to get the hierarchy teased out (and it's not stored according to any of the patterns the Mongo docs suggest).
class CourseVersion(Document):
...
course_instances = ListField(ReferenceField('CourseInstance'))
courseware_containers = ListField(EmbeddedDocumentField('CoursewareContainer'))
class CoursewareContainer(EmbeddedDocument):
id = UUIDField(required=True, binary=False, default=uuid.uuid4)
....
courseware_containers = ListField(EmbeddedDocumentField('self'))
teaching_element_instances = ListField(StringField())
The course's modules, lessons and concepts are stored in courseware_containers; we need to get all of the concepts so we can get the list of ids in teaching_element_instances to find the most recent one the user has worked on (if any) for that concept and then look up their progress.
* Just to be clear, I am using a profiler and looking at times and doings things The Right Way as best I know, not simply changing things and hoping for the best.
The code sample isn't bad per-sae but there are a number of areas that should be considered and may help improve performance.
class CourseVersion(Document):
...
course_instances = ListField(ReferenceField('CourseInstance'))
courseware_containers = ListField(EmbeddedDocumentField('CoursewareContainer'))
class CoursewareContainer(EmbeddedDocument):
id = UUIDField(required=True, binary=False, default=uuid.uuid4)
....
courseware_containers = ListField(EmbeddedDocumentField('self'))
teaching_element_instances = ListField(StringField())
Review
Unbounded lists.
course_instances, courseware_containers, teaching_element_instances
If these fields are unbounded and continuously grow then the document will move on disk as it grows, causing disk contention on heavily loaded systems. There are two patterns to help minimise this:
a) Turn on Power of two sizes. This will cost disk space but should lower the amount of io churn as the document grows
b) Initial Padding - custom pad the document on insert so it gets put into a larger extent and then remove the padding. Really an anti pattern but it may give you some mileage.
The final barrier is the maximum document size - 16MB you can't grow your data bigger than that.
Lists of ReferenceFields - course_instances
MongoDB doesn't have joins so it costs an extra query to look up a ReferenceField - essentially they are an in app join. Which isn't bad per-sae but its important to understand the tradeoff. By default mongoengine won't automatically dereference the field only doing course_version.course_instances will it do another query and then populate the whole list of references. So it can cost you another query - if you don't need the data then exclude() it from the query to stop any leaking queries.
EmbeddedFields
These fields are part of the document, so there is no cost for them, other than the wire costs of transmitting and loading the data. **As they are part of the document, you don't need select_related to get this data.
teaching_element_instances
Are these a list of id's? It says its a StringField in the code sample above. Either way, if you don't need to dereference the whole list then storing the _ids as a StringField and manually dereferencing may be more efficient if coded correctly - especially if you just need the latest (last?) id.
Model complexity
The CoursewareContainer is complex. For any given CourseVersion you have n CoursewareContainers with themselves have a list of n containers and those each have n containers and on...
Finding the most recent instances
We need to get all of the concepts so we can get the list of ids in
teaching_element_instances to find the most recent one the user has
worked on (if any) for that concept and then look up their progress.
I'm unsure if there is a single instance you are after or one per Container or one per Course. Either way - the logic for querying the data should be examined. If its a single instance you are after - then that could be stored against the user so to simplify the logic of looking this up. If its per course or container then to improve performance ensure you minimise the number of queries - if possible collect all the ids and then at the end issue a single $in query, rather than doing a query per container.
Mongoengine costs
Currently, there is a performance cost to loading the data into Mongoengine classes - if you don't need the classes and are happy to work with simple dictionaries then either issue a raw pymongo query or use as_pymongo.
Schema design
The schema looks logical enough but is it suitable for the use case - in essence is it using MongoDB's strengths or is it putting a relational peg in a document database shaped hole? I can't answer than for you but I do know the way to the happy path with MongoDB is design the schema based on its use case. With relational databases schema design from the outset is simple - you normalise, with document databases how the data is used is a primary factor.
MongoDB best practices
There are many other best practices and mongodb have a guide which might be of interest: MongoDB Operations Best Practices.
Feel free to contact me via the Mongoengine mailing list to discuss further and if needs be discuss in private.
Ross
On my website I'm going to provide points for some activities, similarly to stackoverflow. I would like to calculate value basing on many factors so each computation for each user will take for instance 10 SQL queries.
I was thinking about caching it:
in memcache,
in user's row in database (so that wherever I need to get user from base I easly show the points)
Storing in database seems easy but on other hand it's redundant information and I decided to ask, since maybe there is easier and prettier solution which I missed.
I'd highly recommend this app for storing the calculated values in the model: https://github.com/initcrash/django-denorm
Memcache is faster than the db... but if you already have to retrieve the record from the db anyway, having the calculated values cached in the rows you're retrieving (as a 'denormalised' field) is even faster, plus it's persistent.