Datastore NDB best practices when querying and extracting thousands of rows

Datastore NDB best practices when querying and extracting thousands of rows - python-2.7

I'm using the High Replication Datastore, along with ndb. I have a kind with over 27,000 entities, which isn't that much. Supposedly the datastore is efficient in querying and extracting large amounts of data, but whenever I query over that kind, queries take a long time to finish (I've even got DeadlineExceededErrors).
I have a model where I store keywords and URLs I want to index in Google:
class Keywords(ndb.Model):
keyword = ndb.StringProperty(indexed=True)
url = ndb.StringProperty(indexed=True)
number_articles = ndb.IntegerProperty(indexed=True)
# Some other attributes... All attributes are indexed
My current use cases are to build my Sitemap, and to fetch my top 20 keywords to link from my hope page.
When I fetch many entities, I usually do:
Keywords.query().fetch() # For the sitemap, as I want all of the urls
Keywords.query(Keywords.number_articles > 5).fetch() # For the homepage, I want to link to keywords with more than 5 articles
Is there a better way to extract data?
I've tried to index data into the Search API, and I've seen huge speed gains. Even though this works, I don't think it's ideal to replicate data from the Datastore into Search API with basically the same fields.
Thanks in advance!

I would split this functionality.
For home page you can use your second query, but add, as advised by Bruyere, limit=20 paramater. Such request should run very fast, if you have the right index.
The site map is a bigger issue. Usually, to process large number of entities, you use Map reduce.
It's probably a good idea, but only if you don't have too many requests to sitemap. It can also be the only solution if you update Keywords entities often and want as up to date site map as possible.
Another option can be to generate sitemap in a task, save it as a blob and serve this blob in the request. That is really quick. If your updates to the Keywords entities are not very frequent, then you can run this task after any update. If you have many updates, then you can schedule the task to run periodically in cron. As you have success using search API, then this is probably the best option for you.
Generally speaking I don't think it's a good idea to use datastore to retrieve large amounts of data. I recommend to look at least at Datastore comparison with traditional databases. It's designed to handle large databases, but not necessarily large result sets. I would say that datastore is designed to handle large amounts of small requests.

DB speed is related to the number of results returned, not the number of records in the DB. You say:
to build my Sitemap, and to fetch my top 20 keywords
If thats the case add limit=20 in both fetches. If you do it that way then use run instead as per the docs:
https://developers.google.com/appengine/docs/python/datastore/queryclass#Query_fetch

Related

Ember Data: When do I use findAll() over query()?

This is the pattern I find myself running into:
I start making an app, and I use findAll() to get a list of [something random].
Once the app is being tested with serious data, the number of random resource instances will grow. I need to limit the number of resource instances on screen. I need to start paginating them. For this I need query string support. E.g. page[offset].
So findAll(criteria) is replaced by query(criteria, querystring).
This is a pattern so much that findAll() is starting to look like a development placeholder for query() to be used later.
I'm probably misunderstanding the use for findAll(). Is it true findAll() cannot use pagination at all (without customizing adapter code)? Can someone explain in what cases findAll() should be used?

I personally use the findAll method for fetching data that appears in various drop-downs and short lists that cannot be filtered by the user. I use query and queryRecord for pretty much everything else.
Here are a couple of particularities of findAll that can be misleading:
findAll returns all of the records present in the store along with the data that is fetched using the record's adapter.
The return of findAll is two-fold, firstly you will receive the content of the store and then it will be refreshed with the data fetched using the adapter, this behavior can be overridden using the reload flag.

To expand on Jean's answer, findAll does just that, finds all! If you had entities such as "post types" where you have [advertisement, blog, poem], findall makes sense, because you are pulling these 3 things all the time (for example in a "post creator").
Query is more precise. Say You had an api returning every car you have ever seen.
Say you had a "car" model with properties "color" and "bodyStyle"
You could use:
// find all red cars -> /cars?color=red
store.query('car', {color: 'red'});
// find all cars that are coupes -> /cars?bodyStyle=coupe
store.query('car', {bodyStyle: 'coupe'});
To your question on pagination, this is typically implemented on the API. A popular pattern is to accept/return "page" and "count" properties. These are typically found in an API payload's "meta" property.
So if you wanted to look through all cars you know of/have in your database:
// find first 10 cars -> /cars?count=10&page=1
store.query('car', {count: 10, page: 1});
// on the next page, find the next 10 cars -> /cars?count=10&page=2
store.query('car', {count: 10, page: 2});
It is worth nothing that to further your own research you should look into query parameter binding on controllers to ease the labor needed to implement a solution like this.
https://guides.emberjs.com/release/routing/query-params/
In the examples in that link you can see how you can transition to routes and use the query parameters in your store requests to fetch relevant data.
In short, findAll() is great for finding a finite set of easy to represent information, typically types of entities.
query() is great for any filtered set of results based on a criteria, as you mentioned.
Happy Coding :)

If you want "all" record of a type I would recommend using query + peekAll, this is more or less what findAll does under the hood but without various timing issues / freshness issues that findAll is subject to.
query is generally a much better API because it lets you paginate, and most apps with data of any consequence eventually hit a point they are forced to paginate either for rendering concerns or data size concerns.

Django: how do i create a model dynamically

How do I create a model dynamically upon uploading a csv file? I have done the part where it can read the csv file.

This doc explains very well how to dynamically create models at runtime in django. It also links to an example of doing so.
However, as you will see after looking at the document, it is quite complex and cumbersome to do this. I would not recommend doing this and believe it is quite likely you can determine a model ahead of time that is flexible enough to handle the CSV. This would be much better practice since dynamically changing the schema of your database as your application is running is a recipe for a ton of bugs in your code.

I understand that you want to create new schema's on the fly based on fields in the those in a CSV. While thats a valid use case and could be the absolute right call. I doubt it though - it lends itself to a data model for a single tenet SaaS application that could have goofy performance and migration issues.
I'd try using Mongo/ some other NoSQL solutions as others have mentioned. But a simpler approach may be a modified Star Schema implemented in SQL. In this case you create a dimensions tables that stores each header, then create an instance of each data element that has a foreign key to dimension and records the value of that dimension.
If you read the csv the psuedo code would look something like this:
for row in DictReader(file):
for k in row.keys():
try:
dim = Dimension.objects.get(name=k)
except:
dim = Dimension(name=k)
dim.save()
DimensionRecord(dimension=dim, value=row[k]
Obviously you could better handle reading the headers and error trapping if dimensions already exist, but this would be an example of how you could dynamically load variable headered CSV's into a SQL db.

Modifying field with regex in Mongo and adding it to a new field

I'm a mongo noob and have what I hope is a pretty easy question. I received a 100gb .bson file yesterday and need to quickly retrieve some documents associated with urls. Unfortunately, the people that managed the database decided to change the schema for storing urls halfway through its life. This means that the url field must be queried via regex and cannot be indexed.
What I am hoping to do is this: regex out some common string between the two versions of urls and store it in a new field called url_id. This field could then be indexed to make for quicker queries. Looking through some past SO posts i cobbled together some pseudo-code that might do the trick:
//pseudo code, i dont know javascript that well.
db.eval(function() {
db.foo.find({}, {url:1}).forEach(function(e) {
match = e.url.match(/.*(domain.com/.*)?(\\?.*)/); //remove http, www, and query strings
e.url_id = matches[1];
db.foo.save(e);
});
});
Then I could run:
db.foo.ensureIndex({url_id:1})
Which would create a new index that would be quicker to query by so long as I properly modified the urls before querying for them.
However, I'm scared at the prospect of running a for loop across 100gb of records. Is there a better way to do this that I'm not thinking of?

Figured out a workaround...
By simply scripting the modification of the input url to create various versions of itself, I was able to run multiple queries on the indexed database and concatenate the results. Hacky but it worked!

Slow page generation in Django with 50+ sql queries per page

In my Django app I noticed that pages with big number of sql queries load considerably slower than other pages. I'm not a first day in web dev and mainly I have a deal with such a resource hog as Drupal, but even Drupal with its 150 - 200 sql queries per page generates page in 0.5 - 0.7 sec.
Django from the other side, performs really bad with more or less average number of queries per page. For example, one of my pages generates 60 queries like this:
SELECT`gamenode_gamenode`.`id`, `gamenode_gamenode`.`title`, `gamenode_gamenode`.`short_desc`, `gamenode_gamenode`.`full_desc`, `gamenode_gamenode`.`slug`, `gamenode_gamenode`.`type`, `gamenode_gamenode`.`source_gameid`, `gamenode_gamenode`.`created`, `gamenode_gamenode`.`updated`, `gamenode_gamenode`.`status`, `gamenode_gamenode`.`promote`, `gamenode_gamenode`.`sticky`, `gamenode_gamenode`.`hit_count`, `gamenode_gamenode`.`game_rank`, `gamenode_gamenode`.`share_count`, `gamenode_gamenode`.`like_count`, `gamenode_gamenode`.`comment_count` FROM `gamenode_gamenode` WHERE `gamenode_gamenode`.`id` = 1058
and outputs the data as a simple string and it takes 1200ms to generate a page! I did this just for a test to generate many fairly simple queries. If I lower the number of queries to 10 - 15, page generation time will come back to more or less acceptable number.
So I have a question, why Django is so slow when there are many sql queries on the page? I did similar comparisons by using Rails, Symfony and Drupal and all these "resource hogs" performed way better than Django. Am I doing something wrong or there's some "secret" setting to make things faster in Django or, maybe, Djangonauts consider such times as normal and just strive to write code which produces as few queries as possible? Please help me to figure this out.

Yes, Django's ORM is pretty slow. You have three choices for dealing with this:
Complain about it.
Switch to another web application framework.
Make some effort to understand why your application is generating so many database queries, and learn how to use Django's ORM effectively so as to reduce the number of queries.
(1) might be psychologically satisfying but won't solve your problem; (2) is off-topic here at Stack Overflow (but you might look at Wikipedia's Comparison of web application frameworks).
We can help you with (3), but only if you show us some more of your code. The query you quoted looks like a typical query that Django would generate for a call to get():
GameNode.objects.get(id = 1058)
You shouldn't be running more than a couple of queries like this on a page: if you want to get many GameNodes you need to get them in a single query:
GameNode.objects.filter(<criteria>)
Or if the GameNode objects are related to some other object by a foreign key on another model that you are querying, then you could fetch all the related GameNode objects by using Django's select_related() method.
There's almost always a way to speed things up (see this testimonial) but we need to know the details before we can say how to do it.

Can you build a truly RESTful service that takes many parameters?

After reading an article on REST ("Restful Grails"), I have gotten the impression that it is not possible to truly conform to a REST style in a service that demands a lot of parameters. Is this so? All the examples I have seen so far seem to imply that true REST style services are "parameterless". Using parameters would be RPC-ish and not truly RESTful.
To be more specific, say we have a service that returns graph data for stock prices, and this service needs to know the start date, end date, the currency, stock name, and whatever else might be applicable. In any case, at least 4-5 parameters are needed to retrieve the information needed.
I would imagine the URL to be something like this : /stocks/YAHOO?startDate="2008-09-01"&endDate=...
("YAHOO" is here a made-up stock name).
Would this really be REST or is this more RPC-like, what the author of the aforementioned article calls "GETful" (i.e. just low ceremony rpc)?

You can see the querystring as a filter on the resource you are GETing. Here, your resource is the stock prices of yahoo. Doing a GET on that resource give you all the available data, or the most recents. The query string filter the prices you want. Content negociation allow you to change the representation, e.g. a png graph, a csv file, and so on. To add a price, simply POST a representation (e.g. CSV) to the same resource.
The "restfulness" is not realy in the URL itself, since URIs are obscures to client, but in the way you interact with resources themselves identified by their URI

Feel free to use as many parameters as you need to identify the resource you wish to access. REST doesn't care.

Why would you think it is not possible?
Google uses REST for their charts api, and they take alot of params:
http://chart.apis.google.com/chart?cht=bvg&chs=350x300&chd=t:20,35,10&chxr=1,0,40&chds=0,40&chco=FF0000|FFA000|00FF00&chbh=65,0,35&chxt=x,y,x&chxl=0:|High|Medium|Low|2:||Task+Priority||&chxs=2,000000,12&chtt=Tasks+on+my+To+Do+list&chts=000000,20&chg=0,25,5,5

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js