GAE ndb design, performance and use of repeated properties - python-2.7

Say I have a picture gallery and a picture could potentially have 100k+ fans. Which ndb design is more efficient?
class picture(ndb.model):
fanIds = ndb.StringProperty(repeated=True)
... [other picture properties]
or
class picture(ndb.model):
... [other picture properties]
class fan(ndb.model):
pictureId = StringProperty()
fanId = StringProperty()
Is there any limit on the number of items you can add to an ndb repeated property and is there any performance hit with storing a large amount of items in a repeated property? If it is less efficient to use repeated properties, what is their intended use?

Do not use repeated properties if you have more than 100-1000 values. (1000 is probably already pushing it.) They weren't designed for such use.

Generally v1 would be much cheaper.
In terms of read/write costs, you pay per entity fetch/written, so you want to reduce the number of entities. version 1 will be cheaper. Significantly cheaper if you fetch every fan every time you fetch a picture.
However each entity is limited to 1MB. If you have 100k+ fans, you could hit that limit depending on the size of your fanId. That's not counting your other picture data, so you could blow that 1MB limit. You'll have to add some more complex code to handle overflow cases.
Large entities take longer to fetch than small entities. If you're going to fetch all the fans at once all the time, v1 will be better. If you're only going to fetch say 5 fans at any one point, v2 might be faster (only might). If on the other hand you try to pull 100k fan entities... that's gonna take forever.

Related

Neptune and Cypher - Poor Performance

I am wanting to use Neptune for an application with cypher as my query language. I have a pretty small dataset of around ~8500 nodes and ~8500 edges edges. I am trying to do what seem to be fairly straightforward queries, but the latency is very high (~6-8 seconds for around 1000 rows). I have tried with various instance types, enabling and disabling caches, enabling and disabling the OSGP index to no avail. I'm really at a loss as to why the query performance is so poor.
Does anyone have any experience with poor query query performance using Neptune? I feel I must be doing something incorrect to have such high query latency.
Here is some more detailed information on my graph structure and my query.
I have a graph with 2 node types A and B and a single edge type
MAPS_TO which always is directed from an A node to a B node. The relation is MAPS_TO is many to many, but with the current dataset
it is primarily one-to-one, i.e. the graph is mainly
disconnected subgraphs of the form:
(A)-[MAPS_TO]-(B)
What I would like to do is for all A nodes to collect the distinct B nodes which they map to satisfying some conditions. I've experimented with my queries a bit and the fastest one I've been able to arrive at is:
MATCH (a:A)
WHERE a.Owner = $owner AND a.IsPublic = true
WITH a
MATCH (a)-[r:MAPS_TO]->(b:B)
WHERE (b)<-[:MAPS_TO {CreationReason: "origin"}]-(:A {Owner: $owner})
OR (b)<-[:MAPS_TO {CreationReason: "origin"}]-(:A {IsPublic: true})
WITH a, r, b ORDER BY a.AId SKIP 0 LIMIT 1000
RETURN a {
.AId
} AS A, collect(distinct b {
B: {BId: b.BId, Name: b.Name, other properties on B nodes...}
R: {CreationReason: r.CreationReason, other relation properties}
})
The above query takes ~6 seconds on the t4g.medium instance type. I tried upping to a r5d.2xlarge instance type and this cut the query time in half to 3-4 seconds. However, using such a large instance type seems quite excessive for such a small amount of data.
Really I am just trying to figure out why my query seems to perform so poorly. It seems to me that with the amount of data I have it should not really be possible to have a Neptune configuration with such performance.
Unfortunately, there are many reasons that performance could be suffering, be it instance size, data not in buffer cache, instance size, concurrent processes, query optimization, etc. so it is hard to provide specific suggestions with the information available.
To better understand the issue, I'd suggest taking a look at how the query is being processed. These details can be found using the openCypher explain feature which will provide low-level details on what the query is doing and where the time is being spent. If possible, I suggest opening a support case with AWS support.

Restful API optimization to get huge data

I have a page for listing categories. There are parameters under categories and sub-parameters under parameters and data is huge.
Recently I developed and tested the same. It is taking a lot of time and the performance is severely hit. Because there are about 1600 API calls(API calls to fetch the data for each of the categories, parameters & sub-parameters) for that single page. I have two questions.
1) Which way is effective? a or b?
a) I have an API to get data for a parameter, so that I can make use of this call 1600 times to get data for all categories/parameters/sub-parameters.
b) Have one call to get all parameters/parameters/sub-parameters data
2) Does AWS charge based on number of the calls? For example, having one call to get data in one shot is cheaper than 1600 calls to get data for each of categories and parameters.
If I recall correctly AWS charges you on CPU active time, so basically whenever somebody calls the API, or any computation is being done on whatever you are hosting there.
For your other question I believe A) would be the better choice as it will reduce the load slightly (what I mean by this, is that there will be less computation but more frequently, which overall will speed up the whole process, since you will be splitting up the big data into smaller chunks) and will possibly not make a traffic congestion if many people are requesting at the same time.
Hope this helps!
I think this depends on several factors. Overall A is probably the better option as the data transferred stays the same in both models. Therefore the load and processing power is very similar. In A you have the advantage of the spread of the risk (if one package get´s lost only few information gets lost) and probably better speed with the processor as it only needs to handle very small packages.
To answer your second question: I guess your using API Gateway? Here is the pricing sheet. You pay a fixed amount for 1M calls (in USA 3,50$) and you pay separate for the cache and the data transfer. So I guess you need to calculate yourself what would be cheaper for you.

Time required to open very small to very large leveldb databases

I have to give some background first. I want to implement an optimized storage engine for OSM planet data (50GB+). The purpose of this engine is to enable map area extractions as fast as possible - while also remaining the ability for minutely updates. The design I've chosen for several reasons (not mentioning all of them here) is to use one data cell per grid. E.g. think of a all cells on a map being distinct files or databases: http://3.bp.blogspot.com/_CntRFtGsdQo/TTU5UMlLkTI/AAAAAAAAARk/_hW8n33t4Ok/s1600/utmworld.gif
(Jut to get the idea though, this is not the actual cell grid I'll be using)
I have never used leveldb before, but settled on it for it's bulk insert and update performance. However, I'd like to know about the "performance characteristics" when opening many very small and very large leveldb databases. very small meaning just a few kB, very large meaning a few hundred MB
I expect that I have to open / close somewhere between 10-100 dbs per minute. I'd rule out leveldb if it needs significant initialization time.
An answer to this question could be either concrete figures, or insight to what leveldb does during initialization and how it relates to data / index size.
PS. I'll also do my own measurements of course. But as with all tests, I may draw wrong conclusions from my sample data.

What is the most efficient way to store time series in Riak with heavy reads

My current approach:
I have one domain class - Application
Each application in my system is stored in "applications" bucket under APPLICATION_KEY key
Apart from application metadata stored in this bucket, each application has its own bucket called "time_metrics/APPLICATION_KEY" where I store time series in a way:
KEY - timestamp / VALUE - some attributes
My concern is efficiency of queries made over specific time window for given application. Currently to get time series from some specific time window and eventually make some reductions I have to make map/reduce over whole "time_metric/APPLICATION_KEY" bucket, which what I have found is not the recommended use case for Riak Map/Reduce.
My question: what would be the best db structure for this kind of a system and how efficiently query it.
Adding onto #macintux's answer.
Basho has had a few customers that have used riak for time series metrics.
Boundary has a nice tech talk about how they use Riak with their network monitoring software. They rollup data into different chunks of time (1m, 5m, 15m) for analysis.
They also have a series of blog posts about lessons learned while implementing this system.
Kivra also has a good slide deck about how they use timeseries data with riak.
You could roll up your data into some sort of arbitrary time length, then read the range you need by issuing regular K/V gets, and then reconstruct the larger picture / reduce in your application.
If you have spare computing power and you know in advance what keys you need, you certainly can use Riak's MapReduce, but often retrieving the keys and running your processing on the client will be as fast (and won't strain your cluster).
Some general ideas:
Roll up your data into larger blocks
If you're concerned about losing data if your client crashes while buffering it, you can always store the data as it arrives
Similar idea: store the data as it arrives, then retrieve it and roll it up at certain intervals
You can automatically expire data once you're confident it is being reliably stored in larger blocks, using either the Bitcask or Memory backends
Memory backend is quite useful (RAM permitting) for any data that only needs to be stored for a limited period of time
Related: don't be afraid to store multiple copies of your data to make reading/reporting easier later
Multiple chunks of time (5- and 15-minute blocks, for example)
Multiple report formats
Having said all that, if you're doing straight key/value requests (it's ideal to always be able to compute the keys you need, rather than doing indexing or searching), Riak can support very heavy traffic loads, so I wouldn't recommend spending too much time creating alternative storage mechanisms unless you know you're going to face latency problems.

Inserting in a WStandardItemModel is too slow

I am working on a application built uppon WT.
We have a performance problem, as it must display a lot of data in a WTableView associated with a WStandardItemModel.
For each new item to be added in the table it does:
model->setData( row, column, data )
(which happens a few thousand times).
Is there some way to make it faster? some other way to add data in the table?
it can take 2 seconds to generate the data and several minutes to display it ...
WStandardItemModel is a general-purpose model that is easy to use, but it's not optimal for very large models. Try to specialize a WAbstractTableModel; you only need to reimplement three methods and you can read your data from wherever it resides, or compute it on the fly.
It's not normal that a view takes minutes to display. I've used views on tables with many thousands of entries without performance problems. Was your system swapping because of the memory wasted in a (extremely large) WStandardItemModel?