Coucbase large view vs small view and multiple gets

Coucbase large view vs small view and multiple gets - mapreduce

Let's imagine that we have some couchbase bucket containing N docs, each S bytes size and V views count. We need to retreive those docs incl. all info that they contain.
One way:
Create a view that has such map function:
function (doc, meta) {
if (meta.type == "json" && doc.type == "mytype"){
emit([doc.field1, doc.field2], doc);
}
}
This map function will return all data we need in one step. But on the other side it will produce quiet large amount of data.
Another way:
Create a view that will return only document ids like that (or even document key using meta.id):
function (doc, meta) {
if (meta.type == "json" && doc.type == "mytype"){
emit([doc.field1, doc.field2], doc.id);
}
}
Then after getting this on client side we need to get each document by supplied ids like:
couchbase.getMultiple([key1,key2,...,keyX]) *
* where keyX are doc.ids from view.
In this case we produce less amount of view data, but operation will complete with X+1 requests.
So first way loads couchbase servers and consumes large amount of disk space for views. Second way consumes less view space, but loads client and produce more requests to couchbase server.
So there are some questions about this:
Which one of those ways is more acceptable / frequently used or it hardly depends on that N, S and V values?
If correct way depends on N, S, V on which values of those variables (high/medium/low) one way is more acceptable than other?
Couchbase can quiet easily horizontally scaled. If client side is harder to scale, is first way preferred?
May be there are some test results that compare this two ways.

Which SDK are you using to access the view?
The good practice is usually to avoid emitting the document ID since it is automatically put in the view index.
The basic rule:
Do not emit the doc (or too much values)
Do not emit the key
Take a look to:
http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-writing-bestpractice.html
Then if your application need to access the full document you just need to do (Java as example):
query.setIncludeDocs(true);
In this case the client will automatically do the "get" for you to call the server and load the document in the cache. (in fact the SDK is doing a multi get.

Related

Aws amplify shows much less data then in dynamodb exist

I have following problem. In my amplify studio i see 10k datapoints
But if i take a closer look into the corresponding database i see this:
I have over 200k+ data but it shows only 10k inside amplify studio. Why is it like that?
When i try this code in my frontend:
let p = await DataStore.query(Datapoint, Predicates.ALL, {
limit: 1000000000,
})
console.log(p.length)
I get 10000 back. The same number like in amplify studio.
Other questions: Whats the best way to store dozens of datapoints? I need it for chart visualizing.

A DynamoDB Query or Scan request does not return all items in one huge list. Instead, it returns just a single "page" of results, whose size defaults to 1MB. A client library like amplify could call these Query or Scan requests repeatedly to collect all pages into one huge array in memory, but that doesn't make too much sense once it grows very big. So applications usually want to iterate over all the results, not to collect them into one huge array in memory.
So most DynamoDB APIs provide a pagination interface for the query and scan operations - which provides you with one page of results, and a way to get the next page. Alternatively, some APIs can give you an iterator over results - and internally do this pagination. I'm not familiar with Amplify so I don't know which API to recommend. But certainly an API that returns all results as one big array must have its limits, and apparently you've found them.

What's the cheapest way to store an auto increment indexed list of values in AWS?

I have a DynamoDB-based web application that uses DynamoDB to store my large JSON objects and perform simple CRUD operations on them via a web API. I would like to add a new table that acts like a categorization of these values. The user should be able to select from a selection box which category the object belongs to. If a desirable category does not exist, the user should be able to create a new category specifying a name which will be available to other objects in the future.
It is critical to the application that every one of these categories be given a integer ID that increments starting the first at 1. These numbers that are auto generated will turn into reproducible serial numbers for back end reports that will not use the user-visible text name.
So I would like to have a simple API available from the web fronted that allows me to:
A) GET /category : produces { int : string, ... } of all categories mapped to an ID
B) PUSH /category : accepts string and stores the string to the next integer
Here are some ideas for how to handle this kind of project.
Store it in DynamoDB with integer indexes. This leaves has some benefits but it leaves a lot to be desired. Firstly, there's no auto incrementing ID in DynamoDB, but I could definitely get the state of the table, create a new ID, and store the result. This might have issues with consistency and race conditions but there's probably a way to achieve this safely. It might, however, be a big anti pattern to use DynamoDB this way.
Store it in DynamoDB as one object in a table with some random index. Just store the mapping as a JSON object. This really forgets the notion of tables in DynamoDB and uses it as a simple file. It might also run into some issues with race conditions.
Use AWS ElasticCache to have a Redis key value store. This might be "the right" decision but the downside is that ElasticCache is an always on DB offering where you pay per hour. For a low-traffic web site like mine I'd be paying minumum $12/mo I think and I would really like for this to be pay per access/update due to the low volume. I'm not sure there's an auto increment feature for Redis built in the way I'd need it. But it's pretty trivial to make a trasaction that gets the length of the table, adds one, and stores a new value. Race conditions are easily avoid with this solution.
Use a SQL database like AWS Aurora or MYSQL. Well this has the same upsides as Redis, but it's also more overkill than Redis is, and also it costs a lot more and it's still always on.
Run my own in memory web service or MongoDB etc... still you're paying for constant containers running. Writing my own thing is obviously silly but I'm sure there are services that match this issue perfectly but they'd all require a constant container to run.
Is there a food way to just store a simple list, or integer mapping like this that doesn't cost a constant monthly cost? Is there a better way to do this with DynamoDB?

Store the maxCounterValue as an item in DyanamoDB.
For the PUSH /category, perform the following:
Get the current maxCounterValue.
TransactWrite:
Put the category name and id into a new item with id = maxCounterValue + 1.
Update the maxCounterValue +1, add a ConditionExpression to check that maxCounterValue = :valueFromGetOperation.
If TransactWrite fails, start at 1 again, try X more times

AWS Data Structure and Stack Suggestion for highly filterable data

Firstly, let me know if I should place this in a different Community. It is programming related but less than I would prefer.
I am creating a mobile app based which I intend to base on AWS App Sync unless I can determine it is a poor fit.
I want to store a fairly large set of data, say a half million records.
From these records, I need to be able to grab all entries based on a tag and page them from the larger set.
An example of this data would be:
{
"name":"Product123",
"tags":[
{
"name":"1880",
"type":"year",
"value":7092
},
{
"name":"f",
"type":"gender",
"value":4120692
}
]
}
Various objects may or may not have a specific tag but may have up to 500 tags or more (the seed of initial data has 130 tags). My filter would ignore them if they did not match but return them if they did.
In reading about Query vs Scan on DyanmoDB, I feel like my current data structure would require mostly scanning and be in-efficient. Efficiency is only a real restriction due to cost.
With cost in mind, I will focus on the cost per user to access this data in filtered sets. Say 100,000 users for now each filtering and paging data many times a day.

Your concept of tags doesn't sound too different from the concept of Cognito User Pools' groups with AppSync (docs) - authentication based on groups will only return items allowed for groups that the user making the request is in. Cognito's default group limit is 25 per user pool, so while convenient out of the box, it wouldn't itself help you much. Instead, it's interesting just because it's similar conceptually, and can give you insight by looking at how it works internally.
If you go into the AppSync console and set up a request mapping template for groups auth, you'll see that it uses a scan and the contains operation. Doing something similar would probably be your best bet here, if you really want to use Dynamo. If you find that prohibitively costly, you could use a Lambda data source, which allows you to use any data store, if you have one in mind that's a little more flexible for this type of action.

CouchDB reduce error?

I have a sample database in CouchDB with the information of a number of aircraft, and a view which shows the manufacturer as key and the model as the value.
The map function is
function(doc) {
emit(doc["Manufacturer"], doc._id)
}
and the reduce function is
function(keys, values, rereduce){
return values.length;
}
This is pretty simple. And I indeed get the correct result when I show the view using Futon, where I have 26 aircraft of Boeing:
"BOEING" 26
But if I use a REST client to query the view using
http://localhost:6060/aircrafts/_design/basic/_view/VendorProducts?key="BOEING"
I get
{"rows":[
{"key":null,"value":2}
]}
I have tested different clients (including web browser, REST client extensions, and curl), all give me the value 2! While queries with other keys work correctly.
Is there something wrong with the MapReduce function or my query?

The issue could be because of grouping
Using group=true (which is Futon's default), you get a separate reduce value for each unique key in the map - that is, all values which share the same key are grouped together and reduced to a single value.
Were you passing group=true as a query parameter when querying with curl etc? Since it is passed by default in futon you saw the results like
BOEING : 26
Where as without group=true only the reduced value was being returned.
So try this query
http://localhost:6060/aircrafts/_design/basic/_view/VendorProducts?key="BOEING"&group=true

You seem to be falling into the re-reduce-trap. Couchdb strictly speaking uses a map-reduce-rereduce process.
Map: reformats your data in the output format.
Reduce: aggregates the data of several (but not all entries with the same key) - which works correctly in your case.
Re-reduce: does the same as reduce, but on previously reduced data.
As you change the format of the value in the reduce stage, the re-reduce call will aggregate the number of already reduced values.
Solutions:
You can just set the value in the map to 1 and reduce a sum of the values.
You check for rereduce==true and in that case return a sum of the values - which will be the integer values returned by the initial reduce.

Overcoming querying limitations in Couchbase

We recently made a shift from relational (MySQL) to NoSQL (couchbase). Basically its a back-end for social mobile game. We were facing a lot of problems scaling our backend to handle increasing number of users. When using MySQL loading a user took a lot of time as there were a lot of joins between multiple tables. We saw a huge improvement after moving to couchbase specially when loading data as most of it is kept in a single document.
On the downside, couchbase also seems to have a lot of limitations as far as querying is concerned. Couchbase alternative to SQL query is views. While we managed to handle most of our queries using map-reduce, we are really having a hard time figuring out how to handle time based queries. e.g. we need to filter users based on timestamp attribute. We only need a user in view if time is less than current time:
if(user.time < new Date().getTime() / 1000)
What happens is that once a user's time is set to some future time, it gets exempted from this view which is the desired behavior but it never gets added back to view unless we update it - a document only gets re-indexed in view when its updated.
Our solution right now is to load first x user documents and then check time in our application. Sorting is done on user.time attribute so we get those users who's time is less than or near to current time. But I am not sure if this is actually going to work in live environment. Ideally we would like to avoid these type of checks at application level.
Also there are times e.g. match making when we need to check multiple time based attributes. Our current strategy doesn't work in such cases and we frequently get documents from view which do not pass these checks when done in application. I would really appreciate if someone who has already tackled similar problems could share their experiences. Thanks in advance.
Update:
We tried using range queries which works for only one key. Like I said in most cases we have multiple time based keys meaning multiple ranges which does not work.

If you use Date().getTime() inside a view function, you'll always get the time when that view was indexed, just as you said "it never gets added back to view unless we update it".
There are two ways:
Bad way (don't do this in production). Query views with stale=false param. That will cause view to update before it will return results. But view indexing is slow process, especially if you have > 1 milllion records.
Good way. Use range requests. You just need to emit your date in map function as a key or a part of complex key and use that range request. You can see one example here or here (also if you want to use DateTime in couchbase this example will be more usefull). Or just look to my example below:
I.e. you will have docs like:
doc = {
"id"=1,
"type"="doctype",
"timestamp"=123456, //document update or creation time
"data"="lalala"
}
For those docs map function will look like:
map = function(){
if (doc.type === "doctype"){
emit(doc.timestamp,null);
}
}
And now to get recently "updated" docs you need to query this view with params:
startKey="dateTimeNowFromApp"
endKey="{}"
descending=true
Note that startKey and endKey are swapped, because I used descending order. Here is also a link to documnetation about key types that couchbase supports.
Also I've found a link to a question that can also help.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js