Couchdb reduce shrink rate check denies the view - mapreduce

I know this has been discussed a fair amount of times but my scenario I think yells for disabling the rate limit check:
Docs are in the form:
{ prefix: "004945", country: "Germany", type: "Mobile", carrier: "OrangeTel", price: "34"}
{ prefix: "004946", country: "Germany", type: "Mobile", carrier: "SomeOther", price: "46"}
.
.
.
{ prefix: "00807", country: "Unknown", type: "Satelite", carrier: "Inmarsat", price: "123"}
Now I want to get an array of those prefixes some [country, type, carrier] key or a [country, type] key
So I map like this:
emit( [country, type, carrier],[prefix] )
and I reduce like this:
reduce: function(keys, values, rereduce) {
return values.reduce(function(a, b) {return a.concat(b);})
}
The problem is that the shrink rate is not good enough because obviously I return the same amount of data in a different shape: I convert a list with many elements ,each with little data, to a list of few elements, each with much data.
I know I can workaround it with list functions and such, but I thing the scenario is valid for disabling the check. Also, if it exists, I would like any ideas that use a map-reduce solution without changing the structure of those docs. Thanks.

The main task of reduce functions is to reduce result that was produced by map functions.
Since your map function emits [prefix] as value, [prefix] is different for each key and you'd like to reduce/group keys, probably you also interested to not see [prefix] duplicates for each group, right?
Next reduce function makes a set from values array and should produce unique and short list of [prefix]-es for your keys and it doesn't suffer from shirk rate problem. If you need to count how much different prefixes occurs for reduced key, this will be another function, but target is the same: reduce reduce and rereduce long list of values(:
function(keys, values, rereduce){
var prefixes = [];
var update_set = function(src, dst){
for (var idx in src){
item = src[idx];
if (dst.indexOf(item) == -1){
dst.push(item);
}
}
return dst;
}
if (rereduce){
for (var idx in values){
update_set(values[idx], prefixes);
}
}
else{
update_set(values, prefixes);
}
return prefixes;
}

Related

C++ How can I convert a string to enum to use it in a switch?

I have a list of commands that if a user inputs then it will call separate functions. Talking to a friend he said I should use switch, which is faster and easier to read over "if, else if, else" statements.
When checking how to implement this I realised that I wouldn't be able to because the input is a string and for a switch to work I would need to cast it to an enum and it seems like the most straightforward option I have is to map each of the options like below.
header
enum class input
{
min,
max,
avg,
count,
stdev,
sum,
var,
pow
};
map<string, input> inMap
{
{ "min", input::min },
{ "max", input::max },
{ "avg", input::avg },
{ "count", input::count },
{ "stdev", input::stdev },
{ "sum", input::sum },
{ "var", input::var },
{ "pow", input::pow }
};
Is there a more accessible option to use enum where I don't have to map each value? This is very time-consuming and I'm not seeing any benefits at the moment.
cpp file
void test::processInput(vector<string> input) {
switch (inMap[input[0]]) {
case inMap::min:
MIN(input);
break;
case inMap::max:
MAX(input);
break;
case inMap::avg:
AVG(input);
break;
case inMap::count:
COUNT(input);
break;
case inMap::stdev:
STDEV(input);
break;
case inMap::sum:
SUM(input);
break;
case inMap::var:
VAR(input);
break;
case inMap::pow:
POW(input);
break;
default:
std::cout << "Error, input not found" << std::endl;
}
}
I read some posts about hashing the value instead, is this a good option? At this point wouldn't be just better to continue using if, else if, else?
Thanks!
Why not have a map of string to function.
Then you don't need to convert to an enum.
using Action = void(std::vector<std::string>);
using ActionFunc = std::function<Action>;
using ActionMap = std::map<std::string, ActionFunc>;
void MIN(std::vector<std::string>){}
... etc
ActionMap inMap
{
{ "min", MIN },
{ "max", MAX },
{ "avg", AVG },
{ "count", COUNT },
{ "stdev", STDDEV },
{ "sum", SUM },
{ "var", VAR },
{ "pow", POW }
};
void test::processInput(std::vector<std::string> input) // May want a ref here.
{
auto find = inMap.find(input[0]);
if (find == inMap.end()) {
std::cout << "Error, input not found" << std::endl;
return;
}
find->second(input);
}
Either way you are doing a number of string compares, and string compares are slow.
When you do the if/else you are on average doing n/2 compares if you have n target strings. So 4 compares for the 8 keywords.
When you do a map, this reduces to log2(n) compares - so 3 compares for 8 entries. Oh and one extra backwards compare to check if the value equals the index. If it is both not less and not greater, then it must be equal.
Given the map code is more complex, you don't win, as you have seen. With (many) more keywords to check for, you will get a benefit.
If you use unordered_map, then the string is converted into an integer hash value, and that is effectively used to directly look up the answer. Calculating the hash can be a slow process, though, and a final compare will be done to check the found value exactly matches. There is also a higher set-up cost generating the hashes for all the keywords. In this particular case, you could write your own hash function which just takes the first 2 characters, and perhaps trim the range down, which would be somewhat faster than the default string hash function.
The final alternative is to do a character-by-character look-up. If the first character is 'm' then you immediately reduce the check to 2 options. 's' gives 2 other options, and so-on. For a small data set like this, just doing that first character filter will make a difference, doing a second character gives you unique checks. These are simple character compares, not whole string compares, so much faster!
There is no trivial stdlib data structure that can help you with this. You could write it out longhand as 2 levels of nested case statements:
switch (input[0][0])
{
case 'm':
switch (input[0][1])
{
case 'i':
if (input[0].compare("min")==0) return MIN;
return NO_MATCH;
case 'a':
if (input[0].compare("max")==0) return MAX;
return NO_MATCH;
}
return NO_MATCH;
case 's':
switch(input[0][1])
{
case 't':
case 'u':
}
... etc
(as an exercise for the reader, there is a version of string::compare() that will skip the characters you have already compared)
You can build your own tree structure to represent this word look-up. If you add the ability to skip through multiple letters in the tree you can produce a fairly efficient string dictionary look-up which costs little more than a single string compare.
Another option is that with a sorted vector of string/target pairs you can do a custom binary chop function that takes advantage of knowing that all the strings still in the search span have the same initial letters, and do not need to be compared any more. In this case, it won't make much difference, but with hundreds of keywords, this would be much more maintainable than the manual case statements.
But in the end, most keyword detecting code is fast enough using the simple hash table provided by unordered_map.

custom reduce functions in crossfilter on 2 fields

My data looks like this
field1,field2,value1,value2
a,b,1,1
b,a,2,2
c,a,3,5
b,c,6,7
d,a,6,7
The ultimate goal is to get value1+value2 for each distinct value of field1 and field2 : {a:15(=1+2+5+7),b:9(=1+2+6),c:10(=3+7),d:6(=6)}
I don't have a good way of rearranging that data so let's assume the data has to stay like this.
Based on this previous question (Thanks #Gordon), I mapped using :
cf.dimension(function(d) { return [d.field1,d.field2]; }, true);
But I am left a bit puzzled as to how to write the custom reduce functions for my use case. Main question is : from within the reduceAdd and reduceRemove functions, how do I know which key is currently "worked on" ? i.e in my case, how do I know whether I'm supposed to take value1 or value2 into account in my sum ?
(have tagged dc.js and reductio because it could be useful for users of those libraries)
OK so I ended up doing the following for defining the group :
reduceAdd: (p, v) => {
if (!p.hasOwnProperty(v.field1)) {
p[v.field1] = 0;
}
if (!p.hasOwnProperty(v.field2)) {
p[v.field2] = 0;
}
p[v.field1] += +v.value1;
p[v.field2] += +v.value2;
return p;
}
reduceRemove: (p, v) => {
p[v.field1] -= +v.value1;
p[v.field2] -= +v.value2;
return p;
}
reduceInitial: () => {
return {}
}
And when you use the group in a chart, you just change the valueAccessor to be (d) => d.value[d.key] instead of the usual (d) => d.value
Small inefficicency as you store more data than you need to in the value fields but if you don't have millions of distinct values it's basically negligible.
you always have a good way to re-arrange the data, after you have fetched it and before you feed it to crossfilter ;)
In fact, it's pretty much mandatory as soon as you handle non string fields (numeric or date)
You can do a reduceSum on multiple fields
dimensions.reduceSum(function(d) {return +d.value1 + +d.value2; });

Is it possible to re-order view values by using map reduce?

I need to reorder the coiuchbase views results based on the values.I am writing a map reduce like below and did sorting using group, reduce and descending keys.
function (doc, meta) {
if(doc.transactionType=='Transaction')
{
for(var i=0; i<doc.transactionList.length; i++)
{
if(doc.transactionList[i].bucket==1) {
emit(doc.transactionList[i].transactionId,null);
}
}
}
}
Reduce= _count
I got the result the way i was expecting. I know couchbase views filter on the basis of keys only.
{"rows":[
{"key":"Transaction1","value":2},
{"key":"Transaction3","value":3},
{"key":"Transaction4","value":1}
]
}
Is it possible to get the below results. If so, any advice what I am missing here?
{"rows":[
{"key":"Transaction3","value":3},
{"key":"Transaction1","value":2},
{"key":"Transaction4","value":1}
]
}
No, you can not sort by value in couchbase. Only by key
You should sort before inserting or after requesting data from view

Can I put a lua_table in another table to build a multi-dimensional array via C++?

for my game im trying to accomplish the following array structure by C++ because the data come from an external source and should be available in a lua_script.
The array structure should look like this: (The data are in a map, the map contains the name of the variable and a list of Pairs (Each pair is a key value pair considered to be one element in one subarray)...
The data prepared in the map are complete and the structure is definetly okay.
So basically I have
typedef std::map<std::string, std::list<std::pair> >;
/\index(e.g: sword) /\ /\
|| ||
|| Pair: Contains two strings (key/value pair)
||
List of Pairs for each array
items = {
["sword"] = {item_id = 1294, price = 500},
["axe"] = {item_id = 1678, price = 200},
["red gem"] = {item_id = 1679, price = 2000},
}
What I got so far now is:
for(ArrayMap::iterator it = this->npc->arrayMap.begin(); it != this->npc->arrayMap.end(); it++) {
std::string arrayName = (*it).first;
if((*it).second.size() > 0) {
lua_newtable(luaState);
for(ArrayEntryList::iterator itt = (*it).second.begin(); itt != (*it).second.end(); itt++) {
LuaScript::setField(luaState, (*itt).first.c_str(), (*itt).second.c_str());
}
lua_setglobal(luaState, arrayName.c_str());
}
}
But this will only generate the following structure:
(table)
[item_id] = (string) 2000
[name] = (string) sword
[price] = (string) 500
The problem is that the table can ofcourse only contain each index once.
Thatswhy I need something like "a table in a table", is that possible?
Is there a way to achieve this? Im glad for any hints.
So from what I understand, if you have two "sword", then you can not store second one? If that's the case, you are doing it wrong. The key of map should be unique and if you decide that you are going to use std::map to store your items then your external source should provide unique keys. I used std::string as key in my previous game. Example:
"WeakSword" -> { some more data }
"VeryWeakSword" -> { some more data }
or, with your data (assuming item_ids are unique) you can get something like following from external source:
1294 -> { some more data }
1678 -> { some more data }
I'm not sure how efficient is this but I wasn't programming a hardware-hungry 3D bleeding-edge game so it just did a fine job.
The data structure that you are using also depends on the how you are going to use it. For example, if you are always iterating through this structure why don't you store as follows:
class Item {public: ... private: std::string name; int id; int value;}
std::vector<Item> items // be careful tho, std::vector copies item before it pushes
Extract(or Parse?) the actual value you want from each entity in external source and store them in std::vector. Reaching the middle of std::vector is expensive, however, if your intention is not instant accessing but rather iterating over data, why use map? But, if your intention is actually reaching a specific key/value pair, you should alter your external data and use unique keys.
Finally, there is also another associative container that stores non-unique key/value pairs called std::multimap but I really doubt you really need it here.

Data structure for selecting groups of machines

I have this old batch system. The scheduler stores all computational nodes in one big array. Now that's OK for the most part, because most queries can be solved by filtering for nodes that satisfy the query.
The problem I have now is that apart from some basic properties (number of cpus, memory, OS), there are also these weird grouping properties (city, infiniband, network scratch).
Now the issue with these is that when a user requests nodes with infiniband I can't just give him any nodes, but I have to give him nodes connected to one infiniband switch, so the nodes can actually communicate using infiniband.
This is still OK, when user only requests one such property (I can just partition the array for each of the properties and then try to select the nodes in each partition separately).
The problem comes with combining multiple such properties, because then I would have to generate all combination of the subsets (partitions of the main array).
The good thing is that most of the properties are in a sub-set or equivalence relation (It sort of makes sense for machines on one infiniband switch to be in one city). But this unfortunately isn't strictly true.
Is there some good data structure for storing this kind of semi-hierarchical mostly-tree-like thing?
EDIT: example
node1 : city=city1, infiniband=switch03, networkfs=server01
node2 : city=city1, infiniband=switch03, networkfs=server01
node3 : city=city1, infiniband=switch03
node4 : city=city1, infiniband=switch03
node5 : city=city2, infiniband=switch03, networkfs=server02
node6 : city=city2, infiniband=switch03, networkfs=server02
node7 : city=city2, infiniband=switch04, networkfs=server02
node8 : city=city2, infiniband=switch04, networkfs=server02
Users request:
2x node with infiniband and networkfs
The desired output would be: (node1, node2) or (node5,node6) or (node7,node8).
In a good situation this example wouldn't happen, but we actually have these weird cross-site connections in some cases. If the nodes in city2 would be all on infiniband switch04, it would be easy. Unfortunately now I have to generate groups of nodes, that have the same infiniband switch and same network filesystem.
In reality the problem is much more complicated, since users don't request entire nodes, and the properties are many.
Edit: added the desired output for the query.
Assuming you have p grouping properties and n machines, a bucket-based solution is the easiest to set up and provides O(2p·log(n)) access and updates.
You create a bucket-heap for every group of properties (so you would have a bucket-heap for "infiniband", a bucket-heap for "networkfs" and a bucket-heap for "infiniband × networkfs") — this means 2p bucket-heaps.
Each bucket-heap contains a bucket for every combination of values (so the "infiniband" bucket would contain a bucket for key "switch04" and one for key "switch03") — this means a total of at most n·2p buckets split across all bucket-heaps.
Each bucket is a list of servers (possibly partitioned into available and unavailable). The bucket-heap is a standard heap (see std::make_heap) where the value of each bucket is the number of available servers in that bucket.
Each server stores references to all buckets that contain it.
When you look for servers that match a certain group of properties, you just look in the corresponding bucket for that property group, and climb down the heap looking for a bucket that's large enough to accomodate the number of servers requested. This takes O(log(p)·log(n)).
When servers are marked as available or unavailable, you have to update all buckets containing those servers, and then update the bucket-heaps containing those buckets. This is an O(2p·log(n)) operation.
If you find yourself having too many properties (and the 2p grows out of control), the algorithm allows for some bucket-heaps to be built on-demand from other bucket-heaps : if the user requests "infiniband × networkfs" but you only have a bucket-heap available for "infiniband" or "networkfs", you can turn each bucket in the "infiniband" bucket-heap into a bucket-heap on its own (use a lazy algorithm so you don't have to process all buckets if the first one works) and use a lazy heap-merging algorithm to find an appropriate bucket. You can then use a LRU cache to decide which property groups are stored and which are built on-demand.
My guess is that there won't be an "easy, efficient" algorithm and data structure to solve this problem, because what you're doing is akin to solving a set of simultaneous equations. Suppose there are 10 categories (like city, infiniband and network) in total, and the user specifies required values for 3 of them. The user asks for 5 nodes, let's say. Your task is then to infer values for the remaining 7 categories, such that at least 5 records exist that have all 10 category fields equal to these values (the 3 specified and the 7 inferred). There may be multiple solutions.
Still, provided there aren't too many different categories, and not too many distinct possibilities within each category, you can do a simple brute force recursive search to find possible solutions, where at each level of recursion you consider a particular category, and "try" each possibility for it. Suppose the user asks for k records, and may choose to stipulate any number of requirements via required_city, required_infiniband, etc.:
either(x, y) := if defined(x) then [x] else y
For each city c in either(required_city, [city1, city2]):
For each infiniband i in either(required_infiniband, [switch03, switch04]):
For each networkfs nfs in either(required_nfs, [undefined, server01, server02]):
Do at least k records of type [c, i, nfs] exist? If so, return them.
The either() function is just a way of limiting the search to the subspace containing points that the user gave constraints for.
Based on this, you will need a way to quickly look up the number of points (rows) for any given [c, i, nfs] combination -- nested hashtables will work just fine for this.
Step 1: Create an index for each property. E.g. for each property+value pair, create a sorted list of nodes with that property. Put each such list into an associative array of some kind- That is something like and stl map, one for each property, indexed by values. Such that when you are done you have a near constant time function that can return to you a list of nodes that match a single property+value pair. The list is simply sorted by node number.
Step 2: Given a query, for each property+value pair required, retrieve the list of nodes.
Step 3: Starting with the shortest list, call it list 0, compare it to each of the other lists in turn removing elements from list 0 that are not in the other lists.
You should now have just the nodes that have all the properties requested.
Your other option would be to use a database, it is already set up to support queries like this. It can be done all in memory with something like BerkeleyDB with the SQL extensions.
If sorting the list by every criteria mentioned in the query is viable (or having the list pre-sorted by each relative criteria), this works very well.
By "relative criteria", I mean criteria not of the form "x must be 5", which are trivial to filter against, but criteria of the form "x must be the same for each item in the result set". If there are also criteria of the "x must be 5" form, then filter against those first, then do the following.
It relies on using a stable sort on multiple columns to find the matching groups quickly (without trying out combinations).
The complexity is number of nodes * number of criteria in the query (for the algorithm itself) + number of nodes * log(number of nodes) * number of criteria (for the sort, if not pre-sorting). So Nodes*Log(Nodes)*Criteria.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace bleh
{
class Program
{
static void Main(string[] args)
{
List<Node> list = new List<Node>();
// create a random input list
Random r = new Random();
for (int i = 1; i <= 10000; i++)
{
Node node = new Node();
for (char c = 'a'; c <= 'z'; c++) node.Properties[c.ToString()] = (r.Next() % 10 + 1).ToString();
list.Add(node);
}
// if you have any absolute criteria, filter the list first according to it, which is very easy
// i am sure you know how to do that
// only look at relative criteria after removing nodes which are eliminated by absolute criteria
// example
List<string> criteria = new List<string> {"c", "h", "r", "x" };
criteria = criteria.OrderBy(x => x).ToList();
// order the list by each relative criteria, using a ***STABLE*** sort
foreach (string s in criteria)
list = list.OrderBy(x => x.Properties[s]).ToList();
// size of sought group
int n = 4;
// this is the algorithm
int sectionstart = 0;
int sectionend = 0;
for (int i = 1; i < list.Count; i++)
{
bool same = true;
foreach (string s in criteria) if (list[i].Properties[s] != list[sectionstart].Properties[s]) same = false;
if (same == true) sectionend = i;
else sectionstart = i;
if (sectionend - sectionstart == n - 1) break;
}
// print the results
Console.WriteLine("\r\nResult:");
for (int i = sectionstart; i <= sectionend; i++)
{
Console.Write("[" + i.ToString() + "]" + "\t");
foreach (string s in criteria) Console.Write(list[i].Properties[s] + "\t");
Console.WriteLine();
}
Console.ReadLine();
}
}
}
I would do something like this (obviously instead of strings you should map them to int, and use int's as codes)
struct structNode
{
std::set<std::string> sMachines;
std::map<std::string, int> mCodeToIndex;
std::vector<structNode> vChilds;
};
void Fill(std::string strIdMachine, int iIndex, structNode* pNode, std::vector<std::string> &vCodes)
{
if(iIndex < vCodes.size())
{
// Add "Empty" if Needed
if(pNode->vChilds.size() == 0)
{
pNode->mCodeToIndex.insert(pNode->mCodeToIndex.begin(), make_pair("empty", 0));
pNode->vChilds.push_back(structNode());
}
// Add for "Empty"
pNode->vChilds[0].sMachines.insert(strIdMachine);
Fill(strIdMachine, (iIndex + 1), &pNode->vChilds[0], vCodes );
if(vCodes[iIndex] == "empty")
return;
// Add for "Any"
std::map<std::string, int>::iterator mIte = pNode->mCodeToIndex.find("any");
if(mIte == pNode->mCodeToIndex.end())
{
mIte = pNode->mCodeToIndex.insert(pNode->mCodeToIndex.begin(), make_pair("any", pNode->vChilds.size()));
pNode->vChilds.push_back(structNode());
}
pNode->vChilds[mIte->second].sMachines.insert(strIdMachine);
Fill(strIdMachine, (iIndex + 1), &pNode->vChilds[mIte->second], vCodes );
// Add for "Segment"
mIte = pNode->mCodeToIndex.find(vCodes[iIndex]);
if(mIte == pNode->mCodeToIndex.end())
{
mIte = pNode->mCodeToIndex.insert(pNode->mCodeToIndex.begin(), make_pair(vCodes[iIndex], pNode->vChilds.size()));
pNode->vChilds.push_back(structNode());
}
pNode->vChilds[mIte->second].sMachines.insert(strIdMachine);
Fill(strIdMachine, (iIndex + 1), &pNode->vChilds[mIte->second], vCodes );
}
}
//////////////////////////////////////////////////////////////////////
// Get
//
// NULL on empty group
//////////////////////////////////////////////////////////////////////
set<std::string>* Get(structNode* pNode, int iIndex, vector<std::string> vCodes, int iMinValue)
{
if(iIndex < vCodes.size())
{
std::map<std::string, int>::iterator mIte = pNode->mCodeToIndex.find(vCodes[iIndex]);
if(mIte != pNode->mCodeToIndex.end())
{
if(pNode->vChilds[mIte->second].sMachines.size() < iMinValue)
return NULL;
else
return Get(&pNode->vChilds[mIte->second], (iIndex + 1), vCodes, iMinValue);
}
else
return NULL;
}
return &pNode->sMachines;
}
To fill the tree with your sample
structNode stRoot;
const char* dummy[] = { "city1", "switch03", "server01" };
const char* dummy2[] = { "city1", "switch03", "empty" };
const char* dummy3[] = { "city2", "switch03", "server02" };
const char* dummy4[] = { "city2", "switch04", "server02" };
// Fill the tree with the sample
Fill("node1", 0, &stRoot, vector<std::string>(dummy, dummy + 3));
Fill("node2", 0, &stRoot, vector<std::string>(dummy, dummy + 3));
Fill("node3", 0, &stRoot, vector<std::string>(dummy2, dummy2 + 3));
Fill("node4", 0, &stRoot, vector<std::string>(dummy2, dummy2 + 3));
Fill("node5", 0, &stRoot, vector<std::string>(dummy3, dummy3 + 3));
Fill("node6", 0, &stRoot, vector<std::string>(dummy3, dummy3 + 3));
Fill("node7", 0, &stRoot, vector<std::string>(dummy4, dummy4 + 3));
Fill("node8", 0, &stRoot, vector<std::string>(dummy4, dummy4 + 3));
Now you can easily obtain all the combinations that you want for example you query would be something like this:
vector<std::string> vCodes;
vCodes.push_back("empty"); // Discard first property (cities)
vCodes.push_back("any"); // Any value for infiniband
vCodes.push_back("any"); // Any value for networkfs (except empty)
set<std::string>* pMachines = Get(&stRoot, 0, vCodes, 2);
And for example only City02 on switch03 with networfs not empty
vector<std::string> vCodes;
vCodes.push_back("city2"); // Only city2
vCodes.push_back("switch03"); // Only switch03
vCodes.push_back("any"); // Any value for networkfs (except empy)
set<std::string>* pMachines = Get(&stRoot, 0, vCodes, 2);