I have a sample database in CouchDB with the information of a number of aircraft, and a view which shows the manufacturer as key and the model as the value.
The map function is
function(doc) {
emit(doc["Manufacturer"], doc._id)
}
and the reduce function is
function(keys, values, rereduce){
return values.length;
}
This is pretty simple. And I indeed get the correct result when I show the view using Futon, where I have 26 aircraft of Boeing:
"BOEING" 26
But if I use a REST client to query the view using
http://localhost:6060/aircrafts/_design/basic/_view/VendorProducts?key="BOEING"
I get
{"rows":[
{"key":null,"value":2}
]}
I have tested different clients (including web browser, REST client extensions, and curl), all give me the value 2! While queries with other keys work correctly.
Is there something wrong with the MapReduce function or my query?
The issue could be because of grouping
Using group=true (which is Futon's default), you get a separate reduce value for each unique key in the map - that is, all values which share the same key are grouped together and reduced to a single value.
Were you passing group=true as a query parameter when querying with curl etc? Since it is passed by default in futon you saw the results like
BOEING : 26
Where as without group=true only the reduced value was being returned.
So try this query
http://localhost:6060/aircrafts/_design/basic/_view/VendorProducts?key="BOEING"&group=true
You seem to be falling into the re-reduce-trap. Couchdb strictly speaking uses a map-reduce-rereduce process.
Map: reformats your data in the output format.
Reduce: aggregates the data of several (but not all entries with the same key) - which works correctly in your case.
Re-reduce: does the same as reduce, but on previously reduced data.
As you change the format of the value in the reduce stage, the re-reduce call will aggregate the number of already reduced values.
Solutions:
You can just set the value in the map to 1 and reduce a sum of the values.
You check for rereduce==true and in that case return a sum of the values - which will be the integer values returned by the initial reduce.
Related
I've got a JSON object in my logs that shows up as the following:
"result":{
"totalRecords":8,
"bot":3,
"member":5,
"message":0,
"reaction":0,
"success":0,
"error":0,
"unknown":8
}
I'm trying to write a logs insights query to graph the values of each of those keys. Essentially I want a line chart with a different line for the value of each of the keys. Currently I have my query as the following:
fields result.bot, result.error, result.member, result.message, result.reaction,
result.success, result.totalRecords, result.unknown
| stats count(result.bot), count(result.error),
count(result.member),count(result.message),
count(result.reaction),count(result.success),
count(result.totalRecords), count(result.unknown) by bin(30s)
This returns the count of how many times the keys show up in the logs, but not the values.
What I need to know is what you use to get the value of a given key. I tried appending a .0 for example count(result.totalRecords.0) as was suggested in the AWS docs but it doesn't return any value. What is the query for the value of a key?
Based on documentation
Counts the log events. count() (or count(*)) counts all events returned by the query, while count(fieldName) counts all records that include the specified field name.
You can write instead
stats sum(result.bot), sum(result.error) by bin (30s)
etc. This will give you sum of those values over 30s periods. You can shorten the period if you want greater granularity.
In DynamoDB is there a way to guarantee that exactly n results will be
returned if I specify a limit and a filter?
The problem I see is that the docs state:
In a response, DynamoDB returns all the matching results within the
scope of the Limit value. For example, if you issue a Query or a Scan
request with a Limit value of 6 and without a filter expression,
DynamoDB returns the first six items in the table that match the
specified key conditions in the request (or just the first six items
in the case of a Scan with no filter). If you also supply a
FilterExpression value, DynamoDB will return the items in the first
six that also match the filter requirements (the number of results
returned will be less than or equal to 6).
So this means 6 items will be retrieved and then the filter applied. How can I keep searching until I get exactly '6' items? (Ideally there is some setting in the query to keep going until the limit has been reached -- or exhaustion has been reached)
For example, Suppose I make a query to get 50 people, who's name is "john", Dynamo would return 50 people and then apply the "john" filter. Now only 3 people are returned.
Is there a way I can ensure it will keep searching until the limit of 50 is satisfied?
I don't want to use a Scan since a Scan always searches every item in the table (regardless of limit -- correct me if I'm wrong on this).
How can I make the query's filter lazily until the Limit is satisfied? How can I keep searching until the Limit is satisfied?
If you can filter in the query itself, then that'll be best, since you wouldn't have to use a filter expression. But if you can't, the way dynamo works I suspect means the filter is just a scan over the results - basically a way to save on bandwidth, not much more. You can still use pagination to get more results; and if you're using Dynamo you probably care about the rate in which you're querying, so having that control over how many queries you're actually doing (and their size) is kind of a good thing.
If i have 2 search fields for a search: id, name. if i did not enter values in both fields, what would be the default values for them. how should i write an sql query to get "all" records in such a case.
I am using JERSEY 2.0
It could be valid to return all values. However, it is usually not good to return a potentially large amount of results.
Therefore, all your responses should have a maximum number of results, with the option of repeating the query with a different offset to get the other results (That's similar to pagination).
Actually this question was raised by someone else at here https://stackoverflow.com/questions/13338799/does-couchdbs-group-true-prevent-rereduce.
But there is no convincing answer.
group=true is the conceptual equivalent of group_level=exact, so CouchDB runs a reduce per unique key in the map row set.
This is how it is explained in doc.
It sounds like CouchDB would collect all the values for the same key and only reduce one time per each distinct key.
But in another article, it is said that
If the query is on the reduce value of each key (group_by_key = true),
then CouchDB try to locate the boundary of each key. Since this range
is probably not fitting exactly along the B+Tree node, CouchDB need to
figure out the edge of both ends to locate the partially matched leave
B+Tree node and resend its map result (with that key) to the View
Server. This reduce result will then merge with existing rereduce
result to compute the final reduce result of this key.
It sounds like rereduce may happen when group=true.
In my project, there are many documents but there are most 2 values with the same key after grouping for each distinct key.
Will rereduce happen in this case?
Best Regards
Yes. Rereduce is always a possibility.
If this is a problem, there is a rereduce parameter in the reduce function, which allows you to detect if this is happening.
http://docs.couchdb.org/en/latest/couchapp/ddocs.html#reduce-and-rereduce-functions
I need to pick a document from a collection at random (alternatively - a small number of successive documents from a randomly-positioned "window").
I've found two solutions: 1 and 2. The first is unacceptable since I anticipate large collection size and wish to minimize the document size. The second seems ineffective (I'm not sure about the complexity of skip operation). And here one can find a mention of querying a document with a specified index, but I don't know how to do it (I'm using C++ driver).
Are there other solutions to the problem? Which is the most efficient?
I had a similar issue once. In my case, I had a date property on my documents. I knew the earliest date possible in the dataset so in my application code, I would generate a random date within the range of EARLIEST_DATE_IN_SET and NOW and then query mongodb using a GTE query on the date property and simply limit it to 1 result.
There was a small chance that the random date would be greater than the highest date in the data set, so i accounted for that in the application code.
With an index on the date property, this was a super fast query.
It seems like you could mold solution 1 there, (assuming your _id key was an auto-inc value), then just do a count on your records, and use that as the upper limit for a random int in c++, then grab that row.
Likewise, if you don't have an autoinc _id key, just create one with your results.. having an additional field with an INT shouldn't add that much to your document size.
If you don't have an auto-inc field Mongo talks about how to quickly add one here:
Auto Inc Field.