Reduce multiple field values in CouchDB - mapreduce

For a notification system I am storing user messages in CouchDB. My document structure looks like this:
{
"_id": "bc325c2f6a99194ecab6e7bbae81b609",
"_rev": "1-9745ddb21cefe3dbc8cc1f7f7bd8d11d",
"uid": "bc325c2f6a99194ecab6e7bbae80eb29",
"msg": "Hi there, here is a message for you!",
"level": "warning",
"status": "new",
"created": 1467180077,
"read": null
}
The field status can have the two values new or read - I am trying to write a Map/Reduce view to return a count for a certain user with how many messages they have with the status new and how many with the status read.
The Map Function is pretty straight forward:
function(doc) {
emit(doc.uid, doc.status);
}
and it returns something like this:
| key | value
+------------------------------------+------------------------------
| bc325c2f6a99194ecab6e7bbae80eb29 | "read"
+------------------------------------+------------------------------
| bc325c2f6a99194ecab6e7bbae80eb29 | "read"
+------------------------------------+------------------------------
| bc325c2f6a99194ecab6e7bbae80eb29 | "read"
+------------------------------------+------------------------------
| bc325c2f6a99194ecab6e7bbae80eb29 | "new"
+------------------------------------+------------------------------
| bc325c2f6a99194ecab6e7bbae80eb29 | "new"
I am trying now to figure out how to write a reduce function that would produce this:
| key | value
+------------------------------------+------------------------------
| bc325c2f6a99194ecab6e7bbae80eb29 | ["read":3, "new":2]
+------------------------------------+------------------------------
If I use this reduce function
function (key, values, rereduce) {
return values;
}
I get:
| key | value
+------------------------------------+------------------------------
| bc325c2f6a99194ecab6e7bbae80eb29 | ["read","read","read","new","new"]
+------------------------------------+------------------------------
I tried to use return count(values) but it returns null.
It seems I can't wrap my head around, how to approach this. Anyone who could put me on the right track here?

My recommendation would be to emit the uid and status as your key, then you will get the counts as your reduced value.
First, we'll start with the map function:
function (doc) {
emit([ doc.uid, doc.status ]);
}
Notice we're emitting a key only, and we're not bothering with a value. (it turns out we don't need it in this case)
Then, use the following built-in reduce function
_count
Your view output will have keys like: (remember, we don't care about value)
[ "bc325c2f6a99194ecab6e7bbae80eb29", "read" ]
[ "bc325c2f6a99194ecab6e7bbae80eb29", "read" ]
[ "bc325c2f6a99194ecab6e7bbae80eb29", "read" ]
[ "bc325c2f6a99194ecab6e7bbae80eb29", "new" ]
[ "bc325c2f6a99194ecab6e7bbae80eb29", "new" ]
When you query the view, make sure to include group=true (the reduce=true is implied)
You'll notice that your view has the following key/value pairs now:
[ "bc325c2f6a99194ecab6e7bbae80eb29", "read" ]: 3
[ "bc325c2f6a99194ecab6e7bbae80eb29", "new" ]: 2
If your database has more documents in it, you'll see all the other user ids as well. To filter down to the user you care about, simply use:
startkey=["bc325c2f6a99194ecab6e7bbae80eb29"]
endkey=["bc325c2f6a99194ecab6e7bbae80eb29",{}]
These arrays look kinda weird, I'm sure, but those just ensure that any values for the "status" will be matched. (check out the documentation on Views Collation for more information)
This approach will scale up rather nicely, allowing any other status values to work without any code changes.

I found a way of doing this, but I am not sure if it is the right approach:
The reduce function:
function (key, values, rereduce) {
var result = new Object;
result.read = 0;
result.new = 0;
for(var i = 0; i < values.length; ++i){
if(values[i] == 'new'){
result.new++;
}
if(values[i] == 'read'){
result.read++;
}
};
return result;
}
will produce this result:
| key | value
+------------------------------------+------------------------------
| bc325c2f6a99194ecab6e7bbae80eb29 | {read: 2, new: 1}
+------------------------------------+------------------------------

Related

How can I replace every matching JSON field's value with incremental value?

So I have a large json file (approximately 20k hosts) and for each host, I need to find FieldA and replace it's value with a unique value which I can then swap back later.
For instance:
root> cat file.json | jq .
[
{
"id": 1,
"uptime": 0
"computer_name": "Computer01"
},
{
"id": 2,
"uptime": 0
"computer_name": "Computer02"
}
]
I need to iterate through this list of 20k hosts, replace every computer_name with a dummy value:
[
{
"id": 1,
"uptime": 0
"computer_name": "Dummy01"
},
{
"id": 2,
"uptime": 0
"computer_name": "Dummy02"
}
]
And if possible, export the dummy value and original value to a table side by side linking them up.
The dummy values I want to generate automatically such as:
for each computer_name replace value with Dummy?????? where ????? is a number from 00000 to 99999 and it just iterates through this.
I attempted to use: cat file.json | jq .computer_name OR jq.computer_name file.json to filter this down and then work on replacing the values, but when I use .computer_name as the value, I get this error:
jq: error : Cannot index array with string "computer_name".
Thanks in advance.
I would first generate a master table containing both the clear and the obfuscated names,
then extract from it as needed either the protected version by removing the clear names,
or a table with the matchings. You can even perform direct lookups on it:
jq 'with_entries(.key |= "Dummy\("\(.)" | "0"*(5-length) + .)" | .value += {
clear_name: .value.computer_name,
computer_name: .key
})' file.json > master.json
cat master.json
{
"Dummy00000": {
"id": 1,
"uptime": 0,
"computer_name": "Dummy00000",
"clear_name": "Computer01"
},
"Dummy00001": {
"id": 2,
"uptime": 0,
"computer_name": "Dummy00001",
"clear_name": "Computer02"
}
}
jq 'map(del(.clear_name))' master.json
[
{
"id": 1,
"uptime": 0,
"computer_name": "Dummy00000"
},
{
"id": 2,
"uptime": 0,
"computer_name": "Dummy00001"
}
]
jq -r '.[] | [.clear_name, .computer_name] | #tsv' master.json
Computer01 Dummy00000
Computer02 Dummy00001
jq --arg lookup "Dummy00001" '.[$lookup]' master.json
{
"id": 2,
"uptime": 0,
"computer_name": "Dummy00001",
"clear_name": "Computer02"
}
jq -r --arg lookup "Dummy00001" '.[$lookup].clear_name' master.json
Computer02
It's not clear what you mean exactly by "exporting the dummy value and original value", but the following should be sufficient for you to figure out that detail, since the result contains the information necessary to create the table:
def lpad($len; $fill): tostring | ($len - length) as $l | ($fill * $l)[:$l] + .;
. as $in
| ([length|tostring|length, 5]|max) as $width
| [range(0; length) as $i
| $in[$i]
| . + {computer_name: ("Dummy" + ($i | lpad($width;"0"))),
original_name: .computer_name}]
You could use this to create your table, and then remove original_name
e.g., by running:
map(del(.original_name))
Of course there are other ways to achieve your stated goal, but if leaving the original names in the file is an option, then that might be worth considering, since it might obviate the need to maintain a separate table.

Build CGAL Mesh_criteria one by one

I used to build my CGAL MeshCriteria as follows, and that works well.
auto criteria = Mesh_criteria(
CGAL::parameters::edge_size=edge_size,
CGAL::parameters::facet_angle=facet_angle,
CGAL::parameters::facet_size=facet_size,
CGAL::parameters::facet_distance=facet_distance,
CGAL::parameters::cell_radius_edge_ratio=cell_radius_edge_ratio,
CGAL::parameters::cell_size=size
);
Now I have a function which has only some criteria constraints, other values are invalid (e.g., negative). I would like to build Mesh_criteria as follows (pseudocode), but don't know how to do it:
auto criteria = Mesh_criteria();
if edge_size > 0.0:
criteria.add(CGAL::parameters::edge_size=edge_size);
if facet_angle > 0.0:
criteria.add(CGAL::parameters::facet_angle=facet_angle);
// [...]
Any hints?
I don't see any solution but knowing the default values and use the ternary operator ?:.
Here is a copy paste from the code that will give you the default values:
template <class ArgumentPack>
Mesh_criteria_3_impl(const ArgumentPack& args)
: edge_criteria_(args[parameters::edge_size
| args[parameters::edge_sizing_field
| args[parameters::sizing_field | FT(DBL_MAX)] ] ])
, facet_criteria_(args[parameters::facet_angle | FT(0)],
args[parameters::facet_size
| args[parameters::facet_sizing_field
| args[parameters::sizing_field | FT(0)] ] ],
args[parameters::facet_distance | FT(0)],
args[parameters::facet_topology | CGAL::FACET_VERTICES_ON_SURFACE])
, cell_criteria_(args[parameters::cell_radius_edge_ratio
| args[parameters::cell_radius_edge | FT(0)] ],
args[parameters::cell_size
| args[parameters::cell_sizing_field
| args[parameters::sizing_field | FT(0)] ] ])
{ }

Cascading filter out bad records in a file

I am using Custom Functions for DQ checks in Cascading where I am setting an indicator based on which I will filter out the records at last into required pipes
I have written two functions for it. In the below code, Field 'A' is a String for which Null check needs to be done and Field 'B' is the code for which Decimal Check needs to be done. The indicator 'Ind' is set based on the Quality check result and it is passed into and set inside the functions IndicatorNull/IndicatorDecimal.
But I am facing an error in this code. I am not able to pass fields 'A' / 'Ind' and fields 'B'/'Ind' to the first and second filter of the same pipe.
Am I missing something here? Please let me know how this can be handled. Thanks!
Below is the portion of the code -
Scheme inscheme = new TextDelimited(new Fields("A","B","Ind"),",");
Tap sourceTap = new Hfs(inscheme, infile);
Tap sinkTap = new Hfs(inscheme, outfile);
Pipe BooleanPipe = new Pipe ("BooleanPipe");
Fields findreturnNull = new Fields( "A","Ind" );
Fields findreturnDecimal = new Fields("B", "Ind" );
BooleanPipe = new Each( BooleanPipe, findreturnNull, new
IndicatorNull(findreturnNull), Fields.RESULTS );
BooleanPipe = new Each( BooleanPipe, findreturnDecimal, new IndicatorDecimal(findreturnDecimal), Fields.RESULTS );
Below is the error that I am getting -
Exception in thread "main" cascading.flow.planner.PlannerException: could not build flow from assembly: [[BooleanPipe][first.Boolean.main(Boolean.java:48)] unable to resolve argument selector: [{2}:'B', 'Ind'], with incoming: [{2}:'A', 'Ind']]
at cascading.flow.planner.FlowPlanner.handleExceptionDuringPlanning(FlowPlanner.java:577)
at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:286)
at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:80)
at cascading.flow.FlowConnector.connect(FlowConnector.java:459)
at cascading.flow.FlowConnector.connect(FlowConnector.java:450)
at cascading.flow.FlowConnector.connect(FlowConnector.java:426)
at cascading.flow.FlowConnector.connect(FlowConnector.java:275)
at cascading.flow.FlowConnector.connect(FlowConnector.java:220)
at cascading.flow.FlowConnector.connect(FlowConnector.java:202)
at first.Boolean.main(Boolean.java:53)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
The problem is with Fields.RESULTS parameter.
If you look at the flow:
+----------+-----------------+-----------------+------------------------------------------+
| Command | Incoming Fields | Outgoing Fields | Reason |
+----------+-----------------+-----------------+------------------------------------------+
| Input | "A", "B", "Ind" | "A", "B", "Ind" | Input, TextDelimited |
+----------+-----------------+-----------------+------------------------------------------+
| 1st Each | "A", "B", "Ind" | "A", "Ind" | Fields.RESULTS will push only Results |
| | | | fields. Rest will be discarded. |
+----------+-----------------+-----------------+------------------------------------------+
| 2nd Each | "A", "Ind" | ERROR | IndicatorDecimal() is looking from Field |
| | | | "B" and it does not exists in Pipe. |
+----------+-----------------+-----------------+------------------------------------------+
As you have both input & output fields same, the solution will be Fields.REPLACE.
Reference: Fields Sets

Select RDF collection/list and iterate result with Jena

For some RDF like this:
<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:blah="http://www.something.org/stuff#">
<rdf:Description rdf:about="http://www.something.org/stuff/some_entity1">
<blah:stringid>string1</blah:stringid>
<blah:uid>1</blah:uid>
<blah:myitems rdf:parseType="Collection">
<blah:myitem>
<blah:myitemvalue1>7</blah:myitemvalue1>
<blah:myitemvalue2>8</blah:myitemvalue2>
</blah:myitem>
...
<blah:myitem>
<blah:myitemvalue1>7</blah:myitemvalue1>
<blah:myitemvalue2>8</blah:myitemvalue2>
</blah:myitem>
</blah:myitems>
</rdf:Description>
<rdf:Description rdf:about="http://www.something.org/stuff/some__other_entity2">
<blah:stringid>string2</blah:stringid>
<blah:uid>2</blah:uid>
<blah:myitems rdf:parseType="Collection">
<blah:myitem>
<blah:myitemvalue1>7</blah:myitemvalue1>
<blah:myitemvalue2>8</blah:myitemvalue2>
</blah:myitem>
....
<blah:myitem>
<blah:myitemvalue1>7</blah:myitemvalue1>
<blah:myitemvalue2>8</blah:myitemvalue2>
</blah:myitem>
</blah:myitems>
</rdf:Description>
</rdf:RDF>
I'm using Jena/SPARQL and I'd like to be able to use a SELECT query to retrieve the myitems node for an entity with a particular stringid and then extract it from the resultset and iterate through and get the values for each myitem nodes. Order isn't important.
So I have two questions:
Do I need to specify in my query that blah:myitems is a list?
How can I parse a list in a ResultSet?
Selecting Lists (and Elements) in SPARQL
Let's address the SPARQL issue first. I've modified your data just a little bit so that the elements have different values, so it will be easier to see them in the output. Here's the data in N3 format, which is a bit more concise, especially when representing lists:
#prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
#prefix blah: <http://www.something.org/stuff#> .
<http://www.something.org/stuff/some_entity1>
blah:myitems ([ a blah:myitem ;
blah:myitemvalue1 "1" ;
blah:myitemvalue2 "2"
] [ a blah:myitem ;
blah:myitemvalue1 "3" ;
blah:myitemvalue2 "4"
]) ;
blah:stringid "string1" ;
blah:uid "1" .
<http://www.something.org/stuff/some__other_entity2>
blah:myitems ([ a blah:myitem ;
blah:myitemvalue1 "5" ;
blah:myitemvalue2 "6"
] [ a blah:myitem ;
blah:myitemvalue1 "7" ;
blah:myitemvalue2 "8"
]) ;
blah:stringid "string2" ;
blah:uid "2" .
You mentioned in the question selecting the myitems node, but myitems is actually the property that relates the entity to the list. You can select properties in SPARQL, but I'm guessing that you actually want to select the head of the list, i.e., the value of the myitems property. That's straightforward. You don't need to specify that it's an rdf:List, but if the value of myitems could also be a non-list, then you should specify that you're only looking for rdf:Lists. (For developing the SPARQL queries, I'll just run them using Jena's ARQ command line tools, because we can move them to the Java code easily enough afterward.)
prefix blah: <http://www.something.org/stuff#>
select ?list where {
[] blah:myitems ?list .
}
$ arq --data data.n3 --query items.sparql
--------
| list |
========
| _:b0 |
| _:b1 |
--------
The heads of the lists are blank nodes, so this is the sort of result that we're expecting. From these results, you could get the resource from a result set and then start walking down the list, but since you don't care about the order of the nodes in the list, you might as well just select them in the SPARQL query, and then iterate through the result set, getting each item. It also seems likely that you might be interested in the entity whose items you're retrieving, so that's in this query too.
prefix blah: <http://www.something.org/stuff#>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
select ?entity ?list ?item ?value1 ?value2 where {
?entity blah:myitems ?list .
?list rdf:rest* [ rdf:first ?item ] .
?item a blah:myitem ;
blah:myitemvalue1 ?value1 ;
blah:myitemvalue2 ?value2 .
}
order by ?entity ?list
$ arq --data data.n3 --query items.sparql
----------------------------------------------------------------------------------------
| entity | list | item | value1 | value2 |
========================================================================================
| <http://www.something.org/stuff/some__other_entity2> | _:b0 | _:b1 | "7" | "8" |
| <http://www.something.org/stuff/some__other_entity2> | _:b0 | _:b2 | "5" | "6" |
| <http://www.something.org/stuff/some_entity1> | _:b3 | _:b4 | "3" | "4" |
| <http://www.something.org/stuff/some_entity1> | _:b3 | _:b5 | "1" | "2" |
----------------------------------------------------------------------------------------
By ordering the results by entity and by list (in case some entity has multiple values for the myitems property), you can iterate through the result set and be assured of getting, in order, all the elements in a list for an entity. Since your question was about lists in result sets, and not about how to work with result sets, I'll assume that iterating through the results isn't a problem.
Working with Lists in Jena
The following example shows how you can work with lists in Java. The first part of the code is just the boilerplate to load the model and run the SPARQL query. Once you're getting the results of the query back, you can either treat the resource as the head of a linked list and use the rdf:first and rdf:rest properties to iterate manually, or you can cast the resource to Jena's RDFList and get an iterator out of it.
import java.io.IOException;
import java.io.InputStream;
import com.hp.hpl.jena.query.QueryExecutionFactory;
import com.hp.hpl.jena.query.QuerySolution;
import com.hp.hpl.jena.query.ResultSet;
import com.hp.hpl.jena.rdf.model.Model;
import com.hp.hpl.jena.rdf.model.ModelFactory;
import com.hp.hpl.jena.rdf.model.Property;
import com.hp.hpl.jena.rdf.model.RDFList;
import com.hp.hpl.jena.rdf.model.RDFNode;
import com.hp.hpl.jena.rdf.model.Resource;
import com.hp.hpl.jena.util.iterator.ExtendedIterator;
import com.hp.hpl.jena.vocabulary.RDF;
public class SPARQLListExample {
public static void main(String[] args) throws IOException {
// Create a model and load the data
Model model = ModelFactory.createDefaultModel();
try ( InputStream in = SPARQLListExample.class.getClassLoader().getResourceAsStream( "SPARQLListExampleData.rdf" ) ) {
model.read( in, null );
}
String blah = "http://www.something.org/stuff#";
Property myitemvalue1 = model.createProperty( blah + "myitemvalue1" );
Property myitemvalue2 = model.createProperty( blah + "myitemvalue2" );
// Run the SPARQL query and get some results
String getItemsLists = "" +
"prefix blah: <http://www.something.org/stuff#>\n" +
"\n" +
"select ?list where {\n" +
" [] blah:myitems ?list .\n" +
"}";
ResultSet results = QueryExecutionFactory.create( getItemsLists, model ).execSelect();
// For each solution in the result set
while ( results.hasNext() ) {
QuerySolution qs = results.next();
Resource list = qs.getResource( "list" ).asResource();
// Once you've got the head of the list, you can either process it manually
// as a linked list, using RDF.first to get elements and RDF.rest to get
// the rest of the list...
for ( Resource curr = list;
!RDF.nil.equals( curr );
curr = curr.getRequiredProperty( RDF.rest ).getObject().asResource() ) {
Resource item = curr.getRequiredProperty( RDF.first ).getObject().asResource();
RDFNode value1 = item.getRequiredProperty( myitemvalue1 ).getObject();
RDFNode value2 = item.getRequiredProperty( myitemvalue2 ).getObject();
System.out.println( item+" has:\n\tvalue1: "+value1+"\n\tvalue2: "+value2 );
}
// ...or you can make it into a Jena RDFList that can give you an iterator
RDFList rdfList = list.as( RDFList.class );
ExtendedIterator<RDFNode> items = rdfList.iterator();
while ( items.hasNext() ) {
Resource item = items.next().asResource();
RDFNode value1 = item.getRequiredProperty( myitemvalue1 ).getObject();
RDFNode value2 = item.getRequiredProperty( myitemvalue2 ).getObject();
System.out.println( item+" has:\n\tvalue1: "+value1+"\n\tvalue2: "+value2 );
}
}
}
}

CouchDB: Group data by key

I have following data in my database:
| value1 | value2 |
|----------+----------|
| 1 | a |
| 1 | b |
| 2 | a |
| 3 | c |
| 3 | d |
|----------+----------|
What I want as a output is {"key":1,"value":[a,b]},{"key":2,"value":[a]},{"key":3,"value":[c,d]}
I wrote this map function (but not quiet sure if this is correct)
function(doc) {
emit(doc.value1,doc.value2);
}
...but I am missing the reduce-function. Thanks for your help!
Not sure if this can/should be done with a reduce function.
However, you can reformat the output with lists. Try the following list function:
function (head, req) {
var row,
returnObj = {};
while (row = getRow()) {
if(returnObj[row.key]){
returnObj[row.key].push(row.value);
}else{
returnObj[row.key] = [row.value];
}
}
send(JSON.stringify(returnObj));
};
The output should look like this:
{
1: [
"a",
"b"
],
2: [
"a"
],
3: [
"c",
"d"
]
}
Hope that helps.