How would you use map reduce on this document structure? - mapreduce

If I wanted to count foobar.relationships.friend.count, how would I use map/reduce against this document structure so the count will equal 22.
[
[0] {
"rank" => nil,
"profile_id" => 3,
"20130913" => {
"foobar" => {
"relationships" => {
"acquaintance" => {
"count" => 0
},
"friend" => {
"males_count" => 0,
"ids" => [],
"females_count" => 0,
"count" => 10
}
}
}
},
"20130912" => {
"foobar" => {
"relationships" => {
"acquaintance" => {
"count" => 0
},
"friend" => {
"males_count" => 0,
"ids" => [
[0] 77,
[1] 78,
[2] 79
],
"females_count" => 0,
"count" => 12
}
}
}
}
}
]

In JavaScript this query get you the result you expect
r.db('test').table('test').get(3).do( function(doc) {
return doc.keys().map(function(key) {
return r.branch(
doc(key).typeOf().eq('OBJECT'),
doc(key)("foobar")("relationships")("friend")("count").default(0),
0
)
}).reduce( function(left, right) {
return left.add(right)
})
})
In Ruby, it should be
r.db('test').table('test').get(3).do{ |doc|
doc.keys().map{ |key|
r.branch(
doc.get_field(key).typeOf().eq('OBJECT'),
doc.get_field(key)["foobar"]["relationships"]["friend"]["count"].default(0),
0
)
}.reduce{ |left, right|
left+right
}
}
I would also tend to think that the schema you use is not really adapted, it would be better to use something like
{
rank: null
profile_id: 3
people: [
{
id: 20130913,
foobar: { ... }
},
{
id: 20130912,
foobar: { ... }
}
]
}
Edit: A simpler way to do it without using r.branch is just to remove the fields that are not objects with the without command.
Ex:
r.db('test').table('test').get(3).without('rank', 'profile_id').do{ |doc|
doc.keys().map{ |key|
doc.get_field(key)["foobar"]["relationships"]["friend"]["count"].default(0)
}.reduce{ |left, right|
left+right
}
}.run

I think you will need your own inputreader. This site gives you a tutorial how it can be done: http://bigdatacircus.com/2012/08/01/wordcount-with-custom-record-reader-of-textinputformat/
Then you run mapreduce with a mapper
Mapper<LongWritable, ClassRepresentingMyRecords, Text, IntWritable>
In your map function you extract the value for count and emit this is the value. Not sure if you need a key?
In the reducer you add together all the elements with the same key (='count' in your case).
This should get you on your way I think.

Related

Conditional Inside A Foreach Loop To Filter Value And Have A Fall Back

I have a strange problem and I can't get my head around it. I have tried loads of different ways to get the desired result but with no success. What I am looking for is a set of rules like fall backs if conditions are met in a loop. So this is what I have done so far but have tried multitude of different ways. I am always getting fallback as the match even though country, postcode and shipping method is defined. It is almost like it's ignoring these values.
If someone can point me to the right direction I would be grateful thanks.
$country = 'GB';
$postCode = "LE5";
$shippingMethod = "subscription_shipping";
$shippingMatrix = array(
array(
"country" => "GB",
"isPostCodeExcluded" => "LE5",
"shippingMethod" => "subscription_shipping",
"carrier" => "Royal Mail",
"carrierService" => "Royal Mail",
),
array(
"country" => "GB",
"isPostCodeExcluded" => false,
"shippingMethod" => "subscription_shipping",
"carrier" => "DHL",
"carrierService" => "DHL",
),
array(
"country" => false,
"isPostCodeExcluded" => false,
"shippingMethod" => "subscription_shipping",
"carrier" => "Fallback",
"carrierService" => "Fallback",
),
array(
"country" => "GB",
"isPostCodeExcluded" => false,
"shippingMethod" => "standard_delivery",
"carrier" => "DPD",
"carrierService" => "DPD",
),
);
$carriers = [];
foreach ($shippingMatrix as $matrix) {
// If only Shipping Method is matched then fall back will be the result
if ($shippingMethod === $matrix['shippingMethod']) {
$carriers = [
$matrix['carrier'],
$matrix['carrierService'],
];
// If only Shipping Method & Country is matched then fall back will be the result DHL
if ($country === $matrix['country']) {
$carriers = [
$matrix['carrier'],
$matrix['carrierService'],
];
// If only Shipping Method & Country & PostCode is matched then fall back will be the result Royal Mail
if ($postCode === $matrix['isPostCodeExcluded']) {
$carriers = [
$matrix['carrier'],
$matrix['carrierService'],
];
}
}
}
}
var_dump($carriers);
$carriers variable not used as an array notation and because of that
values are overridden. All you need to do to transform $carriers to $carriers[] .
Here is example below.
$carriers[] = [
$matrix['carrier'],
$matrix['carrierService'],
];
Thanks for the reply but this wasn't necessary. After a good old think I made the novice mistake of mixing my types up.
$carriers = [];
foreach ($shippingMatrix as $matrix) {
if ($shippingMethod === $matrix['shippingMethod'] && $country === $matrix['country']) {
if ($postCode === $matrix['isPostCodeExcluded']) {
$carriers = [
$matrix['carrier'],
$matrix['carrierService'],
];
}
}
if ($shippingMethod === $matrix['shippingMethod'] && $country !== $matrix['country'] && !$country) {
$carriers = [
$matrix['carrier'],
$matrix['carrierService'],
];
}
}

elasticsearch/logstash regex - get the value after a specific key [duplicate]

I have a logfile which looks like this ( simplified)
Logline sample
MyLine data={"firstname":"bob","lastname":"the builder"}
I'd like to extract the json contained in data and create two fields, one for firstname, one for last. However, the ouput i get is this:
{"message":"Line data={\"firstname\":\"bob\",\"lastname\":\"the builder\"}\r","#version":"1","#timestamp":"2015-11-26T11:38:56.700Z","host":"xxx","path":"C:/logstashold/bin/input.txt","MyWord":"Line","parsedJson":{"firstname":"bob","lastname":"the builder"}}
As you can see
..."parsedJson":{"firstname":"bob","lastname":"the builder"}}
That's not what I need, I need to create fields for firstname and lastname in kibana, but logstash isn't extracting the fields out with the json filter.
LogStash Config
input {
file {
path => "C:/logstashold/bin/input.txt"
}
}
filter {
grok {
match => { "message" => "%{WORD:MyWord} data=%{GREEDYDATA:request}"}
}
json{
source => "request"
target => "parsedJson"
remove_field=>["request"]
}
}
output {
file{
path => "C:/logstashold/bin/output.txt"
}
}
Any help greatly appreciated, I'm sure I'm missing out something simple
Thanks
After your json filter add another one called mutate in order to add the two fields that you would take from the parsedJson field.
filter {
...
json {
...
}
mutate {
add_field => {
"firstname" => "%{[parsedJson][firstname]}"
"lastname" => "%{[parsedJson][lastname]}"
}
}
}
For your sample log line above that would give:
{
"message" => "MyLine data={\"firstname\":\"bob\",\"lastname\":\"the builder\"}",
"#version" => "1",
"#timestamp" => "2015-11-26T11:54:52.556Z",
"host" => "iMac.local",
"MyWord" => "MyLine",
"parsedJson" => {
"firstname" => "bob",
"lastname" => "the builder"
},
"firstname" => "bob",
"lastname" => "the builder"
}

Perl merging results of maps

How do I concatenate the output of two maps to form a single flat array?
I was attempting to use this:
my $test = { 'foo' => [
map {
{ 'i' => "$_" }
} 0..1,
map {
{ 'j' => "$_" }
} 0..1
] };
To achieve the result of this:
my $test = {'foo' => [
{
'i' => '0'
},
{
'i' => '1'
},
{
'j' => '0'
},
{
'j' => '1'
},
]}
However, this is what I got in $test instead:
{
'foo' => [
{
'i' => '0'
},
{
'i' => '1'
},
{
'i' => 'HASH(0x7f90ad19cd30)'
},
{
'i' => 'HASH(0x7f90ae200908)'
}
]
};
Looks like the result of the second map gets iterated over by the first map.
The list returned by the second map is a part of the input list for the first one, following 0..1,.
The parenthesis can fix that
use warnings;
use strict;
use Data::Dump;
my $test = {
'foo' => [
( map { { i => $_ } } 0..1 ),
( map { { j => $_ } } 0..1 )
],
};
dd($test);
since they delimit the expression so now the first map takes only 0..1 as its input list and computes and returns a list, which is then merged with the second map's returned list. (Strictly speaking you need parenthesis only around the first map.)
This prints
{ foo => [{ i => 0 }, { i => 1 }, { j => 0 }, { j => 1 }] }
I've removed unneeded quotes, please restore them if/as needed in your application.
Without parenthesis the map after the comma is taken to be a part of the expression generating the input list for the first map, so producing the next element(s) of the input list after 0..1, effectively
map { i => $_ } (0..1, LIST);
Consider
my #arr = (
map { { 'i', $_ } }
0..1,
qw(a b),
map { ( { 'j', $_ } ) } 0..1
);
dd(\#arr);
It prints
[
{ i => 0 },
{ i => 1 },
{ i => "a" },
{ i => "b" },
{ i => { j => 0 } },
{ i => { j => 1 } },
]
This is also seen from your output, where all keys are i (no js).

logstash-filter not honoring the regex

I am reading files as input and thenafter pass it to be filtered, and accordingly based on the [type] the if/else for output (stdout) follows.
here is the conf part :
filter {
if [path] =~ "error" {
mutate {
replace => { "type" => "ERROR_LOGS"}
}
grok {
match => {"error_examiner" => "%{GREEDYDATA:err}"}
}
if [err] =~ "9999" {
if [err] =~ "invalid offset" {
mutate {
replace => {"type" => "DISCARDED_ERROR_LOGS"}
}
grok {
match => {"message" => "\[%{DATA:date}\] \[%{WORD:logtype} \] \[%{IPORHOST:ip}\]->\[http://search:9999/%{WORD:searchORsuggest}/%{DATA:askme_tag}\?%{DATA:paramstr}\] \[%{DATA:reason}\]"}
}
date {
match => [ "date", "YYYY-MM-DD aaa HH:mm:ss" ]
locale => en
}
geoip {
source => "ip"
target => "geo_ip"
}
kv {
source => "paramstr"
trimkey => "&\?\[\],"
value_split => "="
target => "params"
}
}
else {
mutate {
replace => {"type" => "ACCEPTED_ERROR_LOGS"}
}
grok {
match => {
"message" => "\[%{DATA:date}\] \[%{WORD:logtype} \] \[%{WORD:uptime}\/%{NUMBER:downtime}\] \[%{IPORHOST:ip}\]->\[http://search:9999/%{WORD:searchORsuggest}\/%{DATA:askme_tag}\?%{DATA:paramstr}\]"
}
}
date {
match => [ "date" , "YYYY-MM-DD aaa HH:mm:ss" ]
locale => en
}
geoip {
source => "ip"
target => "geo_ip"
}
kv {
source => "paramstr"
trimkey => "&\?\[\],"
value_split => "="
target => "params"
}
}
}
else if [err] =~ "Exception" {
mutate {
replace => {"type" => "EXCEPTIONS_IN_ERROR_LOGS"}
}
grok {
match => { "message" => "%{GREEDYDATA}"}
}
}
}
else if [path] =~ "info" {
mutate {
replace => {"type" => "INFO_LOGS"}
}
grok {
match => {
"info_examiner" => "%{GREEDYDATA:info}"
}
}
if [info] =~ "9999" {
mutate {
replace => {"type" => "ACCEPTED_INFO_LOGS"}
}
grok {
match => {
"message" => "\[%{DATA:date}\] \[%{WORD:logtype} \] \[%{WORD:uptime}\/%{NUMBER:downtime}\]( \[%{WORD:qtype}\])?( \[%{NUMBER:outsearch}/%{NUMBER:insearch}\])? \[%{IPORHOST:ip}\]->\[http://search:9999/%{WORD:searchORsuggest}/%{DATA:askme_tag}\?%{DATA:paramstr}\]"
}
}
date {
match => [ "date" , "YYYY-MM-DD aaa HH:mm:ss" ]
locale => en
}
geoip {
source => "ip"
target => "geo_ip"
}
kv {
source => "paramstr"
trimkey => "&\?\[\],"
value_split => "="
target => "params"
}
}
else {
mutate {replace => {"type" => "DISCARDED_INFO_LOGS"}}
grok {
match => {"message" => "%{GREEDYDATA}"}
}
}
}
}
the grok regexps I have tested to be working http://grokdebug.herokuapp.com/
however, what's not working is this part :
grok {
match => {"error_examiner" => "%{GREEDYDATA:err}"}
}
if [err] =~ "9999" {
I was wondering what's wrong in there ???
Actually, I have fixed it. Here is what I'd like to share with other fellows, of what I learnt while initial experiments with logstash, for the documentation and other resources aren't so very telling ...
"error_examiner" or "info_examiner" wont work, parse the instance/event row in "message"
geoip doesnt work for internal ips.
kv, for this you must specify the field_split and value_split if they aren't like a=1 b=2 , but say a:1&b:2 then field_Split is &, value_split is :
stdout, by default badly prepends if codec chosen is json, please choose rubydebug.
Thanks,

logstash target of repeat operator is not specified:

I want to use logstash to ship my logs,so I download the logstash1.5.0 rc2 and run it in ubuntu use the command :
bin/logstash -f test.conf
then the console show the error:
The error reported is:
**target of repeat operator is not specified:** /;(?<Args:method>*);(?<INT:traceid:int>(?:[+-]?(?:[0-9]+)));(?<INT:sTime:int>(?:[+-]?(?:[0-9]+)));(?<INT:eTime:int>(?:[+-]?(?:[0-9]+)));(?<HOSTNAME:hostname>\b(?:[0-9A-Za-z][0-9A-Za-z-]{0,62})(?:\.(?:[0-9A-Za-z][0-9A-Za-z-]{0,62}))*(\.?|\b));(?<INT:eoi:int>(?:[+-]?(?:[0-9]+)));(?<INT:ess:int>(?:[+-]?(?:[0-9]+)));(?<Args:args>*)/m
I don't know how to solve this error,may be you can help me.
my test.conf is as follow:
input { stdin { } }
filter {
grok {
match => ["message" , "%{INT:type}"]}
if [type]=="10" {
grok {
patterns_dir => "./patterns"
match => ["message" , ";%{Args:method};%{INT:traceid:int};%{INT:sTime:int};%{INT:eTime:int};%{HOSTNAME:hostname};%{INT:eoi:int};%{INT:ess:int};%{Args:args}"]
}
date {
match => [ "sTime" , "UNIX_MS" ]
}
ruby {
code => "event['duration'] = event['eTime'] - event['sTime']"
}
}
if [type] =~ /3[1-6]/ {
grok {
patterns_dir => "./patterns"
match => [ "message" , ";%{Args:method};%{Args:sessionid};%{INT:traceID:int};%{INT:sTime:int};%{INT:eTime:int};%{HOSTNAME:hostname};%{INT:eoi:int};%{INT:ess:int};URL{%{HttpField:url}};RequestHeader{%{HttpField:ReqHeader}};RequestPara{%{HttpField:ReqPara}};RequestAttr{%{HttpField:ReqAttr}};SessionAttr{%{HttpField:SessionAttr}};ResponseHeader{%{HttpField:ResHeader}}"]
}
kv {
source => "ReqHeader"
field_split => ";"
value_split => ":"
target => "ReqHeader"
}
kv {
source => "ResHeader"
field_split => ";"
value_split => ":"
target => "ResHeader"
}
date {
match => [ "sTime" , "UNIX_MS" ]
}
ruby {
code => "event['duration'] = event['eTime'] - event['sTime']"
}
}
if [type] == "30" {
grok {
patterns_dir => "./patterns"
match => [ "message" ,";%{Args:method};%{Args:sessionid};%{INT:traceID:int};%{INT:sTime:int};%{INT:eTime:int};%{HOSTNAME:hostname};%{INT:eoi:int};%{INT:ess:int};URL{%{HttpField:url}};RequestHeader{%{HttpField:ReqHeader}};RequestPara{%{HttpField:ReqPara}};RequestAttr{%{HttpField:ReqAttr}};SessionAttr{%{HttpField:SessionAttr}}"
]
}
kv {
source => "ReqHeader"
field_split => ";"
value_split => ":"
target => "ReqHeader"
}
kv {
source => "ResHeader"
field_split => ";"
value_split => ":"
target => "ResHeader"
}
date {
match => [ "sTime" , "UNIX_MS" ]
}
ruby {
code => "event['duration'] = event['eTime'] - event['sTime']"
}
}
if [type]=="20" {
grok {
patterns_dir => "./patterns"
match => [ "message" , ";%{Args:method};%{INT:traceID:int};%{INT:sTime:int};%{INT:eTime:int};%{HOSTNAME:hostname};%{INT:eoi:int};%{INT:ess:int};%{INT:mtype};%{Args:DBUrl}"]
}
date {
match => [ "sTime" , "UNIX_MS" ]
}
ruby {
code => "event['duration'] = event['eTime'] - event['sTime']"
}
}
if [type]=="21" {
grok {
patterns_dir => "./patterns"
match => [ "message" , ";%{Args:method};%{INT:traceID:int};%{INT:sTime:int};%{INT:eTime:int};%{HOSTNAME:hostname};%{INT:eoi:int};%{INT:ess:int};%{INT:mtype};%{Args:sql};%{Args:bindVariables}"]
}
date {
match => [ "sTime" , "UNIX_MS" ]
}
ruby {
code => "event['duration'] = event['eTime'] - event['sTime']"
}
}
if [type]=="12" {
grok {
patterns_dir => "./patterns"
match => [ "message" , ";%{Args:method};%{INT:traceID:int};%{INT:sTime:int};%{INT:eTime:int};%{HOSTNAME:hostname};%{INT:eoi:int};%{INT:ess:int};%{Args:logStack}"]
}
date {
match => [ "sTime" , "UNIX_MS" ]
}
ruby {
code => "event['duration'] = event['eTime'] - event['sTime']"
}
}
if [type]=="11" {
grok {
patterns_dir => "./patterns"
match => [ "message" , ";%{Args:method};%{INT:traceID};%{INT:sTime:int};%{INT:eTime:int};%{HOSTNAME:hostname};%{INT:eoi:int};%{INT:ess:int};%{Args:errorStack}"]
}
date {
match => [ "sTime" , "UNIX_MS" ]
}
ruby {
code => "event['duration'] = event['eTime'] - event['sTime']"
}
}
if [type]=="50" {
grok {
patterns_dir => "./patterns"
match => [ "message" , ";%{INT:sTime:int};%{HOSTNAME:host};%{Args:JVMName};%{Args:GCName};%{INT:count:int};%{INT:time:int}"]
}
date {
match => [ "sTime" , "UNIX_MS" ]
}
}
if [type]=="51" {
grok {
patterns_dir => "./patterns"
match => [ "message" , ";%{INT:sTime:int};%{HOSTNAME:host};%{Args:JVMName};%{INT:maxheap};%{INT:currentheap};%{INT:commitheap};%{INT:iniheap};%{INT:maxnonheap};%{INT:currentnonheap};%{INT:commitnonheap};%{INT:ininonheap}"]
}
date {
match => [ "sTime" , "UNIX_MS" ]
}
}
if [type]=="52" {
grok {
patterns_dir => "./patterns"
match => [ "message" , ";%{INT:sTime:int};%{HOSTNAME:host};%{Args:JVMName};%{Args:iniloadedclasses};%{Args:currentloadedclasses};%{Args:iniunloadedclasses}"]
}
date {
match => [ "sTime" , "UNIX_MS" ]
}
}
}
output {
elasticsearch { host => "127.2.96.1"
protocol => "http"
port => "8080" }
stdout { codec => rubydebug
}
}