MR- Iterator over Iteration, Can't parse value for output - mapreduce

The raw file for mapreduce is like this (delimiter: Tab)
Apple 11 12 13
Orange 15 26 10
When I try to implement the method to add a new feature and separate the numbers using ",", my expected output should be like this:
Apple 3.0:11,12,13
Orange 3.0:15,26,10
But the final output gets like below:
Apple 3.0:11 12 13
Orange 3.0:15 26 10
I try to print the result for tracing, but it seems next() will skip the parsing and directly jump out the loop. Can anyone help on this?
public static class Mapper1 extends MapReduceBase
implements Mapper<Text, Text, Text, Text> {
#Override
public void map(Text key, Text value, OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
output.collect(key, value);
}
}
public static class Reducer1 extends MapReduceBase
implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
String feature = "3.0:";
boolean first = true;
while(values.hasNext()) {
if(!first) {
feature += ",";
}
feature += values.next().toString() ;
System.out.println("count"+feature.length+","+feature);
first = false;
}
output.collect(key, new Text(feature));
}
}

I think it is because your mapper only emits one key value pair for each record, which is not expected. You can check your mapper's output by setting the reducer number to 0 in your driver code:
job.setNumReduceTasks(0);
Mapper input:
Apple 11 12 13
Orange 15 26 10
Actual mapper output: (key, value)
(Apple, 11 12 13)
(Orange, 15 26 10)
Expected mapper output: (key, value)
(Apple, 11)
(Apple, 12)
...
(Orange, 10)
You can either modify your mapper to emit multiple key value pairs for each record, or use split() method of String class to get the substrings from the original one.

Related

LINQ to extract duplicate data more than 3

class Year
{
public int YearNumber;
public List<Month> Months = new List<Month>();
}
class Month
{
public int MonthNumber;
public List<Day> Days = new List<Day>();
}
class Day
{
public int DayNumber;
public string Event;
}
So I have a list of Years(list<year> years). How do I get the list (another list) which have the result that has duplicates event on the same day? I mean events can be happen on multiple dates, does not matter, what matters is, to find out if this any of date happens the same event from different year. . Lastly, (filter) only if its occurs more than 3 times. Example, 5 July 2014, 5 July 2017 and 5 July 2019 is 'Abc Festival', which occurs more than 3 times. So u get the date, the event, and the number of counts.
Using just the classes you show we can only group dates, where a "date" is a day in a month:
var query = from y in years
from m in y.Months
from d in m.Days
select new { m.MonthNumber, d.DayNumber }
into date
group date by date
into dateGroup
where dateGroup.Count() > 2
select dateGroup;
select dateGroup;
As you see, the core solution is to build new { m.MonthNumber, d.DayNumber } objects and group them.

spark scala pattern matching on a dataframe column

I am coming from R background. I could able to implement the pattern search on a Dataframe col in R. But now struggling to do it in spark scala. Any help would be appreciated
problem statement is broken down into details just to describe it appropriately
DF :
Case Freq
135322 265
183201,135322 36
135322,135322 18
135322,121200 11
121200,135322 8
112107,112107 7
183201,135322,135322 4
112107,135322,183201,121200,80000 2
I am looking for a pattern search UDF, which gives me back all the matches of the pattern and then corresponding Freq value from the second col.
example : for pattern 135322 , i would like to find out all the matches in first col Case.It should return corresponding Freq number from Freq col.
Like 265,36,18,11,8,4,2
for pattern 112107,112107 it should return just 7 because there is one matching pattern.
This is how the end result should look
Case Freq results
135322 265 256+36+18+11+8+4+2
183201,135322 36 36+4+2
135322,135322 18 18+4
135322,121200 11 11+2
121200,135322 8 8+2
112107,112107 7 7
183201,135322,135322 4 4
112107,135322,183201,121200,80000 2 2
what i tried so far:
val text= DF.select("case").collect().map(_.getString(0)).mkString("|")
//search function for pattern search
val valsum = udf((txt: String, pattern : String)=> {
txt.split("\\|").count(_.contains(pattern))
} )
//apply the UDF on the first col
val dfValSum = DF.withColumn("results", valsum( lit(text),DF("case")))
This one works
import common.Spark.sparkSession
import java.util.regex.Pattern
import util.control.Breaks._
object playground extends App {
import org.apache.spark.sql.functions._
val pattern = "135322,121200" // Pattern you want to search for
// udf declaration
val coder: ((String, String) => Boolean) = (caseCol: String, pattern: String) =>
{
var result = true
val splitPattern = pattern.split(",")
val splitCaseCol = caseCol.split(",")
var foundAtIndex = -1
for (i <- 0 to splitPattern.length - 1) {
breakable {
for (j <- 0 to splitCaseCol.length - 1) {
if (j > foundAtIndex) {
println(splitCaseCol(j))
if (splitCaseCol(j) == splitPattern(i)) {
result = true
foundAtIndex = j
break
} else result = false
} else result = false
}
}
}
println(caseCol, result)
(result)
}
// registering the udf
val udfFilter = udf(coder)
//reading the input file
val df = sparkSession.read.option("delimiter", "\t").option("header", "true").csv("output.txt")
//calling the function and aggregating
df.filter(udfFilter(col("Case"), lit(pattern))).agg(lit(pattern), sum("Freq")).toDF("pattern","sum").show
}
if input is
135322,121200
Output is
+-------------+----+
| pattern| sum|
+-------------+----+
|135322,121200|13.0|
+-------------+----+
if input is
135322,135322
Output is
+-------------+----+
| pattern| sum|
+-------------+----+
|135322,135322|22.0|
+-------------+----+

Grouping the output of a CouchDB View

I have a map reduce view:
.....
emit( diffYears, doc.xyz );
reduced with _sum.
xyz is then a number which is summed per integer(diffYears).
The output looks roughly like this:
4 1204.9
5 796.19
6 1124.8
7 1112.6
8 1993.62
9 159.26
10 395.41
11 456.05
12 457.97
13 39.80
14 483.68
15 269.469
etc..
What I would like to do is group the results as follows:
Grouping Total per group
0-4 1959.2 i.e add up the xyz's for years 0,1,2,3,4
5-9 3998.5 same for 5,6,7,8,9 ...etc.
10-14 3566.3
I saw a suggestion where a list was used on a view output here: Using a CouchDB view, can I count groups and filter by key range at the same time?
but have been unable to adapt it to get any kind of result.
The code given is:
{
_id: "_design/authors",
views: {
authors_by_date: {
map: function(doc) {
emit(doc.date, doc.author);
}
}
},
lists: {
count_occurrences: function(head, req) {
start({ headers: { "Content-Type": "application/json" }});
var result = {};
var row;
while(row = getRow()) {
var val = row.value;
if(result[val]) result[val]++;
else result[val] = 1;
}
return result;
}
}
}
I substituted var val = row.key in this section:
while(row = getRow()) {
var val = row.value;
if(result[val]) result[val]++;
else result[val] = 1;
}
(although in this case the result is a count.)
This seems to be the way to do it.
(It is like having a startkey and endkey for each grouping which I can do manually, naturally, but not inside a process. Or is there a way of entering multiple start- and endkeys into one GET command???? )
This must be a fairly normal thing to do especially for researchers using statistical analysis.
I assume therefore that it does get done but I cannot locate examples
as far as CouchDB is concerned.
I would appreciate some help with this please or a pointer in the right direction.
Many thanks.
EDIT:
Perhaps the answer lies in a process in 'reduce' to group the output??
You can accomplish what you want using a complex key. The limitation is that the group size is static and needs to be defined in the view.
You'll need a simple step function to create your groups within map like:
var size = 5;
var group = ( doc.diffYears - (doc.diffYears % size)) / size;
emit( [group, doc.diffYears], doc.xyz);
The reduce function can remain _sum.
Now when you query the view use group_level to control the grouping. At group_level=0, everything will be summed and one value will be returned. At group_level=1 you'll receive your desired sums of 0-4, 5-9 etc. At group_level=2 you'll get your original output.

Use gazetteer as dictionary within JAPE rule in GATE

I have this scenario:
I have a list of key-value pairs in the form of (for instance)
000.000.0001.000 VALUE1
000.000.0002.000 VALUE2
...
000.010.0001.000 VALUE254
The documents presents the information using a table as follows:
SK1 | SK2 | SK3 | SK4
000 | 000 | 0001 | 000
The problem is that when processing this table, it turns to
000
000
0001
000
So a gazetteer wont match it. I figured constructing a JAPE rule to match this, and it works properly matching the 4 key parts.
Now I would need to load the gazetteer from withing my JAPE rule in a structure (for instance, a hashmap) so I can lookup the concatenation of these 4 key parts and get (for example) "VALUE1". Is it possible to load a gazetteer from within a JAPE file and use it as a dictionary?
Is there any other (better) way to do what I need to?
Thanks a lot.
I found the solution to my problem using GazetteerList class with the next snippet:
//Gazetteer object
GazetteerList gazList = new GazetteerList() ;
//Object to map gazetteers entries and their positions in the list
//i.e.: 000.000.0001.000 -> 1,3
//This is because, in my case, the same key
//can appear more than once in the gazetteer
HashMap<String, ArrayList<Integer>> keyMap =
new HashMap<String, ArrayList<Integer>>();
try{
gazList.setMode(GazetteerList.LIST_MODE);
gazList.setSeparator("\t");
gazList.setURL(
new URL("file:/path/to/gazetteer/gazetteer_list_file.lst"));
gazList.load();
//Here is the mapping between the keys and their position
int pos = 0;
for( GazetteerNode gazNode : gazList){
if(keyMap.get(gazNode.getEntry()) == null)
keyMap.put(gazNode.getEntry(), new ArrayList<Integer>());
keyMap.get(gazNode.getEntry()).add(pos);
pos++;
}
} catch (MalformedURLException ex){
System.out.println(ex);
} catch (ResourceInstantiationException ex){
System.out.println(ex);
}
Then, you can lookup the matched key in the map and get its features:
for(Integer index : keyMap.get(key)){
FeatureMap fmap = toFeatureMap(gazList.get(index).getFeatureMap());
fmap.put("additionalFeature", "feature");
outputAS.add(startOffset, endOffset, "Lookup", fmap);
}

finding avg/min/max for a dataset using mapreduce

i am trying to write a mapreduce sample practice program where in my data set is somthing like this
its about salaries of people in every year in a country/city/state
place year salary($)
america 2014 60,000
france 2010 40,000
india 2012 20,000
australia 2001 50,000
america 2014 65,000
i want output something like this
place year avg min max
america 2014 625000 600000 650000
france 2010 400000 400000 400000
please guide me how can i write a mapreduce program/ any sample program which is already handled such case.
thanks in advance :)
i have tried mapper part
public static class Map extends Mapper{
public void map(LongWritable key, Text value,
Context context)
throws IOException,InterruptedException {
String year=null;
String country =null;
String amount=null;
// this will work even if we receive more than 1 line
Scanner scanner = new Scanner(value.toString());
String line;
String[] tokens;
while (scanner.hasNext()) {
line = scanner.nextLine();
tokens = line.split("\\s+");
country = tokens[0];
year = tokens[1];
amount = (tokens[2]);
context.write(new Text(country), new Text(year));
context.write(new Text(year), new Text(amount));
}
}
}