combiner and reducer can be different? - mapreduce

In many MapReduce programs, I see a reducer being used as a combiner as well. I know this is because of the specific nature of those programs. But I am wondering if they can be different.

Yes, a combiner can be different to the Reducer, although your Combiner will still be implementing the Reducer interface. Combiners can only be used in specific cases which are going to be job dependent. The Combiner will operate like a Reducer, but only on the subset of the Key/Values output from each Mapper.
One constraint that your Combiner will have, unlike a Reducer, is that the input/output key and value types must match the output types of your Mapper.

Yeah they surely can be different, but I don't think you want to use a different class as mostly you will get unexpected result.
Combiners can only be used on the functions that are commutative(a.b = b.a) and associative {a.(b.c) = (a.b).c} . This also means that combiners may operate only on a subset of your keys and values or may not execute at all, still you want the output of the program to remain same.
Choosing a different class with different logic may not give you a logical output.

Here is the implementation , you can run without combiner and with combiner , both gives exactly same answer . Here Reducer and Combiner has different motive and different implementation.
package combiner;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class Map extends Mapper<LongWritable, Text, Text, Average> {
Text name = new Text();
String[] row;
protected void map(LongWritable offSet, Text line, Context context) throws IOException, InterruptedException {
row = line.toString().split(" ");
System.out.println("Key "+row[0]+"Value "+row[1]);
name.set(row[0]);
context.write(name, new Average(Integer.parseInt(row[1].toString()), 1));
}}
Reduce Class
public class Reduce extends Reducer<Text, Average, Text, LongWritable> {
LongWritable avg =new LongWritable();
protected void reduce(Text key, Iterable<Average> val, Context context)throws IOException, InterruptedException {
int total=0; int count=0; long avgg=0;
for (Average value : val){
total+=value.number*value.count;
count+=value.count;
avgg=total/count;
}
avg.set(avgg);
context.write(key, avg);
}
}
MapObject Class
public class Average implements Writable {
long number;
int count;
public Average() {super();}
public Average(long number, int count) {
this.number = number;
this.count = count;
}
public long getNumber() {return number;}
public void setNumber(long number) {this.number = number;}
public int getCount() {return count;}
public void setCount(int count) {this.count = count;}
#Override
public void readFields(DataInput dataInput) throws IOException {
number = WritableUtils.readVLong(dataInput);
count = WritableUtils.readVInt(dataInput);
}
#Override
public void write(DataOutput dataOutput) throws IOException {
WritableUtils.writeVLong(dataOutput, number);
WritableUtils.writeVInt(dataOutput, count);
}
}
Combiner Class
public class Combine extends Reducer<Text, Average, Text, Average>{
protected void reduce(Text name, Iterable<Average> val, Context context)throws IOException, InterruptedException {
int total=0; int count=0; long avg=0;
for (Average value : val){
total+=value.number;
count+=1;
avg=total/count;
}
context.write(name, new Average(avg, count));
}
}
Driver Class
public class Driver1 {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
if (args.length != 2) {
System.err.println("Usage: SecondarySort <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "CustomCobiner");
job.setJarByClass(Driver1.class);
job.setMapperClass(Map.class);
job.setCombinerClass(Combine.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Average.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Git the code from here
Leave ur suggestions..

The primary goal of combiners is to optimize/minimize the number of key value pairs that will
be shuffled across the network between mappers and reducers and thus to save as most
bandwidth as possible.
The thumb rule of combiner is it has to have the same input and output variable types, the reason
for this, is combiner use is not guaranteed, it can or can not be used , depending the volume
and number of spills.
The reducer can be used as a combiner when it satisfies this rule i.e. same input and output
variable type.
The other most important rule for combiner is it can only be used when the function you want
to apply is both commutative and associative. like adding numbers .But not in case like average(if u r using same code as reducer).
Now to answer your question, yes off course they can be different, and when your reducer has different type of input , and output variables, u have no choice , but to make a different copy of ur reducer code and modifying it.
If u r concerned about the logic of the reducer , that you can implement in a different way as well, say in case of a combiner you can have a collection object to have a local buffer of all the values coming to the combiner, this is less risky than using it in reducer, because in case of reducer , it is more prone to go out of memory than in combiner. other logic differences can certainly exist and does.

Related

Mocking Test : How-to refactor legacy singleton used in static

Well, I'm looking for the best way to refactor a (huge) legacy code-base and introducing some tests in it..There was no test framework. (yeah, I mean not at all)
It was an JEE 5 application. The goal is to revamp that in JEE7
Let me introduce a quick overview .
The end-users (those of them who are authorized) are free to evolve , configure many aspect of the application behavior by setting in the UI a bunch of preferences.
Theses are stored in an SQL table for the main part (the rest in some xml and properties files).
To fulfill this requirement, there is an #Startup object dedicated to build a sort-of huge map with all key-values.
Then all across the code base when a use case needs to adapt it's processing it checks the current value of the parameter(s) needed to its task.
A real case is that the app has to do a few operations on images;
For instance, Class ImgProcessing has to create thumbnail of a picture via this method :
Optional<Path> generateThumb_fromPath(Path origImg);
for this the method generateThumb_fromPath, calls Thumbnailer,
which uses a generic ImageProcessingHelper,
which holds a few set of generic image related tools and methods,
and specially an static method returning the wished dimensions of the thumbnail to be generated based on the original image constraints and some thumbnail preferences (keys = "THUMBNAIL_WIDTH" and "THUMBNAIL_HEIGHT").
These preferences are the user wishes for what size a thumbnail should have.
So far so good, nothing special.
Now the dark side of this :
The original JEE5 config loader is an bad old fashioned infamous singleton pattern as :
OldBadConfig {
private static OldBadConfig instance ;
public static getInstance(){
if(instance==null){
// create, initialize and populate our big preferences' map
}
return instance;
}
}
Then all across the whole code-base these preferences are used. In my refactoring effort I've already done using #Inject for injecting the singleton object.
But in static utilities ( no injection point available ) you have lots of this nasty calls :
OldBadConfig.getInstance.getPreference(key, defaultValue)
(Briefly I will explain that I use testNg + Mockito, I don't think the tools are relevant here, it seems to be more about an original terrible design,
but if I HAVE to change my toolbox (Junit or whatever) I will. But again I don't think the tooling is the root problem here. )
Trying to refactor the image part and make it test-friendly., I want to do this test with cut = instance of my Class Under Test:
#Test
public void firstTest(){
Optional<Path> op = cut.generateThumb_fromPath(targetPath);
// ..assertThatTheThumbnailWasCreated........
}
So in a few words ,
the execution flow will be like :
class under test --> some business implementation --> someutilities --> static_global_app_preference ---> other_class-othermethod.finalizingProcessing,
then return to the caller.
My testing effort halts here. How to mock the static_global_app_preference ?
How can I refactor the static_global_app_preference part from
*OldBadConfig.getInstance.getPreference(key, defaultValue)*
to something mockable where I could mock like :
Mockito.when(gloablConf.getPreference("THUMBNAIL_WIDTH", anyString)).thenReturn("32");
I've spent quite a time reading boks, blog posts etc all saying
'(these kind of) Singleton is EVIL'. You should NOT do that !
I think we all agree , thanks.
But what about a real word and effective solution to such really trivial, common tasks?
I can not add the singleton instance (or the preferences'map ) as parameters (because as it is already spread all across the code-base it will pollute all and every classes and methods . For instance in the exposed use case, it will pollute 5 methods in 4 classes just for one, poor, miserable, access to a parameter.
It's really not feasible.
So far I tried to refactor OldBadConfig class in two part : one with all initialization/write stuff,
and the other with only the read parts. that way I can at least make this a real JEE #Singleton and benefits from concurrent access once the startup is over and the configuration all loaded.
Then I tried to make this SharedGlobalConf accessible via a factory, called like :
SharedGlobalConf gloablConf= (new SharedGlobalConfFactory()).getShared();
then gloablConf.getPreference(key, defaultValue); is accessible.
It seems to be a little better than the original bottleneck, but didn't help at all for the testing part.
I thought the factory will ease everything but nothing like that comes out.
So there is my question :
For myself, I can split the OldBadConfig to an startup artefact doing the init and refesh, and to an SharedGlobalConf which is a JEE7 pure Singleton,
#Singleton
#ConcurrencyManagement(ConcurrencyManagementType.BEAN)
#Lock(LockType.READ)
Then, as for the legacy use case described here, How Can I make this reasonably mock-able ? Real word solutions are all welcomed.
Thanks sharing your wisdom and skills !
I will like to share my own answer.
Let's say we got these classes after the initial large OldBadConfig class was splitted :
#Startup AppConfigPopulator in charge of loading all information and populating the kind-of internal cache,
which is now a distinct SharedGlobalConf object. The populator is the only one in charge of feeding the SharedGlobalConf via :
#Override
public SharedGlobalConf sharedGlobalConf() {
if (sharedGlobalConf.isDirty()) {
this.refreshSharedGlobalConf();
}
return sharedGlobalConf;
}
private void refreshSharedGlobalConf() {
sharedGlobalConf.setParams(params);
sharedGlobalConf.setvAppTempPath_temp(getAppTempPath_temp());
}
In all components (by that I mean all Classes holding valid injection points) you just do your classic
#Inject private SharedGlobalConf globalConf;
For static utilities that can not do #Inject, we got an SharedGlobalConfFactory which handles the shared data to everything in a one-liner :
SharedGlobalConf gloablConf = (new SharedGlobalConfFactory()).getShared();
That way our old code base can be smoothly upgraded : #Inject in all valid components, And the (too many) old utilities that we can not reasonably rewrite them all in this refactoring step can get these
*OldBadConfig.getInstance.getPreference(key, defaultValue)*
,simply replaced by
(new SharedGlobalConfFactory()).getShared().getPreference(key, defaultValue);
And we are test-compliant and mockable !
Proof of concept :
A really critical Business demands is modeled in this class :
#Named
public class Usage {
static final Logger logger = LoggerFactory.getLogger(Usage.class);
#Inject
private SharedGlobalConf globalConf;#Inject
private BusinessCase bc;public String doSomething(String argument) {
logger.debug(" >>doSomething on {}", argument);
// do something using bc
Object importantBusinessDecision = bc.checks(argument);
logger.debug(" >>importantBusinessDecision :: {}", importantBusinessDecision);
if (globalConf.isParamFlagActive("StackOverflow_Required", "1")) {
logger.debug(" >>StackOverflow_Required :: TRUE");
// Do it !
return "Done_SO";
} else {
logger.debug(" >>StackOverflow_Required :: FALSE -> another");
// Do it another way
String resultStatus = importantBusinessDecision +"-"+ StaticHelper.anotherWay(importantBusinessDecision);
logger.debug(" >> resultStatus " + resultStatus);
return "Done_another_way " + resultStatus;
}
}
public void err() {
xx();
}
private void xx() {
throw new UnsupportedOperationException(" WTF !!!");
}
}
To get it's job done , we need a hand from our old companion StaticHelper :
class StaticHelper {
public static String anotherWay(Object importantBusinessDecision) {// System.out.println("zz #anotherWay on "+importantBusinessDecision);
SharedGlobalConf gloablConf = (new SharedGlobalConfFactory()).getShared();
String avar = gloablConf.getParamValue("deeperParam", "deeperValue");
//compute the importantBusinessDecision based on avar
return avar;
}
}
Usage of this =
#Named public class Usage {
static final Logger logger = LoggerFactory.getLogger(Usage.class);
#Inject
private SharedGlobalConf globalConf;
#Inject
private BusinessCase bc;
public String doSomething(String argument) {
logger.debug(" >>doSomething on {}", argument);
// do something using bc
Object importantBusinessDecision = bc.checks(argument);
logger.debug(" >>importantBusinessDecision :: {}", importantBusinessDecision);
if (globalConf.isParamFlagActive("StackOverflow_Required", "1")) {
logger.debug(" >>StackOverflow_Required :: TRUE");
// Do it !
return "Done_SO";
} else {
logger.debug(" >>StackOverflow_Required :: FALSE -> another");
// Do it another way
String resultStatus = importantBusinessDecision +"-"+ StaticHelper.anotherWay(importantBusinessDecision);
logger.debug(" >> resultStatus " + resultStatus);
return "Done_another_way " + resultStatus;
}
}
public void err() {
xx();
}
private void xx() {
throw new UnsupportedOperationException(" WTF !!!");
}}
As you see the old shared key/value holder is still used every where but this time, we can test
public class TestingAgainstOldBadStaticSingleton {
private final Boolean boolFlagParam;
private final String deepParam;
private final String decisionParam;
private final String argument;
private final String expected;
#Factory(dataProvider = "tdpOne")
public TestingAgainstOldBadStaticSingleton(String argument, Boolean boolFlagParam, String deepParam, String decisionParam, String expected) {
this.argument = argument;
this.boolFlagParam = boolFlagParam;
this.deepParam = deepParam;
this.decisionParam = decisionParam;
this.expected = expected;
}
#Mock
SharedGlobalConf gloablConf = (new SharedGlobalConfFactory()).getShared();
#Mock
BusinessCase bc = (new BusinessCase());
#InjectMocks
Usage cut = new Usage();
#Test
public void testDoSomething() {
String result = cut.doSomething(argument);
assertEquals(result, this.expected);
}
#BeforeMethod
public void setUpMethod() throws Exception {
MockitoAnnotations.initMocks(this);
Mockito.when(gloablConf.isParamFlagActive("StackOverflow_Required", "1")).thenReturn(this.boolFlagParam);
Mockito.when(gloablConf.getParamValue("deeperParam", "deeperValue")).thenReturn(this.deepParam);
SharedGlobalConfFactory.setGloablConf(gloablConf);
Mockito.when(bc.checks(ArgumentMatchers.anyString())).thenReturn(this.decisionParam);
}
#DataProvider(name = "tdpOne")
public static Object[][] testDatasProvider() {
return new Object[][]{
{"**AF-argument1**", false, "AF", "DEC1", "Done_another_way DEC1-AF"},
{"**AT-argument2**", true, "AT", "DEC2", "Done_SO"},
{"**BF-Argument3**", false, "BF", "DEC3", "Done_another_way DEC3-BF"},
{"**BT-Argument4**", true, "BT", "DEC4", "Done_SO"}};
}
The test is with TestNG and Mockito : it shows how we don't need to do the complex stuff (reading the sql table, the xml files etc..) but simply mock different set of values targeting just our sole business case. (if a nice fellow would accept to translate in other frameworks for those interested...)
As for the initial request was about the design allowing to reasonably refactor a -huge- existing code-base away from the 'static singleton anti-pattern' , while introducing tests and mocks I assume this a quite valid answer.
Will like to hear about your opinion and BETTER alternatives

Need to sort a list using Wicket

I am working on a very simple program, looking like this:
public class WicketApplication extends WebApplication implements Comparable<Object>{
private List<Person> persons = Arrays.asList(
new Person("Mikkel", "20-02-91", 60169803),
new Person("Jonas", "02-04-90", 86946512),
new Person("Steffen", "15-07-90", 12684358),
new Person("Rasmus", "08-12-93", 13842652),
new Person("Michael", "10-10-65", 97642851));
/**
* #see org.apache.wicket.Application#getHomePage()
*/
#Override
public Class<? extends WebPage> getHomePage() {
return SimpleView.class;
}
public static WicketApplication get() {
return (WicketApplication) Application.get();
}
/**
* #return #see org.apache.wicket.Application#init()
*/
public List<Person> getPersons() {
return persons;
}
public List<Person> getSortedList(){
return Collections.sort(persons);
//This won't work before implementing comparator i know, but how??
}
#Override
public void init() {
super.init();
// add your configuration here
}
#Override
public int compareTo(Object o) {
throw new UnsupportedOperationException("Not supported yet."); //To change body of generated methods, choose Tools | Templates.
}
}
That was the class where i just put my people into a list.
public class SimpleView extends SimpleViewPage {
public SimpleView() {
ListView persons = new ListView("persons", getPersons()) {
#Override
protected void populateItem(ListItem item) {
Person person = (Person) item.getModelObject();
item.add(new Label("name", person.getName()));
item.add(new Label("birthdate", person.getBirthdate()));
item.add(new Label("phone", person.getPhone()));
}
};
add(persons);
add(new Label("size", "Number of people " + getPersons().size()));
}
}
And here is what i do with the people.
Basicly i want the program to show a table with all the data(this already works).
Now i want to be able to sort them. But i can't for the life of me figure it out. I'm still rather new at programming, and i want to have a button below my table that can sort on name, bday or phone number. Was thinking about trying to Comparable, but can't remember it that well, and not sure how it works with Wicket..
Thanks for the help in advance :)
What you need is the DataView component, which provides all the support you need for sorting (and paging, should you require it later on).
Here's a working example, if you click on the "Source Code" link in the top right corner, you can see that most of the things you want from a sortable table work out of the box. All you need is to create a suitable data provider.
If you use DataView with a SortableDataProvider, you don't need to worry about writing your own dynamic Comparator. (Which is not a terribly hard task itself, but it's easy to get it wrong.)

How can I get the HBase table name from a Result object as the mapreduce parameter?

HBASE-3996
Support multiple tables and scanners as input to the mapper in map/reduce job.
The map function always looks as follows:
public void map(ImmutableBytesWritable row, Result value, Context context)
In the map function, how can I distinguish which table the (Result)value comes from?
You can extract the TableSplit from the context, this should work for you (not tested):
public void map(ImmutableBytesWritable row, Result value, Context context) {
TableSplit currentSplit = (TableSplit)context.getInputSplit();
byte[] tableName = split.getTableName();
....
}

What is meant by parameterization?

While reading one of the articles for Data Driven Testing, I came across a term 'parametrization of a test'. Could someone explain to me what is meant by parameterization here?
Let's see an example with TestNG. Suppose you have function SomeClass.calculate(int value). You want to check the results the function returns on different input values.
With not-parametrized tests you do something like this:
#Test
public void testCalculate1()
{
assertEquals(SomeClass.calculate(VALUE1), RESULT1)
}
#Test
public void testCalculate2()
{
assertEquals(SomeClass.calculate(VALUE2), RESULT2)
}
With parametrized test:
//This test method declares that its data should be supplied by the Data Provider
//named "calculateDataProvider"
#Test(dataProvider = "calculateDataProvider")
public void testCalculate(int value, int result)
{
assertEquals(SomeClass.calculate(value), result)
}
//This method will provide data to any test method that declares that its Data Provider
//is named "calculateDataProvider"
#DataProvider(name = "calculateDataProvider")
public Object[][] createData()
{
return new Object[][] {
{ VALUE1, RESULT1 },
{ VALUE2, RESULT2 },
};
}
This way, TestNG engine will generate two tests from testCalculate method, providing parameters from array, returned by createData function.
For more details see documentation.

NUnit parameterized tests with datetime

Is it not possible with NUnit to go the following?
[TestCase(new DateTime(2010,7,8), true)]
public void My Test(DateTime startdate, bool expectedResult)
{
...
}
I really want to put a datetime in there, but it doesn't seem to like it. The error is:
An attribute argument must be a constant expression, typeof expression
or array creation expression of an attribute parameter type
Some documentation I read seems to suggest you should be able to, but I can't find any examples.
You can specify the date as a constant string in the TestCase attribute and then specify the type as DateTime in the method signature.
NUnit will automatically do a DateTime.Parse() on the string passed in.
Example:
[TestCase("01/20/2012")]
[TestCase("2012-1-20")] // Same case as above in ISO 8601 format
public void TestDate(DateTime dt)
{
Assert.That(dt, Is.EqualTo(new DateTime(2012, 01, 20)));
}
I'd probably use something like the ValueSource attribute to do this:
public class TestData
{
public DateTime StartDate{ get; set; }
public bool ExpectedResult{ get; set; }
}
private static TestData[] _testData = new[]{
new TestData(){StartDate= new DateTime(2010, 7, 8), ExpectedResult= true}};
[Test]
public void TestMethod([ValueSource("_testData")]TestData testData)
{
}
This will run the TestMethod for each entry in the _testData collection.
Another alternative is to use a more verbose approach. Especially if I don't necessarily know up front, what kind of DateTime() (if any...) a given string input yields.
[TestCase(2015, 2, 23)]
[TestCase(2015, 12, 3)]
public void ShouldCheckSomething(int year, int month, int day)
{
var theDate = new DateTime(year,month,day);
....
}
...note TestCase supports max 3 params so if you need more, consider something like:
private readonly object[] testCaseInput =
{
new object[] { 2000, 1, 1, true, "first", true },
new object[] { 2000, 1, 1, false, "second", false }
}
[Test, TestCaseSource("testCaseInput")]
public void Should_check_stuff(int y, int m, int d, bool condition, string theString, bool result)
{
....
}
You should use the TestCaseData Class as documented: http://www.nunit.org/index.php?p=testCaseSource&r=2.5.9
In addition to specifying an expected result, like:
new TestCaseData(12, 4).Returns(3);
You can also specify expected exceptions, etc.:
new TestCaseData(0, 0)
.Throws(typeof(DivideByZeroException))
.SetName("DivideByZero")
.SetDescription("An exception is expected");
It seems that NUnit doesn't allow the initialization of non-primitive objects in the TestCase(s). It is best to use TestCaseData.
Your test data class would look like this:
public class DateTimeTestData
{
public static IEnumerable GetDateTimeTestData()
{
// If you want past days.
yield return new TestCaseData(DateTime.Now.AddDays(-1)).Returns(false);
// If you want current time.
yield return new TestCaseData(DateTime.Now).Returns(true);
// If you want future days.
yield return new TestCaseData(DateTime.Now.AddDays(1)).Returns(true);
}
}
In your testing class you'd have the test include a TestCaseSource which directs to your test data.
How to use: TestCaseSource(typeof(class name goes here), nameof(name of property goes here))
[Test, TestCaseSource(typeof(DateTimeTestData), nameof(GetDateTimeTestData))]
public bool GetDateTime_GivenDateTime_ReturnsBoolean()
{
// Arrange - Done in your TestCaseSource
// Act
// Method name goes here.
// Assert
// You just return the result of the method as this test uses ExpectedResult.
}
Nunit has improved and implicitly tries to convert the attribute arguments.
See doc: NUnit3 Doc - see note
This works:
[TestCase("2021.2.1", ExpectedResult = false)]
[TestCase("2021.2.26", ExpectedResult = true)]
public bool IsDate(DateTime date) => date.Date.Equals(new DateTime(2021, 2, 26));
Take care to use english culture format for DateTime string arguments.