Hadoop: Use only a part of the reduce Iterable - mapreduce

I have a situation in which I only want to use the first n values of the Iterable given to my reducer and then abort. I have been reading about the Iterable class and it seems like this may not be trivial.
I can't use a for loop or a next method. I can't use a foreach since it iterates over the whole object. Is there a straight-forward solution or am I approaching the problem wrong?
Thanks.

You can just extract the iterator from the iterable and use a good old for loop, or a while loop.
For example, the below sums over only at most the first TOPN values.
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
private static final int TOPN = 10;
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException
{
int sum = 0;
Iterator<IntWritable> iter = values.iterator();
for (int i=0; iter.hasNext() && i < TOPN; i++) {
sum += iter.next().get();
}
result.set(sum);
context.write(key, result);
}
}

Related

How to solve the "too many connection" problem in zookeeper when I want to query too many times in reduce stage?

Sorry for my stupid question and thank you in advance.
I need to replace the outputvalue in reduce stage(or map stage). However, it will case too many connection in zookeeper. I don't know how to deal with it.
This is my reduce method:
public static class HbaseToHDFSReducer extends Reducer<Text,Text,Text, Text> {
protected void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
HashSet<String> address = new HashSet<>();
for(Text item :values){
String city = getDataByRowKey("A1",item.toString());
address.add(city);
}
context.write(key,new Text(String.valueOf(address).replace("\"", "")));
}
This is the query method:
public static String getDataByRowKey(String tableName, String rowKey) throws IOException {
Table table = ConnectionFactory.createConnection(conf).getTable(TableName.valueOf(tableName));
Get get = new Get(rowKey.getBytes());
String data = new String();
if (!get.isCheckExistenceOnly()) {
Result result = table.get(get);
for (Cell cell : result.rawCells()) {
String colName = Bytes.toString(cell.getQualifierArray(), cell.getQualifierOffset(), cell.getQualifierLength());
String value = Bytes.toString(cell.getValueArray(), cell.getValueOffset(), cell.getValueLength());
if (colName.equals(rowKey)) {
data = value;
}
}
}
table.close();
return data;
}
What should I do to solve it?
Thank you again
You created one connection per query, and connection creation is a heavy-weight operation. Maybe you can get a Connection in your reduce method and change
getDataByRowKey(String tableName, String rowKey) to this
getDataByRowKey(Connection connection, String tableName, String rowKey).

Java8 Lambda compare two List and trasform to Map

Suppose I have two class:
class Key {
private Integer id;
private String key;
}
class Value {
private Integer id;
private Integer key_id;
private String value;
}
Now I fill the first list as follows:
List<Key> keys = new ArrayLisy<>();
keys.add(new Key(1, "Name"));
keys.add(new Key(2, "Surname"));
keys.add(new Key(3, "Address"));
And the second one:
List<Value> values = new ArrayLisy<>();
values.add(new Value(1, 1, "Mark"));
values.add(new Value(2, 3, "Fifth Avenue"));
values.add(new Value(3, 2, "Fischer"));
Can you please tell me how can I rewrite the follow code:
for (Key k : keys) {
for (Value v : values) {
if (k.getId().equals(v.getKey_Id())) {
map.put(k.getKey(), v.getValue());
break;
}
}
}
Using Lambdas?
Thank you!
‐------UPDATE-------
Yes sure it works, I forget "using Lambdas" on the first post (now I added). I would like to rewrite the two nested for cicle with Lamdas.
Here is how you would do it using streams.
stream the keylist
stream an index for indexing the value list
filter matching ids
package the key instance key and the value instance value into a SimpleEntry.
then add that to a map.
Map<String, String> results = keys.stream()
.flatMap(k -> IntStream.range(0, values.size())
.filter(i -> k.getId() == values.get(i).getKey_id())
.mapToObj(i -> new AbstractMap.SimpleEntry<>(
k.getKey(), values.get(i).getValue())))
.collect(Collectors.toMap(Entry::getKey, Entry::getValue));
results.entrySet().forEach(System.out::println);
prints
Address=Fifth Avenue
Surname=Fischer
Name=Mark
Imo, your way is much clearer and easier to understand. Streams/w lambdas or method references are not always the best approach.
A hybrid approach might also be considered.
allocate a map.
iterate over the keys.
stream the values trying to find a match on key_id's and return first one found.
The value was found (isPresent) add to map.
Map<String,String> map = new HashMap<>();
for (Key k : keys) {
Optional<Value> opt = values.stream()
.filter(v -> k.getId() == v.getKey_id())
.findFirst();
if (opt.isPresent()) {
map.put(k.getKey(), opt.get().getValue());
}
}

Increment counter inside RecordReader in Hadoop

I have created a custom RecordReader for a mapreduce job
class PatternRecordReader extends RecordReader<LongWritable, Text> {
#Override
public boolean nextKeyValue() {
try {
final String value = someParsingLogic();
final boolean hasValue = value != null;
if (hasValue) {
someLogic();
}else{
// I would like to increment a counter here
// something like context.getCounter(Counters.INVALID_INPUT).increment(1);
}
return hasValue;
}
I would like to increment a counter if no value is returned and be able to set it in the Task context, so that it would be accessible by the job.
Is there anyway to achieve this ?

how to run multiple test cases in junit or testng with different set of test data from csv file

I hope this scenario is bit confused me lot. I want to run a few test cases using junit or testng with different set of data from csv file. The code snippet i have tried is given below but it dint work,
private static CSVReader csvReader = null;
#BeforeClass
public static void setUp() {
csvReader = new CSVReader(new FileReader(fileName));
}
#Test
public void test1() {
.......
.......
System.out.println(csvReader[0]);
}
#Test
public void test2() {
.......
.......
System.out.println(csvReader[1]);
}
#Test
public void test3() {
.......
.......
System.out.println(csvReader[2]);
}
#Test
public void test4() {
.......
.......
System.out.println(csvReader[3]);
}
My problem is that i need to use data from each column in different test cases and i need to iterate all the test cases again if i have multiple rows in csv file. I have tried using Theories and Datapoints, but it works in a way that first cases runs with all rows in csv file and its moves to next test case and runs again with all rows in csv.
I want the solution to run test1() with first column of first row, test2() with second column of first row, test3() with third column of first row and test4() with fourth column of first row and then same need to be iterated with second row and so on. Is it possible to iterate the test cases like this ? As far i searched we can iterate a particular test cases in many ways. My question is, is this possible to iterate all the test in a class with one set of data and again reiterate the class with another set of data from csv.
Can we accomplish this using junit or testng? if so, please proved some sample code. Thanks in advance!
Well, there are parameterized tests... You could use them.
#RunWith(Parameterized.class)
public class YourTest {
#Parameters
public static Collection<Object[]> data() {
try( FileReader read = new FileReader(fileName)) {
CSVReader csvReader = new CSVReader(reader);
List<CSVRecord> records = ... read data;
Object[][] parameters = new Object[records.length][1];
for(int i=0; i<records.length; i++) {
parameters[i][0] = records.get(i);
}
return parameters;
}
}
private CsvRecord record; // [0] from the array goes here
public YourTest (CsvRecord record) {
this.record = record;
}
#Test
public void test() {
...do something with the record
}
}
And the TestNG solution is:
public class YourTest {
#DataProvider
public static Object[][] data() {
try( FileReader read = new FileReader(fileName)) {
CSVReader csvReader = new CSVReader(reader);
List<CSVRecord> records = ... read data;
Object[][] parameters = new Object[records.length][1];
for(int i=0; i<records.length; i++) {
parameters[i][0] = records.get(i);
}
return parameters;
}
}
#Test(dataProvider="data")
public void test(CsvRecord record) {
...do something with the record
}
}

Hbase Map/reduce-How to access individual columns of the table?

I have a table called User with two columns, one called visitorId and the other called friend which is a list of strings. I want to check whether the VisitorId is in the friendlist. Can anyone direct me as to how to access the table columns in a map function?
I'm not able to picture how data is output from a map function in hbase.
My code is as follows:
ublic class MapReduce {
static class Mapper1 extends TableMapper<ImmutableBytesWritable, Text> {
private int numRecords = 0;
private static final IntWritable one = new IntWritable(1);
private final IntWritable ONE = new IntWritable(1);
private Text text = new Text();
#Override
public void map(ImmutableBytesWritable row, Result values, Context context) throws IOException {
//What should i do here??
ImmutableBytesWritable userKey = new ImmutableBytesWritable(row.get(), 0, Bytes.SIZEOF_INT);
context.write(userkey,One);
}
//context.write(text, ONE);
} catch (InterruptedException e) {
throw new IOException(e);
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = HBaseConfiguration.create();
Job job = new Job(conf, "CheckVisitor");
job.setJarByClass(MapReduce.class);
Scan scan = new Scan();
Filter f = new RowFilter(CompareOp.EQUAL,new SubstringComparator("mId2"));
scan.setFilter(f);
scan.addFamily(Bytes.toBytes("visitor"));
scan.addFamily(Bytes.toBytes("friend"));
TableMapReduceUtil.initTableMapperJob("User", scan, Mapper1.class, ImmutableBytesWritable.class,Text.class, job);
}
}
So Result values instance would contain the full row from the scanner.
To get the appropriate columns from the Result I would do something like :-
VisitorIdVal = value.getColumnLatest(Bytes.toBytes(columnFamily1), Bytes.toBytes("VisitorId"))
friendlistVal = value.getColumnLatest(Bytes.toBytes(columnFamily2), Bytes.toBytes("friendlist"))
Here VisitorIdVal and friendlistVal are of the type keyValue http://archive.cloudera.com/cdh/3/hbase/apidocs/org/apache/hadoop/hbase/KeyValue.html, to get their values out you can do a Bytes.toString(VisitorIdVal.getValue())
Once you have extracted the values from columns you can check for "VisitorId" in "friendlist"