how to find the id of each map task? - mapreduce

I want to get the id of each mapper and reducer task because I want to tag the output of these mappers and reducers according to the mapper and reducer id. How can I retrieve the ids of each?
Thanks

You can print taskId in map, setup etc. functions by utilizing following code.
context.getTaskAttemptID().getTaskID();

you can do it using: org.apache.hadoop.mapreduce.MapContext::getTaskAttemptID().

Related

How AWS DynamoDB query result is a list of items while partition key is a unique value

I'm new to AWS DynamoDB and wanted to clarify something.
As I learned when we use query we should use the partition key which is unique among the list of items, then how the result from a query is a list!!! it should be one row!! and why do we even need adding more condition?
I think I am missing something here, can someone please help me to understand?
I need this because I want to query on list of applications with specific status value or for specific range of time but if I am supposed to provide the appId what is the point of the condition?
Thank you in advance.
Often your table will have sort key, which together with your partition key will create composite primary key. In this scenario a query can return multiple items. To return only one value, not a list, you can use get_item instead if you know unique value of the composite primary key.

Groupby existing attribute present in json string line in apache beam java

I am reading json files from GCS and I have to load data into different BigQuery tables. These file may have multiple records for same customer with different timestamp. I have to pick latest among them for each customer. I am planning to achieve as below
Read files
Group by customer id
Apply DoFn to compare timestamp of records in each group and have only latest one from them
Flat it, convert to table row insert into BQ.
But I am unable to proceed with step 1. I see GroupByKey.create() but unable to make it use customer id as key.
I am implementing using JAVA. Any suggestions would be of great help. Thank you.
Before you GroupByKey you need to have your dataset in key-value pairs. It would be good if you had shown some of your code, but without knowing much, you'd do the following:
PCollection<JsonObject> objects = p.apply(FileIO.read(....)).apply(FormatData...)
// Once we have the data in JsonObjects, we key by customer ID:
PCollection<KV<String, Iterable<JsonObject>>> groupedData =
objects.apply(MapElements.via(elm -> KV.of(elm.getString("customerId"), elm)))
.apply(GroupByKey.create())
Once that's done, you can check timestamps and discard all bot the most recent as you were thinking.
Note that you will need to set coders, etc - if you get stuck with that we can iterate.
As a hint / tip, you can consider this example of a Json Coder.

In Camunda , What is the correct way of optimised coding among two options?

Retrieving userTasks first and then from task, retrive the processInstanceId.
First retrieving ProcessInstances and then iterate it with a loop and find usertasks in each process instance.
Please give reason too.
If you want to build a task list you should use the task queries, as you can do the right queries and filters there easily. Process instance id's are part of the result.
But it would indeed be interesting to understand the use case, what exactly you need the process instance id for.

Write to multiple tables in HBASE

I have a situation here where I need to write to two of the hbase tables say table1,table 2. Whenever a write happens on table 1, I need to do some operation on table 2 say increment a counter in table 2 (like triggering). For this purpose I need to access (write) to two tables in the same task of a map-reduce program. I heard that it can be done using MultiTableOutputFormat. But I could not find any good example explaining in detail. Could some one please answer whether is it possible to do so. If so how can/should I do it. Thanks in advance.
Please provide me an answer that should not include co-processors.
To write into more than one table in map-reduce job, you have to specify that in job configuration. You are right this can be done using MultiTableOutputFormat.
Normally for a single table you use like:
TableMapReduceUtil.initTableReducerJob("tableName", MyReducer.class, job);
Instead of this write:
job.setOutputFormatClass(MultiTableOutputFormat.class);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setNumReduceTasks(2);
TableMapReduceUtil.addDependencyJars(job);
TableMapReduceUtil.addDependencyJars(job.getConfiguration());
Now at the time of writing data in table write as:
context.write(new ImmutableBytesWritable(Bytes.toBytes("tableName1")),put1);
context.write(new ImmutableBytesWritable(Bytes.toBytes("tableName2")),put2);
For this you can use HBase Observer, You have to create an observer and have to deploy on your server(applicable only for HBase Version >0.92), It will automatic trigger to another table.
And I think HBase Observer has similar concepts of like Aspects.
For more details -
https://blogs.apache.org/hbase/entry/coprocessor_introduction

Parititioned Data Map/Reduce

I have written my custom partitioner for partitioning datasets. I want to partition two datasets using the same partitioner and then in the next mapreduce job, I want each mapper to handle the same partition from the two sources and perform some function such as joining etc. How I can I ensure that one mapper gets the split that corresponds to same partition from both the sources?
Any help would be highly appreciated.
What you are describing is one variation of a map-side join. Chapter 8 of Pro Hadoop or org.apache.hadoop.mapred.join