Parititioned Data Map/Reduce - mapreduce

I have written my custom partitioner for partitioning datasets. I want to partition two datasets using the same partitioner and then in the next mapreduce job, I want each mapper to handle the same partition from the two sources and perform some function such as joining etc. How I can I ensure that one mapper gets the split that corresponds to same partition from both the sources?
Any help would be highly appreciated.

What you are describing is one variation of a map-side join. Chapter 8 of Pro Hadoop or org.apache.hadoop.mapred.join

Related

BigQuery tabledata:list output into a bigquery table

I know there is a way to place the results of a query into a table; there is a way to copy a whole table into another table; and there is a way to list a table piecemeal (tabledata:list using startIndex, maxResults and pageToken).
However, what I want to do is go over an existing table with tabledata:list and output the results piecemeal into other tables. I want to use this as an efficient way to shard a table.
I cannot find a reference to such a functionality, or any workaround to it for that matter.
Important to realize: Tabledata.List API is not part of BQL (BigQuery SQL) but rather BigQuery API that you can use in client of your choice.
That said, the logic you outlined in your question can be implemented in many ways, below is an example (high level steps):
Calling Tabledata.List within the loop using pageToken for next iteration or for exiting loop.
In each iteration, process response from Tabledata.List, extract actual data and insert into destination table using streaming data with Tabledata.InsertAll API. You can also have inner loop to go thru rows extracted in given iteration and define which one to go to which table/shard.
This is very generic logic and particular implementation depends on client you use.
Hope this helps
For what you describe, I'd suggest you use the batch version of Cloud Dataflow:
https://cloud.google.com/dataflow/
Dataflow already supports BigQuery tables as sources and sinks, and will keep all data within Google's network. This approach also scales to arbitrarily large tables.
TableData.list-ing your entire table might work fine for small tables, but network overhead aside, it is definitely not recommended for anything of moderate size.

Hbase Update/Insert using Get/Put

Could anyone advise what would be the best way to go for my requirement.
I have the below
An Hbase table
An input file in HDFS
My Requirement is as below
Read the input file and fetch the key. Using key get the data from
Hbase.
Do a comparison to check.
If comparison fails, insert
If comparison is successful update.
I know i can use get to fetch the data and put to write it back. Is this the best way to go forward.I hope i will use mapreduce so that i can get the process to run in parallel.
HBase has a checkAndPut() and a checkAndDelete() operation. which allows you to perform a put or a delete if you have the value you expect (compare=NO_OP if you don't care about the value but just about the key).
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html
Depending on the size of your problem I actually recommend a slightly different approach here. While it probably is feasible to implement HBase puts inside a MapReduce Job, it sounds like a rather complex task.
I'd recommend loading data from HBase into MapReduce joining the two tables and then exporting them back into HBase.
Using Pig this would be rather easy to achieve.
Take a look at Pig HBaseStorage.
Going this route you'd load both files, join them and then write back to HBase. If all there is to it, is comparing keys, this can be achieved in 5 lines of PigLatin.
HTH

Approach to map the clustering output with existing data indexes

I am newbie to Mahout and working on a data mining clustering use case using K-Means. I need a help to understand how to map the original data with the clustered output to gain more insight. Let say
After performing data preparation we have a summarized data set having following attributes
Key1,Key2,Dimension1,Dimension2,Measure1,Measure2,Measure3
Now I have executed clustering algorithm on following attributes
Measure1,Measure2,Measure3
Output of the clustering would be Cluster Id with its data(Measure1,Measure2,Measure3).
Question:
How can I perform clustering on specific attributes in dataset, where the clustered output must contain all attributes.
Request to help me with right approach.

Write to multiple tables in HBASE

I have a situation here where I need to write to two of the hbase tables say table1,table 2. Whenever a write happens on table 1, I need to do some operation on table 2 say increment a counter in table 2 (like triggering). For this purpose I need to access (write) to two tables in the same task of a map-reduce program. I heard that it can be done using MultiTableOutputFormat. But I could not find any good example explaining in detail. Could some one please answer whether is it possible to do so. If so how can/should I do it. Thanks in advance.
Please provide me an answer that should not include co-processors.
To write into more than one table in map-reduce job, you have to specify that in job configuration. You are right this can be done using MultiTableOutputFormat.
Normally for a single table you use like:
TableMapReduceUtil.initTableReducerJob("tableName", MyReducer.class, job);
Instead of this write:
job.setOutputFormatClass(MultiTableOutputFormat.class);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setNumReduceTasks(2);
TableMapReduceUtil.addDependencyJars(job);
TableMapReduceUtil.addDependencyJars(job.getConfiguration());
Now at the time of writing data in table write as:
context.write(new ImmutableBytesWritable(Bytes.toBytes("tableName1")),put1);
context.write(new ImmutableBytesWritable(Bytes.toBytes("tableName2")),put2);
For this you can use HBase Observer, You have to create an observer and have to deploy on your server(applicable only for HBase Version >0.92), It will automatic trigger to another table.
And I think HBase Observer has similar concepts of like Aspects.
For more details -
https://blogs.apache.org/hbase/entry/coprocessor_introduction

Efficiently processing all data in a Cassandra Column Family with a MapReduce job

I want to process all of the data in a column family in a MapReduce job. Ordering is not important.
An approach is to iterate over all the row keys of the column family to use as the input. This could be potentially a bottleneck and could replaced with a parallel method.
I'm open to other suggestions, or for someone to tell me I'm wasting my time with this idea. I'm currently investigating the following:
A potentially more efficient way is to assign ranges to the input instead of iterating over all row keys (before the mapper starts). Since I am using RandomPartitioner, is there a way to specify a range to query based on the MD5?
For example, I want to split the task into 16 jobs. Since the RandomPartitioner is MD5 based (from what I have read), I'd like to query everything starting with a for the first range. In other words, how would I query do a get_range on the MD5 with the start of a and ends before b. e.g. a0000000000000000000000000000000 - afffffffffffffffffffffffffffffff?
I'm using the pycassa API (Python) but I'm happy to see Java examples.
I'd cheat a little:
Create new rows job_(n) with each column representing each row key in the range you want
Pull all columns from that specific row to indicate which rows you should pull from the CF
I do this with users. Users from a particular country get a column in the country specific row. Users with a particular age are also added to a specific row.
Allows me to quickly pull the rows i need based on the criteria i want and is a little more efficient compared to pulling everything.
This is how the Mahout CassandraDataModel example functions:
https://github.com/apache/mahout/blob/trunk/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/cassandra/CassandraDataModel.java
Once you have the data and can pull the rows you are interested in, you can hand it off to your MR job(s).
Alternately, if speed isn't an issue, look into using PIG: How to use Cassandra's Map Reduce with or w/o Pig?