HBase Mapreduce on multiple scan objects - mapreduce

I am just trying to evaluate HBase for some of data analysis stuff we are doing.
HBase would contain our event data. Key would be eventId + time. We want to run analysis on few events types (4-5) between a date range. Total number of event type is around 1000.
The problem with running mapreduce job on the hbase table is that initTableMapperJob (see below) takes only 1 scan object. For performance reason we want to scan the data for only 4-5 event types in a give date range and not the 1000 event types. If we use the method below then I guess we don't have that choice because it takes only 1 scan object.
public static void initTableMapperJob(String table,
Scan scan,
Class mapper,
Class outputKeyClass,
Class outputValueClass,
org.apache.hadoop.mapreduce.Job job)
throws IOException
Is it possible to run mapreduce on a list of scan objects? any workaround?
Thanks

TableMapReduceUtil.initTableMapperJob configures your job to use TableInputFormat which, as you note, takes a single Scan.
It sounds like you want to scan multiple segments of a table. To do so, you'll have to create your own InputFormat, something like MultiSegmentTableInputFormat. Extend TableInputFormatBase and override the getSplits method so that it calls super.getSplits once for each start/stop row segment of the table. (Easiest way would be to TableInputFormatBase.scan.setStartRow() each time). Aggregate the InputSplit instances returned to a single list.
Then configure the job yourself to use your custom MultiSegmentTableInputFormat.

You are looking for the class:
org/apache/hadoop/hbase/filter/FilterList.java
Each scan can take a filter. A filter can be quite complex. The FilterList allows you to specify multiple single filters and then do an AND or an OR between all of the component filters. You can use this to build up an arbitrary boolean query over the rows.

I've tried Dave L's approach and it works beautifully.
To configure the map job, you can use the function
TableMapReduceUtil.initTableMapperJob(byte[] table, Scan scan,
Class<? extends TableMapper> mapper,
Class<? extends WritableComparable> outputKeyClass,
Class<? extends Writable> outputValueClass, Job job,
boolean addDependencyJars, Class<? extends InputFormat> inputFormatClass)
where inputFormatClass refers to the MultiSegmentTableInputFormat mentioned in Dave L's comments.

Related

BigQuery tabledata:list output into a bigquery table

I know there is a way to place the results of a query into a table; there is a way to copy a whole table into another table; and there is a way to list a table piecemeal (tabledata:list using startIndex, maxResults and pageToken).
However, what I want to do is go over an existing table with tabledata:list and output the results piecemeal into other tables. I want to use this as an efficient way to shard a table.
I cannot find a reference to such a functionality, or any workaround to it for that matter.
Important to realize: Tabledata.List API is not part of BQL (BigQuery SQL) but rather BigQuery API that you can use in client of your choice.
That said, the logic you outlined in your question can be implemented in many ways, below is an example (high level steps):
Calling Tabledata.List within the loop using pageToken for next iteration or for exiting loop.
In each iteration, process response from Tabledata.List, extract actual data and insert into destination table using streaming data with Tabledata.InsertAll API. You can also have inner loop to go thru rows extracted in given iteration and define which one to go to which table/shard.
This is very generic logic and particular implementation depends on client you use.
Hope this helps
For what you describe, I'd suggest you use the batch version of Cloud Dataflow:
https://cloud.google.com/dataflow/
Dataflow already supports BigQuery tables as sources and sinks, and will keep all data within Google's network. This approach also scales to arbitrarily large tables.
TableData.list-ing your entire table might work fine for small tables, but network overhead aside, it is definitely not recommended for anything of moderate size.

Can a map side join have reducers?

I want to write a map-side join and want to include a reducer code as well. I have a smaller data set which I will send as distributed cache.
Can I write the map-side join with reducer code?
Yes!! Why not. Look, reducer is meant for aggregation of the key values emitted from the map. So you can always have a reducer in your code whenever you want to aggregate your result (say you want to count or find average or any numerical summarization) based on certain criteria that you've set in your code or in accordance with the problem statement. Map is just for filtering the data and emitting some useful key value pairs out of a LOT of data. Map side join is just needed when one of the dataset is small enough to fit the memory of the commodity machine. By the way reduce-side join serves your purpose too!!

Amazon DynamoDB Mapper - limits to batch operations

I am trying to write a huge number of records into a dynamoDB and I would like to know what is the correct way of doing that. Currently, I am using the DynamoDBMapper to do the job in a one batchWrite operation but after reading the documentation, I am not sure if this is the correct way (especially if there are some limits concerning the size and number of the written items).
Let's say, that I have an ArrayList with 10000 records and I am saving it like this:
mapper.batchWrite(recordsToSave, new ArrayList<BillingRecord>());
The first argument is the list with records to be written and the second one contains items to be deleted (no such items in this case).
Does the mapper split this write into multiple writes and handle the limits or should it be handled explicitly?
I have only found examples with batchWrite done with the AmazonDynamoDB client directly (like THIS one). Is using the client directly for the batch operations the correct way? If so, what is the point of having a mapper?
Does the mapper split your list of objects into multiple batches and then write each batch separately? Yes, it does batching for you and you can see that it splits the items to be written into batches of up to 25 items here. It then tries writing each batch and some of the items in each batch can fail. An example of a failure is given in the mapper documentation:
This method fails to save the batch if the size of an individual object in the batch exceeds 400 KB. For more information on batch restrictions see, http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_BatchWriteItem.html
The example is talking about the size of one record (one BillingRecord instance in your case) exceeding 400KB, which at the time of writing this answer, is the maximum size of a record in DynamoDB.
In the case a particular batch fails, it moves on to the next batch (sleeping the thread for a bit in case the failure was because of throttling). In the end, all of the failed batches are returned in List of FailedBatch instances. Each FailedBatch instance contains a list of unprocessed items that weren't written to DynamoDB.
Is the snippet that you provided the correct way for doing batch writes? I can think of two suggestions. The BatchSave method is more appropriate if you have no items to delete. You might also want to think about what you want to do with the failed batches.
Is using the client directly the correct way? If so, what is the point of the mapper? The mapper is simply a wrapper around the client. The mapper provides you an ORM layer to convert your BillingRecord instances into the sort-of nested hash maps that the low-level client works with. There is nothing wrong with using the client directly and this does tend to happen in some special cases where additional functionality needed needs to be coded outside of the mapper.

Write to multiple tables in HBASE

I have a situation here where I need to write to two of the hbase tables say table1,table 2. Whenever a write happens on table 1, I need to do some operation on table 2 say increment a counter in table 2 (like triggering). For this purpose I need to access (write) to two tables in the same task of a map-reduce program. I heard that it can be done using MultiTableOutputFormat. But I could not find any good example explaining in detail. Could some one please answer whether is it possible to do so. If so how can/should I do it. Thanks in advance.
Please provide me an answer that should not include co-processors.
To write into more than one table in map-reduce job, you have to specify that in job configuration. You are right this can be done using MultiTableOutputFormat.
Normally for a single table you use like:
TableMapReduceUtil.initTableReducerJob("tableName", MyReducer.class, job);
Instead of this write:
job.setOutputFormatClass(MultiTableOutputFormat.class);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setNumReduceTasks(2);
TableMapReduceUtil.addDependencyJars(job);
TableMapReduceUtil.addDependencyJars(job.getConfiguration());
Now at the time of writing data in table write as:
context.write(new ImmutableBytesWritable(Bytes.toBytes("tableName1")),put1);
context.write(new ImmutableBytesWritable(Bytes.toBytes("tableName2")),put2);
For this you can use HBase Observer, You have to create an observer and have to deploy on your server(applicable only for HBase Version >0.92), It will automatic trigger to another table.
And I think HBase Observer has similar concepts of like Aspects.
For more details -
https://blogs.apache.org/hbase/entry/coprocessor_introduction

Efficiently processing all data in a Cassandra Column Family with a MapReduce job

I want to process all of the data in a column family in a MapReduce job. Ordering is not important.
An approach is to iterate over all the row keys of the column family to use as the input. This could be potentially a bottleneck and could replaced with a parallel method.
I'm open to other suggestions, or for someone to tell me I'm wasting my time with this idea. I'm currently investigating the following:
A potentially more efficient way is to assign ranges to the input instead of iterating over all row keys (before the mapper starts). Since I am using RandomPartitioner, is there a way to specify a range to query based on the MD5?
For example, I want to split the task into 16 jobs. Since the RandomPartitioner is MD5 based (from what I have read), I'd like to query everything starting with a for the first range. In other words, how would I query do a get_range on the MD5 with the start of a and ends before b. e.g. a0000000000000000000000000000000 - afffffffffffffffffffffffffffffff?
I'm using the pycassa API (Python) but I'm happy to see Java examples.
I'd cheat a little:
Create new rows job_(n) with each column representing each row key in the range you want
Pull all columns from that specific row to indicate which rows you should pull from the CF
I do this with users. Users from a particular country get a column in the country specific row. Users with a particular age are also added to a specific row.
Allows me to quickly pull the rows i need based on the criteria i want and is a little more efficient compared to pulling everything.
This is how the Mahout CassandraDataModel example functions:
https://github.com/apache/mahout/blob/trunk/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/cassandra/CassandraDataModel.java
Once you have the data and can pull the rows you are interested in, you can hand it off to your MR job(s).
Alternately, if speed isn't an issue, look into using PIG: How to use Cassandra's Map Reduce with or w/o Pig?