Write to multiple tables in HBASE - mapreduce

I have a situation here where I need to write to two of the hbase tables say table1,table 2. Whenever a write happens on table 1, I need to do some operation on table 2 say increment a counter in table 2 (like triggering). For this purpose I need to access (write) to two tables in the same task of a map-reduce program. I heard that it can be done using MultiTableOutputFormat. But I could not find any good example explaining in detail. Could some one please answer whether is it possible to do so. If so how can/should I do it. Thanks in advance.
Please provide me an answer that should not include co-processors.

To write into more than one table in map-reduce job, you have to specify that in job configuration. You are right this can be done using MultiTableOutputFormat.
Normally for a single table you use like:
TableMapReduceUtil.initTableReducerJob("tableName", MyReducer.class, job);
Instead of this write:
job.setOutputFormatClass(MultiTableOutputFormat.class);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setNumReduceTasks(2);
TableMapReduceUtil.addDependencyJars(job);
TableMapReduceUtil.addDependencyJars(job.getConfiguration());
Now at the time of writing data in table write as:
context.write(new ImmutableBytesWritable(Bytes.toBytes("tableName1")),put1);
context.write(new ImmutableBytesWritable(Bytes.toBytes("tableName2")),put2);

For this you can use HBase Observer, You have to create an observer and have to deploy on your server(applicable only for HBase Version >0.92), It will automatic trigger to another table.
And I think HBase Observer has similar concepts of like Aspects.
For more details -
https://blogs.apache.org/hbase/entry/coprocessor_introduction

Related

Best practice for using Force_Index on spanner

I have a client application which querys data in Spanner..
Lets say I have a table with 10 columns and my client application can search on a combination of columns.. Lets say I've added 5 indexes to optimise searching.
According to https://cloud.google.com/spanner/docs/sql-best-practices#secondary-indexes
it says:
In this scenario, Spanner automatically uses the secondary index SingersByLastName when executing the query (as long as three days have passed since database creation; see A note about new databases). However, it's best to explicitly tell Spanner to use that index by specifying an index directive in the FROM clause:
And also https://cloud.google.com/spanner/docs/secondary-indexes#index-directive suggests
When you use SQL to query a Spanner table, Spanner automatically uses any indexes that are likely to make the query more efficient. As a result, you don't need to specify an index for SQL queries. However, for queries that are critical for your workload, Google advises you to use FORCE_INDEX directives in your SQL statements for more consistent performance.
Both links suggest YOU (The developer) should be supplying Force_Index on yours queries.. This means I now need business logic in my client to say something like:
If (object.SearchTermOne)
queryBuilder.IndexToUse = "Idx_SearchTermOne"
This feels like I'm essentially trying to do the job of the optimiser by setting the index to use.. It also means if I add an extra index I need a code change to make use of it
So what are the best practices when it comes to using Force_Index in spanner queries?
The best practice is to use the Force_Index as described in the documentation at this time.
This feels like I'm essentially trying to do the job of the optimiser by setting the index to use..
I feel the same.
https://cloud.google.com/spanner/docs/secondary-indexes#index-directive
Note: The query optimizer requires up to three days to collect the databases statistics required to select a secondary index for a SQL query. During this time, Cloud Spanner will not automatically use any indexes.
As noted in this note, even if an amount of data is added that would allow the index to function effectively, it may take up to three days for the optimizer to figure it out.
Queries during that time will probably be full scans.
If you want to prevent this other than using Force_Index, you will need to run ANALYZE DDL manually.
https://cloud.google.com/blog/products/databases/a-technical-overview-of-cloud-spanners-query-optimizer
But none of this changes the fact that we are essentially trying to do the optimizer's job...

When to use cte and temp table?

i know this a common question, but most frequently people ask about performancy between this two.
What I'm asking for is use cases of cte and temp table, for better understanding the usage of them
With a temp table you can use CONSTRAINT's and INDEX's. You can also create a CURSOR on a temp table where a CTE terminates after the end of the query(emphasizing a single query).
I will answer through specific use cases with an application I've had experience with in order to aid with my point.
Common use cases in an example enterprise application I've used is as follows:
Temp Tables
Normally, we use temp tables in order to transform data before INSERT or UPDATE in the appropriate tables in time that require more than one query. Gather similar data from multiple tables in order to manipulate and process the data.
There are different types of orders (order_type1, order_type2, order_type3) all of which are on different TABLE's but have similar COLUMN's. We have a STORED PROCEDURE that UNION's all these tables into one #orders temp table and UPDATE's a persons suggested orders depending on existing orders.
CTE's
CTE's are awesome for readability when dealing with single queries. When creating reports that requires analysis using PIVOT's,Aggregates, etc. with tons of lines of code, CTE's provide readability by being able to separate a huge query into logical sections.
Sometimes there is a combination of both. When more than one query is required. Its still useful to break down some of those queries with CTE's.
I hope this is of some usefulness, cheers!

DynamoDB Scan with no FilterExpression vs Query

I have created a DynamoDB table and a Global Secondary Index on that table. I need to fetch all data from the GSI of that table.
There are two options:
Scan operation with No Filter Expression.
Query operation with no condition.
I need to find out which one has better performance, so that I start my implementation.
I have read a lot about the DynamoDB Scan and Query operations but could resolve my query.
Please help me in resolving my query.
Thanks in advance.
Abhishek
They will both impose the same performance overhead. So choosing either should be okay.
You should think of adding optimizations on top of whichever approach you use - for instance performing parallel scans as mentionedin the best practices:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScanGuidelines.html
or caching data in your application
Do note that parallel scans will eat up your provisions.
Another thing to watch out for while making your decision would be, how likely is the query pattern going to change? Do you plan on adding filters in the future? If so, then query would be better since scan loads all the data (consuming provisioned read capacity) and then filters results.

BigQuery tabledata:list output into a bigquery table

I know there is a way to place the results of a query into a table; there is a way to copy a whole table into another table; and there is a way to list a table piecemeal (tabledata:list using startIndex, maxResults and pageToken).
However, what I want to do is go over an existing table with tabledata:list and output the results piecemeal into other tables. I want to use this as an efficient way to shard a table.
I cannot find a reference to such a functionality, or any workaround to it for that matter.
Important to realize: Tabledata.List API is not part of BQL (BigQuery SQL) but rather BigQuery API that you can use in client of your choice.
That said, the logic you outlined in your question can be implemented in many ways, below is an example (high level steps):
Calling Tabledata.List within the loop using pageToken for next iteration or for exiting loop.
In each iteration, process response from Tabledata.List, extract actual data and insert into destination table using streaming data with Tabledata.InsertAll API. You can also have inner loop to go thru rows extracted in given iteration and define which one to go to which table/shard.
This is very generic logic and particular implementation depends on client you use.
Hope this helps
For what you describe, I'd suggest you use the batch version of Cloud Dataflow:
https://cloud.google.com/dataflow/
Dataflow already supports BigQuery tables as sources and sinks, and will keep all data within Google's network. This approach also scales to arbitrarily large tables.
TableData.list-ing your entire table might work fine for small tables, but network overhead aside, it is definitely not recommended for anything of moderate size.

Hbase Update/Insert using Get/Put

Could anyone advise what would be the best way to go for my requirement.
I have the below
An Hbase table
An input file in HDFS
My Requirement is as below
Read the input file and fetch the key. Using key get the data from
Hbase.
Do a comparison to check.
If comparison fails, insert
If comparison is successful update.
I know i can use get to fetch the data and put to write it back. Is this the best way to go forward.I hope i will use mapreduce so that i can get the process to run in parallel.
HBase has a checkAndPut() and a checkAndDelete() operation. which allows you to perform a put or a delete if you have the value you expect (compare=NO_OP if you don't care about the value but just about the key).
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html
Depending on the size of your problem I actually recommend a slightly different approach here. While it probably is feasible to implement HBase puts inside a MapReduce Job, it sounds like a rather complex task.
I'd recommend loading data from HBase into MapReduce joining the two tables and then exporting them back into HBase.
Using Pig this would be rather easy to achieve.
Take a look at Pig HBaseStorage.
Going this route you'd load both files, join them and then write back to HBase. If all there is to it, is comparing keys, this can be achieved in 5 lines of PigLatin.
HTH