I have to join tables in Hbase.
I integrated HIVE and HBase and that is working well. I can query using HIVE.
But can somebody help me how to join tables in HBase without using HIVE. I think using mapreduce we can achieve this, if so can anybody share a working example that I can refer.
Please share your opinions.
I have an approach in mind. That is,
If I need to JOIN tables A x B x C;
I may use TableMapReduceUtil to iterate over A, then get Data from B and C inside the TableMapper. Then use the TableReducer to write back to another table Y.
Will this approach be a good one.
That is certainly an approach, but if you are doing 2 random reads per scanned row then your speed will plummet. If you are filtering the rows out significantly or have a small dataset in A that may not be an issue.
Sort-merge Join
However the best approach, which will be available in HBase 0.96, is the MultipleTableInput method. This means that it will scan table A and write it's output with a unique key that will allow table B to match up.
E.g. Table A emits (b_id, a_info) and Table B will emit (b_id, b_info) merging together in the reducer.
This is an example of a sort-merge join.
Nested-Loop Join
If you are joining on the row key or the joining attribute is sorted in line with table B, you can have a instance of a scanner in each task which sequentially reads from table B until it finds what it's looking for.
E.g. Table A row key = "companyId" and Table B row key = "companyId_employeeId". Then for each Company in Table A you can get all the employees using the nest-loop algorithm.
Pseudocode:
for(company in TableA):
for(employee in TableB):
if employee.company_id == company.id:
emit(company.id, employee)
This is an example of a nest-loop join.
More detailed join algorithms are here:
http://en.wikipedia.org/wiki/Nested_loop_join
http://en.wikipedia.org/wiki/Hash_join
http://en.wikipedia.org/wiki/Sort-merge_join
Related
I would like to store an object (payload) along with some metadata in HBase.
Then I would like to run queries on the table and pull out the payload part based on metadata info.
For example, let's say I have the following column qualifiers
P: Payload (larger than M1 + M2).
M1: Meta-Data1
M2: Meta-Data2
Then I would run a query such as:
Fetch all Payload where M1='search-key1' && M2='search-key2'
Does it make sense to:
keep M1 and M2 in one column family and P in another column family? Would the scan be quicker?
Keep all 3 columns in the same column family?
Normally, I would do a spike (I may still need to) - I thought I ask first.
I'd try to follow the advice given in HBase Reference and go with option #2 (Keep all 3 col in the same column family):
Try to make do with one column family if you can in your schemas. Only
introduce a second and third column family in the case where data
access is usually column scoped; i.e. you query one column family or
the other but usually not both at the one time.
Currently I'm loading data from Google Storage to stage_table_orders using WRITE_APPEND. Since this load both new and existed order there could be a case where same order has more than one version the field etl_timestamp tells which row is the most updated one.
then I WRITE_TRUNCATE my production_table_orders with query like:
select ...
from (
SELECT * , ROW_NUMBER() OVER
(PARTITION BY date_purchased, orderid order by etl_timestamp DESC) as rn
FROM `warehouse.stage_table_orders` )
where rn=1
Then the production_table_orders always contains the most updated version of each order.
This process is suppose to run every 3 minutes.
I'm wondering if this is the best practice.
I have around 20M rows. It seems not smart to WRITE_TRUNCATE 20M rows every 3 minutes.
Suggestion?
We are doing the same. To help improve performance though, try to partition the table by date_purchased and cluster by orderid.
Use a CTAS statement (to the table itself) as you cannot add partition after fact.
EDIT: use 2 tables and MERGE
Depending on your particular use case i.e. the number of fields that could be updated between old and new, you could use 2 tables, e.g. stage_table_orders for the imported records and final_table_orders as destination table and do
a MERGE like so:
MERGE final_table_orders F
USING stage_table_orders S
ON F.orderid = S.orderid AND
F.date_purchased = S.date_purchased
WHEN MATCHED THEN
UPDATE SET field_that_change = S.field_that_change
WHEN NOT MATCHED THEN
INSERT (field1, field2, ...) VALUES(S.field1, S.field2, ...)
Pro: efficient if few rows are "upserted", not millions (although not tested) + pruning partitions should work.
Con: you have to explicitly list the fields in the update and insert clauses. A one-time effort if schema is pretty much fixed.
There are may ways to de-duplicate and there is no one-size-fits-all. Search in SO for similar requests using ARRAY_AGG, or EXISTS with DELETE or UNION ALL,... Try them out and see which performs better for YOUR dataset.
This documentation describes key-distribution in redshift as follows:
The rows are distributed according to the values in one column. The
leader node will attempt to place matching values on the same node
slice. If you distribute a pair of tables on the joining keys, the
leader node collocates the rows on the slices according to the values
in the joining columns so that matching values from the common columns
are physically stored together.
I was wondering if key-distribution additionally helps in optimizing equality filters. My intuition says it should but it isn't mentioned anywhere.
Also, I saw a documentation regarding sort-keys which says that to select a sort-key:
Look for columns that are used in range filters and equality filters.
This got me confused since sort-keys are explicitly mentioned as a way to optimize equality filters.
I am asking this because I already have a candidate sort-key on which I will be doing range queries. But I also want to have quick equality filters on another column which is a good distribution key in my case.
It is a very bad idea to be filtering on a distribution key, especially if your table / cluster is large.
The reason is that the filter may be running on just one slice, in effect running without the benefit of MPP.
For example, if you have a dist key of "added_date", you may find that all of the added date for the previous week are all together on one slice.
You will then have the majority of queries filtering for recent ranges of added_date, and these queries will be concentrated and will saturate that one slice.
The simple rule is:
Use DISTKEY for the column most commonly joined
Use SORTKEY for fields most commonly used in a WHERE statement
There actually are benefits to using the same field for SORTKEY and DISTKEY. From Choose the Best Sort Key:
If you frequently join a table, specify the join column as both the sort key and the distribution key.
This enables the query optimizer to choose a sort merge join instead of a slower hash join. Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join.
Feel free to do some performance tests -- create a few different versions of the table, and use INSERT or SELECT INTO to populate them. Then, try common queries to see how they perform.
My main concern:
I have an existing table with huge data.It is having a clustered index.
My c++ process has a list of many keys with which it checks whether the key exists in the table,
and if yes, it will then check the row in the table and the new row are similar. if there is a change the new row is updated in the table.
In general there will less changes. But its huge data in the table.
S it means there will be lot of select queries but not many update queries.
What I would I like to achieve:
I just read about partitioning a table in sybase here.
I just wanted to know will this be helpful for me, as I read in the article it mentions about the insert queries only. But how can I improve my select query performance.
Could anyone please suggest what should I look for in this case?
Yes it will improve your query (read) performance so long as your query is based on the partition keys defined. Indexes can also be partitioned and it stands to reason that a smaller index will mean faster read performance.
For example if you had a query like select * from contacts where lastName = 'Smith' and you have partitioned your table index based on first letter of lastName, then the server only has to search one partition "S" to retrieve its results.
Be warned that partitioning your data can be difficult if you have a lot of different query profiles. Queries that do not include the index partition key (e.g. lastName) such as select * from staff where created > [some_date] will then have to hit every index partition in order to retrieve it's result set.
No one can tell you what you should/shouldn't do as it is very application specific and you will have to perform your own analysis. Before meddling with partitions, my advice is to ensure you have the correct indexes in place, they are being hit by your queries (i.e. no table scans), and your server is appropriately resourced (i.e got enough fast disk and RAM), and you have tuned your server caches to suit your queries.
I need to write a MapReduce Job that Gets all rows in a given Date Range(say last one month). It would have been a cakewalk had My Row Key started with Date. But My frequent Hbase queries are on starting values of key.
My Row key is exactly A|B|C|20120121|D . Where combination of A/B/C along with date (in YearMonthDay format) makes a unique row ID.
My Hbase tables could have upto a few million rows. Should my Mapper read all the table and filter each row if it falls in given date range or Scan / Filter can help handling this situation?
Could someone suggest (or a snippet of code) a way to handle this situation in an effective manner?
Thanks
-Panks
A RowFilter with a RegEx Filter would work, but would not be the most optimal solution. Alternatively you can try to use secondary indexes.
One more solution is to try the FuzzyRowFIlter. A FuzzyRowFilter uses a kind of fast-forwarding, hence skipping many rows in the overall scan process and will thus be faster than a RowFilter Scan. You can read more about it here.
Alternatively BloomFilters might also help depending on your schema. If your data is huge you should do a comparative analysis on secondary index and Bloom Filters.
You can use a RowFilter with a RegexStringComparator. You'd need to come up with a RegEx that filters your dates appropriately. This page has an example that includes setting a Filter for a MapReduce scanner.
I am just getting started with HBase, bloom filters might help.
You can modify the Scan that you send into the Mapper to include a filter. If your date is also the record timestamp, it's easy:
Scan scan = new Scan();
scan.setTimeRange(minTime, maxTime);
TableMapReduceUtil.initTableMapperJob("mytable", scan, MyTableMapper.class,
OutputKey.class, OutputValue.class, job);
If the date in your row key is different, you'll have to add a filter to your scan. This filter can operate on a column or a row key. I think it's going to be messy with just the row key. If you put the date in a column, you can make a FilterList where all conditions must be true and use a CompareOp.GREATER and a CompareOp.LESS. Then use scan.setFilter(filterList) to add your filters to the scan.