Would an HBase Scan perform better with multiple Column Families or single Column Family? - mapreduce

I would like to store an object (payload) along with some metadata in HBase.
Then I would like to run queries on the table and pull out the payload part based on metadata info.
For example, let's say I have the following column qualifiers
P: Payload (larger than M1 + M2).
M1: Meta-Data1
M2: Meta-Data2
Then I would run a query such as:
Fetch all Payload where M1='search-key1' && M2='search-key2'
Does it make sense to:
keep M1 and M2 in one column family and P in another column family? Would the scan be quicker?
Keep all 3 columns in the same column family?
Normally, I would do a spike (I may still need to) - I thought I ask first.

I'd try to follow the advice given in HBase Reference and go with option #2 (Keep all 3 col in the same column family):
Try to make do with one column family if you can in your schemas. Only
introduce a second and third column family in the case where data
access is usually column scoped; i.e. you query one column family or
the other but usually not both at the one time.

Related

What is the best practice for loading data into BigQuery table?

Currently I'm loading data from Google Storage to stage_table_orders using WRITE_APPEND. Since this load both new and existed order there could be a case where same order has more than one version the field etl_timestamp tells which row is the most updated one.
then I WRITE_TRUNCATE my production_table_orders with query like:
select ...
from (
SELECT * , ROW_NUMBER() OVER
(PARTITION BY date_purchased, orderid order by etl_timestamp DESC) as rn
FROM `warehouse.stage_table_orders` )
where rn=1
Then the production_table_orders always contains the most updated version of each order.
This process is suppose to run every 3 minutes.
I'm wondering if this is the best practice.
I have around 20M rows. It seems not smart to WRITE_TRUNCATE 20M rows every 3 minutes.
Suggestion?
We are doing the same. To help improve performance though, try to partition the table by date_purchased and cluster by orderid.
Use a CTAS statement (to the table itself) as you cannot add partition after fact.
EDIT: use 2 tables and MERGE
Depending on your particular use case i.e. the number of fields that could be updated between old and new, you could use 2 tables, e.g. stage_table_orders for the imported records and final_table_orders as destination table and do
a MERGE like so:
MERGE final_table_orders F
USING stage_table_orders S
ON F.orderid = S.orderid AND
F.date_purchased = S.date_purchased
WHEN MATCHED THEN
UPDATE SET field_that_change = S.field_that_change
WHEN NOT MATCHED THEN
INSERT (field1, field2, ...) VALUES(S.field1, S.field2, ...)
Pro: efficient if few rows are "upserted", not millions (although not tested) + pruning partitions should work.
Con: you have to explicitly list the fields in the update and insert clauses. A one-time effort if schema is pretty much fixed.
There are may ways to de-duplicate and there is no one-size-fits-all. Search in SO for similar requests using ARRAY_AGG, or EXISTS with DELETE or UNION ALL,... Try them out and see which performs better for YOUR dataset.

Redshift derive column value based on other column

Let's say, I have 3 date columns (d1,d2,d3) in a redshift table.
d1 = max(d2,d3)
Instead of my application computing the value and setting it, during insert, if I specify only d2 and d3, can redshift auto-populate d1 = max(d2,d3)?
There are two ways to load data into Amazon Redshift.
The first is via the COPY command, when data is loaded from files stored in Amazon S3. Each column in a file will be mapped to one column in a table, so you cannot 'compute' a column during this process.
The second is via an INSERT command. This is not very efficient when used with Amazon Redshift and preferably is used to insert bulk rows rather than one row at a time.
A common practice is to load the data into a staging table, manipulate it as desired, then re-insert it into the target table.
You might even be able to do some fancy stuff with Redshift Spectrum where you can SELECT directly from files in S3 and insert into a table. This would also allow you to include terms, eg:
INSERT INTO normal-table
SELECT max(d2,d3), d2, d3 FROM spectrum-table
An alternative is to load the data, then use an UPDATE command to set the value of the extra column based upon existing columns.
Update:
It appears that using an UPDATE statement in Amazon Redshift (and, in fact, in any columnar database) is not a good idea. This is because each column is stored separately but in the same order. Updating one value requires the whole row to be re-written at the end of the storage space, rather than updated in-place. Thus, you'd need to VACUUM the database after such updates.
In postgresql (on which redshift is based), you can do what you want like this:
create table test (a int, b int, c int);
insert into test (a, b, c)
values (1, 2, greatest(1, 2))
(4, 1, greatest(4, 1));
It should also work in redshift, although I can't verify that at the moment. But this won't work for bulk loading data via the copy command.
If the above doesn't work, the other option would be to insert data, and then set column c using an update query.
insert into test (a, b) values (1, 2);
update test set c = greatest(a, b) where c is null;
for bulk loading, it is necessary to load data into columns a & b first using the copy command, then use the update query to set the value of column c

Google Big Query splitting an ingestion time partitioned table

I have an ingestion time partitioned table that's getting a little large. I wanted to group by the values in one of the columns and use that to split it into multiple tables. Is there an easy way to do that while retaining the original _PARTITIONTIME values in the set of new ingestion time partitioned tables?
Also I'm hoping for something that's relatively simple/cheap. I could do something like copy my table a bunch of times and then delete the data for all but one value on each copy, but I'd get charged a huge amount for all those DELETE operations.
Also I have enough unique values in the column I want to split on that saving a "WHERE column = value" query result to a table for every value would be cost prohibitive. I'm not finding any documentation that mentions whether this approach would even preserve the partitions, so even if it weren't cost prohibitive it may not work.
Case you describe required having two level partitioning which is not supported yet
You can create column partition table https://cloud.google.com/bigquery/docs/creating-column-partitions
And after this build this value of column as needed that used to partitioning before insert - but in this case you lost _PARTITIONTIME value
Based on additional clarification - I had similar problem - and my solution was to write python application that will read source table (read is important here - not query - so it will be free) - split data based on your criteria and stream data (simple - but not free) or generate json/csv files and upload it into target tables (which also will be free but with some limitation on number of these operations) - will required more coding/exception handling if you go second route.
You can also can do it via DataFlow - it will be definitely more expensive than custom solution but potentially more robust.
Examples for gcloud python library
client = bigquery.Client(project="PROJECT_NAME")
t1 = client.get_table(source_table_ref)
target_schema = t1.schema[1:] #removing first column which is a key to split
ds_target = client.dataset(project=target_project, dataset_id=target_dataset)
rows_to_process_iter = client.list_rows( t1, start_index=start_index, max_results=max_results)
# convert to list
rows_to_process = list(rows_to_process_iter)
# doing something with records
# stream records to destination
errors = client.create_rows(target_table, records_to_stream)
BigQuery now supports clustered partitioned tables, which allow you to specify additional columns that the data should be split by.

How to subtract 2 columns with dtype = object within data frame to form a new column of the difference pandas

I have a merge data frame(mdf) which the 2 data frames are retrieved from SQL. I wish to create a new col within mdf which will be the subtraction of existing 2 columns.
I'm not sure what you mean by a "merge data frame," but here's a sketch of what you might be after. Please elaborate a little your question so it will be more useful to others.
df = pd.read_sql('select ....', some_sql_connection)
df['difference'] = df['some column name'] - df['another column name']
Also, referring to the title of your question where you mention dtype=object, data extracted from a SQL database sometimes defaults to the generic object datatype, even if it is actually numeric. (This is not ideal, and better handling of datatypes to and from SQL databases is being actively improved for a future release of pandas.)
For now, before manipulating your data, you might want to run df.convert_objects(convert_numeric=True) if you have all numerical data. See documentation.

Efficiently processing all data in a Cassandra Column Family with a MapReduce job

I want to process all of the data in a column family in a MapReduce job. Ordering is not important.
An approach is to iterate over all the row keys of the column family to use as the input. This could be potentially a bottleneck and could replaced with a parallel method.
I'm open to other suggestions, or for someone to tell me I'm wasting my time with this idea. I'm currently investigating the following:
A potentially more efficient way is to assign ranges to the input instead of iterating over all row keys (before the mapper starts). Since I am using RandomPartitioner, is there a way to specify a range to query based on the MD5?
For example, I want to split the task into 16 jobs. Since the RandomPartitioner is MD5 based (from what I have read), I'd like to query everything starting with a for the first range. In other words, how would I query do a get_range on the MD5 with the start of a and ends before b. e.g. a0000000000000000000000000000000 - afffffffffffffffffffffffffffffff?
I'm using the pycassa API (Python) but I'm happy to see Java examples.
I'd cheat a little:
Create new rows job_(n) with each column representing each row key in the range you want
Pull all columns from that specific row to indicate which rows you should pull from the CF
I do this with users. Users from a particular country get a column in the country specific row. Users with a particular age are also added to a specific row.
Allows me to quickly pull the rows i need based on the criteria i want and is a little more efficient compared to pulling everything.
This is how the Mahout CassandraDataModel example functions:
https://github.com/apache/mahout/blob/trunk/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/cassandra/CassandraDataModel.java
Once you have the data and can pull the rows you are interested in, you can hand it off to your MR job(s).
Alternately, if speed isn't an issue, look into using PIG: How to use Cassandra's Map Reduce with or w/o Pig?