Hbase Update/Insert using Get/Put - mapreduce

Could anyone advise what would be the best way to go for my requirement.
I have the below
An Hbase table
An input file in HDFS
My Requirement is as below
Read the input file and fetch the key. Using key get the data from
Hbase.
Do a comparison to check.
If comparison fails, insert
If comparison is successful update.
I know i can use get to fetch the data and put to write it back. Is this the best way to go forward.I hope i will use mapreduce so that i can get the process to run in parallel.

HBase has a checkAndPut() and a checkAndDelete() operation. which allows you to perform a put or a delete if you have the value you expect (compare=NO_OP if you don't care about the value but just about the key).
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html

Depending on the size of your problem I actually recommend a slightly different approach here. While it probably is feasible to implement HBase puts inside a MapReduce Job, it sounds like a rather complex task.
I'd recommend loading data from HBase into MapReduce joining the two tables and then exporting them back into HBase.
Using Pig this would be rather easy to achieve.
Take a look at Pig HBaseStorage.
Going this route you'd load both files, join them and then write back to HBase. If all there is to it, is comparing keys, this can be achieved in 5 lines of PigLatin.
HTH

Related

Informatica Cloud (IICS): Can Target transformation pre and post sql include incoming data?

I have a Target insert transformation that I'd like to do a delete on the row before insertion (weird niche case that may pop up).
I know the update override allows for :TU.xyz to point at incoming data, but Pre/Post SQL doesn't have the same configure menu.
How would I accomplish this correctly?
From what I recall, Pre- and Post- SQL uses a separate connection so there is no way of referring incoming data.
One thing you could do is flagging/storing the key somewhere and using the flag/instance in the PostSQL query, for example.
Maciejg is correct, there is no dynamic use of Pre and Post SQL.
I would normally recommend an Upsert approach.
But, I found using a MS SQL target, IICS has a bug with doing Insert and Update off a Router. The workaround of using a data driven operation removes batch loading on your insert, so... I now recommend a full data load approach.
From a target with the operation set to Insert, I do batch deletes with Pre SQL.
I found this faster and more cost affective than doing delete/insert/update operations individually.

Options for paging and sorting a DynamoDB result set?

I'm starting on a new project and am going to be using DynamoDB as the main data source. A lot of what it does works perfectly for the needs, with a couple exceptions.
Those are the sorting and paginating needs of the UI. Users can sort the data by anywhere from 8-10 different columns, and a result set of 20-30k+ rows should be paginated over.
From what I can tell of DynamoDB, the only way to do sorting by all those columns would be to expose that many sort keys through a variety of additional indexes, and that seems like a misuse of those concepts. If I'm not going to sort the data with DynamoDb queries, I can't paginate there either.
So my question is, what's the quickest way once I have the data to paginate & sort? Should I move the result set into Aurora and then sort & page with SQL? I've thought about exporting to S3 and then utilizing something like Athena to page & sort, but that tool really seems to be geared to much larger datasets than this. What are other options?
One option is to duplicate the data and store it once for each sorting option, with each version of the record having different data in the sort key. If you are okay with eventual consistency that might be a little more delayed you can accomplish this by having a lambda that reads from a DynamoDB stream and insert/update/delete the sorted records as the main records are inserted/updated/deleted.
Sorting, pagination and returning 20-30K records are not Dynamo's strong suit...
Why not just store the data in Aurora in the first place?
Depending on the data, Elasticsearch may be a better choice. Might even look at Redshift.
EDIT
If you've not seen this before...

How to ignore amazon athena struct order

I'm getting an HIVE_PARTITION_SCHEMA_MISMATCH error that I'm not quite sure what to do about. When I look at the 2 different schemas, the only thing that's different is the order of the keys in one of my structs (created by a glue crawler). I really don't care about the order of the data, and I'm receiving the data as a JSON blob, so I cannot guarantee the order of the keys.
struct<device_id:string,user_id:string,payload:array<struct<channel:string,sensor_id:string,type:string,unit:string,value:double,name:string>>,topic:string,channel:string,client_id:string,hardware_id:string,timestamp:bigint,application_id:string>
struct<device_id:string,user_id:string,payload:array<struct<channel:string,name:string,sensor_id:string,type:string,unit:string,value:double>>,topic:string,channel:string,client_id:string,hardware_id:string,timestamp:bigint,application_id:string>
I suggest you stop using Glue crawlers. It's probably not the response you had hoped for, but crawlers are really bad at their job. They can be useful sometimes as a way to get a schema from a random heap of data that someone else produced and that you don't want to spend time looking at to figure out what its schema is – but once you have a schema, and you know that new data will follow that schema, Glue crawlers are just in the way, and produce unnecessary problems like the one you have encountered.
What to do instead depends on how new data is added to S3.
If you are in control of the code that produces the data, you can add code that adds partitions after the data has been uploaded. The benefit of this solution is that partitions are added immediately after new data has been produced so tables are always up to date. However, it might tightly couple the data producing code with Glue (or Athena if you prefer to add partitions through SQL) in a way that is not desirable.
If it doesn't make sense to add the partitions from the code that produces the data, you can create a Lambda function that does it. You can either set it to run at a fixed time every day (if you know the location of the new data you don't have to wait until it exists, partitions can point to empty locations), or you can trigger it by S3 notifications (if there are multiple files you can either figure out a way to debounce the notifications through SQS or just create the partition over and over again, just swallow the error if the partition already exists).
You may also have heard of MSCK REPAIR TABLE …. It's better than Glue crawlers in some ways, but just as bad in other ways. It will only add new partitions, never change the schema, which is usually what you want, but it's extremely inefficient, and runs slower and slower the more files there are. Kind of like Glue crawlers.

BigQuery tabledata:list output into a bigquery table

I know there is a way to place the results of a query into a table; there is a way to copy a whole table into another table; and there is a way to list a table piecemeal (tabledata:list using startIndex, maxResults and pageToken).
However, what I want to do is go over an existing table with tabledata:list and output the results piecemeal into other tables. I want to use this as an efficient way to shard a table.
I cannot find a reference to such a functionality, or any workaround to it for that matter.
Important to realize: Tabledata.List API is not part of BQL (BigQuery SQL) but rather BigQuery API that you can use in client of your choice.
That said, the logic you outlined in your question can be implemented in many ways, below is an example (high level steps):
Calling Tabledata.List within the loop using pageToken for next iteration or for exiting loop.
In each iteration, process response from Tabledata.List, extract actual data and insert into destination table using streaming data with Tabledata.InsertAll API. You can also have inner loop to go thru rows extracted in given iteration and define which one to go to which table/shard.
This is very generic logic and particular implementation depends on client you use.
Hope this helps
For what you describe, I'd suggest you use the batch version of Cloud Dataflow:
https://cloud.google.com/dataflow/
Dataflow already supports BigQuery tables as sources and sinks, and will keep all data within Google's network. This approach also scales to arbitrarily large tables.
TableData.list-ing your entire table might work fine for small tables, but network overhead aside, it is definitely not recommended for anything of moderate size.

Write to multiple tables in HBASE

I have a situation here where I need to write to two of the hbase tables say table1,table 2. Whenever a write happens on table 1, I need to do some operation on table 2 say increment a counter in table 2 (like triggering). For this purpose I need to access (write) to two tables in the same task of a map-reduce program. I heard that it can be done using MultiTableOutputFormat. But I could not find any good example explaining in detail. Could some one please answer whether is it possible to do so. If so how can/should I do it. Thanks in advance.
Please provide me an answer that should not include co-processors.
To write into more than one table in map-reduce job, you have to specify that in job configuration. You are right this can be done using MultiTableOutputFormat.
Normally for a single table you use like:
TableMapReduceUtil.initTableReducerJob("tableName", MyReducer.class, job);
Instead of this write:
job.setOutputFormatClass(MultiTableOutputFormat.class);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setNumReduceTasks(2);
TableMapReduceUtil.addDependencyJars(job);
TableMapReduceUtil.addDependencyJars(job.getConfiguration());
Now at the time of writing data in table write as:
context.write(new ImmutableBytesWritable(Bytes.toBytes("tableName1")),put1);
context.write(new ImmutableBytesWritable(Bytes.toBytes("tableName2")),put2);
For this you can use HBase Observer, You have to create an observer and have to deploy on your server(applicable only for HBase Version >0.92), It will automatic trigger to another table.
And I think HBase Observer has similar concepts of like Aspects.
For more details -
https://blogs.apache.org/hbase/entry/coprocessor_introduction