HBase: How to delete rows using Map-Reduce - mapreduce

I want to know how to delete specific rows in HBase using Map-Reduce?

The easiest way is to create a map task that does a delete, no reduce necessary.
In your job configuration, you should set up an appropriate Scan object with the conditions that you want to specify, and the hbase column family/qualifiers that you will used to determine whether or not to delete a row.
Alternatively, you could put the conditions in the map, but that would be much more inefficient. The nice part about putting the conditions in the scan is that the comparisons are done on the server, not the client. The hard part is that you either have to use the very flexible built-in comparators, or ensure that your custom comparator is in the classpath of all of the regionservers

Related

Best practice for using Force_Index on spanner

I have a client application which querys data in Spanner..
Lets say I have a table with 10 columns and my client application can search on a combination of columns.. Lets say I've added 5 indexes to optimise searching.
According to https://cloud.google.com/spanner/docs/sql-best-practices#secondary-indexes
it says:
In this scenario, Spanner automatically uses the secondary index SingersByLastName when executing the query (as long as three days have passed since database creation; see A note about new databases). However, it's best to explicitly tell Spanner to use that index by specifying an index directive in the FROM clause:
And also https://cloud.google.com/spanner/docs/secondary-indexes#index-directive suggests
When you use SQL to query a Spanner table, Spanner automatically uses any indexes that are likely to make the query more efficient. As a result, you don't need to specify an index for SQL queries. However, for queries that are critical for your workload, Google advises you to use FORCE_INDEX directives in your SQL statements for more consistent performance.
Both links suggest YOU (The developer) should be supplying Force_Index on yours queries.. This means I now need business logic in my client to say something like:
If (object.SearchTermOne)
queryBuilder.IndexToUse = "Idx_SearchTermOne"
This feels like I'm essentially trying to do the job of the optimiser by setting the index to use.. It also means if I add an extra index I need a code change to make use of it
So what are the best practices when it comes to using Force_Index in spanner queries?
The best practice is to use the Force_Index as described in the documentation at this time.
This feels like I'm essentially trying to do the job of the optimiser by setting the index to use..
I feel the same.
https://cloud.google.com/spanner/docs/secondary-indexes#index-directive
Note: The query optimizer requires up to three days to collect the databases statistics required to select a secondary index for a SQL query. During this time, Cloud Spanner will not automatically use any indexes.
As noted in this note, even if an amount of data is added that would allow the index to function effectively, it may take up to three days for the optimizer to figure it out.
Queries during that time will probably be full scans.
If you want to prevent this other than using Force_Index, you will need to run ANALYZE DDL manually.
https://cloud.google.com/blog/products/databases/a-technical-overview-of-cloud-spanners-query-optimizer
But none of this changes the fact that we are essentially trying to do the optimizer's job...

Informatica Cloud (IICS): Can Target transformation pre and post sql include incoming data?

I have a Target insert transformation that I'd like to do a delete on the row before insertion (weird niche case that may pop up).
I know the update override allows for :TU.xyz to point at incoming data, but Pre/Post SQL doesn't have the same configure menu.
How would I accomplish this correctly?
From what I recall, Pre- and Post- SQL uses a separate connection so there is no way of referring incoming data.
One thing you could do is flagging/storing the key somewhere and using the flag/instance in the PostSQL query, for example.
Maciejg is correct, there is no dynamic use of Pre and Post SQL.
I would normally recommend an Upsert approach.
But, I found using a MS SQL target, IICS has a bug with doing Insert and Update off a Router. The workaround of using a data driven operation removes batch loading on your insert, so... I now recommend a full data load approach.
From a target with the operation set to Insert, I do batch deletes with Pre SQL.
I found this faster and more cost affective than doing delete/insert/update operations individually.

DynamoDB insert timestamps with trigger vs in a put/post request

I have two small dynamodb tables with about 10 attributes, I want to add "CreatedDate" and "ModifiedDate" attributes to these. I am trying to decide what would be the best practice to do it with the lowest cost and highest performance, reusability.
First, I was thinking to create a trigger and add these attributes when there is an update or create operation in the table. I like this way because it will be centralized. However, I am not sure if this is the cheapest way to do it, because, after a new item written in the table, this trigger will do another write operation to insert the dates.
Second, just sending these values in the "PUT" request as new attributes. That way, I will have to do only one write operation. The downside of doing this, I will need to update each function writes an item to these tables.
Which way I should go in that case? Are there any better ways to do it or anything I am missing?

BigQuery tabledata:list output into a bigquery table

I know there is a way to place the results of a query into a table; there is a way to copy a whole table into another table; and there is a way to list a table piecemeal (tabledata:list using startIndex, maxResults and pageToken).
However, what I want to do is go over an existing table with tabledata:list and output the results piecemeal into other tables. I want to use this as an efficient way to shard a table.
I cannot find a reference to such a functionality, or any workaround to it for that matter.
Important to realize: Tabledata.List API is not part of BQL (BigQuery SQL) but rather BigQuery API that you can use in client of your choice.
That said, the logic you outlined in your question can be implemented in many ways, below is an example (high level steps):
Calling Tabledata.List within the loop using pageToken for next iteration or for exiting loop.
In each iteration, process response from Tabledata.List, extract actual data and insert into destination table using streaming data with Tabledata.InsertAll API. You can also have inner loop to go thru rows extracted in given iteration and define which one to go to which table/shard.
This is very generic logic and particular implementation depends on client you use.
Hope this helps
For what you describe, I'd suggest you use the batch version of Cloud Dataflow:
https://cloud.google.com/dataflow/
Dataflow already supports BigQuery tables as sources and sinks, and will keep all data within Google's network. This approach also scales to arbitrarily large tables.
TableData.list-ing your entire table might work fine for small tables, but network overhead aside, it is definitely not recommended for anything of moderate size.

Hbase Update/Insert using Get/Put

Could anyone advise what would be the best way to go for my requirement.
I have the below
An Hbase table
An input file in HDFS
My Requirement is as below
Read the input file and fetch the key. Using key get the data from
Hbase.
Do a comparison to check.
If comparison fails, insert
If comparison is successful update.
I know i can use get to fetch the data and put to write it back. Is this the best way to go forward.I hope i will use mapreduce so that i can get the process to run in parallel.
HBase has a checkAndPut() and a checkAndDelete() operation. which allows you to perform a put or a delete if you have the value you expect (compare=NO_OP if you don't care about the value but just about the key).
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html
Depending on the size of your problem I actually recommend a slightly different approach here. While it probably is feasible to implement HBase puts inside a MapReduce Job, it sounds like a rather complex task.
I'd recommend loading data from HBase into MapReduce joining the two tables and then exporting them back into HBase.
Using Pig this would be rather easy to achieve.
Take a look at Pig HBaseStorage.
Going this route you'd load both files, join them and then write back to HBase. If all there is to it, is comparing keys, this can be achieved in 5 lines of PigLatin.
HTH