Is there a clever HBase Schema to Aid with Discovering Missing Value? - mapreduce

Let's assume I have billions of rows in my HBase table. The rows in this table change slowly, meaning there will be new rowkeys and some rowkeys get deleted.
I receive lots of events per row. However, there will be very few rows that will not have any events associated with them.
At the end of the day I would like to report on the rows that have not received any events.
My naive solution would be to introduce a cf:c that holds a flag, set the flag to 1 every-time I see an event for it. Then do a full-scan of the table looking for rowkeys that are missing the column-value. That seems like a waste, because I would be looking through 10 billion rows to discover a handful of rowkeys (we are talking about 100s or low 1000s).
Is there a clever way to design the hbase schema such that the rowkeys that are missing events could be found quickly (without going through every row)?

If I understood correctly, you have a rowkey xxxxyyyyzzzz1 ... xxxxyyyyzzzzn.
You have events for some rows and no events for other rows.
c is your flag to know whether events are there or not and you have huge data.
Rule of thumb in HBase: RowFilters are always faster and more efficient than column value filters (for searching that flag, a full table scan is required).
Your approach to scan the entire table for missing events (column value filter) will lead to a full table scan and is not efficient.
Conclusion: You have to use a row key filter to scan such a big table.
So I'd suggest you write the flag in the row key. For example :
0 -- is for no events
1 -- is there are events
xxxxyyyyzzzz1_0 // row with no events
xxxxyyyyzzzz1_1 // row with events
Now you can use a fuzzy row filter to capture missing event rows and take a report.
Option 2 of your another question which was answered by me
Is there a clever HBase Schema to Aid with Discovering Missing Value?
From, my experience with hbase, there is no such thing.

Related

AWS DynamoDB - To use a GSI or Scan if I just wish to query the table by Date

I feel like I'm thinking my self in circles here. Maybe you all can help :)
Say I have this simple table design in DynamoDB:
Id | Employee | Created | SomeExtraMetadataColumns... | LastUpdated
Say my only use case is to find all the rows in this table where LastUpdated < (now - 2 hours).
Assume that 99% of the data in the table will not meet this criteria. Assume there is a some job running every 15 mins that is updating the LastUpdated column.
Assume there are say 100,000 rows and grows maybe 1000 rows a day. (no need to large write capacity).
Assume a single entity will be performing this 'read' use case (no need for large read capacity).
Options I can think of:
Do a scan.
Pro: can leverage parallels scans to scale in the future.
Con: wastes a lot of money reading rows that do not match the filter criteria.
Add a new column called 'Constant' that would always have the value of 'Foo' and make a GSI with the Partition Key of 'Constant' and a Sort Key of LastUpdated. Then execute a query on this index for Constant = 'Foo' and LastUpdated < (now - 2hours).
Pro: Only queries the rows matching the filter. No wasted money.
Con: In theory this would be plagued by the 'hot partition' problem if writes scale up. But I am unsure how much of a problem it will be as aws outlined this problem to be a thing of the past.
Honestly, I leaning toward the latter option. But I'm curious what the communities thoughts are on this. Perhaps I am missing something.
Based on the assumption that the last_updated field is the only field you need to query against, I would do something like this:
PK: EMPLOYEE::{emp_id}
SK: LastUpdated
Attributes: Employee, ..., Created
PK: EMPLOYEE::UPDATE
SK: LastUpdated::{emp_id}
Attributes: Employee, ..., Created
By denormalising your data here you have the ability to create an update record with an update row which can be queried with PK = EMPLOYEE::UPDATE and SK between 'datetime' and 'datetime'. This is assuming you store the datetime as something like 2020-10-01T00:00:00Z.
You can either insert this additional row here or you could consider utilising DynamoDB streams to stream update events to Lambda and then add the row from there. You can set a TTL on the 'update' row which will expire somewhere between 0 and 48 hours from the TTL you set keeping the table clean. It doesn't need to be instantly removed because you're querying based on the PK and SK anyway.
A scan is an absolute no-no on a table that size so I would definitely recommend against that. If it increases by 1,000 per day like you say then before long your scan would be unmanageable and would not scale. Even at 100,000 rows a scan is very bad.
You could also utilise DynamoDB Streams to stream your data out to data stores which are suitable for analytics which is what I assume you're trying to achieve here. For example you could stream the data to redshift, RDS etc etc. Those require a few extra steps and could benefit from kinesis depending on the scale of updates but it's something else to consider.
Ultimately there are quite a lot of options here. I'd start by investigating the denormalisation and then investigate other options. If you're trying to do analytics in DynamoDB I would advise against it.
PS: I nearly always call my PK and SK attributes PK and SK and have them as strings so I can easily add different types of data or denormalisations to a table easily.
Definitely stay away from scan...
I'd look at a GSI with
PK: YYYY-MM-DD-HH
SK: MM-SS.mmmmmm
Now to get the records updated in the last two hours, you need only make three queries.

DynamoDB - Partition grouping or sharding?

So, looking through the DynamoDB docs, they'll often recommend that you "group" togheter items that are related in the same partition, as so to better distribute your partition usage.
Take the following example where we have an user that has contacts and invoices inside its partition :
So, if I need all of user_001's invoice I will simply query (pseudo):
QUERY WHERE PartitionKey = "user_001" AND SortKey.begins_with("invoice_")
But I recently noticed there's quite an issue when you use the method above.
You see, DynamoDB will search inside the whole user_001 partition for the invoices, and will consume read capacity based on all items searched, whether they where invoices or not.
This can be end up being very inefficient if you have a partition that is too big, let's say I had 10,000 contacts and 2 invoices, it could end up being very costly to get those 2 invoices.
I'm assuming this based on the quote by the docs :
DynamoDB calculates the number of read capacity units consumed based on
item size, not on the amount of data that is returned to an
application
The solution :
Wouldn't this be a better approach?
1) It shards the data better so I don't need to use starts_with
2) It allows me to use a time-based uuid as the sort key and enable more complex ordering/pagination
3) I will consume much less capacity on queries since it won't have to go through items I don't need
What's the question?
Well, what I said above is just theories and assumptions, the documentation doesn't make it clear how it really works behind the scene, and it even recommends picture 1 to be used.
But I'm really thinking picture 2 it's the best here, specially when you consider that now DynamoDB smartly distributes capacity throughout your partitions (and not evenly like it used to be)
So, are my points for thinking picture 2 being much better than 1 valid?
You have assumed incorrectly—the documentation you have quoted applies to filter expressions.
If you have a condition that applies to your sort key, that should be part of the query expression, not a filter expression.

Best way to update a column of a table of tens of millions of rows

Question
What is the Best way to update a column of a table of tens of millions of rows?
1)
I saw creating a new table and rename the old one when finish
2)
I saw update in batches using a temp table
3)
I saw single transaction (don't like this one though)
4)
never listen to cursor solution for a problema like this and I think it's not worthy to try
5) I read about loading data from file (Using BCP), but have not read if the performance is better or not. was not clear if it is just to copy or if it would allow join a big table with something and then bull copy.
really would like have some advice here.
Priority is performance
At the momment I'm testing solution 2) and Exploring solution 5)
Additional Information (UPDATE)
thank you for the critical thinking in here.
The operation be done in downtime.
UPDATE Will not cause row forwarding
All the tables go indexes, average 5 indexes, although few tables got
like 13 indexes.
the probability of target column is present in one of the table
indexes something like 50%.
Some tables can be rebuilt and replace, others don't because they
make part of a software solution, and we might lose support to those.
from those tables some got triggers.
I'll need to do this for more than 600 tables where ~150 range from
0.8 Million to 35 Million rows
The update is always in the same column in the various fields
References
BCP for data transfer
Actually it depends:
on the number of indexes the table contains
the size of the row before and after the UPDATE operation
type of UPDATE - would it be in place? does it need to modify the row length
does the operation cause row forwarding?
how big is the table?
how big would the transaction log of the UPDATE command be?
does the table contain triggers?
can the operation be done in downtime?
will the table be modified during the operation?
are minimal logging operations allowed?
would the whole UPDATE transaction fit in the transaction log?
can the table be rebuilt & replaced with a new one?
what was the timing of the operation on the test environment?
what about free space in the database - is there enough space for a copy of the table?
what kind of UPDATE operation is to be performed? does additional SELECT commands have to be done to calculate the new value of every row? or is it a static change?
Depending on the answers and the results of the operation in the test environment we could consider the fastest operations to be:
minimal logging copy of the table
an in place UPDATE operation preferably in batches

Scanning DynamoDB table while inserting

When we scan a DynamoDB table, we can/should use LastEvaluatedKey to track the progress so that we can resume in case of failures. The documentation says that
LastEvaluateKey is The primary key of the item where the operation stopped, inclusive of the previous result set. Use this value to start a new operation, excluding this value in the new request.
My question is if I start a scan, pause, insert a few rows and resume the scan from the previous LastEvaluatedKey, will I get those new rows after resuming the scan?
My guess is I might miss some of all of the new rows because the new keys will be hashed and the values could be smaller than LastEvaluatedKey.
Is my guess right? Any explanation or documentation links are appreciated.
It is going sequentially through your data, and it does not know about all items that were added in the process:
Scan operations proceed sequentially; however, for faster performance
on a large table or secondary index, applications can request a
parallel Scan operation by providing the Segment and TotalSegments
parameters.
Not only it can miss some of the items that were added after you've started scanning it can also miss some of the items that were added before the scan started if you are using eventually consistent read:
Scan uses eventually consistent reads when accessing the data in a
table; therefore, the result set might not include the changes to data
in the table immediately before the operation began.
If you need to keep track of items that were added after you've started a scan you can use DynamoDB streams for that.

How to Scan HBase Rows efficiently

I need to write a MapReduce Job that Gets all rows in a given Date Range(say last one month). It would have been a cakewalk had My Row Key started with Date. But My frequent Hbase queries are on starting values of key.
My Row key is exactly A|B|C|20120121|D . Where combination of A/B/C along with date (in YearMonthDay format) makes a unique row ID.
My Hbase tables could have upto a few million rows. Should my Mapper read all the table and filter each row if it falls in given date range or Scan / Filter can help handling this situation?
Could someone suggest (or a snippet of code) a way to handle this situation in an effective manner?
Thanks
-Panks
A RowFilter with a RegEx Filter would work, but would not be the most optimal solution. Alternatively you can try to use secondary indexes.
One more solution is to try the FuzzyRowFIlter. A FuzzyRowFilter uses a kind of fast-forwarding, hence skipping many rows in the overall scan process and will thus be faster than a RowFilter Scan. You can read more about it here.
Alternatively BloomFilters might also help depending on your schema. If your data is huge you should do a comparative analysis on secondary index and Bloom Filters.
You can use a RowFilter with a RegexStringComparator. You'd need to come up with a RegEx that filters your dates appropriately. This page has an example that includes setting a Filter for a MapReduce scanner.
I am just getting started with HBase, bloom filters might help.
You can modify the Scan that you send into the Mapper to include a filter. If your date is also the record timestamp, it's easy:
Scan scan = new Scan();
scan.setTimeRange(minTime, maxTime);
TableMapReduceUtil.initTableMapperJob("mytable", scan, MyTableMapper.class,
OutputKey.class, OutputValue.class, job);
If the date in your row key is different, you'll have to add a filter to your scan. This filter can operate on a column or a row key. I think it's going to be messy with just the row key. If you put the date in a column, you can make a FilterList where all conditions must be true and use a CompareOp.GREATER and a CompareOp.LESS. Then use scan.setFilter(filterList) to add your filters to the scan.