How to Scan HBase Rows efficiently - mapreduce

I need to write a MapReduce Job that Gets all rows in a given Date Range(say last one month). It would have been a cakewalk had My Row Key started with Date. But My frequent Hbase queries are on starting values of key.
My Row key is exactly A|B|C|20120121|D . Where combination of A/B/C along with date (in YearMonthDay format) makes a unique row ID.
My Hbase tables could have upto a few million rows. Should my Mapper read all the table and filter each row if it falls in given date range or Scan / Filter can help handling this situation?
Could someone suggest (or a snippet of code) a way to handle this situation in an effective manner?
Thanks
-Panks

A RowFilter with a RegEx Filter would work, but would not be the most optimal solution. Alternatively you can try to use secondary indexes.
One more solution is to try the FuzzyRowFIlter. A FuzzyRowFilter uses a kind of fast-forwarding, hence skipping many rows in the overall scan process and will thus be faster than a RowFilter Scan. You can read more about it here.
Alternatively BloomFilters might also help depending on your schema. If your data is huge you should do a comparative analysis on secondary index and Bloom Filters.

You can use a RowFilter with a RegexStringComparator. You'd need to come up with a RegEx that filters your dates appropriately. This page has an example that includes setting a Filter for a MapReduce scanner.

I am just getting started with HBase, bloom filters might help.

You can modify the Scan that you send into the Mapper to include a filter. If your date is also the record timestamp, it's easy:
Scan scan = new Scan();
scan.setTimeRange(minTime, maxTime);
TableMapReduceUtil.initTableMapperJob("mytable", scan, MyTableMapper.class,
OutputKey.class, OutputValue.class, job);
If the date in your row key is different, you'll have to add a filter to your scan. This filter can operate on a column or a row key. I think it's going to be messy with just the row key. If you put the date in a column, you can make a FilterList where all conditions must be true and use a CompareOp.GREATER and a CompareOp.LESS. Then use scan.setFilter(filterList) to add your filters to the scan.

Related

How to get row count for large dataset in Informatica?

I am trying to get the row count for a dataset with 280 fields with out having affect on the performance. Looking for best possible ways to perform.
The better option to avoid performance issue is, use sorter transformation and sort the columns and pass the pipeline to aggregator transformation. In aggregator transformation please check the option sorted input.
In terms if your source is a database then, index the required conditional columns in the table and also partition the table if required.
For your solution, I have in mind 2 options:
Using Aggregator (remember to use a predefined order by to improve performance with the next trans), SQ > Aggregator > Target. Inside the aggregator add new ports with the sum() and/or count() functions. Remember to select the columns to group
Check this out this example:
https://www.guru99.com/aggregator-transformation-informatica.html
Using Source Qualifier query override. Use a traditional select count/sum with group by from the database- SQ > Target.
By the way. Informatica is very good with the performance, more than the columns you need to review how many records you are processing. A best practice is always to stress the datasource/database more than the Infa app.
Regards,
Juan
If all you need is just to count the rows, use the Aggregator. That's what it's for. However, this will create cache - to limit it's size, use a single port.
To avoid caching, you can use a variable in expression and just increment it. This however will give you an extra column with all rows numbered, not just a single value. You'll still need to aggregate it. Here it would be possible to use aggregater with no function to return just the last value.

Is there a clever HBase Schema to Aid with Discovering Missing Value?

Let's assume I have billions of rows in my HBase table. The rows in this table change slowly, meaning there will be new rowkeys and some rowkeys get deleted.
I receive lots of events per row. However, there will be very few rows that will not have any events associated with them.
At the end of the day I would like to report on the rows that have not received any events.
My naive solution would be to introduce a cf:c that holds a flag, set the flag to 1 every-time I see an event for it. Then do a full-scan of the table looking for rowkeys that are missing the column-value. That seems like a waste, because I would be looking through 10 billion rows to discover a handful of rowkeys (we are talking about 100s or low 1000s).
Is there a clever way to design the hbase schema such that the rowkeys that are missing events could be found quickly (without going through every row)?
If I understood correctly, you have a rowkey xxxxyyyyzzzz1 ... xxxxyyyyzzzzn.
You have events for some rows and no events for other rows.
c is your flag to know whether events are there or not and you have huge data.
Rule of thumb in HBase: RowFilters are always faster and more efficient than column value filters (for searching that flag, a full table scan is required).
Your approach to scan the entire table for missing events (column value filter) will lead to a full table scan and is not efficient.
Conclusion: You have to use a row key filter to scan such a big table.
So I'd suggest you write the flag in the row key. For example :
0 -- is for no events
1 -- is there are events
xxxxyyyyzzzz1_0 // row with no events
xxxxyyyyzzzz1_1 // row with events
Now you can use a fuzzy row filter to capture missing event rows and take a report.
Option 2 of your another question which was answered by me
Is there a clever HBase Schema to Aid with Discovering Missing Value?
From, my experience with hbase, there is no such thing.

Partitioning a table in sybase-select query

My main concern:
I have an existing table with huge data.It is having a clustered index.
My c++ process has a list of many keys with which it checks whether the key exists in the table,
and if yes, it will then check the row in the table and the new row are similar. if there is a change the new row is updated in the table.
In general there will less changes. But its huge data in the table.
S it means there will be lot of select queries but not many update queries.
What I would I like to achieve:
I just read about partitioning a table in sybase here.
I just wanted to know will this be helpful for me, as I read in the article it mentions about the insert queries only. But how can I improve my select query performance.
Could anyone please suggest what should I look for in this case?
Yes it will improve your query (read) performance so long as your query is based on the partition keys defined. Indexes can also be partitioned and it stands to reason that a smaller index will mean faster read performance.
For example if you had a query like select * from contacts where lastName = 'Smith' and you have partitioned your table index based on first letter of lastName, then the server only has to search one partition "S" to retrieve its results.
Be warned that partitioning your data can be difficult if you have a lot of different query profiles. Queries that do not include the index partition key (e.g. lastName) such as select * from staff where created > [some_date] will then have to hit every index partition in order to retrieve it's result set.
No one can tell you what you should/shouldn't do as it is very application specific and you will have to perform your own analysis. Before meddling with partitions, my advice is to ensure you have the correct indexes in place, they are being hit by your queries (i.e. no table scans), and your server is appropriately resourced (i.e got enough fast disk and RAM), and you have tuned your server caches to suit your queries.

nosql/dynamodb hash and range use case

It's my first time using a NoSQL database so I'm really confused. I'd really appreciate any help I can get.
I want to store data comprising announcements in my table. Essentially, each announcement has an ID, a date, and a text.
So for example, an announcement might have ID of 1, date of 2014/02/26, and text of "This is a sample announcement". Newer announcements always have a greater ID value than older announcements, since they are added to the table later.
There are two types of queries I want to run on this table:
I want to retrieve the text of the announcements sorted in order of date.
I want to retrieve the text and dates of the x most recent announcements (say, the 3 most recent announcements).
So I've set up the table with the following attributes:
ID (number) as primary key, and
date (string) as range
Is this appropriate for what my use cases? And if so, what kind of query/reads/requests/scans/whatever (I'm really confused about the terminology here too) should I be running to accomplish the two types of queries I want to make?
Any help will be very much appreciated. Thanks!
You are on the right track.
As far as sorting, DynamoDB will sort by the range key, so date will work but I'd recommend storing it as a number, perhaps milliseconds since the Unix epoch, rather than a String. This will make it trivial to get the announcements in ascending or descending order based on their created date.
See this answer for an overview of local vs global secondary indexes and what capabilities they provide: Optional secondary indexes in DynamoDB
As far as retrieving all items, you would need to perform a scan. Scans are not as efficient as queries, but since all of Dynamo is on SSD's they're still relatively quick. You don't get the single digit millisecond performance with a scan that you get with a query, so if there's a way to associate announcements with a user ID, you might get better performance than with a scan.
Note that you cannot modify the table schema (hash key, range key, and indexes) after you create the table. There are ways to manually migrate a table or import/export it, but the point is that you should think hard about current and future query requirements up front and design the table to support them. It's very easy to add or stop storing non-key or non-item attributes though, which provides nice flexibility.
Finally, try to avoid thinking of Dynamo as relational. With Dynamo, in a lot of cases you may well be better off de normalizing or duplicating some of the data in exchange for fast query performance.

Efficiently processing all data in a Cassandra Column Family with a MapReduce job

I want to process all of the data in a column family in a MapReduce job. Ordering is not important.
An approach is to iterate over all the row keys of the column family to use as the input. This could be potentially a bottleneck and could replaced with a parallel method.
I'm open to other suggestions, or for someone to tell me I'm wasting my time with this idea. I'm currently investigating the following:
A potentially more efficient way is to assign ranges to the input instead of iterating over all row keys (before the mapper starts). Since I am using RandomPartitioner, is there a way to specify a range to query based on the MD5?
For example, I want to split the task into 16 jobs. Since the RandomPartitioner is MD5 based (from what I have read), I'd like to query everything starting with a for the first range. In other words, how would I query do a get_range on the MD5 with the start of a and ends before b. e.g. a0000000000000000000000000000000 - afffffffffffffffffffffffffffffff?
I'm using the pycassa API (Python) but I'm happy to see Java examples.
I'd cheat a little:
Create new rows job_(n) with each column representing each row key in the range you want
Pull all columns from that specific row to indicate which rows you should pull from the CF
I do this with users. Users from a particular country get a column in the country specific row. Users with a particular age are also added to a specific row.
Allows me to quickly pull the rows i need based on the criteria i want and is a little more efficient compared to pulling everything.
This is how the Mahout CassandraDataModel example functions:
https://github.com/apache/mahout/blob/trunk/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/cassandra/CassandraDataModel.java
Once you have the data and can pull the rows you are interested in, you can hand it off to your MR job(s).
Alternately, if speed isn't an issue, look into using PIG: How to use Cassandra's Map Reduce with or w/o Pig?