HBase scanning through shell command and mapreduce gives two different rersult - mapreduce

I have HBase table and it has more than billion records. When I query scan the HBase table with certain ValueFilter I get 41820 records, but it took more than 35 mins to give the result but when I used mapreduce program to scan the same HBase table, I got the count with in 2 mins but gave me 41035 recods. I don't know.
Here is the shell command I use :
scan 'permhistory', { COLUMNS => 'h:e_source', FILTER => "ValueFilter( =, 'binaryprefix:AC_B2B' )" }
result : 41820
Here is the Scan object in mapreduce :
Scan scan = new Scan();
scan.setCaching(2000);
scan.setCacheBlocks(false);
scan.addFamily(Bytes.toBytes("h"));
scan.addColumn(Bytes.toBytes("h"), Bytes.toBytes("e_source"));
SingleColumnValueFilter filter = new SingleColumnValueFilter(Bytes.toBytes("h"),
Bytes.toBytes("e_source"),CompareOp.EQUAL,Bytes.toBytes("AC_B2B"));
filter.setLatestVersionOnly(false);
scan.setFilter(filter);
Any idea? This is my first post on here. Experts out there, would you please help me out? I am kind of stuck on automating our system

In the mapreduce you are using this constructor
public SingleColumnValueFilter(byte[] family,
byte[] qualifier,
CompareFilter.CompareOp compareOp,
byte[] value)
It means you instantiate a Filter using default comparator, but in the hbase shell you're using
"ValueFilter( =, 'binaryprefix:AC_B2B' )"
A binaryprefix comparator, so you should try this instead
SingleColumnValueFilter filter = new SingleColumnValueFilter(Bytes.toBytes("h"),
Bytes.toBytes("e_source"),
CompareOp.EQUAL,
new BinaryPrefixComparator(Bytes.toBytes("AC_B2B")));
Moreover, in the hbase shell you are using ValueFilter and in the mapreduce you are using SingleColumnValueFilter. For your reference:
SingleColumnValueFilter
This filter is used to filter cells based on value. It takes a
CompareFilter.CompareOp operator (equal, greater, not equal, etc), and
either a byte [] value or a ByteArrayComparable. If we have a byte []
value then we just do a lexicographic compare. For example, if passed
value is 'b' and cell has 'a' and the compare operator is LESS, then
we will filter out this cell (return true). If this is not sufficient
(eg you want to deserialize a long and then compare it to a fixed long
value), then you can pass in your own comparator instead.
You must also specify a family and qualifier. Only the value of this
column will be tested. When using this filter on a Scan with specified
inputs, the column to be tested should also be added as input
(otherwise the filter will regard the column as missing).
To prevent the entire row from being emitted if the column is not
found on a row, use setFilterIfMissing(boolean). Otherwise, if the
column is found, the entire row will be emitted only if the value
passes. If the value fails, the row will be filtered out.
ValueFilter
This filter is used to filter based on column value. It takes an
operator (equal, greater, not equal, etc) and a byte [] comparator for
the cell value.
In this case, since you're specifically set the column to be scanned, it will act the same way. The ValueFilter will filter all column, and the SingleColumnValueFilter will only filter a specific column and omit the row altogether if it doesn't pass the filter.

Related

How can I partition an Arrow Table by value in one pass?

I would like to be able to partition an Arrow table by the values of one of its columns (assuming the set of n values occurring in that column is known). The straightforward way is a for-loop: for each of these values, scan the whole table and build a new table of matching rows. Are there ways to do this in one pass instead of n passes?
I initially thought that Arrow's support for group-by scans would be the solution -- but Arrow (in contrast to Pandas) does not support extracting groups after a group-by scan.
Am I just thinking about this wrong and there is another way to partition a table in one pass?
For the group by support, there is a "hash_list" function that returns all values in the group. Is that what you're looking for? You could then slice the resulting values after-the-fact to extract the individual groups.

MarkLogic Optic javaScript Geospatial Difference

I want to reduce the selected items by their distance from a point using MarkLogic Optic.
I have a table with data and a lat long
const geoData = op.fromView("namespace","coordinate");
geoData.where(op.le(distance(-28.13,153.4,geoData.col(lat),geoData(long)),100))
The distance function I have already written and utilises geo.distance(cts.point(lat,long),cts.point(lat,long)) but the geoData.col("lat") passes an object that describes the full names space of the col not the value.
op.schemaCol('namespace', 'coordinate', 'long')
I suspect I need to do a map/reduce function but MarkLogic documentation gives the normal simplistic examples that are next to useless.
I would appreciate at some help.
FURTHER INFORMATION
I have mostly solved most of this problem except that some column have null values. The data is sparse and not all rows have a long lat.
So when the cts.points runs in the where statement and two null values are passed it raises an exception.
How do I coalesce or prevent execution of cts.points when data columns are null? I dont want to reduce the data set as the null value records still need to be returned they will just have a null distance.
Where possible, it's best to do filtering by passing a constraining cts.query() to where().
A constraining query matches the indexed documents and filters the set of rows to the rows that TDE projected from those documents before retrieving the filtered rows from the indexes.
If the lat and long columns are each distinct JSON properties or XML elements in the indexed documents, it may be possible to express the distance constraint using techniques similar to those summarized here:
http://docs.marklogic.com/guide/search-dev/geospatial#id_42017
In general, it's best to use map/reduce SJS functions for postprocessing on the filtered result set because the rows have to be retrieved to the enode to process in SJS.
Hoping that helps,

Querying DynamoDB with a partition key and list of specific sort keys

I have a DyanmoDB table that for the sake of this question looks like this:
id (String partition key)
origin (String sort key)
I want to query the table for a subset of origins under a specific id.
From my understanding, the only operator DynamoDB allows on sort keys in a Query are 'between', 'begins_with', '=', '<=' and '>='.
The problem is that my query needs a form of 'CONTAINS' because the 'origins' list is not necessarily ordered (for a between operator).
If this was SQL it would be something like:
SELECT * from Table where id={id} AND origin IN {origin_list}
My exact question is: What do I need to do to achieve this functionality in the most efficient way? should I change my table structure? maybe add a GSI? Open to suggestions.
I am aware that this can be achieved with a Scan operation but I want to have an efficient query. Same goes for BatchGetItem, I would rather avoid that functionality unless absolutely necessary.
Thanks
This is a case for using Filter Expressions for Query. It has IN operator
Comparison Operator
a IN (b, c, d) — true if a is equal to any value in the list — for
example, any of b, c or d. The list can contain up to 100 values,
separated by commas.
However, you cannot use condition expressions on key attributes.
Filter Expressions for Query
A filter expression cannot contain partition key or sort key
attributes. You need to specify those attributes in the key condition
expression, not the filter expression.
So, what you could do is to use origin not as a sort key (or duplicate it with another attribute) to filter it after the query. Of course filter first reads all the items has that 'id' and filters later which consumes read capacity and less efficient but there is no other way to query that otherwise. Depending on your item sizes and query frequency and estimated number of returned items BatchGetItem could be a better choice.

mysql++ (mysqlpp): how to get number of rows in result prior to iteration using fetch_row through UseQueryResult

Is there an API call provided by mysql++ to get the number of rows returned by the result?
I have code structured as follows:
// ...
Query q = conn.query(queryString);
if(mysqlpp::UseQueryResult res = query.use()){
// some code
while(mysqlpp::Row row = res.fetch_row()){
}
}
My previous question here will be solved easily if a function that returns the number of rows of the result. I can use it to allocate memory of that size and fill in as I iterate row by row.
In case anyone runs into this:
I quote the user manual:
The most direct way to retrieve a result set is to use Query::store(). This returns a StoreQueryResult object,
which derives from std::vector, making it a random-access container of Rows. In turn,
each Row object is like a std::vector of String objects, one for each field in the result set. Therefore, you can
treat StoreQueryResult as a two-dimensional array: you can get the 5th field on the 2nd row by simply saying
result[1][4]. You can also access row elements by field name, like this: result[2]["price"].
AND
A less direct way of working with query results is to use Query::use(), which returns a UseQueryResult object.
This class acts like an STL input iterator rather than a std::vector: you walk through your result set processing
one row at a time, always going forward. You can’t seek around in the result set, and you can’t know how many
results are in the set until you find the end. In payment for that inconvenience, you get better memory efficiency,
because the entire result set doesn’t need to be stored in RAM. This is very useful when you need large result sets.
A suggestion found here: http://lists.mysql.com/plusplus/9047
is to use the COUNT(*) query and fetch that result and then use Query.use again. To avoid inconsistent count, one can wrap the two queries in one transaction as follows:
START TRANSACTION;
BEGIN;
SELECT COUNT(*) FROM myTable;
SELECT * FROM myTable;
COMMIT;

multiple strings as argument in table input

I'm trying to use SQL like select column from table where column in (?)
as ? should be concatenation of strings. I did script, that concatenates rows in something like 'string','secondstring' and so on.
I know, I should use just more parameters, but to the moment of execution I don't know, how many arguments there will be, and that is hundreds of them each time.
I'd like to do it in one SQL, so putting every argument in a single row, and check "execute for each row" isn't perfect either.
Any clue, how to do this?
You can use the cycles and variables kettle.
For example:
-create a job that contains:
1)a transformation where you store in an environment variable
(setVariable ("varname" value, "r") r is the parameter to be accessible by the parent job) the concat all input rows.
2)a transformation which makes the desired query with variable replacement (SELECT column FROM table WHERE column IN (${varname})).
If you need I can send the example files.