Obtain value from HBase table by key - mapreduce

There is a HBase table with tens of billions records where a key is a line of 40 bytes. And also there is a list of hundreds of thousands keys. I need to get all records with this keys and return value of certain table field. So, my purpose is transform a set of keys to a set of values. What is the most convenient and/or efficient way to perform the task (with any programming language and technology)?

You can use HBase Java API. In java-like pseudo code
conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", "ZOOKEEPER_USED_BY_HBASE")
connection = ConnectionFactory.createConnection(conf)
table = connection.getTable("tablename")
gets = new ArrayList<Get>()
for all keys {
gets.add(new Get(key.toBytes()))
}
table.get(gets)
A few more suggestions:
Have a look at Get javadocs, you can configure it to return only columns
you are interested in
If keys share some common prefix using Scan with start/stop row might work as well. Call scan.setCaching(5000) to make it slightly faster if you use it.

I was testing MapReduce on MongoDB to see how efficient it is at grabbing key/value pairs from a collection. It was only a collection of 100k records, but a small JavaScript function was able to retrieve all the countries and the amount of times they appeared in the collection.
Map1 = function()
{
Emit(this.country, 1)
}
Reduce1 = function(key, vals) {
for(var i=0, sum=0; i < vals.length; i++)
{
sum += vals[i];
}
return sum;
}
Then again, i don't know how effective M/R would be with billions of records.

Related

Compare values in dynamodb during query without knowing values in ExpressionAttributeValues

Is it possible to apply a filter based on values inside a dynamodb database?
Let's say the database contains an object info within a table:
info: {
toDo: x,
done: y,
}
Using the ExpressionAttributeValues, is it possible to check whether the info.toDo = info.done and apply a filter on it without knowing the current values of info.toDo and info.done ?
At the moment I tried using ExpressionAttributeNames so it contains:
'#toDo': info.toDo, '#done': info.done'
and the filter FilterExpression is
#toDo = #done
but I'm retrieving no items doing a query with this filter.
Thanks a lot!
DynamoDB is not designed to perform arbitrary queries as you might be used to in a relational database. It is designed for fast lookups based on keys.
Therefore, if you can add an index allowing you to access the records you look for, you can use it for this new access pattern. For example, if you add an index that uses info.toDo as the partition key and info.done as the sort key. You can then use the index to scan the records with the conditional expression of PK=x and SK=x, assuming that the list of possible values is limited and known.

Fastest way to select a lot of rows based on their ID in PostgreSQL?

I am using postgres with libpqxx, and I have a table that we will simplify down to
data_table
{
bytea id PRIMARY KEY,
BigInt size
}
If I have a set of ID's in cpp, eg std::unordered_set<ObjectId> Ids, what is the best way to get the ID and the Size parameters out of data_table?
I have so far used a prepared statement:
constexpr char* preparedStatement = "SELECT size FROM data_table WHERE id = $1";
Then in a transaction I have called that prepared statement for every entry in the set, and retrieved the result for every entry in the set,
pqxx::work transaction(SomeExistingPqxxConnection);
std::unordered_map<ObjectId, uint32_t> result;
for (const auto& id : Ids)
{
auto transactionResult = transaction.exec_prepared(preparedStatement, ToPqxxBinaryString(id));
result.emplace(id, transactionResult[0][0].as<uint32_t>());
}
return result;
Because the set can contain tens of thousands of objects, and the table can contain millions, this can take quite some time to process, and I don't think it is a particularly efficient use of postgres.
I am pretty much brand new to SQL, so I don't really know if what I am doing is the right way to go about this, or if this is a much more efficient way.
E: For what it's worth the ObjectId class is basically a type wrapper over std::array<uint8_t, 32>, aka a 256 bit cryptographic hash.
The task as I understand it:
Get id (PK) and size (bigint) for "tens of thousands of objects" from a table with millions of rows and presumably several more columns ("simplified down").
The fastest way of retrieval is index-only scans. The cheapest way to get that in your particular case would be a "covering index" for your query by "including" the size column in the PK index like this (requires Postgres 11 or later):
CREATE TEMP TABLE data_table (
id bytea
, size bigint
, PRIMARY KEY (id) INCLUDE (size) -- !
)
About covering indexes:
Do covering indexes in PostgreSQL help JOIN columns?
Then retrieve all rows in a single query (or few queries) for many IDs at once like:
SELECT id, size
FROM data_table
JOIN (
VALUES ('id1'), ('id2') -- many more
) t(id) USING (id);
Or one of the other methods laid out here:
Query table by indexes from integer array
Or create a temporary table and join to it.
But do not "insert all those IDs one by one into it". Use the much faster COPY (or the meta-command \copy in psql) to fill the temp table. See:
How to update selected rows with values from a CSV file in Postgres?
And you do not need an index on the temporary table, as that one will be read in a sequential scan anyway. You only need the covering PK index I lined out.
You may want to ANALYZE the temporary table after filling it, to give Postgres some column statistics to work with. But as long as you get the index-only scans I am aiming for, you can skip that, too. The query plan won't get any better than that.
The id is a primary key and so is indexed, so my first concern would be query setup time. A stored procedure is precompiled, for instance. A second tack is to put your set in a temp table, possibly also keyed on the id, so the two tables/indexes can be joined in one select. The indexes for this should be ordered, like tree not hash, so they can be merged.

Dynamo DB Query and Scan Behavior Question

I thought of this scenario in querying/scanning in DynamoDB table.
What if i want to get a single data in a table and i have 20k data in that table, and the data that im looking for is at 19k th row. Im using Scan with a limit 1000 for example. Does it consume throughput even though for the 19th time it does not returned any Item?. For Instance,
I have a User table:
type UserTable{
userId:ID!
username:String,
password:String
}
then my query
var params = {
TableName: "UserTable",
FilterExpression: "username = :username",
ExpressionAttributeValues: {
":username": username
},
Limit: 1000
};
How to effectively handle this?
According to the doc
A Scan operation always scans the entire table or secondary index. It
then filters out values to provide the result you want, essentially
adding the extra step of removing data from the result set.
Performance
If possible, you should avoid using a Scan operation on a large table
or index with a filter that removes many results. Also, as a table or
index grows, the Scan operation slows
Read units
The Scan operation examines every item for the requested values and can
use up the provisioned throughput for a large table or index in a
single operation. For faster response times, design your tables and
indexes so that your applications can use Query instead of Scan
For better performance an less read unit consumption i advice you create GSI using it with query
A Scan operation will look at entire table and visits all records to find out which of them matches your filter criteria. So it will consume throughput enough to retrieve all the visited records. Scan operation is also very slow, especially if the table size is large.
To your second question, you can create a Secondary Index on the table with UserName as Hash key. Then you can convert the scan operation to a Query. That way it will only consume throughput enough for fetching one record.
Read about Secondary Indices Here

couchdb mapreduce query intersection of multiple keys

I'm manipulating documents containing a dictionnary of arbitrary metadata, and I would like to search documents based on metadata.
So far my approach is to build an index from the metadata. For each document, I insert each (key,value) pair of the metadata dictionary.
var metaIndexDoc = {
_id: '_design/meta_index',
views: {
by_meta: {
map: function(doc) {
if (doc.meta) {
for (var k in doc.meta) {
var v = doc.meta[k];
emit(k,v);
}
}
}.toString()
}
}
};
That way, I can query for all the docs that have a metadata date, and the result will be sorted based on the value associated with date. So far, so good.
Now, I would like to make queries based on multiple criteria: docs that have date AND important in their metadata. From what I've gathered so far, it looks like the way I built my index won't allow that. I could create a new index with ['date', 'important'] as keys, but the number of indexes would grow exponentially as I add new metadata keys.
I also read about using a new kind of document to store the fact that document X has metadata (key,value), which is definitely how you would do it in a relational database, but I would rather have self-contained documents if it is possible.
Right now, I'm thinking about keeping my metaIndex, making one query for date, one for important, and then use underscore.intersection to compute the intersection of both lists.
Am I missing something ?
EDIT: after discussion with #alexis, I'm reconsidering the option to create custom indexes when I need them and to let PouchDB manage them. It is still true that with a growing number of metadata fields, the number of possible combinations will grow exponentially, but as long as the indexes are created only when they are needed, I guess I'm good to go...

CouchDB - reprocessing view results

I decided to try out CouchDB today, using the ~9GB of Amazon Review data here:
http://snap.stanford.edu/data/web-Movies.html
What I am trying to do is find the least helpful users of all time. The people who have written the largest number of reviews which other people find unhelpful (are they Amazon's greatest trolls? Or just disagreeable? I want to see).
I've written a map function to find the userID for all users that have a difference of helpfulness rating of over 5, then a reduce function to sum them, to find out how often they appear.
// map function:
function(doc){
var unhelpfulness = doc.helpfulness[1] - doc.helpfulness[0]
if(unhelpfulness > 5){
emit(doc.userId, 1);
}
}
// reduce function:
function(keys, values){
return sum(values);
}
This gives me a view of userId : number of unhelpful reviews.
I want to take this output and then reprocess it with more map reduce, to find out who's written the most unhelpful reviews. How do I do this? Can I export a view as another table or something? Or am I just thinking about this problem in the wrong way?
You are on the right track. Couch db does not allow the results to be sorted by value but it has a list function that can be used to perform operations on the results of the view. From the couchdb book
Just as show functions convert documents to arbitrary output formats, CouchDB list functions allow you to render the output of view queries in any format. The powerful iterator API allows for flexibility to filter and aggregate rows on the fly, as well as output raw transformations for an easy way to make Atom feeds, HTML lists, CSV files, config files, or even just modified JSON.
So we will use list to filter and aggregate. In your design document create a list function like so
function(head, req)
{
var row; var rows=[];
while(row=getRow()){rows.push(row); }
rows.sort(function(a,b){return b.value -a.value});
send(JSON.stringify(rows[0]));
}
now if you query
/your-database/_design/your-design-doc-name/your-list-name/your-view-name?group=true
You should have the name of the person who has the most unhelpful review. Couch db makes it easy to find a troll :)