Power BI VertiPaq encoding - powerbi

If I replace a long string key with a shorter integer key will this increase performance in my model when ultimately the key is hashed either way?

Related

How to limit dynamodb scan to a given partition key and NOT read the entire table

Theoretical table with billions of entries.
Partition key is a unique uuid representing a given deviceId. There will be around 10k unique uuids.
Sort Key is a dateString for when the data was collected.
Each item has some data fields. There are dozens of fields such that making a GSI for each wouldn't be reasonable. For our example, let's say we are looking for the "dataOfInterest" field.
I'd like to search the DB for "all items where the dataOfInterest = 'foobar'" - and ideally do it within a date range. As far as I know, a scan operation is the only option. With billions entries... that's not going to be a fast process (though I understand I could split it out to run multiple operations at a time - it's stil going to eat RCU's like crazy)
Of note, I only care about a given uuid for each search, however. In other words, what I REALLY care about is "all items within a given partition where the dataOfInterest = 'foobar'". And futher, it'd be great to use the sort key to give "all items within a given partition where the dataOfInterest = 'foobar' that are between Jan 1 and Feb 28"
The scan operation allows you to limit the results with a filter expression such that I could get the results of just a single partition ... but it still reads the entire table and the filtering is done before returning the data to you. https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Scan.html
Is there an AWS API that does a scan-like operation that reads only a given partition? Are there other ways to achieve this (perhaps re-architecting the DB?)
As #jarmod says, you can use a Query and specify the PK of the UUID. You can then either put the timestamp into the SK and filter for the dataOfInterest value (unindexed), or for more efficiency and to make everything indexed you can construct a composite SK which is dataOfInterest#timestamp and then do a range query on the SK of foobar#time1 to foobar#time2. That makes this query perfectly index optimized.
Course, this makes purely timestamp-based queries less simple. So you either do multiple queries for those or, if you want both queries efficient, setup this composite SK in a GSI and use that to resolve this query.

How to perform a range query over AWS dynamoDB

I have a AWS DynamoDB table storing books information, the hash key is book id. There is an attribute for book price.
Now I want to perform a query to return all the books whose price is lower than a certain value. How to do this efficiently, without scanning the whole table?
The query on secondary-index seems only could return a set of entries with the index being a certain value, so I am confused about how to perform a range query efficiently. Thank you very much!
There are two things that maybe you are confusing. The range key with a range on an attribute.
To clarify, in this case you would need a secondary index and when querying the index you would specify a key condition (assuming java and assuming secondary index on value - this in pretty much any sdk supported language)
see http://docs.amazonaws.cn/en_us/AWSJavaSDK/latest/javadoc/index.html?com/amazonaws/services/dynamodbv2/model/QueryRequest.html w/ a BETWEEN condition.
You can't do query of that kind. DynamoDB is sharded across many nodes by hash key, so doing a query without hash key (on all hash keys) is essentially a full scan.
A hack for your case would be to have a hash key with only one value for the whole table, but this is fundamentally wrong because you loose all the pros of using DynamoDB. See hot hash key issue for more info: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html

Dynamo Db Spatial Search with GeoHashing

We have a project for a mobile app where an user search for places based on the user position and the category they prefer ("fast food restaurants" for example). The client want to use Dynamo Db and we are trying hard to understand how to best model the data.
All queries will be based in two fields:
An string containing the geohashing value for a rounding box --> That our Hash primary key
An int containing the category type of the item --> Range Key
After reading the documentation we found out that the solution doesn't follow the recommendations of Amazon because the Hash Key will be repeated a lot and not will good use of parallel scanning, and the Range Key doesn't represent a range at all.
So we are kind of lost on how to proceed. Any thanks will be appreciated.
Amazon release a lib for geo hashing (only in Java)
http://aws.typepad.com/aws/2013/09/new-geo-library-for-dynamodb.html

A simple credentials table for mySQL

Here is my simple table definition for a mysql credentials table.
case "credentials":
self::create('credentials', 'identifier INT NOT NULL AUTO_INCREMENT, flname VARCHAR(60), email VARCHAR(32), pass VARCHAR(40), PRIMARY KEY(identifier)');
break;
Please ignore all but the inner arguments...the syntax is good...I just want to verify the form. Basically, I have an auto-incrementing int for the PRIMARY KEY and 3 fields - the users's name, email, and password.
I want this to be as simple as possible. Searches will be based upon the id
Question: Will this work for a basic credentials table?
Please please please do not store passwords in plaintext.
Use a well known iterated hashing function, such as bcrypt or PBKDF2. Don't store a raw MD5 hash, or even a raw SHA or SHA-2 hash. You should always salt and iterate your hashes to be secure.
You'll need one extra column to store the salt, and if you want to be flexible you could also have per-user iteration counts and maybe even per-user hash functions. That gives you the flexibility to change to a different hash function in the future without requiring all users to immediately change their passwords.
Apart from that the table looks fine.
I would suggest that you increase the size of the email field (maximum length of an email can be up to 256 chars). Also you should store your passwords as a hash (e.g. bcrypt) not a plain string.

MongoDB: what is the most efficient way to query a single random document?

I need to pick a document from a collection at random (alternatively - a small number of successive documents from a randomly-positioned "window").
I've found two solutions: 1 and 2. The first is unacceptable since I anticipate large collection size and wish to minimize the document size. The second seems ineffective (I'm not sure about the complexity of skip operation). And here one can find a mention of querying a document with a specified index, but I don't know how to do it (I'm using C++ driver).
Are there other solutions to the problem? Which is the most efficient?
I had a similar issue once. In my case, I had a date property on my documents. I knew the earliest date possible in the dataset so in my application code, I would generate a random date within the range of EARLIEST_DATE_IN_SET and NOW and then query mongodb using a GTE query on the date property and simply limit it to 1 result.
There was a small chance that the random date would be greater than the highest date in the data set, so i accounted for that in the application code.
With an index on the date property, this was a super fast query.
It seems like you could mold solution 1 there, (assuming your _id key was an auto-inc value), then just do a count on your records, and use that as the upper limit for a random int in c++, then grab that row.
Likewise, if you don't have an autoinc _id key, just create one with your results.. having an additional field with an INT shouldn't add that much to your document size.
If you don't have an auto-inc field Mongo talks about how to quickly add one here:
Auto Inc Field.