What kind of technique does use Google Plus to generate users' unique ids?
Example
https://plus.google.com/102766325060234825733/posts
You can only assume that they are randomly generated ID's that are large enough to be generated non-sequentially with sufficient entropy.
The ID's are too big to be stored in a bigint field which is interesting, again probably due to the required entropy and non-sequential requirement (so that nothing can be inferred by comparing userid's).
A simple encryption of a serially generated number, with a secret key can be used to generate the IDs. It can be a 1 way hash, or a decryptable encryption.
The reason for not using serial numbers directly is obvious: You can easily guess userids of other users on the network, which can result in Bots scraping the content of the network.
Related
What I would like to achieve:
Assuming I have 500 JSON files already uploaded to IPFS, each one represents a meta data of an NFT.
Every time a user triggers mintBatch(number) function, I wish a number of NFTs would be minted randomly for this user, and the NFTs that have already been minted should not be minted again.
Hence, I think I will need a place to store the IDs of the NFTs that have been minted, so that the future randomize function can avoid those IDs when minting.
I am thinking if I should do this on-chain or off-chain. If off-chain, things can be a lot easier as I just need to upload the number randomly generated off-chain and record them in my off-chain database. But some articles say it would be costly and low efficient (which I don't quite understand the reason). On the other hand, on-chain was described on some other articles that has less security as users can see the randomize mechanic on blockchain and try to hack it, but some other articles say it is more efficient and will genenrate less gas.
Normally which way would people choose, and normally how do they do it?
UPDATE: To prevent users from minting the same json that others have minted, I am thinking of putting json CIDs in an array, and removing the minted item from it once a mint function is called. I wonder if this would change the suitable way for minting randomly.
Check this question:
How to generate a random number in solidity?
Only true number generation is only possible with using RNG services like chainlink.
After get the random number from chainlink, you can pick one of your json files and give it to user.
I am modelling the data of my application to use DynamoDB.
My data model is rather simple:
I have users and projects
Each user can have multiple projects
Users can be millions, project per users can be thousands.
My access pattern is also rather simple:
Get a user by id
Get a list of paginated users sorted by name or creation date
Get a project by id
get projects by user sorted by date
My single table for this data model is the following:
I can easily implement all my access patterns using table PK/SK and GSIs, but I have issues with number 2.
According to the documentation and best practices, to get a sorted list of paginated users:
I can't use a scan, as sorting is not supported
I should not use a GSI with a PK that would put all my users in the same partition (e.g. GSI PK = "sorted_user", SK = "name"), as that would make my single partition hot and would not scale
I can't create a new entity of type "organisation", put all users in there, and query by PK = "org", as that would have the same hot partition issue as above
I could bucket users and use write sharding, but I don't really know how I could practically query paginated sorted users, as bucket PKs would need to be possibly random, and I would have to query all buckets to be able to sort all users together. I also thought that bucket PKs could be alphabetical letters, but that could crated hot partitions as well, as the letter "A" would probably be hit quite hard.
My application model is rather simple. However, after having read all docs and best practices and watched many online videos, I find myself stuck with the most basic use case that DynamoDB does not seem to be supporting well. I suppose it must be quite common to have to get lists of users in some sort of admin panel for practically any modern application.
What would others would do in this case? I would really want to use DynamoDB for all the benefits that it gives, especially in terms of costs.
Edit
Since I have been asked, in my app the main use case for 2) is something like this: https://stackoverflow.com/users?tab=Reputation&filter=all.
As to the sizing, it needs to scale well, at least to the tens of thousands.
I also thought that bucket PKs could be alphabetical letters, but
that could create hot partitions as well, as the letter "A" would
probably be hit quite hard.
I think this sounds like a reasonable approach.
The US Social Security Administration publishes data about names on its website. You can download the list of name data from as far back as 1879! I stumbled upon a website from data scientist and linguist Joshua Falk that charted the baby name data from the SSA, which can give us a hint of how names are distributed by their first letter.
Your users may not all be from the US, but this can give us an understanding of how names might be distributed if partitioned by the first letter.
While not exactly evenly distributed, perhaps it's close enough for your use case? If not, you could further distribute the data by using the first two (or three, or four...) letters of the name as your partition key.
1 million names likely amount to no more than a few MBs of data, which isn't very much. Partitioning based on name prefixes seems like a reasonable way to proceed.
You might also consider using a tool like ElasticSearch, which could support your second access pattern and more.
I've included some links along with our approaches to other answers, which seem to be the most optimal on the web right now.
Our records need to be categorized (eg. "horror", "thriller", "tv"), and randomly accessible both in specific categories and across all/some categories. We generally need to access about 20 - 100 items at a time. We also have a smallish number of categories (less than 100).
We write to the database for uploading/removing content, although this is done in batches and does not need to be real time.
We have tried two different approaches, with two different data structures.
Approach 1
AWS DynamoDB - Pick a record/item randomly?
Help selecting nth record in query.
In short, using the category as a hash key, and a UUID as the sort key. Generate a random UUID, query Dynamo using greater than or less than, and limit to 1. This is even suggested by an AWS employee in the second link. (We've also tried increasing the limit to the number of items we need, but this increases the probability of the query failing the first time around).
Issues with this approach:
First query can fail if it is greater than/less than any of the UUIDs
Querying on any specific category will cause throttling at scale (Small number of partitions)
We've also considered adding a suffix to each category to artificially increase the number of partitions we have, as pointed out in the following link.
AWS Database Blog
Choosing the Right DynamoDB Partition Key
Approach 2
Amazon Web Services: How do we get random item from the dynamoDb's table?
Doing something similar to this, where we concatenate the category with a sequential number, and use this as the hash key. e.g. horror-000001.
By knowing the number of records in each category, we're able to perform random queries across our entire data set, while also avoiding hot partitions/keys.
Issues with this approach
We need a secondary data structure to manage the sequential counts across each category
Writing (especially deleting) is significantly more complex, although this doesn't need to happen in real time.
Conclusion
Both approaches solve our main use case of random queries on category/categories, but the cons they offer are really deterring us from using them. We're leaning more towards approach #1 using suffixes to solve the hot partitioning issue, although we would need the additional retry logic for failed queries.
Is there a better way of approaching this problem? Specifically looking for solutions capable of scaling well (No scan), without requiring extra resources be implemented. #1 fits the bill, but needing to manage suffixes and failed attempts really deters us from using it, especially when it is being called inside a lambda (billed for time used).
Thanks!
Follow Up
After more research and testing, my team has decided to move towards MySQL hosted on RDS for these tables. We learned that this is one of the few use cases were DynamoDB does not fit, and requires rewriting your use case to fit the DB (Bad).
We felt that the extra complexity required to integrate random sampling on DynamoDB wasn't worth it, and we were unable to come up with any comparable solutions. We are, however, sticking with DynamoDB for our tables that do not need random accessibility due to the price and response times.
For anyone wondering why we chose MySQL, it was largely due to the Nodejs library available, great online resources (which DynamoDB definitely lacks), easy integration via RDS with our Lambdas, and the option to migrate to Amazons Aurora database.
We also looked at PostgreSQL, but we weren't as happy with the client library or admin tools, and we believe that MySQL will suit our needs for these tables.
If anybody has anything else they'd like to add or a specific question please leave a comment or send me a message!
This was too long for a comment, and I guess it's pretty much a full fledged answer now.
Approach 2
I've found that my typical time to get a single item from dynamodb to a host in the same region is <10ms. As long as you're okay with at most 1-2 extra calls, you can quite easily implement approach 2.
If you use a keys only GSI where the category is your hash key and the primary key of the table is your range key, you can quickly find the largest numbered single item within a category.
When you add a new item, find the largest number for that category from the GSI and then write the new item to the table with sequence number n+1.
When you delete, find the item with the largest sequence number for that category from the GSI, overwrite the item you are deleting, and then delete the now duplicated item from its position at the highest sequence number.
To randomly get an item, query the GSI to find the highest numbered item in the category, and then randomly pick a number since you now know the valid range.
Approach 1
I'm not sure exactly what you mean when you say "without requiring extra resources to be implemented". If you're okay with using a managed resource (no dev work to implement), you can also make Approach 1 work by putting a DAX cluster in front of your dynamodb table. Then you can query to your heart's content without really worrying about hot partitions. (Though the caching layer means that new/deleted items won't be reflected right away.)
We have a project for a mobile app where an user search for places based on the user position and the category they prefer ("fast food restaurants" for example). The client want to use Dynamo Db and we are trying hard to understand how to best model the data.
All queries will be based in two fields:
An string containing the geohashing value for a rounding box --> That our Hash primary key
An int containing the category type of the item --> Range Key
After reading the documentation we found out that the solution doesn't follow the recommendations of Amazon because the Hash Key will be repeated a lot and not will good use of parallel scanning, and the Range Key doesn't represent a range at all.
So we are kind of lost on how to proceed. Any thanks will be appreciated.
Amazon release a lib for geo hashing (only in Java)
http://aws.typepad.com/aws/2013/09/new-geo-library-for-dynamodb.html
We have a situation in our product where for a long time some data has been stored in the application's database as SQL string (choice of MS SQL server or sybase SQL anywhere) which was encrypted via the Windows API function CryptEncrypt. (direct and de-cryptable)
The problem is that CryptEncrypt can produce NULL's in the output, meaning that when it's stored in the database, the string manipulations will at some point truncate the CipherText.
Ideally we'd like to use an algo that will produce CipherText that doesn't contain NULLs as that will cause the least amount of change to the existing databases (changing a column from string to binary and code to deal with binary instead of strings) and just decrypt existing data and re-encrypt with the new algorithm at database upgrade time.
The algorithm doesn't need to be the most secure, as the database is already in a reasonably secure environment (not an open network / the inter-webs) but does need to be better than ROT13 (which I can almost decrypt in my head now!)
edit: btw, any particular reason for changing ciphertext to cyphertext? ciphertext seems more widely used...
Any semi-decent algorithm will end up with a strong chance of generating a NULL value somewhere in the resulting ciphertext.
Why not do something like base-64 encode your resulting binary blob before persisting to the DB? (sample implementation in C++).
Storing a hash is a good idea. However, please definitely read Jeff's You're Probably Storing Passwords Incorrectly.
That's an interesting route OJ.
We're looking at the feasability of a non-reversable method (still making sure we don't explicitly retrieve the data to decrypt) e.g. just store a Hash to compare on a submission
It seems that the developer handling this is going to wrap the existing encryption with yEnc to preserve the table integrity as the data needs to be retrievable, and this save all that messy mucking about with infinite-improbab.... uhhh changing column types on entrenched installations.
Cheers Guys