Storing and Searching Large Data Set

Storing and Searching Large Data Set - c++

I'm relatively new to programming in C++ and I'm trying to create a data set that just has two values: an ID number and a string. There will be about 100,000 pairs of these. I'm just not sure what data structure would best suit my needs.
The data set has the following requirements:
-the ID number corresponding to a string is 6 digits (so 000000 to 999999)
-not all ID values between 000000 and 999999 will be used
-the user will not have permission to modify the data set
-I wish to search by ID or words in the String and return to the user ID and String
-speed of searching is important
So basically I'm wondering what I should be using (vector, list, array, SQL database, etc) to construct this data set and quickly search it?

the ID number corresponding to a string is 6 digits (so 000000 to
999999)
Good, use an int, or more precisely int32_t for the ID
-not all ID values between 000000 and 999999 will be used
Not a problem...
-the user will not have permission to modify the data set
Encapsulate your data within a class and you are good to go
-I wish to search by ID or words in the String and return to the user ID and String
Good, use Boost.Bimap
-speed of searching is important
I know, that's why you are using C++... :-)
You may also want to check SQLite : SQLite, can also function as an in-memory database.

use std::map
void main()
{
std::map<string /*id*/, string> m;
m["000000"] = "any string you want";
}

Vector & list are worst to use if you don't sort them, you don't want loop through all.
I suggest you use map, even tho building the entire map might take longer (nlogn). I still recommend it, since the runtime for searching is log(n) which is pretty fast!
"speed of searching is important"

I'd suggest something like a class which contains a vector of your id/string pairs, an unordered_map which maps id to an iterator or reference into that vector, and an unordered_map which maps a string to an iterator or reference into that vector. Then, two search functions in the class which look up the id/string pair based on the id or a string.

You have couple of options.
Use database, MySQL, SQLite etc. Performance depends on the database you use.
Or, if you want to do it in C++ code, you can use vectors. One vector for the key, another is for the string. You also need to map the related index between 2 vectors.
Sort both vectors after add a new item. Remember to update the map of related index
Then use binary search to find either key, or value. It shall be fast enough.

Related

Is using json as sort key/partition key value good practice in DynamoDB?

Trying to define a schema for a DynamoDB table. More than two values decide a row.
A potential solution to put these key values is to have the sort key contain more than one value. As it's specified here.
Inspired by this approach, I'm thinking instead of using simple delimiter to concatenate values together, using JSON or any other string representation of objects(e.g.: String translated by Jackson) as the value of the sort key should be able to achieve similar goal and easy to convert.
However, my concern is by doing so - adding the length of the sort key - will it decrease the performance of DynamoDB? Is it a fine to use complicated string as the sort key?

TL;DR: For your Sort Key, you can use any string (within the byte limits) that distinguishes your records within the primary key. But if you are clever about it, you can make better use of it for sorting and filtering.
There are limits to the key lengths:
Partition Key: 1 to 2048 bytes
Sort Key: 1 to 1024 bytes
I am not aware of any significant performance differences based on the length of your primary and sort keys. I'm sure that ensuring performance is part of the reason for AWS to set these particular limits.
Technically, you should be fine to use any string as your key, including JSON. However, depending on how you intend to query your table, you may want to consider a more clever arrangement for your Sort Key.
For example, if your sort key contains First and Last names, you might end up with JSON like these:
{"LastName":"Doe","FirstName":"John"}
{"FirstName":"Jane","LastName":"Doe"}
JSON alone doesn't care about the ordering of the name fields, so if you don't put additional constraints on your JSON, you might make it difficult to query all records with LastName "Doe".
The documentation you linked hints at an example of a pattern you might follow for your sort key:
LASTNAME#Doe#FIRSTNAME#John
LASTNAME#Doe#FIRSTNAME#Jane
Now you can easily query for all records with last name Doe with the startsWith condition "LASTNAME#Doe#FIRSTNAME#". Your records will also naturally be sorted by Last Name, First Name.
Rather than having to parse out that string when you want to find a record's first and last names, you could just duplicate the content in the record by adding separate fields for "FirstName" and "LastName" for convenience.
So your full record might look something like this:
{
"PK":"some-primary-key",
"SK":"LASTNAME#Doe#FIRSTNAME#John",
"FirstName":"John",
"LastName":"Doe"
}

DynamoDB create index on map or list type

I'm trying to add an index to an attribute inside of a map object in DynamoDB and can't seem to find a way to do so. Is this something that is supported or are indexes really only allowed on scalar values? The documentation around this seems to be quite sparse. I'm hoping that the indexing functionality is similar to MongoDB but so far the approaches I've taken of referencing the attribute to index using dot syntax has not been successful. Any help or additional info that can be provided is appreciated.

Indexes can be built only on top-level JSON attributes. In addition, range keys must be scalar values in DynamoDB (one of String, Number, Binary, or Boolean).
From http://aws.amazon.com/dynamodb/faqs/:
Q: Is querying JSON data in DynamoDB any different?
No. You can create a Global Secondary Index or Local Secondary Index
on any top-level JSON element. For example, suppose you stored a JSON
document that contained the following information about a person:
First Name, Last Name, Zip Code, and a list of all of their friends.
First Name, Last Name and Zip code would be top-level JSON elements.
You could create an index to let you query based on First Name, Last
Name, or Zip Code. The list of friends is not a top-level element,
therefore you cannot index the list of friends. For more information
on Global Secondary Indexing and its query capabilities, see the
Secondary Indexes section in this FAQ.
Q: What data types can be indexed?
All scalar data types (Number, String, Binary, and Boolean) can be
used for the range key element of the local secondary index key. Set,
list, and map types cannot be indexed.

I have tried doing hash(str(object)) while I store the object separately. This hash gives me an integer(Number) and I am able to use a secondary index on it. Below is a sample in python, it is important to use a hash function which generates the same hash key every time for the value. So I am using sha1.
# Generate a small integer hash:
import hashlib
def hash_8_digits(source):
return int(hashlib.sha1(source.encode()).hexdigest(), 16) % (10 ** 8)
The idea is to keep the entire object small while still the entity intact. i.e. rather than serializing and storing the object as string and changing whole way the object is used I am storing a smaller hash value along with the actual list or map.

Find a User by name quickly

I have a C++ server that manages Users for a game. These Users have unique AccountIDs and almost every look-up for Users on the server involves finding a User from a global map of
std::map<unsigned int, User*>
where unsigned int is the AccountID. This works great except for this new case where I am implementing a friends list. In order to add a friend to someones friend list it needs to be done by Username. I am also running into this problem when inviting people by Username to a chatroom or other "party" type events.
My two current options are:
1) Iterate through the entire Users map, doing a string comparison by Username.
2) Do a database look-up on an indexed Username column and return the AccountID, then do a map find for the User*.
Both of these solutions are very inefficient. I am looking for a more optimized solution of finding a User by Username.
The first idea that comes to mind is a Hashtable that hashes on the Username, but then I have two different data structures (the Hashtable and the Map) that are doing the same thing except one is by AccountID and one is by name.
A second option could be to use the Username as the key for the map, although I can't imagine having a string for a key being too efficient.
Any suggestions on what I should do here? As for some more information on the server, there will be around 1000+ Users and they will be leaving and joining constantly.

C++11 has std::unordered_map which will automagically handle hashing for you, e.g. std::unordered_map<std::string, User*>.

I would suggest just using another map std::map<std::string, User*>. I believe that for an application with ~1000 users it is over-engineering to do hashmaps or more complicated solutions, the string based lookup in map will not be that expensive, practically zero compared to lookup in database.
Maybe, you can use the by-product of having alphabetically sorted users somewhere as well.

Add to list within document MongoDB

I have a database where I store player names that belong to people who have certain items.
The items have and IDs, and subIDs.
The way I am currently storing everything is:
Each ID has its own collection
Within the collection there is a document for each subID.
The document for each subID is layed out like so:
{
"itemID":itemID,
"playerNames":[playerName1, playerName2, playerName3]
}
I need to be able to add to the list of playerNames quickly and efficiently.
Thank you!

If I understood your question correctly, you need to add items to the "playerNames" array. You can use one of the following operators:
If the player names array will have unique elements in it, use $addToSet
Otherwise, use $push

Scan a dynamodb based on a list

I have a String Set attribute i.e SS in a dynamodb table. I need to scan the database to check the value present in the any one list of the items.
Which comparison operator should I use for this scan?
example the db has items like this:
name
[email1, email2]
phone
I need to search for a items containing a particular email say email1 alone not giving the entire tuple.

It seems like you are looking for the CONTAINS operator of Scan operation. It basically is the equivalent of in in Python.
This said, if you need to perform this often, you probably should de-normalize your data to make it faster.
For example, you could build a second table like this:
hash_key: name
range_key: email
Of course, you would have to maintain this index yourself and query it manually.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js