Application for filtering database for the short period of time - c++

I need to create an application that would allow me to get phone numbers of users with specific conditions as fast as possible. For example we've got 4 columns in sql table(region, income, age [and 4th with the phone number itself]). I want to get phone numbers from the table with specific region and income. Just make a sql query won't help because it takes significant amount of time. Database updates 1 time per day and I have some time to prepare data as I wish.
The question is: How would you make the process of getting phone numbers with specific conditions as fast as possible. O(1) in the best scenario. Consider storing values from sql table in RAM for the fastest access.
I came up with the following idea:
For each phone number create smth like a bitset. 0 if the particular condition is false and 1 if the condition is true. But I'm not sure I can implement it for columns with not boolean values.
Create a vector with phone numbers.
Create a vector with phone numbers' bitsets.
To get phone numbers - iterate for the 2nd vector and compare bitsets with required one.
It's not O(1) at all. And I still don't know what to do about not boolean columns. I thought maybe it's possible to do something good with std::unordered_map (all phone numbers are unique) or improve my idea with vector and masks.
P.s. SQL table consumes 4GB of memory and I can store up to 8GB in RAM. The're 500 columns.

I want to get phone numbers from the table with specific region and income.
You would create indexes in the database on (region, income). Let the database do the work.

If you really want it to be fast I think you should consider ElasticSearch. Think of every phone in the DB as a doc with properties (your columns).
You will need to reindex the table once a day (or in realtime) but when it's time to search you just use the filter of ElasticSearch to find the results.
Another option is to have an index for every column. In this case the engine will do an Index Merge to increase performance. I would also consider using MEMORY Tables. In case you write to this table - consider having a read replica just for reads.
To optimize your table - save your queries somewhere and add index(for multiple columns) just for the top X popular searches depends on your memory limitations.
You can use use NVME as your DB disk (if you can't load it to memory)

Related

Aws RedShift sampling

For a data quality check I need to collect data in a specific interval.
Some tables are huge in size.
Is there any hack to do this without affecting the performance?
Like select 100 rows randomly.
How random do you need? The classic way to do this is with "WHERE RANDOM() < .001". If you need it to give you a repeatable "random" set then you can add a seed. The issue is that your tables are huge and this means reading (scanning) every row from disk just to throw most of them away and since table scan can take a significant time this isn't what you want to do.
So you may want to take advantage of Redshift "limited table scan" capabilities as part of your "random" sampling. (The fastest data to read from disk is the data you don't read from disk.) The issue here is that this solution will depend on your table sort keys and ordering which will push the solution into even "more pseudo" random territory (less of a true random sampling). In many cases this isn't a big deal but if the statistics really matter then this may not work for you.
This is done by sampling "blocks", not rows, based on the sort key(s). This sampling of blocks can be done randomly and each block of data will represent about 250K rows (based on sort key data type, compression etc. and COULD range anywhere from <100K rows to 2M rows). Doing this process will take a little inspection of STV_BLOCKLIST. The storage quanta for Redshift is the 1MB block and each and every block's metadata in the system can be referenced in STV_BLOCKLIST. This system table contains min and max values for each block. First find all the blocks for the sort key for the table in question. Next pick a random sample of these blocks (and if you are still dealing with a lot of data make sure that this sampling picks an even number from across all the slices to avoid execution skew).
Now the trick is to translate these min a max metadata values into a WHERE clause the performs the desired sampling. These min and max values are BIGINTs and are hashed from the data in the sort key column. This hash is data type dependent. If the data type is BIGINT then the has is quite simple - if the data type is timestamp then it is a bit more complex. But the ordering will be preserved across the hashing function for the data type involved. Reverse engineering this hash isn't hard - just perform a few experiments - but I can help if you tell me the type involved as I've done this for just about every data type at this point.
You can even do a random sampling rows on top of this random sampling of blocks. Or if you want you can just pick some narrow ranges of the sort key value and then randomly sample row and avoid all this reverse engineering business. The idea is to use Redshift "reduced scan" capability to greatly reduce the amount of data read from disk. To do this you need to be metadata aware in your choice of sampling windows which often means a sort key where clause. This is all about understanding how the database engine works and using its capabilities to your advantage.
I understand that this answer is based on some unstated information so please reach out in a comment if something isn't clear.

Postgresql ArrayField vs ForeignKey? which one is performant?

I want to design a phone book application that each contact can have multiple numbers. There is two db designs:
use contact foreign key in each number.
store numbers in an ArrayField inside each contact.
which solution is more performant in production and why?
Thanks in advance.
If you build a GIN index on the array column and Django will write queries in such a way that they can use that index, then the performance for reading will be quite similar between the two.
It is very unlikely that the performance difference should be the driving factor behind this choice. For example, do you need more info behind the phone number than just the number, such as when it was added, when it was last used, whether it is a mobile phone or something else, etc.
The array column should be faster because it only has to consult one index and table, rather than two of each. Also, it will be more compact, and so more cacheable.
On the other hand, the statistical estimates for your array column will have a problem when estimating rare values, which you are likely to have here, as no phone number is likely to be shared by a large number of people. This misestimate could have devastating results on your query performance. For example in a little test, overestimating the number of rows by many thousand fold caused it to launch parallel worker for a single-row query, leading it to be about 20 fold slower than when parallelization is turned off, and 10 times slower than using the foreign-key representation which doesn't suffer from the estimation problem.
For example:
create table contact as select md5(floor(random()*50000000)::text) as name, array_agg(floor(random()*100000000)::int) phones from generate_series(1,100000000) f(x) group by name;
vacuum analyze contact;
create index on contact using gin (phones );
explain analyze select * from contact where phones #> ARRAY[123456];
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=3023.30..605045.19 rows=216167 width=63) (actual time=0.668..8.071 rows=2 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Bitmap Heap Scan on contact (cost=2023.30..582428.49 rows=90070 width=63) (actual time=0.106..0.110 rows=1 loops=3)
Recheck Cond: (phones #> '{123456}'::integer[])
Heap Blocks: exact=2
-> Bitmap Index Scan on contact_phones_idx (cost=0.00..1969.25 rows=216167 width=0) (actual time=0.252..0.252 rows=2 loops=1)
Index Cond: (phones #> '{123456}'::integer[])
Planning Time: 0.820 ms
Execution Time: 8.137 ms
You can see that it estimates where will be 216167 rows, but in fact there are only 2. (For convenience, I used ints, rather than the text field you would probably use for phone numbers, but this doesn't change anything fundamental).
If this is really vital to you, then you should do the test and see, using your own data and your own architecture. It will depend on what does and does not fit in memory, what kinds of queries you are doing (do you ever look up numbers in bulk? Join them to other tables besides the immediately discussed foreign key?), and maybe how your driver/library handle columns/parameters with array types.

Show top 10 scores for a game using DynamoDB

I created a table "scores" in DynamoDB to store the scores of a game.
The table has the following attributes:
uuid
username
score
datetime
Now I want to query the "top 10" scores for the game (leaderboard).
In DynamoDB, it is not possible to create an index without a partition key.
So, how do I perform this query in a scalable manner?
No. You will always need the partition key. DynamoDB is able to provide the required speed at scale because it stores records with the same partition key physically close to each other instead of on 100 different disks (or SSDs, or even servers).
If you have a mature querying use case (which is what DynamoDB was designed for) e.g. "Top 10 monthly scores" then you can hash datetime into a derived month attribute like 12/01/2017, 01/01/2018, and so on and then use this as your partition key so all the scores generated in the same month will get "bucketized" into the same partition for the best lookup performance. You can then keep score as the sort key.
You can of course have other tables (or LSIs) for weekly and daily scores. Or you can even choose the most granular bucket and sum up the results yourself in your application code. You'll prob need to pull a lot more then 10 records to get close enough to 100% accuracy on the aggregate level... I don't know. Test it. I wouldn't rely on it if I were dealing with money/commissions but for game scores who cares :)
Note: If you decide to use this route then instead of using "12/10/2017" etc. as the partition values I'd use integer offsets e.g. UNIX epoch (rounded off to represent the first midnight of the month) to make it easier to compute/code against. I used the friendly dates to better illustrate the approach.

dynamodb table design for SET like scenario

Moving from RDBMS and I am not sure how best to design for below scenario
I have a table with around 200,000 questions with question id as partition key.
Users view questions and i do not wish to show viewed question again to the user. So which one is better option?
Have a table with question id as partition key and a set of user IDs as attribute
Have a table with User id as partition key and a set of question IDs they have viewed as attribute
Have a table with question id as partition key and user id as sort key. Once user has viewed question, add row to this table
1 and 2 might have a problem with the 400 kb size limit for item. The third seems better option though i would end up with 100 million items as there will be one row per user per question viewed. but i assume this is not a problem for dynamo?
Another problem is how to get 10 random questions not viewed by the user. Do i generate 10 random numbers between 1 and 200,000 (the number of questions) and then check if not in table mentioned in point 3 above?
I definitely would not go with option 1 or 2 for the reason you mentioned: you would already be limiting your scalability by the 400kb limit. With a UUID of 128 bits, you would be limited to about 250 users per question.
Option 3 is the way to go with DynamoDB, but what you need to consider is what is the partition key and what is the range key. You could have user_id as the partition key and question_id as the range key. The answer to that decision depends on how your data is going to be accessed. DynamoDB divides the total table throughput by each partition key: each one of your n partition keys gets 1/nth of the table throughput. For example, if you have a subset of partition keys that are accessed more than the others, then you won't be efficiently utilizing your table throughput because those partition keys that actually use up less than 1/nth of the throughput are still provisioned for 1/nth of the throughput. The general idea is that you want to have the each of your partition keys utilized equally. I think that you have it correct, I'm assuming that each question is given randomly and is no more popular than another, while some users might be more active than others.
The other part of your question is a little bit more difficult to answer / determine. You could do it your way where you have tables that contain question and user pairs for the questions that those users have read or you could have tables that contain the pairs for the questions those users haven't read. The tradeoff here is between initial write cost and subsequent read cost, and the answer depends on the amount of questions that you have compared to the consumption rate.
When you have a large amount of questions compared to the rate that users will progress through them, the chances of randomly selecting an already chosen one are small, so you're going to want to store have-read question-user pairs. With this setup you don't pay a lot to initialize a user (you don't have to write a question-user pair for each question) and you won't have a lot of miss-read costs (i.e. where you select a question-user pair and it turns out they already read it, this still consumes read-write units).
If you have a small amount of questions compared to the rate that users consume them, then you're going to want to store the haven't-read question-user pairs. You pay something to initialize each user (writing in one question-user pair for each question), but then you don't have any accidental miss-reads. If you stored them as have-read pairs when their is a small amount of questions, then you will encounter a lot of miss-reads as the percentage of read questions approaches 100% (to the point where you would have been better off just setting them up as haven't-read pairs).
I hope this helps with your design considerations. Drop a comment if you need clarification!

Incremental update of millions of records, indexed vs. join

I'm currently developing a strategy for an incremental update of our user data. We assume 100_000_000 records in our database of which approximately 1_000_000 records are updated per workflow.
The idea is to update records in a MapReduce job. Is it useful to use an indexed storage (eg. Cassandra) to be able to access current records randomly? Or is it preferable to retrieve data from HDFS and join new information to existing records.
The record size is O(200 Bytes). The user data has a fixed length but should be extendable. The log events have a similar but not equal structure. The number of user records is likely to grow. Near real-time updates are desirable, ie. a 3 hour time gap is not acceptable, few minutes is OK.
Have you made any experiences with either of these strategies and data of this size?
Is the pig JOIN fast enough? Is it a bottleneck always to read all records? Is Cassandra able to hold this amount of data efficiently? Which solution is scalable? What about the complexity of the system?
You need to define your requirements first. Your record volumes are not a problem, but you don't give a record length. Are they fixed length, fixed field number, likely to change format over time? Are we talking 100 byte records or 100,000 byte records? You need an index on a field/column if you wish to query by that field/column, unless you do all your work using map/reduce. Will the number of user records stay at 100mill (1 server will probably suffice) or will it grow 100% per year ( probably multiple servers adding new ones over time).
How you access records for updating depends on whether you need to update them in real-time or whether you can run a batch job. Will updates be every minute, or hour, or month?
I would strongly suggest you do some experimenting. Have you done any testing already? This will give you a context for your questions and this will lead to more objective questions and answers. It is unlikely that you can 'whiteboard' a solution based on your question.