memcached vs. database for random select - django

I have a question about DB/memcached usage. I have a series of questions in database, divided by level (about 1000 questions per level). For each user at each step I need to select one question for specified level randomly. I use standard django's ORM select + random row using this query:
question = Question.objects.all().filter(level=1).order_by('?')[0]
After analysis of log files I saw that about 50% of all DB query time is spent on selections of questions. I tried to use memcached for it. Because of randomize choise it is not obvious, I can't use it like key-value storage for question_id-question pairs. So I decided to split questions by level, store them in memcached, after it select a group of questions from memcached and choose random one using python like this:
for level in ...:
questions_by_level = [q for q in questions if q.level == level]
cache.set('questions' + str(level), questions_by_level)
and when I need a question:
questions = cache.get('questions' + str(level))
question = choice(questions)
I have memcached on the same machine and geting 1000 questions this way is about 2.5 times slower, that from database. Probably it is because 1000 of objects is selectedfrom memcached, deserialized to python and random one is selected.
Is it possible to choose another strategy for uing cache in this situation? Questions are updated rare, so it is a good lace to have a cache from my point of view.
Thanks.
UPD: one solution, that I discovered myself. For each question build a string key like this: l_n, where l is level and n is number of the question in the group of questions with level l. Now to find a random question I build a random key:
key = str(level) + '_' + str(int(random.random() * num_of_questions_by_level)
pros: getting of 1000 random questions is about 10 time faster, than from DB
cons: initial cache population is very slow

Store them in the database with sequential id numbers and then simply pick a random number between 0 and the number of keys then check memcached for the key, if it returns the question use it if not pull it from the database and put it in memcached for the next use.
This will run you into problems if you delete questions and so there are missing IDs in the sequence but this problem can be overcome, for example, if instead of using IDs you instead select the X item in the database and then when the questions do change you clear memcached so the data gets refreshed.

Related

Android/ Java : IS there fast way to filter large data saved in a list ? and how to get high quality picture with small storage space in server?

I have two questions
the first one is:
I have large data come from the server I saved it in a list , the customer can filter this data by 7 filters and two by text watcher this thing caused filtering operation to slow it takes 4 seconds in each time
I tried to put the filter keywords like(length or width ...) in one if and (&&) between them
but it didn't give me a result, also I tried to replace the textwatcher by spinner but it's not
useful.
I'm using one (for loop)
So the question: how can I use multi filter for list contain up to 2000 row with mini or zero slow?
the second is:
I saved from 2 to 8 pictures in the server in string form
the question is when I get these pictures from the server how can I show them in high quality?
when I show them I can see the pixels and this is not good for the customer
I don't want these pictures to take large space in the server and at the same time I want it in good quality when I restore them to display
I'm using Android/ Java
Thank you
The answer on my first quistion is if you want using filter (like when you are using online clothes shop and you want to filter it by less price ) you should use the hash map, not ordinary list it will be faster
The answer on my second question is: if you want to save store images in a database you should save it as a link, not a string or any other datatype

Application for filtering database for the short period of time

I need to create an application that would allow me to get phone numbers of users with specific conditions as fast as possible. For example we've got 4 columns in sql table(region, income, age [and 4th with the phone number itself]). I want to get phone numbers from the table with specific region and income. Just make a sql query won't help because it takes significant amount of time. Database updates 1 time per day and I have some time to prepare data as I wish.
The question is: How would you make the process of getting phone numbers with specific conditions as fast as possible. O(1) in the best scenario. Consider storing values from sql table in RAM for the fastest access.
I came up with the following idea:
For each phone number create smth like a bitset. 0 if the particular condition is false and 1 if the condition is true. But I'm not sure I can implement it for columns with not boolean values.
Create a vector with phone numbers.
Create a vector with phone numbers' bitsets.
To get phone numbers - iterate for the 2nd vector and compare bitsets with required one.
It's not O(1) at all. And I still don't know what to do about not boolean columns. I thought maybe it's possible to do something good with std::unordered_map (all phone numbers are unique) or improve my idea with vector and masks.
P.s. SQL table consumes 4GB of memory and I can store up to 8GB in RAM. The're 500 columns.
I want to get phone numbers from the table with specific region and income.
You would create indexes in the database on (region, income). Let the database do the work.
If you really want it to be fast I think you should consider ElasticSearch. Think of every phone in the DB as a doc with properties (your columns).
You will need to reindex the table once a day (or in realtime) but when it's time to search you just use the filter of ElasticSearch to find the results.
Another option is to have an index for every column. In this case the engine will do an Index Merge to increase performance. I would also consider using MEMORY Tables. In case you write to this table - consider having a read replica just for reads.
To optimize your table - save your queries somewhere and add index(for multiple columns) just for the top X popular searches depends on your memory limitations.
You can use use NVME as your DB disk (if you can't load it to memory)

Is there any real sense in uniform distributed partition keys for small applications using DynamoDB?

Amazon DynamoDB doc is focused on partition key uniform distribution is the most important point in creating correct db architecture.
From the other hand, when things come to real numbers, you can find that your app will never go out of one partition. That is, according to doc:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.Partitions
partition calculation formula is
( readCapacityUnits / 3,000 ) + ( writeCapacityUnits / 1,000 ) = initialPartitions (rounded up)
So you need more than 1000 writes per second demand (for 1 kb data) to go out from one partition. But according to my calculation for the most of small application you don't even need default 5 writes per second - 1 is enough. (To be precise you can go out of one partition if your data excesses 10Gb but it's also a big number).
The question becomes more important when you realize that creating of any additional indexes requires additional writes per second allocation.
Just imagine, I have some data related to particular user, for example, "posts".
I create "posts" data table and then according to Amazon guidelines I choose the next key format:
partition: id, // post id like uuid
sort: // don't need it
Since there is no any two posts having the same id we don't need sort key here. But then you realize that the most common operation you have is requesting a list of posts for a particular user. So you need to create secondary index like:
partition: userId,
sort: id // post id
But every secondary index requires additional read/write units so the cost of such decision is doubled!
From the other hand, keeping in mind that you have only one partition, you could already have such primary key:
partition: userId
sort: id // post id
That works fine for your purposes and doesn't double your cost.
So the question is: have I missed something? May be partition key is much more effective than sort one even inside one partition?
Addition: you may say "ok, now having userId as partition key for posts is ok but when you have 100000 users in your app you'll run into troubles with scaling". But in reality the trouble can be only for some "transition" case - when you have only a few partitions with a group of active users posts all in one partition and inactive ones in the other one. If you have thousands of users it's natural that you have a lot of users with active posts, the impact of one user is negligible and statistically their posts are evenly distributed between a lot of partitions due to big numbers.
I think its absolutely fine if you make sure you wont exceed partition limits by increasing RCU/WCU or by growth of your data. Moreover, best practices says
If the table will fit entirely into a single partition (taking into consideration growth of your data over time), and if your application's read and write throughput requirements do not exceed the read and write capabilities of a single partition, then your application should not encounter any unexpected throttling as a result of partitioning.

dynamodb table design for SET like scenario

Moving from RDBMS and I am not sure how best to design for below scenario
I have a table with around 200,000 questions with question id as partition key.
Users view questions and i do not wish to show viewed question again to the user. So which one is better option?
Have a table with question id as partition key and a set of user IDs as attribute
Have a table with User id as partition key and a set of question IDs they have viewed as attribute
Have a table with question id as partition key and user id as sort key. Once user has viewed question, add row to this table
1 and 2 might have a problem with the 400 kb size limit for item. The third seems better option though i would end up with 100 million items as there will be one row per user per question viewed. but i assume this is not a problem for dynamo?
Another problem is how to get 10 random questions not viewed by the user. Do i generate 10 random numbers between 1 and 200,000 (the number of questions) and then check if not in table mentioned in point 3 above?
I definitely would not go with option 1 or 2 for the reason you mentioned: you would already be limiting your scalability by the 400kb limit. With a UUID of 128 bits, you would be limited to about 250 users per question.
Option 3 is the way to go with DynamoDB, but what you need to consider is what is the partition key and what is the range key. You could have user_id as the partition key and question_id as the range key. The answer to that decision depends on how your data is going to be accessed. DynamoDB divides the total table throughput by each partition key: each one of your n partition keys gets 1/nth of the table throughput. For example, if you have a subset of partition keys that are accessed more than the others, then you won't be efficiently utilizing your table throughput because those partition keys that actually use up less than 1/nth of the throughput are still provisioned for 1/nth of the throughput. The general idea is that you want to have the each of your partition keys utilized equally. I think that you have it correct, I'm assuming that each question is given randomly and is no more popular than another, while some users might be more active than others.
The other part of your question is a little bit more difficult to answer / determine. You could do it your way where you have tables that contain question and user pairs for the questions that those users have read or you could have tables that contain the pairs for the questions those users haven't read. The tradeoff here is between initial write cost and subsequent read cost, and the answer depends on the amount of questions that you have compared to the consumption rate.
When you have a large amount of questions compared to the rate that users will progress through them, the chances of randomly selecting an already chosen one are small, so you're going to want to store have-read question-user pairs. With this setup you don't pay a lot to initialize a user (you don't have to write a question-user pair for each question) and you won't have a lot of miss-read costs (i.e. where you select a question-user pair and it turns out they already read it, this still consumes read-write units).
If you have a small amount of questions compared to the rate that users consume them, then you're going to want to store the haven't-read question-user pairs. You pay something to initialize each user (writing in one question-user pair for each question), but then you don't have any accidental miss-reads. If you stored them as have-read pairs when their is a small amount of questions, then you will encounter a lot of miss-reads as the percentage of read questions approaches 100% (to the point where you would have been better off just setting them up as haven't-read pairs).
I hope this helps with your design considerations. Drop a comment if you need clarification!

Incremental update of millions of records, indexed vs. join

I'm currently developing a strategy for an incremental update of our user data. We assume 100_000_000 records in our database of which approximately 1_000_000 records are updated per workflow.
The idea is to update records in a MapReduce job. Is it useful to use an indexed storage (eg. Cassandra) to be able to access current records randomly? Or is it preferable to retrieve data from HDFS and join new information to existing records.
The record size is O(200 Bytes). The user data has a fixed length but should be extendable. The log events have a similar but not equal structure. The number of user records is likely to grow. Near real-time updates are desirable, ie. a 3 hour time gap is not acceptable, few minutes is OK.
Have you made any experiences with either of these strategies and data of this size?
Is the pig JOIN fast enough? Is it a bottleneck always to read all records? Is Cassandra able to hold this amount of data efficiently? Which solution is scalable? What about the complexity of the system?
You need to define your requirements first. Your record volumes are not a problem, but you don't give a record length. Are they fixed length, fixed field number, likely to change format over time? Are we talking 100 byte records or 100,000 byte records? You need an index on a field/column if you wish to query by that field/column, unless you do all your work using map/reduce. Will the number of user records stay at 100mill (1 server will probably suffice) or will it grow 100% per year ( probably multiple servers adding new ones over time).
How you access records for updating depends on whether you need to update them in real-time or whether you can run a batch job. Will updates be every minute, or hour, or month?
I would strongly suggest you do some experimenting. Have you done any testing already? This will give you a context for your questions and this will lead to more objective questions and answers. It is unlikely that you can 'whiteboard' a solution based on your question.