Let users run arbitrary queries our MySQL database safely - c++

We provide various service to our clients, for example, sending emails etc. Their users are saved in our databases (MySQL). We would like to give an ability to our clients to run arbitrary search on our database without compromising our database. Let me elaborate:
Below is an existing table definition
* User Table
- email varchar
- category integer
Currently the only way our clients can select group of users for whom an action to be taken (say, sending an email) is by sending us a category. They send us category and we take action for all the users in that category.
However, this is quite restrictive and there are times where it would be desired to run custom searches by our clients in an unrestricted way. For example, below is our new table
* User Table
- email varchar
- category integer
- gender integer
- dob integer
- country varchar
and we would like our clients to run arbitrary searches on their users using all the fields mentioned, at least logical AND, OR, % (like), () operations to begin with, for example,
(gender = 1 AND dob < 1999) OR category = 2.
The idea is that they pass us a subquery which we append to 'SELECT' statement in WHERE clause. However, this is risky and we want to ensure that we tackle this safely without compromising our database by any malicious attempt to exploit this feature. And hence I need your help/inputs.
What would be the safest way to go about providing this kind of ability to search users safely? We use C++ for our backend. Client supplies logic using REST API which will be received by our C++ backend.

The best way to approach this is probably to use a prepared statement. Difficulties arise, because, using your second User table as an example, a customer could potentially request a query whose WHERE clause involves any combination of the 5 columns in that table.
One trick which might work is something like this:
SELECT *
FROM Users
WHERE
email = COALESCE(?, email) AND
category = COALESCE(?, category) AND
gender = COALESCE(?, gender) AND
age = COALESCE(?, age) AND
country = COALESCE(?, country);
The basic idea is that you have a single prepared statement, to which you always bind parameters for all columns in the Users table. If a customer makes a request which does not involve a certain column, then you bind NULL, and that condition in the WHERE clause effectively gets ignored.
The exact solution you use would depend on the programming language you are using to expose MySQL.

Related

DynamoDB Many-to-Many relations

I have a problem modeling my data in DynamoDB. My APP creates notes with the possibility to share a note with other user and allow the other user to update the Note (as done by https://keep.google.com/).
As I need to share notes between users, I decide that my primary table key will be the identifier of a Note.
Then I come with the following data-model for my DynamoDB tables:
Primary Table :(PK = NoteId, SK = Type)
Secondary Table: (GSK = userId, SK = noteId )
The "Type" will indicate if it is the BODY of the note (where information regarding the note will be save) or an identifier that indicate if the note has been shared with other user.
But I do have a problem: I use the secondary global key to retrieve all the notes for a user.
Once I have the list of noteId(s), I will enquiry my primary table to get all shared-notes for the user (as the notes for the user are already present in the SGK).
However, for doing this I need to use the function: "BatchGetItem".
The problem is that it is only allow to get 100 items and 16MB data.
In case of more than 100 shared-notes I have to call this functions several times. Moreover in case the data exceeds 16MB I need to implement a mechanism to read the rest of the requested data.
This operation could get really slow depending on the data size and number of shareId.
As you can imagine this is easily solved using a RDB and "join".
But the idea here is to use DynamoDB.
Data Access patterns:
Get all Notes by userId (own and shared)
Add a shared by userId and sharedId.
Get rights by noteId and userId.
Update a note by Id
Delete a note by Id
Any ideas of how I can change my data-model to improve the access pattern to read all notes?
Modelling your schema to utilise item collections will allow you to use the Query API which does not have a limit of items returned except a 1MB limit that still needs to be paged through.

AWS DynamoDB Table Design: Store two UserIDs and Details in Table

I'm building an app where two users can connect with each other and I need to store that connection (e.g. a friendship) in a DynamoDB table. Basically, the connection table has have two fields:
userIdA (hash key)
userIdB (sort key)
I was thinking to add an index on userIdB to query on both fields. Should I store a connection with one record (ALICE, BOB) or two records (ALICE, BOB; BOB, ALICE)? The first option needs one write operation and less space, but I have to query twice to get all all connections of an user. The second option needs two write operations and more space, but I only have to query once for the userId.
The user tablehas details like name and email:
userId (hash key)
name (sort key)
email
In my app, I want to show all connections of a certain user with user details in a listview. That means I have two options:
Store the user details of the connected users also in the connection table, e.g. add two name fields to that table. This is fast, but if the user name changes (name and email are retrieved from Facebook), the details are invalid and I need to update all entries.
Query the user details of each userId with a Batch Get request to read multiple items. This may be slower, but I always have up to date user details and don't need to store them in the connection table.
So what is the better solution, or are there any other advantages/disadvantages that I may have overlooked?
EDIT
After some google research regarding friendship tables with NoSQL databases, I found the following two links:
How does Facebook maintain a list of friends for each user? Does it maintain a separate table for each user?
NoSQL Design Patterns for Relational Data
The first link suggests to store the connection (or friendship) in a two way direction with two records, because it makes it easier and faster to query:
Connections:
1 userIdA userIdB
2 userIdB userIdA
The second link suggests to save a subset of duplicated data (“summary”) into the tables to read it faster with just one query. That would be mean to save the user details also into the connection table and to save the userIds into an attribute of the user table:
Connections:
# userIdA userIdB userDetails status
1 123 456 { userId: 456, name: "Bob" } connected
2 456 123 { userId: 123, name: "Alice" } connected
Users:
# userId name connections
1 123 Alice { 456 }
2 456 Bob { 123 }
This database model makes it pretty easy to query connections, but seems to be difficult to update if some user details may change. Also, I'm not sure if I need the userIds within the user table again because I can easily query on a userId.
What do you think about that database model?
In general, nosql databases are often combined with a couple of assumptions:
Eventual consistency is acceptable. That is, it's often acceptable in application design if during an update some of the intermediate answers aren't right. That is, it might be fine if for a few seconds while alice is becoming Bob's friend, It's OK if "Is Alice Bob's friend" returns true while "is Bob Alice's friend" returns false
Performance is important. If you're using nosql it's generally because performance matters to you. It's also almost certainly because you care about the performance of operations that happen most commonly. (It's possible that you have a problem where the performance of some uncommon operation is so bad that you can't do it; nosql is not generally the answer in that situation)
You're willing to make uncommon operations slower to improve the performance of common operations.
So, how does that apply to your question. First, it suggests that ultimately the answer depends on performance. That is, no matter what people say here, the right answer depends on what you observe in practice. You can try multiple options and see what results you get.
With regard to the specific options you enumerated.
Assuming that performance is enough of a concern that nosql is a reasonable solution for your application, it's almost certainly query rather than update performance you care about. You probably will be happy if you make updates slower and more expensive so that queries can be faster. That's kind of the whole point.
You can likely handle updates out of band--that is eventually consistency likely works for you. You could submit update operations to a SQS queue rather than handling them during your page load. So if someone clicks a confirm friend button, you could queue a request to actually update your database. It is OK even if that involves rebuilding their user row, rebuilding the friend rows, and even updating some counts about how many friends they have.
It probably does make sense to store a friend row in each direction so you only need one query.
It probably does make sense to store the user information like Name and picture that you typically display in a friend list duplicated in the friendship rows. Note that whenever the name or picture changes you'll need to go update all those rows.
It's less clear that storing the friends in the user table makes sense. That could get big. Also, it could be tricky to guarantee eventual consistency. Consider what happens if you are processing updates to two users' friendships at the same time. It's very important that you not end up with inconsistency once all the dust has settled.
Whenever you have non-normalized data such as duplicating rows in each direction, or copying user info into friendship tables, you want some way to revalidate and fix your data. You want to write code that in the background can go scan your system for inconsistencies caused by bugs or crashed activities and fix them.
I suggest you have the following fields in the table:
userId (hash key)
name (sort key)
email
connections (Comma separated or an array of userId assuming you have multiple connections for a user)
This structure can ensure consistency across your data.

DynamoDb database design

I'm new to DynamoDb and noSql in general.
I have a users table and a notes table. A user can create notes and I want to be able to retrieve all notes associated with a user.
One solution I've thought of is every time a note is saved the note id is stored inside a 'notes' attribute inside the user table. This will allow me to query the users table for all note id's and then query notes using those id's:
UserTable:
UserId: 123456789
notes: ['note-id-1', note-id-2]
NotesTable
id: note-id-1
text: "Some note"
Is this the correct approach? The only other way I can think is to have the notes table have a userId attribute so I can then query the notes table based on that userId. Obviously this is the sort of approach is more relational.
I would take the approach at the end of your question: each note should have a userId attribute. Then create a global secondary index with userId as primary key and noteId as sort key. This way you can also query on userId, by doing a query on that index.
If you do it the way you suggested, you always need two queries to get the notes of a user (first get the notes from the user table and then query on the notes table). Also, when someone has N notes you would need to do N queries, this is going to be expensive if N is large.
If you do it the way in this answer, you need one query to get all notes of a user (I'm assuming no pagination) and one to get the user information. Will never be more than 2.
General rule of thumb:
SQL: storage = expensive, computation = cheap
NoSQL: storage = cheap, computation = expensive
So always try to need as little queries as possible.

Trying to minimize the number of trips to a database voting table

I use django 1.10.1, postgres 9.5 and redis.
I have a table that store users votes and looks like:
==========================
object | user | created_on
==========================
where object and user are foreign keys to the id column of their own tables respectively.
The problem is that in many situations, I have to list many objects in one page. If the user is logged in or authenticated, I have to check for every object whether it was voted or not (and act depending on the result, something like show vote or unvote button). So in my template I have to call such function for every object in the page.
def is_obj_voted(obj_id, usr_id):
return ObjVotes.objects.filter(object_id=obj_id, user_id=usr_id).exists()
Since I may have tens of objects in one page, I found, using django-debug-toolbar, that the database access alone could take more than one second because I access just one row for each query and that happens in a serial way for all objects in the page. To make it worse, I use similar queries from that tables in other pages (i.e. filter using user only or object only).
What I try to achieve and what I think it is the right thing to do is to find a way to access the database just once to fetch all objects voted filtered by some user (maybe when the user logs in in or the at the first page hit requiring such database access), and then filter it further to whatever I want depending on the page needs. Since I use redis and django-cacheops app, can it help me to do that job?
In your case I'd better go with getting an array of object IDs and querying all votes by user's ID and this array, something like:
object_ids = [o.id for o in Object.objects.filter(YOUR CONDITIONS)]
votes = set([v.object_id for v in ObjVotes.objects.filter(object_id__in=object_ids, user_id=usr_id)]
def is_obj_voted(obj_id, votes):
return obj_id in votes
This will make only one additional database query for getting votes by user per page.

What's cheaper on DynamoDB (GSI vs multiple tables)

I have an issue of making a username AND an email unique. It is quite easy with relationaldatabase and just do 2 queries and get the count back on each.
select count(email) from users;
select count(username) from users;
But in DynamoDB (NoSQL) is it better (i.e. cheaper) to have 2 tables like so:
username table (where username is the hash) and check that table with a PUT and attribute_does_not_exist
AND
email table (where email is the hash) and check that table after the first one with a PUT and attribute_does_not_exist
OR do I
email table (hash) and username (GSI in that table). Then query the GSI first and if it doesn't exist then do a PUT with email and username
Which is better (cheaper)?
Two questions so I'll address them separately.
Which is cheaper?
You can run a single table with one GSI or two tables for the exact same cost if you want to because throughput for GSIs are provisioned the same way the primary table's throughput is.
Cost should not be a deciding factor.
Which is better?
The fact DynamoDB makes it difficult to have a secondary attribute retain its uniqueness is difficult is a common problem. Because of the asynchronous nature of GSIs the HASH or HASH/RANGE combination for a GSI is not unique. This can be taken advantage of in some circumstances.
If you use two tables you are taking the responsibility for keeping both tables in sync (something that is not easy to do in many situations). This comes with some important responsibilities (what happens if your app dies after writing to the first table but before it writes to the second), but this additional responsibility could allow you to maintain the uniqueness you want.
To explain how you would actually accomplish the dual uniqueness while maintaining accuracy, you would want to take advantage of conditional writes. The following outline describes a series of steps that would ensure that you maintain uniqueness.
Write record to username table with condition that username is not in the table, but include a conditional flag set to false (if write fails, we bail)
Write record to email table with condition that email is not in the table (if write fails, we delete the previous username record)
Update the username record to set the conditional flag to true
The reason you would want to use a conditional flag with the username to essentially indicate that the record is not in a valid state is to ensure you actually maintain the uniqueness.