DynamoDB Schema Advice - amazon-web-services

I am practicing with working with DynamoDB and other serverless tools offered by AWS. I am used to working with relational databases like MySQL so with with Dynamo has been a little bit of a challenge for me.
Basically I was looking at doing something similar to what already exists with Facebook, Instagram, Youtube, and other popular sites are doing. And that's creating a platform that allows for Users to sign up, follow others, and post media (videos and pictures) that can be liked, and commented on. For items that could be growing, like Followers, or Likes, I originally stored as Lists in their respective tables; however, I realize that may not be the best approach as DynamoDB does have a data limit. For example, if someone like Kobe Bryant joined the app, and immediately got millions of followers, the list approach may not be best.
Like this,
Media:
- MediaID
- UserID
- MediaType
- Size
- S3_URL
- Likes: {
...
...
}
- Comments: {
...
...
}
Would it be better to store things like this in separate tables? Or am I now thinking back to relational databases?
For example,
Media:
- MediaID
- UserID
- MediaType
- Size
- S3_URL
Media_Likes:
- LikeID
- MediaID
- UserID
- DateLiked
Media_Comments:
- CommentID
- MediaID
- UserID
- Text
- DateCommented
Or what else would be the best way to design something like this?

Related

CubeJS Multitenant: How to use COMPILE_CONTEXT as users access the server with different Tokens?

We've been getting started with CubeJS. We are using BiqQuery, with the following heirarchy:
Project (All client)
Dataset (Corresponding to a single client)
Tables (Different data-types for a single client)
We'd like to use COMPILE_CONTEXT to allow different clients to access different Datasets based on the JWT that we issue them after authentication. The JWT includes the user info that'd cause our schema to select a different dataset:
const {
securityContext: { dataset_id },
} = COMPILE_CONTEXT;
cube(`Sessions`, {
sql: `SELECT * FROM ${ dataset_id }.sessions_export`,
measures: {
// Count of all session objects
count: {
sql: `Status`,
type: `count`,
},
In testing, we've found that the COMPILE_CONTEXT global variable is set when the server is launched, meaning that even if a different client submits a request to Cube with a different dataset_id, the old one is used by the server, sending info from the old dataset. The Cube docs on Multi-tenancy state that COMPILE_CONTEXT should be used in our scenario (at least, this is my understanding):
Multitenant COMPILE_CONTEXT should be used when users in fact access different databases. For example, if you provide SaaS ecommerce hosting and each of your customers have a separate database, then each ecommerce store should be modelled as a separate tenant.
SECURITY_CONTEXT, on the other hand, is set at Query time, so we tried to also access the appropriate data from SECURITY_CONTEXT like so:
cube(`Sessions`, {
sql: `SELECT * FROM ${SECURITY_CONTEXT.dataset_id}.sessions_export`,
But the query being sent to the database (found in the error log in the Cube dev server) is SELECT * FROM [object Object].sessions_export) AS sessions.
I'd love to inspect the SECURITY_CONTEXT variable but I'm having trouble finding how to do this, as it's only accessible within our cube Sql to my knowledge.
Any help would be appreciated! We are open to other routes besides those described above. In a nutshell, how can we deliver a specific dataset to a client using a unique JWT?
Given that all your datasets are in the same BigQuery database, I think your use-case reflects the Multiple DB Instances with Same Schema part of the documentation (that title could definitely be improved):
// cube.js
const PostgresDriver = require('#cubejs-backend/postgres-driver');
module.exports = {
contextToAppId: ({ securityContext }) =>
`CUBEJS_APP_${securityContext.dataset_id}`,
driverFactory: ({ securityContext }) =>
new PostgresDriver({
database: `${securityContext.dataset_id}`,
}),
};
// schema/Sessions.js
cube(`Sessions`, {
sql: `SELECT * FROM sessions_export`,
}

AWS Data Lake Dynamo vs ElasticSearch

I am really struggling to understand how Dynamo / ElasticSearch should be used to support AWS data lake efforts (Metadata / Catalogs). It seems as though you would log the individual S3 locations of your zip archives for your sources in Dynamo and any additional metadata / attributes you would like to search by in ES. If that is correct, how would you use the two together to support that. I tried to find more detailed information about how to properly pair the two together, but have been unsuccessful. Any information / documentation that others have would be great. Good chance I am overlooking some obvious examples / documentation.
What I am imagining is something like the following:
User could search for metadata / attributes in ES that would point to the high-level S3 buckets / partitions that match.
The search in DynamoDB would be against the part of the key (Partition / bucket) from the ES result
The search would most likely result in many individual objects / keys that could then be processed, extracted, etc.
I spoke to one of our AWS reps, who referred me to this article. It was a great starting point. AWS Data Lake. This seemed to answer some of my questions about the user of components and approach, that was previously unclear to me.
Highlights:
Blueprint for implementing a data lake. Combining S3 / DynamoDB / ES is common.
There are many variations to the implementation. Substituting an RDS for ES / DynamoDB, using just ES, etc.
We will most likely start with an RDS to workout the process, then move to DyanmoDB / ES.

Apartment gem: How to rename tenant?

Can we able to rename the tenant in apartment-gem or we have to drop the tenant and create a new one to achieve this.
Please provide me some suggestions
I don't think this is available through the apartment gem, but it is fairly simple to do with a SQL query. It depends on your setup though.
If you are using Postgresql with a schema for each tenant:
ALTER SCHEMA old RENAME TO new;
If you are using MySql, you should rename the table name prefixes for the tenant. This should work if the databases are on the same file system:
RENAME TABLE current_tenant.table TO new_tenant.table;
Disclaimer: not tested.
You can change name (in my case subdomain) doing something like this:
1) Your schema should be some unique column in Tenant model (schema_id is fine). I am generating it's value from subdomain and Tenant ID.
2) In apartment.rb you require "apartment/elevators/generic". Then
config.tenant_names = -> { Tenant.pluck :schema_id }
so you use schema stuff like tenant name.
Then on the bottom of the file add
Rails.application.config.middleware.use "Apartment::Elevators::Generic", lambda { |request|
Tenant.find_by(subdomain: request.host.split(".").first).schema_id
}
Now after you make proper subdomain requests stuff, you or your tenant user can edit name/subdomain and data in schemas will be safe.
PS: Also see here - https://github.com/influitive/apartment/issues/242

Is this an appropriate use-case for Amazon DynamoDB / NoSQL?

I'm working on a web application that uses a bunch of Amazon Web Services. I'd like to use DynamoDB for a particular part of the application but I'm not sure if it's an appropriate use-case.
When a registered user on the site performs a "job", an entry is recorded and stored for that job. The job has a bunch of details associated with it, but the most relevant thing is that each job has a unique identifier and an associated username. Usernames are unique too, but there can of course be multiple job entries for the same user, each with different job identifiers.
The only query that I need to perform on this data is: give me all the job entries (and their associated details) for username X.
I started to create a DynamoDB table but I'm not sure if it's right. My understanding is that the chosen hash key should be the key that's used for querying/indexing into the table, but it should be unique per item/row. Username is what I want to query by, but username will not be unique per item/row.
If I make the job identifier the primary hash key and the username a secondary index, will that work? Can I have duplicate values for a secondary index? But that means I will never use the primary hash key for querying/indexing into the table, which is the whole point of it, isn't it?
Is there something I'm missing, or is this just not a good fit for NoSQL.
Edit:
The accepted answer helped me find out what I was looking for as well as this question.
I'm not totally clear on what you're asking, but I'll give it a shot...
With DynamoDB, the combination of your hash key and range key must uniquely identify an item. Range key is optional; without it, hash key alone must uniquely identify an item.
You can also store a list of values (rather than just a single value) as an item's attributes. If, for example, each item represented a user, an attribute on that item could be a list of that user's job entries.
If you're concerned about hitting the size limitation of DynamoDB records, you can use S3 as backing storage for that list - essentially use the DDB item to store a reference to the S3 resource containing the complete list for a given user. This gives you flexibility to query for or store other attributes rather easily. Alternatively (as you suggested in your answer), you could put the entire user's record in S3, but you'd lose some of the flexibility and throughput of doing your querying/updating through DDB.
Perhaps a "Jobs" table would work better for you than a "User" table. Here's what I mean.
If you're worried about all of those jobs inside a user document adding up to more than the 400kb limit, why not store the jobs individually in a table like:
my_jobs_table:
{
{
Username:toby,
JobId:1234,
Status: Active,
CreationDate: 2014-10-05,
FileRef: some-reference1
},
{
Username:toby,
JobId:5678,
Status: Closed,
CreationDate: 2014-10-01,
FileRef: some-reference2
},
{
Username:bob,
JobId:1111,
Status: Closed,
CreationDate: 2014-09-01,
FileRef: some-reference3
}
}
Username is the hash and JobId is the range. You can query on the Username to get all the user's jobs.
Now that the size of each document is more limited, you could think about putting all the data for each job in the dynamo db record instead of using the FileRef and looking it up in S3. This would probably save a significant amount of latency.
Each record might then look like:
{
Username:bob,
JobId:1111,
Status: Closed,
CreationDate: 2014-09-01,
JobCategory: housework,
JobDescription: Doing the dishes,
EstimatedDifficulty: Extreme,
EstimatedDuration: 9001
}
I reckon I didn't really play with the DynamoDB console for long enough to get a good understanding before posting this question. I only just understood now that a DynamoDB table (and presumably any other NoSQL table) is really just a giant dictionary/hash data structure. So to answer my question, yes I can use DynamoDB, and each item/row would look something like this:
{
"Username": "SomeUser",
"Jobs": {
"gdjk345nj34j3nj378jh4": {
"Status": "Active",
"CreationDate": "2014-10-05",
"FileRef": "some-reference"
},
"ghj3j76k8bg3vb44h6l22": {
"Status": "Closed",
"CreationDate": "2014-09-14",
"FileRef": "another-reference"
}
}
}
But I'm not sure it's even worth using DynamoDB after all that. It might be simpler to just store a JSON file containing that content structure above in an S3 bucket, where the filename is the username.json
Edit:
For what it's worth, I just realized that DynamoDB has a 400KB size limit on items. That's a huge amount of data, relatively speaking for my use-case, but I can't take the chance so I'll have to go with S3.
It seems that username as the hash key and a unique job_id as the range, as others have already suggested would serve you well in dynamodb. Using a query you can quickly search for all records for a username.
Another option is to take advantage of local secondary indexes and sparse indexes. It seems that there is a status column but based upon what I've read you could add another column, perhaps 'not_processed': 'x', and make your local secondary index on username+not_processed. Only records which have this field are indexed and once a job is complete you delete this field. This means you can effectively table scan using an index for username where not_processed=x. Also your index will be small.
All my relational db experience seems to be getting in the way for my understanding dynamodb. Good luck!

Mapping restful architecture to CRUD functionality

I am trying to develop my first restful service in Java and having some trouble with mapping the methods to CRUD functionality.
My uri structure is as follows and maps to basic database structure:
/databases/{schema}/{table}/
/databases is static
{schema} and {table} are dynamic and react upon the path parameter
This is what I have:
Method - URI - DATA - Comment
---------------------------------------------------------------------
GET - /databases - none - returns a list of databases
POST - /databases - database1 - creates a database named database1
DELETE - /databases - database1 - deletes the database1 database
PUT - /databases - daatbase1 - updates database1
Currently in the example above I am passing through the database name as a JSON object. However, I am unsure if this is correct. Should I instead be doing this (using DELETE method as an example):
Method - URI - DATA - Comment
---------------------------------------------------------------------
DELETE - /databases/database1 - none - deletes the database with the same name
If this is the correct method and I needed to pass extra data would the below then be correct:
Method - URI - DATA - Comment
---------------------------------------------------------------------
DELETE - /databases/database1 - some data - deletes the database with the same name
Any comments would be appreciated
REST is an interface into your domain. Thus, if you want to expose a database then CRUD will probably work. But there is much more to REST (see below)
REST-afarians will object to your service being RESTful since if does not fit one of the key constraints: The Hypermedia Constraint. But, that can be addressed if you add links to the documents (hypermedia) that your service will generate / serve. See the Hypermedia constrain. After this your users will follow links and forms to change stuff in the application. (Database, tables and rows in your example) :
- GET /database -> List of databases
- GET /database/{name} -> List of tables
- GET /database/{name}/{table}?page=1 -> First set of rows in table XXXXX
- POST /database/{name}/{table} -> Create a record
- PUT /database/{name}/{table}/{PK} -> Update a record
- DELETE /database/{name}/{table}/{PK} -> Send the record to the big PC in the sky..
Don't forget to add links to you documents!
Using REST for CRUD is kind of like putting it in a Straitjacket :) : Your URIs can represent any concept. Thus, how about trying to expose some more creative / rich URIs based on the underling resources (functionality) that you want your service or web app to do.
Take a look at at this great article: How to GET a Cup of Coffee