Web services with a large number of extensions - web-services

I need to design a traditional sort of "get customer by ID" SOAP service operation. The thing is that customer data is retrieved from about 15+ systems, and depending on the customer's geographical location, different data needs to come back in addition to the traditional base set of facts about a customer (name, address, phone number, etc).
There are therefore two types of customer information at hand:
Common customer data that can fit into a "canonical data model" type bucket, e.g. name, address, etc.
Region-specific customer data that is specific to a customer's region or country. Think social security number if you're in the U.S., National Insurance number if you're in the UK, and so on.
Customers can be from a wide range of regions and as a result if we were to follow the approach of a traditional customer base type and a number of extensions e.g. USACustomerType which extends the CustomerType base type, we'll quickly run into two issues:
The WSDL will be huge
The WSDL will have large bits of information that are irrelevant to large amounts of people
I'm trying to avoid both of these things.
The options I have thought of, all in my opinion quite mediocre, are:
Have only a base customer type and an unbounded list of name-value pairs, and different systems can add their own name-value pairs, and consumers looking for that data could look for those (this is nice and extensible, but just not a very good idea)
Bite the bullet and have a giant WSDL that has a CustomerType base type and a number of extensions, adding more nations as they become available
A variation on #2, which is have only a base CustomerType and a large number of optional fields
I suppose the question is: Have you run into a situation like this and if so, how did you deal with it?

Related

DynamoDB one-to-one

Hello stackoverflow community,
This question is about modeling one-to-one relationships with multiple entities involved.
Say we have an application about students. Each Student has:
Profile (name, birth date...)
Grades (math score, geography...)
Address (city, street...).
Requirements:
The Profile, Grades and the Address only belong to one Student each time (i.e. one-to-one).
A Student has to have all Profile, Grades and Address data present (there is no student without grades for example).
Updates can happen to all fields, but the profile data mostly remain untouched.
We access the data based on a Student and not by querying for the address or something else (a query could be "give me the grades of student John", or "give me profile and address of student John", etc).
All fields put together are bellow the 400kb threshold of DynamoDB.
The question is how would you design it? Put all data as a single row/item or split it to Profile, Grades and Address items?
My solution is to go with keeping all data in one row defined by the studentId as the PK and the rest of the data follow in a big set of columns. So one item looks like [studentId, name, birthDate, mathsGrade, geographyGrade, ..., city, street].
I find that like this I can have transnational inserts/updates (with the downside that I always have to work with the full item of course) and while querying I can ask for the subset of data needed each time.
On top of the above, this solution fits with two of the most important AWS guidelines about dynamo:
keep everything in a single table and
pre-join data whenever possible.
The reason for my question is that I could only find one topic in stackoverflow about one-to-one modeling in DynamoDB and the suggested solution (also heavily up-voted) was in favor of keeping the data in separate tables, something that reminds me a relational-DB kind of design (see the solution here).
I understand that in that context the author tried to keep a more generic use case and probably support more complex queries, but it feels like the option of putting everything together was fully devalued.
For that reason I'd like to open that discussion here and listen to other opinions.
A Basic Implementation
Considering the data and access patterns you've described, I would set up a single student-data table with a partition key that allows me to query by the student, and a sort key that allows me to narrow down my results even further based on the entity I want to access. One way of doing that would be to use some kind of identifier for a student, say studentID, and then something more generalized for the sort key like entityID, or simply SK.
At the application layer, I would classify each Item under one possible entity (profile, grades, address) and store data relevant to that entity in any number of attributes that I would need on that Item.
An example of how that data might look for a student named john smith:
{ studentId: "john", entityId: "profile", firstName: "john", lastName: "smith" }
{ studentId: "john", entityId: "grades", math2045: 96.52, eng1021:89.93 }
{ studentId: "john", entityId: "address", state: "CA", city: "fresno" }
With this schema, all your access patterns are available:
"give me the math grades of student john"
PartitionKey = "john", SortKey = "grades"
and if you store address within the students profile entity, you can accomplish "give me profile and address of student John" in one shot (multiple queries should be avoided when possible)
PartitionKey = "john", SortKey = "profile"
Consider
Keep in mind, you need to take into account how frequently you are reading/writing data when designing your table. This is a very rudimentary design, and may need tweaking to ensure that you're not setting yourself up for major cost or performance issues down the road.
The basic idea that this implementation demonstrates is that denormalizing your data (in this case, across the different entities you've established) can be a very powerful way to leverage DynamoDB's speed, and also leave yourself with plenty of ways to access your data efficiently.
Problems & Limitations
Specific to your application, there is one potential problem that stands out, which is that it seems very feasible the grades Items start to balloon to the point where they are impossible to manage and become expensive to read/write/update. As you start storing more and more students, and each student takes more and more courses, your grades entities will expand with them. Say the average student takes anywhere from 35-40 classes and gets a grade for each of them, you don't want to have to manage 35-40 attributes on an item if you don't have to. You also may not want back every single grade every time you ask for a student's grades. Maybe you start storing more data on each grade entity like:
{ math1024Grade: 100, math1024Instructor: "Dr. Jane Doe", math1024Credits: 4 }
Now for each class, you're storing at least 2 extra attributes. That Item with 35-40 attributes just jumped up to 105-120 attributes.
On top of performance and cost issues, your access patterns could start to evolve and become more demanding. You may only want grades from the student's major, or a certain type of class like humanities, sciences, etc, which is currently unavailable. You will only ever be able to get every single grade from each student. You can apply a FilterExpression to your request and remove some of the unwanted Items, but you're still paying for all the data you've read.
With the current solution, we are leaving a lot on the table in terms of optimizations in performance, flexibility, maintainability, and cost.
Optimizations
One way to address the lack of flexibility in your queries, and possible bloating of grades entities, is the concept of a composite sort key.
Using a composite sort key can help you break down your entities even further, making them more manageable to update and providing you more flexibility when you're querying. Additionally, you would wind up with much smaller and more manageable items, and although the number of items you store would increase, you'll save on cost and performance. With more optimized queries, you'll get only the data you need back so you're not paying those extra read units for data you're throwing away. The amount of data a single Query request can return is limited as well, so you may cut down on the amount of roundtrips you are making.
That composite sort key could look something like this, for grades:
{ studentId: "john", entityId: "grades#MATH", math2045: 96.52, math3082:91.34 }
{ studentId: "john", entityId: "grades#ENG", eng1021:89.93, eng2203:93.03 }
Now, you get the ability to say "give me all of John's MATH course grades" while still being able to get all the grades (by using the begins_with operation on the sort key when querying).
If you think you'll want to start storing more course information under grades entities, you can suffix your composite sort key with the course name, number, identifier, etc. Now you can get all of a students grades, all of a students grades within a subject, and all that data about a students grade within a subject, like its instructor, credits, year taken, semester, start date, etc.
These optimizations are all possible solutions, but may not fit your application, so again keep that in mind.
Resources
Here are some resources that should help you come up with your own solution, or ways to tweak the ones I've provided above to better suit you.
AWS re:Invent 2019: Data modeling with Amazon DynamoDB (CMY304)
AWS re:Invent 2018: Amazon DynamoDB Deep Dive: Advanced Design Patterns for DynamoDB (DAT401)
Best Practices for Using Sort Keys to Organize Data
NoSQL Design For DynamoDB
And keep this one in mind especially when you are considering cost/performance implications for high-traffic application:
Best Practices for Designing and Using Partition Keys Effectively

DynamoDB query all users sorted by name

I am modelling the data of my application to use DynamoDB.
My data model is rather simple:
I have users and projects
Each user can have multiple projects
Users can be millions, project per users can be thousands.
My access pattern is also rather simple:
Get a user by id
Get a list of paginated users sorted by name or creation date
Get a project by id
get projects by user sorted by date
My single table for this data model is the following:
I can easily implement all my access patterns using table PK/SK and GSIs, but I have issues with number 2.
According to the documentation and best practices, to get a sorted list of paginated users:
I can't use a scan, as sorting is not supported
I should not use a GSI with a PK that would put all my users in the same partition (e.g. GSI PK = "sorted_user", SK = "name"), as that would make my single partition hot and would not scale
I can't create a new entity of type "organisation", put all users in there, and query by PK = "org", as that would have the same hot partition issue as above
I could bucket users and use write sharding, but I don't really know how I could practically query paginated sorted users, as bucket PKs would need to be possibly random, and I would have to query all buckets to be able to sort all users together. I also thought that bucket PKs could be alphabetical letters, but that could crated hot partitions as well, as the letter "A" would probably be hit quite hard.
My application model is rather simple. However, after having read all docs and best practices and watched many online videos, I find myself stuck with the most basic use case that DynamoDB does not seem to be supporting well. I suppose it must be quite common to have to get lists of users in some sort of admin panel for practically any modern application.
What would others would do in this case? I would really want to use DynamoDB for all the benefits that it gives, especially in terms of costs.
Edit
Since I have been asked, in my app the main use case for 2) is something like this: https://stackoverflow.com/users?tab=Reputation&filter=all.
As to the sizing, it needs to scale well, at least to the tens of thousands.
I also thought that bucket PKs could be alphabetical letters, but
that could create hot partitions as well, as the letter "A" would
probably be hit quite hard.
I think this sounds like a reasonable approach.
The US Social Security Administration publishes data about names on its website. You can download the list of name data from as far back as 1879! I stumbled upon a website from data scientist and linguist Joshua Falk that charted the baby name data from the SSA, which can give us a hint of how names are distributed by their first letter.
Your users may not all be from the US, but this can give us an understanding of how names might be distributed if partitioned by the first letter.
While not exactly evenly distributed, perhaps it's close enough for your use case? If not, you could further distribute the data by using the first two (or three, or four...) letters of the name as your partition key.
1 million names likely amount to no more than a few MBs of data, which isn't very much. Partitioning based on name prefixes seems like a reasonable way to proceed.
You might also consider using a tool like ElasticSearch, which could support your second access pattern and more.

AWS Personalize items attributes

I'm trying to implement personalization and having problems with Items schema.
Imagine I'm Amazon, I've products their brands and their categories. In what kind of Items schema should I include this information?
Should I include brand name as string as categorical field? Should I rather include brand ID as string or numeric? or should I include both?
What about categories? I've the same questions.
Metadata Fields Metadata includes string or non-string fields that
aren't required or don't use a reserved keyword. Metadata schemas have
the following restrictions:
Users and Items schemas require at least one metadata field,
Users and Interactions datasets can contain up to five metadata
fields. An Items dataset can contain up to 50 metadata fields.
If you add your own metadata field of type string, it must include the
categorical attribute. Otherwise, Amazon Personalize won't use the
field when training a model.
https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html
There are simply 2 ways to include your metadata in Items/Users datasets:
If it can be represented as a number value, then provide the actual value if it makes sense.
If it can be represented as string, then provide the string value and make sure, that categorical is set to true.
But let's take a look into "Why does they need me, to categorize my strings metadata?". The answer is pretty simple.
Let's start with an example.
If you would have Items as Amazon.com products and you would like to provide rates metadata field, then:
You could take all of the rates including the full review text sent by clients and simply put it as metadata field.
You can take just stars rating, calculate the average and put it as metadata field.
Probably the second one is making more sense in general. Having random, long reviews of product as metadata, pretty much changes nothing. Personalize doesn't understands if the review itself is good or bad, or if the author also recommends another product, so pretty much it doesn't really add anything to the recommendations.
However if you simply "cut" your dataset and calculate the average rating, like in the 2. point, then it makes a lot more sense. Maybe some of our customers like crappy products? Maybe they want to buy them, because they are famous YouTubers and they create videos about that? Based on their previous interactions and much more, Personalize will be able to perform just slightly better, because now it knows, that this product has rating of 5/5 or 3/5.
I wanted to show you, that for some cases, providing Items metadata as string makes no sense. That's why your string metadata must be categorical. It means, that it should be finite set of values, so it adds some knowledge for Personalize about given Item and why some of people might want to interact with it.
Going back to your question:
Should I include brand name as string as categorical field? Should I rather include brand ID as string or numeric? or should I include both?
I would simply go with brand ID as string. You could also go with brand name, but probably single brand can be renamed, when it's still the same brand, so picking up the ID would be more constant. Also two different brands could have the same names, because they are present on different markets, so picking up the ID solves that.
The "categorical": true switch in your schema just tells Personalize:
Hey, do you see that string field? It's categorised, finite set of values. If you train a model for me, please include this one during the training, it's important!
And as it's said in documentation, if you will provide string metadata field, which is not marked as categorical, then Personalize will "think" that:
Hmm.. this field is a string, it has pretty random values and it's not marked as categorical. It's probably just a leftover from Items export job. Let's ignore that.

Ember index data -vs- show data

How do people deal with index data (the data usually shown on index pages, like a customer list) -vs- the model detail data?
When somebody goes to the customer/index route -- they only need access to a small subset of the full customer resource. Since I am dealing with legacy data, my customer model has > 10 relationships. It seems wasteful to have the api return a complete and full customer representation for every customer just to render a list/select/index view.
I know those relationships are somewhat lazy-loaded, but it still takes effort on the backend to pull all those relationships in. For some relationships (such as customer->invoices) this could be a large list of ids.
I feel answers to this can be very opinionated. But my two cents:
The API you are drawing on for your data should have an end-point to fetch the subset of data you're interested in, e.g. /api/mini-customer vs /api/customer.
You can then either define two separate models (one to represent the model in the list and one to represent the detailed view), or simply populate the original model with the subset of data and merge the extra data in at a later point.
That said, I've also seen plenty of cases such as the one you describe, where you load all data initially and just display the subset to begin with. If it's reasonable that the data will eventually be used and your page-load constraints can handle it, then this can be an acceptable approach.

How do you implement multi-tenancy on CouchBase? Can it be performant?

I'm considering an app which will store customer data. Given the way buckets work in CouchBase, all customer data will be in one bucket. It appears that I have two choices:
Implement multi-tenancy in views, by assigning a field to each record that indicates the customer it belongs to.
Implement it by putting a factor on every key that is a customer ID.
It seems, though, that since I will be using views, I'll really want to do both. In case number 2, I need to have the data in the record so that it can be indexed on (or maybe I can pull out part of the key in the map phase and index on customer) and in option 1, I'd want it to be part of the key as a check when retrieving data to make sure I don't send the wrong customers data down the line.
The problem is, this is a service where multiple customers will interact, and sometimes one customer will create some data and the other will view it, at the first customers request. But putting an ACL on each record that lists everyone who's authorized to view it would be problematic, to say the least.
I bet there is a common methodology or design pattern to answer this question, and would appreciate some pointers to best practices.
I'm also concerned about the performance if the indexes are indexing both on the particular piece of relevant data, and the customer id... a large number of different customers would presumably make the indexes much less efficient. (but maybe not.)
Here are my thoughts on your questions:
[Concerning items #1 and 2] - It seems, though, that since I will be using views, I'll really want to do both.
This doesn't seem to make sense to me. In Couchbase, the map phase can include content from both the key and the value. It makes little sense to store the data in both the key and the value, as you are guaranteed to have 1:1 duplication there. Store it wherever it makes the most sense to store it; in this case, probably the value.
The problem is, this is a service where multiple customers will interact, and sometimes one customer will create some data and the other will view it, at the first customers request. But putting an ACL on each record that lists everyone who's authorized to view it would be problematic, to say the least.
My site also has muti-tenant data stored in a single database. In my case, I use object unique identifiers as my keys. By default, customers can access all objects that belong to them (I have a user object, and the user is associated with a customer account). Users may also have additional permissions assigned to them, whereby a single object from another customer could be added to their user account, and they would thereby be granted access to view the object.
The alternative is "security through obscurity" and use guids as a random identifier, giving customers access to view any object that they have the guid for.
I would not, however, try to store the permissions on the objects themselves. That would quickly become unwieldy. You need to think about your specific use case, and decide what simple approach would work for the majority of the cases, and just not support the other 1-2% of the cases.