I am trying to create a simple logging system using DynamoDB. Using this simple table structure:
{
"userName": "Billy Wotsit",
"messageLog": [
{
"date": "2022-06-08 13:17:03",
"messageId": "j2659afl32btc0feqtbqrf802th296srbka8tto0",
"status": 200
},
{
"date": "2022-06-08 16:28:37.464",
"id": "eb4oqktac8i19got1t70eec4i8rdcman6tve81o0",
"status": 200
},
{
"date": "2022-06-09 11:54:37.457",
"id": "m5is9ah4th4kl13d1aetjhjre7go0nun2lecdsg0",
"status": 200
}
]
}
It is easily possible that the number of items in the message log will run into the thousands.
According to the documentation an "item" can have a maximum size of 400kB which severly limits the maximum number of log elements that can be stored.
What would be the correct way to store this amount of data without resorting to a more traditional SQL-approach (which is not really needed)
Some information on use cases for your data would help. My primary questions would be
How often do you need to update/ append logs to an entry?
How do you need to retrieve the logs? Do you need all of them? Do you need them per user? Will you filter by time?
Without knowing any more info on your data read/write patterns, a better layout would be to have each log entry be an item in the DB:
The username can be the partition key
The log date can be the sort key.
(optional) If you need to ever retrieve by message id, you can create a secondary index. (Or vice versa with the log date)
With the above structure, you can trivially append log entries & retrieve them efficiently assuming you've set your sort key / indexes appropriately to your use case. Focus on understanding how the sort key and secondary indexes optimize finding your entries in the DB. You want to avoid scanning through the entire database looking for your items. See Core Components of Amazon DynamoDB for more info.
Related
I'm adding a product recommendation feature with Amazon Personalize to an e-commerce website. We currently have a huge product catalog with millions of items. We want to be able to use Amazon Personalize on our item details page to recommend other relevant items to the current item.
Now as you may be aware, Amazon Personalize heavily rely on the user interaction to provide recommendation. However, since we only just started our new line of business, we're not getting enough interaction data. The majority of items in our catalog have no interaction at all. A few items (thousands) though get interacted a lot, which then impose a huge influence on the recommendation results. Hence you will see those few items always get recommended even if they are not relevant to the current item at all, creating a very odd recommendation.
I think this is what we usually refer as a "cold-start" situation - except that usual cold-start problems are about item "cold-start" or user "cold-start", but the problem I am faced with now is a new business "cold-start" - we don't have the basic amount of interaction data to support the a fully personalized recommendation. With the absence of interaction data of each item, we want the Amazon Personalize service to rely on the item metadata to provide the recommendation. So that ideally, we want the service to recommend based on item metadata and once it's getting more interactions, recommend based on item metadata + interaction.
So far I've done quite some researches only to find one solution - to increase explorationWeight when creating the campaign. As this article indicates, Higher values for explorationWeight signify higher exploration; new items with low impressions are more likely to be recommended. But it does NOT seem to do the trick for me. It improves the situation a little bit but still often times I am seeing odd results being recommended due to a higher integration rate.
I'm not sure if there're any other solutions out there to remedy my situation. How can I improve the recommendation results when I have a huge catalog with not enough interaction data?
I appreciate if anyone has any advice. Thank you and have a good day!
The SIMS recipe is typically what is used on product detail pages to recommend similar items. However, given that SIMS only considers the user-item interactions dataset and you have very little interaction data, SIMS will not perform well in this case. At least at this time. Once you have accumulated more interaction data, you may want to revisit SIMS for your detail page.
The user-personalization recipe is a better match here since it uses item metadata to recommend cold items that the user may be interested in. You can improve the relevance of recommendations based on item metadata by adding textual data to your items dataset. This is a new Personalize feature (see blog post for details). Just add your product descriptions to your items dataset as a textual field as shown below and create a solution with the user-personalization recipe.
{
"type": "record",
"name": "Items",
"namespace": "com.amazonaws.personalize.schema",
"fields": [
{
"name": "ITEM_ID",
"type": "string"
},
{
"name": "BRAND",
"type": [
"null",
"string"
],
"categorical": true
},
{
"name": "PRICE",
"type": "float"
},
{
"name": "DESCRIPTION",
"type": [
"null",
"string"
],
"textual": true
},
],
"version": "1.0"
}
If you're still using this recipe on your product detail page, you can also consider using a filter when calling GetRecommendations to limit recommendations to the current product's category.
INCLUDE ItemID WHERE Items.CATEGORY IN ($CATEGORY)
Where $CATEGORY is the current product's category. This may require some experimentation to see if it fits with your UX and catalog.
I've been working with regular SQL databases and now wanted to start a new project using AWS services. I want the back end data storage to be DynamoDB and what I want to store is a tiered document, like an instruction booklet of all the programming tips I learned that can be pulled and called up via a React frontend.
So the data will be in a format like Python -> Classes -> General -> "Information on Classes Text Wall"
There will be more than one sub directory at times.
Future plans would be to be able to add new subfolders, move data to different folders, "thumbs up", and eventual multi account with read access to each other's data.
I know how to do this in a SQL DB, but have never used a NoSQL before and figured this would be a great starting spot.
I am also thinking of how to sort the partition, and I doubt this side program would ever grow to more than one cluster but I know with NoSQL you have to plan your layout ahead of time.
If NoSQL is just a horrible fit for this style of data, let me know as well. This is mostly for practice and to practice AWS systems.
DynamoDb is a key-value database with options to add a secondary indices. It's good to store documents that doesn't require full scan or aggregation queries. If you design your tiered document application to show only one document at a time, then DynamoDB would be a good choice. You can put the documents with a structure like this:
DocumentTable:
{
"title": "Python",
"parent_document": "root"
"child_documents": ["Classes", "Built In", ...]
"content": "text"
}
Where:
parent_document - the "title" of the parent document, may be empty for "Python" in your example for a document titled "Classes"
content - text or unstructured document with notes, thumbs up, etc, but you don't plan to execute conditional queries over it otherwise you need a global secondary index. But as you won't have many documents, full scan of a table won't take long.
You can also have another table with a table of contents for a user's tiered document, which you can use for easier navigate over the documents, however in this case you need to care about consistency of this table.
Example:
ContentsTable:
{
"user": -- primary key for this table in case you have many users
"root": [
"Python":[
"Classes": [
"General": [
"Information on Classes Text Wall"
]
]
]
]
}
Where Python, Classes, General and Information on Classes Text Wall are keys for DocumentTable.title. You can also use something instead of titles to keep the keys unique. DynamoDB maximum document size is 400 KB, so this would be enough for a pretty large table of contents
I have some sample data (which is simulating real data that I will begin getting soon), that represents user behavior on a website.
The data is broken down into 2 json files for every day of usage. (When I'm getting the real data, I will want to fetch it every day at midnight). At the bottom of this question are example snippets of what this data looks like, if that helps.
I'm no data scientist, but I'd like to be able to do some basic analysis on this data. I want to be able to see things like how many of the user-generated objects existed on any given day, and the distribution of different attributes that they have/had. I'd also like to be able to visualize which objects are getting edited more, by whom, when and how frequently. That sort of thing.
I think I'd like to be able to make dashboards in google data studio (or similar), which basically means that I get this data in a usable format into a normal relational database. I'm thinking postgres in aws RDS (there isn't that much data that I need something like aurora, I think, though I'm not terribly opposed).
I want to automate the ingestion of the data (for now the sample data sets I have stored on S3, but eventually from an API that can be called daily). And I want to automate any reformatting/processing this data needs to get the types of insights I want.
AWS has so many data science/big data tools that it feels to me like there should be a way to automate this type of data pipeline, but the terminology and concepts are too foreign to me, and I can't figure out what direction to move in.
Thanks in advance for any advice that y'all can give.
Data example/description:
One file is catalog of all user generated objects that exist at the time the data was pulled, along with their attributes. It looks something like this:
{
"obj_001": {
"id": "obj_001",
"attr_a": "a1",
"more_attrs": {
"foo": "fred":,
"bar": null
}
},
"obj_002": {
"id": "obj_002",
"attr_a": "b2",
"more_attrs": {
"foo": null,
"bar": "baz"
}
}
}
The other file is an array that lists all the user edits to those objects that occurred in the past day, which resulted in the state from the first file. It looks something like this:
[
{
"edit_seq": 1,
"obj_id": "obj_002",
"user_id": "u56",
"edit_date": "2020-01-27",
"times": {
"foo": null,
"bar": "baz"
}
},
{
"edit_seq": 2,
"obj_id": "obj_001",
"user_id": "u25",
"edit_date": "2020-01-27",
"times": {
"foo": "fred",
"bar": null
}
}
]
It depends on the architecture that you want to deploy. If you want event based trigger, I would use SQS, I have used it heavily, as soon as someone drop a file in s3 it can trigger the SQS which can be used to trigger Lambda.
Here is a link which can give you some idea: http://blog.zenof.ai/processing-high-volume-big-data-concurrently-with-no-duplicates-using-aws-sqs/
You could build Data Pipelines using AWS DataPipeline. For e.g. if you want to read data from S3 and some transformations and then to Redshift.
You can also have look on AWS Glue, which has Spark backend, which can also crawler the schema and perform ETL.
I'am learning AWS API Gateway + Lambda + Dynamodb by building a very simple API project.
I have a daily value starting from 2013-01-01 and keep updating every day, so basically is something like:
[
{
"value": 1776.09,
"date": "2013-01-01"
},
{
"value": 1779.25,
"date": "2013-01-02"
},
// ...
{
"value": 2697.32,
"date": "2018-11-22"
}
]
In the API I want to get the data for a specific day and for a range (dateFrom - dateTo), and I've been reading about Dynamodb and planning to have date as partition key in format YYYY-MM-DD and no sorting key, but not sure if this is the correct aproach for this type of data and the range query I'm going to be doing as I assume I'm going to have to do a full table scan for the range query, although is a small data set.
Can someone point me if this aproach is right or do I need to reconsider my table structure.
What you propose will work.
However, if you want to improve the efficiency of the design, you could use a partition key of YYYY and then your sort key could be MM-DD. That way, you can use a query operation to limit the results (or you could still use a scan).
You could even use a single, constant value for the partition key and date as the sort key, but having the same partition key for every item is generally not recommended.
Either way, your data is small enough that you should probably just pick the implementation that is simplest to develop and maintain.
Copying my answer from this post
Few concepts of NOSQLdb
writes should be equally spread out on primary keys.
read should be equally spread out on primary keys.
The obvious thing that comes to mind looking at given problem and dyanamodb schema is
have key logs as primary key and timestamp as secondary key. And to do an aggregation use
select * where pk=logs and sk is_between x and y
but this will violate both the concepts. We are always writing on a single pk and always reading from the same.
Now to this particular problem,
Our PK should be random enough (so that no hot keys) and deterministic enough (so that we can query)
we will have to make some assumptions about application while designing keys. let's say we decide that we will update every hour. hence can have 7-jan-2018-17 as a key. where 17 means 17th hour. this Key is deterministic but it is not random enough. and every update or read on 7th jan will mostly be going to same partition. To make the key random we can calculate hash of it using hashing algo like md5. let's say after taking hash, our key becomes 1sdc23sjdnsd. This will not make any sense if you are looking at table data. But if you want to know the event count on 7-jan-2018-17 you just hash the time and do a get from dynamodb with the hashkey.
if you want to know all the events on 7-jan-2018 you can do repeated 24 gets and aggregate the count.
Now this kind of schema will have issues where
If you decide to change from hourly to minute basis.
If most of your queries are run time like get me all the data for last 2,4,6 days. It will mean too many round trips to db. And it will be both time and cost inefficient.
Rule of thumb is when query patterns are well defined, use NOSQL and store the results for performance reasons. If you are trying to do a join or aggregation sort of queries on nosql, it is force fitting your use case based on your technology choice.
You can also looks at aws recommendation of storing time series data.
I would like to filter a list and sort it based on an aggregate; something that is fairly simple to express in SQL, but I'm puzzled about the best way to do it with iterative Map Reduce. I'm specifically using Cloudant's "dbcopy" addition to CouchDB, but I think the approach might be similar with other map/reduce architectures.
Pseudocode SQL might look like so:
SELECT grouping_field, aggregate(*)
FROM data
WHERE #{filter}
GROUP BY grouping_field
ORDER BY aggregate(*), grouping_field
LIMIT page_size
The filter might be looking for a match or it might searching within a range; e.g. field in ('foo', 'bar') or field between 37 and 42.
As a concrete example, consider a dataset of emails; the grouping field might be "List-id", "Sender", or "Subject"; the aggregate function might be count(*), or max(date) or min(date); and the filter clause might consider flags, a date range, or a mailbox ID. The documents might look like so:
{
"id": "foobar", "mailbox": "INBOX", "date": "2013-03-29",
"sender": "foo#example.com", "subject": "Foo Bar"
}
Getting a count of emails with the same sender is trivial:
"map": "function (doc) { emit(doc.sender, null) }",
"reduce": "_count"
And Cloudant has a good example of sorting by count on the second pass of a map reduce.
But when I also want to filter (e.g. by mailbox), things get messy fast.
If I add the filter to the view keys (e.g. final result looks like {"key": ["INBOX", 1234, "foo#example.com"], "value": null}, then it's trivial to sort by count within a single filter value. But sorting that data by count with multiple filters would require traversing the entire data set (per key), which is far too slow on large data sets.
Or I could create an index for each potential filter selection; e.g. final result looks like {"key": [["mbox1", "mbox2"], 1234, "foo#example.com"], "value": null}, (for when both "mbox1" and "mbox2" are selected) or {"key": [["mbox1"], 1234, "foo#example.com"], "value": {...}}, (for when only "mbox1" is selected). That's easy to query, and fast. But it seems like the disk size of the index will grow exponentially (with the number of distinct filtered fields). And it seems to be completely untenable for filtering on open-ended data, such as date ranges.
Lastly, I could dynamically generate views which handle the desired filters on the fly, only on an as-needed basis, and tear them down after they are no longer being used (to save on disk space). The downsides here are a giant jump in code complexity, and a big up-front cost every time a new filter is selected.
Is there a better way?
I've been thinking about this for nearly a day and I think that there is no better way to do this than what you have proposed. The challenges that you face are the following:
1) The aggregation work (count, sum, etc) can only be done in the CouchDB/Cloudant API via the materialized view engine (mapreduce).
2) While the group_level API provides some flexibility to specify variable granularity at query time, it isn't sufficiently flexible for arbitrary boolean queries.
3) Arbitrary boolean queries are possible in the Cloudant API via the lucene based _search API. However, the _search API doesn't support aggregation post query. Limited support for what you want to do is only capable in lucene using faceting, which isn't yet supported in Cloudant. Even then, I believe it may only support count and not sum or more complex aggregations.
I think the best option you face is to use the _search API and make use of sort, group_by, or group_sort and then do aggregation on the client. A few sample URLs to test would look like:
GET /db/_design/ddoc/_search/indexname?q=name:mike AND age:[1.2 TO 4.5]&sort=["age","name"]
GET /db/_design/ddoc/_search/indexname?q=name:mike AND group_by="mailbox" AND group_sort=["age","name"]