When I do a table.put_item I get the error message "Aggregated size of all range keys has exceeded the size limit of 1024". What options do I have so I can save my data?
Change a setting in DynamoDB to allow a larger limit?
Split or compress the item and save to DynamoDB?
Store the item in s3?
Use another kind of database?
Other options?
Here is the specific code snippet:
def put_record(item):
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('table_name')
table.put_item(Item=item)
Here is an example of an item stored in DynamoDB. The two string variables p and r combined could be up to 4000 tokens.
{
"uuid": {
"S": "5bf19498-344c"
},
“p”: {
"S": “What is the next word after Mary had a”
},
“pp”: {
"S": "0"
},
"response_length": {
"S": "632"
},
"timestamp": {
"S": "04/03/2022 06:30:55 AM CST"
},
"s": {
"S": "1"
},
"c": {
"S": "test"
},
"f": {
"S": "0"
},
"t": {
"S": "0.7"
},
"to": {
"S": "1"
},
"b": {
"S": "1"
},
"r": {
"S": “lamb”
}
}
I read this
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ServiceQuotas.html
and couldn't figure out how the 1024 is calculated but I'm assuming the two string variables are causing the error.
The put_item doesn't cause an error when Item is a smaller size; only when the size is larger than the 1024 limit.
It is hard to estimate how many of the saves will be large but I need to be able to save the large items. So from an architecture perspective willing to consider any and all options.
Appreciate the assistance!
The error message "Aggregated size of all range keys has exceeded the size limit of 1024" is baffling because there can only be one sort key, so what does "aggregate" refer to? The following post was also surprised by this message: https://www.stuffofminsun.com/2019/05/07/dynamodb-keys/.
I am guessing you actually do try to write an item where its "p" (which you said is your sort key) itself is over 1024 characters. I don't see how it's the size of p+r combined that matters. You can take a look (and/or include in the question) at the one specific request that fails, and check what is the length of p itself. Please also double-check that you really set "p", and not something else, as the sort key.
Finally, if you really need sort keys over 1024 characters in length, you can consider Scylla Alternator - an open-source DynamoDB-compatible database which doesn't have this specific limitation.
Related
Here is my DynamoDB structure.
{"books": [
{
"name": "Hello World 1",
"id": "1234"
},
{
"name": "Hello World 2",
"id": "5678"
}
]}
I want to set ConditionExpression to check whether id existed before adding new items to books array. Here is my ConditionExpression. I am using API gateway to access DynamoDB.
"ConditionExpression": "NOT contains(#lu.books.id,:id)",
"ExpressionAttributeValues": {":id": {
"S": "$input.path('$.id')"
}
}
Result when I test the API: no matter id existed or not, success to add items to array.
Any suggestion on how to do it? Thanks!
Unfortunately, you can't. However, there is a workaround.
Store the books in separate rows. For example
PK SK
BOOK_LU#<ID> BOOK_NAME#<book name>#BOOK_ID#<BOOK_ID>
Now you can use the 'if_not_exists' conditional expression
"ConditionExpression": "if_not_exists(id, :id)'",
"ExpressionAttributeValues": {":id": {
"S": "$input.path('$.id')"
}
}
The con is if you were previously fetching the list as part of another object you will have to change that.
The pro is that now you can easily work with the books + you won't hit the max row size limits if the books became too many.
Basically, i have the following input:
{
"name": "abc",
"choice": "choice1"
}
My dynamoDB table has the following structure:
Partition key - "name"
Complex json with choices:
{
"choices":
{
"choice1": ......,
"choice2": ......
}
}
I want to directly read from dynamodb, and get a subitem under the relevant choice:
{
"StartAt": "Read Next Message from DynamoDB",
"States": {
"Read Next Message from DynamoDB": {
"Type": "Task",
"Resource": "arn:aws:states:::dynamodb:getItem",
"Parameters": {
"TableName": "my_table",
"Key": {
"customerName": {"S.$": "$.name"}
}
},
"OutputPath": "$.Item.choices.M.choice1.M.myvalue.S",
"Next": "World"
},
"World": {
"Type": "Pass",
"End": true
}
}
}
basically i want to do something like "$.Item.choices.M.{$.choice}.M.myvalue.S", and take one of the output's keys from the input. is this possible?
I think what you're looking for is JsonPath interpolation, but that is not supported as per this thread on AWS forums.
As far as I know Step Functions allow only path reference through $, . and [] operators (Reference Path).
I don't know how much control you have on the DynamoDB table's data but I think your problem can be solved easily if your choice types are modeled in following way
{
"choices": [{
"choiceType": "choice1",
........
},
{
"choiceType": "choice2",
........
}]
}
Now you can use the map state to iterate over the choices array. Note that don't forget to pass the expected choiceType to each iteration.
First state of the map iterator can be a choice state which compares choiceType and moves to appropriate next state. So, basically your rest of the workflow is modeled as iterator of the map state in step 1.
Now, if you don't have the control over DynamoDB table, then you can process the query result in an AWS Lambda.
I am trying to import a CSV file into Amazon Personalize
my schema looks like this:
{
"type": "record",
"name": "Items",
"namespace": "com.amazonaws.personalize.schema",
"fields": [
{
"name": "ITEM_ID",
"type": "string"
},
{
"name": "AUTHOR",
"type": "string",
"categorical": true
},
{
"name": "COUNTRY",
"type": "string",
"categorical": true
},
{
"name": "CITY",
"type": "string",
"categorical": true
},
{
"name": "STYLES",
"type": "string",
"categorical": true
},
{
"name": "CATEGORIES",
"type": "string",
"categorical": true
}
],
"version": "1.0"
}
the first few rows of data look like this:
ITEM_ID,AUTHOR,COUNTRY,CITY,STYLES,CATEGORIES
5b4253a7e12434f55875381e,5acd193f48ed4b9b3add5be6,US,city_us_austin,5ad45bc575eb016f3cdb562b|571aa21888a4fd9934f0fd7b|571aa21888a4fd9934f0fd79|5ad45e8c75eb016f3cdb563f|5b4ea35abaa12285687a1f47,593a866a082c26444eab2d3c|5a8e4820fc112d414fbc1be3
5b4253a7e12434f55875381f,5acd193f48ed4b9b3add5be6,US,city_us_jackson,571aa21888a4fd9934f0fd82|57600e419e4959cd069658eb|5ad45c3a75eb016f3cdb5631|571aa21888a4fd9934f0fd7b|57aaa7094a393f531ace43f0|575e6d8e34ca56f742bea1c8|571aa21888a4fd9934f0fd8f,593a866a082c26444eab2d3c|5a8e4820fc112d414fbc1be3
I get the error
Failed to create a data import job for item dataset.
Input csv has rows that do not conform to the dataset schema. Please ensure all required data fields are present and that they are of the type specified in the schema.
How can I figure out what is wrong with the CSV (it's thousands of lines long), so I have not idea if its a general mistake, or something wrong on a specific line?
In my experience, so long as the dataset is not >250 thousand records, you can still use Excel to check the data utilizing data filters and corresponding search functions. If it's more than that, look into using Notepad++ and RegEx. Your problem may be one of the following things:
(1) There's a missing comma. This would misalign your data and keep it from being processed.
(2) There's a missing ITEM_ID value. For Items, Personalize requires ITEM_ID and at least one metadata field. It might give this error if there is an instance where you are missing ITEM_ID or have ITEM_ID but no other metadata field values.
(3) STYLES and/or CATEGORIES exceeds 256 characters. There is probably a limit on String length, but I can't get a clear answer on this from the developer's guide. I would guess it's 256 characters. If I was betting money, this would be my guess on your problem.
Here is a different approach to solve the problem, maybe will be useful for other cases. I had the same issue, but when dealing with int columns having null values. Pandas by default converts the columns to float data type - something AWS Personalize dataset import job will not accept if you have dedfined these columns as int or long. Long story short, converting these columns to int solves the problem:
df.column_name = df.column_name.astype(pd.Int32Dtype())
I am trying to query an elasticsearch index in AWS to get all entries with a mass attribute greater than 1000, the datatype for the attribute is Long.
I found the range query and have tried that (see example below) but it's returning nothing but when I use other queries they return attributes with mass greater than 1000 so they're definitely in the index.
This is the Range query I'm trying:
{
"method": "POST",
"index": "users",
"type": "user",
"path": "_search?filter_path=filter",
"body": {
"size": 20,
"from": 0,
"query": {
"bool": {
"must":[{
"range": {
"mass": {
"gte": 1000
}
}
}]
}
}
}
}
I'm not getting any error messages, just zero hits.
So the problem that's causing to get you zero hits is the filter_path parameter you specify in
"path": "_search?filter_path=filter"
As stated in the official documentation the filter_path parameter is part of the common options regarding the REST API's. That means you can always add that parameter.
With Response Filtering you can reduce the response returned by Elasticsearch. Since you defined
_search?filter_path=filter
you probably get zero hits because there is no filter-element that can be returned.
I have documents in CouchDB (v. 2.1.1) as follows:
{
"xyz": "a",
"abc": "def"
},
{
"xyz": "a",
"ghi": "jkl"
},
{
"xyz": "a",
"mno": "pqr"
},
{
"xyz": "a",
"stu": "vwx"
},
{
"xyz": "a",
"bcd": 1000
}
If I run a simple map function, for example:
function (doc) {
if (doc.xyz ){
emit(doc.xyz, doc.abc);}}
I get:
{
"id": "4c3406a1d92942b4fb10d1314e0061a9",
"key": "a",
"value": "def"
},
{
"id": "4c3406a1d92942b4fb10d1314e006ccf",
"key": "a",
"value": null
},
{
"id": "4c3406a1d92942b4fb10d1314e00787f",
"key": "a",
"value": null
},
{
"id": "4c3406a1d92942b4fb10d1314e00871e",
"key": "a",
"value": null
},
{
"id": "4c3406a1d92942b4fb10d1314e00906a",
"key": "a",
"value": null
}
I want to try and eliminate the 'null' outputs.
I am looking at having a CouchDB database with many small documents containing small snippets of information rather than having larger documents containing much more information per document.
My question is, is my document design a good one and if so how do I get just what I am looking for rather than rows of 'nulls'. If my storage design is not ideal, what kind of design should I be looking at to simplify the output given my plan to have many small 'docs'.
EDIT:
Having looked at possible answers, I have decided that having numerous small documents as I described in my question is not giving me the kind of benefit I imangined they would.
I was unable to get a satisfactory solution to the map function to get readable answers.
However, I investigated the 'Mango' query system available in recent updates of CouchDB and I was able using these queries to get acceptable output from a database like my supplied one.
This is what I did:
curl -X POST http://admin:123#127.0.0.1:5984/ptn/_find -d '{"selector": {"$or": [{"abc": {"$gt": null}},{"ghi": {"$gt": null}}]},"fields": ["abc","ghi"]}' -H "Content-Type:application/json"
Un-minified:
{
"selector": {
"$or": [
{
"abc": {
"$gt": null
}
},
{
"ghi": {
"$gt": null
}
}
]
},
"fields": [
"abc",
"ghi"
]
}
The output:
{"docs":[
{"abc":"def"},
{"ghi":"jkl"}
]
.....
A concise answer.
Sorting can be done but sorted fields must be indexed. Indexing is in any case advised for larger data sets.
Reference:
http://docs.couchdb.org/en/2.1.1/api/database/find.html
As my question required a map function, this perhaps cannot be regarded as a valid answer but for me it is an answer. I have tried the 'Mango' query system a little on other databases and it seems to be more useful/powerful than I thought is was although it offers no means of totaling etc.