TextSizeLimitExceededException when calling the DetectPiiEntities operation

TextSizeLimitExceededException when calling the DetectPiiEntities operation - amazon-web-services

I am using aws comprehend for PII redaction, Idea is to detect entities and then redact PII from it.
Now the problem is this API has a Input text size limit. How can I increase the limit ?? Maybe to 1 MB ?? Or is there any other way to detect entities for large text.
ERROR: botocore.errorfactory.TextSizeLimitExceededException: An error occurred (TextSizeLimitExceededException) when calling the DetectPiiEntities operation: Input text size exceeds limit. Max length of request text allowed is 5000 bytes while in this request the text size is 7776 bytes

There's no way to increase this limit.
For input text greater than 5000 bytes, you can split the text into multiple chunks of 5000 bytes each and then aggregate the results back.
Please do mind that you keep some overlap between different chunks, to carry over some context from previous chunk.
For reference you can use similar solution exposed by Comprehend team itself . https://github.com/aws-samples/amazon-comprehend-s3-object-lambda-functions/blob/main/src/processors.py#L172

Related

S3 Select Result/Response size

AWS Documentation mentions: The maximum length of a record in the input or result is 1 MB. https://docs.aws.amazon.com/AmazonS3/latest/dev/selecting-content-from-objects.html
However, I'm even able to fetch 2.4GB result on running an S3 Select query through a python lambda, and have seen people working with even more huge result size.
Can someone please highlight the significance of 1 MB mentioned in AWS documentation and what does it mean?

Background:
I recently faced the same question regarding the 1 MB limit. I'm dealing with a large gzip compressed csv file and had to figure out, if S3 Select would be an alternative to processing the file myself. My research makes me feel the author of the previous answer misunderstood the question.
The 1 MB limit referenced by the current AWS S3 Select documentation is referring to the record size:
... The maximum length of a record in the input or result is 1 MB.
The SQL Query is not the input (it has a lower limit though):
... The maximum length of a SQL expression is 256 KB.
Question Response:
I interpret this 1 MB limit the following way:
One row in the queried CSV file (uncompressed input) can't use more than 1 MB of memory
One result record (result row returned by S3 select) also can't use more than 1 MB of memory
To put this in a practical perspective, the following questions discussed the string size in bytes for Python. I'm using an UTF-8 encoding.
This means len(row.encode('utf-8')) (string size in bytes) <= 1024 * 1024 bytes for each csv row represented as UTF-8 encoded string of the input file.
It again means len(response_json.encode('utf-8')) <= 1024 * 1024 bytes for each returned response record (in my case the JSON result).
Note:
In my case, the 1 MB limit works fine. However, this depends a lot on the amount of data in your input (and potentially extra, static columns you might add via SQL).
If the limit 1MB is exceeded and you want to query files without a data base solution involved, using the more expensive AWS Athena might be a solution.

Could you point us to part of documentation which talking about this 1mb?
I have never seen 1 MB limit. Downloading of object is just downloading, and you can download almost unlimited file.
AWS Uplaods files with multipart upload and it has limits up to Terabytes for object and up to Gigabytes for objects part
Docs is here: https://docs.aws.amazon.com/AmazonS3/latest/dev/qfacts.html
Response to the question
As per comment of author below my post:
Limit described here: https://docs.aws.amazon.com/AmazonS3/latest/dev/querying-glacier-archives.html
This docs refers to query for archived objects. So you can do some query on data, without collecting it from the Glacier.
And input query cannot exceed 1MB. Output of that query cannot exceed 1MB.
Input is SQL query
Output is files list.
Find more info here: https://docs.aws.amazon.com/amazonglacier/latest/dev/s3-glacier-select-sql-reference-select.html
So this limit is not for files but for SQL-like queries.

How can DynamoDB's BatchWriteItem API call write more than 10MB of data, given that it can write a maximum of 25 items with 400KB/item?

According to the official documentation: "A single call to BatchWriteItem can write up to 16 MB of data, which can comprise as many as 25 put or delete requests. Individual items to be written can be as large as 400 KB." (https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_BatchWriteItem.html)
But 25 put requests * 400KB per put request = 10MB. How then is the limit 16MB? Under what circumstances could the total ever exceed 10MB? Purely asking out of curiosity.

Actually I have also had the same doubt. Searched for this so much but found a decent explanation which I posted here (Don't know whether it is correct or not but I hope it gives you some intuition).
The 16MB limit applies to the request size - ie, the raw data going over the network. Can be quite different from what is actually stored and metered as throughput. I was able to hit this 16MB request size cap with a BatchWriteItem containing 25 PutItems of around 224kB
Also once head over to this link. This might help.

Step function exceeding the maximum number of characters service limit

My state in a step function flow returns an error of state/task returned a result with a size exceeding the maximum number of characters service limit.. In the step function documentation, the limit for characters for input/output is 32,768 characters. Upon checking the total characters of my result data if falls below the limit. Are there any other scenarios that it will throw that error? Thanks!

2020-09-29 Edit: Step Functions now supports 256KB payloads!
256KB is the maximum size of the payload that can be passed between states. You could also exceed this limit from a Map or Parallel state, whose final output is an array with the output of each iteration or branch.
https://aws.amazon.com/about-aws/whats-new/2020/09/aws-step-functions-increases-payload-size-to-256kb
The recommended solution from the Step Functions documentation is to store the data somewhere else (e.g. S3) and pass around the ARN instead of raw JSON.
https://docs.aws.amazon.com/step-functions/latest/dg/avoid-exec-failures.html
You can also use OutputPath to reduce the output to the fields you want to pass to the next state.
https://docs.aws.amazon.com/step-functions/latest/dg/input-output-outputpath.html

Check that textInput does not exceed 256 (dialogflow)

I send a query to dialogflow from python api and I got the error:
Text length must not exceed 256 bytes.
I calculate lenght of my text like this :
l=len(Text)
But I still get the error that my texts exceed 256.
So I want to khow how to check that my text doesn't exceed 256?

This (len(Text.encode('utf-8'))) should work better than just len(Text).
Good luck!

Efficiently read data from a structured file in C/C++

I have a file as follows:
The file consists of 2 parts: header and data.
The data part is separated into equally sized pages. Each page holds data for a specific metric. Multiple pages (needs not to be consecutive) might be needed to hold data for a single metric. Each page consists of a page header and a page body. A page header has a field called "Next page" that is the index of the next page that holds data for the same metric. A page body holds real data. All pages have the same & fixed size (20 bytes for header and 800 bytes for body (if data amount is less than 800 bytes, 0 will be filled)).
The header part consists of 20,000 elements, each element has information about a specific metric (point 1 -> point 20000). An element has a field called "first page" that is actually index of the first page holding data for the metric.
The file can be up to 10 GB.
Requirement: Re-order data of the file in the shortest time, that is, pages holding data for a single metric must be consecutive, and from metric 1 to metric 20000 according to alphabet order (header part must be updated accordingly).
An apparent approach: For each metric, read all data for the metric (page by page), write data to new file. But this takes much time, especially when reading data from the file.
Is there any efficient ways?

One possible solution is to create an index from the file, containing the page number and the page metric that you need to sort on. Create this index as an array, so that the first entry (index 0) corresponds to the first page, the second entry (index 1) the second page, etc.
Then you sort the index using the metric specified.
When sorted, you end up with a new array which contains a new first, second etc. entries, and you read the input file writing to the output file in the order of the sorted index.

An apparent approach: For each metric, read all data for the metric (page by page), write data to new file. But this takes much time, especially when reading data from the file.
Is there any efficient ways?
Yes. After you get a working solution, measure it's efficiency, then decide which parts you wish to optimize. What and how you optimize will depend greatly on what results you get here (what are your bottlenecks).
A few generic things to consider:
if you have one set of steps that read data for a single metric and move it to the output, you should be able to parallelize that (have 20 sets of steps instead of one).
a 10Gb file will take a bit to process regardless of what hardware you run your code on (concievably, you could run it on a supercomputer but I am ignoring that case). You / your client may accept a slower solution if it displays it's progress / shows a progress bar.
do not use string comparisons for sorting;
Edit (addressing comment)
Consider performing the read as follows:
create a list of block offset for the blocks you want to read
create a list of worker threads, of fixed size (for example, 10 workers)
each idle worker will receive the file name and a block offset, then create a std::ifstream instance on the file, read the block, and return it to a receiving object (and then, request another block number, if any are left).
read pages should be passed to a central structure that manages/stores pages.
Also consider managing the memory for the blocks separately (for example, allocate chunks of multiple blocks preemptively, when you know the number of blocks to be read).

I first read header part, then sort metrics in alphabetic order. For each metric in the sorted list I read all data from the input file and write to the output file. To remove bottlenecks at reading data step, I used memory mapping. The results showed that when using memory mapping the execution time for an input file of 5 GB was reduced 5 ~ 6 times compared with when not using memory mapping. This way temporarily solve my problems. However, I will also consider suggestions of #utnapistim.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js