Is AWS Cloudsearch Scalable? - amazon-web-services

I have 500MB worth of data to push to cloud search.
Here are the options I have tried:
Upload directly from console:
Tried to uplaod the file, there is a 5 MB limitation.
Then uploaded the file to S3 and selected the S3 option,
Upload to S3 and give S3 url in the console:
Fails and asks to try command line.
Tried with command line
aws cloudsearchdomain upload-documents --endpoint-url http://endpoint
--content-type application/json --documents s3://bucket/cs.json
Error parsing parameter '--documents': Blob values must be a path to a file.
OK, copied the file from s3 to local and tried to upload,
Tried with local file and cli:
aws cloudsearchdomain upload-documents --endpoint-url http://endpoint
--content-type application/json --documents ./cs.json
Connection was closed before we received a valid response from endpoint URL: "http://endpoint/2013-01-01/documents/batch?format=sdk".
Anyway to get CloudSearch to work?

As I understand the question, this is not about the scalability of Cloudsearch as per the Question Header, but it is about the limitations of uploading, and how to upload a large file into Amazon Cloudsearch.
The best and optimal solution would be to upload data by chunking it. Break your document into batches and upload data in batches. (But keep in mind the limitations associated)
The advantage of this is, if you have multiple documents to submit, submit them all in a single call rather than always submitting batches of size 1. AWS recommends to group (up to 5 mb) and send in one call. Each 1,000 batch calls cost you $0.10, I think, so grouping also saves you some money.
This worked for me. Given below are a few guidelines to help tackle the problem better.
Guidelines to follow when uploading data into Amazon Cloudsearch.
Group documents into batches before you upload them. Continuously uploading batches that consist of only one document has a huge, negative impact on the speed at which Amazon CloudSearch can process your updates. Instead, create batches that are as close to the limit as possible and upload them less frequently. (The limits are explained below)
To upload data to your domain, it must be formatted as a valid JSON or XML batch
Now, let me explain the limitations associated with Amazon Cloud search related to file uploads.
1) Batch Size:
The maximum batch size is 5 MB
2) Document size
The maximum document size is 1 MB
3) Document fields
Documents can have no more than 200 fields
4) Data loading volume
You can load one document batch every 10 seconds (approximately 10,000
batches every 24 hours), with each batch size up to 5 MB.
But if you wish to increase the limits, you can Contact Amazon CloudSearch. At the moment, Amazon does not allow to increase upload size limitations.
You can submit a request if you need to increase the maximum number of
partitions for a search domain. For information about increasing other
limits such as the maximum number of search domains, contact Amazon
CloudSearch.

Related

Pricing criteria of AWS S3

I am confused about pricking criteria in aws s3.
Let's assume
SELECT * FROM TABLE => result 1000 rows
Is this command 1000 requests or 1 request?
If you using S3 Select to query, it is a single request and you get charged for every 1000 requests and charged at the same rate as object GET requests
If you are using Athena to query S3, the charge would be for the amount of data retrieved and how the file is stored like zip or parquet format
AWS S3 Select charges for the bytes scanned ($0.00200 per GB) and the bytes returned ($0.0007 per GB).
The charges are broken in your billing management console Under Billing>Bills>Details>Simple Storage Service>Region> "Amazon Simple Storage Service USE2-Select-Returned-Bytes" and "Amazon Simple Storage Service USE2-Select-Scanned-Bytes"
Using limit, where clauses or having a compressed file reduces the bytes scanned. The service supports gzipped Parquet files which is fantastic. I was able to reduce file sizes by 90% and take advantage of columnar data format rather than csv.

AWS size limit on lambda output to be written to s3

I have to create a lambda that processes some payload and creates an output that is greater than the limit 6 MB for the response payload.
A way to solve this problem mentioned over various SO answers is to directly put the file on an s3.
But what these answers fail to mention is the upper limit of the output that can be saved into the s3 by the lambda. Is it because there isn't any limit?
I just want to confirm this before moving forward.
There are always limits. So yes, there is also a limit of object size in a S3 bucket. But before you hit that limit, you are going to hit other limits.
Here is the limit of uploading files using the API:
Using the multipart upload API, you can upload a single large object, up to 5 TB in size.
(Source)
But you are probably not going to be able to achieve this with a Lambda, since Lambdas have a maximum running time of 900 seconds. So even if you could upload a file at 1GB/s, you only would be able to upload 900GB before the Lambda stops.

Amazon S3 buckets, limit the number of requests per day, limit of size

currently developing a "social network"-like website, I plan to use an Amazon S3 bucket to store users' profile pictures. Does anyone know whether there is an easy way to set :
the maximum number of requests per day a bucket can accept
the maximum size that the bucket can reach
the maximum size a given picture must not exceed to be accepted
My fear being that a given user would upload such a big picture (or such a big number of pictures, or create such a high number of profiles) that my amazon S3 bill would skyrocket...
(I know that some 'a posteriori' alerts can be set, but I'm looking for a solution that would prevent this situation..)
Any help greatly appreciated !
There is no such a configuration in the S3 to limit the upload requests per day or maximum size that bucket can reach.
If you are concerned by a large sized photo upload, why don't you simply do it on the front-end side? Using Web Files API, you can get the selected file size and alert the user that it exceeds the limitation.
the maximum number of requests per day a bucket can accept
S3 allows 3500 PUT/DELETE/POST requests per second per folder. So, say your s3 bucket is "pics" which has folders on the basis of date (yyyy/mm/dd).
So, for each of the paths pics/2020/01/01, pics/2020/01/02, pics/2020/01/03 and so on, you will have the above mentioned throughput available.
the maximum size that the bucket can reach
Buckets have no upper size limit.
the maximum size a given picture must not exceed to be accepted
S3 has a per object limitation of 5TB. Hence, the user cannot upload a pic of size greater than 5TB. Obviously, that's not the limit you were expecting.
To handle your custom limitations, custom code would have to be written.
May I suggest using AWS Lambda whenever an upload is triggered. The said Lambda can validate the size of the pic, can even compress it (to save on S3 storage cost) and upload to your S3 bucket.

How to post single document to Amazon CloudSearch with other than documents/batch endpoint?

I referred to the Amazon Cloudsearch documentation and found one endpoint (HTTP) which ends with "documents/batch". This endpoint is working fine.
Now my doubt is, is this the endpoint to use for both batch and single or is there any other endpoint for single document to upload. I have not found any other way to upload documents to Amazon Cloudsearch (single-by single) other than in batch.
Calling a batch is more costly.
Please share your views.
Let me give you a precise answer for this. by breaking your question into sections.
QUESTION 1 is this the endpoint to use for both batch and single or is there any
other endpoint for single document to upload?
There is only one endpoint provided by AWS Cloudsearch for Document uploads. I have attached here an image to show the 3 main values for any Cloudsearch domain.
Search Endpoint: to query
Document Endpoint: to upload
Domain ARN: resource name of the component
QUESTION 2 I have not found any other way to upload documents to cloud search
(single-by single) other than in batch
There is no other way to upload as a single document. Don't get confused about this endpoint. It is an endpoint for you to upload data in batches. But you can upload data in different batch sizes. This could be size 1 as well. (but please be mindful about the limitations)
QUESTION 3 Calling a batch is more costly?
AWS mentions this in their documentation. You are billed for the total number of document batches uploaded to your search domain. Uploaded documents are automatically indexed.
As you can see, the batch size is irrelevant to the pricing model. It's always better to upload a larger size in one batch (one call) - but keep in mind the limitations.
$0.10 per 1,000 Batch Upload Requests (the maximum size for each batch
is 5 MB)
Read more about Cloudsearch pricing.

AWS S3 TransferUtility.UploadDirectoryRequest and HTTP PUT Limit

Here is a little background. I have designed a WEB API which provides methods for cloud operations(upload, download..). This API internally calls AWS API methods to fulfill these cloud operation requests.
WebDev solution --> WEB API --> AWS API
I am trying to upload a directory to AWS S3 cloud. This directory has large individual files more than 5 GB each. Amazon S3 has a limit of 5 GB for a single PUT operation. But They have provided a multi part upload mechanism using which files upto 5 TB can be uploaded.
The AWS documentation said the TransferUtility.UploadDirectory method will use multi-part upload mechanism to upload large files. So In my WEB API a [PUT] UploadDirectoryMethod method calls TransferUtility.UploadDirectory to upload the directory.
I am receiving an error "Amazon.S3.AmazonS3Exception: Your proposed upload exceeds the maximum allowed size"
Shouldn't the TransferUtility.UploadDirectory take care of breaking the larger objects (larger then 5GB) into parts and they uploading them with multiple PUT(?) operations?
How does Multi-part upload work internally for objects more then 5 GB in size? Does it create multiple PUT requests internally?