I have a custom multi threaded http server(in c++) that:
1)accepts POSTS of media ts segments of different video streams
2)archives a bunch of them into a zip file(on disk)
3) and uploads the zip file to a preconfigured origin/content server(say after 100 segments have been archived to a zip file)
The problem is that each zip file is around ~100 MB and with high number of client POSTs(150 per second), uploading these zip files overwhelms the vsize/rss and crashes.(since uploading requires reading the zip file into memory)
Is there a memory aware /memory efficient way to ensure the upload threads can achieve maximum throughput without overwhelming the memory?
Some kind of dynamic rate limiting perhaps so that too many clients don't shoot up vsize?
Platform is Linux Ubuntu 14.04
Answering my own question: these strategies worked for me:
1. limit no of concurrent uploads
2. Read the zip files in chunks and upload over the open http connection
Related
If I am having a file of 1GB and have upload speed of 1MBps. So if I will upload whole file or will upload it in parts, the total uploading speed would not increase since I am having an upper bound on the upload speed. Am I right or there is some other scenario?
You are right. In your scenario there is little to gain from parallel multipart uploads.
The recommendation to use parallel multipart uploads is when you have more bandwidth available on your end than will be used per S3 connection.
I have a virtual machine disk and need to host it for distribution within my company. What's the best way to do so?
Can I put it on AWS S3?
Yes, you can host your file on S3. However, there are several factors to consider.
Cost:
Data transfer out (download) pricing on S3 is apx $.09 per GB. This translate to about $4.50 for each download.
Transfer Reliability:
Downloading a 50 GB file may be problematic for some customers. I would use a zip tool and split the file into 2 - 5 GB parts to make downloading easier. You don't want to pay for a 47 GB download that failed and then the user had to start the download over again.
Security:
If you make the download file public, anyone can download it and you pay. Use presigned URLs or signed cookies to control who can download the file. Since the file is for internal company use, you can restict access to IP (CIDR) block ranges.
Performance:
Where are your users located? This will help you determine where to host the bucket(s) for downloads. You might consider CloudFront, but for files this large, they won't get cached unless you split the file into smaller pieces.
I am trying to upload a .bak file(24gb) to amazon s3 using multipart upload low-level API approach in Java. I was able to write the file successfully but the time it took was around 7-8 hours. I want to know what is the average/ideal time to upload a file of such a big size, is the time it took is expected or it can be improved? If there is a scope for improvement than what could be the approach?
If you are using default settings of Transfer Manager, then for multipart uploads, the DEFAULT_MINIMUM_UPLOAD_PART_SIZE is 5MB which is too low for a 24GB file. This essentially means that you'll end up having thousands of small part uploaded to s3. Since each part is uploaded by a different worker thread, your application will spend too much time in Network communication. This will not give you optimal uploading speed.
You must increase the minimum upload part size to be between 100MB to 500 MB. Use this setting : setMinimumUploadPartSize
Official Documentation for setting MinimumUploadPartSize :
Decreasing the minimum part size will cause multipart uploads to be split into a larger number of smaller parts. Setting this value too low can have a negative effect on transfer speeds since it will cause extra latency and network communication for each part.
I am certain you'll see improvement in upload throughput by tuning this setting if you are currently using default settings. Let me know if this improves the throughput.
Happy Uploading !!!
I am trying to determine the ideal size for a file stored in S3 that will be used in Hadoop jobs on EMR.
Currently I have large text files around 5-10gb. I am worried about the delay in copying these large files to HDFS to run MapReduce jobs. I have the option of making these files smaller.
I know S3 files are copied in parallel to HDFS when using S3 as an input directory in MapReduce jobs. But will a single large file be copied to HDFS using single thread, or will this file be copied as multiple parts in parallel? Also, does Gzip compression affect copying a single file in multiple parts?
There are two factors to consider:
Compressed files cannot be split between tasks. For example, if you have a single, large, compressed input file, only one Mapper can read it.
Using more, smaller files makes parallel processing easier but there is more overhead when starting the Map/Reduce jobs for each file. So, fewer files are faster.
Thus, there is a trade-off between the size and quantity of files. The recommended size is listed in a few places:
The Amazon EMR FAQ recommends:
If you are using GZIP, keep your file size to 1–2 GB because GZIP files cannot be split.
The Best Practices for Amazon EMR whitepaper recommends:
That means that a single mapper (a single thread) is responsible for fetching the data from Amazon S3. Since a single thread is limited to how much data it can pull from Amazon S3 at any given time (throughput), the
process of reading the entire file from Amazon S3 into the mapper becomes the bottleneck in your data processing workflow. On the other hand, if your data files can be split, more than a single mapper can process your file. The suitable size for such data files is between 2 GB and 4 GB.
The main goal is to keep all of your nodes busy by processing as many files in parallel as possible, without introducing too much overhead.
Oh, and keep using compression. The savings in disk space and data transfer time makes it more advantageous than enabling splitting.
I have lots (10 million) of files (some 20K folders, each folder with about 500 files) on an EC2 EBS drive of 1TB.
I'f like to download it to my PC, how would I do that most efficiently.
Currently I'm using rsync, but it's taking AGES (about 3MB/s, when my ISP is 10MB/s).
Maybe I should use some tool to send it to S3 and then download it from there?
How would I do that, while preserving the directory structure?
The most efficient way would be to get a disc/drive sent there and back. Even today, for large sizes (>= 1 TB), snail mail is the fastest & most efficient way to send data back and forth
http://aws.amazon.com/importexport/
S3 and parallel HTTP downloads can help, but you can also use other download acceleration tools directly from your EC2 instance, such as Tsunami UDP or Aspera