When copying the files through EG copy task what is the average download speed? Is it based on the internet speed? I’m asking since I know it follows FTP, so is there a limitation on the packets sent and received?
Thanks in advance.
Related
I have about 200Gb of files on a Windows server file share that i want to upload in to S3. From looking at the documentation, it looks like i have to write my own API to do.
I'm an networking and server guy - I have no coding experience. Can anyone point me in the right direction for getting started? Has anyone ever done this before, who could maybe give me the high level overview of the steps involved, so that I can go an research each step? Any info will be greatly appreciated. Thanks.
I have lots (10 million) of files (some 20K folders, each folder with about 500 files) on an EC2 EBS drive of 1TB.
I'f like to download it to my PC, how would I do that most efficiently.
Currently I'm using rsync, but it's taking AGES (about 3MB/s, when my ISP is 10MB/s).
Maybe I should use some tool to send it to S3 and then download it from there?
How would I do that, while preserving the directory structure?
The most efficient way would be to get a disc/drive sent there and back. Even today, for large sizes (>= 1 TB), snail mail is the fastest & most efficient way to send data back and forth
http://aws.amazon.com/importexport/
S3 and parallel HTTP downloads can help, but you can also use other download acceleration tools directly from your EC2 instance, such as Tsunami UDP or Aspera
I have an application which does file upload and download. I also am able to limit upload/download speed to a desired level (CONFIGURABLE), so that my application does not consume the whole available bandwidth. I am able to achieve this using the libcurl (http) library.
But my question is, if I have to limit my upload speed to say 75% of the available upload bandwidth, how do I find out my available upload bandwidth programatically? preferably in C/C++. If it is pre-configured, I have no issues, but if it has to be learnt and adapted each time, like I said, 75% of the available upload limit, I do not know who to figure it out. Same is applicable to download. Any pointers would be of great help.
There's no way to determine the absolute network capacity between two points on a regular network.
The reason is that the traffic can be rerouted in between, other data streams appear or disappear or links can be severed.
What you can do is figure out what is the available bandwidth right now. One way to do it is to upload/download a chunk of data (say 1MB) as fast as possible (no artificial caps), and measure how long it takes. From there you can figure out what bandwidth is available now and go from there.
You could periodically measure the bandwidth again to make sure you're not too way off.
I have to transfer some files to a third party. We can invent the file format, but want to keep it simple, like CSV. These won't be big files - a few 10s of MB at most and there won't be many - 3 files per night.
Our preference for the protocol is sftp. We've done this lots in the past and we understand it well.
Their preference is to do it via a web service/SOAP/https call.
The reasons they give is reliability, mainly around knowing that they've fully received the file.
I don't buy this as a killer argument. You can easily build something into your file transfer process using sftp to make sure the transfer has completed, e.g. use headers/footers in the files, or move file between directories, etc.
The only other argument I can think of is that over http(s), ports 80/443 will be open, so there might be less firewall work for our infrastructure guys.
Can you think of any other arguments either way on this? Is there a consensus on what would be best practice here?
Thanks in advance.
File completeness is a common issue in "managed file transfer". If you went for a compromise "best practice", you'd end up running either AS/2 (a web service-ish way to transfer files that incorporates non-repudiation via signed integrity checks) or AS/3 (same thing over FTP or FTPS).
One of the problems with file integrity and SFTP is that you can't arbitrarily extend the protocol like you can FTP and FTPS. In other words, you can't add an XSHA1 command to your SFTP transfer just because you want to.
Yes, there are other workarounds (like transactional files that contain hashes of files received), but at the end of the day someone's going to have to do some work...but it really shouldn't be this hard.
If the third party you're talking to really doesn't have a non-web service call to accept large files, you might be their guinea pig as they try to navigate a brand new world. (Or, they may have jsut fired all their transmissions folks and are not just realizing that the world doesn't operate on SOAP...yet - seen that happen too.)
Either way, unless they GIVE you the magic code/utility/whatever to do the file-to-SOAP transaction for them (and that happens too), I'd stick to your sftp guns until they find the right guy on their end to talk bulk data transmissions.
SFTP is the protocol for secure file transfer, soap is an API protocol - which can be used for sending file attachments (i.e. MIME attachments), or as Base64 encoded data.
SFTP adds additional potential complexity around separate processes for encrypting/decrypting files (at-rest, if they contain sensitive data), file archiving, data latency, coordinating job scheduling, and setting-up FTP service accounts.
My team is currently developing a resume-parser for a website. Our parser will translate and format the resume into the industry-standard HR-XML. The website will then take the HR-XML-formatted information and pre-populate editable fields so the user can finalize his/her profile on the website.
What would be the best way to port the HR-XML information to the website? Should we store the XML tags in program memory and have the website call a retriever method in our software? Or should we create a temporary file for each resume that is uploaded to the site? If so, where should this file be stored, and how should we go about maintaining our directories so they are not crowded with temp files?
Any insight would be greatly appreciated! Thank you in advance for your time and your help.
There are a lot of things implied by your questions. If I understand, it sounds as if you're going to import/snarf resume content from external sources, then allow users to fill-in-the-blanks to update the information. Additionally, it sounds like this is existing code you have in place that runs in a non-web environment. Please clarify if that's correct, and I'll update my answer further.
A lot depends on the size of the temp files and the volume of resumes you expect to process. I'd recommend writing them to a temp directory that gets cleared at server start and shutdown. This is often useful when diagnosing server misbehavior. Writing them to disk also enables you to run a job run periodically that clears out "old" entries.
Keeping a collection of entries in memory on the server probably makes more sense if volume is moderate and response time is a big factor, but IME container memory is usually more expensive than disk space.