How do I pass the Hadoop Streaming -file flag to Amazon ElasticMapreduce? - elastic-map-reduce

The -file flag allows you to pack your executable files as a part of job submission and thus allow you to run a MapReduce without first manually copying the executable to S3. Is there a way to use the -file flag with Amazon's elastic-mapreduce command? If not, what is the easiest way to upload the binary you want to run?

The answer is no. I ended up uploading the executable separately from invoking elastic-mapreduce.

Related

Is there a way to handle files within google cloud storage?

There are many compressed files inside google cloud storage.
I want to unzip the zipped file, rename it and save it back to another bucket.
I've seen a lot of posts, but I couldn't find a way other than how to download it with gsutil and handle it.
Do you have any other way?
To modify a file, such as unzipping, you must read, modify and then write. This means download, unzip and upload the extracted files.
Use gsutil or another tool, one of the SDKs, or the REST APIs. To unzip a file, use a zip tool or one of the libraries that support zip operations.
You can start using cloudFunctions for it. CloudFunction will get triggered when the object is created/finalized and it will do the job automatically.
You can also use the same function to iterate over all the zip files, do the tasks and move same files to another bucket. One thing to make sure it may not be able to move all files in a single go, but for the files which already exists on the bucket, you can use local machine to run the program.
Here is the list of libraries to connect to CloudStorage

Sync a specific set of files from Amazon S3 to Dropbox or Amazon Drive

I have an Amazon S3 bucket with tons of images. A subset of these images need to be synced to a local machine for image analysis (AI) purposes. This has to be done regularly and ideally with a list of file names as input. Not all images need to be synced.
There are ways to synchronise S3 with either Dropbox/Amazon Drive or other storage services, but none of them appear to have the option to provide a list of files that need to be synced.
How can this be implemented?
The first thing that springs to mind when talking about syncing and s3 is using the aws s3 sync cli command. This will allow you to sync specific origin destination folders as well as afford you the ability to use --include, --exclude if you want to list specific files. The commands also allow for the use of wildcards [*] if you have specific naming conventions you can use to identify the files.
You can also repeatedly call the --exclude command for multiple files, so depending on your OS you could either list all files or create a find script that identifies the files and singles them out.
Additionally you are able to do --delete which will remove any files in the destination path that are not in the origin.
As much as I would like to answer but I felt it would be good to
comment with one's thoughts initially if they are in line with the OP!
But I see the comments are being used to provide an answer to gain
points :)
I would like to submit my official answer!
Ans:
If I get this correctly I would use aws cli wth filters of include and exclude.
https://docs.aws.amazon.com/cli/latest/reference/s3/index.html#use-of-exclude-and-include-filters

Automate uploading files stored locally to Cloud Storage using gsutil

I'm new to GCP, I'm trying to build an ETL stream that will upload data from files to BigQuery. It seems to me that the best solution would be to use gsutil. The steps I see today are:
(done) Downloading the .zip file from the SFTP server to the virtual machine
(done) Unpacking the file
Uploading files from VM to Cloud Storage
(done) Automatically upload files from Cloud Storage to BigQuery
Steps 1 and 2 would be performed according to the schedule, but I would like step 3 to be event driven. So when files are copied to a specific folder, gsutil will send them to the specified bucket in Cloud Storage. Any ideas how can this be done?
Assuming you're running on a Linux VM, you might want to check out inotifywait, as mentioned in this question -- you can run this as a background process to try it out, e.g. bash /path/to/my/inotify/script.sh &, and then set it up as a daemon once you've tested it out and got something working to your liking.

Rename Folder in Amazon S3 slow. Can it be resolved using the following?

Rename is not just renaming a folder in amazon in turn it does the copy and delete which involves PUT request and cost is involved and the might be slow when there are huge files with huge size exist in the folder (http://gerardvivancos.com/2016/04/12/Single-and-Bulk-Renaming-of-Objects-in-Amazon-S3/)
I come across the following page (http://gerardvivancos.com/2016/04/12/Single-and-Bulk-Renaming-of-Objects-in-Amazon-S3/) which talks about renaming via script.
Can we execute the similar script via Amazon SDK java API?
Does still it does copy and delete internally or just changing the paths alone?
Thanks.
That script is just issuing aws s3 mv commands in a loop. You can issue S3 move commands via the AWS Java SDK.
It is still copying and deleting internally.

How to put input file automatically in hdfs?

In Hadoop we always putting input file manually through -put command. Is there any way we can automate this process ?
There is no automated process of inputing a file into the Hadoop filesystem. However, it is possible to -put or -get multiple files with one command.
Here is the website for the Hadoop shell commands
http://hadoop.apache.org/common/docs/r0.18.3/hdfs_shell.html
I am not sure how many files you are dropping into HDFS, but one solution for watching for files and then dropping them in is Apache Flume. These slides provide a decent intro.
You can thing of automatic this process with Fabric library and python. Write hdfs put command in a function and you can call it for multiple file and perform same operations of multiple hosts in network. Fabric should be really helpful to automate in your scenario.