EMR hdfs transparently backed by s3 - amazon-web-services

With hadoop I can use s3 as a storage url. But currently I have a lot of applications using hdfs://... and I would like to migrate the whole cluster and apps to EMR and s3. do i have to change url in every single app from hdfs://... to s3://... or is it possible to somehow tell EMR to store hdfs content on s3 so each application can still use hdfs://... but in fact it will point to s3? if so, how?

That's a very good question. is there such a thing as protocol spoofing? could you actually affect this behavior by writing something that overrides how protocols are handled? Honestly that kind of a solution gives me the heeby-jeebies because if someone doesn't know that's happening and then gets unexpected pathing, and can't really diagnose or fix it, that's worse than the original problem.
if I were you, I'd do a find-replace over all my apps to just update the protocol.
let's say you had all of your apps in a directory:
-- myApps
|-- app1.txt
|-- app2.txt
and you wanted to find and replace hdfs:// with s3:// in all of those apps, I'd just do something like this:
sed -i .original 's/hdfs/s3/h' *
which produces:
-- myApps
|-- app1.txt
|-- app1.txt.original
|-- app2.txt
|-- app2.txt.original
and now app1.txt has s3:// everywhere rather than hdfs://
Isn't that enough?

The applications shall be refactored so that the input and output paths are not hard-coded. Instead, they shall be injected into the applications, after being read from some configuration files or parsed from command line arguments.
Take the following Pig script for example:
loaded_records =
LOAD '$input'
USING PigStorage();
--
-- ... magic processing ...
--
STORE processed_records
INTO '$output'
USING PigStorage();
We can then have a wrapper script like this:
#!/usr/bin/env bash
config_file=${1:?"Missing config_file"}
[[ -f "$config_file" ]] && source "$config_file" || { echo "Failed to source config file $config_file"; exit 1; }
pig -p input="${input_root:?'Missing parameter input_root in config_file'}/my_input_path" -p output="${output:?'Missing parameter output_root in config_file'}/my_output_path" the_pig_script.pig
In the config file:
input_root="s3://mybucket/input"
output_root="s3://mybucket/output"
If you have this kind of setup, you only have to do the configuration changes to switch between hdfs and s3.

Related

rclone - How do I list which directory has the latest files in AWS S3 bucket?

I am currently using rclone accessing AWS S3 data, and since I don't use either one much I am not an expert.
I am accessing the public bucket unidata-nexrad-level2-chunks and there are 1000 folders I am looking at. To see these, I am using the windows command prompt and entering :
rclone lsf chunks:unidata-nexrad-level2-chunks/KEWX
Only one folder has realtime data being written to it at any time and that is the one I need to find. How do I determine which one is the one I need? I could run a check to see which folder has the newest data. But how can I do that?
The output from my command looks like this :
1/
10/
11/
12/
13/
14/
15/
16/
17/
18/
19/
2/
20/
21/
22/
23/
... ... ... (to 1000)
What can I do to find where the latest data is being written to? Since it is only one folder at a time, I hope it would be simple.
Edit : I realized I need a way to list the latest file (along with it's folder #) without listing every single file and timestamp possible in all 999 directories. I am starting a bounty and the correct answer that allows me to do this without slogging through all of them will be awarded the bounty. If it takes 20 minutes to list all contents from all 999 folders, it's useless as the next folder will be active by that time.
If you wanted to know the specific folder with the very latest file, you should write your own script that retrieves a list of ALL objects, then figures out which one is the latest and which bucket it is in. Here's a Python script that does it:
import boto3
s3_resource = boto3.resource('s3')
objects = s3_resource.Bucket('unidata-nexrad-level2-chunks').objects.filter(Prefix='KEWX/')
date_key_list = [(object.last_modified, object.key) for object in objects]
print(len(date_key_list)) # How many objects?
date_key_list.sort(reverse=True)
print(date_key_list[0][1])
Output:
43727
KEWX/125/20200912-071306-065-I
It takes a while to go through those 43,700 objects!

Move only certain files to GCP and keep subfolder

I want to move all the files with extension "gz", with his folder/subfolders of the dir "C:\GCPUpload\Additional" to a folder in the bucket "gs://BucketName/Additional/".
I need to keep the folder structure, in a way like this:
C:\GCPUpload\Additional\Example1.gz --> gs://BucketName/Additional/Example1.gz
C:\GCPUpload\Additional\Example2.gz --> gs://BucketName/Additional/Example2.gz
C:\GCPUpload\Additional\ExampleNot.txt --> (Ignore this file)
C:\GCPUpload\Additional\Subfolder2\Example3.gz --> gs://BucketName/Additional/Subfolder2/Example3.gz
C:\GCPUpload\Additional\Subfolder2\Example4.gz --> gs://BucketName/Additional/Subfolder2/Example4.gz
This is the command that I am using so far:
call gsutil mv -r -c "C:\GCPUpload\Additional\**\*.gz" "gs://BucketName/Additional/"
The trouble that I'm having is that all the files are being move to the root of the bucket (i.e gs://BucketName/Additional/) , and ignoring its original folder/subfolder
How should I write this? I've tried and googled, but can't find a way where this is working.
Thanks!!
The behavior you're seeing was implemented by gsutil to match the corresponding (older) behavior when you use a recursive wildcard (**) in the shell.
To do what you want you'll need to list all of the objects you want moved and create a shell script that individually runs gsutil mv commands that move them to the directories you want. You could probably use local editing tools to make that somewhat easier (like awk or sed).

Redshift COPY command raises error if S3 prefix does not exist

When I run this COPY command:
COPY to_my_table (field1, field2, etc)
FROM s3://my-service-f55b83j5vvkp/2018/09/03
CREDENTIALS 'aws_iam_role=...'
JSON 'auto' TIMEFORMAT 'auto';
I get this error:
The specified S3 prefix '2018/09/03' does not exist
Which makes sense, because my S3 bucket does not have any file in that specific prefix. However, this is part of a daily job to load data, where sometimes there's something to load, but some other times there's nothing to load.
I checked the COPY documentation and it doesn't seem to be any way to avoid the error and just don't do anything if there are no objects under that prefix. Maybe I am missing something?
I would like to suggest here, how we have solved this problem in our case, though its simple solution but may be helpfull to others. Jon Scot has suggested good option in comment that I liked. But, unfortuanetely in our case, we coundn't do it as system adding files to S3 was not in our controll. So not sure it its your case too.
I think you could solve your problem multiple ways, but here are two options that I suggest.
1) As you may be running cron job to load data to Redshift, put a file existence check before executing the Copy command, like below.
path=s3://my-service-f55b83j5vvkp/2018/09/03
count=\`s3cmd ls $path | wc -l\`
if [[ $count -eq 1 ]]; then
//Your Redshift copy code goes here.
else
echo "Nothing to load"
fi
Advantage of this options is your saving some cost though may be completely negligible.
2) dummy file without records, that will eventually load no data to Redshift.

How to implement my product resource into a Pods structure?

Reading http://www.ember-cli.com/#pod-structure
Lets say I have a product resource. Which currently has the following directory structure:
app/controllers/products/base.js
app/controllers/products/edit.js
app/controllers/products/new.js
app/controllers/products/index.js
With pods all the logic in these files are put in a single file app/products/controller.js?
At the same time, my routes and templates for these resources currently look like:
app/routes/products/base.js
app/routes/products/edit.js
app/routes/products/new.js
app/routes/products/index.js
app/templates/products/-form.hbs
app/templates/products/edit.hbs
app/templates/products/index.hbs
app/templates/products/new.hbs
app/templates/products/show.hbs
How should this be converted to Pods?
You can use ember generate --pod --dry-run to help with that:
$ ember g -p -d route products/base
version: 0.1.6
The option '--dryRun' is not supported by the generate command. Run `ember generate --help` for a list of supported options.
installing
You specified the dry-run flag, so no changes will be written.
create app/products/base/route.js
create app/products/base/template.hbs
installing
You specified the dry-run flag, so no changes will be written.
create tests/unit/products/base/route-test.js
$
(I don't know why it complains yet it honours the option, might be a bug).
So you'd end up with a structure like:
app/controllers/products/base/route.js
app/controllers/products/edit/route.js
etc.

How can I load a Django fixture too large to fit into memory?

I wish to load initial data using fixtures as described here
https://docs.djangoproject.com/en/dev/howto/initial-data/
This would be easy enough with a small data set. However I wish to load a large CSV which will not fit into memory. How would I go about serializing this to a large JSON format? Do I have to hack it by manually writing the opening '[' and closing ']' or is there a cleaner of doing this?
Seeing that you are starting with a CSV file you could create a custom command. You can read the CSV file, create the objects and save them to the database within the command. As long as you can process each line of the CSV within the loop you will not run into memory issues.
Relative documentation can be found here:
http://docs.python.org/2/library/csv.html
https://docs.djangoproject.com/en/dev/howto/custom-management-commands/
I realize this is quite old, but I just had the same issue.
Using this post as reference:
Using jq how can I split a very large JSON file into multiple files, each a specific quantity of objects?
I split my original large array of json objects into individual objects, one per file, like so:
jq -c '.[]' fixtures/counts_20210517.json | \
awk '{print > "fixtures/counts_split/doc00" NR ".json";}'
and then looped over the files, added a square bracket to the beginning and end, and called the manage.py loaddata on that file
for file in fixtures/counts_split/*json; do
echo "loading ${file}"
sed -i '1s/^/[/' $file
sed -i '1s/$/]/' $file
manage.py loaddata $file
done