How to move file from local to HDFS using oozie? - hdfs

I am trying to move data from a local file system to the Hadoop distributed file system , but i am not able to move it through oozie
Can we move or copy data from a local filesystem to HDFS using oozie ???

I found a workaround for this problem. The ssh action will always execute from the Oozie server. So if your files are located on the local file system of the Oozie server, you will be able to copy them to HDFS.
The ssh action will always be executed by the 'oozie' user. So your ssh action should look like this: myUser#oozie-server-ip, where myUser is a user with read rights on the files from the Oozie server.
Next, you need to set up passwordless ssh between the oozie user and myUser, on the Oozie server. Generate a public key for the 'oozie' user and copy the generated key in the authorized_keys file of 'myUser'. This is the command for generating the rsa key:
ssh-keygen -t rsa
When generating the key, you need to be logged in with the oozie user. Usually on a Hadoop cluster this user will have its home in /var/lib/oozie and the public key will be generated in id_rsa.pub in /var/lib/oozie/.ssh
Next copy this key in the authorized_keys file of 'myUser'. You will find it in the user's home, in the .ssh folder.
Now that you have set up the passwordless ssh, it time to set up the ssh oozie action. This action will execute the command 'hadoop' and will have as arguments '-copyFromLocal', '${local_file_path}' and '${hdfs_file_path}'.

No, Oozie isn't aware of a local filesystem, cause it's run in Map-Reduce cluster nodes. You should use Apache Flume to move data from a local filesystem to HDFS.

Oozie will not support the Copy action from Local to HDFS or vise versa, but u can call java program to do the same, Shell action will also work, but if you have more than one node in a cluster, then all the node should be having the said local Mount point available or mounted with read/write access.

You can do this using Oozie shell action by putting the copy command in the shell script.
https://oozie.apache.org/docs/3.3.0/DG_ShellActionExtension.html#Shell_Action
Example:
<workflow-app name="reputation" xmlns="uri:oozie:workflow:0.4">
<start to="shell"/>
<action name="shell">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<exec>run.sh</exec>
<file>run.sh#run.sh</file>
<capture-output/>
</shell>
<ok to="end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
In Your run.sh you can use: hadoop fs -copyFromLocal command.

Related

Why can't my GCP script/notebook find my file?

I have a working script that finds the data file when it is in the same directory as the script. This works both on my local machine and Google Colab.
When I try it on GCP though it can not find the file. I tried 3 approaches:
PySpark Notebook:
Upload the .ipynb file which includes a wget command. This downloads the file without error but I am unsure where it saves it to and the script can not find the file either (I assume because I am telling it that the file is in the same directory and pressumably using wget on GCP saves it somewhere else by default.)
PySpark with bucket:
I did the same as the PySpark notebook above but first I uploaded the dataset to the bucket and then used the two links provided in the file details when you click the file name inside the bucket on the console (neither worked). I would like to avoid this though as wget is much faster then downloading on my slow wifi then reuploading to the bucket through the console.
GCP SSH:
Create cluster
Access VM through SSH.
Upload .py file using the cog icon
wget the dataset and move both into the same folder
Run script using python gcp.py
Just gives me an error saying file not found.
Thanks.
As per your first and third approach, if you are running a PySpark code on Dataproc, irrespective of whether you use .ipynb file or .py file, please note the below points:
If you use the ‘wget’ command to download the file, then it will be downloaded in the current working directory where your code is executed.
When you try to access the file through the PySpark code, it will check defaultly in HDFS. If you want to access the downloaded file from the current working directory, use the “ file:///” URI with absolute file path.
If you want to access the file from HDFS, then you have to move the downloaded file to HDFS and then access from there using an absolute HDFS file path. Please refer the below example:
hadoop fs -put <local file_name> </HDFS/path/to/directory>

How to give back hdfs permission to super group?

In order to access hdfs. I unknowing gave the following command in root user.( I had tried to resolve the following error )
sudo su - hdfs
hdfs dfs -mkdir /user/root
hdfs dfs -chown root:hdfs /user/root
exit
Now when I tried to access hdfs it says,
Call From headnode.name.com/192.168.21.110 to headnode.name.com:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
What can I do to resolve this issue.It would be great if you could explain what the command 'hdfs dfs -chown root:hdfs /user/root'does.
I am using HDP 3.0.1.0 (Ambari)
It seems like your HDFS is down.. Check if your namenode is up.
The command hdfs dfs -chown root:hdfs /user/root changes the ownership of the HDFS directory /user/root(if it exists) to user root and group hdfs. User hdfs should be able to perform this command(or any command in the HDFS in the matter of fact). The "root" user of the HDFS is hdfs.
If you want to make user root an HDFS superuser, you can change the group of the root user to hdfs using(with user root) usermod -g hdfs root and then run(from user hdfs) hdfs dfsadmin -refreshUserToGroupsMappings. This will sync the user groups mappings in the server with the HDFS, making user root a superuser.

Changing EC2 pem file key pair when you have access to the EC2 instance

thank you for your time.
I have an EC2 instance, but for security reasons i need to change the pem files associated in .ssh/authorized_keys. I do understand that the public pem file goes into authorized_keys.
I do not want to mount the volume of the ec2 instance to a new one. I am considering as a last option since I do have access to the EC2 instance.
How can this be done?
I have tried:
This post Change key pair for ec2 instance the answer by Pat Mcb, but no luck.
Run this command after you download your AWS pem.
ssh-keygen -f YOURKEY.pem -y Then dump the output into
authorized_keys.
Or copy pem file to your AWS instance and execute following commands
chmod 600 YOURKEY.pem and then
ssh-keygen -f YOURKEY.pem -y >> ~/.ssh/authorized_keys
But that didn't work for me. If i follow it exactly download aws key pair key, and follow the instructions by coping the key when ssh into the instance, when i do ssh-keygen -f YOURKEY.pem -y >> ~/.ssh/authorized_keys It asks for a passphrase (never had to input one)
What i am doing is the following.
I create a new key with
ssh-keygen newpem.pem
and the .pub file i copy it in .ssh/authorized_keys
Can someone explain what i am doing incorrectly?
Note the authorized_keys file has the correct permissions.
Seems like you want to deprecate the old key and use a new key instead. These steps may help you -
Create a new key pair using the aws console and download it onto your system.
Retrieve the public key from the private key(.pem) file using the command - "ssh-keygen -y"
SSH into the instance using the old key.
Once you have access to the instance add the public key you got in step 2 into the "~/.ssh/authorized_keys" files and then save the file.
Log out of the instance and then try accessing the instance with the new key.
Hope it helps. Thank You !
You Don't even need to do all of this just mind few things with AWS EC2 you get a private key for default users . like ec2-user /ubuntu etc.
You are doing the right step
ssh-keygen -t rsa -C "your_email#example.com"
if it ask for entering any paraphrase leave it blank.
Just press to accept the default location and file name. If the .ssh directory doesn't exist, the system creates one for you.
Enter, and re-enter, if passphrase prompted
you have that key now .
Copy that key
Login to your Ec2 server.
sudo su
vim ~/.ssh/authorized_keys
paste the key.
:wq!
You'll see a key there copy it and save it as a backup somewhere.
Now paste your newly generated key in that file
and save the file.
now final step to take care is the permission, so run the following command.
sudo chmod 700 .ssh && chmod 600 .ssh/authorized_keys
Now you're good to go you.
Following are the steps to change your keypair on AWS EC2.
Login to AWS Console. Go to the Network and Security >> Keypair.
Give the name of your keypair (mykeypair) and keytype (RSA) and Private
keyformat (.pem). and click on the create keypair. It will ask you to
download .pem file in your local machine. Save it at and remember the
location.
Login to your EC2 instance and go to the .ssh. location. Create a new file called
(mykeypair.pem) and paste the content from the file we downloaded in step no.2
Run the command: sudo chmod 600 mykeypair.pem
Run the command: ssh-keygen -f mykeypair.pem -y and it will generate some
content. Copy that content. Open the file called autherized_keys and
remove all the content from it.
Paste the copied content that we have generated in the previous step. Also enter your file name (mykeypair) in last after entering space.
Reboot your instance. Go to the puttygen and generate the .ppk file
using the pem file you have downloaded from the keypair. You will be able to login your ec2 with the newly generated .ppk from putty.
Okay I figured out my problem. First of all I had been hacked by a hacker apparently because I didn't know that permitpasswordlogin: yes DISABLES pubkey authentication.... I thought it was additional security. So i used a very loose password that could be easily guessed. Anyways, I believe this because I went to the root folder and found that there was actually a new key in the root named "el patrono 1337" which actually means "the master/boss" in spanish... LOL. Anyways... So i changed that back to my secure key (made a new one actually) and then I went to login as ec2-user and couldnt, but could as root. was driving me crazy for 30 minutes or so until I realized I had accidentally changed the owner of my ec2-user folder to root and therefore ssh was not searching the ec2-user .ssh/authorized_keys when I tried to log in. Wow very glad that's over lol. And just fyi guys I don't think the hacker installed anything malicious, but I did get tipped off that he tried to ssh into other people's servers (who claim they get attacked by ssh alot according to the aws abuse report) from my machine. I'm running a very simple website with zero sensitive data etc. He didn't even block me out of the machine by disabling password authentication.(i guess he didn't want me to know?). I will build a new instance from scratch next time I want to add anything(will be pretty soon) just to be on the safe side.

Can't make COPY from remote host to Redshift work

I have a gzipped file on a local machine and want to load it to Redshift.
My command looks like this:
\COPY tablename FROM 's3://redshift.manifests/copy_from_yb01_urlinfo.txt' REGION 'us-east-1' CREDENTIALS 'aws_access_key_id=...;aws_secret_access_key=...' SSH GZIP;
But I get a message "s3:/redshift.manifests/copy_from_yb01_urlinfo.txt: No such file or directory".
But this file even public: https://s3.amazonaws.com/redshift.manifests/copy_from_yb01_urlinfo.txt.
Moreover, the user whose credentials I use have a full access to S3 and Redshift: http://c2n.me/iEnI5l.png
And even more weird is the fact that I could perfectly access that file with same credentials from AWS CLI:
> aws s3 ls redshift.manifests
2014-08-01 19:32:13 137 copy_from_yb01_urlinfo.txt
How to diagnose that further?
Just in case, I connect to my Redshift cluster via psql (PostgreSQL cli):
PAGER=more LANG=C psql -h ....us-east-1.redshift.amazonaws.com -p 5439 -U ... -d ...
edit:
Uploaded file to S3 - same error on COPY...
And again I uploaded it and ran COPY with same credentials.
\COPY url_info FROM 's3://redshift-datafiles/url_info_1.copy.gz' CREDENTIALS 'aws_access_key_id=...;aws_secret_access_key=...' GZIP;
I am going to despair...
Since you are trying to copy to RedShift using a manifest file, you need to use the MANIFEST command at the end like :
\COPY tablename FROM 's3://redshift.manifests/copy_from_yb01_urlinfo.txt' REGION 'us-east-1' CREDENTIALS 'aws_access_key_id=...;aws_secret_access_key=...' SSH GZIP MANIFEST;
Oh.
The fix was to remove backslash in the beginning of the command.
Can't remember why I started writing it... Actually I already began writing it when I exported data from local PostgreSQL installation.
This is so stupid) One small rubber duck could have saved me a day or two.

How to use SCP to copy file from my local system to remote AWS server

I have created a AWS server instance.
I have a pem file which i use to get access to the remote AWS
through my local system terminal
I have saved the file pem file in /home/downloads/xxx.pem
I want to copy an image from location /home/images/image.jpg to the
server at /images
I did some research on the google and found out it is done through
SCP
I could not achieve the goal
How to use scp to copy an image from source( MY-Computer-Local ) to AWS(server)?
Thanks
scp -i secrate_key.pem file_name.ext ec2-user#publicDNS:~/.
'You should keep that folder in your local D drive, and open that folder from AWS instance and drag & drop that folder'