how to create permanent hadoop filesystem in hadoop 2.6.0 - hdfs

I am working on apache hadoop 2.6.0. Every time I start my system, I have to create a new filesystem. How can I create a permanent hadoop filesystem?

this may be a solution:
You can use hadoop snapshot mechanism,
this link: "8.2.7. Upgrades and Filesystem Snapshots".

Related

Problems Integrating Hadoop 3.x on Flink cluster

I am facing some issues while trying to integrate Hadoop 3.x version on a Flink cluster. My goal is to use HDFS as a persistent storage and store checkpoints. I am currectly using Flink 1.13.1 and HDFS 3.3.1. The error that I am getting while trying to submit a job is that HDFS is not supported as a file system. In the standalone version, this error was solved by specifying the HADOOP_CLASSPATH on my local machine. As a next step I applied the solution above in all the machines that are used in my cluster and in the standalone version I managed to successfully submit my jobs in all of them without facing any issues. However, when I started modifying the configurations to setup my cluster (by specifying the IPs of my machines) that problem came up once again. What I am missing?
In Hadoop 2.x there are the pre-bundled jar files in the official flink download page that would solve similar issues in the past but that's not the case with Hadoop 3.x versions
It should be enough to set HADOOP_CLASSPATH on every machine in the cluster.
For anyone still struggling with a similar issue, the answer proposed by David worked for me in the end. The detail that I was missing was in the definition of the environment variables.
In my initial attempts, I was using the .bashrc script to permanently define my environment variables. This works in the standalone cluster which is not the case with a distributed cluster due to the scope of the script. What actually worked for me was defining my variables(and $HADOOP_CLASSPATH) in the /etc/profile
I also managed to find another solution while was struggling with HADOOP_CLASSPATH. As I mentioned in my initial post, in Hadoop 2.x there are pre-bundled jar files in the official Flink download page to support HDFS integration, which is not the case in Hadoop 3.x. I found the following maven repository page and after testing all of the existing jars I managed to find one that worked in my case. To be more precise, for Hadoop 3.3.1 the 3.1.1.7.2.8.0-224-9.0 jar (Placed the jar in the $FLINK_HOME/lib) worked. While it is not an "official solution" it seems to solve the issue.
https://mvnrepository.com/artifact/org.apache.flink/flink-shaded-hadoop-3-uber?repo=cloudera-repos

Is PygreSQL available on AWS Glue Spark Jobs?

I tried using PygreSQL modules
import pg
import pgdb
but it says the modules were not found when running on AWS Glue Spark.
Their Developer Guide, https://docs.aws.amazon.com/glue/latest/dg/glue-dg.pdf, says it's available for Python Shell though.
Can anyone else confirm this?
Is there a page I can refer to for what libraries that come by default for the Python environment?
Is there an alternative to a PostgreSQL library for running on Spark Glue jobs? I know it is possible to use an external library by importing into S3 and adding the path in the configurations but I would like to avoid as many manual steps as possible.
The document that you have shared is talking about libraries only intended for python shell jobs. If you want this library in a Glue spark job then you need to package it then upload to s3 and import it in your Glue job.
There are alternatives like pg8000 which can also be used as external python library.This and this talks more about on how you can package it which can also be used with pygresql library.
Also this has more information on how you can connect to on-prem postgresql databases.

Using spark with latest aws-java-sdk?

We are currently using spark 2.1 with hadoop 2.7.3 and I know and can't believe that spark still requires aws-java-sdk version 1.7.4. We are using a maven project and I was wondering if there is any way to setup libraries or my environment to be able to use spark 2.1 along with other applications that use the latest aws-java-sdk? I guess it's the same thing as asking if it's possible to setup a workflow that uses different versions of the aws-java-sdk and then when I want to run the jar on a cluster I could just point to the latest aws-java-sdk. I know I could obviously maintain to separate projects one for spark and one for pure sdk work but I'd like to just have them in the same project.
use spark 2.1 along with other applications that use the latest aws-java-sdk
You can try to use the Maven Shade Plugin when you create your JAR, then ensure the user classpath is before the Hadoop classpath (spark.executor.userClassPathFirst). This will ensure you're loading all dependencies included by Maven, not what's provided by Spark
I've done this with Avro before, but I know that the AWS SDK has more to it

How to Start MapReduce programs?

I have installed single node cluster in my system(VM->Ubuntu).
I have studies basics of MapReduce and Hadoop Framework. How to get started with MapReduce Coding?
To understand MR please start with Why Big Data is needed and how the traditional problems of JOIN which was solved with SQL needs a different perspective as data is growing huge.
Please follow these amazing links which helped me .
http://bytepadding.com/big-data/map-reduce/understanding-map-reduce-the-missing-guide/
http://bytepadding.com/map-reduce/
Ensure you have installed Hadoop. Create a new user in ubuntu $ sudo adduser hduser sudo. change user to 'hduser'(id used while Hadoop configuration,you can switch to the userid used during your Hadoop programming config. Install Java, Install ssh, Extract Hadoop into /home/hduser. Set JAVA HOME, HADOOP HOME , PATH environment variables in to .bashrc file. Create a new directory, write the program and compile the java files and make sure that you give all permissions to the files.it is there in local filesystem only we have to copy into HDFS. Output directory will be there HDFS, to check output

Zeppelin: how to download or save the Zeppelin notebook?

I am using Zeppelin sandbox with aws EMR.
Is there a way to download or save the zeppelin notebook in a way so that it can be imported into another Zeppelin server ?
As noted in the comments above, this feature is available starting in version 0.5.6. You can find more details in the release notes. Downloading and installing this version would solve that issue.
Given that you are using EMR, it looks like you will have to work with the version available. As Samuel mentioned above, you can backup the contents of the incubator-zeppelin/notebook folder and make the transfer.