Apache Pig - Determining and loading the latest dataset in a directory - hdfs

I have an hdfs location with several timestamped directories and I need my pig script to pick up the latest one. For example
/projects/ABC/dailydata/20170110/
/projects/ABC/dailydata/20170115/
/projects/ABC/dailydata/20170203/ #<---- pig should pick this one
What I've tried is and got working is below, but wondering if there's a cleaner way to get the latest timestamp
sh hdfs dfs -ls /projects/ABC/dailydata/ | tail -1

Related

How to read a csv file from s3 bucket using pyspark

I'm using Apache Spark 3.1.0 with Python 3.9.6. I'm trying to read csv file from AWS S3 bucket something like this:
spark = SparkSession.builder.getOrCreate()
file = "s3://bucket/file.csv"
c = spark.read\
.csv(file)\
.count()
print(c)
But I'm getting the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o26.csv.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
I understand that I need add special libraries, but I didn't find any certain information which exactly and which versions. I've tried to add something like this to my code, but I'm still getting same error:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
How can I fix this?
You need to use hadoop-aws version 3.2.0 for spark 3. In --packages specifying hadoop-aws library is enough to read files from S3.
--packages org.apache.hadoop:hadoop-aws:3.2.0
You need to set below configurations.
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "<access_key>")
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "<secret_key>")
After that you can read CSV file.
spark.read.csv("s3a://bucket/file.csv")
Thanks Mohana for the pointer! After breaking my head for more than a day, I was able to finally figure out. Summarizing my learnings:
Make sure what version of Hadoop your spark comes with:
print(f'pyspark hadoop version:
{spark.sparkContext._jvm.org.apache.hadoop.util.VersionInfo.getVersion()}')
or look for
ls jars/hadoop*.jar
The issue I was having was I had older version of Spark that I had installed a while back that Hadoop 2.7 and was messing up everything.
This should give a brief idea of what binaries you need to download.
For me it was Spark 3.2.1 and Hadoop 3.3.1.
Hence I downloaded :
https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/3.3.1
https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-bundle/1.11.901 # added this just in case;
Placed these jar files in the spark installation dir:
spark/jars/
spark-submit runner.py --packages org.apache.hadoop:hadoop-aws:3.3.1
You have your code snippet that reads from AWS S3

How to use a newer version of parquet jar on AWS EMR

The latest EMR 4.1.0 bundles with Hive 1.0.0 and Spark 1.5.0, and Hive 1.0.0 uses parquet-hadoop-bundle-1.5.0.jar while Spark uses parquet-hadoop-1.7.0.jar
Unfortunately the version parquet 1.5.0 can not read files generated by parquet 1.7.0.
I tried to use add jar parquet-hive-bundle-1.7.0.jar in Hive shell but no luck, Hive still used its bundled old Parquet jar.
Then I tried to replace the old jar with the newer jar, however I couldn't find any parquet related jars even using the command sudo find / "*parquet*.jar".
However I copied parquet-hive-bundle-1.7.0.jar to /usr/lib/hive/lib but it didn't work, Hive still used the old parquet jar and couldn't read my parquet files. Normally this way works in Cloudera distribution.
So my question is, where is the parquet jar and how can I replace it with a newer version?
I resolved the issue by adding the new parquet jar location to the environment variable 'HADOOP_CLASSPATH' in environment or .profile or .bashrc file
export HADOOP_CLASSPATH=/var/lib/hive/parquet-hadoop-bundle-1.9.0.jar:$HADOOP_CLASSPATH
Run below command to get the hive command location
which hive
Open 'hive' file under /usr/bin/(Your hive location)
vi /usr/bin/hive
You should see something like below.
Take a backup of the hive file and add an echo command for the HADOOP_CLASSPATH at the end before exec as below.
hive
#!/bin/bash
if [ -d "/usr/hdp/2.5.0.0-1245/atlas/hook/hive" ]; then
if [ -z "${HADOOP_CLASSPATH}" ]; then
export HADOOP_CLASSPATH=/usr/hdp/2.5.0.0-1245/atlas/hook/hive/*
else
export HADOOP_CLASSPATH=${HADOOP_CLASSPATH}:/usr/hdp/2.5.0.0-1245/atlas/hook/hive/*
fi
fi
...
if [ -z "${HADOOP_CLASSPATH}" ]; then
export HADOOP_CLASSPATH=${HCATALOG_JAR_PATH}
else
export HADOOP_CLASSPATH=${HADOOP_CLASSPATH}:${HCATALOG_JAR_PATH}
fi
echo "Classpath=$HADOOP_CLASSPATH"
exec "${HIVE_HOME}/bin/hive.distro" "$#"
Run hive command to display the classpath.
Once we get the actual classpath, we know where to add the new jar file to take priority over the old parquet jar.
In this case, if we set the HADOOP_CLASSPATH, the classpath of hive will be prefixed with the value in HADOOP_CLASSPATH.
So adding the HADOOP_CLASSPATH value with new parquet jar fixed the issue.
I don't know whether this is a right fix but its working.

No wildcard support in hdfs dfs put command in Hadoop 2.3.0-cdh5.1.3?

I'm trying to move my daily apache access log files to a Hive external table by coping the daily log files to the relevant HDFS folder for each month.
I try to use wildcard, but it seems that hdfs dfs doesn't support it? (documentation seems to be saying that it should support it).
Copying individual files works:
$ sudo HADOOP_USER_NAME=myuser hdfs dfs -put
"/mnt/prod-old/apache/log/access_log-20150102.bz2"
/user/myuser/prod/apache_log/2015/01/
But all of the following ones throw "No such file or directory":
$ sudo HADOOP_USER_NAME=myuser hdfs dfs -put
"/mnt/prod-old/apache/log/access_log-201501*.bz2"
/user/myuser/prod/apache_log/2015/01/
put:
`/mnt/prod-old/apache/log/access_log-201501*.bz2': No such file or
directory
$ sudo HADOOP_USER_NAME=myuser hdfs dfs -put
/mnt/prod-old/apache/log/access_log-201501*
/user/myuser/prod/apache_log/2015/01/
put:
`/mnt/prod-old/apache/log/access_log-201501*': No such file or
directory
The environment is on Hadoop 2.3.0-cdh5.1.3
I'm going to answer my own question.
So hdfs dfs put does work with wildcard, the problem is that the input directory is not a local directory, but a mounted SSHFS (fuse) drive.
It seems that SSHFS is the one not able to handle wildcard characters.
Below is the proof the hdfs dfs put works just fine with wildcards when using the local filesystem and not the mounted drive:
$ sudo HADOOP_USER_NAME=myuser hdfs dfs -put
/tmp/access_log-201501*
/user/myuser/prod/apache_log/2015/01/
put: '/user/myuser/prod/apache_log/2015/01/access_log-20150101.bz2':
File exists
put:
'/user/myuser/prod/apache_log/2015/01/access_log-20150102.bz2': File
exists

Spark - Writing into HDFS does not complete successfully

My question is similar to (Spark writing to hdfs not working with the saveAsNewAPIHadoopFile method)! I am using Spark 1.1.0 on CDH 5.2.1
I am trying to save a file to hdfs system through saveAsTextFile method from Spark. The job completes successfully but when I look into the folder path, I see _temporary folder with data files inside it in various tasks and attempt folder. This tells me Spark is quitting the job as succeeded even before the files are completely moved into hdfs in the right output folder. This is the same issue with saveAsParquetFile method too. Please let me know if you have any idea about this?
Thanks

How to check the disk usage of /user/hadoop partition in multi-node cluster in Hadoop HDFS

I am looking for a help from someone who can clarify my doubt. I have setup 5-node cluster environment. I have installed hadoop in linux RHEL machines.
Now I need to check the diskspace of HDFS partition /user/hadoop in every machine.How to check it.
In which partition,Logical volume..Physicallly this hdfs /user/hadoop partition is allocated.
Is it possible to do cd /user/hadoop in cluster machines.
use du to get the disk usage
Usage: hdfs dfs -du [-s] [-h] URI [URI …]
Displays sizes of files and directories contained in the given directory or the length of a file in case its just a file.
Options:
The -s option will result in an aggregate summary of file lengths being displayed, rather than the individual files.
The -h option will format file sizes in a "human-readable" fashion (e.g 64.0m instead of 67108864)
Example:
hdfs dfs -du -h /user/hadoop
output for me:
24.3 M /user/hadoop/test