How to force store(overwrite) in HDFS using Mapreduce

How to force store(overwrite) in HDFS using Mapreduce - mapreduce

How do i overwrite the existing output in HDFS with Mapreduce program.
In Pig there is statement called
rmf /user/cloudera/outputfiles/citycount
STORE rel into '/user/cloudera/outputfiles/citycount';
Similarly is there any way to achieve the same in mapreduce program

You can like this in your driver module.
conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
String pathin = args[0];
String pathout = args[1];
fs.delete(new Path(pathout), true);
// it will delete the output folder if the folder already exists.

Related

No FileSystem for scheme "s3" when trying to read a list of files with Spark from EC2

I'm trying to provide a list of files for spark to read as and when it needs them (which is why I'd rather not use boto or whatever else to pre-download all the files onto the instance and only then read them into spark "locally").
os.environ['PYSPARK_SUBMIT_ARGS'] = "--master local[3] pyspark-shell"
spark = SparkSession.builder.getOrCreate()
spark.sparkContext._jsc.hadoopConfiguration().set('fs.s3.access.key', credentials['AccessKeyId'])
spark.sparkContext._jsc.hadoopConfiguration().set('fs.s3.access.key', credentials['SecretAccessKey'])
spark.read.json(['s3://url/3521.gz', 's3://url/2734.gz'])
No idea what local[3] is about but without this --master flag, I was getting another exception:
Exception: Java gateway process exited before sending the driver its port number.
Now, I'm getting this:
Py4JJavaError: An error occurred while calling o37.json.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
...
Not sure what o37.json refers to here but it probably doesn't matter.
I saw a bunch of answers to similar questions suggesting an addition of flags like:
os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 pyspark-shell"
I tried prepending it and appending it to the other flag but it doesn't work.
Just like the many variations I see in other answers and elsewhere on the internet (with different packages and versions), for example:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--master local[*] --jars spark-snowflake_2.12-2.8.4-spark_3.0.jar,postgresql-42.2.19.jar,mysql-connector-java-8.0.23.jar,hadoop-aws-3.2.2,aws-java-sdk-bundle-1.11.563.jar'

A typical example for reading files from S3 is as below -
Additional you can go through this answer to ensure the minimalistic structure and necessary modules are in place -
java.io.IOException: No FileSystem for scheme: s3
Read Parquet - S3
os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=com.amazonaws:aws-java-sdk-bundle:1.11.375,org.apache.hadoop:hadoop-aws:3.2.0 pyspark-shell"
sc = SparkContext.getOrCreate()
sql = SQLContext(sc)
hadoop_conf = sc._jsc.hadoopConfiguration()
config = configparser.ConfigParser()
config.read(os.path.expanduser("~/.aws/credentials"))
access_key = config.get("****", "aws_access_key_id")
secret_key = config.get("****", "aws_secret_access_key")
session_key = config.get("****", "aws_session_token")
hadoop_conf.set("fs.s3.aws.credentials.provider", "org.apache.hadoop.fs.s3.TemporaryAWSCredentialsProvider")
hadoop_conf.set("fs.s3a.access.key", access_key)
hadoop_conf.set("fs.s3a.secret.key", secret_key)
hadoop_conf.set("fs.s3a.session.token", session_key)
s3_path = "s3a://xxxx/yyyy/zzzz/"
sparkDF = sql.read.parquet(s3_path)

AWS Glue job create_dynamic_frame_from_options() opening a specific file?

If one uses create_dynamic_frame_from_catalog(), you supply the database name and table name, e.g. created from a Glue crawler, which effectively names a specific input file. I want to be able to do the same (name a specific input file) without the crawler and database.
I've tried using create_dynamic_frame_from_options(), but the "path" connection option doesn't allow me to name the file, apparently. Is there any way to do this?

IIUC, you want to read multiple files from a specific s3 path and want the filename in your dataframe. You can achieve this by using spark session and reading it as pyspark dataframe
from pyspark.sql.functions import input_file_name
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
path = 's3://bucket/folder'
df = spark.read.csv(path)
df = df.withColumn('FileName', input_file_name())

Flume sink copying garbage data in hdfs

While copying data from local path to HDFS sink, i am getting some garbage data in the file at HDFS location.
My config file for flume:
# spool.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = s1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.s1.type = spooldir
a1.sources.s1.spoolDir = /home/cloudera/spool_source
a1.sources.s1.channels = c1
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = flumefolder/events
a1.sinks.k1.hdfs.filetype = Datastream
#Format to be written
a1.sinks.k1.hdfs.writeFormat = Text
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
I am aopyuing file from local path "/home/cloudera/spool_source" to hdfs path "flumefolder/events".
Flume command:
flume-ng agent --conf-file spool.conf --name a1 -Dflume.root.logger=INFO,console
File "salary.txt" at local path "/home/cloudera/spool_source" is:
GR1,Emp1,Jan,31,2500
GR3,Emp3,Jan,18,2630
GR4,Emp4,Jan,31,3000
GR4,Emp4,Feb,28,3000
GR1,Emp1,Feb,15,2500
GR2,Emp2,Feb,28,2800
GR2,Emp2,Mar,31,2800
GR3,Emp3,Mar,31,3000
GR1,Emp1,Mar,15,2500
GR2,Emp2,Apr,31,2630
GR3,Emp3,Apr,17,3000
GR4,Emp4,Apr,31,3200
GR7,Emp7,Apr,21,2500
GR11,Emp11,Apr,17,2000
At the target path "flumefolder/events", the data is copied with garbage values as:
1 W��ȩGR1,Emp1,Jan,31,2500W��ȲGR3,Emp3,Jan,18,2630W��ȷGR4,Emp4,Jan,31,3000W��ȻGR4,Emp4,Feb,28,3000W��ȽGR1,Emp1,Feb,15,2500W����GR2,Emp2,Feb,28,2800W����GR2,Emp2,Mar,31,2800W����GR3,Emp3,Mar,31,3000W����GR1,Emp1,Mar,15,2500W����GR2,Emp2,
What is wrong in my configuration file spool.conf, i am unable to figure it out.

Flume configuration is case sensitive so change the filetype line to fileType, and fix the Datastream value too as it's also case sensitive
sinks.k1.hdfs.fileType = DataStream
Your current setup means the default of a sequence file is being used, hence the odd characters

Zip support in Apache Spark

I have read about Spark's support for gzip-kind input files here, and I wonder if the same support exists for different kind of compressed files, such as .zip files. So far I have tried computing a file compressed under a zip file, but Spark seems unable to read its contents successfully.
I have taken a look to Hadoop's newAPIHadoopFile and newAPIHadoopRDD, but so far I have not been able to get anything working.
In addition, Spark supports creating a partition for every file under a specified folder, like in the example below:
SparkConf SpkCnf = new SparkConf().setAppName("SparkApp")
.setMaster("local[4]");
JavaSparkContext Ctx = new JavaSparkContext(SpkCnf);
JavaRDD<String> FirstRDD = Ctx.textFile("C:\input\).cache();
Where C:\input\ points to a directory with multiple files.
In the case computing zipped files would be possible, would it also be possible to pack every file under a single compressed file and follow the same pattern of one partition per file?

Spark default support compressed files
According to Spark Programming Guide
All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/.txt"), and textFile("/my/directory/.gz").
This could be expanded by providing information about what compression formats are supported by Hadoop, which basically can be checked by finding all classes extending CompressionCodec (docs)
name | ext | codec class
-------------------------------------------------------------
bzip2 | .bz2 | org.apache.hadoop.io.compress.BZip2Codec
default | .deflate | org.apache.hadoop.io.compress.DefaultCodec
deflate | .deflate | org.apache.hadoop.io.compress.DeflateCodec
gzip | .gz | org.apache.hadoop.io.compress.GzipCodec
lz4 | .lz4 | org.apache.hadoop.io.compress.Lz4Codec
snappy | .snappy | org.apache.hadoop.io.compress.SnappyCodec
Source : List the available hadoop codecs
So the above formats and much more possibilities could be achieved simply by calling:
sc.readFile(path)
Reading zip files in Spark
Unfortunately, zip is not on the supported list by default.
I have found a great article: Hadoop: Processing ZIP files in Map/Reduce and some answers (example) explaining how to use imported ZipFileInputFormat together with sc.newAPIHadoopFile API. But this did not work for me.
My solution
Without any external dependencies, you can load your file with sc.binaryFiles and later on decompress the PortableDataStream reading the content. This is the approach I have chosen.
import java.io.{BufferedReader, InputStreamReader}
import java.util.zip.ZipInputStream
import org.apache.spark.SparkContext
import org.apache.spark.input.PortableDataStream
import org.apache.spark.rdd.RDD
implicit class ZipSparkContext(val sc: SparkContext) extends AnyVal {
def readFile(path: String,
minPartitions: Int = sc.defaultMinPartitions): RDD[String] = {
if (path.endsWith(".zip")) {
sc.binaryFiles(path, minPartitions)
.flatMap { case (name: String, content: PortableDataStream) =>
val zis = new ZipInputStream(content.open)
// this solution works only for single file in the zip
val entry = zis.getNextEntry
val br = new BufferedReader(new InputStreamReader(zis))
Stream.continually(br.readLine()).takeWhile(_ != null)
}
} else {
sc.textFile(path, minPartitions)
}
}
}
using this implicit class, you need to import it and call the readFile
method on SparkContext:
import com.github.atais.spark.Implicits.ZipSparkContext
sc.readFile(path)
And the implicit class will load your zip file properly and return RDD[String] like it used to.
Note: This only works for single file in the zip archive!
For multiple files in your zip support, check this answer: https://stackoverflow.com/a/45958458/1549135

Since Apache Spark uses Hadoop Input formats we can look at the hadoop documentation on how to process zip files and see if there is something that works.
This site gives us an idea of how to use this (namely we can use the ZipFileInputFormat). That being said, since zip files are not split-table (see this) your request to have a single compressed file isn't really well supported. Instead, if possible, it would be better to have a directory containing many separate zip files.
This question is similar to this other question, however it adds an additional question of if it would be possible to have a single zip file (which, since it isn't a split-table format isn't a good idea).

You can use sc.binaryFiles to open the zip file in binary format, then unzip it into the text format. Unfortunately, the zip file is not split-able.. So you need to wait for the decompression, then maybe call shuffle to balance the data in each partition.
Here is an example in Python. More info is in http://gregwiki.duckdns.org/index.php/2016/04/11/read-zip-file-in-spark/
file_RDD = sc.binaryFiles( HDFS_path + data_path )
def Zip_open( binary_stream_string ) : # New version, treat a stream as zipped file
try :
pseudo_file = io.BytesIO( binary_stream_string )
zf = zipfile.ZipFile( pseudo_file )
return zf
except :
return None
def read_zip_lines(zipfile_object) :
file_iter = zipfile_object.open('diff.txt')
data = file_iter.readlines()
return data
My_RDD = file_RDD.map(lambda kv: (kv[0], Zip_open(kv[1])))

You can use sc.binaryFiles to read Zip as binary file
val rdd = sc.binaryFiles(path).flatMap {
case (name: String, content: PortableDataStream) => new ZipInputStream(content.open)
} //=> RDD[ZipInputStream]
And then you can map the ZipInputStream to list of lines:
val zis = rdd.first
val entry = zis.getNextEntry
val br = new BufferedReader(new InputStreamReader(in, "UTF-8"))
val res = Stream.continually(br.readLine()).takeWhile(_ != null).toList
But the problem remains that the zip file is not splittable.

Below is an example which searches a directory for .zip files and create an RDD using a custom FileInputFormat called ZipFileInputFormat and the newAPIHadoopFile API on the Spark Context. It then writes those files to an output directory.
allzip.foreach { x =>
val zipFileRDD = sc.newAPIHadoopFile(
x.getPath.toString,
classOf[ZipFileInputFormat],
classOf[Text],
classOf[BytesWritable], hadoopConf)
zipFileRDD.foreach { y =>
ProcessFile(y._1.toString, y._2)
}
https://github.com/alvinhenrick/apache-spark-examples/blob/master/src/main/scala/com/zip/example/Unzip.scala
The ZipFileInputFormat used in the example can be found here: https://github.com/cotdp/com-cotdp-hadoop/tree/master/src/main/java/com/cotdp/hadoop

How to transfer data from one system to another system's HDFS (connected through LAN) using Flume?

I have a computer in LAN Connection . I need to transfer data from the system to another system's HDFS location using flume.
I have tried using ip address of the sink system, but it didn't work. Please help..
Regards,
Athiram

This can be achieved by using avro mechanism.
The flume has to be installed in both the machines. A config file with the following codes has to be made to be run in the source system , where the logs are generated.
a1.sources = tail-file
a1.channels = c1
a1.sinks=avro-sink
a1.sources.tail-file.channels = c1
a1.sinks.avro-sink.channel = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.sources.tail-file.type = spooldir
a1.sources.tail-file.spoolDir =<location of spool directory>
a1.sources.tail-file.channels = c1
a1.sinks.avro-sink.type = avro
a1.sinks.avro-sink.hostname = <IP Address of destination system where the data has to be written>
a1.sinks.avro-sink.port = 11111
A config file with the following codes has to be made to be run in the destination system , where the logs are generated.
a2.sources = avro-collection-source
a2.sinks = hdfs-sink
a2.channels = mem-channel
a2.sources.avro-collection-source.channels = mem-channel
a2.sinks.hdfs-sink.channel = mem-channel
a2.channels.mem-channel.type = memory
a2.channels.mem-channel.capacity = 1000
a2.sources.avro-collection-source.type = avro
a2.sources.avro-collection-source.bind = localhost
a2.sources.avro-collection-source.port = 44444
a2.sinks.hdfs-sink.type = hdfs
a2.sinks.hdfs-sink.hdfs.writeFormat = Text
a2.sinks.hdfs-sink.hdfs.filePrefix = testing
a2.sinks.hdfs-sink.hdfs.path = hdfs://localhost:54310/user/hduser/
Now, the data from the log file in the source system will be written to hdfs system in the destination system.
Regards,
Athiram

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to force store(overwrite) in HDFS using Mapreduce - mapreduce

How do i overwrite the existing output in HDFS with Mapreduce program. In Pig there is statement called rmf /user/cloudera/outputfiles/citycount STORE rel into '/user/cloudera/outputfiles/citycount'; Similarly is there any way to achieve the same in mapreduce program

You can like this in your driver module. conf = new Configuration(); FileSystem fs = FileSystem.get(conf); String pathin = args[0]; String pathout = args[1]; fs.delete(new Path(pathout), true); // it will delete the output folder if the folder already exists.

Related

No FileSystem for scheme "s3" when trying to read a list of files with Spark from EC2

AWS Glue job create_dynamic_frame_from_options() opening a specific file?

Flume sink copying garbage data in hdfs

Zip support in Apache Spark

How to transfer data from one system to another system's HDFS (connected through LAN) using Flume?

Categories

Resources