Zip support in Apache Spark - compression

I have read about Spark's support for gzip-kind input files here, and I wonder if the same support exists for different kind of compressed files, such as .zip files. So far I have tried computing a file compressed under a zip file, but Spark seems unable to read its contents successfully.
I have taken a look to Hadoop's newAPIHadoopFile and newAPIHadoopRDD, but so far I have not been able to get anything working.
In addition, Spark supports creating a partition for every file under a specified folder, like in the example below:
SparkConf SpkCnf = new SparkConf().setAppName("SparkApp")
.setMaster("local[4]");
JavaSparkContext Ctx = new JavaSparkContext(SpkCnf);
JavaRDD<String> FirstRDD = Ctx.textFile("C:\input\).cache();
Where C:\input\ points to a directory with multiple files.
In the case computing zipped files would be possible, would it also be possible to pack every file under a single compressed file and follow the same pattern of one partition per file?

Spark default support compressed files
According to Spark Programming Guide
All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/.txt"), and textFile("/my/directory/.gz").
This could be expanded by providing information about what compression formats are supported by Hadoop, which basically can be checked by finding all classes extending CompressionCodec (docs)
name | ext | codec class
-------------------------------------------------------------
bzip2 | .bz2 | org.apache.hadoop.io.compress.BZip2Codec
default | .deflate | org.apache.hadoop.io.compress.DefaultCodec
deflate | .deflate | org.apache.hadoop.io.compress.DeflateCodec
gzip | .gz | org.apache.hadoop.io.compress.GzipCodec
lz4 | .lz4 | org.apache.hadoop.io.compress.Lz4Codec
snappy | .snappy | org.apache.hadoop.io.compress.SnappyCodec
Source : List the available hadoop codecs
So the above formats and much more possibilities could be achieved simply by calling:
sc.readFile(path)
Reading zip files in Spark
Unfortunately, zip is not on the supported list by default.
I have found a great article: Hadoop: Processing ZIP files in Map/Reduce and some answers (example) explaining how to use imported ZipFileInputFormat together with sc.newAPIHadoopFile API. But this did not work for me.
My solution
Without any external dependencies, you can load your file with sc.binaryFiles and later on decompress the PortableDataStream reading the content. This is the approach I have chosen.
import java.io.{BufferedReader, InputStreamReader}
import java.util.zip.ZipInputStream
import org.apache.spark.SparkContext
import org.apache.spark.input.PortableDataStream
import org.apache.spark.rdd.RDD
implicit class ZipSparkContext(val sc: SparkContext) extends AnyVal {
def readFile(path: String,
minPartitions: Int = sc.defaultMinPartitions): RDD[String] = {
if (path.endsWith(".zip")) {
sc.binaryFiles(path, minPartitions)
.flatMap { case (name: String, content: PortableDataStream) =>
val zis = new ZipInputStream(content.open)
// this solution works only for single file in the zip
val entry = zis.getNextEntry
val br = new BufferedReader(new InputStreamReader(zis))
Stream.continually(br.readLine()).takeWhile(_ != null)
}
} else {
sc.textFile(path, minPartitions)
}
}
}
using this implicit class, you need to import it and call the readFile
method on SparkContext:
import com.github.atais.spark.Implicits.ZipSparkContext
sc.readFile(path)
And the implicit class will load your zip file properly and return RDD[String] like it used to.
Note: This only works for single file in the zip archive!
For multiple files in your zip support, check this answer: https://stackoverflow.com/a/45958458/1549135

Since Apache Spark uses Hadoop Input formats we can look at the hadoop documentation on how to process zip files and see if there is something that works.
This site gives us an idea of how to use this (namely we can use the ZipFileInputFormat). That being said, since zip files are not split-table (see this) your request to have a single compressed file isn't really well supported. Instead, if possible, it would be better to have a directory containing many separate zip files.
This question is similar to this other question, however it adds an additional question of if it would be possible to have a single zip file (which, since it isn't a split-table format isn't a good idea).

You can use sc.binaryFiles to open the zip file in binary format, then unzip it into the text format. Unfortunately, the zip file is not split-able.. So you need to wait for the decompression, then maybe call shuffle to balance the data in each partition.
Here is an example in Python. More info is in http://gregwiki.duckdns.org/index.php/2016/04/11/read-zip-file-in-spark/
file_RDD = sc.binaryFiles( HDFS_path + data_path )
def Zip_open( binary_stream_string ) : # New version, treat a stream as zipped file
try :
pseudo_file = io.BytesIO( binary_stream_string )
zf = zipfile.ZipFile( pseudo_file )
return zf
except :
return None
def read_zip_lines(zipfile_object) :
file_iter = zipfile_object.open('diff.txt')
data = file_iter.readlines()
return data
My_RDD = file_RDD.map(lambda kv: (kv[0], Zip_open(kv[1])))

You can use sc.binaryFiles to read Zip as binary file
val rdd = sc.binaryFiles(path).flatMap {
case (name: String, content: PortableDataStream) => new ZipInputStream(content.open)
} //=> RDD[ZipInputStream]
And then you can map the ZipInputStream to list of lines:
val zis = rdd.first
val entry = zis.getNextEntry
val br = new BufferedReader(new InputStreamReader(in, "UTF-8"))
val res = Stream.continually(br.readLine()).takeWhile(_ != null).toList
But the problem remains that the zip file is not splittable.

Below is an example which searches a directory for .zip files and create an RDD using a custom FileInputFormat called ZipFileInputFormat and the newAPIHadoopFile API on the Spark Context. It then writes those files to an output directory.
allzip.foreach { x =>
val zipFileRDD = sc.newAPIHadoopFile(
x.getPath.toString,
classOf[ZipFileInputFormat],
classOf[Text],
classOf[BytesWritable], hadoopConf)
zipFileRDD.foreach { y =>
ProcessFile(y._1.toString, y._2)
}
https://github.com/alvinhenrick/apache-spark-examples/blob/master/src/main/scala/com/zip/example/Unzip.scala
The ZipFileInputFormat used in the example can be found here: https://github.com/cotdp/com-cotdp-hadoop/tree/master/src/main/java/com/cotdp/hadoop

Related

opensmile in Python save as .arff

I am using Python with the library opensmile. My target is to generate *.arff files to use in Weka3 ML-Tool.
My problem is, that It is rather unclear for me how to save the extracted features into an *.arff file.
for example:
import opensmile
smile = opensmile.Smile(
feature_set=opensmile.FeatureSet.ComParE_2016,
feature_level=opensmile.FeatureLevel.Functionals,
)
y = smile.process_file('audio.wav')
//ToDo save y in arff
I should be possible since there are questions about the generated files eg:here.
However I can't find anything specific about that.
Instead of generating ARFF directly, you could generate a CSV file in a format that Weka can load it:
import csv
import pandas as pd
import opensmile
# configure feature generation
smile = opensmile.Smile(
feature_set=opensmile.FeatureSet.ComParE_2016,
feature_level=opensmile.FeatureLevel.Functionals,
)
# the audio files to generate features from
audio_files = [
'000000.wav',
'000001.wav',
'000002.wav',
]
# generate features
ys = []
for audio_file in audio_files:
y = smile.process_file(audio_file)
ys.append(y)
# combine features and save as CSV
data = pd.concat(ys)
data.to_csv('audio.csv', quotechar='\'', quoting=csv.QUOTE_NONNUMERIC)
As a second (and optional) step, convert the CSV file to ARFF using the CSVLoader class from the command-line:
java -cp weka.jar weka.core.converters.CSVLoader audio.csv > audio.arff
NB: You will need to adjust the paths to audio files, weka.jar, CSV and ARFF file to fit your environment, of course.

Uploading multiple files to Google Cloud Storage via Python Client Library

The GCP python docs have a script with the following function:
def upload_pyspark_file(project_id, bucket_name, filename, file):
"""Uploads the PySpark file in this directory to the configured
input bucket."""
print('Uploading pyspark file to GCS')
client = storage.Client(project=project_id)
bucket = client.get_bucket(bucket_name)
blob = bucket.blob(filename)
blob.upload_from_file(file)
I've created an argument parsing function in my script that takes in multiple arguments (file names) to upload to a GCS bucket. I'm trying to adapt the above function to parse those multiple args and upload those files, but am unsure how to proceed. My confusion is with the 'filename' and 'file' variables above. How can I adapt the function for my specific purpose?
I don't suppose you're still looking for something like this?
from google.cloud import storage
import os
files = os.listdir('data-files')
client = storage.Client.from_service_account_json('cred.json')
bucket = client.get_bucket('xxxxxx')
def upload_pyspark_file(filename, file):
# """Uploads the PySpark file in this directory to the configured
# input bucket."""
# print('Uploading pyspark file to GCS')
# client = storage.Client(project=project_id)
# bucket = client.get_bucket(bucket_name)
print('Uploading from ', file, 'to', filename)
blob = bucket.blob(filename)
blob.upload_from_file(file)
for f in files:
upload_pyspark_file(f, "data-files\\{0}".format(f))
The difference between file and filename is as you may have guessed, file is the source file and filename is the destination file.

ZIP Archive Created Within ZIP Archive

I recently wrote a python script to select certain files within a directory and save them to a new archive within that directory. The script works with the exception that it creates a duplicate archive within the new archive. I think it has something to do with the arcname I used and the loop but I'm really not sure. As I'm sure is obvious by looking at my code I am a beginner so I am sure there is plenty of room for improvement here. Any ideas as to where the problem is? Also if you have any suggestions for improving the code I'm all ears.
import os,arcpy,zipfile
inputfc = arcpy.GetParameterAsText(0) # User Inputs Feature Class Path
desc = arcpy.Describe(inputfc)
fcname = desc.basename
zname = fcname + ".zip"
gpath = os.path.dirname(inputfc)
zpath = os.path.join(gpath,zname)
zfile = zipfile.ZipFile(zpath, "w")
for f in os.listdir(gpath):
fpath = os.path.join(gpath, f)
if f.startswith(fcname):
zfile.write(fpath,f,compress_type = zipfile.ZIP_DEFLATED)
zfile.close()
Edit: After aruisdante answered my question I decided to just change the zname variable to
zname = "zip" + fcname + ".zip" #ugly but it worked thanks
This:
zfile = zipfile.ZipFile(zpath, "w")
Creates a new Zip file at zpath
for f in os.listdir(gpath):
Iterates through all of the files at gpath. Since gpath is also the root of zpath, then the zip file you just created will be one of the files in gpath. So it gets included in the archive. You will need to exclude it:
for f in (filename for filename in os.listdir(gpath) if filename != zname):

Beginner, python - how to read a list from a file

I have a Word document that is literally a list of lists, that is 8 pages long. Eg:
[['WTCS','Dec 21'],['THWD','Mar 22']...]
I am using Linux Mint, Python 3.2 and the IDLE interface, plus my own .py programs. I need to read and reference this list frequently and when I stored it inside .py programs it seemed to slow down the window considerably as I was editing code. How can I store this information in a separate file and read it into python? I have it in a .txt file now and tried the following code:
def readlist():
f = open(r'/home/file.txt','r')
info = list(f.read())
return(info)
but I get each character as an element of a list. I also tried info = f.read() but I get a string. Thanks!
You can convert a Python list read from a text file from a text file as a string into a list using the ast module:
>>> import ast
>>> s = "[['WTCS','Dec 21'],['THWD','Mar 22']]"
>>> ast.literal_eval(s)
[['WTCS', 'Dec 21'], ['THWD', 'Mar 22']]

Getting "new-line character seen in unquoted field" when parsing csv document using django-storages

I am trying to parse csv files that have been uploaded to Amazon S3 using django-storages. I keep getting a "Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?". The normal work around for this is to open the file with "rU", but that does not seem to work with django storages. If I drop the file directly on the server and open from there it works, I just want to avoid storing the files directly on the server if possible. Here is the code I am using:
import csv
from django.core.files.storage import default_storage as s3_storage
n = 'csvdumps/130331548894.csv'
csvf = s3_storage.open(n, "rU")
csvReader = csv.reader(csvf)
for item in csvReader:
print item
I can see that this is a django-storage reported bug here http://jgrid.org/david/django-storages/issue/80/trying-to-parse-csv-file-from-django but perhaps you can try this:-
csvf = s3_storage.open(n.splitlines(), "rU")
Would also be great if you could share a link to access some of your S3 (sample) csv files though so I can open them to check the line endings.