read a parquet files from HDFS using PyArrow - hdfs

I know I can connect to an HDFS cluster via pyarrow using pyarrow.hdfs.connect()
I also know I can read a parquet file using pyarrow.parquet's read_table()
However, read_table() accepts a filepath, whereas hdfs.connect() gives me a HadoopFileSystem instance.
Is it somehow possible to use just pyarrow (with libhdfs3 installed) to get a hold of a parquet file/folder residing in an HDFS cluster? What I wish to get to is the to_pydict() function, then I can pass the data along.

Try
fs = pa.hdfs.connect(...)
fs.read_parquet('/path/to/hdfs-file', **other_options)
or
import pyarrow.parquet as pq
with fs.open(path) as f:
pq.read_table(f, **read_options)
I opened https://issues.apache.org/jira/browse/ARROW-1848 about adding some more explicit documentation about this

I tried the same via Pydoop library and engine = pyarrow and it worked perfect for me.Here is the generalized method.
!pip install pydoop pyarrow
import pydoop.hdfs as hd
#read files via Pydoop and return df
def readParquetFilesPydoop(path):
with hd.open(path) as f:
df = pd.read_parquet(f ,engine='pyarrow')
logger.info ('file: ' + path + ' : ' + str(df.shape))
return df

Related

Where is output file created after running FileWriter in AWS EMR

This is how I am writing to file. (Scala code)
import java.io.FileWriter
val fw = new FileWriter("my_output_filename.txt", true)
fw.write("something to write into output file")
fw.close()
This is part of a spark job I'm running on AWS EMR. Thw job runs and finishes successfully. The problem is I'm unable to locate my_output_filename.txt anywhere once it's done.
A bit more context:
What I'm trying to do: Doing some processing on each row of a dataframe and writing it into a file. so it looks something like this:
myDF.collect().foreach( row => {
import java.io.FileWriter
val fw = new FileWriter("my_output_filename.txt", true)
fw.write("row data to be written into file")
fw.close()
})
How I checked:
When I ran it in local, I found newly created file in same directory where code is present. But couldn't find it in remote node.
I ran find / -name "my_output_filename.txt".
I also checked in HDFS: hdfs dfs -find / -name "my_output_filename.txt"
Where can I find the output file ?
Is there a better way to do this?

Loading images in cloud ml

This is main code which works on CPU machine. It loads all images and masks from folders, resizes them, and save as 2 numpy arrays.
from skimage.transform import resize as imresize
from skimage.io import imread
def create_data(dir_input, img_size):
img_files = sorted(glob(dir_input + '/images/*.jpg'))
mask_files = sorted(glob(dir_input + '/masks/*.png'))
X = []
Y = []
for img_path, mask_path in zip(img_files, mask_files):
img = imread(img_path)
img = imresize(img, (img_size, img_size), mode='reflect', anti_aliasing=True)
mask = imread(mask_path)
mask = imresize(mask, (img_size, img_size), mode='reflect', anti_aliasing=True)
X.append(img)
Y.append(mask)
path_x = dir_input + '/images-{}.npy'.format(img_size)
path_y = dir_input + '/masks-{}.npy'.format(img_size)
np.save(path_x, np.array(X))
np.save(path_y, np.array(Y))
Here is gcloud storage hierarchy
gs://my_bucket
|
|----inputs
| |----images/
| |-----masks/
|
|----outputs
|
|----trainer
dir_input should be gs://my_bucket/inputs
This doesn't work. What is the proper way to load images from that path on cloud, and save numpy array in the inputs folder?
Preferable with skimage, which is loaded in setup.py
Most Python libraries such as numpy don't natively support reading from and writing to object stores like GCS or S3. There are a few options:
Copy the data to local disk first (see this answer).
Try using the GCS python SDK (docs)
Use another library, like TensorFlow's FileIO abstraction. Here's some code similar to what you're trying to do (read/write numpy arrays).
The latter is particularly useful if you are using TensorFlow, but can still be used even if you are using some other framework.

Pandas to PySpark giving OOM error instead of spilling to disk [duplicate]

This question already has answers here:
Why does SparkContext.parallelize use memory of the driver?
(3 answers)
Closed 5 years ago.
I have a use case where I want to iteratively load data into Pandas dataframes, do some processing using outside functions (i.e. xgboost, not shown in the example code), and then push the result into a single PySpark object (RDD or DF).
I've tried to get PySpark to spill to disk when storing data either as an RDD or Dataframe, again where the source is a Pandas DataFrame. Nothing seems to be working, I keep crashing the Java driver and I can't load my data in. Alternatively, I've tried loading my data without processing just using a basic textFile RDD and it worked like a charm. I'm wondering if this is a PySpark bug or else if there is a workaround.
Sample Code:
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd
import pyspark
try:
SparkContext.stop(sc)
except NameError:
1
SparkContext.setSystemProperty('spark.executor.memory', '200g')
SparkContext.setSystemProperty('spark.driver.memory', '200g')
sc = SparkContext("local", "App Name")
sql_sc = SQLContext(sc)
chunk_100k = pd.read_csv("CData.csv", chunksize=100000)
empty_df = pd.read_csv("CData.csv", nrows=0)
infer_df = pd.read_csv("CData.csv", nrows=10).fillna('')
my_schema = SQLContext.createDataFrame(sql_sc, infer_df).schema
SparkDF = SQLContext.createDataFrame(sql_sc, empty_df, schema=my_schema)
for chunk in chunk_100k:
SparkDF = SparkDF.union(SQLContext.createDataFrame(sql_sc, chunk, schema=my_schema))
Crashes after a few iterations with:
Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.readRDDFromFile. :
java.lang.OutOfMemoryError: Java heap space
Working direct load to RDD code:
my_rdd = sc.textFile("CData.csv") \
.map(lambda line: line.split(",")) \
.filter(lambda line: len(line)>1) \
.map(lambda line: (line[0],line[1]))
Update:
I have changed the code to demonstrate failure when loading into Spark DataFrames instead of RDDs, note that the issue still persists and the error message is still referencing RDDs.
Previous to changing the example code, saving to RDDs was found to be at least problematic when using 'parallelize' for the following reasons:
Why does SparkContext.parallelize use memory of the driver?
create a spark-defaults.conf file in file in apache-spark/1.5.1/libexec/conf/ and add the following line to it:
spark.driver.memory 45G
spark.driver.maxResultSize 10G

Flink read data from hdfs

I'm a freshman in Flink and I am wondering that how to read data from hdfs . Can anybody give some advices or some easy examples for me ? Thank you all.
If your files are formatted in text file format, you can use 'readTextFile' method from 'ExecutionEnvironment' object.
Here is an example of various data sources. (https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/batch/index.html#data-sources)
Flink can read HDFS data which can be in any of the formats like text,Json,avro such as.
Support for Hadoop input/output formats is part of the flink-java maven modules which are required when writing flink jobs.
Sample 1 : To read text file named JsonSeries and print on console
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<String> lines = env.readTextFile("hdfs://localhost:9000/user/hadoop/input/JsonSeries.txt")
.name("HDFS File read");
lines.print();
Sample 2 : using input format
DataSet<Tuple2<LongWritable, Text>> inputHadoop =
env.createInput(HadoopInputs.readHadoopFile(new TextInputFormat(),
LongWritable.class, Text.class, "hdfs://localhost:9000/user/hadoop/input/JsonSeries.txt"));
inputHadoop.print();
With Flink 1.13, Hadoop 3.1.2, Java 1.8.0 on the Centos7 machine I was able to read from HDFS.
HADOOP_HOME and HADOOP_CLASSPATH were exported already. I think from version 1.11 something changed. I couldn't find even a simple example. Therefore I share my example.
I added to pom.xml following dependency
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.1.2</version>
</dependency>
My Scala code:
package com.vbo.datastreamapi
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
object ReadWriteHDFS extends App {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val stream = env.readTextFile("hdfs://localhost:9000/user/train/datasets/Advertising.csv")
stream.print()
env.execute("Read Write HDFS")
}

Libtorrent - Given a magnet link, how do you generate a torrent file?

I have read through the manual and I cannot find the answer. Given a magnet link I would like to generate a torrent file so that it can be loaded on the next startup to avoid redownloading the metadata. I have tried the fast resume feature, but I still have to fetch meta data when I do it and that can take quite a bit of time. Examples that I have seen are for creating torrent files for a new torrent, where as I would like to create one matching a magnet uri.
Solution found here:
http://code.google.com/p/libtorrent/issues/detail?id=165#c5
See creating torrent:
http://www.rasterbar.com/products/libtorrent/make_torrent.html
Modify first lines:
file_storage fs;
// recursively adds files in directories
add_files(fs, "./my_torrent");
create_torrent t(fs);
To this:
torrent_info ti = handle.get_torrent_info()
create_torrent t(ti)
"handle" is from here:
torrent_handle add_magnet_uri(session& ses, std::string const& uri add_torrent_params p);
Also before creating torrent you have to make sure that metadata has been downloaded, do this by calling handle.has_metadata().
UPDATE
Seems like libtorrent python api is missing some of important c++ api that is required to create torrent from magnets, the example above won't work in python cause create_torrent python class does not accept torrent_info as parameter (c++ has it available).
So I tried it another way, but also encountered a brick wall that makes it impossible, here is the code:
if handle.has_metadata():
torinfo = handle.get_torrent_info()
fs = libtorrent.file_storage()
for file in torinfo.files():
fs.add_file(file)
torfile = libtorrent.create_torrent(fs)
torfile.set_comment(torinfo.comment())
torfile.set_creator(torinfo.creator())
for i in xrange(0, torinfo.num_pieces()):
hash = torinfo.hash_for_piece(i)
torfile.set_hash(i, hash)
for url_seed in torinfo.url_seeds():
torfile.add_url_seed(url_seed)
for http_seed in torinfo.http_seeds():
torfile.add_http_seed(http_seed)
for node in torinfo.nodes():
torfile.add_node(node)
for tracker in torinfo.trackers():
torfile.add_tracker(tracker)
torfile.set_priv(torinfo.priv())
f = open(magnet_torrent, "wb")
f.write(libtorrent.bencode(torfile.generate()))
f.close()
There is an error thrown on this line:
torfile.set_hash(i, hash)
It expects hash to be const char* but torrent_info.hash_for_piece(int) returns class big_number which has no api to convert it back to const char*.
When I find some time I will report this missing api bug to libtorrent developers, as currently it is impossible to create a .torrent file from a magnet uri when using python bindings.
torrent_info.orig_files() is also missing in python bindings, I'm not sure whether torrent_info.files() is sufficient.
UPDATE 2
I've created an issue on this, see it here:
http://code.google.com/p/libtorrent/issues/detail?id=294
Star it so they fix it fast.
UPDATE 3
It is fixed now, there is a 0.16.0 release. Binaries for windows are also available.
Just wanted to provide a quick update using the modern libtorrent Python package: libtorrent now has the parse_magnet_uri method which you can use to generate a torrent handle:
import libtorrent, os, time
def magnet_to_torrent(magnet_uri, dst):
"""
Args:
magnet_uri (str): magnet link to convert to torrent file
dst (str): path to the destination folder where the torrent will be saved
"""
# Parse magnet URI parameters
params = libtorrent.parse_magnet_uri(magnet_uri)
# Download torrent info
session = libtorrent.session()
handle = session.add_torrent(params)
print "Downloading metadata..."
while not handle.has_metadata():
time.sleep(0.1)
# Create torrent and save to file
torrent_info = handle.get_torrent_info()
torrent_file = libtorrent.create_torrent(torrent_info)
torrent_path = os.path.join(dst, torrent_info.name() + ".torrent")
with open(torrent_path, "wb") as f:
f.write(libtorrent.bencode(torrent_file.generate()))
print "Torrent saved to %s" % torrent_path
If saving the resume data didn't work for you, you are able to generate a new torrent file using the information from the existing connection.
fs = libtorrent.file_storage()
libtorrent.add_files(fs, "somefiles")
t = libtorrent.create_torrent(fs)
t.add_tracker("http://10.0.0.1:312/announce")
t.set_creator("My Torrent")
t.set_comment("Some comments")
t.set_priv(True)
libtorrent.set_piece_hashes(t, "C:\\", lambda x: 0), libtorrent.bencode(t.generate())
f=open("mytorrent.torrent", "wb")
f.write(libtorrent.bencode(t.generate()))
f.close()
I doubt that it'll make the resume faster than the function built specifically for this purpose.
Try to see this code http://code.google.com/p/libtorrent/issues/attachmentText?id=165&aid=-5595452662388837431&name=java_client.cpp&token=km_XkD5NBdXitTaBwtCir8bN-1U%3A1327784186190
it uses add_magnet_uri which I think is what you need