How to write on HDFS using pyarrow

How to write on HDFS using pyarrow - hdfs

I'm using python with pyarrow library and I'd like to write a pandas dataframe on HDFS. Here is the code I have
import pandas as pd
import pyarrow as pa
fs = pa.hdfs.connect(namenode, port, username, kerb_ticket)
df = pd.DataFrame(...)
table = pa.Table.from_pandas(df)
According to the documentation I should use the following code to write a pyarrow.Table on HDFS
import pyarrow.parquet as pq
pq.write_parquet(table, 'filename.parquet')
What I don't understand is where should I use my connection (fs), because if I don't use it in the write_parquet then how come it knows where the HDFS is?

Based on the document: https://arrow.apache.org/docs/python/api/formats.html#parquet-files
you can either use write_table or write_to_dataset function:
write_table
write_table takes multiple parameters, few of those are below:
table -> pyarrow.Table
where -> this can be a string or the filesystem object
filesystem -> Default is None
Example
pq.write_table(table, path, filesystem = fs)
or
with fs.open(path, 'wb') as f:
pq.write_table(table, f)
write_to_dataset
You can use write_to_dataset in case you want to partition data based on certain column in the table, Example:
pq.write_to_dataset(table, path, filesystem = fs, partition_cols = [col1])

You can do this
with fs.open(path, 'wb') as f:
pq.write_parquet(table, f)
I opened a JIRA about adding some more documentation about this
https://issues.apache.org/jira/browse/ARROW-6239

Related

AWS Glue job create_dynamic_frame_from_options() opening a specific file?

If one uses create_dynamic_frame_from_catalog(), you supply the database name and table name, e.g. created from a Glue crawler, which effectively names a specific input file. I want to be able to do the same (name a specific input file) without the crawler and database.
I've tried using create_dynamic_frame_from_options(), but the "path" connection option doesn't allow me to name the file, apparently. Is there any way to do this?

IIUC, you want to read multiple files from a specific s3 path and want the filename in your dataframe. You can achieve this by using spark session and reading it as pyspark dataframe
from pyspark.sql.functions import input_file_name
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
path = 's3://bucket/folder'
df = spark.read.csv(path)
df = df.withColumn('FileName', input_file_name())

Read text file via UDF in pyspark returns unexpected output

I have pyspark dataframe df containing paths to text files. I want to create a new column with the content of the text files.
import pyspark.sql.functions as F
from pyspark.sql.types import *
def read_file(filepath):
import s3fs
s3 = s3fs.S3FileSystem()
with s3.open(filepath) as f:
return f.read()
read_file_udf = F.udf(read_file, StringType())
df.withColumn('raw_text', read_file_udf('filepath')).show()
+---------------------+-----------+
| file | raw_text|
+---------------------+-----------+
|s3://bucket/file1.txt| [B#aa2a4f3|
|s3://bucket/file2.txt|[B#138664c5|
|s3://bucket/file3.txt| [B#3bcc67e|
|s3://bucket/file4.txt|[B#70b735c4|
|s3://bucket/file5.txt|[B#6fad821d|
+---------------------+-----------+
Instead of getting the actual file content, I am getting these strange [B# codes. What are they, why am I getting them and how do I fix this?

To answer my own question... I was getting [B# because the read_file() function was returning the byte representation of a string. Defining:
def read_file(filepath):
import s3fs
s3 = s3fs.S3FileSystem()
with s3.open(filepath) as f:
return f.read().decode("utf-8")
will fix the issue.

Reading multiple files in a directory with pyyaml

I'm trying to read all yaml files in a directory, but I am having trouble. First, because I am using Python 2.7 (and I cannot change to 3) and all of my files are utf-8 (and I also need them to keep this way).
import os
import yaml
import codecs
def yaml_reader(filepath):
with codecs.open(filepath, "r", encoding='utf-8') as file_descriptor:
data = yaml.load_all(file_descriptor)
return data
def yaml_dump(filepath, data):
with open(filepath, 'w') as file_descriptor:
yaml.dump(data, file_descriptor)
if __name__ == "__main__":
filepath = os.listdir(os.getcwd())
data = yaml_reader(filepath)
print data
When I run this code, python gives me the message:
TypeError: coercing to Unicode: need string or buffer, list found.
I want this program to show the content of the files. Can anyone help me?

I guess the issue is with filepath.
os.listdir(os.getcwd()) returns the list of all the files in the directory. so you are passing the list to codecs.open() instead of filename

There are multiple problems with your code, apart from that it is invalide Python, in the way you formatted this.
def yaml_reader(filepath):
with codecs.open(filepath, "r", encoding='utf-8') as file_descriptor:
data = yaml.load_all(file_descriptor)
return data
however it is not necessary to do the decoding, PyYAML is perfectly capable of processing UTF-8:
def yaml_reader(filepath):
with open(filepath, "rb") as file_descriptor:
data = yaml.load_all(file_descriptor)
return data
I hope you realise your trying to load multiple documents and always get a list as a result in data even if your file contains one document.
Then the line:
filepath = os.listdir(os.getcwd())
gives you a list of files, so you need to do:
filepath = os.listdir(os.getcwd())[0]
or decide in some other way, which of the files you want to open. If you want to combine all files (assuming they are YAML) in one big YAML file, you need to do:
if __name__ == "__main__":
data = []
for filepath in os.listdir(os.getcwd()):
data.extend(yaml_reader(filepath))
print data
And your dump routine would need to change to:
def yaml_dump(filepath, data):
with open(filepath, 'wb') as file_descriptor:
yaml.dump(data, file_descriptor, allow_unicode=True, encoding='utf-8')
However this all brings you to the biggest problem: that you are using PyYAML, that will mangle your YAML, dropping flow-style, comment, anchor names, special int/float, quotes around scalars etc. Apart from that PyYAML has not been updated to support YAML 1.2 documents (which has been the standard since 2009). I recommend you switch to using ruamel.yaml (disclaimer: I am the author of that package), which supports YAML 1.2 and leaves comments etc in place.
And even if you are bound to use Python 2, you should use the Python 3 like syntax e.g. for print that you can get with from __future__ imports.
So I recommend you do:
pip install pathlib2 ruamel.yaml
and then use:
from __future__ import absolute_import, unicode_literals, print_function
from pathlib import Path
from ruamel.yaml import YAML
if __name__ == "__main__":
data = []
yaml = YAML()
yaml.preserve_quotes = True
for filepath in Path('.').glob('*.yaml'):
data.extend(yaml.load_all(filepath))
print(data)
yaml.dump(data, Path('your_output.yaml'))

Fetch a file(.csv) from S3 bucket and copy to an RDS

I'm gonna connect to a S3 bucket, get the csv files and copy the rows to RDS DB. On this script we are using arcpy, I'm not that familiar with this package, I'm just trying to get the csv file directly from S3 bucket as source without downloading it on the server. The code is as follows:
import arcpy
from boto.s3.key import Key
import StringIO
import pandas as pd
import boto
import boto.s3.connection
access_key = ''
secret_key = ''
conn = boto.connect_s3(aws_access_key_id = access_key,aws_secret_access_key = secret_key,host = 's3.amazonaws.com')
b = conn.get_bucket('mybucket')
#for key in b.list:
b_key = b.get_key('file1.csv')
arcpy.env.overwriteOutput = True
b_url = b_key.generate_url(0, query_auth=False, force_http=True)
print b_url
##Read file
k = Key(b,file1.csv)
content = k.get_contents_as_string()
sourcefile_csv = pd.read_csv(StringIO.StringIO(content))
##CopyRows_management (in_rows, out_table, {config_keyword})
#http://pro.arcgis.com/en/pro-app/tool-reference/data-management/copy-rows.htm
arcpy.CopyRows_management(sourcefile_csv, "RDSTablePath", "")
print("copy rows done")
Error: in CopyRows arcgisscripting.ExecuteError. Failed to execute Parameters are not valid
If we use a path on the server as source path like below it works fine:
sourcefile_csv = "D:\\DEV\\file1.csv"
arcpy.CopyRows_management(sourcefile_csv, "RDSTablePath", "")
Any help would be appreciated.

It looks like you are trying to use the Pandas dataframe as the table to read from with CopyRows_management? I don't think that is a valid input for the function, thus the "Parameters are not valid" error. The documentation says that in_rows should be "The rows from a feature class, layer, table, or table view to be copied." I think the use of pandas is unnecessary here anyways.
So either save the csv somewhere that the script can access it (as you did in when you used the path on the server) or, if you don't want to save the file anywhere, just read the contents of the csv and iterate through it using an Insert Cursor to write it to your table/feature class.
See this post on how to read a csv from a string using the csv module. Then just loop through the rows of the csv and use the Insert Cursor to write to your table.

If your RDS happens to be an Aurora MySql then you should take a look into Loading Data from S3 feature, where you can skip the code and just loads line by line into your DB.

Reading csv zipped files in python

I'm trying to get data from a zipped csv file. Is there a way to do this without unzipping the whole files? If not, how can I unzip the files and read them efficiently?

I used the zipfile module to import the ZIP directly to pandas dataframe.
Let's say the file name is "intfile" and it's in .zip named "THEZIPFILE":
import pandas as pd
import zipfile
zf = zipfile.ZipFile('C:/Users/Desktop/THEZIPFILE.zip')
df = pd.read_csv(zf.open('intfile.csv'))

If you aren't using Pandas it can be done entirely with the standard lib. Here is Python 3.7 code:
import csv
from io import TextIOWrapper
from zipfile import ZipFile
with ZipFile('yourfile.zip') as zf:
with zf.open('your_csv_inside_zip.csv', 'r') as infile:
reader = csv.reader(TextIOWrapper(infile, 'utf-8'))
for row in reader:
# process the CSV here
print(row)

A quick solution can be using below code!
import pandas as pd
#pandas support zip file reads
df = pd.read_csv("/path/to/file.csv.zip")

zipfile also supports the with statement.
So adding onto yaron's answer of using pandas:
with zipfile.ZipFile('file.zip') as zip:
with zip.open('file.csv') as myZip:
df = pd.read_csv(myZip)

Thought Yaron had the best answer but thought I would add a code that iterated through multiple files inside a zip folder. It will then append the results:
import os
import pandas as pd
import zipfile
curDir = os.getcwd()
zf = zipfile.ZipFile(curDir + '/targetfolder.zip')
text_files = zf.infolist()
list_ = []
print ("Uncompressing and reading data... ")
for text_file in text_files:
print(text_file.filename)
df = pd.read_csv(zf.open(text_file.filename)
# do df manipulations
list_.append(df)
df = pd.concat(list_)

Yes. You want the module 'zipfile'
You open the zip file itself with zipfile.ZipInfo([filename[, date_time]])
You can then use ZipFile.infolist() to enumerate each file within the zip, and extract it with ZipFile.open(name[, mode[, pwd]])

this is the simplest thing I always use.
import pandas as pd
df = pd.read_csv("Train.zip",compression='zip')

Supposing you are downloading a zip file that contains a CSV and you don't want to use temporary storage. Here is what a sample implementation looks like:
#!/usr/bin/env python3
from csv import DictReader
from io import TextIOWrapper, BytesIO
from zipfile import ZipFile
import requests
def all_tickers():
url = "https://simfin.com/api/bulk/bulk.php?dataset=industries&variant=null"
r = requests.get(url)
zip_ref = ZipFile(BytesIO(r.content))
for name in zip_ref.namelist():
print(name)
with zip_ref.open(name) as file_contents:
reader = DictReader(TextIOWrapper(file_contents, 'utf-8'), delimiter=';')
for item in reader:
print(item)
This takes care of all python3 bytes/str issues.

Modern Pandas since version 0.18.1 natively supports compressed csv files: its read_csv method has compression parameter : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

If you have a file name: my_big_file.csv and you zip it with the same name my_big_file.zip
you may simply do this:
df = pd.read_csv("my_big_file.zip")
Note: check your pandas version first (not applicable for older versions)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to write on HDFS using pyarrow - hdfs

You can do this with fs.open(path, 'wb') as f: pq.write_parquet(table, f) I opened a JIRA about adding some more documentation about this https://issues.apache.org/jira/browse/ARROW-6239

Related

AWS Glue job create_dynamic_frame_from_options() opening a specific file?

Read text file via UDF in pyspark returns unexpected output

Reading multiple files in a directory with pyyaml

Fetch a file(.csv) from S3 bucket and copy to an RDS

Reading csv zipped files in python

Categories

Resources