I'm trying to get data from a zipped csv file. Is there a way to do this without unzipping the whole files? If not, how can I unzip the files and read them efficiently?
I used the zipfile module to import the ZIP directly to pandas dataframe.
Let's say the file name is "intfile" and it's in .zip named "THEZIPFILE":
import pandas as pd
import zipfile
zf = zipfile.ZipFile('C:/Users/Desktop/THEZIPFILE.zip')
df = pd.read_csv(zf.open('intfile.csv'))
If you aren't using Pandas it can be done entirely with the standard lib. Here is Python 3.7 code:
import csv
from io import TextIOWrapper
from zipfile import ZipFile
with ZipFile('yourfile.zip') as zf:
with zf.open('your_csv_inside_zip.csv', 'r') as infile:
reader = csv.reader(TextIOWrapper(infile, 'utf-8'))
for row in reader:
# process the CSV here
print(row)
A quick solution can be using below code!
import pandas as pd
#pandas support zip file reads
df = pd.read_csv("/path/to/file.csv.zip")
zipfile also supports the with statement.
So adding onto yaron's answer of using pandas:
with zipfile.ZipFile('file.zip') as zip:
with zip.open('file.csv') as myZip:
df = pd.read_csv(myZip)
Thought Yaron had the best answer but thought I would add a code that iterated through multiple files inside a zip folder. It will then append the results:
import os
import pandas as pd
import zipfile
curDir = os.getcwd()
zf = zipfile.ZipFile(curDir + '/targetfolder.zip')
text_files = zf.infolist()
list_ = []
print ("Uncompressing and reading data... ")
for text_file in text_files:
print(text_file.filename)
df = pd.read_csv(zf.open(text_file.filename)
# do df manipulations
list_.append(df)
df = pd.concat(list_)
Yes. You want the module 'zipfile'
You open the zip file itself with zipfile.ZipInfo([filename[, date_time]])
You can then use ZipFile.infolist() to enumerate each file within the zip, and extract it with ZipFile.open(name[, mode[, pwd]])
this is the simplest thing I always use.
import pandas as pd
df = pd.read_csv("Train.zip",compression='zip')
Supposing you are downloading a zip file that contains a CSV and you don't want to use temporary storage. Here is what a sample implementation looks like:
#!/usr/bin/env python3
from csv import DictReader
from io import TextIOWrapper, BytesIO
from zipfile import ZipFile
import requests
def all_tickers():
url = "https://simfin.com/api/bulk/bulk.php?dataset=industries&variant=null"
r = requests.get(url)
zip_ref = ZipFile(BytesIO(r.content))
for name in zip_ref.namelist():
print(name)
with zip_ref.open(name) as file_contents:
reader = DictReader(TextIOWrapper(file_contents, 'utf-8'), delimiter=';')
for item in reader:
print(item)
This takes care of all python3 bytes/str issues.
Modern Pandas since version 0.18.1 natively supports compressed csv files: its read_csv method has compression parameter : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
If you have a file name: my_big_file.csv and you zip it with the same name my_big_file.zip
you may simply do this:
df = pd.read_csv("my_big_file.zip")
Note: check your pandas version first (not applicable for older versions)
Related
I'm using python with pyarrow library and I'd like to write a pandas dataframe on HDFS. Here is the code I have
import pandas as pd
import pyarrow as pa
fs = pa.hdfs.connect(namenode, port, username, kerb_ticket)
df = pd.DataFrame(...)
table = pa.Table.from_pandas(df)
According to the documentation I should use the following code to write a pyarrow.Table on HDFS
import pyarrow.parquet as pq
pq.write_parquet(table, 'filename.parquet')
What I don't understand is where should I use my connection (fs), because if I don't use it in the write_parquet then how come it knows where the HDFS is?
Based on the document: https://arrow.apache.org/docs/python/api/formats.html#parquet-files
you can either use write_table or write_to_dataset function:
write_table
write_table takes multiple parameters, few of those are below:
table -> pyarrow.Table
where -> this can be a string or the filesystem object
filesystem -> Default is None
Example
pq.write_table(table, path, filesystem = fs)
or
with fs.open(path, 'wb') as f:
pq.write_table(table, f)
write_to_dataset
You can use write_to_dataset in case you want to partition data based on certain column in the table, Example:
pq.write_to_dataset(table, path, filesystem = fs, partition_cols = [col1])
You can do this
with fs.open(path, 'wb') as f:
pq.write_parquet(table, f)
I opened a JIRA about adding some more documentation about this
https://issues.apache.org/jira/browse/ARROW-6239
I have pyspark dataframe df containing paths to text files. I want to create a new column with the content of the text files.
import pyspark.sql.functions as F
from pyspark.sql.types import *
def read_file(filepath):
import s3fs
s3 = s3fs.S3FileSystem()
with s3.open(filepath) as f:
return f.read()
read_file_udf = F.udf(read_file, StringType())
df.withColumn('raw_text', read_file_udf('filepath')).show()
+---------------------+-----------+
| file | raw_text|
+---------------------+-----------+
|s3://bucket/file1.txt| [B#aa2a4f3|
|s3://bucket/file2.txt|[B#138664c5|
|s3://bucket/file3.txt| [B#3bcc67e|
|s3://bucket/file4.txt|[B#70b735c4|
|s3://bucket/file5.txt|[B#6fad821d|
+---------------------+-----------+
Instead of getting the actual file content, I am getting these strange [B# codes. What are they, why am I getting them and how do I fix this?
To answer my own question... I was getting [B# because the read_file() function was returning the byte representation of a string. Defining:
def read_file(filepath):
import s3fs
s3 = s3fs.S3FileSystem()
with s3.open(filepath) as f:
return f.read().decode("utf-8")
will fix the issue.
I'm trying to read all yaml files in a directory, but I am having trouble. First, because I am using Python 2.7 (and I cannot change to 3) and all of my files are utf-8 (and I also need them to keep this way).
import os
import yaml
import codecs
def yaml_reader(filepath):
with codecs.open(filepath, "r", encoding='utf-8') as file_descriptor:
data = yaml.load_all(file_descriptor)
return data
def yaml_dump(filepath, data):
with open(filepath, 'w') as file_descriptor:
yaml.dump(data, file_descriptor)
if __name__ == "__main__":
filepath = os.listdir(os.getcwd())
data = yaml_reader(filepath)
print data
When I run this code, python gives me the message:
TypeError: coercing to Unicode: need string or buffer, list found.
I want this program to show the content of the files. Can anyone help me?
I guess the issue is with filepath.
os.listdir(os.getcwd()) returns the list of all the files in the directory. so you are passing the list to codecs.open() instead of filename
There are multiple problems with your code, apart from that it is invalide Python, in the way you formatted this.
def yaml_reader(filepath):
with codecs.open(filepath, "r", encoding='utf-8') as file_descriptor:
data = yaml.load_all(file_descriptor)
return data
however it is not necessary to do the decoding, PyYAML is perfectly capable of processing UTF-8:
def yaml_reader(filepath):
with open(filepath, "rb") as file_descriptor:
data = yaml.load_all(file_descriptor)
return data
I hope you realise your trying to load multiple documents and always get a list as a result in data even if your file contains one document.
Then the line:
filepath = os.listdir(os.getcwd())
gives you a list of files, so you need to do:
filepath = os.listdir(os.getcwd())[0]
or decide in some other way, which of the files you want to open. If you want to combine all files (assuming they are YAML) in one big YAML file, you need to do:
if __name__ == "__main__":
data = []
for filepath in os.listdir(os.getcwd()):
data.extend(yaml_reader(filepath))
print data
And your dump routine would need to change to:
def yaml_dump(filepath, data):
with open(filepath, 'wb') as file_descriptor:
yaml.dump(data, file_descriptor, allow_unicode=True, encoding='utf-8')
However this all brings you to the biggest problem: that you are using PyYAML, that will mangle your YAML, dropping flow-style, comment, anchor names, special int/float, quotes around scalars etc. Apart from that PyYAML has not been updated to support YAML 1.2 documents (which has been the standard since 2009). I recommend you switch to using ruamel.yaml (disclaimer: I am the author of that package), which supports YAML 1.2 and leaves comments etc in place.
And even if you are bound to use Python 2, you should use the Python 3 like syntax e.g. for print that you can get with from __future__ imports.
So I recommend you do:
pip install pathlib2 ruamel.yaml
and then use:
from __future__ import absolute_import, unicode_literals, print_function
from pathlib import Path
from ruamel.yaml import YAML
if __name__ == "__main__":
data = []
yaml = YAML()
yaml.preserve_quotes = True
for filepath in Path('.').glob('*.yaml'):
data.extend(yaml.load_all(filepath))
print(data)
yaml.dump(data, Path('your_output.yaml'))
Feel like a dunce. I'm trying to interact with a zip file and can't seem to use the zipfile library. Fairly new to python
from zipfile import *
#set filename
fpath = '{}_{}_{}.zip'.format(strDate, day, week)
#use zipfile to get info about ftp file
zip = zipfile.Zipfile(fpath, mode='r')
# doesn't matter if I use
#zip = zipfile.Zipfile(fpath, mode='w')
#or zip = zipfile.Zipfile(fpath, 'wb')
I'm getting this error
zip = zipfile.Zipfile(fpath, mode='r')
NameError: name 'zipfile' is not defined
if I just use import zipfile I get this error:
TypeError: 'module' object is not callable
Two ways to fix it:
1) use from, and in that case drop the zipfile namespace:
from zipfile import *
#set filename
fpath = '{}_{}_{}.zip'.format(strDate, day, week)
#use zipfile to get info about ftp file
zip = ZipFile(fpath, mode='r')
2) use direct import, and in that case use full path like you did:
import zipfile
#set filename
fpath = '{}_{}_{}.zip'.format(strDate, day, week)
#use zipfile to get info about ftp file
zip = zipfile.ZipFile(fpath, mode='r')
and there's a sneaky typo in your code: Zipfile should be ZipFile (capital F, so I feel slightly bad for answering...
So the lesson learnt is:
avoid from x import y because editors have a harder time to complete words
with a proper import zipfile and an editor which proposes completion, you would never have had this problem in the first place.
Easiest way to zip a file using Python:
import zipfile
zf = zipfile.ZipFile("targetZipFileName.zip",'w', compression=zipfile.ZIP_DEFLATED)
zf.write("FileTobeZipped.txt")
zf.close()
I have written this code
import csv as csv
import numpy as np
csv_file_object=
csv.reader(open('C:\Users\hostname\Desktop\spyder\train.csv', 'rb'))
header = csv_file_object.next()
data=[]
for row in csv_file_object:
data.append(row)
data = np.array(data)
but Error ([Errno 22] invalid mode ('rb') or filename:) appears.
I'd suggest using numpy genfromtxt
import numpy as np
np.genfromtxt('C:\Users\hostname\Desktop\spyder\train.csv',delimiter=',',dtype=None)
You'll have to adjust the delimiter and dtype parameters based on your csv file.