Issues with merging of CSV files - Python

Issues with merging of CSV files - Python - list

I am reading a book(foundation for analytics with python) and trying to merge CSV files. I searched for this issue, but didn't find a relevant answer to resolve it.
My issue is-->
input_path = sys.argv[1] IndexError: list index out of range
my code is -->
import csv
import glob
import os
import sys
input_path = sys.argv[1]
output_file = sys.argv[2]
first_file = True
for input_file in glob.glob(os.path.join(input_path, 'csv_*')):
print(os.path.basename(input_file))
with open(input_file, 'r', newline='') as csv_in_file:
with open(output_file, 'a', newline='') as csv_out_file:
filereader = csv.reader(csv_in_file)
filewriter = csv.writer(csv_out_file)
if first_file:
for row in filereader:
filewriter.writerow(row)
first_file = False
else:
header = next(filereader)
for row in filereader:
filewriter.writerow(row)
Please help me with it.

I solved(exactly my colleague's help) it.
i didn't input exact path when i used cmd....
So foolish Question!! idiot questioner!!!!
csv_merge.py "C:\pathpathpath\csv_merge" output.csv

Related

Reading multiple files in a directory with pyyaml

I'm trying to read all yaml files in a directory, but I am having trouble. First, because I am using Python 2.7 (and I cannot change to 3) and all of my files are utf-8 (and I also need them to keep this way).
import os
import yaml
import codecs
def yaml_reader(filepath):
with codecs.open(filepath, "r", encoding='utf-8') as file_descriptor:
data = yaml.load_all(file_descriptor)
return data
def yaml_dump(filepath, data):
with open(filepath, 'w') as file_descriptor:
yaml.dump(data, file_descriptor)
if __name__ == "__main__":
filepath = os.listdir(os.getcwd())
data = yaml_reader(filepath)
print data
When I run this code, python gives me the message:
TypeError: coercing to Unicode: need string or buffer, list found.
I want this program to show the content of the files. Can anyone help me?

I guess the issue is with filepath.
os.listdir(os.getcwd()) returns the list of all the files in the directory. so you are passing the list to codecs.open() instead of filename

There are multiple problems with your code, apart from that it is invalide Python, in the way you formatted this.
def yaml_reader(filepath):
with codecs.open(filepath, "r", encoding='utf-8') as file_descriptor:
data = yaml.load_all(file_descriptor)
return data
however it is not necessary to do the decoding, PyYAML is perfectly capable of processing UTF-8:
def yaml_reader(filepath):
with open(filepath, "rb") as file_descriptor:
data = yaml.load_all(file_descriptor)
return data
I hope you realise your trying to load multiple documents and always get a list as a result in data even if your file contains one document.
Then the line:
filepath = os.listdir(os.getcwd())
gives you a list of files, so you need to do:
filepath = os.listdir(os.getcwd())[0]
or decide in some other way, which of the files you want to open. If you want to combine all files (assuming they are YAML) in one big YAML file, you need to do:
if __name__ == "__main__":
data = []
for filepath in os.listdir(os.getcwd()):
data.extend(yaml_reader(filepath))
print data
And your dump routine would need to change to:
def yaml_dump(filepath, data):
with open(filepath, 'wb') as file_descriptor:
yaml.dump(data, file_descriptor, allow_unicode=True, encoding='utf-8')
However this all brings you to the biggest problem: that you are using PyYAML, that will mangle your YAML, dropping flow-style, comment, anchor names, special int/float, quotes around scalars etc. Apart from that PyYAML has not been updated to support YAML 1.2 documents (which has been the standard since 2009). I recommend you switch to using ruamel.yaml (disclaimer: I am the author of that package), which supports YAML 1.2 and leaves comments etc in place.
And even if you are bound to use Python 2, you should use the Python 3 like syntax e.g. for print that you can get with from __future__ imports.
So I recommend you do:
pip install pathlib2 ruamel.yaml
and then use:
from __future__ import absolute_import, unicode_literals, print_function
from pathlib import Path
from ruamel.yaml import YAML
if __name__ == "__main__":
data = []
yaml = YAML()
yaml.preserve_quotes = True
for filepath in Path('.').glob('*.yaml'):
data.extend(yaml.load_all(filepath))
print(data)
yaml.dump(data, Path('your_output.yaml'))

remove u letters from the csv in python 2.7

I already searched the solutions online, but it did not work on the following python script. I am getting u letter in csv file.
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import codecs
import csv
cursor = db.jobs.find( {}, {'_id':1, 'locationIds':1, 'termMap.en':1 })
with codecs.open('job_test.csv', 'w', encoding='utf-8') as outfile:
fields = ['_id', 'locationIds', 'termMap.en']
write = csv.DictWriter(outfile, fieldnames=fields)
write.writeheader()
for x in cursor:
x_id = x['_id']
x_locationIds =x.get('locationIds')
x_termMap =x['termMap'].get('en')
z = {
'_id': x_id,
'locationIds':str(x_locationIds).encode('ascii', 'ignore'),
'termMap.en':str(x_termMap).encode('ascii', 'ignore') }
write.writerow(z)
the output is
_
id locationIds termMap.en
51dc52fec0d988a9547b5201 [u'00aaaaaaaaaaaaaaa5913490', u'00aaaaaaaaaaaaaaa6118158', u'00aaaaaaaaaaaaaaa5946768'] [u'abc', u'acuity', u'analyze', u'become', u'beverage']
51dc52fec0d988a9547b5202 [u'00aaaaaaaaaaaaaaa5946768'] [u'abc', u'air', u'angles', u'apples', u'banana', u'bananas', u'because', u'beings', u'birds', u'bottom']
I tried to use many different ways, but I can not remove the u letters yet. please someone can help me.

Can't import python library 'zipfile'

Feel like a dunce. I'm trying to interact with a zip file and can't seem to use the zipfile library. Fairly new to python
from zipfile import *
#set filename
fpath = '{}_{}_{}.zip'.format(strDate, day, week)
#use zipfile to get info about ftp file
zip = zipfile.Zipfile(fpath, mode='r')
# doesn't matter if I use
#zip = zipfile.Zipfile(fpath, mode='w')
#or zip = zipfile.Zipfile(fpath, 'wb')
I'm getting this error
zip = zipfile.Zipfile(fpath, mode='r')
NameError: name 'zipfile' is not defined
if I just use import zipfile I get this error:
TypeError: 'module' object is not callable

Two ways to fix it:
1) use from, and in that case drop the zipfile namespace:
from zipfile import *
#set filename
fpath = '{}_{}_{}.zip'.format(strDate, day, week)
#use zipfile to get info about ftp file
zip = ZipFile(fpath, mode='r')
2) use direct import, and in that case use full path like you did:
import zipfile
#set filename
fpath = '{}_{}_{}.zip'.format(strDate, day, week)
#use zipfile to get info about ftp file
zip = zipfile.ZipFile(fpath, mode='r')
and there's a sneaky typo in your code: Zipfile should be ZipFile (capital F, so I feel slightly bad for answering...
So the lesson learnt is:
avoid from x import y because editors have a harder time to complete words
with a proper import zipfile and an editor which proposes completion, you would never have had this problem in the first place.

Easiest way to zip a file using Python:
import zipfile
zf = zipfile.ZipFile("targetZipFileName.zip",'w', compression=zipfile.ZIP_DEFLATED)
zf.write("FileTobeZipped.txt")
zf.close()

PDF to Word Doc in Python

I've read though the other stack overflow questions regarding this but it doesn't answer my issue, so down vote away. Its version 2.7.
All I want to do is use python to convert a PDF to a Word doc. At minimum convert to text so I can copy and paste into a word doc.
This is the code I have so far. All it prints is the female gender symbol.
Is my code wrong? Am I approaching this wrong? Do some PDFs just not work with PDFMiner? Do you know of any other alternatives to accomplish my goal of converting a PDF to Word, besides using PyPDF2 or PDFMiner?
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file('Bottom Dec.pdf', 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
print convert_pdf_to_txt(1)

from pdf2docx import Converter
pdf_file = 'E:\Muhammad UMER LAR.pdf'
doc_file= 'E:\Lari.docx'
c=Converter(pdf_file)
c.convert(doc_file)
c.close()

Another alternative solution is Aspose.Words Cloud SDK for Python, you can install it from pip for PDF to DOC conversion.
import asposewordscloud
import asposewordscloud.models.requests
api_client = asposewordscloud.ApiClient()
api_client.configuration.host = 'https://api.aspose.cloud'
# Get AppKey and AppSID from https://dashboard.aspose.cloud/
api_client.configuration.api_key['api_key'] = 'xxxxxxxxxxxxxxxxxxxxx' # Put your appKey here
api_client.configuration.api_key['app_sid'] = 'xxxxxxxxx-xxxx-xxxxx-xxxx-xxxxxxxxxx' # Put your appSid here
words_api = asposewordscloud.WordsApi(api_client)
filename = '02_pages.pdf'
remote_name = 'TestPostDocumentSaveAs.pdf'
dest_name = 'TestPostDocumentSaveAs.doc'
#upload PDF file to storage
request_stoarge = asposewordscloud.models.requests.UploadFileRequest(filename,remote_name)
response = words_api.upload_file(request_stoarge)
#Convert PDF to DOC and save to storage
save_options = asposewordscloud.SaveOptionsData(save_format='doc', file_name=dest_name)
request = asposewordscloud.models.requests.SaveAsRequest(remote_name, save_options)
result = words_api.save_as(request)
print("Result {}".format(result))
I'm developer evangelist at Aspose.

Reading csv zipped files in python

I'm trying to get data from a zipped csv file. Is there a way to do this without unzipping the whole files? If not, how can I unzip the files and read them efficiently?

I used the zipfile module to import the ZIP directly to pandas dataframe.
Let's say the file name is "intfile" and it's in .zip named "THEZIPFILE":
import pandas as pd
import zipfile
zf = zipfile.ZipFile('C:/Users/Desktop/THEZIPFILE.zip')
df = pd.read_csv(zf.open('intfile.csv'))

If you aren't using Pandas it can be done entirely with the standard lib. Here is Python 3.7 code:
import csv
from io import TextIOWrapper
from zipfile import ZipFile
with ZipFile('yourfile.zip') as zf:
with zf.open('your_csv_inside_zip.csv', 'r') as infile:
reader = csv.reader(TextIOWrapper(infile, 'utf-8'))
for row in reader:
# process the CSV here
print(row)

A quick solution can be using below code!
import pandas as pd
#pandas support zip file reads
df = pd.read_csv("/path/to/file.csv.zip")

zipfile also supports the with statement.
So adding onto yaron's answer of using pandas:
with zipfile.ZipFile('file.zip') as zip:
with zip.open('file.csv') as myZip:
df = pd.read_csv(myZip)

Thought Yaron had the best answer but thought I would add a code that iterated through multiple files inside a zip folder. It will then append the results:
import os
import pandas as pd
import zipfile
curDir = os.getcwd()
zf = zipfile.ZipFile(curDir + '/targetfolder.zip')
text_files = zf.infolist()
list_ = []
print ("Uncompressing and reading data... ")
for text_file in text_files:
print(text_file.filename)
df = pd.read_csv(zf.open(text_file.filename)
# do df manipulations
list_.append(df)
df = pd.concat(list_)

Yes. You want the module 'zipfile'
You open the zip file itself with zipfile.ZipInfo([filename[, date_time]])
You can then use ZipFile.infolist() to enumerate each file within the zip, and extract it with ZipFile.open(name[, mode[, pwd]])

this is the simplest thing I always use.
import pandas as pd
df = pd.read_csv("Train.zip",compression='zip')

Supposing you are downloading a zip file that contains a CSV and you don't want to use temporary storage. Here is what a sample implementation looks like:
#!/usr/bin/env python3
from csv import DictReader
from io import TextIOWrapper, BytesIO
from zipfile import ZipFile
import requests
def all_tickers():
url = "https://simfin.com/api/bulk/bulk.php?dataset=industries&variant=null"
r = requests.get(url)
zip_ref = ZipFile(BytesIO(r.content))
for name in zip_ref.namelist():
print(name)
with zip_ref.open(name) as file_contents:
reader = DictReader(TextIOWrapper(file_contents, 'utf-8'), delimiter=';')
for item in reader:
print(item)
This takes care of all python3 bytes/str issues.

Modern Pandas since version 0.18.1 natively supports compressed csv files: its read_csv method has compression parameter : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

If you have a file name: my_big_file.csv and you zip it with the same name my_big_file.zip
you may simply do this:
df = pd.read_csv("my_big_file.zip")
Note: check your pandas version first (not applicable for older versions)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Issues with merging of CSV files - Python - list

I solved(exactly my colleague's help) it. i didn't input exact path when i used cmd.... So foolish Question!! idiot questioner!!!! csv_merge.py "C:\pathpathpath\csv_merge" output.csv

Related

Reading multiple files in a directory with pyyaml

remove u letters from the csv in python 2.7

Can't import python library 'zipfile'

PDF to Word Doc in Python

Reading csv zipped files in python

Categories

Resources