I'm using the psychopg2 module to make queries against QuestDB from Python. I have had some trouble using the copy_from() cursor object to get CSV data into a table. What's the best way to get this into the database?
I'm trying the following:
import pandas as pd
import numpy as np
import psycopg2
import os
conn = psycopg2.connect(user="admin",
password="quest",
host="127.0.0.1",
port="8812",
database="qdb")
cursor = conn.cursor()
dest_table = "eur_fr_bulk"
temp_dataframe = "./temp_dataframe.csv"
# input
df = pd.read_csv("./data/eur_fr.csv")
df.to_csv(temp_dataframe, index_label='id', header=False)
f = open(temp_dataframe, 'r')
cursor = conn.cursor()
try:
cursor.copy_from(f, dest_table)
conn.commit()
except (Exception, psycopg2.DatabaseError) as error:
os.remove(temp_dataframe)
print("Error: %s" % error)
conn.rollback()
cursor.close()
cursor.close()
The copy_from() wrapper in psychopg2 is executing some SQL in the background that's not yet supported in QuestDB as of yet, specifically, it will run
COPY my_table FROM stdin WITH DELIMITER AS ' ' NULL AS '\\N'
The DELIMITER keyword is not yet implemented. As a workaround, you can either make the request via HTTP in python, which might be the most convenient:
import requests
csv = {'data': ('my_table_import', open('./data/eur_fr.csv', 'r'))}
server = 'http://localhost:9000/imp'
response = requests.post(server, files=csv)
print(response.text)
or you can specify a copy directory in the server.conf file which allows loading CSV files. This is documented on the COPY documentation page.
Related
I want to copy certain data from a Vertica cluster (lets say a test cluster) to another Vertica cluster (lets say QA cluster). Manually I can do this by dumping the result of a query into a CSV file and then importing it on the other cluster. But, how can I do it on a Python script without using os or system commands. I want to do it purely using some Python module or adapter. As of now I am using python-vertica adapter, I am able to connect to Test cluster and get the data into a python list, but I am unable to export it to a CSV file natively using the adapter (i.e. without using python csv module). Also, how can I import the CSV file in my QA cluster using the same adapter (or a different vertica module for python)?
You can do it with COPY FROM VERTICA for simple problems. Read here for more info.
For python you can use in my template:
Environment:
python=2.7.x
vertica-python==0.7.3
Vertica Analytic Database v8.1.1-10
Source code example:
#!/usr/bin/env python2
# coding: UTF-8
import csv
import cStringIO
# connection info: username, password, etc
SRC_DB_INFO = {...}
DST_DB_INFO = {...}
csvbuffer = cStringIO.StringIO()
csvwriter = csv.writer(csvbuffer, delimiter='|', lineterminator='\n', quoting=csv.QUOTE_MINIMAL)
# establish connection to source database
connection = vertica_python.connect(**SRC_DB_INFO)
cursor = connection.cursor()
cursor.execute('SELECT * FROM A')
# convert data to csv format
for row in cursor.iterate():
csvwriter.writerow(row)
# cleanup
cursor.close()
connection.close()
# establish connection to destination database
connection = vertica_python.connect(**DST_DB_INFO)
cursor = connection.cursor()
# copy data
cursor.copy('COPY B FROM STDIN ABORT ON ERROR', csvbuffer.getvalue())
connection.commit()
# cleanup
cursor.close()
connection.close()
I'm gonna connect to a S3 bucket, get the csv files and copy the rows to RDS DB. On this script we are using arcpy, I'm not that familiar with this package, I'm just trying to get the csv file directly from S3 bucket as source without downloading it on the server. The code is as follows:
import arcpy
from boto.s3.key import Key
import StringIO
import pandas as pd
import boto
import boto.s3.connection
access_key = ''
secret_key = ''
conn = boto.connect_s3(aws_access_key_id = access_key,aws_secret_access_key = secret_key,host = 's3.amazonaws.com')
b = conn.get_bucket('mybucket')
#for key in b.list:
b_key = b.get_key('file1.csv')
arcpy.env.overwriteOutput = True
b_url = b_key.generate_url(0, query_auth=False, force_http=True)
print b_url
##Read file
k = Key(b,file1.csv)
content = k.get_contents_as_string()
sourcefile_csv = pd.read_csv(StringIO.StringIO(content))
##CopyRows_management (in_rows, out_table, {config_keyword})
#http://pro.arcgis.com/en/pro-app/tool-reference/data-management/copy-rows.htm
arcpy.CopyRows_management(sourcefile_csv, "RDSTablePath", "")
print("copy rows done")
Error: in CopyRows arcgisscripting.ExecuteError. Failed to execute Parameters are not valid
If we use a path on the server as source path like below it works fine:
sourcefile_csv = "D:\\DEV\\file1.csv"
arcpy.CopyRows_management(sourcefile_csv, "RDSTablePath", "")
Any help would be appreciated.
It looks like you are trying to use the Pandas dataframe as the table to read from with CopyRows_management? I don't think that is a valid input for the function, thus the "Parameters are not valid" error. The documentation says that in_rows should be "The rows from a feature class, layer, table, or table view to be copied." I think the use of pandas is unnecessary here anyways.
So either save the csv somewhere that the script can access it (as you did in when you used the path on the server) or, if you don't want to save the file anywhere, just read the contents of the csv and iterate through it using an Insert Cursor to write it to your table/feature class.
See this post on how to read a csv from a string using the csv module. Then just loop through the rows of the csv and use the Insert Cursor to write to your table.
If your RDS happens to be an Aurora MySql then you should take a look into Loading Data from S3 feature, where you can skip the code and just loads line by line into your DB.
I am trying to convert this Amazon sample Snap grocery JSON data to a Pandas dataframe in IBM Bluemix (using Python 2.x) and then analyze it with Apache Spark.
I have unzipped the JSON file and uploaded it to an Apache Spark Container.
Here is my container connection:
# In[ ]:
credentials_1 = {
'auth_uri':'',
'global_account_auth_uri':'',
'username':'myUname',
'password':"myPw",
'auth_url':'https://identity.open.softlayer.com',
'project':'object_storage_988dfce6_5b93_48fc_9575_198bbed3abfc',
'project_id':'2c05de8a36d74d32bdbe0eeec7e5a372',
'region':'dallas',
'user_id':'4976489bab7d489f8d2eba681adacb78',
'domain_id':'8b6bc3e989d644858d7b74f24119447a',
'domain_name':'1079761',
'filename':'meta_Grocery_and_Gourmet_Food.json',
'container':'grocery',
'tenantId':'s31d-8e24c13d9c36f4-43b43b7b993d'
}
I then used Apache Spark's sample of importing data from container to StringIO
# In[ ]:
import requests, StringIO, pandas as pd, json, re
# In[ ]:
def get_file_content(credentials):
"""For given credentials, this functions returns a StringIO object containing the file content."""
url1 = ''.join([credentials['auth_url'], '/v3/auth/tokens'])
data = {'auth': {'identity': {'methods': ['password'],
'password': {'user': {'name': credentials['username'],'domain': {'id': credentials['domain_id']},
'password': credentials['password']}}}}}
headers1 = {'Content-Type': 'application/json'}
resp1 = requests.post(url=url1, data=json.dumps(data), headers=headers1)
resp1_body = resp1.json()
for e1 in resp1_body['token']['catalog']:
if(e1['type']=='object-store'):
for e2 in e1['endpoints']:
if(e2['interface']=='public'and e2['region']==credentials['region']):
url2 = ''.join([e2['url'],'/', credentials['container'], '/', credentials['filename']])
s_subject_token = resp1.headers['x-subject-token']
headers2 = {'X-Auth-Token': s_subject_token, 'accept': 'application/json'}
resp2 = requests.get(url=url2, headers=headers2)
return StringIO.StringIO(resp2.content)
I then converted the String content to a strict JSON pattern by appending [ and ] at the beginning and at the end and by separating the data with a comma.
print('----------------------\n')
import json
myDf=[];
def parse(data):
for l in data:
yield json.dumps(eval(l))
def getDF(data):
st='['
i = 0
df =[]
for d in parse(data):
if i<100:
i += 1
#print(str(d))
st=st+str(d)+','
#print('----------------\n')
st=st[:-1]
st=st+']'
#js=json.loads(st)
#print(json.dumps(js))
return pd.read_json(st)
content_string = get_file_content(credentials_1)
df = getDF(content_string)
df.head()
I am getting a perfectly desirable result.
Output of the code
The problem is that when I remove i < 100 condition, it just never completes and the kernel remains busy for over one hour.
Is there any other elegant ways to convert the the data into dataframe?
Also, ijson is not available with Bluemix Notebook.
Let me answer this in two parts:-
You can install ijson in your bluemix spark service using below command and then user import ijson to further use it as per your use.
!pip install --user ijson
You can use sqlContext.jsonFile to read the json from object storage rather than going around to define your schema.
This will even infer the schema for you and then you can run some spark-sql queries to do whatever you want with the dataframe.
df = sqlContext.jsonFile("swift://" + objectStorageCreds['container'] + "." + objectStorageCreds['name'] + "/" + objectStorageCreds['filename'])
Here is the link to complete notebook.
If you have to work with pandas dataframe , you can simply convert it to
df.toPandas().head()
But this will cause to get everything at the driver node(use carefully).
Thanks,
Charles.
I'm trying to use openpyxl to write data to an existing xlsx workbook and save it as a separate file. I would like to write a dataframe to a sheet called 'Data' and write some values to another sheet called 'Summary' then save the workbook object as 'test.xlsx'. I have the following code that writes the calculations I need to the summary sheet, but I get a TypeError for an unexpected keyword argument 'font' when I try to write the dateframe, which doesn't make much sense to me...
I'm using the following code which I've adapted from here
from openpyxl import load_workbook
import pandas as pd
book = load_workbook('template.xlsx')
# Write values to summary sheet
ws = book.get_sheet_by_name('Summary')
ws['A1'] = 'TEST'
book.save('test.xlsx')
# Write df
writer = pd.ExcelWriter('test.xlsx', engine='openpyxl')
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
df.to_excel(writer, sheet_name="Data", index=False)
writer.save()
The trace-back:
File "C:\Users\me\AppData\Local\Continuum\Anaconda\lib\site- packages\pandas\core\frame.py", line 1274, in to_excel
startrow=startrow, startcol=startcol)
File "C:\Users\me\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\excel.py", line 778, in write_cells
xcell.style = xcell.style.copy(**style_kwargs)
File "C:\Users\me\AppData\Local\Continuum\Anaconda\lib\site-packages\openpyxl-2.2.2-py2.7.egg\openpyxl\compat\__init__.py", line 67, in new_func
return obj(*args, **kwargs)
TypeError: copy() got an unexpected keyword argument 'font'
I think the error arises from necessary changes in the openpyxl styles API. As a result Pandas installs an older version of openpyxl. You should be okay, therefore, if you remove openpyxl 2.2
I'm trying to get data from a zipped csv file. Is there a way to do this without unzipping the whole files? If not, how can I unzip the files and read them efficiently?
I used the zipfile module to import the ZIP directly to pandas dataframe.
Let's say the file name is "intfile" and it's in .zip named "THEZIPFILE":
import pandas as pd
import zipfile
zf = zipfile.ZipFile('C:/Users/Desktop/THEZIPFILE.zip')
df = pd.read_csv(zf.open('intfile.csv'))
If you aren't using Pandas it can be done entirely with the standard lib. Here is Python 3.7 code:
import csv
from io import TextIOWrapper
from zipfile import ZipFile
with ZipFile('yourfile.zip') as zf:
with zf.open('your_csv_inside_zip.csv', 'r') as infile:
reader = csv.reader(TextIOWrapper(infile, 'utf-8'))
for row in reader:
# process the CSV here
print(row)
A quick solution can be using below code!
import pandas as pd
#pandas support zip file reads
df = pd.read_csv("/path/to/file.csv.zip")
zipfile also supports the with statement.
So adding onto yaron's answer of using pandas:
with zipfile.ZipFile('file.zip') as zip:
with zip.open('file.csv') as myZip:
df = pd.read_csv(myZip)
Thought Yaron had the best answer but thought I would add a code that iterated through multiple files inside a zip folder. It will then append the results:
import os
import pandas as pd
import zipfile
curDir = os.getcwd()
zf = zipfile.ZipFile(curDir + '/targetfolder.zip')
text_files = zf.infolist()
list_ = []
print ("Uncompressing and reading data... ")
for text_file in text_files:
print(text_file.filename)
df = pd.read_csv(zf.open(text_file.filename)
# do df manipulations
list_.append(df)
df = pd.concat(list_)
Yes. You want the module 'zipfile'
You open the zip file itself with zipfile.ZipInfo([filename[, date_time]])
You can then use ZipFile.infolist() to enumerate each file within the zip, and extract it with ZipFile.open(name[, mode[, pwd]])
this is the simplest thing I always use.
import pandas as pd
df = pd.read_csv("Train.zip",compression='zip')
Supposing you are downloading a zip file that contains a CSV and you don't want to use temporary storage. Here is what a sample implementation looks like:
#!/usr/bin/env python3
from csv import DictReader
from io import TextIOWrapper, BytesIO
from zipfile import ZipFile
import requests
def all_tickers():
url = "https://simfin.com/api/bulk/bulk.php?dataset=industries&variant=null"
r = requests.get(url)
zip_ref = ZipFile(BytesIO(r.content))
for name in zip_ref.namelist():
print(name)
with zip_ref.open(name) as file_contents:
reader = DictReader(TextIOWrapper(file_contents, 'utf-8'), delimiter=';')
for item in reader:
print(item)
This takes care of all python3 bytes/str issues.
Modern Pandas since version 0.18.1 natively supports compressed csv files: its read_csv method has compression parameter : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
If you have a file name: my_big_file.csv and you zip it with the same name my_big_file.zip
you may simply do this:
df = pd.read_csv("my_big_file.zip")
Note: check your pandas version first (not applicable for older versions)