I am working on a Django project where i use MySql database. In my project, one of the method used two cursor connections to execute the query. It was closed correctly.
The MySql database has a table called GeometryTable where it had two columns (Geometry Datatypes). One is GeomPolygon (Polygon values) and another is GeomPoint (Point data). I wrote a MySql query in my python project which returns the selected points and polygons within the given polygon. The table (GeometryTable) had 9 million rows of values.
When i run the query in MySql Workbench, it took a few seconds. But in project, it took several minutes to return the values. Anyone please help me to optimize the code. Thanks
The method is :
def GeometryShapes(polygon):
geom = 'POLYGON((' + ','.join(['%s %s' % v for v in polygon]) + '))'
query1 = 'SELECT GeomId, AsText(GeomPolygon)' \
' FROM GeometryTable' \
' WHERE MBRWithin(GeomPolygon, GeomFromText("%s")) AND length(GeomPolygon) < 10 million' % (geom)
query2 = 'SELECT GeomId, AsText(GeomPoint)' \
' FROM GeometryTable' \
' WHERE MBRWithin(GeomPoint, GeomFromText("%s")) AND length(GeomPoint) < 10 million' % (geom)
if query1:
cursor = connection.cursor()
cursor.execute(query1)
.....
cursor.close()
if query2:
cursor = connection.cursor()
cursor.execute(query2)
.......
cursor.close()
Here, while cursor execution (cursor.execute(query1) and cursor.execute(query2)) took several minutes to execute the query. I have indexed both the columns in the specified table.
Can anyone help me to optimize the code?
Related
I'm trying to store temporal embedding in pytable. There are 12 tables and each table has more than 130,000 rows, where each table has two columns (word varchar, embedding float(numpy.arry(300,))). What I want is to calculate cosine similarity for a given against all the word in a given table and repeat this for all 12 tables. Presently I'm doing it sequentially by iterating on each table but it takes around 15 minutes to calculate for all 12 tables.
So my question is, is it possible to read all the table concurrently? I used multithreading but I error
Segmentation fault: 11
Below is my code snippet
def synchronized_open_file():
with lock:
return tb.open_file(FILENAME, mode="r", title="embedding DB")
def synchronized_close_file(self, *args, **kwargs):
with lock:
return self.close(*args, **kwargs)
outqueue = queue.Queue()
for table in list_table :
thread = threading.Thread(target=self.top_n_similar, args=(table,))
thread.start()
threads.append(thread)
try:
for _ in range(len(threads)):
result = outqueue.get()
if isinstance(result, Exception):
raise result
else:
top_n_neighbor_per_period[result[0]] = result[1]
finally:
for thread in threads:
thread.join()
def top_n_similar(table_name):
H5FILE = synchronized_open_file()
do work()
outqueue.put(result)
finally :
synchronized_close_file(H5FILE)
Yes, you can access multiple pytable objects simultaneously. A simple example is provided below that creates 3 tables using a (300,2) numpy record array created with random data. It demonstrates you can access all 3 tables as table objects -OR- as numpy arrays (or both).
I have not done multi-threading with pytables, so can't help with that. I suggest you get your code to work in serial before adding multi-threading. Also, review the pytables docs. I know h5py has specific procedures to use mpi4py for multi-threading. Pytables may have similar requirements.
Code Sample
import tables as tb
import numpy as np
h5f = tb.open_file('SO_55445040.h5',mode='w')
mydtype = np.dtype([('word',float),('varchar',float)])
arr = np.random.rand(300,2)
recarr = np.core.records.array(arr,dtype=mydtype)
h5f.create_table('/', 'table1', obj=recarr )
recarr = np.core.records.array(2.*arr,dtype=mydtype)
h5f.create_table('/', 'table2', obj=recarr )
recarr = np.core.records.array(3.*arr,dtype=mydtype)
h5f.create_table('/', 'table3', obj=recarr )
h5f.close()
h5f = tb.open_file('SO_55445040.h5',mode='r')
# Return Table ojbects:
tb1 = h5f.get_node('/table1')
tb2 = h5f.get_node('/table2')
tb3 = h5f.get_node('/table3')
# Return numpy arrays:
arr1 = h5f.get_node('/table1').read
arr2 = h5f.get_node('/table2').read
arr3 = h5f.get_node('/table3').read
h5f.close()
This is the code I'm running in Python. The table has been created in the DB already. I'm doing a commit, so I don't know why it's working.
The code executes just fine, but no data is inserted into the table. I ran the same insert statement directly via sqlite command line and it worked just fine.
import os
import sqlite3
current_dir = os.path.dirname(__file__)
db_file = os.path.join(current_dir, '../data/trips.db')
trips_db = sqlite3.connect(db_file)
c = trips_db.cursor()
print 'inserting data into aggregate tables'
c.execute(
'''
insert into route_agg_data
select
pickup_loc_id || ">" || dropoff_loc_id as ride_route,
count(*) as rides_count
from trip_data
group by
pickup_loc_id || ">" || dropoff_loc_id
'''
)
trips_db.commit
trips_db.close
I changed the last 2 lines of my code to this:
trips_db.commit()
trips_db.close()
Thanks #thesilkworm
Could you write stored procedure inside sqlite:
Then Try In Python:
cur = connection.cursor()
cur.callproc('insert_into_route_agg_data', [request.data['value1'],
request.data['value2'], request.data['value3'] ] )
results = cur.fetchone()
cur.close()
Initially tried using pd.read_sql().
Then I tried using sqlalchemy, query objects but none of these methods are
useful as the sql getting executed for long time and it never ends.
I tried using Hints.
I guess the problem is the following: Pandas creates a cursor object in the
background. With cx_Oracle we cannot influence the "arraysize" parameter which
will be used thereby, i.e. always the default value of 100 will be used which
is far too small.
CODE:
import pandas as pd
import Configuration.Settings as CS
import DataAccess.Databases as SDB
import sqlalchemy
import cx_Oracle
dfs = []
DBM = SDB.Database(CS.DB_PRM,PrintDebugMessages=False,ClientInfo="Loader")
sql = '''
WITH
l AS
(
SELECT DISTINCT /*+ materialize */
hcz.hcz_lwzv_id AS lwzv_id
FROM
pm_mbt_materialbasictypes mbt
INNER JOIN pm_mpt_materialproducttypes mpt ON mpt.mpt_mbt_id = mbt.mbt_id
INNER JOIN pm_msl_materialsublots msl ON msl.msl_mpt_id = mpt.mpt_id
INNER JOIN pm_historycompattributes hca ON hca.hca_msl_id = msl.msl_id AND hca.hca_ignoreflag = 0
INNER JOIN pm_tpm_testdefprogrammodes tpm ON tpm.tpm_id = hca.hca_tpm_id
inner join pm_tin_testdefinsertions tin on tin.tin_id = tpm.tpm_tin_id
INNER JOIN pm_hcz_history_comp_zones hcz ON hcz.hcz_hcp_id = hca.hca_hcp_id
WHERE
mbt.mbt_name = :input1 and tin.tin_name = 'x1' and
hca.hca_testendday < '2018-5-31' and hca.hca_testendday > '2018-05-30'
),
TPL as
(
select /*+ materialize */
*
from
(
select
ut.ut_id,
ut.ut_basic_type,
ut.ut_insertion,
ut.ut_testprogram_name,
ut.ut_revision
from
pm_updated_testprogram ut
where
ut.ut_basic_type = :input1 and ut.ut_insertion = :input2
order by
ut.ut_revision desc
) where rownum = 1
)
SELECT /*+ FIRST_ROWS */
rcl.rcl_lotidentifier AS LOT,
lwzv.lwzv_wafer_id AS WAFER,
pzd.pzd_zone_name AS ZONE,
tte.tte_tpm_id||'~'||tte.tte_testnumber||'~'||tte.tte_testname AS Test_Identifier,
case when ppd.ppd_measurement_result > 1e15 then NULL else SFROUND(ppd.ppd_measurement_result,6) END AS Test_Results
FROM
TPL
left JOIN pm_pcm_details pcm on pcm.pcm_ut_id = TPL.ut_id
left JOIN pm_tin_testdefinsertions tin ON tin.tin_name = TPL.ut_insertion
left JOIN pm_tpr_testdefprograms tpr ON tpr.tpr_name = TPL.ut_testprogram_name and tpr.tpr_revision = TPL.ut_revision
left JOIN pm_tpm_testdefprogrammodes tpm ON tpm.tpm_tpr_id = tpr.tpr_id and tpm.tpm_tin_id = tin.tin_id
left JOIN pm_tte_testdeftests tte on tte.tte_tpm_id = tpm.tpm_id and tte.tte_testnumber = pcm.pcm_testnumber
cross join l
left JOIN pm_lwzv_info lwzv ON lwzv.lwzv_id = l.lwzv_id
left JOIN pm_rcl_resultschipidlots rcl ON rcl.rcl_id = lwzv.lwzv_rcl_id
left JOIN pm_pcm_zone_def pzd ON pzd.pzd_basic_type = TPL.ut_basic_type and pzd.pzd_pcm_x = lwzv.lwzv_pcm_x and pzd.pzd_pcm_y = lwzv.lwzv_pcm_y
left JOIN pm_pcm_par_data ppd ON ppd.ppd_lwzv_id = l.lwzv_id and ppd.ppd_tte_id = tte.tte_id
'''
#method1: using query objects.
Q = DBM.getQueryObject(sql)
Q.execute({"input1":'xxxx',"input2":'yyyy'})
while not Q.AtEndOfResultset:
print Q
#method2: using sqlalchemy
connectstring = "oracle+cx_oracle://username:Password#(description=
(address_list=(address=(protocol=tcp)(host=tnsconnect string)
(port=pertnumber)))(connect_data=(sid=xxxx)))"
engine = sqlalchemy.create_engine(connectstring, arraysize=10000)
df_p = pd.read_sql(sql, params=
{"input1":'xxxx',"input2":'yyyy'}, con=engine)
#method3: using pd.read_sql()
df_p = pd.read_sql_query(SQL_PCM, params=
{"input1":'xxxx',"input2":'yyyy'},
coerce_float=True, con= DBM.Connection)
It would be great if some one could help me out in this. Thanks in advance.
And yet another possibility to adjust the array size without needing to create oraaccess.xml as suggested by Chris. This may not work with the rest of your code as is, but it should give you an idea of how to proceed if you wish to try this approach!
class Connection(cx_Oracle.Connection):
def __init__(self):
super(Connection, self).__init__("user/pw#dsn")
def cursor(self):
c = super(Connection, self).cursor()
c.arraysize = 5000
return c
engine = sqlalchemy.create_engine(creator=Connection)
pandas.read_sql(sql, engine)
Here's another alternative to experiment with.
Set a prefetch size by using the external configuration available to Oracle Call Interface programs like cx_Oracle. This overrides internal settings used by OCI programs. Create an oraaccess.xml file:
<?xml version="1.0"?>
<oraaccess xmlns="http://xmlns.oracle.com/oci/oraaccess"
xmlns:oci="http://xmlns.oracle.com/oci/oraaccess"
schemaLocation="http://xmlns.oracle.com/oci/oraaccess
http://xmlns.oracle.com/oci/oraaccess.xsd">
<default_parameters>
<prefetch>
<rows>1000</rows>
</prefetch>
</default_parameters>
</oraaccess>
If you use tnsnames.ora or sqlnet.ora for cx_Oracle, then put the oraaccess.xml file in the same directory. Otherwise, create a new directory and set the environment variable TNS_ADMIN to that directory name.
cx_Oracle needs to be using Oracle Client 12c, or later, libraries.
Experiment with different sizes.
See OCI Client-Side Deployment Parameters Using oraaccess.xml.
This question already has answers here:
Spark Dataframes UPSERT to Postgres Table
(4 answers)
How to use JDBC source to write and read data in (Py)Spark?
(3 answers)
Closed 4 years ago.
I have the following sample code. The actual dataframe size is about 1.1 million rows & approximately 100MB size. The code when run on AWS Glue with 10 / 30 DPUs, takes about 1.15 minutes, which I believe is too much a time.
import mysql.connector
df2= sqlContext.createDataFrame([("xxx1","81A01","TERR NAME 55","NY"),("xxx2","81A01","TERR NAME 55","NY"),("x103","81A01","TERR NAME 01","NJ")], ["zip_code","territory_code","territory_name","state"])
upsert_list = df2.collect()
db = mysql.connector.connect(host="xxxxxxxx.us-east-1.rds.amazonaws.com", user="user1", password="userpasswd", database="ziplist")
for r in upsert_list:
cursor = db.cursor()
insertQry = "INSERT INTO TEMP1 (zip_code, territory_code, territory_name, state) VALUES(%s, %s, %s, %s);"
n = cursor.execute(insertQry, (r.zip_code, r.territory_code, r.territory_name, r.state))
#print (" CURSOR status :", n)
db.commit()
db.close()
FOR UPSERTS:
cursor = db.cursor()
insertQry = "INSERT INTO TEMP2 (zip_code, territory_code, territory_name, state, city) SELECT tmp.zip_code, tmp.territory_code, tmp.territory_name, tmp.state, tmp.city from TEMP1 tmp ON DUPLICATE KEY UPDATE territory_name = tmp.territory_name, state = tmp.state, city = tmp.city;"
n = cursor.execute(insertQry)
print (" CURSOR status :", n)
db.commit()
db.close()
I found that the issue is caused due to the 2nd line, df2.collect() which is taking the entire dataframe to the driver and run it on a single node. I am trying to do / implement parallelism so that the database operation happens in multiple nodes, is there any way I can do this.
How to implement parallelism for RDS upserts, please share some sample code or pseudo code.
Thanks
I am trying to automate 100 google searches (one per individual String in a row and return urls per each query) on a specific column in a csv (via python 2.7); however, I am unable to get Pandas to read the row contents to the Google Search automater.
*GoogleSearch source = https://breakingcode.wordpress.com/2010/06/29/google-search-python/
Overall, I can print Urls successfully for a query when I utilize the following code:
from google import search
query = "apples"
for url in search(query, stop=5, pause=2.0):
print(url)
However, when I add Pandas ( to read each "query") the rows are not read -> queried as intended. I.E. "data.irow(n)" is being queired instead of the row contents, one at a time.
from google import search
import pandas as pd
from pandas import DataFrame
query_performed = 0
querying = True
query = 'data.irow(n)'
#read the excel file at column 2 (i.e. "Fruit")
df = pd.read_csv('C:\Users\Desktop\query_results.csv', header=0, sep=',', index_col= 'Fruit')
# need to specify "Column2" and one "data.irow(n)" queried at a time
while querying:
if query_performed <= 100:
print("query")
query_performed +=1
else:
querying = False
print("Asked all 100 query's")
#prints initial urls for each "query" in a google search
for url in search(query, stop=5, pause=2.0):
print(url)
Incorrect output I receive at the command line:
query
Asked all 100 query's
query
Asked all 100 query's
Asked all 100 query's
http://www.irondata.com/
http://www.irondata.com/careers
http://transportation.irondata.com/
http://www.irondata.com/about
http://www.irondata.com/public-sector/regulatory/products/versa
http://www.irondata.com/contact-us
http://www.irondata.com/public-sector/regulatory/products/cavu
https://www.linkedin.com/company/iron-data-solutions
http://www.glassdoor.com/Reviews/Iron-Data-Reviews-E332311.htm
https://www.facebook.com/IronData
http://www.bloomberg.com/research/stocks/private/snapshot.asp?privcapId=35267805
http://www.indeed.com/cmp/Iron-Data
http://www.ironmountain.com/Services/Data-Centers.aspx
FYI: My Excel .CSV format is the following:
B
1 **Fruit**
2 apples
2 oranges
4 mangos
5 mangos
6 mangos
...
101 mangos
Any advice on next steps is greatly appreciated! Thanks in advance!
Here's what I got. Like I mentioned in my comment, I couldn't get the stop parameter to work like i thought it should. Maybe i'm misunderstanding how its used. I'm assuming you only want the first 5 urls per search.
a sample df
d = {"B" : ["mangos", "oranges", "apples"]}
df = pd.DataFrame(d)
Then
stop = 5
urlcols = ["C","D","E","F","G"]
# Here i'm using an apply() to call the google search for each 'row'
# and a list is built for the urls return by search()
df[urlcols] = df["B"].apply(lambda fruit : pd.Series([url for url in
search(fruit, stop=stop, pause=2.0)][:stop])) #get 5 by slicing
which gives you. Formatting is a bit rough on this
B C D E F G
0 mangos http://en.wikipedia.org/wiki/Mango http://en.wikipedia.org/wiki/Mango_(disambigua... http://en.wikipedia.org/wiki/Mangifera http://en.wikipedia.org/wiki/Mangifera_indica http://en.wikipedia.org/wiki/Purple_mangosteen
1 oranges http://en.wikipedia.org/wiki/Orange_(fruit) http://en.wikipedia.org/wiki/Bitter_orange http://en.wikipedia.org/wiki/Valencia_orange http://en.wikipedia.org/wiki/Rutaceae http://en.wikipedia.org/wiki/Cherry_Orange
2 apples https://www.apple.com/ http://desmoines.citysearch.com/review/692986920 http://local.yahoo.com/info-28919583-apple-sto... http://www.judysbook.com/Apple-Store-BtoB~Cell... https://tr.foursquare.com/v/apple-store/4b466b...
if you'd rather not specify the columns (i.e. ["C",D"..]) you could do the following.
df.join(df["B"].apply(lambda fruit : pd.Series([url for url in
search(fruit, stop=stop, pause=2.0)][:stop])))