Query pyarrow dataset from GCP bucket is extremely slow

Query pyarrow dataset from GCP bucket is extremely slow - google-cloud-platform

I am using pyarrow dataset to Query a parquet file in GCP, the code is straightforward
import pyarrow.dataset as ds
import duckdb
import json
lineitem = ds.dataset("gs://duckddelta/lineitem",format="parquet")
lineitem_partition = ds.dataset("gs://duckddelta/delta2",format="parquet", partitioning="hive")
con = duckdb.connect()
def Query(request):
SQL = request.get_json().get('name')
df = con.execute(SQL).df()
return json.dumps(df.to_json(orient="records")), 200, {'Content-Type': 'application/json'}
then I call that function using a SQL Query
SQL = '''
SELECT
l_returnflag,
l_linestatus,
SUM(l_quantity) AS sum_qty,
SUM(l_extendedprice) AS sum_base_price,
SUM(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
SUM(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
AVG(l_quantity) AS avg_qty,
AVG(l_extendedprice) AS avg_price,
AVG(l_discount) AS avg_disc,
COUNT(*) AS count_order
FROM
lineitem
GROUP BY 1,2
ORDER BY 1,2 ;
'''
I know that local SSD storage is faster but I am getting some massive difference
The Query take 4 second, when the file is saved in my laptop
it take 54 second when run from google cloud function in the same region
take 3 minutes when I run it in Colab
it seems to me there is a bottleneck somewhere in google cloud function, I was expected a better performance
edit for more context : File is 1.2 GB, region is us-central1 (Iowa), cloud function gen 2, 8 GB, 8 CPU

sorry turn out it is a possible bug in DuckDB, apparently it DuckDB is not multi threaded in this particular scenario
https://github.com/duckdb/duckdb/issues/4525

Related

Adding constant values at the begining of a dataframe in pyspark

I am trying to read a CSV file from HDFS location and to that 3 columns batchid,load timestamp and a delete indicator needs to be added at the beginning. I am using spark 2.3.2 and python 2.7.5. Sample values for 3 columns to be added is given below.
batchid- YYYYMMdd (int)
Load timestamp - current timestamp (timestamp)
delete indicator - blank (string)

Your question is a little bit obscure. You can do something in this flavor. First, create your timestamp using python functionalities :
import time
import datetime
timestamp = datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S')
Then, assuming you use the DataFrame API, you plug that into your column :
import pyspark.sql.functions as psf
df = (df
.withColumn('time',
psf.unix_timestamp(
psf.lit(timestamp),'yyyy-MM-dd HH:mm:ss'
).cast("timestamp")
)
.withColumn('batchid', psf.date_format('time', 'yyyyMMdd/yyy'))
.withColumn('delete', psf.lit(''))
To reorder your columns:
df = df.select(*["time","batchid","delete"] + [k for k in colnames if k not in ["time","batchid","delete"]])

How to do parallelism in database upserts in pyspark or python [duplicate]

This question already has answers here:
Spark Dataframes UPSERT to Postgres Table
(4 answers)
How to use JDBC source to write and read data in (Py)Spark?
(3 answers)
Closed 4 years ago.
I have the following sample code. The actual dataframe size is about 1.1 million rows & approximately 100MB size. The code when run on AWS Glue with 10 / 30 DPUs, takes about 1.15 minutes, which I believe is too much a time.
import mysql.connector
df2= sqlContext.createDataFrame([("xxx1","81A01","TERR NAME 55","NY"),("xxx2","81A01","TERR NAME 55","NY"),("x103","81A01","TERR NAME 01","NJ")], ["zip_code","territory_code","territory_name","state"])
upsert_list = df2.collect()
db = mysql.connector.connect(host="xxxxxxxx.us-east-1.rds.amazonaws.com", user="user1", password="userpasswd", database="ziplist")
for r in upsert_list:
cursor = db.cursor()
insertQry = "INSERT INTO TEMP1 (zip_code, territory_code, territory_name, state) VALUES(%s, %s, %s, %s);"
n = cursor.execute(insertQry, (r.zip_code, r.territory_code, r.territory_name, r.state))
#print (" CURSOR status :", n)
db.commit()
db.close()
FOR UPSERTS:
cursor = db.cursor()
insertQry = "INSERT INTO TEMP2 (zip_code, territory_code, territory_name, state, city) SELECT tmp.zip_code, tmp.territory_code, tmp.territory_name, tmp.state, tmp.city from TEMP1 tmp ON DUPLICATE KEY UPDATE territory_name = tmp.territory_name, state = tmp.state, city = tmp.city;"
n = cursor.execute(insertQry)
print (" CURSOR status :", n)
db.commit()
db.close()
I found that the issue is caused due to the 2nd line, df2.collect() which is taking the entire dataframe to the driver and run it on a single node. I am trying to do / implement parallelism so that the database operation happens in multiple nodes, is there any way I can do this.
How to implement parallelism for RDS upserts, please share some sample code or pseudo code.
Thanks

Yahoo Finance API .get_historical() not working python

So I recently downloaded the yahoo_finance API and its version 1.4.0. I got it a few days ago, and the .get_historical() was working fine. Now however, it doesn't. Heres what its doing:
import yahoo_finance as yf
apple=yf.Share('AAPL')
apple_price=apple.get_price()
print apple.get_historical('2016-02-15', '2016-04-29')
The error I get is:YQLResponseMalformedError: Response malformed. Is there a bug in the API or am I forgetting something?

The Yahoo Stock Price API doesn't work anymore, which a lot of modules are based on, unfortunately.
Alternatively, you could use Google's API
https://www.google.com/finance/getprices?q=1101&x=TPE&i=86400&p=3d&f=d,c,h,l,o,v
q=1101 is the stock quote
x=TPE is the exchange (List of Exchanges here: https://www.google.com/googlefinance/disclaimer/ )
i=86400 interval in seconds (86400 sec = 1 day)
p=3d data since how long ago
f= fields of data (d=date, c=close, h=high, l=low, o=open, v=volume)
Data would look like this:
EXCHANGE%3DTPE
MARKET_OPEN_MINUTE=540
MARKET_CLOSE_MINUTE=810
INTERVAL=86400
COLUMNS=DATE,CLOSE,HIGH,LOW,OPEN,VOLUME
DATA=
TIMEZONE_OFFSET=480
a1496295000,24.4,24.75,24.35,24.75,11782000
1,24.5,24.5,24.3,24.4,10747000
a1496295000 is the Unix timestamp of first row of data
the second row 1 is the interval offset from first row (offset 1 day)

Faster approach to transform a bunch of .csv file into HDF dataframe

1. Background
HDF is a superb file format for data storage and management.
I have a source data (365 .csv files) which containing the air quality data (time resolution 1h) for all monitoring sites (more than 1500) of China. Each file is consisted of many feature (particulate matter, SO2, etc) and its corresponding time.
I have uploaded some template files here for someone interested.
My goal ==> Merge all the files into one dataframe for efficient management
2. My code
# -*- coding: utf-8 -*-
#coding=utf-8
import pandas as pd
from pandas import HDFStore, DataFrame
from pandas import read_hdf
import os,sys,string
import numpy as np
### CREAT A EMPTY HDF5 FILE
hdf = HDFStore("site_2016_whole_year.h5")
### READ THE CSV FILES AND SAVE IT INTO HDF5 FORMAT
os.chdir("./site_2016/")
files = os.listdir("./")
files.sort()
### Read an template file to get the name of columns
test_file= "china_sites_20160101.csv"
test_f = pd.read_csv(test_file,encoding='utf_8')
site_columns = list(test_f.columns[3:])
print site_columns[1]
feature = ['pm25','pm10','O3','O3_8h','CO',"NO2",'SO2',"aqi"]
fe_dict = {"pm25":1,"aqi":0, 'pm10':3, 'SO2':5,'NO2':7, 'O3':9,"O3_8h":11, "CO": 13}
for k in range(0,len(feature),1):
data_2016 = {"date":[],'hour':[],}
for i in range(0,len(site_columns),1):
data_2016[site_columns[i]] = []
for file in files[0:]:
filename,extname = os.path.splitext(file)
if (extname == ".csv"):
datafile =file
f_day = pd.read_csv(datafile,encoding='utf_8')
site_columns = list(f_day.columns[3:])
for i in range(0,len(f_day),15):
datetime = str(f_day["date"].iloc[i])
hour = "%02d" % ((f_day["hour"].iloc[i]))
data_2016["date"].append(datetime)
data_2016["hour"].append(hour)
for t in range(0,len(site_columns),1):
data_2016[site_columns[t]].\
append(f_day[site_columns[t]].iloc[i+fe_dict[feature[k]]])]
data_2016 = pd.DataFrame(data_2016)
hdf.put(feature[k], data_2016, format='table', encoding="utf-8")
3. My problem
Using my code above, the hdf5 file can be created but with slow speed.
My lab has a Linux cluster with 32 core CPU. Is there any method to transform my program into multi-processing ones?

maybe I do not understand your problem properly, but I would use something like this:
import os
import pandas as pd
indir = <'folder with downloaded 12 csv files'>
indata = []
for i in os.listdir(indir):
indata.append(pd.read_csv(indir + i))
out_table = pd.concat(indata)
hdf = pd.HDFStore("site_2016_whole_year.h5", complevel=9, complib='blosc')
hdf.put('table1',out_table)
hdf.close()
for 12 input files it takes 2.5 sec on my laptop, so even for 365 files it should be done in a minute or so. I do not see need for parallelisation in this case.

Pandas: Iterate on a column one row at a time to automate a google search?

I am trying to automate 100 google searches (one per individual String in a row and return urls per each query) on a specific column in a csv (via python 2.7); however, I am unable to get Pandas to read the row contents to the Google Search automater.
*GoogleSearch source = https://breakingcode.wordpress.com/2010/06/29/google-search-python/
Overall, I can print Urls successfully for a query when I utilize the following code:
from google import search
query = "apples"
for url in search(query, stop=5, pause=2.0):
print(url)
However, when I add Pandas ( to read each "query") the rows are not read -> queried as intended. I.E. "data.irow(n)" is being queired instead of the row contents, one at a time.
from google import search
import pandas as pd
from pandas import DataFrame
query_performed = 0
querying = True
query = 'data.irow(n)'
#read the excel file at column 2 (i.e. "Fruit")
df = pd.read_csv('C:\Users\Desktop\query_results.csv', header=0, sep=',', index_col= 'Fruit')
# need to specify "Column2" and one "data.irow(n)" queried at a time
while querying:
if query_performed <= 100:
print("query")
query_performed +=1
else:
querying = False
print("Asked all 100 query's")
#prints initial urls for each "query" in a google search
for url in search(query, stop=5, pause=2.0):
print(url)
Incorrect output I receive at the command line:
query
Asked all 100 query's
query
Asked all 100 query's
Asked all 100 query's
http://www.irondata.com/
http://www.irondata.com/careers
http://transportation.irondata.com/
http://www.irondata.com/about
http://www.irondata.com/public-sector/regulatory/products/versa
http://www.irondata.com/contact-us
http://www.irondata.com/public-sector/regulatory/products/cavu
https://www.linkedin.com/company/iron-data-solutions
http://www.glassdoor.com/Reviews/Iron-Data-Reviews-E332311.htm
https://www.facebook.com/IronData
http://www.bloomberg.com/research/stocks/private/snapshot.asp?privcapId=35267805
http://www.indeed.com/cmp/Iron-Data
http://www.ironmountain.com/Services/Data-Centers.aspx
FYI: My Excel .CSV format is the following:
B
1 **Fruit**
2 apples
2 oranges
4 mangos
5 mangos
6 mangos
...
101 mangos
Any advice on next steps is greatly appreciated! Thanks in advance!

Here's what I got. Like I mentioned in my comment, I couldn't get the stop parameter to work like i thought it should. Maybe i'm misunderstanding how its used. I'm assuming you only want the first 5 urls per search.
a sample df
d = {"B" : ["mangos", "oranges", "apples"]}
df = pd.DataFrame(d)
Then
stop = 5
urlcols = ["C","D","E","F","G"]
# Here i'm using an apply() to call the google search for each 'row'
# and a list is built for the urls return by search()
df[urlcols] = df["B"].apply(lambda fruit : pd.Series([url for url in
search(fruit, stop=stop, pause=2.0)][:stop])) #get 5 by slicing
which gives you. Formatting is a bit rough on this
B C D E F G
0 mangos http://en.wikipedia.org/wiki/Mango http://en.wikipedia.org/wiki/Mango_(disambigua... http://en.wikipedia.org/wiki/Mangifera http://en.wikipedia.org/wiki/Mangifera_indica http://en.wikipedia.org/wiki/Purple_mangosteen
1 oranges http://en.wikipedia.org/wiki/Orange_(fruit) http://en.wikipedia.org/wiki/Bitter_orange http://en.wikipedia.org/wiki/Valencia_orange http://en.wikipedia.org/wiki/Rutaceae http://en.wikipedia.org/wiki/Cherry_Orange
2 apples https://www.apple.com/ http://desmoines.citysearch.com/review/692986920 http://local.yahoo.com/info-28919583-apple-sto... http://www.judysbook.com/Apple-Store-BtoB~Cell... https://tr.foursquare.com/v/apple-store/4b466b...
if you'd rather not specify the columns (i.e. ["C",D"..]) you could do the following.
df.join(df["B"].apply(lambda fruit : pd.Series([url for url in
search(fruit, stop=stop, pause=2.0)][:stop])))

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Query pyarrow dataset from GCP bucket is extremely slow - google-cloud-platform

sorry turn out it is a possible bug in DuckDB, apparently it DuckDB is not multi threaded in this particular scenario https://github.com/duckdb/duckdb/issues/4525

Related

Adding constant values at the begining of a dataframe in pyspark

How to do parallelism in database upserts in pyspark or python [duplicate]

Yahoo Finance API .get_historical() not working python

Faster approach to transform a bunch of .csv file into HDF dataframe

Pandas: Iterate on a column one row at a time to automate a google search?

Categories

Resources