How to Make Pandas DataFrame SQL Fetch Progress Bar? - python-2.7

I'm trying to make a progress bar reflecting pandas dataframe build progress from sql. Currently I have a table with 9 columns containing 1000 records.
import pandas as pd
import psycopg2 as ps
import pandas.io.sql as psql
conn = ps.connect(user="user", password="password", database="database")
sql = "select * from table"
a = datetime.datetime.now()
df = psql.read_frame(sql, con=conn)
---and blablabla some little functions
b = datetime.datetime.now()
print b-a
instead of getting delta start & end time of the function, I would prefer and it would be nice to show progress bar to end user (just in case the data is getting bigger), so they have idea how long it would take. Is that possible? how to do it?

Related

How do I compute moving averages on QuestDB?

I'm using the database QuestDB, and trying to compute moving averages with some market data Is this something that is already available out of the box (like in Influxdb or kdb+), or is there a way around it?
If you are using Pandas / Python, the following example shows how to calculate the moving average after connecting to the database using postgres wire:
import psycopg2
import pandas as pd
df_trades = pd.DataFrame()
connection = psycopg2.connect(user="admin",
password="quest",
host="127.0.0.1",
port="8812",
database="qdb")
# get query as dataframe
df_trades = pd.read_sql_query("select * from my_table",connection)
# Simple moving average, k=10, of column called 'close'
df_trades['moving_av_10'] = df_trades['close'].rolling(window=10).mean()
print(df_trades.tail())

BigQuery - copy a query into a new table

I wrote a query for one of my Big Query table called historical and I would like to copy the result of this query into a new Big Query table called historical_recent. I have difficulties to figure out how to do this operation with Python. Right now, I am able to execute my query and get the expected result:
SELECT * FROM gcp-sandbox.dailydev.historical WHERE (date BETWEEN '2015-11-05 00:00:00' AND '
2015-11-07 23:00:00')
I am also able to copy a my Big Query table without making any changes with this script:
from google.cloud import bigquery
client = bigquery.Client()
job = client.copy_table(
'gcp-sandbox.dailydev.historical',
'gcp-sandbox.dailydev.historical_copy')
How can I combine both using Python?
You can use INSERT statement as in below example
INSERT `gcp-sandbox.dailydev.historical_recent`
SELECT *
FROM `gcp-sandbox.dailydev.historical`
WHERE date BETWEEN '2015-11-05 00:00:00' AND '2015-11-07 23:00:00'
Using Python to save your query result.
from google.cloud import bigquery
client = bigquery.Client()
# Target table to save results
table_id = "gcp-sandbox.dailydev.historical_recent"
job_config = bigquery.QueryJobConfig(
allow_large_results=True,
destination=table_id,
use_legacy_sql=True
)
sql = """
SELECT * FROM gcp-sandbox.dailydev.historical
WHERE (date BETWEEN '2015-11-05 00:00:00' AND '2015-11-07 23:00:00')
"""
query = client.query(sql, job_config=job_config)
query.result()
print("Query results loaded to the table {}".format(table_id))
This example is based on the Google documentation.

Creation of formatted DataFrame and then adding data line by line

I have a continuous stream of data coming in so I want to define the DataFrame before hand so that I don't have tell pandas to format data or set index
So I want to create a DataFrame like
df = pd.DataFrame(columns=["timestamp","stockname","price","volume"])
but I want to tell Pandas that index of data should be timestamp and that the format would be
"%Y-%m-%d %H:%M:%S:%f"
and one this it set, then I would read through file and pass data to the DataFrame initialized
I get data in variables like these populated every time in loop like
for line in filehandle:
timestamp, stockname, price, volume = fetch(line)
here I want to update the "df"
this update would go on while I would keep using the copy of
df
let us say into a
tempdf
to do re-sampling or any other task at any given point in time because original dataframe
df
is getting updated continuously
import numpy as np
import pandas as pd
import datetime as dt
import time
# create df with timestamp as index
df = pd.DataFrame(columns=["timestamp","stockname","price","volume"], dtype = float)
pd.to_datetime(df['timestamp'], format = "%Y-%m-%d %H:%M:%S:%f")
df.set_index('timestamp', inplace = True)
for i in range(10): # for the purposes of functioning demo code
i += 1 # counter
time.sleep(0.01) # give jupyter notebook a moment
timestamp = dt.datetime.now() # to be used as index
df.loc[timestamp] = ['AAPL', np.random.randint(1000), np.random.randint(10)] # replace with your database read
tempdf = df.copy()
If you are reading a file or database continuously, you can replace the for: loop with what you described in your question. #MattR's questions should also be addressed; if you need to continuously log or update data, I am not sure if pandas is the best solution.

How to I add a current timestamp (extra column) in the glue job so that the output data has an extra column

How to I add a current timestamp (extra column) in the glue job so that the output data has an extra column. In this case:
Schema Source Table:
Col1, Col2
After Glue job.
Schema of Destination:
Col1, Col2, Update_Date(Current Timestamp)
We do the following and works great without converting toDF()
datasource0 = glueContext.create_dynamic_frame.from_catalog(...)
from datetime import datetime
def AddProcessedTime(r):
r["jobProcessedDateTime"] = datetime.today() #timestamp of when we ran this.
return r
mapped_dyF = Map.apply(frame = datasource0, f = AddProcessedTime)
I'm not sure if there's a glue native way to do this with the DynamicFrame, but you can easily convert to a Spark Dataframe and then use the withColumn method. You will need to use the lit function to put literal values into a new column, as below.
from datetime import datetime
from pyspark.sql.functions import lit
glue_df = glueContext.create_dynamic_frame.from_catalog(...)
spark_df = glue_df.toDF()
spark_df = spark_df.withColumn('some_date', lit(datetime.now()))
Some references:
Glue DynamicFrame toDF()
Spark Dataframe withColumn()
In my experience working with Glue the timezone where Glue runs is GMT. But my timezone is CDT. So, to get CDT timezone I need to convert the time within SparkContext. This specific case is to add last_load_date to the target/sink.
So I created a function.
def convert_timezone(sc):
sqlContext = SQLContext(sc)
local_time=dt.now().strftime('%Y-%m-%d %H:%M:%S')
local_time_df=sqlContext.createDataFrame([(local_time,)],['time'])
CDT_time_df = local_time_df.select(from_utc_timestamp(local_time_df['time'],'CST6CDT').alias('cdt_time'))
CDT_time=[i['cdt_time'].strftime('%Y-%m-%d %H:%M:%S') for i in CDT_time_df.collect()][0]
return CDT_time
And then call the function like ...
job_run_time = date_config.convert_timezone(sc)
datasourceDF0 = datasource0.toDF()
datasourceDF1 = datasourceDF0.withColumn('last_updated_date',lit(job_run_time))
As I have been seen there is not a properly answer to this issue I will try to explain my solution to this problem:
First thing is to clarify the withColumn function is a good way to do this but it is important to mention that this function is from the Dataframe from Spark itself and this function is not part of the glue DynamicFrame which is a own library from Glue AWS, so you need to covert the frames to do this....
First step is from the DynamicFrame get the Spark Dataframe, glue library does this with the function toDF() function, once with the Spark frame you can add the column and/or do whatever manipulation you require.
Then what we glue expect is his own frame so we need to transformed back from spark to glue proprietary frame, to do so you can use the apply function of the DynamicFrame, which requires to import the object:
import com.amazonaws.services.glue.DynamicFrame
and use the glueContext which you should already have it, like:
DynamicFrame(sparkDataFrame, glueContext)
In resume the code should looks like:
import org.apache.spark.sql.functions._
import com.amazonaws.services.glue.DynamicFrame
...
val sparkDataFrame = datasourceToModify.toDF().withColumn("created_date", current_date())
val finalDataFrameForGlue = DynamicFrame(sparkDataFrame, glueContext)
...
Note: the import org.apache.spark.sql.functions._ is to bring the current_date() function to add the column with the date.
Hope this helps....
Use Spark's current_timestamp() function:
import org.apache.spark.sql.functions._
...
val timestampedDf = source.toDF().withColumn("Update_Date", current_timestamp())
val timestamped = DynamicFrame(timestampedDf, glueContext)
You can do this supposedly with a built-in functionality now: see here...
Note to look for just the glueContext.add_ingestion_time_columns section

Loop through csv with Pandas, specific column

With the csv module, I loop through the rows to execute logic:
import csv
with open("file.csv", "r") as csv_read:
r = csv.reader(csv_read, delimiter = ",")
next(r, None) #Skip headers first row
for row in rows:
#Logic here
I'm new to Pandas, and I want to execute the same logic, using the second column only in the csv as the input for the loop.
import pandas as pd
pd.read_csv("file.csv", usecols=[1])
Assuming the above is correct, what should I do from here to execute the logic based on the cells in column 2?
I want to use the cell values in column 2 as input for a web crawler. It takes each result and inputs it as a search term on a webpage, and then scrapes data from that webpage. Is there any way to grab each cell value in the array rather than the whole array at the same time?
Basically the pandas equivalent of your code is this:
import pandas as pd
df = pd.read_csv("file.csv", usecols=[1])
So passing usecols=[1] will only load the second column, see the docs.
now assuming this column has a name like 'url' but really it doesn't matter we can do something like:
def crawl(x):
#do something
df.apply(crawl)
So in principle the above will crawl each url in your column a value at a time.
EDIT
You can pass param axis=1 to apply so that it process each row rather than the entire column:
df.apply(crawl, axis=1)