How do I compute moving averages on QuestDB? - questdb

I'm using the database QuestDB, and trying to compute moving averages with some market data Is this something that is already available out of the box (like in Influxdb or kdb+), or is there a way around it?

If you are using Pandas / Python, the following example shows how to calculate the moving average after connecting to the database using postgres wire:
import psycopg2
import pandas as pd
df_trades = pd.DataFrame()
connection = psycopg2.connect(user="admin",
password="quest",
host="127.0.0.1",
port="8812",
database="qdb")
# get query as dataframe
df_trades = pd.read_sql_query("select * from my_table",connection)
# Simple moving average, k=10, of column called 'close'
df_trades['moving_av_10'] = df_trades['close'].rolling(window=10).mean()
print(df_trades.tail())

Related

GCP Cloud Function to write data to BigQuery runs with success but data doesn't appear in BigQuery table

I am running the following cloud function. It runs with success and indicates data was loaded to the table. But when I query the BigQuery no data has been added. I am getting no errors and no indication that it isn't working.
from google.cloud import bigquery
import pandas as pd
def download_data(event, context):
df = pd.read_csv('https://covid.ourworldindata.org/data/ecdc/full_data.csv')
# Create an empty list
Row_list =[]
# Iterate over each row
for index, rows in df.iterrows():
# Create list for the current row
my_list =[rows.date, rows.location, rows.new_cases, rows.new_deaths, rows.total_cases, rows.total_deaths]
#print(my_list)
# append the list to the final list
Row_list.append(my_list)
## Get Biq Query Set up
client = bigquery.Client()
table_id = "<project_name>.raw.daily_load"
table = client.get_table(table_id)
print(client)
print(table_id)
print(table)
errors = client.insert_rows(table, Row_list) # Make an API request.
if errors == []:
print("New rows have been added.")
Attempted so far;
Check data was being pulled -> PASSED, I printed out row_list and
data is there
Run locally from my machine -> PASSED, data appeared when I ran it from a python terminal
Print out the table details -> PASSED, see attached screenshot it all appears in the logs
Confirm it is able to find the table -> PASSED, I changed the name
of the table to one that didn't exist and it failed
Not sure what is next, any advice would be greatly appreciated
Maybe this post in Google Cloud documentation could help.
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv#loading_csv_data_into_a_table
You can directly stream the data from the website to BigQuery using Cloud Functions but the data should be clean and conform to BigQuery standards else the e insertion will fail. One more point to note is that the dataframe columns must match the table columns for the data to be successfully inserted. I tested this out and saw insertion errors returned by the client when the column names didn’t match.
Writing the function
I have created a simple Cloud Function using the documentation and pandas example. The dependencies that need to be included are google-cloud-bigquery and pandas.
main.py
from google.cloud import bigquery
import pandas as pd
def hello_gcs(event,context):
df = pd.read_csv('https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv')
df.set_axis(["Month", "Year_1", "Year_2", "Year_3"], axis=1, inplace=True) ## => Rename the columns if necessary
table_id = "project.dataset.airtravel"
## Get BiqQuery Set up
client = bigquery.Client()
table = client.get_table(table_id)
errors = client.insert_rows_from_dataframe(table, df) # Make an API request.
if errors == []:
print("Data Loaded")
return "Success"
else:
print(errors)
return "Failed"
requirements.txt
# Function dependencies, for example:
# package>=version
google-cloud-bigquery
pandas
Now you can directly deploy the function.
Output
Output Table
Assuming that the App Engine default service account has the default Editor role assigned and that you have a very simple schema for the BigQuery table. For example:
Field name Type Mode Policy tags Description
date STRING NULLABLE
location STRING NULLABLE
new_cases INTEGER NULLABLE
new_deaths INTEGER NULLABLE
total_cases INTEGER NULLABLE
total_deaths INTEGER NULLABLE
The following modification of your code should work for an HTTP triggered function. Notice that you were not including the Row_list.append(my_list) in the for loop to populate your list with the elements and that according to the samples on the documentation you should be using a list of tuples:
from google.cloud import bigquery
import pandas as pd
client = bigquery.Client()
table_id = "[PROJECT-ID].[DATASET].[TABLE]"
def download_data(request):
df = pd.read_csv('https://covid.ourworldindata.org/data/ecdc/full_data.csv')
# Create an empty list
Row_list =[]
# Iterate over each row
for index, rows in df.iterrows():
# Create list for the current row
my_list =(rows.date, rows.location, rows.new_cases, rows.new_deaths, rows.total_cases, rows.total_deaths)
# append the list to the final list
Row_list.append(my_list)
## Get Biq Query Set up
table = client.get_table(table_id)
errors = client.insert_rows(table, Row_list) # Make an API request.
if errors == []:
print("New rows have been added.")
With the very simple requirements.txt file:
# Function dependencies, for example:
# package>=version
pandas
google-cloud-bigquery

Elasticsearch-Hadoop formatting multi resource writes issue

I am interfacing Elasticsearch with Spark, using the Elasticsearch-Hadoop plugin and I am having difficulty writing a dataframe with a timestamp type column to Elasticsearch.
The problem is when I try to write using dynamic/multi resource formatting to create a daily index.
From the relevant documentation I get the impression that this is possible, however, the python example below fails to run unless I change my dataframe type to date.
import pyspark
conf = pyspark.SparkConf()
conf.set('spark.jars', 'elasticsearch-spark-20_2.11-6.1.2.jar')
conf.set('es.nodes', '127.0.0.1:9200')
conf.set('es.read.metadata', 'true')
conf.set('es.nodes.wan.only', 'true')
from datetime import datetime, timedelta
now = datetime.now()
before = now - timedelta(days=1)
after = now + timedelta(days=1)
cols = ['idz', 'name', 'time']
vals = [(0,'maria', before), (1, 'lolis', after)]
time_df = spark.createDataFrame(vals, cols)
When I try to write, I use the following:
time_df.write.mode('append').format(
'org.elasticsearch.spark.sql'
).options(
**{'es.write.operation': 'index' }
).save('xxx-{time|yyyy.MM.dd}/1')
Unfortunatelly this renders an error:
.... Caused by: java.lang.IllegalArgumentException: Invalid format:
"2018-03-04 12:36:12.949897" is malformed at " 12:36:12.949897" at
org.joda.time.format.DateTimeFormatter.parseDateTime(DateTimeFormatter.java:945)
On the other hand this works perfectly fine if I use dates when I create my dataframe:
cols = ['idz', 'name', 'time']
vals = [(0,'maria', before.date()), (1, 'lolis', after.date())]
time_df = spark.createDataFrame(vals, cols)
Is it possible to format a dataframe timestamp to be written to daily indexes with this method, without also keeping a date column around? How about monthly indexes?
Pyspark version:
spark version 2.2.1
Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_151
ElasticSearch version
number "6.2.2" build_hash "10b1edd"
build_date "2018-02-16T19:01:30.685723Z" build_snapshot false
lucene_version "7.2.1" minimum_wire_compatibility_version "5.6.0"
minimum_index_compatibility_version "5.0.0"

Creation of formatted DataFrame and then adding data line by line

I have a continuous stream of data coming in so I want to define the DataFrame before hand so that I don't have tell pandas to format data or set index
So I want to create a DataFrame like
df = pd.DataFrame(columns=["timestamp","stockname","price","volume"])
but I want to tell Pandas that index of data should be timestamp and that the format would be
"%Y-%m-%d %H:%M:%S:%f"
and one this it set, then I would read through file and pass data to the DataFrame initialized
I get data in variables like these populated every time in loop like
for line in filehandle:
timestamp, stockname, price, volume = fetch(line)
here I want to update the "df"
this update would go on while I would keep using the copy of
df
let us say into a
tempdf
to do re-sampling or any other task at any given point in time because original dataframe
df
is getting updated continuously
import numpy as np
import pandas as pd
import datetime as dt
import time
# create df with timestamp as index
df = pd.DataFrame(columns=["timestamp","stockname","price","volume"], dtype = float)
pd.to_datetime(df['timestamp'], format = "%Y-%m-%d %H:%M:%S:%f")
df.set_index('timestamp', inplace = True)
for i in range(10): # for the purposes of functioning demo code
i += 1 # counter
time.sleep(0.01) # give jupyter notebook a moment
timestamp = dt.datetime.now() # to be used as index
df.loc[timestamp] = ['AAPL', np.random.randint(1000), np.random.randint(10)] # replace with your database read
tempdf = df.copy()
If you are reading a file or database continuously, you can replace the for: loop with what you described in your question. #MattR's questions should also be addressed; if you need to continuously log or update data, I am not sure if pandas is the best solution.

How to I add a current timestamp (extra column) in the glue job so that the output data has an extra column

How to I add a current timestamp (extra column) in the glue job so that the output data has an extra column. In this case:
Schema Source Table:
Col1, Col2
After Glue job.
Schema of Destination:
Col1, Col2, Update_Date(Current Timestamp)
We do the following and works great without converting toDF()
datasource0 = glueContext.create_dynamic_frame.from_catalog(...)
from datetime import datetime
def AddProcessedTime(r):
r["jobProcessedDateTime"] = datetime.today() #timestamp of when we ran this.
return r
mapped_dyF = Map.apply(frame = datasource0, f = AddProcessedTime)
I'm not sure if there's a glue native way to do this with the DynamicFrame, but you can easily convert to a Spark Dataframe and then use the withColumn method. You will need to use the lit function to put literal values into a new column, as below.
from datetime import datetime
from pyspark.sql.functions import lit
glue_df = glueContext.create_dynamic_frame.from_catalog(...)
spark_df = glue_df.toDF()
spark_df = spark_df.withColumn('some_date', lit(datetime.now()))
Some references:
Glue DynamicFrame toDF()
Spark Dataframe withColumn()
In my experience working with Glue the timezone where Glue runs is GMT. But my timezone is CDT. So, to get CDT timezone I need to convert the time within SparkContext. This specific case is to add last_load_date to the target/sink.
So I created a function.
def convert_timezone(sc):
sqlContext = SQLContext(sc)
local_time=dt.now().strftime('%Y-%m-%d %H:%M:%S')
local_time_df=sqlContext.createDataFrame([(local_time,)],['time'])
CDT_time_df = local_time_df.select(from_utc_timestamp(local_time_df['time'],'CST6CDT').alias('cdt_time'))
CDT_time=[i['cdt_time'].strftime('%Y-%m-%d %H:%M:%S') for i in CDT_time_df.collect()][0]
return CDT_time
And then call the function like ...
job_run_time = date_config.convert_timezone(sc)
datasourceDF0 = datasource0.toDF()
datasourceDF1 = datasourceDF0.withColumn('last_updated_date',lit(job_run_time))
As I have been seen there is not a properly answer to this issue I will try to explain my solution to this problem:
First thing is to clarify the withColumn function is a good way to do this but it is important to mention that this function is from the Dataframe from Spark itself and this function is not part of the glue DynamicFrame which is a own library from Glue AWS, so you need to covert the frames to do this....
First step is from the DynamicFrame get the Spark Dataframe, glue library does this with the function toDF() function, once with the Spark frame you can add the column and/or do whatever manipulation you require.
Then what we glue expect is his own frame so we need to transformed back from spark to glue proprietary frame, to do so you can use the apply function of the DynamicFrame, which requires to import the object:
import com.amazonaws.services.glue.DynamicFrame
and use the glueContext which you should already have it, like:
DynamicFrame(sparkDataFrame, glueContext)
In resume the code should looks like:
import org.apache.spark.sql.functions._
import com.amazonaws.services.glue.DynamicFrame
...
val sparkDataFrame = datasourceToModify.toDF().withColumn("created_date", current_date())
val finalDataFrameForGlue = DynamicFrame(sparkDataFrame, glueContext)
...
Note: the import org.apache.spark.sql.functions._ is to bring the current_date() function to add the column with the date.
Hope this helps....
Use Spark's current_timestamp() function:
import org.apache.spark.sql.functions._
...
val timestampedDf = source.toDF().withColumn("Update_Date", current_timestamp())
val timestamped = DynamicFrame(timestampedDf, glueContext)
You can do this supposedly with a built-in functionality now: see here...
Note to look for just the glueContext.add_ingestion_time_columns section

How to Make Pandas DataFrame SQL Fetch Progress Bar?

I'm trying to make a progress bar reflecting pandas dataframe build progress from sql. Currently I have a table with 9 columns containing 1000 records.
import pandas as pd
import psycopg2 as ps
import pandas.io.sql as psql
conn = ps.connect(user="user", password="password", database="database")
sql = "select * from table"
a = datetime.datetime.now()
df = psql.read_frame(sql, con=conn)
---and blablabla some little functions
b = datetime.datetime.now()
print b-a
instead of getting delta start & end time of the function, I would prefer and it would be nice to show progress bar to end user (just in case the data is getting bigger), so they have idea how long it would take. Is that possible? how to do it?