I have a continuous stream of data coming in so I want to define the DataFrame before hand so that I don't have tell pandas to format data or set index
So I want to create a DataFrame like
df = pd.DataFrame(columns=["timestamp","stockname","price","volume"])
but I want to tell Pandas that index of data should be timestamp and that the format would be
"%Y-%m-%d %H:%M:%S:%f"
and one this it set, then I would read through file and pass data to the DataFrame initialized
I get data in variables like these populated every time in loop like
for line in filehandle:
timestamp, stockname, price, volume = fetch(line)
here I want to update the "df"
this update would go on while I would keep using the copy of
df
let us say into a
tempdf
to do re-sampling or any other task at any given point in time because original dataframe
df
is getting updated continuously
import numpy as np
import pandas as pd
import datetime as dt
import time
# create df with timestamp as index
df = pd.DataFrame(columns=["timestamp","stockname","price","volume"], dtype = float)
pd.to_datetime(df['timestamp'], format = "%Y-%m-%d %H:%M:%S:%f")
df.set_index('timestamp', inplace = True)
for i in range(10): # for the purposes of functioning demo code
i += 1 # counter
time.sleep(0.01) # give jupyter notebook a moment
timestamp = dt.datetime.now() # to be used as index
df.loc[timestamp] = ['AAPL', np.random.randint(1000), np.random.randint(10)] # replace with your database read
tempdf = df.copy()
If you are reading a file or database continuously, you can replace the for: loop with what you described in your question. #MattR's questions should also be addressed; if you need to continuously log or update data, I am not sure if pandas is the best solution.
Related
I am trying to load a CSV from GCS which contains timestamps in one of the columns.
When I upload via BQ interface, I get the following error:
Could not parse '2018-05-03 10:25:18.257000000' as DATETIME for field creation_date (position 6) starting at location 678732930 with message 'Invalid datetime string "2018-05-03 10:25:18.257000000"'
Is the issue here the trailing 0's? How would I fix the issue using Python?
Thanks in advance
Yes you are correct. The issue is the trailing 0s. DATETIME field only allows 6 digits at the subsecond value.
Name | Range
DATETIME | 0001-01-01 00:00:00 to 9999-12-31 23:59:59.999999
To remove the trailing 0s, you can use Pandas to convert it to a proper DATETIME format so it can be used in BigQuery. For testing purposes, I used a CSV file that contains a dummy value at column 0 and DATETIME with trailing 0s at column 1.
Test,2018-05-03 10:25:18.257000000
Test1,2018-05-03 10:22:18.123000000
Test2,2018-05-03 10:23:18.234000000
Using this block of code, Pandas will convert column 1 to the proper DATETIME format:
import pandas as pd
df = pd.read_csv("data.csv",header=None) #define your CSV file here
first_column = df.iloc[:, 1] # Change to the location of your DATETIME column
df.iloc[:, 1] = pd.to_datetime(first_column,format='%Y-%m-%d %H:%M:%S.%f') # convert to correct datetime format
df.to_csv("data.csv", header=False, index=False) # write the new values to data.csv
print(df) #print output for testing
This will result to:
Test,2018-05-03 10:25:18.257
Test1,2018-05-03 10:22:18.123
Test2,2018-05-03 10:23:18.234
You can now use the updated CSV file to write to BQ via BQ interface. See result of BQ testing:
How to I add a current timestamp (extra column) in the glue job so that the output data has an extra column. In this case:
Schema Source Table:
Col1, Col2
After Glue job.
Schema of Destination:
Col1, Col2, Update_Date(Current Timestamp)
We do the following and works great without converting toDF()
datasource0 = glueContext.create_dynamic_frame.from_catalog(...)
from datetime import datetime
def AddProcessedTime(r):
r["jobProcessedDateTime"] = datetime.today() #timestamp of when we ran this.
return r
mapped_dyF = Map.apply(frame = datasource0, f = AddProcessedTime)
I'm not sure if there's a glue native way to do this with the DynamicFrame, but you can easily convert to a Spark Dataframe and then use the withColumn method. You will need to use the lit function to put literal values into a new column, as below.
from datetime import datetime
from pyspark.sql.functions import lit
glue_df = glueContext.create_dynamic_frame.from_catalog(...)
spark_df = glue_df.toDF()
spark_df = spark_df.withColumn('some_date', lit(datetime.now()))
Some references:
Glue DynamicFrame toDF()
Spark Dataframe withColumn()
In my experience working with Glue the timezone where Glue runs is GMT. But my timezone is CDT. So, to get CDT timezone I need to convert the time within SparkContext. This specific case is to add last_load_date to the target/sink.
So I created a function.
def convert_timezone(sc):
sqlContext = SQLContext(sc)
local_time=dt.now().strftime('%Y-%m-%d %H:%M:%S')
local_time_df=sqlContext.createDataFrame([(local_time,)],['time'])
CDT_time_df = local_time_df.select(from_utc_timestamp(local_time_df['time'],'CST6CDT').alias('cdt_time'))
CDT_time=[i['cdt_time'].strftime('%Y-%m-%d %H:%M:%S') for i in CDT_time_df.collect()][0]
return CDT_time
And then call the function like ...
job_run_time = date_config.convert_timezone(sc)
datasourceDF0 = datasource0.toDF()
datasourceDF1 = datasourceDF0.withColumn('last_updated_date',lit(job_run_time))
As I have been seen there is not a properly answer to this issue I will try to explain my solution to this problem:
First thing is to clarify the withColumn function is a good way to do this but it is important to mention that this function is from the Dataframe from Spark itself and this function is not part of the glue DynamicFrame which is a own library from Glue AWS, so you need to covert the frames to do this....
First step is from the DynamicFrame get the Spark Dataframe, glue library does this with the function toDF() function, once with the Spark frame you can add the column and/or do whatever manipulation you require.
Then what we glue expect is his own frame so we need to transformed back from spark to glue proprietary frame, to do so you can use the apply function of the DynamicFrame, which requires to import the object:
import com.amazonaws.services.glue.DynamicFrame
and use the glueContext which you should already have it, like:
DynamicFrame(sparkDataFrame, glueContext)
In resume the code should looks like:
import org.apache.spark.sql.functions._
import com.amazonaws.services.glue.DynamicFrame
...
val sparkDataFrame = datasourceToModify.toDF().withColumn("created_date", current_date())
val finalDataFrameForGlue = DynamicFrame(sparkDataFrame, glueContext)
...
Note: the import org.apache.spark.sql.functions._ is to bring the current_date() function to add the column with the date.
Hope this helps....
Use Spark's current_timestamp() function:
import org.apache.spark.sql.functions._
...
val timestampedDf = source.toDF().withColumn("Update_Date", current_timestamp())
val timestamped = DynamicFrame(timestampedDf, glueContext)
You can do this supposedly with a built-in functionality now: see here...
Note to look for just the glueContext.add_ingestion_time_columns section
After reading in a .csv file using pandas, and then converting it into an R dataframe using the rpy2 package, I created a model using some R functions (also via rpy2), and now want to take the summary of the model and convert it into a Pandas dataframe (so that I can either save it as a .csv file or use it for other purposes).
I have followed out the instructions on the pandas site (source: https://pandas.pydata.org/pandas-docs/stable/r_interface.html) in order to figure it out:
import pandas as pd
from rpy2.robjects import r
import sys
import rpy2.robjects.packages as rpackages
from rpy2.robjects.vectors import StrVector
from rpy2.robjects import r, pandas2ri
pandas2ri.activate()
caret = rpackages.importr('caret')
broom= rpackages.importr('broom')
my_data= pd.read_csv("my_data.csv")
r_dataframe= pandas2ri.py2ri(my_data)
preprocessing= ["center", "scale"]
center_scale= StrVector(preprocessing)
#these are the columns in my data frame that will consist of my predictors in the model
predictors= ['predictor1','predictor2','predictor3']
predictors_vector= StrVector(predictors)
#this column from the dataframe consists of the outcome of the model
outcome= ['fluorescence']
outcome_vector= StrVector(outcome)
#this line extracts the columns of the predictors from the dataframe
columns_predictors= r_dataframe.rx(True, columns_vector)
#this line extracts the column of the outcome from the dataframe
column_response= r_dataframe.rx(True, column_response)
cvCtrl = caret.trainControl(method = "repeatedcv", number= 20, repeats = 100)
model_R= caret.train(columns_predictors, columns_response, method = "glmStepAIC", preProc = center_scale, trControl = cvCtrl)
summary_model= base.summary(model_R)
coefficients= stats.coef(summary_model)
pd_dataframe = pandas2ri.ri2py(coefficients)
pd_dataframe.to_csv("coefficents.csv")
Although this workflow is ostensibly correct, the output .csv file did not meet my needs, as the names of the columns and rows were removed. When I ran the command type(pd_dataframe), I find that it is a <type 'numpy.ndarray'>. Although the information of the table is still present, the new formatting has removed the names of the columns and rows.
So I ran the command type(coefficients) and found that it was a <class 'rpy2.robjects.vectors.Matrix'>. Since this Matrix object still retained the names of my columns and rows, I tried to convert it into an R objects DataFrame, but my efforts proved to be futile. Furthermore, I don't know why the line pd_dataframe = pandas2ri.ri2py(coefficients) did not yield a pandas DataFrame object, nor why it did not retain the names of my columns and rows.
Can anybody recommend an approach so I can get some kind of pandas DataFrame that retains the names of my columns and rows?
UPDATE
A new method was mentioned in the documents of a slightly older version of the package called pandas2ri.ri2py_dataframe (source: https://rpy2.readthedocs.io/en/version_2.7.x/changes.html), and now I have a proper data frame instead of the numpy array. However, I still can't get the names of the rows and columns to be transferred properly. Any suggestions?
May be it should happen automatically during conversion, but in the meantime row and column names can easily be obtained from the R object and added to the pandas DataFrame. For example the column names for the R matrix should be at: https://rpy2.github.io/doc/v2.9.x/html/vector.html#rpy2.robjects.vectors.Matrix.colnames
With the csv module, I loop through the rows to execute logic:
import csv
with open("file.csv", "r") as csv_read:
r = csv.reader(csv_read, delimiter = ",")
next(r, None) #Skip headers first row
for row in rows:
#Logic here
I'm new to Pandas, and I want to execute the same logic, using the second column only in the csv as the input for the loop.
import pandas as pd
pd.read_csv("file.csv", usecols=[1])
Assuming the above is correct, what should I do from here to execute the logic based on the cells in column 2?
I want to use the cell values in column 2 as input for a web crawler. It takes each result and inputs it as a search term on a webpage, and then scrapes data from that webpage. Is there any way to grab each cell value in the array rather than the whole array at the same time?
Basically the pandas equivalent of your code is this:
import pandas as pd
df = pd.read_csv("file.csv", usecols=[1])
So passing usecols=[1] will only load the second column, see the docs.
now assuming this column has a name like 'url' but really it doesn't matter we can do something like:
def crawl(x):
#do something
df.apply(crawl)
So in principle the above will crawl each url in your column a value at a time.
EDIT
You can pass param axis=1 to apply so that it process each row rather than the entire column:
df.apply(crawl, axis=1)
I'm trying to make a progress bar reflecting pandas dataframe build progress from sql. Currently I have a table with 9 columns containing 1000 records.
import pandas as pd
import psycopg2 as ps
import pandas.io.sql as psql
conn = ps.connect(user="user", password="password", database="database")
sql = "select * from table"
a = datetime.datetime.now()
df = psql.read_frame(sql, con=conn)
---and blablabla some little functions
b = datetime.datetime.now()
print b-a
instead of getting delta start & end time of the function, I would prefer and it would be nice to show progress bar to end user (just in case the data is getting bigger), so they have idea how long it would take. Is that possible? how to do it?