I am trying to create a new datetime column from existing columns (one datetime column and another integer column) in a pandas data frame. Here is my code:
import datetime
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = df['start_date'] + pd.Timedelta(df.total_waiting_days, unit='D')
But I got the following errors:
ValueError: Value must be Timedelta, string, integer, float, timedelta or convertible
What did I do wrong here and how do I fix this? Thanks!
Seems like you want to convert whole column to Timedelta
df['end_date'] = df['start_date'] + df.total_waiting_days.apply(lambda x :pd.Timedelta(x, unit='D'))
Related
I am trying to load a CSV from GCS which contains timestamps in one of the columns.
When I upload via BQ interface, I get the following error:
Could not parse '2018-05-03 10:25:18.257000000' as DATETIME for field creation_date (position 6) starting at location 678732930 with message 'Invalid datetime string "2018-05-03 10:25:18.257000000"'
Is the issue here the trailing 0's? How would I fix the issue using Python?
Thanks in advance
Yes you are correct. The issue is the trailing 0s. DATETIME field only allows 6 digits at the subsecond value.
Name | Range
DATETIME | 0001-01-01 00:00:00 to 9999-12-31 23:59:59.999999
To remove the trailing 0s, you can use Pandas to convert it to a proper DATETIME format so it can be used in BigQuery. For testing purposes, I used a CSV file that contains a dummy value at column 0 and DATETIME with trailing 0s at column 1.
Test,2018-05-03 10:25:18.257000000
Test1,2018-05-03 10:22:18.123000000
Test2,2018-05-03 10:23:18.234000000
Using this block of code, Pandas will convert column 1 to the proper DATETIME format:
import pandas as pd
df = pd.read_csv("data.csv",header=None) #define your CSV file here
first_column = df.iloc[:, 1] # Change to the location of your DATETIME column
df.iloc[:, 1] = pd.to_datetime(first_column,format='%Y-%m-%d %H:%M:%S.%f') # convert to correct datetime format
df.to_csv("data.csv", header=False, index=False) # write the new values to data.csv
print(df) #print output for testing
This will result to:
Test,2018-05-03 10:25:18.257
Test1,2018-05-03 10:22:18.123
Test2,2018-05-03 10:23:18.234
You can now use the updated CSV file to write to BQ via BQ interface. See result of BQ testing:
After reading in a .csv file using pandas, and then converting it into an R dataframe using the rpy2 package, I created a model using some R functions (also via rpy2), and now want to take the summary of the model and convert it into a Pandas dataframe (so that I can either save it as a .csv file or use it for other purposes).
I have followed out the instructions on the pandas site (source: https://pandas.pydata.org/pandas-docs/stable/r_interface.html) in order to figure it out:
import pandas as pd
from rpy2.robjects import r
import sys
import rpy2.robjects.packages as rpackages
from rpy2.robjects.vectors import StrVector
from rpy2.robjects import r, pandas2ri
pandas2ri.activate()
caret = rpackages.importr('caret')
broom= rpackages.importr('broom')
my_data= pd.read_csv("my_data.csv")
r_dataframe= pandas2ri.py2ri(my_data)
preprocessing= ["center", "scale"]
center_scale= StrVector(preprocessing)
#these are the columns in my data frame that will consist of my predictors in the model
predictors= ['predictor1','predictor2','predictor3']
predictors_vector= StrVector(predictors)
#this column from the dataframe consists of the outcome of the model
outcome= ['fluorescence']
outcome_vector= StrVector(outcome)
#this line extracts the columns of the predictors from the dataframe
columns_predictors= r_dataframe.rx(True, columns_vector)
#this line extracts the column of the outcome from the dataframe
column_response= r_dataframe.rx(True, column_response)
cvCtrl = caret.trainControl(method = "repeatedcv", number= 20, repeats = 100)
model_R= caret.train(columns_predictors, columns_response, method = "glmStepAIC", preProc = center_scale, trControl = cvCtrl)
summary_model= base.summary(model_R)
coefficients= stats.coef(summary_model)
pd_dataframe = pandas2ri.ri2py(coefficients)
pd_dataframe.to_csv("coefficents.csv")
Although this workflow is ostensibly correct, the output .csv file did not meet my needs, as the names of the columns and rows were removed. When I ran the command type(pd_dataframe), I find that it is a <type 'numpy.ndarray'>. Although the information of the table is still present, the new formatting has removed the names of the columns and rows.
So I ran the command type(coefficients) and found that it was a <class 'rpy2.robjects.vectors.Matrix'>. Since this Matrix object still retained the names of my columns and rows, I tried to convert it into an R objects DataFrame, but my efforts proved to be futile. Furthermore, I don't know why the line pd_dataframe = pandas2ri.ri2py(coefficients) did not yield a pandas DataFrame object, nor why it did not retain the names of my columns and rows.
Can anybody recommend an approach so I can get some kind of pandas DataFrame that retains the names of my columns and rows?
UPDATE
A new method was mentioned in the documents of a slightly older version of the package called pandas2ri.ri2py_dataframe (source: https://rpy2.readthedocs.io/en/version_2.7.x/changes.html), and now I have a proper data frame instead of the numpy array. However, I still can't get the names of the rows and columns to be transferred properly. Any suggestions?
May be it should happen automatically during conversion, but in the meantime row and column names can easily be obtained from the R object and added to the pandas DataFrame. For example the column names for the R matrix should be at: https://rpy2.github.io/doc/v2.9.x/html/vector.html#rpy2.robjects.vectors.Matrix.colnames
I have two datetime columns in a pandas dataframe: time_col & override_col. If override_col isn't null I want to replace the corresponding time_col record with the override_col record. The following code does this but it converts the override_col records to a long and not a datetime... then the time_col is a mix of datetime and long dtypes.
df.loc[~df['override_col'].isnull(), 'time_col'] = df[~df['override_col'].isnull()]['override_col']
Any ideas why this is happening?!
In excel contains details of so many weeks,I need to print only current week number from excel sheet,if only current week conditions matches it should print that
piece of code for that condition but it showing error:
rec_date = datetime.datetime(*xlrd.xldate_as_tuple(rec, inputfp.datemode)).isocalendar()[1]
if rec_date == date.today().isocalendar()[1]:
print '\n',rec_date
print str(out[rec])
Showing error:
if rec_date == date.today():
AttributeError: 'float' object has no attribute 'today'
So I presume you do not import date from datetime but just import datetime. So
if rec_date == datetime.date.today()
shall work. Or from datetime import date
I have three variables stored as number, string and string, as shown below.
load_id = 100
t_date = '2014-06-18'
p_date = '19-JUN-14 10.51.45.378196'
I would like to insert them into a SQL Server table using Python 2.7. The SQL Server table structure is as follows
load_id = float
t_date = date
p_date = timestamp
In Oracle, we tend to use TO_DATE or TO_TIMESTAMP to convert the string to DATE or TIMESTAMP field.
I would like to know how I can do similar conversion while inserting into an SQL Server table.
Thanks in advance.
convert with :
import datetime
import calendar
thedate=datetime.datetime.strptime(p_date,'%d-%b-%y %H.%M.%S.%f')
thetimestamp=calendar.timegm(thedate.utctimetuple())
https://community.toadworld.com/platforms/sql-server/b/weblog/archive/2012/04/18/convert-datetime-to-timestamp
DECLARE #DateTimeVariable DATETIME
SELECT #DateTimeVariable = GETDATE()
SELECT #DateTimeVariable AS DateTimeValue,
CAST(#DateTimeVariable AS TIMESTAMP) AS DateTimeConvertedToTimestampCAST
SELECT CAST(CAST(#DateTimeVariable AS TIMESTAMP) AS DATETIME) AS
TimestampToDatetime
Do the conversion with SQL instead of trying to get Python to match the SQL format.
Neither format matches yours, however the DATETIME type should be adequate.