This is the code where "rec" variable is used to read the dates in excel sheet but its printing float value how to print that in date format for example '2015:09:02'
for rec in sorted(out.keys()):
print rec #printing float values
print str(out[rec])
I got output:
42240.0
24
Excel internally stored date values as floats. So in xlrd if you want to read Excel date values as Python date values, you have to use the xldate_as_tuple method to get the date.
Documentation: http://www.lexicon.net/sjmachin/xlrd.html#xlrd.xldate_as_tuple-function
Here's a generic Example:
import datetime, xlrd
book = xlrd.open_workbook("myexcelfile.xls")
sh = book.sheet_by_index(0)
a1 = sh.cell_value(rowx=0, colx=0)
a1_as_datetime = datetime.datetime(*xlrd.xldate_as_tuple(a1, book.datemode))
print 'datetime: %s' % a1_as_datetime
If you create the file myexcelfile.xls and enter a date in cell A1 and run the above code, you should be able to see the correct datetime value in the a1_as_datetime variable.
Related
I am trying to read a CSV file from HDFS location and to that 3 columns batchid,load timestamp and a delete indicator needs to be added at the beginning. I am using spark 2.3.2 and python 2.7.5. Sample values for 3 columns to be added is given below.
batchid- YYYYMMdd (int)
Load timestamp - current timestamp (timestamp)
delete indicator - blank (string)
Your question is a little bit obscure. You can do something in this flavor. First, create your timestamp using python functionalities :
import time
import datetime
timestamp = datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S')
Then, assuming you use the DataFrame API, you plug that into your column :
import pyspark.sql.functions as psf
df = (df
.withColumn('time',
psf.unix_timestamp(
psf.lit(timestamp),'yyyy-MM-dd HH:mm:ss'
).cast("timestamp")
)
.withColumn('batchid', psf.date_format('time', 'yyyyMMdd/yyy'))
.withColumn('delete', psf.lit(''))
To reorder your columns:
df = df.select(*["time","batchid","delete"] + [k for k in colnames if k not in ["time","batchid","delete"]])
I have seen this question asked on Stack Overflow multiple times before, however, what I have not seen is anyone either ask nor answer the question properly here. So, here is my question: I have a .csv file named selected.csv, with three columns - Date (LT), AQI and Raw Conc. The date is in the format dd-mm-yyyy hh:mm. I want to convert the format of the date to yyyy-mm-dd, thereafter, saving the corrected data with three columns - Date, AQI and Raw Conc. as corrected.csv. I have tried the code typed below, but to no avail.
import csv
from datetime import datetime
output_file = open(r"C:\Users\Win-8.1\Desktop\delhi\corrected.csv", "wb")
fieldnames = ['Date', 'AQI' , 'Raw Conc.']
writer = csv.DictWriter(output_file, fieldnames = fieldnames)
writer.writeheader()
with open(r"C:\Users\Win-8.1\Desktop\delhi\selected.csv") as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
output_row = {}
output_row['Date'] = datetime.datetime.strptime(row['Date (LT)'], '%d-%m-%Y %H:%M').strftime('%Y-%m-%d')
output_row['AQI'] = row['AQI']
output_row['Raw Conc.'] = row['Raw Conc.']
writer.writerow(output_row)
output_file.close()
output_row['Date'] = datetime.strptime(row['Date (LT)'], '%d-%m-%Y %H:%M').strftime('%Y-%m-%d')
You are using also %H:%M when you mentioned the format is DD-MM-YYYY. Maybe you didn't intend this?
I have the below-mentioned dataset.
https://docs.google.com/spreadsheets/d/13GCAXHp5BU4vYU6PdX40wM-Jhp--LeRd9C5oUurbVY4/edit#gid=0
I want to find the cumulative values for sales for difference stores in one column. For example, the cumulative value for store 2106 the sales figure should be 176,849
I'm using the following function
df = df.groupby('storenumber')['sales'].cumsum() but i am not getting the correct result
Can someone help?
Here's what I did to solve this problem.
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv') # get data frame from csv file
You won't be able to run numerical operations on your data, as it is, because the Sale (Dollars) column in df is not formatted as a numerical type. The following piece of code will convert the data in the Sale (Dollars) and Suggested answer column to be of type float and remove the dollar sign and separating commas.
df[df.columns[2:]] = df[df.columns[2:]].replace('[\$,]', '', regex=True).astype(float)
Then, I used the following bit of code to get the cumulative value for each unique Store Number.
cum_sales_by_store_number = df.groupby('Store Number')['Sale (Dollars)'].agg(np.sum)
cum_sales_by_store_number = pd.DataFrame(cum_sales_by_store_number)
Output for cum_sales_by_store_number:
Sale (Dollars)
Store Number
2106 176849.97
I hope this answers your question. Happy coding!
I have a script that processes an Excel file. The department that sends it has a system that generated it, and my script stopped working.
I suddenly got the error Can only use .str accessor with string values, which use np.object_ dtype in pandas for the following line of code:
df['DATE'] = df['Date'].str.replace(r'[^a-zA-Z0-9\._/-]', '')
I checked the type of the date columns in the file from the old system (dtype: object) vs the file from the new system (dtype: datetime64[ns]).
How do I change the date format to something my script will understand?
I saw this answer but my knowledge about date formats isn't this granular.
You can use apply function on the dataframe column to convert the necessary column to String. For example:
df['DATE'] = df['Date'].apply(lambda x: x.strftime('%Y-%m-%d'))
Make sure to import datetime module.
apply() will take each cell at a time for evaluation and apply the formatting as specified in the lambda function.
pd.to_datetime returns a Series of datetime64 dtype, as described here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html
df['DATE'] = df['Date'].dt.date
or this:
df['Date'].map(datetime.datetime.date)
You can use pd.to_datetime
df['DATE'] = pd.to_datetime(df['DATE'])
I am importing data from a JSON file and it has the date in the following format 1/7/11 9:15
What would be the best variable type/format to define in order to accept this date as it is? If not what would be the most efficient way to accomplish this task?
Thanks.
"What would be the best variable type/format to define in order to accept this date as it is?"
The DateTimeField.
"If not what would be the most efficient way to accomplish this task?"
You should use the datetime.strptime method from Python's builtin datetime library:
>>> from datetime import datetime
>>> import json
>>> json_datetime = "1/7/11 9:15" # still encoded as JSON
>>> py_datetime = json.loads(json_datetime) # now decoded to a Python string
>>> datetime.strptime(py_datetime, "%m/%d/%y %I:%M") # coerced into a datetime object
datetime.datetime(2011, 1, 7, 9, 15)
# Now you can save this object to a DateTimeField in a Django model.
If you take a look at https://docs.djangoproject.com/en/dev/ref/models/fields/#datetimefield, it says that django uses the python datetime library which is docomented at http://docs.python.org/2/library/datetime.html.
Here is a working example (with many debug prints and step-by-step instructions:
from datetime import datetime
json_datetime = "1/7/11 9:15"
json_date, json_time = json_datetime.split(" ")
print json_date
print json_time
day, month, year = map(int, json_date.split("/")) #maps each string in stringlist resulting from split to an int
year = 2000 + year #be ceareful here! 2 digits for a year may cause trouble!!! (could be 1911 as well)
hours, minutes = map(int, json_time.split(":"))
print day
print month
print year
my_datetime = datetime(year, month, day, hours, minutes)
print my_datetime
#Generate a json date:
new_json_style = "{0}/{1}/{2} {3}:{4}".format(my_datetime.day, my_datetime.month, my_datetime.year, my_datetime.hour, my_datetime.minute)
print new_json_style