I have a script that processes an Excel file. The department that sends it has a system that generated it, and my script stopped working.
I suddenly got the error Can only use .str accessor with string values, which use np.object_ dtype in pandas for the following line of code:
df['DATE'] = df['Date'].str.replace(r'[^a-zA-Z0-9\._/-]', '')
I checked the type of the date columns in the file from the old system (dtype: object) vs the file from the new system (dtype: datetime64[ns]).
How do I change the date format to something my script will understand?
I saw this answer but my knowledge about date formats isn't this granular.
You can use apply function on the dataframe column to convert the necessary column to String. For example:
df['DATE'] = df['Date'].apply(lambda x: x.strftime('%Y-%m-%d'))
Make sure to import datetime module.
apply() will take each cell at a time for evaluation and apply the formatting as specified in the lambda function.
pd.to_datetime returns a Series of datetime64 dtype, as described here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html
df['DATE'] = df['Date'].dt.date
or this:
df['Date'].map(datetime.datetime.date)
You can use pd.to_datetime
df['DATE'] = pd.to_datetime(df['DATE'])
Related
I have a CSV file with various columns and everything worked perfectly for the past few months until I updated the file and got new information and now the one column does not appear to be picked up by Python. I am using Python 2.7 and have made sure I have the latest version of pandas.
When I downloaded the csv file from Yahoo Finance, I opened it in Excel and made changes to the format of the columns in order to make it more readable as all information was in one cell. I used the "Text to Column" feature and split up the data based on where the commas were.
Then I made sure that in each column there were no white spaces in the beginning of the cell using the Trim function in excel and left-aligning the data.
I tried the following and still get the same or similiar:
After the df = pd.read_csv("KIO.csv") I tried to read whether I can read the first few columns by using df.head() - but still got the same error.
I tried renaming the problematic column as suggested in a similiar post using:
df = df.rename(columns={"Close": "Closing"}) - here I got the same error again. "print df.columns" also led to the same issue.
"df[1]" - gave a long error with "KeyError: 1" at the end - I can print the entire thing if it it will assist.
Adding the "skipinitialspace=True" - no difference.
I thought the problem might be within the actual csv file information so I deleted all the columns and made my own information and I still got the same error.
Below is a portion of my code as the total code is very long:
enter code here
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as pltdate
import datetime
import matplotlib.animation as animation
import numpy as np
df = pd.read_csv("KIO.csv", skipinitialspace=True)
#df.head()
#Close = df.columns[0]
#df= df.rename(columns={"Close": "Closing"})
df1 = pd.read_csv("USD-ZAR.csv")
kio_close = pd.DataFrame(df.Close)
exchange = pd.DataFrame(df1.Value)
dates = df["Date"]
dates1 = df1["Date"]
The above variables have been used throughout the remaining code though so if this issue can be solved here the remaining code will be right.
This is copy/paste of the error:
Blockquote
Traceback (most recent call last):
File "C:/Users/User/Documents/PycharmProjects/Trading_GUI/GUI_testing.py", line 33, in
kio_close = pd.DataFrame(df.Close)
File "C:\Python27\lib\site-packages\pandas\core\generic.py", line 4372, in getattr
return object.getattribute(self, name)
AttributeError: 'DataFrame' object has no attribute 'Close'
Thank you so much in advance.
#Rip_027 This is in regards to your last comment. I used to have the same issue whenever I open a csv file by simply double clicking the file icon. You need to launch Excel first, then get external data. Link below has more details,which will serve as a guideline. Hope this helps.
https://www.hesa.ac.uk/support/user-guides/import-csv
I am new to python. I have a csv file which has cleaned tweets. I want to create a bag of words of these tweets.
I have the following code but its not working correctly.
import pandas as pd
from sklearn import svm
from sklearn.feature_extraction.text import CountVectorizer
data = pd.read_csv(open("Twidb11.csv"), sep=' ')
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(data.Text)
count_vect.vocabulary_
Error:
.ParserError: Error tokenizing data. C error: Expected 19 fields in
line 5, saw 22
It's duplicated i think. U can see answer here. There are a lot of answers and comments.
So, solution can be:
data = pd.read_csv('Twidb11.csv', error_bad_lines=False)
Or:
df = pandas.read_csv(fileName, sep='delimiter', header=None)
"In the code above, sep defines your delimiter and header=None tells pandas that your source data has no row for headers / column titles. Thus saith the docs: "If file contains no header row, then you should explicitly pass header=None". In this instance, pandas automatically creates whole-number indeces for each field {0,1,2,...}."
I have a trouble with data frame. I have a csv file with ten columns, but all the data stores in the first column. How can i automatically extract data from the first column and put into other columns? Could you help me, please. enter image description here
This is my code:
import pandas as pd
import numpy as np
df = pd.read_csv('test_dataset.csv')
df.head(3)
one_column = df.iloc[:,0]
one_column.head(3)
This is the link for download file:
enter link description here
You can use parameter quoting=3 for no quoting in read_csv:
df = pd.read_csv('test_dataset.csv', quoting=3)
quoting : int or csv.QUOTE_* instance, default 0
Control field quoting behavior per csv.QUOTE_* constants. Use one of QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
I have a file like so that I am reading from excel:
Year Month Day
1 2 1
2 1 2
I want to specify the column width that excel recognizes. I would like to do it in pandas but I don't see a option. I have tried to do it with the module StyleFrame.
This is my code:
from StyleFrame import StyleFrame
import pandas as pd
df=pd.read_excel(r'P:\File.xlsx')
excel_writer = StyleFrame.ExcelWriter(r'P:\File.xlsx')
sf=StyleFrame(df)
sf=sf.set_column_width(columns=['Year', 'Month'], width=4.0)
sf=sf.set_column_width(columns=['Day'], width=6.00)
sf=sf.to_excel(excel_writer=excel_writer)
excel_writer.save()
but the formatting isn't saved when I open the new file.
Is there a way to do it in pandas? I would even take a pure python solution to this, pretty much anything that works.
As for your question on how to remove the headers, you can simply pass header=False to to_excel:
sf.to_excel(excel_writer=excel_writer, header=False).
Note that this will still result with the first line of the table being bold.
If you don't want that behavior you can update to 0.1.6 that I just released.
I am trying to use matplotlib module to plot date vs. some values. I'd like to make some changes to the ticklables of the x_axis and to do that I am using the xaxis.get_ticklables(). As the matplotlib Artist tutorial says this function gives a list of Text instances. Now the question is, what is a Text instance and is there any way I can convert a Text instance to strings or numbers?
Thank you.
You can get the text, i.e. the string, for all tick labels with:
label_texts = [label.get_text() for label in ax.xaxis.get_ticklabels()]
A tick label is rich matplotlib object with many ttributes:
>>> tick_label = ax.xaxis.get_ticklabels()[0]
>>> len(dir(tick_label))
213
Type:
>>> dir(tick_label)
to see their names.