python 2.7 pandas Dataframe.from_csv Date column not accessible - python-2.7

Surely I am missing something obvious - but I am baffled by this result:
Environment:
Ubuntu 16.04.1 LTS
Python 2.7.12
pandas 0.18.1
CSV File:
Date,Open,High,Low,Close,Volume
12-Aug-16,107.78,108.44,107.78,108.18,18660434
11-Aug-16,108.52,108.93,107.85,107.93,27484506
10-Aug-16,108.71,108.90,107.76,108.00,24008505
Code:
import pandas as pd
aapl = pd.DataFrame.from_csv('aapl.csv',index_col=None)
print aapl.columns
print aapl.Low.dtype
print aapl['Low'].dtype
# Fails - KeyError
print aapl['Date'].dtype
Output:
Index([u'Date', u'Open', u'High', u'Low', u'Close', u'Volume'], dtype='object')
float64
float64
KeyError: 'Date'
The mystery to me is that 'Date' appears in the columns list, but I cannot address the column. What am I missing?

To close this out, #Boud answered the question. Printing the columns with repr(), i.e. print(repr(aapl.columns[0]) showed encoded characters in the key string which prevented it from being found.

Related

How to fix UnicodeEncodeError: in Pyspark while converting Dataframe Row to a String

I have a simple dataframe with 3 columns.
+------------------+-------------------+-------+
| NM1_PROFILE| CURRENT_DATEVALUE| ID|
+------------------+-------------------+-------+
|XY_12345678 – Main|2019-12-19 00:00:00|myuser1|
+------------------+-------------------+-------+
All i want in the output is a single string consists of all the values in dataframe row separated by comma or pipe. Although there are many rows in the dataframe, i just want 1 row to solve my purpose.
XY_12345678 – Main,2019-12-19 00:00:00,myuser1
I have tried with below and it has worked fine for my other dataframes but for above it gives me an error.
df.rdd.map(lambda line: ",".join([str(x) for x in line])).take(1)[0]
Error when it encounters "-"
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 12: ordinal not in range(128)
I am using Spark 1.6 with Python 2 and tried -
import sys
reload(sys)
sys.setdefaultencoding('utf8')
According to the Spark 1.6 Documentation, you can use the concat_ws function, which giving a separator and a set of columns, it will concat them in one string. So this should solve your issue
from pyspark.sql.functions import col, concat_ws
df.select(concat_ws(",", col("NM1_PROFILE"), col("CURRENT_DATEVALUE"), col("ID")).alias("concat")).collect()
Or, if you prefer a more generic way, you can use something like this:
from pyspark.sql.functions import col, concat_ws
cols = [col(column) for column in df.columns]
df.select(concat_ws(",", *cols).alias("concat")).collect()
For more information: https://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html#pyspark.sql.functions.concat_ws
Hope this helps

Why is Pandas forcing unicode column names in place of strings?

Why is Pandas force converting ascii strings to unicode upon conversion from dictionary to dataframe? Is this a feature or a known bug?
I'm using Python 2.7.3 and Pandas 0.20.2
MWE included below.
import pandas as pd
sample_dict={}
sample_dict['A'] = {'Key_1': 'A1', 'Key-2': 'A2', 'Key_3': 'A3'}
sample_dict['B'] = {'Key_1': 'B1', 'Key-2': 'B2', 'Key_3': 'B3'}
sample_dict['C'] = {'Key_1': 'C1', 'Key-2': 'C2', 'Key_3': 'C3'}
print sample_dict['A'].keys()
sample_df = pd.DataFrame.from_dict(sample_dict, orient='index')
print sample_df.keys()
Results in:
['Key-2', 'Key_1', 'Key_3']
Index([u'Key-2', u'Key_1', u'Key_3'], dtype='object')
Addendum: I came across this similar question, but it's been inactive for a couple of years and does not discuss why this is happening.
from pandas dataframe repr
it says
"""
Return a string representation for a particular object.
Yields Bytestring in Py2, Unicode String in py3.
"""
so I am sure in python 3 you should not be able to see any unicode prefix.

cx_oracle giving wrong division output

I am currently using Python 2.7.5 on Aix 5.1 with cx_Oracle version 5.2 to connect to Oracle 12c
I am trying to execute a SQL query, and put its output in a csv file using the csv module. The query I am running is:
Select 1.563/100, 0.38/100 from dual; - - simplified query
However the output in file is:
0.015629999999999998,0.0038
When I expect it to be
0.01563,0.0038
After doing some research, I believe this is because floating point numbers are represented in Binary Base 2.
But I don't know how to resolve this?
I also tried
from __future__ import division
But it did not help.
function ROUND is your friend
SELECT ROUND(3.1415926,4),ROUND(3.1415926,5) FROM DUAL;
ROUND(3.1415926,4) ROUND(3.1415926,5)
------------------ ------------------
3.1416 3.14159
or, in Python:
print round (3.1415926, 4)
print round (3.1415926, 5)
3.1416
3.14159
Thank You Zsigmond Lőrinczy.
It worked, by using a to_char(round())
>>> import cx_Oracle
>>> con = cx_Oracle.connect(xxx/xxx#xxx)
>>> cur = con.cursor()
>>> cur.execute("select 1.563/100, round(1.563/100,5), to_char(round(1.563/100,5)) from dual")
>>> l_result = cur.fetchall()
>>> l_result
[(0.015629999999999998, 0.015629999999999998, '0.01563')]

Python Pandas to_csv Output Returns Single Character for String/Object Values

I'm attempting to output the result into a pandas data frame. When I print the data frame, the object values appear correct, but when I use the to_csv function on the data frame, my csv output has only the first character for every string/object value.
df = pandas.DataFrame({'a':[u'u\x00s\x00']})
df.to_csv('test.csv')
I've also tried the following addition to the to_csv function:
df.to_csv('test_encoded.csv', encoding= 'utf-8')
But am getting the same results:
>>> print df
a
0 us
(output in csv file)
u
For reference, I'm connecting to a Vertica database and using the following setup:
OS: Mac OS X Yosemite (10.10.5)
Python 2.7.10 |Anaconda 2.3.0 (x86_64)| (default, Sep 15 2015,
14:29:08)
pyodbc 3.0.10
pandas 0.16.2
ODBC: Vertica ODBC 6.1.3
Any help figuring out how to pass the entire object string using the to_csv function in pandas would be greatly appreciated.
I was facing the same problem and found this post UTF-32 in Python
To fix your problem, I believe that you need to replace all '\x00' by empty. I managed to write the correct CSV with the code below
fixer = dict.fromkeys([0x00], u'')
df['a'] = df['a'].map(lambda x: x.translate(fixer))
df.to_csv('test.csv')
To solve my problem with Vertica I had to change the encoding to UTF-16 in the file /Library/Vertica/ODBC/lib/vertica.ini with the configuration below
[Driver]
ErrorMessagesPath=/Library/Vertica/ODBC/messages/
ODBCInstLib=/usr/lib/libiodbcinst.dylib
DriverManagerEncoding=UTF-16
Best regards,
Anderson Neves

TypeError when inserting time into xlsxwriter

I'm importing two points of data from MySQLdb. The second point is a time which cursor.fetchall() returns as a timedelta. I had no luck trying to insert that info into xlsxwriter, always getting a "TypeError: Unknown or unsupported datetime type" error.
Ok... round 2
Now I'm trying to convert the timedelta into a datetime.datetime object:
for x in tempList:
timeString = str(x[1])
ctTime.append(datetime.datetime.strptime(timeString,"%H:%M:%S))
Now in xlsxwriter, I setup formatting:
ctChart.set_x_axis({'name': 'Time', 'name_font': {'size': 14, 'bold': True}, 'num_font': {'italic': True},'date_axis': True})
And then I create a time format:
timeFormat = workbook.add_format({'num_format': 'hh:mm:ss'})
Then I attempt to insert data:
ctWorksheet.write_datetime('A1',ctTime,timeFormat)
But no matter what I do, no matter how I format the data, I always get the following error:
TypeError: Unknown or unsupported datetime type
Is there something stupidly obvious I'm missing?
******* EDIT 1 *******
jmcnamara - In response to your comment here are more details:
I've tried using a list of time deltas such as datetime.timedelta(0, 27453) which when printed is 7:37:33 using the following code:
timeFormat = workbook.add_format({'num_format': 'hh:mm:ss'})
ctWorksheet.write_datetime('A1',ctTime,timeFormat)
I still get the error: TypeError: Unknown or unsupported datetime type
Even iterating through the list and attempting to insert the results fails:
timeFormat = workbook.add_format({'num_format': 'hh:mm:ss'})
i = 0
for t in ctTime:
ctWorksheet.write_datetime(i,0,t,timeFormat)
i += 1
I finally got it working with my most recent code. The chart still isn't graphing correctly using the inserted times, but at least they are inserting correctly.
Since I was pulling the timedeltas from SQL, I had to change their format first. Raw timedeltas from SQL just weren't working:
for x in templist:
timeString = datetime.datetime.strptime(str(x[1]),"%H:%M:%S")
ctTime.append(timeString)
With those datetime.strptime formatted times I was able to then successfully insert into the worksheet.
timeFormat = workbook.add_format({'num_format': 'hh:mm:ss'})
i = 0
for t in ctTime:
ctWorksheet.write_datetime(i,0,t,timeFormat)
i += 1
The GitHub master version of XlsxWriter supports datetime.timedelta.
Try it out and let me know if it works. It will probably be uploaded to PyPI in the next week.