Surely I am missing something obvious - but I am baffled by this result:
Environment:
Ubuntu 16.04.1 LTS
Python 2.7.12
pandas 0.18.1
CSV File:
Date,Open,High,Low,Close,Volume
12-Aug-16,107.78,108.44,107.78,108.18,18660434
11-Aug-16,108.52,108.93,107.85,107.93,27484506
10-Aug-16,108.71,108.90,107.76,108.00,24008505
Code:
import pandas as pd
aapl = pd.DataFrame.from_csv('aapl.csv',index_col=None)
print aapl.columns
print aapl.Low.dtype
print aapl['Low'].dtype
# Fails - KeyError
print aapl['Date'].dtype
Output:
Index([u'Date', u'Open', u'High', u'Low', u'Close', u'Volume'], dtype='object')
float64
float64
KeyError: 'Date'
The mystery to me is that 'Date' appears in the columns list, but I cannot address the column. What am I missing?
To close this out, #Boud answered the question. Printing the columns with repr(), i.e. print(repr(aapl.columns[0]) showed encoded characters in the key string which prevented it from being found.
Related
I have a simple dataframe with 3 columns.
+------------------+-------------------+-------+
| NM1_PROFILE| CURRENT_DATEVALUE| ID|
+------------------+-------------------+-------+
|XY_12345678 – Main|2019-12-19 00:00:00|myuser1|
+------------------+-------------------+-------+
All i want in the output is a single string consists of all the values in dataframe row separated by comma or pipe. Although there are many rows in the dataframe, i just want 1 row to solve my purpose.
XY_12345678 – Main,2019-12-19 00:00:00,myuser1
I have tried with below and it has worked fine for my other dataframes but for above it gives me an error.
df.rdd.map(lambda line: ",".join([str(x) for x in line])).take(1)[0]
Error when it encounters "-"
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 12: ordinal not in range(128)
I am using Spark 1.6 with Python 2 and tried -
import sys
reload(sys)
sys.setdefaultencoding('utf8')
According to the Spark 1.6 Documentation, you can use the concat_ws function, which giving a separator and a set of columns, it will concat them in one string. So this should solve your issue
from pyspark.sql.functions import col, concat_ws
df.select(concat_ws(",", col("NM1_PROFILE"), col("CURRENT_DATEVALUE"), col("ID")).alias("concat")).collect()
Or, if you prefer a more generic way, you can use something like this:
from pyspark.sql.functions import col, concat_ws
cols = [col(column) for column in df.columns]
df.select(concat_ws(",", *cols).alias("concat")).collect()
For more information: https://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html#pyspark.sql.functions.concat_ws
Hope this helps
Why is Pandas force converting ascii strings to unicode upon conversion from dictionary to dataframe? Is this a feature or a known bug?
I'm using Python 2.7.3 and Pandas 0.20.2
MWE included below.
import pandas as pd
sample_dict={}
sample_dict['A'] = {'Key_1': 'A1', 'Key-2': 'A2', 'Key_3': 'A3'}
sample_dict['B'] = {'Key_1': 'B1', 'Key-2': 'B2', 'Key_3': 'B3'}
sample_dict['C'] = {'Key_1': 'C1', 'Key-2': 'C2', 'Key_3': 'C3'}
print sample_dict['A'].keys()
sample_df = pd.DataFrame.from_dict(sample_dict, orient='index')
print sample_df.keys()
Results in:
['Key-2', 'Key_1', 'Key_3']
Index([u'Key-2', u'Key_1', u'Key_3'], dtype='object')
Addendum: I came across this similar question, but it's been inactive for a couple of years and does not discuss why this is happening.
from pandas dataframe repr
it says
"""
Return a string representation for a particular object.
Yields Bytestring in Py2, Unicode String in py3.
"""
so I am sure in python 3 you should not be able to see any unicode prefix.
I am currently using Python 2.7.5 on Aix 5.1 with cx_Oracle version 5.2 to connect to Oracle 12c
I am trying to execute a SQL query, and put its output in a csv file using the csv module. The query I am running is:
Select 1.563/100, 0.38/100 from dual; - - simplified query
However the output in file is:
0.015629999999999998,0.0038
When I expect it to be
0.01563,0.0038
After doing some research, I believe this is because floating point numbers are represented in Binary Base 2.
But I don't know how to resolve this?
I also tried
from __future__ import division
But it did not help.
function ROUND is your friend
SELECT ROUND(3.1415926,4),ROUND(3.1415926,5) FROM DUAL;
ROUND(3.1415926,4) ROUND(3.1415926,5)
------------------ ------------------
3.1416 3.14159
or, in Python:
print round (3.1415926, 4)
print round (3.1415926, 5)
3.1416
3.14159
Thank You Zsigmond Lőrinczy.
It worked, by using a to_char(round())
>>> import cx_Oracle
>>> con = cx_Oracle.connect(xxx/xxx#xxx)
>>> cur = con.cursor()
>>> cur.execute("select 1.563/100, round(1.563/100,5), to_char(round(1.563/100,5)) from dual")
>>> l_result = cur.fetchall()
>>> l_result
[(0.015629999999999998, 0.015629999999999998, '0.01563')]
I'm attempting to output the result into a pandas data frame. When I print the data frame, the object values appear correct, but when I use the to_csv function on the data frame, my csv output has only the first character for every string/object value.
df = pandas.DataFrame({'a':[u'u\x00s\x00']})
df.to_csv('test.csv')
I've also tried the following addition to the to_csv function:
df.to_csv('test_encoded.csv', encoding= 'utf-8')
But am getting the same results:
>>> print df
a
0 us
(output in csv file)
u
For reference, I'm connecting to a Vertica database and using the following setup:
OS: Mac OS X Yosemite (10.10.5)
Python 2.7.10 |Anaconda 2.3.0 (x86_64)| (default, Sep 15 2015,
14:29:08)
pyodbc 3.0.10
pandas 0.16.2
ODBC: Vertica ODBC 6.1.3
Any help figuring out how to pass the entire object string using the to_csv function in pandas would be greatly appreciated.
I was facing the same problem and found this post UTF-32 in Python
To fix your problem, I believe that you need to replace all '\x00' by empty. I managed to write the correct CSV with the code below
fixer = dict.fromkeys([0x00], u'')
df['a'] = df['a'].map(lambda x: x.translate(fixer))
df.to_csv('test.csv')
To solve my problem with Vertica I had to change the encoding to UTF-16 in the file /Library/Vertica/ODBC/lib/vertica.ini with the configuration below
[Driver]
ErrorMessagesPath=/Library/Vertica/ODBC/messages/
ODBCInstLib=/usr/lib/libiodbcinst.dylib
DriverManagerEncoding=UTF-16
Best regards,
Anderson Neves
I'm importing two points of data from MySQLdb. The second point is a time which cursor.fetchall() returns as a timedelta. I had no luck trying to insert that info into xlsxwriter, always getting a "TypeError: Unknown or unsupported datetime type" error.
Ok... round 2
Now I'm trying to convert the timedelta into a datetime.datetime object:
for x in tempList:
timeString = str(x[1])
ctTime.append(datetime.datetime.strptime(timeString,"%H:%M:%S))
Now in xlsxwriter, I setup formatting:
ctChart.set_x_axis({'name': 'Time', 'name_font': {'size': 14, 'bold': True}, 'num_font': {'italic': True},'date_axis': True})
And then I create a time format:
timeFormat = workbook.add_format({'num_format': 'hh:mm:ss'})
Then I attempt to insert data:
ctWorksheet.write_datetime('A1',ctTime,timeFormat)
But no matter what I do, no matter how I format the data, I always get the following error:
TypeError: Unknown or unsupported datetime type
Is there something stupidly obvious I'm missing?
******* EDIT 1 *******
jmcnamara - In response to your comment here are more details:
I've tried using a list of time deltas such as datetime.timedelta(0, 27453) which when printed is 7:37:33 using the following code:
timeFormat = workbook.add_format({'num_format': 'hh:mm:ss'})
ctWorksheet.write_datetime('A1',ctTime,timeFormat)
I still get the error: TypeError: Unknown or unsupported datetime type
Even iterating through the list and attempting to insert the results fails:
timeFormat = workbook.add_format({'num_format': 'hh:mm:ss'})
i = 0
for t in ctTime:
ctWorksheet.write_datetime(i,0,t,timeFormat)
i += 1
I finally got it working with my most recent code. The chart still isn't graphing correctly using the inserted times, but at least they are inserting correctly.
Since I was pulling the timedeltas from SQL, I had to change their format first. Raw timedeltas from SQL just weren't working:
for x in templist:
timeString = datetime.datetime.strptime(str(x[1]),"%H:%M:%S")
ctTime.append(timeString)
With those datetime.strptime formatted times I was able to then successfully insert into the worksheet.
timeFormat = workbook.add_format({'num_format': 'hh:mm:ss'})
i = 0
for t in ctTime:
ctWorksheet.write_datetime(i,0,t,timeFormat)
i += 1
The GitHub master version of XlsxWriter supports datetime.timedelta.
Try it out and let me know if it works. It will probably be uploaded to PyPI in the next week.