Is there a way to encode the index of my dataframe? I have a dataframe where the index is the name of international conferences.
df2= pd.DataFrame(index=df_conf['Conference'], columns=['Citation1991','Citation1992'])
I keep getting:
KeyError: 'Leitf\xc3\xa4den der angewandten Informatik'
whenever my code references a foreign conference name with unknown ascii letters.
I tried:
df.at[x.encode("utf-8"), 'col1']
df.at[x.encode('ascii', 'ignore'), 'col']
Is there a way around it? I tried to see if I could encode the dataframe itself when creating it, but it doesn't seem I can do that either.
If you're not using csv, and you want to encode your string index, this is what worked for me:
df.index = df.index.str.encode('utf-8')
Setting up the encoding should be treated when reading the input file, using the option encoding
df = pd.read_csv('bibliography.csv', delimiter=',', encoding="utf-8")
or if the file uses BOM,
df = pd.read_csv('bibliography.csv', delimiter=',', encoding="utf-8-sig")
Just put "u" in front of utf8 strings such that
df2= pd.DataFrame(index=df_conf[u'Conference'], columns=[u'Citation1991',u'Citation1992'])
It will work.
Related
I have below kind of data which is space separated, I want to parse it by space but I am getting issue when particular element has “Space” in it.
2018-02-13 17:21:52.809 “EWQRR.OOM” “ERW WERT11”
have used below code:
import shlex
rdd= line.map(lambda x: shlex.split(x))
but its returning deserialized result like \x00\x00\x00.
Use re.findall() and regex “.+?”|\S+ or you can use “[^”]*”|\S+ by #ctwheels it performs better.
rdd = line.map(lambda x: re.findall(r'“.+?”|\S+', x))
Input:
"1234" "ewer" "IIR RT" "OOO"
Getting Output:
1234, ewer, IIR, RT, OOO
Desired Output .
1234, ewer, IIR RT, OOO
By default all the text lines are encoded as unicode if you are using sparkContext's textFile api as the api document of textFile says
Read a text file from HDFS, a local file system (available on all
nodes), or any Hadoop-supported file system URI, and return it as an
RDD of Strings.
If use_unicode is False, the strings will be kept as str (encoding
as utf-8), which is faster and smaller than unicode. (Added in
Spark 1.2)
And by default this option is true
#ignore_unicode_prefix
def textFile(self, name, minPartitions=None, use_unicode=True):
And thats the reason you are getting unicode characters like \x00\x00\x00 in the results.
You should include use_unicode option while reading the data file to rdd as
import shlex
rdd = sc.textFile("path to data file", use_unicode=False).map(lambda x: shlex.split(x))
Your results should be
['2018-02-13', '17:21:52.809', 'EWQRR.OOM', 'ERW WERT11']
you can even include utf-8 encoding in map function as
import shlex
rdd = sc.textFile("path to the file").map(lambda x: shlex.split(x.encode('utf-8')))
I hope the answer is helpful
I have two dataframes - an original file that looks like this:
Gene Symbol, 10555, 10529, 10519
Map7, .184, .026, .207
nan, .348, .041, .187
Cpm, .45, .278, .453
and a reference file that looks like this:
Experiment_Num, Microarray, Experiment_Name, Chip_Name
10555, Genechip, Famotidine-5d, RG230-2
10529, Genechip, Famotidine-3d, RG230-2
10519, MMchip, Dicyclomine-3d, R01
I am trying to merge them in a way that the header of the original file display the Experiment_Name rather than just the Experiment_Num as follows:
Gene symbol, Famotidine-5d, Famotidine-3d, Dicyclomine-3d
Map7, .184, .026, .207
nan, .348, .041, .187
Cpm, .45, .278, .453
My code is completely written using pandas and looks as follows:
import pandas as pd
df = ('ftp://anonftp.niehs.nih.gov/drugmatrix/Differentially_expressed_gene_lists_directly_from_DrugMatrix/Affymetrix/Affymetrix_annotation.txt', sep='\t', dtype=str)
# Reference file
df2.columns = df2.columns.to_series().replace(df.set_index('Experiment').Compound_Name)
#Original File
df2
I tried to convert the columns of the original DF to it's series representation and then replace the old value which were the Experiment_Num. with the new Experiment_name retrieved from the reference DF, but keep getting
KeyError: 'Experiment'
I tried figuring out what could be causing a KeyError, but found that there are so many possibilities, none of which seems to fix my particular issue.
Thanks for the help if possible!
Troy
I have a script that processes an Excel file. The department that sends it has a system that generated it, and my script stopped working.
I suddenly got the error Can only use .str accessor with string values, which use np.object_ dtype in pandas for the following line of code:
df['DATE'] = df['Date'].str.replace(r'[^a-zA-Z0-9\._/-]', '')
I checked the type of the date columns in the file from the old system (dtype: object) vs the file from the new system (dtype: datetime64[ns]).
How do I change the date format to something my script will understand?
I saw this answer but my knowledge about date formats isn't this granular.
You can use apply function on the dataframe column to convert the necessary column to String. For example:
df['DATE'] = df['Date'].apply(lambda x: x.strftime('%Y-%m-%d'))
Make sure to import datetime module.
apply() will take each cell at a time for evaluation and apply the formatting as specified in the lambda function.
pd.to_datetime returns a Series of datetime64 dtype, as described here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html
df['DATE'] = df['Date'].dt.date
or this:
df['Date'].map(datetime.datetime.date)
You can use pd.to_datetime
df['DATE'] = pd.to_datetime(df['DATE'])
I have a field/column in a .csv file that I am loading into Pandas that will not parse as a datetime data type in Pandas. I don't really understand why. I want both FirstTime and SecondTime to parse as datetime64 in Pandas DataFrame.
# Assigning a header for our data
header = ['FirstTime', 'Col1', 'Col2', 'Col3', 'SecondTime', 'Col4',
'Col5', 'Col6', 'Col7', 'Col8']
# Loading our data into a dataframe
df = pd.read_csv('MyData.csv', names=header, parse_dates=['FirstTime', 'SecondTime'])
The code above will only parse SecondTime as datetime64[ns]. FirstTime is left as a Object data type. If I do the following code instead:
# Assigning a header for our data
header = ['FirstTime', 'Col1', 'Col2', 'Col3', 'SecondTime', 'Col4',
'Col5', 'Col6', 'Col7', 'Col8']
# Loading our data into a dataframe
df = pd.read_csv('MyData.csv', names=header, parse_dates=['FirstTime'])
It still will not parse FirstTime as a datetime64[ns].
The format for both columns is the same:
# Example FirstTime
# (%f is always .000)
2015-11-05 16:52:37.000
# Example SecondTime
# (%f is always .000)
2015-11-04 15:33:15.000
What am I missing here? Is the first column not able to be datetime by default or something in Pandas?
did you try
df = pd.read_csv('MyData.csv', names=header, parse_dates=True)
I had a similar problem and it turned out in one of my date variables there is an integer cell. So, python recognize it as "object" and the other one is recognized as "int64". You need to make sure both variables are integer.
You can use df.dtypes to see the format of your vaiables.
I have a pandas dataframe (python 2.7) containing a u'\u2019' that does not let me extract as csv my result.
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 180: ordinal not in range(128)
Is there a way to query the dataframe and substitude these character with another one?
Try using a different encoding when saving to file (the default in pandas for Python 2.x is ascii, that's why you get the error since it can't handle unicode characters):
df.to_csv(path, encoding='utf-8')
I did not manage to export the whole file. However, I managed to identity the row with the character causing problems and eliminate it
faulty_rows = []
for i in range(len(outcome)):
try:
test = outcome.iloc[i]
test.to_csv("/Users/john/test/test.csv")
except:
pass
faulty_rows.append(i)
print i
tocsv = tocsv.drop(outcome.index[[indexes]])
tocsv.to_csv("/Users/john/test/test.csv")