How to replace the values of a single column which consist of different values with only one value? - replace

I have a dataframe with a column which is having values like:
Jan\n\t\t\t\t\t\t\t\t\t\t\t\tMon\n\t\t\t\t\t\t\t\t\t\t\t\t
Jan\n\t\t\t\t\t\t\t\t\t\t\t\tWed\n\t\t\t\t\t\t\t\t\t\t\t\t
Jan\n\t\t\t\t\t\t\t\t\t\t\t\tTue\n\t\t\t\t\t\t\t\t\t\t\t\t
Jan\n\t\t\t\t\t\t\t\t\t\t\t\tSat\n\t\t\t\t\t\t\t\t\t\t\t\t
Jan\n\t\t\t\t\t\t\t\t\t\t\t\tSun\n\t\t\t\t\t\t\t\t\t\t\t\t
I want to replace these values with only the name, Jan and have tried to use the following code for doing that.
df['month'] = df['month'].apply(str)
df['month'] = df['month'].str.replace('Jan\n\t\t\t\t\t\t\t\t\t\t\t\tMon\n\t\t\t\t\t\t\t\t\t\t\t\t ', '1-stop')
I have also tried to rename the values using this command:
df.rename({'Jan\n\t\t\t\t\t\t\t\t\t\t\t\tMon\n\t\t\t\t\t\t\t\t\t\t\t\t': 'Jan'}, axis=1, inplace=True)
Also, used
import re month=df['month'] result= str(month)
#result = re.sub(r'\s+', '',str(month))
But nothing seems to work. Also, I want to rename all the row values with the name 'Jan', can anyone help me with this?
Thanks in advance.

Related

unexpected character after line continuation character. Also to keep rows after floating point rows in pandas dataframe

I have a dataset in which I want to keep row just after a floating value row and remove other rows.
For eg, a column of the dataframe looks like this:
17.3
Hi Hello
Pranjal
17.1
[aasd]How are you
I am fine[:"]
Live Free
So in this I want to preserve:
Hi Hello
[aasd]How are you
and remove the rest. I tried it with the following code, but an error showed up saying "unexpected character after line continuation character". Also I don't know if this code will solve my purpose
Dropping extra rows
for ind in data.index:
if re.search((([1-9][0-9]*\.?[0-9]*)|(\.[0-9]+))([Ee][+-]?[0-9]+)?, ind):
ind+=1
else:
data.drop(ind)
your regex has to be a string, you can't just write it like that.
re.search((('[1-9][0-9]*\.?[0-9]*)|(\.[0-9]+))([Ee][+-]?[0-9]+)?', ind):
edit - but actually i think the rest of your code is wrong too.
what you really want is something more like this:
import pandas as pd
l = ['17.3',
'Hi Hello',
'Pranjal',
'17.1',
'[aasd]How are you',
'I am fine[:"]',
'Live Free']
data = pd.DataFrame(l, columns=['col'])
data[data.col.str.match('\d+\.\d*').shift(1) == True]
logic:
if you have a dataframe with a column that is all string type (won't work for mixed type decimal and string you can find the decimal / int entries with the regex '\d+.?\d*'. If you shift this mask by one it gives you the entries after the matches. use that to select the rows you want in your dataframe.

Modify values across all column pyspark

I have a pyspark data frame and I'd like to have a conditional replacement of a string across multiple columns, not just one.
To be more concrete: I'd like to replace the string 'HIGH' with 1, and everything else in the column with 0. [Or at least replace every 'HIGH' with 1.] In pandas I would do:
df[df == 'HIGH'] = 1
Is there a way to do something similar? Or can I do a loop?
I'm new to pyspark so I don't know how to generate example code.
You can use the replace method for this:
>>> df.replace("HIGH", "1")
Keep in mind that you'll need to replace like for like datatypes, so attemping to replace "HIGH" with 1 will throw an exception.
Edit: You could also use regexp_replace to address both parts of your question, but you'd need to apply it to all columns:
>>> df = df.withColumn("col1", regexp_replace("col1", "^(?!HIGH).*$", "0"))
>>> df = df.withColumn("col1", regexp_replace("col1", "^HIGH$", "1"))

How to get column names of a pandas.DataFrame from below given description of the data

Every column name ends with a colon and the next column name starts on newline with the previous line ended with a fullstop , so there should be a way to get a list of column name from the string
data_description = '''age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school.
education-num: continuous.'''
How can I get the below output
Columns = ['age','workclass','fnlwgt','education','education-num']
The title of your post says, get column names of a pandas.DataFrame from below and I don't see pandas code written anywhere in your explanation.
You could do this very easily through pandas:
First create your dictionary like this:
data_description = {'age': ['continuous.'],
'workclass': ['Private, Self-emp-not-inc, Self-emp-inc, Federal-gov.'],
'fnlwgt': ['continuous.'],
'education':[ 'Bachelors, Some-college, 11th, HS-grad, Prof-school.'],
'education-num': ['continuous.']}
Then create a dataframe using above dict
df = pd.DataFrame(data_description)
Then just say, list(df.columns) and it will give you all column names in a list.
In [1009]: list(df.columns)
Out[1009]: ['age', 'education', 'education-num', 'fnlwgt', 'workclass']
Try this:
>>> Columns = [i.split(':')[0] for i in data_description.split() if ':' in i]
>>> Columns
['age', 'workclass', 'fnlwgt', 'education', 'education-num']
Using regular expressions, capture the none space (\S) characters before a ie parenthesis are used to capture. \S means opposite of space. :. In this case, you can simply do:
import re
re.findall(r'(\S+):',data_description)
['age', 'workclass', 'fnlwgt', 'education', 'education-num']
if you need to take the \n into consideration, maybe because there might be some in the data which are not column names yet succeeded by a colon then:
re.findall(r'(?:^|\n)(\S+):',data_description)
['age', 'workclass', 'fnlwgt', 'education', 'education-num']
I would first remove all \n that are imported with the string, and then apply some split()and filter() methods, like this:
data_description = data_description.replace("\n", "")
columns = [i.split(":")[0] for i in list(filter(None, data_description.split(".")))]
Now you get the name of each column:
columns
['age', 'workclass', 'fnlwgt', 'education', 'education-num']
There is not a general rule. For each case you have to think how to remove leading and trailing whitespace and try use methods like split in a way you get what you need.
This is a simple one-liner.
print([every_line.split(':')[0] for every_line in data_description.split('\n')])

KeyError Pandas Dataframe (encoding index)

I'm running the code below. It creates a couple of dataframes that takes a column in another dataframe that has a list of Conference Names, as its index.
df_conf = pd.read_sql("select distinct Conference from publications where year>=1991 and length(conference)>1 order by conference", db)
for index, row in df_conf.iterrows():
row[0]=row[0].encode("utf-8")
df2= pd.DataFrame(index=df_conf['Conference'], columns=['Citation1991','Citation1992'])
df2 = df2.fillna(0)
df_if= pd.DataFrame(index=df_conf['Conference'], columns=['IF1994','IF1995'])
df_if = df_if.fillna(0)
df_pubs=pd.read_sql("select Conference, Year, count(*) as totalPubs from publications where year>=1991 group by conference, year", db)
for index, row in df_pubs.iterrows():
row[0]=row[0].encode("utf-8")
df_pubs= df_pubs.pivot(index='Conference', columns='Year', values='totalPubs')
df_pubs.fillna(0)
for index, row in df2.iterrows():
df_if.ix[index,'IF1994'] = df2.ix[index,'Citation1992'] / (df_pubs.ix[index,1992]+df_pubs.ix[index,1993])
The last line keeps giving me the following error:
KeyError: 'Analyse dynamischer Systeme in Medizin, Biologie und \xc3\x96kologie'
Not quite sure what I'm doing wrong. I tried encoding the indexes. It won't work. I even tried .at still wont' work.
I know it has to do with encoding, as it always stops at indexes with non-ascii characters.
I'm using python 2.7
I think the problem with this:
for index, row in df_conf.iterrows():
row[0]=row[0].encode("utf-8")
is that it may or may not work, I'm surprised it didn't raise a warning.
Besides that it's much quicker to use the vectorised str method to encode the series:
df_conf['col_name'] = df_conf['col_name'].str.encode('utf-8')
If needed you can also encode the index in a similar fashion:
df.index = df.index.str.encode('utf-8')
It happens in the line in the last part of the code?
df_if.ix[index,'IF1994'] = df2.ix[index,'Citation1992'] / (df_pubs.ix[index,1992]+df_pubs.ix[index,1993])
if then, try
df_if.ix[index,u'IF1994'] = df2.ix[index,u'Citation1992'] / (df_pubs.ix[index,1992]+df_pubs.ix[index,1993])
It would work. Dataframe indexing in UTF8 works in strange way even though the script is declared with "# -- coding:utf8 --". Just put "u" in utf8 strings when you use dataframe columns and index with utf8 strings

Adding data to a Pandas dataframe

I have a dataframe that contains Physician_Profile_City, Physician_Profile_State and Physician_Profile_Zip_Code. I ultimately want to stratify an analysis based on state, but unfortunately not all of the Physician_Profile_States are filled in. I started looking around to try and figure out how to fill in the missing States. I came across the pyzipcode module which can take as an input a zip code and returns the state as follows:
In [39]: from pyzipcode import ZipCodeDatabase
zcdb = ZipCodeDatabase()
zcdb = ZipCodeDatabase()
zipcode = zcdb[54115]
zipcode.state
Out[39]: u'WI'
What I'm struggling with is how I would iterate through the dataframe and add the appropriate "Physician_Profile_State" when that variable is missing. Any suggestions would be most appreciated.
No need to iterate if the form of the data is a dict then you should be able to perform the following:
df['Physician_Profile_State'] = df['Physician_Profile_Zip_Code'].map(zcdb)
Otherwise you can call apply like so:
df['Physician_Profile_State'] = df['Physician_Profile_Zip_Code'].apply(lambda x: zcdb[x].state)
In the case where the above won't work as it can't generate a Series to align with you df you can apply row-wise passing axis=1 to the df:
df['Physician_Profile_State'] = df[['Physician_Profile_Zip_Code']].apply(lambda x: zcdb[x].state, axis=1)
By using double square brackets we return a df allowing you to pass the axis param