I have a dataframe with a column which is having values like:
Jan\n\t\t\t\t\t\t\t\t\t\t\t\tMon\n\t\t\t\t\t\t\t\t\t\t\t\t
Jan\n\t\t\t\t\t\t\t\t\t\t\t\tWed\n\t\t\t\t\t\t\t\t\t\t\t\t
Jan\n\t\t\t\t\t\t\t\t\t\t\t\tTue\n\t\t\t\t\t\t\t\t\t\t\t\t
Jan\n\t\t\t\t\t\t\t\t\t\t\t\tSat\n\t\t\t\t\t\t\t\t\t\t\t\t
Jan\n\t\t\t\t\t\t\t\t\t\t\t\tSun\n\t\t\t\t\t\t\t\t\t\t\t\t
I want to replace these values with only the name, Jan and have tried to use the following code for doing that.
df['month'] = df['month'].apply(str)
df['month'] = df['month'].str.replace('Jan\n\t\t\t\t\t\t\t\t\t\t\t\tMon\n\t\t\t\t\t\t\t\t\t\t\t\t ', '1-stop')
I have also tried to rename the values using this command:
df.rename({'Jan\n\t\t\t\t\t\t\t\t\t\t\t\tMon\n\t\t\t\t\t\t\t\t\t\t\t\t': 'Jan'}, axis=1, inplace=True)
Also, used
import re month=df['month'] result= str(month)
#result = re.sub(r'\s+', '',str(month))
But nothing seems to work. Also, I want to rename all the row values with the name 'Jan', can anyone help me with this?
Thanks in advance.
Related
I have a dataset in which I want to keep row just after a floating value row and remove other rows.
For eg, a column of the dataframe looks like this:
17.3
Hi Hello
Pranjal
17.1
[aasd]How are you
I am fine[:"]
Live Free
So in this I want to preserve:
Hi Hello
[aasd]How are you
and remove the rest. I tried it with the following code, but an error showed up saying "unexpected character after line continuation character". Also I don't know if this code will solve my purpose
Dropping extra rows
for ind in data.index:
if re.search((([1-9][0-9]*\.?[0-9]*)|(\.[0-9]+))([Ee][+-]?[0-9]+)?, ind):
ind+=1
else:
data.drop(ind)
your regex has to be a string, you can't just write it like that.
re.search((('[1-9][0-9]*\.?[0-9]*)|(\.[0-9]+))([Ee][+-]?[0-9]+)?', ind):
edit - but actually i think the rest of your code is wrong too.
what you really want is something more like this:
import pandas as pd
l = ['17.3',
'Hi Hello',
'Pranjal',
'17.1',
'[aasd]How are you',
'I am fine[:"]',
'Live Free']
data = pd.DataFrame(l, columns=['col'])
data[data.col.str.match('\d+\.\d*').shift(1) == True]
logic:
if you have a dataframe with a column that is all string type (won't work for mixed type decimal and string you can find the decimal / int entries with the regex '\d+.?\d*'. If you shift this mask by one it gives you the entries after the matches. use that to select the rows you want in your dataframe.
I have a pyspark data frame and I'd like to have a conditional replacement of a string across multiple columns, not just one.
To be more concrete: I'd like to replace the string 'HIGH' with 1, and everything else in the column with 0. [Or at least replace every 'HIGH' with 1.] In pandas I would do:
df[df == 'HIGH'] = 1
Is there a way to do something similar? Or can I do a loop?
I'm new to pyspark so I don't know how to generate example code.
You can use the replace method for this:
>>> df.replace("HIGH", "1")
Keep in mind that you'll need to replace like for like datatypes, so attemping to replace "HIGH" with 1 will throw an exception.
Edit: You could also use regexp_replace to address both parts of your question, but you'd need to apply it to all columns:
>>> df = df.withColumn("col1", regexp_replace("col1", "^(?!HIGH).*$", "0"))
>>> df = df.withColumn("col1", regexp_replace("col1", "^HIGH$", "1"))
Every column name ends with a colon and the next column name starts on newline with the previous line ended with a fullstop , so there should be a way to get a list of column name from the string
data_description = '''age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school.
education-num: continuous.'''
How can I get the below output
Columns = ['age','workclass','fnlwgt','education','education-num']
The title of your post says, get column names of a pandas.DataFrame from below and I don't see pandas code written anywhere in your explanation.
You could do this very easily through pandas:
First create your dictionary like this:
data_description = {'age': ['continuous.'],
'workclass': ['Private, Self-emp-not-inc, Self-emp-inc, Federal-gov.'],
'fnlwgt': ['continuous.'],
'education':[ 'Bachelors, Some-college, 11th, HS-grad, Prof-school.'],
'education-num': ['continuous.']}
Then create a dataframe using above dict
df = pd.DataFrame(data_description)
Then just say, list(df.columns) and it will give you all column names in a list.
In [1009]: list(df.columns)
Out[1009]: ['age', 'education', 'education-num', 'fnlwgt', 'workclass']
Try this:
>>> Columns = [i.split(':')[0] for i in data_description.split() if ':' in i]
>>> Columns
['age', 'workclass', 'fnlwgt', 'education', 'education-num']
Using regular expressions, capture the none space (\S) characters before a ie parenthesis are used to capture. \S means opposite of space. :. In this case, you can simply do:
import re
re.findall(r'(\S+):',data_description)
['age', 'workclass', 'fnlwgt', 'education', 'education-num']
if you need to take the \n into consideration, maybe because there might be some in the data which are not column names yet succeeded by a colon then:
re.findall(r'(?:^|\n)(\S+):',data_description)
['age', 'workclass', 'fnlwgt', 'education', 'education-num']
I would first remove all \n that are imported with the string, and then apply some split()and filter() methods, like this:
data_description = data_description.replace("\n", "")
columns = [i.split(":")[0] for i in list(filter(None, data_description.split(".")))]
Now you get the name of each column:
columns
['age', 'workclass', 'fnlwgt', 'education', 'education-num']
There is not a general rule. For each case you have to think how to remove leading and trailing whitespace and try use methods like split in a way you get what you need.
This is a simple one-liner.
print([every_line.split(':')[0] for every_line in data_description.split('\n')])
I'm running the code below. It creates a couple of dataframes that takes a column in another dataframe that has a list of Conference Names, as its index.
df_conf = pd.read_sql("select distinct Conference from publications where year>=1991 and length(conference)>1 order by conference", db)
for index, row in df_conf.iterrows():
row[0]=row[0].encode("utf-8")
df2= pd.DataFrame(index=df_conf['Conference'], columns=['Citation1991','Citation1992'])
df2 = df2.fillna(0)
df_if= pd.DataFrame(index=df_conf['Conference'], columns=['IF1994','IF1995'])
df_if = df_if.fillna(0)
df_pubs=pd.read_sql("select Conference, Year, count(*) as totalPubs from publications where year>=1991 group by conference, year", db)
for index, row in df_pubs.iterrows():
row[0]=row[0].encode("utf-8")
df_pubs= df_pubs.pivot(index='Conference', columns='Year', values='totalPubs')
df_pubs.fillna(0)
for index, row in df2.iterrows():
df_if.ix[index,'IF1994'] = df2.ix[index,'Citation1992'] / (df_pubs.ix[index,1992]+df_pubs.ix[index,1993])
The last line keeps giving me the following error:
KeyError: 'Analyse dynamischer Systeme in Medizin, Biologie und \xc3\x96kologie'
Not quite sure what I'm doing wrong. I tried encoding the indexes. It won't work. I even tried .at still wont' work.
I know it has to do with encoding, as it always stops at indexes with non-ascii characters.
I'm using python 2.7
I think the problem with this:
for index, row in df_conf.iterrows():
row[0]=row[0].encode("utf-8")
is that it may or may not work, I'm surprised it didn't raise a warning.
Besides that it's much quicker to use the vectorised str method to encode the series:
df_conf['col_name'] = df_conf['col_name'].str.encode('utf-8')
If needed you can also encode the index in a similar fashion:
df.index = df.index.str.encode('utf-8')
It happens in the line in the last part of the code?
df_if.ix[index,'IF1994'] = df2.ix[index,'Citation1992'] / (df_pubs.ix[index,1992]+df_pubs.ix[index,1993])
if then, try
df_if.ix[index,u'IF1994'] = df2.ix[index,u'Citation1992'] / (df_pubs.ix[index,1992]+df_pubs.ix[index,1993])
It would work. Dataframe indexing in UTF8 works in strange way even though the script is declared with "# -- coding:utf8 --". Just put "u" in utf8 strings when you use dataframe columns and index with utf8 strings
I have a dataframe that contains Physician_Profile_City, Physician_Profile_State and Physician_Profile_Zip_Code. I ultimately want to stratify an analysis based on state, but unfortunately not all of the Physician_Profile_States are filled in. I started looking around to try and figure out how to fill in the missing States. I came across the pyzipcode module which can take as an input a zip code and returns the state as follows:
In [39]: from pyzipcode import ZipCodeDatabase
zcdb = ZipCodeDatabase()
zcdb = ZipCodeDatabase()
zipcode = zcdb[54115]
zipcode.state
Out[39]: u'WI'
What I'm struggling with is how I would iterate through the dataframe and add the appropriate "Physician_Profile_State" when that variable is missing. Any suggestions would be most appreciated.
No need to iterate if the form of the data is a dict then you should be able to perform the following:
df['Physician_Profile_State'] = df['Physician_Profile_Zip_Code'].map(zcdb)
Otherwise you can call apply like so:
df['Physician_Profile_State'] = df['Physician_Profile_Zip_Code'].apply(lambda x: zcdb[x].state)
In the case where the above won't work as it can't generate a Series to align with you df you can apply row-wise passing axis=1 to the df:
df['Physician_Profile_State'] = df[['Physician_Profile_Zip_Code']].apply(lambda x: zcdb[x].state, axis=1)
By using double square brackets we return a df allowing you to pass the axis param