Randomly set one-third of na's in a column to one value and the rest to another value - python-2.7

I'm trying to impute missing values in a dataframe df. I have a column A with 300 NaN's. I want to randomly set 2/3rd of it to value1 and the rest to value2.
Please help.
EDIT: I'm actually trying to this on dask, which does not support item assignment. This is what I have currently. Initially, I thought I'll try to convert all NA's to value1
da.where(df.A.isnull() == True, 'value1', df.A)
I got the following error:
ValueError: need more than 0 values to unpack

As the comment suggests, you can solve this with Series.where.
The following will work, but I cannot promise how efficient this is. (I suspect it may be better to produce a whole column of replacements at once with numpy.choice.)
df['A'] = d['A'].where(~d['A'].isnull(),
lambda df: df.map(
lambda x: random.choice(['value1', 'value1', x])))
explanation: if the value is not null (NaN), certainly keep the original. Where it is null, replace with the corresonding values of the dataframe produced by the first lambda. This maps values of the dataframe (chunks) to randomly choose the original value for 1/3 and 'value1' for others.
Note that, depending on your data, this likely has changed the data type of the column.

Related

Code to missing values if all Items of an Item battery have value 1

I have a large data set in Stata.
There are several item batteries in this data set.
One item battery consists of 8 items (v1 - v8), each scaled from 1 to 7.
I want to code all items that take the value 1 in all items as missing values.
If v1 to v8 have the value "1", all rows to which this applies are to be replaced with missings.
I know how to code missing values with the if qualifier, but the selection with the complex condition causes me difficulties.
The code for R would probably solve this via rowSums, but I need the solution for Stata.
(I assume in R it would work like this:
df[rowSums(df[,c("v1", ... "v8")]!=1)==0, c("v1", .... "v8")] <- NA
But I need a solution for Stata.
If I understood this correctly, you want
egen rowall = concat(v1-v8)
mvdecode v1-v8 if rowall == 8 * "1", mv(1)
That is, all instances in v1-v8 of 1 are recoded as missing if and only if the values of those variables are all 1 in any observation.

unexpected character after line continuation character. Also to keep rows after floating point rows in pandas dataframe

I have a dataset in which I want to keep row just after a floating value row and remove other rows.
For eg, a column of the dataframe looks like this:
17.3
Hi Hello
Pranjal
17.1
[aasd]How are you
I am fine[:"]
Live Free
So in this I want to preserve:
Hi Hello
[aasd]How are you
and remove the rest. I tried it with the following code, but an error showed up saying "unexpected character after line continuation character". Also I don't know if this code will solve my purpose
Dropping extra rows
for ind in data.index:
if re.search((([1-9][0-9]*\.?[0-9]*)|(\.[0-9]+))([Ee][+-]?[0-9]+)?, ind):
ind+=1
else:
data.drop(ind)
your regex has to be a string, you can't just write it like that.
re.search((('[1-9][0-9]*\.?[0-9]*)|(\.[0-9]+))([Ee][+-]?[0-9]+)?', ind):
edit - but actually i think the rest of your code is wrong too.
what you really want is something more like this:
import pandas as pd
l = ['17.3',
'Hi Hello',
'Pranjal',
'17.1',
'[aasd]How are you',
'I am fine[:"]',
'Live Free']
data = pd.DataFrame(l, columns=['col'])
data[data.col.str.match('\d+\.\d*').shift(1) == True]
logic:
if you have a dataframe with a column that is all string type (won't work for mixed type decimal and string you can find the decimal / int entries with the regex '\d+.?\d*'. If you shift this mask by one it gives you the entries after the matches. use that to select the rows you want in your dataframe.

Join strings from the same column in ´pandas´ using a placeholder condition

I have a series of data that I need to filter.
The df consists of one col. of information that is separated by a row with with value NaN.
I would like to join all of the rows that occur until each NaN in a new column.
For example my data looks something like:
the
car
is
red
NaN
the
house
is
big
NaN
the
room
is
small
My desired result is
B
the car is red
the house is big
the room is small
Thus far, I am approaching this problema by building a function and applying it to each row in my dataframe. See below for my working code example so far.
def joinNan(row):
newRow = []
placeholder = 'NaN'
if row is not placeholder:
newRow.append(row)
if row == placeholder:
return newRow
df['B'] = df.loc[0].apply(joinNan)
For some reason, the first row of my data is being used as the index or column title, hence why I am using 'loc[0]' here instead of a specific column name.
If there is a more straight forward way to approach this directly iterating in the column, I am open for that suggestion too.
For now, I am trying to reach my desired solution and have not found any other similiar case in Stack overflow or the web in general to help me.
I think for test NaNs is necessary use isna, then greate helper Series by cumsum and aggregate join with groupby:
df=df.groupby(df[0].isna().cumsum())[0].apply(lambda x: ' '.join(x.dropna())).to_frame('B')
#for oldier version of pandas
df=df.groupby(df[0].isnull().cumsum())[0].apply(lambda x: ' '.join(x.dropna())).to_frame('B')
Another solution is filter out all NaNs before groupby:
mask = df[0].isna()
#mask = df[0].isnull()
df['g'] = mask.cumsum()
df = df[~mask].groupby('g')[0].apply(' '.join).to_frame('B')

compare two dictionary, one with list of float value per key, the other one a value per key (python)

I have a query sequence that I blasted online using NCBIWWW.qblast. In my xml blast file result I obtained for a query sequence a list of hit (i.e: gi|). Each hit or gi| have multiple hsp. I made a dictionary my_dict1 where I placed gi| as key and I appended the bit score as value. So multiple values for each key.
my_dict1 = {
gi|1002819492|: [437.702, 384.47, 380.86, 380.86, 362.83],
gi|675820360| : [2617.97, 2614.37, 122.112],
gi|953764029| : [414.258, 318.66, 122.112, 86.158],
gi|675820410| : [450.653, 388.08, 386.27] }
Then I looked for max value in each key using:
for key, value in my_dict1.items():
max_value = max(value)
And made a second dictionary my_dict2:
my_dict2 = {
gi|1002819492|: 437.702,
gi|675820360| : 2617.97,
gi|953764029| : 414.258,
gi|675820410| : 450.653 }
I want to compare both dictionary. So I can extract the hsp with the highest score bits. I am also including other parameters like query coverage and identity percentage (Not shown here). The finality is to get the best gi| with the highest bit scores, coverage and identity percentage.
I tried many things to compare both dictionary like this :
First code :
matches[]
if my_dict1.keys() not in my_dict2.keys():
matches[hit_id] = bit_score
else:
matches = matches[hit_id], bit_score
Second code:
if hit_id not in matches.keys():
matches[hit_id]= bit_score
else:
matches = matches[hit_id], bit_score
Third code:
intersection = set(set(my_dict1.items()) & set(my_dict2.items()))
Howerver I always end up with 2 types of errors:
1 ) TypeError: list indices must be integers, not unicode
2 ) ... float not iterable...
Please I need some help and guidance. Thank you very much in advance for your time. Best regards.
It's not clear what you're trying to do. What is hit_id? What is bit_score? It looks like your second dict is always going to have the same keys as your first if you're creating it by pulling the max value for each key of the first dict.
You say you're trying to compare them, but don't really state what you're actually trying to do. Find those with values under a certain max? Find those with the highest max?
Your first code doesn't work because I'm assuming you're trying to use a dict key value as an index to matches, which you define as a list. That's probably where your first error is coming from, though you haven't given the lines where the error is actually occurring.
See in-code comments below:
# First off, this needs to be a dict.
matches{}
# This will never happen if you've created these dicts as you stated.
if my_dict1.keys() not in my_dict2.keys():
matches[hit_id] = bit_score # Not clear what bit_score is?
else:
# Also not sure what you're trying to do here. This will assign a tuple
# to matches with whatever the value of matches[hit_id] is and bit_score.
matches = matches[hit_id], bit_score
Regardless, we really need more information and the full code to figure out your actual goal and what's going wrong.

Python Pandas: How to get column values to be a list?

I have a dataframe with one column and 20 rows. I want to use
dataframe[column].apply(lambda x : some_func(x))
to get second column. The function returns a list. Pandas is not giving me what I want. It is filling the second column with NaN instead of the list items that some_func() is returning.
Is there a clever or simple way to fix this?
It seems that the error was cause because I forgot to include:
axis = 1
My full line of code should have been:
dataframe[column].apply(lambda x : some_func(x), axis = 1)
You can just assign it like a dictionary:
dataframe['column2'] = dataframe['column1'].apply(lambda x : some_func(x))
Simple as that.