Split if two consecutive characters are not same - python-2.7

I have a input string like
a = '4433555555666666'
i want the values to be separated if last character is not same as the next one.
in this case:
44, 33, 555555, 666666
I'm new in python so don't know how to deal with it. I have tried but it just gives first one correct i.e.
['44', '', '555555666666']
Also if two consecutive character group is same.
i.e.
a = 'chchdfch'
then 'ch' should be replaced with
a = '**df*'

You can use itertools.groupby()
[''.join(v) for k, v in itertools.groupby(a)]
Demo:
>>> import itertools
>>> a = '4433555555666666'
>>> [''.join(value) for key, value in itertools.groupby(a)]
['44', '33', '555555', '666666']
So this code is called a list comprehension - a compact way of iterating over elements individually.
Another way of representing this is:
>>> for k, v in itertools.groupby(a):
... print k, v
...
4 <itertools._grouper object at 0x100b90710>
3 <itertools._grouper object at 0x100b90750>
5 <itertools._grouper object at 0x100b90710>
6 <itertools._grouper object at 0x100b90750>
>>> for k, v in itertools.groupby(a):
... print k, "".join(v)
...
4 44
3 33
5 555555
6 666666
>>>
Just ignore the k the iterator generates.

Related

Use regular expression to extract elements from a pandas data frame

From the following data frame:
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
My ultimate goal is to extract the letters a, b or c (as string) in a pandas series. For that I am using the .findall() method from the re module, as shown below:
# import the module
import re
# define the patterns
pat = 'a|b|c'
# extract the patterns from the elements in the specified column
df['col1'].str.findall(pat)
The problem is that the output i.e. the letters a, b or c, in each row, will be present in a list (of a single element), as shown below:
Out[301]:
0 [a]
1 [b]
2 [c]
3 [a]
While I would like to have the letters a, b or c as string, as shown below:
0 a
1 b
2 c
3 a
I know that if I combine re.search() with .group() I can get a string, but if I do:
df['col1'].str.search(pat).group()
I will get the following error message:
AttributeError: 'StringMethods' object has no attribute 'search'
Using .str.split() won't do the job because, in my original dataframe, I want to capture strings that might contain the delimiter (e.g. I might want to capture a-b)
Does anyone know a simple solution for that, perhaps avoiding iterative operations such as a for loop or list comprehension?
Use extract with capturing groups:
import pandas as pd
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
result = df['col1'].str.extract('(a|b|c)')
print(result)
Output
0
0 a
1 b
2 c
3 a
Fix your code
pat = 'a|b|c'
df['col1'].str.findall(pat).str[0]
Out[309]:
0 a
1 b
2 c
3 a
Name: col1, dtype: object
Simply try with str.split() like this- df["col1"].str.split("-", n = 1, expand = True)
import pandas as pd
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
df['col1'] = df["col1"].str.split("-", n = 1, expand = True)
print(df.head())
Output:
col1
0 a
1 b
2 c
3 a

Reading In Integers in Python

So, my question is simple. I'm simply struggling with syntax here. I need to read in a set of integers, 3, 11, 2, 4, 4, 5, 6, 10, 8, -12. What I want to do with those integers is place them in a list as I'm reading them. n = n x n array in which these will be presented. so if n = 3, then i will be passed something like this 3 \n 11 2 4 \n 4 5 6 \n 10 8 -12 ( \n symbolizing a new line in input file)
n = int(raw_input().strip())
a = []
for a_i in xrange(n):
value = int(raw_input().strip())
a.append(value)
print(a)
I receive this error from the above code code:
value = int(raw_input().strip())
ValueError: invalid literal for int() with base 10: '11 2 4'
The actual challenge can be found here, https://www.hackerrank.com/challenges/diagonal-difference .
I have already completed this in Java and C++, simply trying to do in Python now but I suck at python. If someone wants to, they don't have too, seeing the proper way to read in an entire line, say " 11 2 4 ", creating a new list out that line, and adding it to an already existing list. So then all I have to do is search said index of list[ desiredInternalList[ ] ].
You can split the string at white space and convert the entries into integers.
This gives you one list:
for a_i in xrange(n):
a.extend([int(x) for x in raw_input().split()])
and this a list of lists:
for a_i in xrange(n):
a.append([int(x) for x in raw_input().split()]):
You get this error because you try to give all inputs in one line. To handle this issue you may use this code
n = int(raw_input().strip())
a = []
while len(a)< n*n:
x=raw_input().strip()
x = map(int,x.split())
a.extend(x)
print(a)

Find empty or NaN entry in Pandas Dataframe

I am trying to search through a Pandas Dataframe to find where it has a missing entry or a NaN entry.
Here is a dataframe that I am working with:
cl_id a c d e A1 A2 A3
0 1 -0.419279 0.843832 -0.530827 text76 1.537177 -0.271042
1 2 0.581566 2.257544 0.440485 dafN_6 0.144228 2.362259
2 3 -1.259333 1.074986 1.834653 system 1.100353
3 4 -1.279785 0.272977 0.197011 Fifty -0.031721 1.434273
4 5 0.578348 0.595515 0.553483 channel 0.640708 0.649132
5 6 -1.549588 -0.198588 0.373476 audio -0.508501
6 7 0.172863 1.874987 1.405923 Twenty NaN NaN
7 8 -0.149630 -0.502117 0.315323 file_max NaN NaN
NOTE: The blank entries are empty strings - this is because there was no alphanumeric content in the file that the dataframe came from.
If I have this dataframe, how can I find a list with the indexes where the NaN or blank entry occurs?
np.where(pd.isnull(df)) returns the row and column indices where the value is NaN:
In [152]: import numpy as np
In [153]: import pandas as pd
In [154]: np.where(pd.isnull(df))
Out[154]: (array([2, 5, 6, 6, 7, 7]), array([7, 7, 6, 7, 6, 7]))
In [155]: df.iloc[2,7]
Out[155]: nan
In [160]: [df.iloc[i,j] for i,j in zip(*np.where(pd.isnull(df)))]
Out[160]: [nan, nan, nan, nan, nan, nan]
Finding values which are empty strings could be done with applymap:
In [182]: np.where(df.applymap(lambda x: x == ''))
Out[182]: (array([5]), array([7]))
Note that using applymap requires calling a Python function once for each cell of the DataFrame. That could be slow for a large DataFrame, so it would be better if you could arrange for all the blank cells to contain NaN instead so you could use pd.isnull.
Try this:
df[df['column_name'] == ''].index
and for NaNs you can try:
pd.isna(df['column_name'])
Check if the columns contain Nan using .isnull() and check for empty strings using .eq(''), then join the two together using the bitwise OR operator |.
Sum along axis 0 to find columns with missing data, then sum along axis 1 to the index locations for rows with missing data.
missing_cols, missing_rows = (
(df2.isnull().sum(x) | df2.eq('').sum(x))
.loc[lambda x: x.gt(0)].index
for x in (0, 1)
)
>>> df2.loc[missing_rows, missing_cols]
A2 A3
2 1.10035
5 -0.508501
6 NaN NaN
7 NaN NaN
I've resorted to
df[ (df[column_name].notnull()) & (df[column_name]!=u'') ].index
lately. That gets both null and empty-string cells in one go.
In my opinion, don't waste time and just replace with NaN! Then, search all entries with Na. (This is correct because empty values are missing values anyway).
import numpy as np # to use np.nan
import pandas as pd # to use replace
df = df.replace(' ', np.nan) # to get rid of empty values
nan_values = df[df.isna().any(axis=1)] # to get all rows with Na
nan_values # view df with NaN rows only
Partial solution: for a single string column
tmp = df['A1'].fillna(''); isEmpty = tmp==''
gives boolean Series of True where there are empty strings or NaN values.
you also do something good:
text_empty = df['column name'].str.len() > -1
df.loc[text_empty].index
The results will be the rows which are empty & it's index number.
Another opltion covering cases where there might be severar spaces is by using the isspace() python function.
df[df.col_name.apply(lambda x:x.isspace() == False)] # will only return cases without empty spaces
adding NaN values:
df[(df.col_name.apply(lambda x:x.isspace() == False) & (~df.col_name.isna())]
To obtain all the rows that contains an empty cell in in a particular column.
DF_new_row=DF_raw.loc[DF_raw['columnname']=='']
This will give the subset of DF_raw, which satisfy the checking condition.
You can use string methods with regex to find cells with empty strings:
df[~df.column_name.str.contains('\w')].column_name.count()

Why Python doesn't iterate over repeated list into one list

Right now I'm working on Python 2.7 and I have a little issue.
I explain I need to get the index of a list that is into another list:
students=[["A","B"],["A","B"]]
for m in students:
if "A" in m and "B" in m:
print m
When I run this code I got this:
['A', 'B']
['A', 'B']
It seems to be right, it iterates over students and print twice ['A', 'B'] because it's repeated...but if I run this code:
for m in students:
if "A" in m and "B" in m:
print students.index(m)
it prints this:
0
0
It seems that it only iterates over the first element, for me the correct output should be like this:
0
1
Could anyone explain me why Python do that, and how to fix it, Thank you
students.index(m) returns the first index, i, where students[i] equals m.
Since students contains the same item twice, 0 is returned both times.
So the loop is iterating over both items in students, but since student[0] == student[1], when m is bound to students[1], students.index(student[1])) still returns 0.
If you simply want to report the current index of the loop, then use enumerate:
students = [["A","B"],["A","B"]]
for i, m in enumerate(students):
if "A" in m and "B" in m:
print i
prints
0
1

Enumerate list values into a list of dictionaries

I have a list of dictionaries, and I'm trying to assign dictionary key:value pairs based on the values of other other variables in lists. I'd like to assign the "ith" value of each variable list to ith dictionary in block_params_list with the variable name (as a string) as the key. The problem is that while the code appropriately assigns the values (as demonstrated by "pprint(item)"), when the entire enumerate loop is finished, each item in "block_params_list" is equal to the value of the last item.
I'm at a loss to explain this behavior. Can someone help? Thanks!
'''empty list of dictionaries'''
block_params_list = [{}] * 5
'''variable lists to go into the dictionaries'''
ran_iti = [False]*2 + [True]*3
iti_len = [1,2,4,8,16]
trial_cnt = [5,10,15,20,25]
'''the loops'''
param_list= ['iti_len','trial_cnt','ran_iti']#key values, also variable names
for i,item in enumerate(block_params_list):
for param in param_list:
item[param] = eval(param)[i]
pprint(item) #check what each item value is after assignment
pprint(block_params_list) #prints a list of dictionaries that are
#only equal to the very last item assigned
You've hit a common 'gotcha' in Python, on your first line of code:
# Create a list of five empty dictionaries
>>> block_params_list = [{}] * 5
The instruction [{}] * 5 is equivalent to doing this:
>>> d = {}
>>> [d, d, d, d, d]
The list contains five references to the same dictionary. You say "each item in 'block_params_list' is equal to the value of the last item" - that's an illusion, there's effectively only one item in "block_params_list" and you are assigning to it, then looking at it, five times over through five different references to it.
You need to use a loop to create your list, to make sure you create five different dictionaries:
block_params_list = []
for i in range(5):
block_params_list.append({})
or
block_params_list = [{} for i in range(5)]
NB. You can safely do [1] * 5 for a list of numbers, or [True] * 5 for a list of True, or ['A'] * 5 for a list of character 'A'. The distinction is whether you end up changing the list, or whether you change a thing referenced by the list. Top level or second level.
e.g. making a list of numbers, assinging to it does this:
before:
nums = [1] * 3
list_start
entry 0 --> 1
entry 1 --> 1
entry 2 --> 1
list_end
nums[0] = 8
after:
list_start
entry 0 -xx 1
\-> 8
entry 1 --> 1
entry 2 --> 1
list_end
Whereas making a list of dictionaries the way you are doing, and assigning to it, does this:
before:
blocks = [{}] * 3
list_start
entry 0 --> {}
entry 1 --/
entry 2 -/
list_end
first_block = blocks[0]
first_block['test'] = 8
after:
list_start
entry 0 --> {'test':8}
entry 1 --/
entry 2 -/
list_end
In the first example, one of the references in the list has to change. You can't pull a number out of a list and change the number, you can only put a different number in the list.
In the second example, the list itself doesn't change at all, you're assigning to a dictionary referenced by the list. So while it feels like you are updating every element in the list, you really aren't, because the list doesn't "have dictionaries in it", it has references to dictionaries in it.