Split Pandas Column by values that are in a list - python-2.7

I have three lists that look like this:
age = ['51+', '21-30', '41-50', '31-40', '<21']
cluster = ['notarget', 'cluster3', 'allclusters', 'cluster1', 'cluster2']
device = ['htc_one_2gb','iphone_6/6+_at&t','iphone_6/6+_vzn','iphone_6/6+_all_other_devices','htc_one_2gb_limited_time_offer','nokia_lumia_v3','iphone5s','htc_one_1gb','nokia_lumia_v3_more_everything']
I also have column in a df that looks like this:
campaign_name
0 notarget_<21_nokia_lumia_v3
1 htc_one_1gb_21-30_notarget
2 41-50_htc_one_2gb_cluster3
3 <21_htc_one_2gb_limited_time_offer_notarget
4 51+_cluster3_iphone_6/6+_all_other_devices
I want to split the column into three separate columns based on the values in the above lists. Like so:
age cluster device
0 <21 notarget nokia_lumia_v3
1 21-30 notarget htc_one_1gb
2 41-50 cluster3 htc_one_2gb
3 <21 notarget htc_one_2gb_limited_time_offer
4 51+ cluster3 iphone_6/6+_all_other_devices
First thought was to do a simple test like this:
ages_list = []
for i in ages:
if i in df['campaign_name'][0]:
ages_list.append(i)
print ages_list
>>> ['<21']
I was then going to convert ages_list to a series and combine it with the remaining two to get the end result above but i assume there is a more pythonic way of doing it?

the idea behind this is that you'll create a regular expression based on the values you already have , for example if you want to build a regular expressions that capture any value from your age list you may do something like this '|'.join(age) and so on for all the values you already have cluster & device.
a special case for device list becuase it contains + sign that will conflict with the regex ( because + means one or more when it comes to regex ) so we can fix this issue by replacing any value of + with \+ , so this mean I want to capture literally +
df = pd.DataFrame({'campaign_name' : ['notarget_<21_nokia_lumia_v3' , 'htc_one_1gb_21-30_notarget' , '41-50_htc_one_2gb_cluster3' , '<21_htc_one_2gb_limited_time_offer_notarget' , '51+_cluster3_iphone_6/6+_all_other_devices'] })
def split_df(df):
campaign_name = df['campaign_name']
df['age'] = re.findall('|'.join(age) , campaign_name)[0]
df['cluster'] = re.findall('|'.join(cluster) , campaign_name)[0]
df['device'] = re.findall('|'.join([x.replace('+' , '\+') for x in device ]) , campaign_name)[0]
return df
df.apply(split_df, axis = 1 )
if you want to drop the original column you can do this
df.apply(split_df, axis = 1 ).drop( 'campaign_name', axis = 1)
Here I'm assuming that a value must be matched by regex but if this is not the case you can do your checks , you got the idea

Related

filtering random objects in django

How can i filter 12 random objects from a model in django .
I tried to do this but It does not work and It just returned me 1 object.
max = product.objects.aggregate(id = Max('id'))
max_p = int(max['id'])
l = []
for s in range(1 , 13):
l.append(random.randint(1 , max_p))
for i in l:
great_proposal = product.objects.filter(id=i)
products = product.objects.all().order_by('-id')[:50]
great_proposal1 = random.sample(list(products) , 12)
Hi . It worked with this code !
Try this:
product.objects.order_by('?')[:12]
The '?' will "sort" randomly and "[:12]" will get only 12 objects.
I'm pretty sure the code is correct, but maybe you did not realize that you're just using great_proposal as variable to save the output, which is not an array, and therefore only returns one output.
Try:
result_array = []
for i in l:
result_array.append(product.objects.filter(index=i))

Use regular expression to extract elements from a pandas data frame

From the following data frame:
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
My ultimate goal is to extract the letters a, b or c (as string) in a pandas series. For that I am using the .findall() method from the re module, as shown below:
# import the module
import re
# define the patterns
pat = 'a|b|c'
# extract the patterns from the elements in the specified column
df['col1'].str.findall(pat)
The problem is that the output i.e. the letters a, b or c, in each row, will be present in a list (of a single element), as shown below:
Out[301]:
0 [a]
1 [b]
2 [c]
3 [a]
While I would like to have the letters a, b or c as string, as shown below:
0 a
1 b
2 c
3 a
I know that if I combine re.search() with .group() I can get a string, but if I do:
df['col1'].str.search(pat).group()
I will get the following error message:
AttributeError: 'StringMethods' object has no attribute 'search'
Using .str.split() won't do the job because, in my original dataframe, I want to capture strings that might contain the delimiter (e.g. I might want to capture a-b)
Does anyone know a simple solution for that, perhaps avoiding iterative operations such as a for loop or list comprehension?
Use extract with capturing groups:
import pandas as pd
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
result = df['col1'].str.extract('(a|b|c)')
print(result)
Output
0
0 a
1 b
2 c
3 a
Fix your code
pat = 'a|b|c'
df['col1'].str.findall(pat).str[0]
Out[309]:
0 a
1 b
2 c
3 a
Name: col1, dtype: object
Simply try with str.split() like this- df["col1"].str.split("-", n = 1, expand = True)
import pandas as pd
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
df['col1'] = df["col1"].str.split("-", n = 1, expand = True)
print(df.head())
Output:
col1
0 a
1 b
2 c
3 a

Excel | Get all column/row names in which a specific text is as a list

It is difficult for me to describe the problem in the title, so excuse any misleading description.
The easiest way to describe what I need is with an example. I have a table like:
A B C
1 x
2 x x
3 x x
Now what I want is the formula in a cell for every single column and row with each of the column or row name for every x that is placed. In the example like:
A B C
1,2 2,3 3
1 A x
2 A, B x x
3 B, C x x
The column and row names are not equivalent to the excel designation. It works with an easy WHEN statement for single cells (=WHEN(C3="x";C1)), but not for a bunch of them (=WHEN(C3:E3="x";C1:E1)). How should/can such a formula look like?
So I found the answer to my problem. Excel provides the normal CONCATENATE function. What is needed is something like a CONCATENATEIF (in German = verkettenwenn) function. By adding a module in VBA based on a thread from ransi from 2011 on the ms-office-forum.net the function verkettenwenn can be used. The code for the German module looks like:
Option Explicit
Public Function verkettenwenn(Bereich_Kriterium, Kriterium, Bereich_Verketten)
Dim mydic As Object
Dim L As Long
Set mydic = CreateObject("Scripting.Dictionary")
For L = 1 To Bereich_Kriterium.Count
If Bereich_Kriterium(L) = Kriterium Then
mydic(L) = Bereich_Verketten(L)
End If
Next
verkettenwenn = Join(mydic.items, ", ")
End Function
With that module in place one of the formula for the mentioned example looks like: =verkettenwenn(C3:E3;"x";$C$1:$K$1)
The English code for a CONCATENATEIF function should probably be:
Option Explicit
Public Function CONCATENATEIF(Criteria_Area, Criterion, Concate_Area)
Dim mydic As Object
Dim L As Long
Set mydic = CreateObject("Scripting.Dictionary")
For L = 1 To Criteria_Area.Count
If Criteria_Area(L) = Criterion Then
mydic(L) = Concate_Area(L)
End If
Next
CONCATENATEIF = Join(mydic.items, ", ")
End Function

Taking First Two Elements in List

I am trying to script a dynamic way way to only take the first two elements in a list and I am having some trouble. Below is a breakdown of what I have in my List
Declaration:
Set List = CreateObject("Scripting.Dictionary")
List Contents:
List(0) = 0-0-0-0
List(1) = 0-1-0-0
List(2) = 0-2-0-0
Code so far:
for count = 0 To UBound(List) -1 step 1
//not sure how to return
next
What I currently have does not work.
Desired Return List:
0-0-0-0
0-1-0-0
You need to use the Items method of the Dictionary. For more info see here
For example:
Dim a, i
a = List.Items
For i = 0 To List.Count - 1
MsgBox(a(i))
Next i
or if you just want the first 2:
For i = 0 To 1
MsgBox(a(i))
Next i
UBound() is for arrays, not dictionaries. You need to use the Count property of the Dictionary object.
' Show all dictionary items...
For i = 0 To List.Count - 1
MsgBox List(i)
Next
' Show the first two dictionary items...
For i = 0 To 1
MsgBox List(i)
Next

Removing duplicates from the data

I already loaded 20 csv files with function:
tbl = list.files(pattern="*.csv")
list_of_data = lapply(tbl, read.csv)
I combined all of those filves into one:
all_data = do.call(rbind.fill, list_of_data)
In the new table is a column called "Accession". After combining many of the names (Accession) are repeated. And I would like to remove all of the duplicates.
Another problem is that some of those "names" are ALMOST the same. The difference is that there is name and after become the dot and the number.
Let me show you how it looks:
AT3G26450.1 <--
AT5G44520.2
AT4G24770.1
AT2G37220.2
AT3G02520.1
AT5G05270.1
AT1G32060.1
AT3G52380.1
AT2G43910.2
AT2G19760.1
AT3G26450.2 <--
<-- = Same sample, different names. Should be treated as one. So just ignore dot and a number after.
Tried this one:
all_data$CleanedAccession = str_extract(all_data$Accession, "^[[:alnum:]]+")
all_data = subset(all_data, !duplicated(CleanedAccession))
Error in `$<-.data.frame`(`*tmp*`, "CleanedAccession", value = character(0)) :
You can use this command to both subset and rename the values:
subset(transform(alldata, Ascension = sub("\\..*", "", Ascension)),
!duplicated(Ascension))
Ascension
1 AT3G26450
2 AT5G44520
3 AT4G24770
4 AT2G37220
5 AT3G02520
6 AT5G05270
7 AT1G32060
8 AT3G52380
9 AT2G43910
10 AT2G19760
What about
df <- data.frame( Accession = c("AT3G26450.1",
"AT5G44520.2",
"AT4G24770.1",
"AT2G37220.2",
"AT3G02520.1",
"AT5G05270.1",
"AT1G32060.1",
"AT3G52380.1",
"AT2G43910.2",
"AT2G19760.1",
"AT3G26450.2"))
df[!duplicated(unlist(lapply(strsplit(as.character(df$Accession),
".", fixed = T), "[", 1))), ]