Pandas - identify unique triplets from a df - python-2.7

I have a dataframe which represents unique items. Each item is uniquely identified by a set of varA, varB, and varC (so each item has 0 to n values for varA, varB, or varC). My df has multiple raws per unique item, with various combination of varA, varB, and varC.
The df is like this (ID is unique in the column, but it doesn't represent the unique item).
df = pd.DataFrame({'ID':[1,2,3,4,5],
'varA':['a', 'd', 'a', 'm','Z'],
'varB':['b', 'e', 'k', 'e',NaN],
'varC':['c', 'f', 'l', NaN ,'t']})
So in the df here, you can see that:
1 and 3 are the same item with: {varA:[a], varB:[b,k], varC: [c,l]}.
2 and 4 is also the same: {varA:[d,m], varB:[e], varC: [f]}
I would like to identify every unique item, give them a unique id, and store their information.
The code I have written is terribly inefficient:
Step1: I walk through each row of the dataframe and make a list of each variable
When the three variables are new, it's a new item and I give it an id.
When either of the variable is know, I store the new ones in their respective list and keep walking to the next row
Step2: Once I walked all the dataframe, I have two subsets:
1 with a unique id,
the other one without unique id, but whose information can be found in the ones that have unique id, either with varA, varB, or varC. So quite uglily I merge successively on either variable, and find the unique id.
Result: I have the same df than at the start, but with a column of repeated unique identifiers.
This works well with 20,000 rows in entry with varA and varB. This is running very slow and dying before the end (between Step1 and Step2) on 100,000 rows, and I need to make it on 1,000,000 rows.
Any pandanique way of doing this?

You can use chained boolean indexing using duplicated (pd.Series.duplicated):
If you want to keep the first occurence of a duplicated:
myfilter = ~df.varA.duplicated(keep='first') & \
~df.varB.duplicated(keep='first') & \
~df.varC.duplicated(keep='first')
If you don't want to
myfilter = ~df.varA.duplicated(keep=False) & \
~df.varB.duplicated(keep=False) & \
~df.varC.duplicated(keep=False)
Then you can for example give these an incremental uniqueID:
df.ix[myfilter, 'uniqueID'] = np.arange(myfilter.sum(), dtype='int')
df
ID varA varB varC uniqueID
0 1 a b c 0.0
1 2 d e f 1.0
2 3 a k l NaN
3 4 m e NaN NaN
4 5 Z NaN t 2.0

Related

Code to missing values if all Items of an Item battery have value 1

I have a large data set in Stata.
There are several item batteries in this data set.
One item battery consists of 8 items (v1 - v8), each scaled from 1 to 7.
I want to code all items that take the value 1 in all items as missing values.
If v1 to v8 have the value "1", all rows to which this applies are to be replaced with missings.
I know how to code missing values with the if qualifier, but the selection with the complex condition causes me difficulties.
The code for R would probably solve this via rowSums, but I need the solution for Stata.
(I assume in R it would work like this:
df[rowSums(df[,c("v1", ... "v8")]!=1)==0, c("v1", .... "v8")] <- NA
But I need a solution for Stata.
If I understood this correctly, you want
egen rowall = concat(v1-v8)
mvdecode v1-v8 if rowall == 8 * "1", mv(1)
That is, all instances in v1-v8 of 1 are recoded as missing if and only if the values of those variables are all 1 in any observation.

Python run set.intersection with set of sets as input

I am working with biological datasets, straight from transcriptome (RNA) to finding certain protein sequences. I have a set of the protein names for each dataset, and want to find which are common to all datasets. Due to how the data is processed, I end up with a one variable that contains all the sub sets.
Due to how the set.intersect() command works, it requires at least 2 sets as input:
IDs = set.intersection(transc1 & trans2)
However I only have one input, colA which contains 30 sets of 80 to 100 entries. Here is what I have so far:
from glob import glob
for file in glob('*_query.tsv'): #input all 30 datasets, first column with protein IDs
sources = file
colnames = ['a', 'b', 'c', 'd', 'e', 'f']
df = pandas.read_csv(sources, sep='\t', names=colnames) #colnames headers for df contruction
colA = df.a.tolist() #turn column a, protein IDs, into list
IDs = set(colA) #turn lists into sets
If I print(colA), the output is something like this, with two unnamed elements as sets:
set(['ID2', 'ID8', 'ID35', 'ID77', 'ID78', 'ID199', 'ID211'])
set(['ID1', 'ID5', 'ID8', 'ID88', 'ID105', 'ID205'])
At this point I get stuck. I can't get set.intersection() working with the IDs set of sets. Also tried pandas.merge(*IDs) for which the syntax seemed to work, but the number of entries for comparison exceeded the maximum (12).
I wanted to use sets because unlike lists, it should be quick to find common IDs between all the datasets. If there is a better way, I am all for it.
Help is much appreciated.

Reformat a dataframe based on final empty columns in python

I am working on scraping a table that has major and minor column names. When I do this, the table comes in having read both the column names and column groups, so the column names are misaligned in the dataframe like so (simplified):
unnamed1 unnamed2 unnamed3 Year Passing Rushing Receiving
2015 NA 200 60 NA NA NA
2014 NA 180 70 NA NA NA
My challenge is in shifting the column names so that 'Year' aligns over '2015' and so forth. The problem is then that the number of columns to shift does not remain constant from table to table (this is only one of many). My code at the moment looks like the following:
table1=read_html('http://www.pro-football-reference.com/players/T/TyexWi00.htm')
df=table1[0]
to_shift=len(df.dropna(how='all', axis=1).columns) #Number of empty columns to shift by
df2=df.dropna(how='all',axis=1) #Drop the empty columns
df2.columns=df.columns[-to_shift:] #Shift all columns left by the number i've found
The problem is that for a player that has none of one stat (passing in this simple example), there are completely blank columns in the middle of the dataframe as well as at the right end, so that the code shifts too far. Is there a clean way of counting the columns from right to left until one is not completely empty?
Much thanks, and I hope my question is clear!
Is there a clean way of counting the columns from right to left until one is not completely empty?
from itertools import takewhile
len(df.columns) - len(list(takewhile(lambda col: df[col].isnull().all(), reversed(df.columns)))) - 1
Explanation:
takewhile returns all elements of a list (beginning at the front) until the given condition is False. When we call it on reversed(df.columns), we get all elements from the end. With df[col].isnull().all() we can check whether all entries of a column are null (a.k.a. nan). Consequently the above takewhile expression returns the suffix of columns which are completely 'empty'. By calculating total_length - bad_suffix_length - 1, we get the first index for which the condition is not satisfied.
Adding to the correct response from Michael Hoff (Thank you very much!), the code has been edited to
to_shift=len(df.columns) - len(list(takewhile(lambda col: df[col].isnull().all(), reversed(df.columns)))) #Index of origianl dataframe to keep
df2=df.drop(list(takewhile(lambda col: df[col].isnull().all(), reversed(df.columns))),axis=1) #Drop the empty right side columns
colnames=df.columns[-to_shift:]
df2.columns=colnames

Based on count value i have to create number of rows,is that possible without java transformation?

Hey guys anyone know how to create number of rows based on the count value without using java transformation in informatica 9.6(For flat file).Please help me with that
You can create an auxiliary table with n rows for each possible count value between 1 and N:
1
2
2
3
3
3
...
...
N rows with the last value
...
N rows with the last value
Join this table to the source data using the n count value as the key and you will get n copies of each source row.

Create prioritization log in Excel - Two lists

I am trying to create a prioritization list. I have 6 distinct values that the user inputs into a worksheet (by way of a VBA GUI). Excel calculates these values and creates a prioritization number. I need to list them (through a function(s)) in two tables. The problem comes into play when there are duplicate values (ie ProjA = 23 and ProjB = 23).
I don't care which one is listed first, but everything I have tried has secondary issues. There are two sheets to my work book. The first is where the "raw" data is entered and the second is where I would like the two lists to be located. *I do not want to use pivots for these lists.
Priority Number Proj Name
57 Project Alpha c
57 DUI Button Project
56 asdf
57 asdfsdfg
56 asdfasdf
56 Project Alpha a
56 Project Alpha b
18 Project BAS
List A (would include a value range of 1-20 and
List B (would include a value range of 20 - inf)
So, I want it to look like this:
Table 1 (High Priority) Table 2 (Low Priority)
Project BAS Project Apha C
DUI Button Project
Etc.
Generally these open-ended questions aren't received on StackOverflow. You should make an attempt to demonstrate what you've tried so far, and exactly where you're becoming confused. Otherwise people are doing your work for you, rather than trying to solve specific errors.
However, because you're new here, I've made an exception.
You can begin solving your issue by looping through the priority list and copy the values into the appropriate lists. For starters, I assumed that priority values begin at cell A2 and project names begin at cell B2 (the cells A1 and B1 would be the headers). I also assumed we're using a worksheet called Sheet1.
Now I need to know the length of the priority/project name list. I can determine this by using an integer called maxRows, calculated by Worksheets.Cells(1, 1).End(xlDown).Row. This gives the number of values in regular table (including the header, A1).
I continue by setting the columns for each priority list (high/low). In my example, I set these to columns 3 and 4. Then I clear these columns to remove any values that already existed there.
Then I create some tracking variables that will help me determine how many items I've already added to each list (highPriorityCount and lowPriorityCount).
Finally, I loop through the original list and check if the priority value is low (< 20) or high (the else condition). The project names are placed into the appropriate column, using the tracking variables I created above.
Note: Anywhere that uses a 2 as an offset is due to the fact that I am accounting for the header cells (row 1).
Option Explicit
Sub CreatePriorityTables()
With Worksheets("Sheet1")
' Determine the length of the main table
Dim maxRows As Integer
maxRows = .Cells(1, 1).End(xlDown).Row
' Set the location of the priority lists
Dim highPriorityColumn As Integer
Dim lowPriorityColumn As Integer
highPriorityColumn = 3
lowPriorityColumn = 4
' Empty the priority lists
.Columns(highPriorityColumn).Clear
.Columns(lowPriorityColumn).Clear
' Create headers for priority lists
.Cells(1, highPriorityColumn).Value = "Table 1 (High Priority)"
.Cells(1, lowPriorityColumn).Value = "Table 2 (Low Priority)"
' Create some useful counts to track
Dim highPriorityCount As Integer
Dim lowPriorityCount As Integer
highPriorityCount = 0
lowPriorityCount = 0
' Loop through all values and copy into priority lists
Dim currentColumn As Integer
Dim i As Integer
For i = 2 To maxRows
' Determine column by priority value
If (.Cells(i, 1) < 20) Then
.Cells(lowPriorityCount + 2, lowPriorityColumn).Value = .Cells(i, 2)
lowPriorityCount = lowPriorityCount + 1
Else
.Cells(highPriorityCount + 2, highPriorityColumn).Value = .Cells(i, 2)
highPriorityCount = highPriorityCount + 1
End If
Next i
End With
End Sub
This should produce the expected behavior.