How do I add information to specific cells in a Pandas DataFrame based on the variable name of the column as well as the row? - python-2.7

Note: I have simplified this code and some background information quite a bit in order to pinpoint the exact issue. If you want me to explain additional aspects of this code, please ask in the comments.
I am currently trying to make a code capable of aggregating two or more dictionaries together to ultimately make a .csv file containing the following basic set up:
TF-A TF-A TF-B TF-B
ids gene name score sum hits score sum hits ... gene description
id1 gene A 53.85 14 37.65 7 ... stuff
id2 gene B 97.55 11 63.94 8 ... stuff
id3 gene C 88.67 9 79.43 12 ... stuff
id4 gene D 69.35 12 13.03 13 ... stuff
... ...... ...... ... ... ... ..........
idx gene Z 49.32 8 84.03 10 ... stuff
The dictionaries contain the names of genes as keys, with corresponding values being an array of scores (these scores are generated by calculating the probability of a transcription factor, a.k.a. TF, binding to a gene at a certain position). Each TF has one dict, and its keys only contain genes that it has at least one score with. After the dictionaries are opened, set intersection is used in order to generate a list of genes that all given transcription factors have in common, which are then organized into the dataframe in the "gene name" column in a for loop. Because of previous class structures I built before (not shown), I can retrieve the common name and description of each gene easily and place it in the "gene description" column using df.set_value.
for id in common_names:
gene_name= idconverter.getgene(id).gene_name
gene_description= idconverter.getgene(id).description
df.set_value(id,'gene name', gene_name)
df.set_value(id,'gene name', gene_description)
However, the number of columns in the dataframe is dependent on the number of transcription factors the user wishes to analyze. So, a user putting in two TFs will add four columns to the dataframe– two columns for TF-A (sum of scores and number of scores, or 'hits'), and two columns for TF-B. Inputting three TFs will yield six columns, inputting four TFs will yield eight columns, and so on. The ID, gene name, and description columns are constant.
So, before I build the dataframe, I make a list that expands for every TF given.
ColumnList = []
for tf_id in tf_id_list:
ColumnList+=['{} Total Sum of Hits'.format(tf_id), '{} Number of Hits'.format(tf_id)]
With this list, I concatenate it with the other column names, and then instantiate my dataframe.
df= DataFrame(columns= ['ids','gene name']+ColumnList+['gene description'])
As shown above, I can easily set the names and description in the correct cell in the dataframe. And I can easily calculate the number of hits and total sum of scores for each gene according to which TF from the original dicts, but I have NO idea how to place this information in their according cell. Because the number of columns is dependent on the inputted TFs, I do not know what kind of code I should write in order to accommodate for this variable number of columns, or how to specify column based on its adhering TF. Can anybody recommend me the proper code set up and/or methods?
I have done some research, and I did see a method whereby one can add in a certain piece of information into a cell based on what kind of information is present in another column (but in the same row):
Modifying a subset of rows in a pandas dataframe
df.ix[df.A==0, 'B'] = np.nan
If you read the code in the link provided above, this piece of code above adds in a NaN into the 'B' column whenever a zero is present in the 'A' column. I thought I could utilize this methodology, but given that I need to add the number of hits and total sum of scores based on whether they relate to the first TF, second TF, or third TF, and so on. Would one write:
for id in common_genes:
for tf_id in tf_id_list:
df.ix[df.'{} Number of Hits'.format(tf_id)== tf_id]= number_hits
df.ix[df.'{} Total Sum of Scores'.format(tf_id)== tf_id]= sum_scores
I don't believe that's correct, since the my IDE says the syntax does not compile. I also should note, I have simplified the above code a little bit– the variables "number_hits" and "sum_scores" are actually derived from a dictionary that contains gene names as keys, and a list of hits, score sum, and pertaining TF name as values.

In the end, I decided to make a dict of dicts– I realized that this data structure was actually the most ideal for what I needed. The information from the original dicts would be stored as inside a dict (the "total_dict"), and those would be accessed if and only if the gene was present in the list of common_genes (which was derived from a set that all the transcription factors had in common). They are accessed by setting a for loop to go through the TFs of the total_dict to determine whether the gene id was present in the TF dict, and if it was, the information in the value (i.e. sum of scores and number of hits) was added to the correct row (based on the gene name, or id) and column (based on the TF at hand).
for id in common_names:
gene_name= idconverter.getgene(id).gene_name
gene_description= idconverter.getgene(id).description
df.set_value(id,'gene name', gene_name)
df.set_value(id,'gene name', gene_description)
for tf_id in total_dict.keys():
if id in total_dict[tf_id].keys():
df.set_value(id, '{} Score Sum'.format(tf_id), total_dict[tf_dict][id][0])
df.set_value(id, '{} Score Sum'.format(tf_id), total_dict[tf_dict][id][1])
print(df.header())
Long story short, if you are dealing with a lot of different kinds of data, make SURE the kind of data structures you work with can be manipulated in order to build the kind of results you want to generate.

Related

UNIQUE formula in Google Sheets for multiple ranges

I have a list of participants in column A. A full employee list in column B. I want to get the list of non-participants in column C. Basically 'B-A' but in list form.
'January' is the participants list:
try:
=FILTER(A:A; NOT(COUNTIF(B:B; A:A)))
It is always an added challenge to write formulas when we don't have access to actual date. But based on what I can see, try this formula in the top cell of any empty column:
=ArrayFormula({"My Header"; FILTER(R2:R,ISERROR(VLOOKUP(TRIM(R2:R),TRIM(T2:T),1,FALSE)))})
You can change "My Header" to something meaningful.
The next part means "FILTER in anything in the range R2:R that cannot be found [i.e., ISERROR(VLOOKUP(...))] in T2:T."
TRIM is used just to account for any accidental/stray spaces that may occur in either list, since that would result in no match if one or the other had extra space.
If this does not do what you expect, please share a link to a sample spreadsheet.

DAX: How to achieve the effect of a join/relationship between a bitmask column and a table of bit values?

I have a table CarHistoryFact (CarHistoryFactId, CarId, CarHistoryFactTime, CarHistoryFactConditions) that tracks the historical status of cars. CarHistoryFactConditions is a 25-bit (in binary) int column that encodes the status of 25 various conditions the car may be in at a given point in time.
I have a dimension table CarConditions with a row for each of the conditions, and their base 10 bit value.
How can I implement a "relationship" between the fact and dimension, giving a list of all the conditions a given car is
I can come up with bit parsing code, but I'm not sure how to hook it up to the dimension table to get just the currently applicable conditions at a car-time.
Bitmask parsing in dax can be seen here :
https://radacad.com/quick-dax-convert-number-to-binary
You can create a CROSSJOIN Table where all records are added 25 times and then filter out the once not existing
CarHistoryConditions =
var temp = CROSSJOIN(CarHistoryFact ; CarConditions )
return FILTER(temp; MOD(TRUNC(CarHistoryFact [CarHistoryFactConditions] / CarConditions [bit]):2) = 1)
note: I assumed the CarHistoryFactConditions and bit to be an integer, not a string of bits. For sure you can change that.
The reult is a table with as many rows of conditions for each car. E.g. Car one has 2 conditions and Car two has 5 conditions. You get 7 rows

How to create dependent drop down lists containing unique distinct alphabetically sorted text values, with ignore Blanks?

According to this picture there is Table1 in "Sheet1" containing initial data.
I need use List Data validation for 3 end columns of Table2 in the Sheet2
i.e. Columns B, C, D in Sheet2 should have data validation lists that are comprised of the unique sorted values of Sheet1 columns A,B,C respectively. They should dynamically update as new items are added to Sheet1.
Note: I want prepare dependent data validation list, without using helper sheet or helper column.
In related same issues (questions & answers & articles e.g. get-digital-help and microsoft and extendoffice), this below objects are not covered all together.
dependent drop down lists with several (more than 2) columns.
unique distinct alphabetically sorted text values
ignore Blanks
without using helper sheet or helper column
With formulas:
1) I tried for ignoring blanks:
{=IFERROR(INDEX(Table1[order '# id], SMALL(IF(ISBLANK(Table1[order '# id]),"", ROW(Table1[order '# id])-MIN(ROW(Table1[order '# id]))+1), ROW(A2))),"")}
2) With helper columns i tried:
{=INDEX(order,MATCH(0,COUNTIF($A$1:A1,Table1[order '# id]),0))} for unique value and {=INDEX(continent,MATCH(0,COUNTIF($C$1:C1,Table1[Continent])‌​+(order<>Sheet1!$E$2‌​)+(product<>Sheet1!$‌​E$5),0))}
But I need to combine the 2 steps of remove blanks and unique values in one step and write in a Name if possible.

Sort column with repeated values by another column

In Power BI Desktop, I'm trying to order the following column with repeated values by an ID column (contains primary key).
This returns the error: "There can't be more than one value in "Nível2"...."
In this other post it seems the suggestion is to concatenate the values of the column so they don't get duplicate.
But I want them to be repeated so they can aggregate values in visuals.
So, what's the workaround for this situation?
Thanks in advance for helping!
The issue is that your sort column (i.e. your ID column) contains multiple values for each value in the column you are trying to sort (i.e. your Nivel2 column).
You need to ensure that your sort column contains only one distinct value for each value in the column you are trying to sort.
One way to achieve this would be to create a new (calculated) sort column based on your ID column. It could be defined like this:
SortColumn:=CALCULATE(MAX('YourTable'[ID]),ALLEXCEPT('YourTable','YourTable'[Nivel2]))
Here is an example of how the SortColumn would behave:
Id Nivel2 SortColumn
1 Caixa 4
2 Caixa 4
3 Caixa 4
4 Caixa 4
5 Depósitos à ordem 7
6 Depósitos à ordem 7
7 Depósitos à ordem 7
You can now sort Nivel2 by SortColumn.
EDIT - The implementation of the SortColumn should be done in the data source
There seems to be a limitation in PowerBI where it checks the implementation of the sort column rather than the data in the sort column. Therefore the above solution does not work, even though the data in the sort column is perfectly valid. The above solution will throw this error when you attempt to sort [Nivel2] by SortColumn:
This column can't be sorted by a column that is already sorted, directly or indirectly, by this column.
The implementation of the SortColumn should be moved to the data source instead. I.e. if your data source is an Excel sheet, then the SortColumn should be created inside the Excel sheet.
The above answer does explain the issue and the resolvation correctly. The only change is that the SortColumn must be implemented outside of the tabular model (PowerBI) to ensure that PowerBI does not know about the dependency between the SortColumn and the [Nivel2] column.
In my case, I calculate the levels from a parent-child hierarchy
Path = Path([id],[father])
For each level:
Level1 = LOOKUPVALUE([Name],[id], PathItem([Path],1))
Level2 = LOOKUPVALUE([Name],[id], PathItem([Path],2))
.....
Then I created a new column for each level to sort the column Level:
SortL1 = LOOKUPVALUE([nID],[id], PathItem([Path],1))
SortL2 = LOOKUPVALUE([nID],[id], PathItem([Path],2))
.....
id and nID is the same numeric variable but "id" in string format because Path do not support numeric values.

Searching for unmatched ntheames when comparing spreadsheets

In one spreadsheet I have 3 columns with a first and last name of a person combined. In the 2nd spreadsheet, I have column a = first name and column b = last name.
I want to know which names in spreadsheet one cannot be found in spreadsheet two. I also need to verify the data to make sure that the formula was accurate on finding the correct lookup.
Do I have to combine my columns in spreadsheet 2 to make the first and last name in the same column to make this work?
Which formula would you use for either scenario?
Use this:
=ISNA(MATCH($A1&" "&$B1,Sheet2!$A:$A,FALSE)))
Where (in order):
A1 is the first name column in Sheet1
B1 is the last name column in Sheet1
Sheet2 is the sheet that has the data stored as names separately
$A:$A is the rows that have the two names together
FALSE is because it's an exact match
This will return FALSE if the element does not exist, and TRUE if it does
You can also use:
=VLOOKUP($A1&" "&$B1,Sheet2!$A:$D,3,FALSE)
If you want to retrieve data for a match.
Finally, if you need to do your lookups the other way, take a look at this thread for some ideas on how to split the string into two pieces.