How to create a variable for when other variable takes up highest value, with conditions on countries? - if-statement

I'm trying to create a new variable under conditions on other variables. I have countries in Africa with each country divided into constituencies; for each I have the number of votes for a candidate.
I am trying to work for one country at a time (country=ctr) and to create the value in each constituency (cst)
I would like to create a variable win1 = 2 when the votes take the highest value in a given constituency, and in a given country.
I have tried :
by cst : replace win1=2 if cv1=max(cv1) in (ctr==566)

by ctr cst (cv1) : replace win1=2 if cv1==cv1[_N]
Errors:
in is for observation numbers. It's not an alternative to if.
You need == to test equality, not =.
max() as a Stata function requires two or more arguments and works rowwise, not across groups of observations.
This code assumes no missing values.
It's also easier than you think in so far as you can work with several countries at once.

Related

DAX syntax for conditional measure to format a matrix based on value of a column?

I have a matrix in Power BI which shows counts by Category and by Country (please see table 1 below for example).
I want to create a measure to 'suppress' any counts in that matrix (by replacing the count with "...") that are lower than 3 IF the value of another column in the same source table = 1. IF the value is not equal to 1, I want to suppress any counts in that same matrix that are lower than 10.
To clarify, countries 1 to 5 have either a value of 1 or 2 for the column on which the suppression depends, and within the table it would look like this:
With this measure in the values field, this would produce a matrix that looks like this:
I have been able to achieve the first part (suppressing counts less than 3) using the following DAX syntax to create the measure:
Less than 3 = IF([COUNT]<3),"...",[FORMCOUNT]
But I have struggled to incorporate the second condition. I imagined it would look something like this:
Less than 3 or 10 = IF('Table'[conditional column] = 1
,(IF(AND([COUNT]>0,[COUNT]<3),"...",[COUNT]))
,(IF(AND([COUNT]>0,[COUNT]<10),"...",[COUNT])))
But that doesn't work. I have tried various variations of this syntax using functions such as FILTER that allow you to specify the column and condition but no luck.
Any pointers would be greatly appreciated, thanks!
This may not be the perfect answer but the below appears to do what is required:
Two measures are required:
Measure1 = SELECTEDVALUE('Dataset'[Suppression category],"NA")
Measure2 = IF('Dataset'[Measure1] = 1,IF([COUNT]<3,"...",[COUNT]),IF([COUNT]<10,"...",[COUNT]))
Unfortunately can't currently explain why or how the first measure works, other than that I think it serves to select/distinguish between suppression categories 1 and 2.
The second measure uses the first measure as a condition in a nested IF statement, such that if the suppression category = 1, suppress the count if it is less than 3, otherwise do not suppress the count. IF the suppression is not 1, then suppress the count if it is less than 10. Otherwise, do not suppress the count.

Get total count of each distinct value

If I for example have a column of countries that might repeat and the list follows like this: Spain, Spain, Italy, Spain
I want to get the result that I take the number that a country appears in the column and divide it by total number. I have tried:
CountRows = DIVIDE(DISTINCTCOUNT('Report (7)'[Country]); COUNT('Report (7)'[Country]) )
Any suggestions? do I need a new column for that?
The easiest way to achieve this type of calculation is to add one column with the number of occurrence of the selected words divided by the number of row in the table.
You need to use the function Earlier to get the context.
If you have one table named Table1 and your column Country
Something like :
Divide(COUNTROWS(FILTER(table1, Table1[Country] = EARLIER(Table1[Country]))),COUNTROWS(Table1))
Don't forget to put your new column in Percentage type or add some decimal to see the correct data.

DAX: How to achieve the effect of a join/relationship between a bitmask column and a table of bit values?

I have a table CarHistoryFact (CarHistoryFactId, CarId, CarHistoryFactTime, CarHistoryFactConditions) that tracks the historical status of cars. CarHistoryFactConditions is a 25-bit (in binary) int column that encodes the status of 25 various conditions the car may be in at a given point in time.
I have a dimension table CarConditions with a row for each of the conditions, and their base 10 bit value.
How can I implement a "relationship" between the fact and dimension, giving a list of all the conditions a given car is
I can come up with bit parsing code, but I'm not sure how to hook it up to the dimension table to get just the currently applicable conditions at a car-time.
Bitmask parsing in dax can be seen here :
https://radacad.com/quick-dax-convert-number-to-binary
You can create a CROSSJOIN Table where all records are added 25 times and then filter out the once not existing
CarHistoryConditions =
var temp = CROSSJOIN(CarHistoryFact ; CarConditions )
return FILTER(temp; MOD(TRUNC(CarHistoryFact [CarHistoryFactConditions] / CarConditions [bit]):2) = 1)
note: I assumed the CarHistoryFactConditions and bit to be an integer, not a string of bits. For sure you can change that.
The reult is a table with as many rows of conditions for each car. E.g. Car one has 2 conditions and Car two has 5 conditions. You get 7 rows

Power BI remove duplicates based on max value

I have 2 column; ID CODE, value
Remove duplicates function will remove the examples with the higher value and leave the lower one. Is there any way to remove the lower ones? The result I expected was like this.
I've tried Buffer Table function before but it doesn't work. Seems like Buffer Table just works with date-related data (newest-latest).
You could use SUMMARIZE which can be used similar to a SQL query that takes a MIN value for a column, grouped by some other column.
In the example below, MIN([value]) is taken, given a new column name "MinValue", which is grouped by IDCode. This should return the min value for each IDCode.
NewCalculatedTable =
SUMMARIZE(yourTablename, yourTablename[IDCode], "MinValue", MIN(yourTablename[value]) )
Alternatively, if you want the higher values just replace the MIN function with MAX.

How do I add information to specific cells in a Pandas DataFrame based on the variable name of the column as well as the row?

Note: I have simplified this code and some background information quite a bit in order to pinpoint the exact issue. If you want me to explain additional aspects of this code, please ask in the comments.
I am currently trying to make a code capable of aggregating two or more dictionaries together to ultimately make a .csv file containing the following basic set up:
TF-A TF-A TF-B TF-B
ids gene name score sum hits score sum hits ... gene description
id1 gene A 53.85 14 37.65 7 ... stuff
id2 gene B 97.55 11 63.94 8 ... stuff
id3 gene C 88.67 9 79.43 12 ... stuff
id4 gene D 69.35 12 13.03 13 ... stuff
... ...... ...... ... ... ... ..........
idx gene Z 49.32 8 84.03 10 ... stuff
The dictionaries contain the names of genes as keys, with corresponding values being an array of scores (these scores are generated by calculating the probability of a transcription factor, a.k.a. TF, binding to a gene at a certain position). Each TF has one dict, and its keys only contain genes that it has at least one score with. After the dictionaries are opened, set intersection is used in order to generate a list of genes that all given transcription factors have in common, which are then organized into the dataframe in the "gene name" column in a for loop. Because of previous class structures I built before (not shown), I can retrieve the common name and description of each gene easily and place it in the "gene description" column using df.set_value.
for id in common_names:
gene_name= idconverter.getgene(id).gene_name
gene_description= idconverter.getgene(id).description
df.set_value(id,'gene name', gene_name)
df.set_value(id,'gene name', gene_description)
However, the number of columns in the dataframe is dependent on the number of transcription factors the user wishes to analyze. So, a user putting in two TFs will add four columns to the dataframe– two columns for TF-A (sum of scores and number of scores, or 'hits'), and two columns for TF-B. Inputting three TFs will yield six columns, inputting four TFs will yield eight columns, and so on. The ID, gene name, and description columns are constant.
So, before I build the dataframe, I make a list that expands for every TF given.
ColumnList = []
for tf_id in tf_id_list:
ColumnList+=['{} Total Sum of Hits'.format(tf_id), '{} Number of Hits'.format(tf_id)]
With this list, I concatenate it with the other column names, and then instantiate my dataframe.
df= DataFrame(columns= ['ids','gene name']+ColumnList+['gene description'])
As shown above, I can easily set the names and description in the correct cell in the dataframe. And I can easily calculate the number of hits and total sum of scores for each gene according to which TF from the original dicts, but I have NO idea how to place this information in their according cell. Because the number of columns is dependent on the inputted TFs, I do not know what kind of code I should write in order to accommodate for this variable number of columns, or how to specify column based on its adhering TF. Can anybody recommend me the proper code set up and/or methods?
I have done some research, and I did see a method whereby one can add in a certain piece of information into a cell based on what kind of information is present in another column (but in the same row):
Modifying a subset of rows in a pandas dataframe
df.ix[df.A==0, 'B'] = np.nan
If you read the code in the link provided above, this piece of code above adds in a NaN into the 'B' column whenever a zero is present in the 'A' column. I thought I could utilize this methodology, but given that I need to add the number of hits and total sum of scores based on whether they relate to the first TF, second TF, or third TF, and so on. Would one write:
for id in common_genes:
for tf_id in tf_id_list:
df.ix[df.'{} Number of Hits'.format(tf_id)== tf_id]= number_hits
df.ix[df.'{} Total Sum of Scores'.format(tf_id)== tf_id]= sum_scores
I don't believe that's correct, since the my IDE says the syntax does not compile. I also should note, I have simplified the above code a little bit– the variables "number_hits" and "sum_scores" are actually derived from a dictionary that contains gene names as keys, and a list of hits, score sum, and pertaining TF name as values.
In the end, I decided to make a dict of dicts– I realized that this data structure was actually the most ideal for what I needed. The information from the original dicts would be stored as inside a dict (the "total_dict"), and those would be accessed if and only if the gene was present in the list of common_genes (which was derived from a set that all the transcription factors had in common). They are accessed by setting a for loop to go through the TFs of the total_dict to determine whether the gene id was present in the TF dict, and if it was, the information in the value (i.e. sum of scores and number of hits) was added to the correct row (based on the gene name, or id) and column (based on the TF at hand).
for id in common_names:
gene_name= idconverter.getgene(id).gene_name
gene_description= idconverter.getgene(id).description
df.set_value(id,'gene name', gene_name)
df.set_value(id,'gene name', gene_description)
for tf_id in total_dict.keys():
if id in total_dict[tf_id].keys():
df.set_value(id, '{} Score Sum'.format(tf_id), total_dict[tf_dict][id][0])
df.set_value(id, '{} Score Sum'.format(tf_id), total_dict[tf_dict][id][1])
print(df.header())
Long story short, if you are dealing with a lot of different kinds of data, make SURE the kind of data structures you work with can be manipulated in order to build the kind of results you want to generate.