Descriptive Statistics for Binary Variables - stata

I am attempting to create a descriptive table that provides me with the count of each specific variable mentioned. When I use the following code it just gives me the total number of observations. How do I get the total count of award_winner when it is equal to one only? award_winner is a binary variable for 1 = received the award and 0 = did not receive the award. registered is the total number of businesses.
asdoc summ registered i.award_winner

Related

How to replace missing values in panel data?

I am looking into weekly earnings data, where I have defined my data as pre-pandemic earning data and post-pandemic earning data. Now for some individuals, I have some missing values for the post-pandemic period which I want to replace with their pre-pandemic earnings. However, I am struggling with the coding for this panel data. I was hoping someone could help me with. Thanks in advance.
It is always easier if you share example data (see dataex) or at least list what variables you have. The example below will therefore most likely need to be edited.
* Sort the data by individual id and the time unit
* that indicates if this the obs is pre or post pandemic
sort id time
* This replaces the earnings value with a missing value if the
* id var is the same as on the next row AND the earnings var
* on is missing on the next row
replace earnings = . if id == id[_n+1] & missing(earnings[_n+1])
This assumes that all individuals are indeed represented in each time period and that you have a unique id variable (id) in your data set.

How to create a variable for when other variable takes up highest value, with conditions on countries?

I'm trying to create a new variable under conditions on other variables. I have countries in Africa with each country divided into constituencies; for each I have the number of votes for a candidate.
I am trying to work for one country at a time (country=ctr) and to create the value in each constituency (cst)
I would like to create a variable win1 = 2 when the votes take the highest value in a given constituency, and in a given country.
I have tried :
by cst : replace win1=2 if cv1=max(cv1) in (ctr==566)
by ctr cst (cv1) : replace win1=2 if cv1==cv1[_N]
Errors:
in is for observation numbers. It's not an alternative to if.
You need == to test equality, not =.
max() as a Stata function requires two or more arguments and works rowwise, not across groups of observations.
This code assumes no missing values.
It's also easier than you think in so far as you can work with several countries at once.

DAX: How to achieve the effect of a join/relationship between a bitmask column and a table of bit values?

I have a table CarHistoryFact (CarHistoryFactId, CarId, CarHistoryFactTime, CarHistoryFactConditions) that tracks the historical status of cars. CarHistoryFactConditions is a 25-bit (in binary) int column that encodes the status of 25 various conditions the car may be in at a given point in time.
I have a dimension table CarConditions with a row for each of the conditions, and their base 10 bit value.
How can I implement a "relationship" between the fact and dimension, giving a list of all the conditions a given car is
I can come up with bit parsing code, but I'm not sure how to hook it up to the dimension table to get just the currently applicable conditions at a car-time.
Bitmask parsing in dax can be seen here :
https://radacad.com/quick-dax-convert-number-to-binary
You can create a CROSSJOIN Table where all records are added 25 times and then filter out the once not existing
CarHistoryConditions =
var temp = CROSSJOIN(CarHistoryFact ; CarConditions )
return FILTER(temp; MOD(TRUNC(CarHistoryFact [CarHistoryFactConditions] / CarConditions [bit]):2) = 1)
note: I assumed the CarHistoryFactConditions and bit to be an integer, not a string of bits. For sure you can change that.
The reult is a table with as many rows of conditions for each car. E.g. Car one has 2 conditions and Car two has 5 conditions. You get 7 rows

Power BI Dashboard where the core filter condition is a disjunction on numeric fields

We are trying to implement a dashboard that displays various tables, metrics and a map where the dataset is a list of customers. The primary filter condition is the disjunction of two numeric fields. We want to the user to be able to select a threshold for [field 1] and a separate threshold for [field 2] and then impose the condition [field 1] >= <threshold> OR [field 2] >= <threshold>.
After that, we want to also allow various other interactive slicers so the user can restrict the data further, e.g. by country or account manager.
Power BI naturally imposes AND between all filters and doesn't have a neat way to specify OR. Can you suggest a way to define a calculation using the two numeric fields that is then applied as a filter within the same interactive dashboard screen? Alternatively, is there a way to first prompt the user for the two threshold values before the dashboard is displayed -- so when they click Submit on that parameter-setting screen they are then taken to the main dashboard screen with the disjunction already applied?
Added in response to a comment:
The data can be quite simple: no complexity there. The complexity is in getting the user interface to enable a disjunction.
Suppose the data was a list of customers with customer id, country, gender, total value of transactions in the last 12 months, and number of purchases in last 12 months. I want the end-user (with no technical skills) to specify a minimum threshold for total value (e.g. $1,000) and number of purchases (e.g. 10) and then restrict the data set to those where total value of transactions in the last 12 months > $1,000 OR number of purchases in last 12 months > 10.
After doing that, I want to allow the user to see the data set on a dashboard (e.g. with a table and a graph) and from there select other filters (e.g. gender=male, country=Australia).
The key here is to create separate parameter tables and combine conditions using a measure.
Suppose we have the following Sales table:
Customer Value Number
-----------------------
A 568 2
B 2451 12
C 1352 9
D 876 6
E 993 11
F 2208 20
G 1612 4
Then we'll create two new tables to use as parameters. You could do a calculated table like
Number = VALUES(Sales[Number])
Or something more complex like
Value = GENERATESERIES(0, ROUNDUP(MAX(Sales[Value]),-2), ROUNDUP(MAX(Sales[Value]),-2)/10)
Or define the table manually using Enter Data or some other way.
In any case, once you have these tables, name their columns what you want (I used MinNumber and MinValue) and write your filtering measure
Filter = IF(MAX(Sales[Number]) > MIN(Number[MinCount]) ||
MAX(Sales[Value]) > MIN('Value'[MinValue]),
1, 0)
Then put your Filter measure as a visual level filter where Filter is not 0 and use MinCount and MinValues column as slicers.
If you select 10 for MinCount and 1000 for MinValue then your table should look like this:
Notice that E and G only exceed one of the thresholds and tha A and D are excluded.
To my knowledge, there is no such built-in slicer feature in Power BI at the time being. There is however a suggestion in the Power BI forum that requests a functionality like this. If you'd be willing to use the Power Query Editor, it's easy to obtain the values you're looking for, but only for hard-coded values for your limits or thresh-holds.
Let me show you how for a synthetic dataset that should fit the structure of your description:
Dataset:
CustomerID,Country,Gender,TransactionValue12,NPurchases12
51,USA,M,3516,1
58,USA,M,3308,12
57,USA,M,7360,19
54,USA,M,2052,6
51,USA,M,4889,5
57,USA,M,4746,6
50,USA,M,3803,3
58,USA,M,4113,24
57,USA,M,7421,17
58,USA,M,1774,24
50,USA,F,8984,5
52,USA,F,1436,22
52,USA,F,2137,9
58,USA,F,9933,25
50,Canada,F,7050,16
56,Canada,F,7202,5
54,Canada,F,2096,19
59,Canada,F,4639,9
58,Canada,F,5724,25
56,Canada,F,4885,5
57,Canada,F,6212,4
54,Canada,F,5016,16
55,Canada,F,7340,21
60,Canada,F,7883,6
55,Canada,M,5884,12
60,UK,M,2328,12
52,UK,M,7826,1
58,UK,M,2542,11
56,UK,M,9304,3
54,UK,M,3685,16
58,UK,M,6440,16
50,UK,M,2469,13
57,UK,M,7827,6
Desktop table:
Here you see an Input table and a subset table using two Slicers. If the forum suggestion gets implemented, it should hopefully be easy to change a subset like below to an "OR" scenario:
Transaction Value > 1000 OR Number or purchases > 10 using Power Query:
If you use Edit Queries > Advanced filter you can set it up like this:
The last step under Applied Steps will then contain this formula:
= Table.SelectRows(#"Changed Type2", each [NPurchases12] > 10 or [TransactionValue12] > 1000
Now your original Input table will look like this:
Now, if only we were able to replace the hardcoded 10 and 1000 with a dynamic value, for example from a slicer, we would be fine! But no...
I know this is not what you were looking for, but it was the best 'negative answer' I could find. I guess I'm hoping for a better solution just as much as you are!

How do I add information to specific cells in a Pandas DataFrame based on the variable name of the column as well as the row?

Note: I have simplified this code and some background information quite a bit in order to pinpoint the exact issue. If you want me to explain additional aspects of this code, please ask in the comments.
I am currently trying to make a code capable of aggregating two or more dictionaries together to ultimately make a .csv file containing the following basic set up:
TF-A TF-A TF-B TF-B
ids gene name score sum hits score sum hits ... gene description
id1 gene A 53.85 14 37.65 7 ... stuff
id2 gene B 97.55 11 63.94 8 ... stuff
id3 gene C 88.67 9 79.43 12 ... stuff
id4 gene D 69.35 12 13.03 13 ... stuff
... ...... ...... ... ... ... ..........
idx gene Z 49.32 8 84.03 10 ... stuff
The dictionaries contain the names of genes as keys, with corresponding values being an array of scores (these scores are generated by calculating the probability of a transcription factor, a.k.a. TF, binding to a gene at a certain position). Each TF has one dict, and its keys only contain genes that it has at least one score with. After the dictionaries are opened, set intersection is used in order to generate a list of genes that all given transcription factors have in common, which are then organized into the dataframe in the "gene name" column in a for loop. Because of previous class structures I built before (not shown), I can retrieve the common name and description of each gene easily and place it in the "gene description" column using df.set_value.
for id in common_names:
gene_name= idconverter.getgene(id).gene_name
gene_description= idconverter.getgene(id).description
df.set_value(id,'gene name', gene_name)
df.set_value(id,'gene name', gene_description)
However, the number of columns in the dataframe is dependent on the number of transcription factors the user wishes to analyze. So, a user putting in two TFs will add four columns to the dataframe– two columns for TF-A (sum of scores and number of scores, or 'hits'), and two columns for TF-B. Inputting three TFs will yield six columns, inputting four TFs will yield eight columns, and so on. The ID, gene name, and description columns are constant.
So, before I build the dataframe, I make a list that expands for every TF given.
ColumnList = []
for tf_id in tf_id_list:
ColumnList+=['{} Total Sum of Hits'.format(tf_id), '{} Number of Hits'.format(tf_id)]
With this list, I concatenate it with the other column names, and then instantiate my dataframe.
df= DataFrame(columns= ['ids','gene name']+ColumnList+['gene description'])
As shown above, I can easily set the names and description in the correct cell in the dataframe. And I can easily calculate the number of hits and total sum of scores for each gene according to which TF from the original dicts, but I have NO idea how to place this information in their according cell. Because the number of columns is dependent on the inputted TFs, I do not know what kind of code I should write in order to accommodate for this variable number of columns, or how to specify column based on its adhering TF. Can anybody recommend me the proper code set up and/or methods?
I have done some research, and I did see a method whereby one can add in a certain piece of information into a cell based on what kind of information is present in another column (but in the same row):
Modifying a subset of rows in a pandas dataframe
df.ix[df.A==0, 'B'] = np.nan
If you read the code in the link provided above, this piece of code above adds in a NaN into the 'B' column whenever a zero is present in the 'A' column. I thought I could utilize this methodology, but given that I need to add the number of hits and total sum of scores based on whether they relate to the first TF, second TF, or third TF, and so on. Would one write:
for id in common_genes:
for tf_id in tf_id_list:
df.ix[df.'{} Number of Hits'.format(tf_id)== tf_id]= number_hits
df.ix[df.'{} Total Sum of Scores'.format(tf_id)== tf_id]= sum_scores
I don't believe that's correct, since the my IDE says the syntax does not compile. I also should note, I have simplified the above code a little bit– the variables "number_hits" and "sum_scores" are actually derived from a dictionary that contains gene names as keys, and a list of hits, score sum, and pertaining TF name as values.
In the end, I decided to make a dict of dicts– I realized that this data structure was actually the most ideal for what I needed. The information from the original dicts would be stored as inside a dict (the "total_dict"), and those would be accessed if and only if the gene was present in the list of common_genes (which was derived from a set that all the transcription factors had in common). They are accessed by setting a for loop to go through the TFs of the total_dict to determine whether the gene id was present in the TF dict, and if it was, the information in the value (i.e. sum of scores and number of hits) was added to the correct row (based on the gene name, or id) and column (based on the TF at hand).
for id in common_names:
gene_name= idconverter.getgene(id).gene_name
gene_description= idconverter.getgene(id).description
df.set_value(id,'gene name', gene_name)
df.set_value(id,'gene name', gene_description)
for tf_id in total_dict.keys():
if id in total_dict[tf_id].keys():
df.set_value(id, '{} Score Sum'.format(tf_id), total_dict[tf_dict][id][0])
df.set_value(id, '{} Score Sum'.format(tf_id), total_dict[tf_dict][id][1])
print(df.header())
Long story short, if you are dealing with a lot of different kinds of data, make SURE the kind of data structures you work with can be manipulated in order to build the kind of results you want to generate.