Sum calculations in Stata - stata

I have problems with managing data in stata and moreover I tried to search but maybe my search questions are incorrect. So I m very sorry if the question already existed.
if to look to the pic you will see, that I should calculate countries. Sum of the same countries by every id. I have a huge dataset, so I need to do it fast and not to loose time.
ask me questionsstata

Not entirely clear what you have in mind, but it sounds like you want:
bysort id country: generate count = _N
If not, a clearer example with fewer countries would be helpful.

Related

Ranking sum of data by two categories

I have data that I would like to rank by two separate categories, State and ServiceType. Essentially, there are multiple years of data for each ServiceType across various states, and I was hoping to get the sum of all years for each ServiceType by State, meaning each State is treated independently and the sums of the various categories are ranked only within that state, not nationally.
I've tried
bys State ServiceCategory (quant_variable): ///
egen rank_quant_variable= rank(sum(quant_variable)), field
as well as a version of above where I used a pre-calculated sum variable. Both don't really work.
This lacks a reproducible example, as you do not give your data or phrase your problem in terms of a dataset we could download, for example as loaded with or referred to in Stata. There is no need to give the full dataset but just a minimal example with the same structure.
The call to sum() here would be to Stata's sum() function, which yields the cumulative or running sum, which evidently isn't what you want. So that case is easy to dismiss.
The problem remaining is quite what you did in the code you don't show with a pre-calculated sum.
At a guess you worked out
bys State ServiceCategory: egen sum = total(quant_variable)
and then pushed that sum through rank(). But that would use each value of sum as many times as it occurred.
Perhaps you want something more like this:
egen tag = tag(State ServiceCategory)
bysort State: egen rank_quant_variable = rank(sum) if tag, field
bysort State (rank): replace rank = rank[1]
But it's really hard (for me) to visualize this without details on what you did or an example to work on.

Clarification on tabstat use after bysort in Stata

I have a rather simple question regarding the output of tabstat command in Stata.
To be more specific, I have a large panel dataset containing several hundred thousands of observations, over a 9 year period.
The context:
bysort year industry: egen total_expenses=total(expenses)
This line should create total expenses by year and industry (or sum of all expenses by all id's in one particular year for one particular industry).
Then I'm using:
tabstat total_expenses, by(country)
As far as I understand, tabstat should show in a table format the means of expenses. Please do note that ids are different from countries.
In this case tabstat calculates the means for all 9 years for all industries for a particular country, or it just the mean of one year and one industry by each country from my panel data?
What would happen if this command is used in the following context:
bysort year industry: egen mean_expenses=mean(expenses)
tabstat mean_expenses, by(country)
Does tabstat creates means of means? This is a little bit confusing.
I don't know what is confusing you about what tabstat does, but you need to be clear about what calculating means implies. Your dataset is far too big to post here, but for your sake as well as ours creating a tiny sandbox dataset would help you see what is going on. You should experiment with examples where the correct answer (what you want) is obvious or at least easy to calculate.
As a detail, your explanation that ids are different from countries is itself confusing. My guess is that your data are on firms and the identifier concerned identifies the firm. Then you have aggregations by industry and by country and separately by year.
bysort year industry: egen total_expenses = total(expenses)
This does calculate totals and assigns them to every observation. Thus if there are 123 observations for industry A and 2013, there will be 123 identical values of the total in the new variable.
tabstat total_expenses, by(country)
The important detail is that tabstat by default calculates and shows a mean. It just works on all the observations available, unless you specify otherwise. Stata has no memory or understanding of how total_expenses was just calculated. The mean will take no account of different numbers in each (industry, year) combination. There is no selection of individual values for (industry, year) combinations.
Your final question really has the same flavour. What your command asks for is a brute force calculation using all available data. In effect your calculations are weighted by the numbers of observations in whatever combinations of industry, country and year are being aggregated.
I suspect that you need to learn about two commands (1) collapse and (2) egen, specifically its tag() function. If you are using Stata 16, frames may be useful to you. That should apply to any future reader of this using a later version.

DAX function to check the amount of dates that is greater than another set of dates in 2 different columns

I'm currently doing an internship where I have been asked to make a few visuals in Power BI
I've searched around, tried a couple of things. But the truth is I am very much a beginner at coding and functions in general. Only had basic courses of different languages during my education and to be fair, it's a bit outside my scope of work.
So I have 2 columns I need to compare in order to find out how many dates in column 2 that is greater than the dates in column 1
So I'm imagining something like:
Measure = IF[(Investments(Expected closure)]<[(Investments(Actualclosure)]
Basically I want an overview of how many investments have a later closure date than expected.
Next thing would possibly be to create a boxplot showing the distribution (by how far we are off).
I know this is very basic, and possibly not formulated in the best way possible, please let me know if you need any more information.
Thanks in advance
You can use a calculated column as a flag to identify if actual date > expected date and then count the flag.
Flag = IF('Table'[Act] > 'Table'[Exp], 1, 0)
Hope this helps. Thanks.
enter image description here
Welcome to the community. Be sure to read to read this for posting questions.
For your questions, you can use the following code. You are filter the table with your logical expression with FILTER and then you count the lines of a column with COUNTA.
Measure = CALCULATE(COUNTA('Investments'[Actualclosure]),FILTER(ALL('Investments'),'Investments'[Actualclosure]>'Investments'[Expected closure]))
Hope this solves your problem.

Create link between observations clustered within village

Hi I am trying to create dyad from households id clustered within villages with stata. My problem is I do not know how to use vlookup in order to have a list of households id linked to every household.
Without a bit more information this question is tough to answer, but some places you can look are first tabulate to see your data broken down by variables. Another place to check is the bysort and gen commands, these together will probably be the answer you're looking for, although it is tough to tell from the question. Finally, you may want to look into encode if your village variable is a string, you will get a unique id for each village using that command.

How do I get a group minimum over level combinations of factors?

I would like to find minimum values within groups. In stata, I think it is simply "by group, sort : egen minvalue=min(value)"...
I tried to mess around with ave and rowsum, but to no avail.
ave(value, group, FUN=min) did not work.
Sorry this answer is a little late but, in case you are still looking for the answer or for future searchers here goes....
You are onto the right track with the -by- command. Here is what I'd do to find the lowest price of cars in the auto.dta dataset by domestic/foreign grouping.
sysuse auto, clear
bysort foreign : egen minprice = min(price)
What this does is create a new variable 'minprice' that holds the minimum price for domestic cars if a given car (observation) is domestic and vice versa for foreign cars. So this new variable with have just two values in this example and you can check that by doing:
tabulate minprice
Depending on why you wanted to find the minimum values by group this may not be what you had in mind but hopefully someone finds it helpful.