I am currently developing a sentiment index using Google search frequencies taken from Google Trends.
I am using Stata 12 on Windows.
My approach is as following:
I downloaded approx ~150 business-related search queries from Googletrends from Jan 2004 to Dec 2013
I now want to construct an index using the 30 at that point in time most relevant queries related to the market I observe
To achieve that I want to use monthly expanding backward rolling regressions of each query on the market
Thus I need to regress 150 items one-by-one on the market 120 times (12 months x 10 years), using different time windows and then extract the 30 queries with the most negative t-test.
To exemplify the procedure, if I would want to construct the sentiment for January 2010 I would regress the query terms on the market during the period from Jan 2004 to December 2009 and then extract the 30 queries with the most negative t-statistic.
Now I am looking for a way to make this as automatized as possible. I guess should be able to run the 150 items at once, and I can specify the time window using the time stamps. Using Excel commands and creating a do-file with all the regression commands in it (which would be quite large) I could probably create the regressions relatively efficiently (although it depends on how much Stata can handle - any experience on that?).
What I would need to make the data extraction much easier is a command which I can use to rank the results of the regression according to their t-statistics. Does someone have an efficient approach to this? Or has general advice?
If you are using Stata, once you run a ttest, you can type return list and you will get scalars that stata stores. Once you run a loop you can store these values in a number of different ways. check out the post command.
Related
I'm trying to figure out a solution to my problem. Basically we get a monthly report with about 3000 records and there's a bunch of reporting that is done on that, and there are calculations based on various columns. e.g.
Date
Total usage
Recommended reduction
Product
01.01.2022
1000
500
A
01.01.2022
1300
70
B
01.01.2022
2000
900
C
...
...
...
At the end of it Power BI kindly sums up the columns which is great, but now what I am trying to do is take the sum of these columns and store them in a summary table so that it would be something like this so that I could use it for a time series visual
Month
Sum Total Usage
Sum Recommended Reduction
January
59720
12040
February
81020
20580
...
...
...
I have no idea how to go about doing this. Is this the right way to go ? Or is there a way to create a visual without having to create a summary table ? I'm at a bit of a loss, so any suggestions would be really appreciated.
You don't need any DAX calculations for that. Simply pull your data onto the fields of a line chart visual like shown below. Note that you have to drill-down from Year to Month to actually see the lines.
I'm new to google automl tables and have a basic question about which data is worthwhile including in the training of my model.
I have a dataset of golfers and will be looking at the averages of scores over different periods. For example, average over the past 3 months, 6 months, 1 year etc.
My question is, is it worthwhile also including the sample size for each date range for each player. For example, over the past 3 months, some players will have a sample size of 28 while some will only have 2. Those players that have 28 rounds will have more accurate averages than those with 2. However, I didn't know whether google automl tables would pick up this link automatically, whether I could create a different weighting/reliability variable, or whether there's a way to specify a link between columns? Or if this automated type of automl isn't really suitable?
Thanks in advance
I am simulating pga tournaments using Stata. My simulation results table consists of:
column 1: the names of the 30 players in the tournament
columns 2 - 30,001: the 4 round results of my monte-carol simulations.
what I am trying to do is create a 30 x 30 matrix with the golfers' names as column 1 and across the column names where each cell represents the percentage of times Golfer A beat Golfer B outright from the 30,000 simulations. Is this possible to do in Stata? Thanks
I tend to say that everything is always possible in all programming languages, but somethings are much more difficult to do in some languages compared to others. I do not think that Stata is great tool for what you intend to do.
You need to provide some code examples for us to be able to help you with your task, but here is one thing I can say. Stata has two programming languages. One is often called Stata (but is called ado on Stata Corps webiste) and the other is Mata. If you for some reason need to use the software Stata, you should do this in the language Mata that has more matrix operators than ado. And in ado you cant store text in a matrix, so if you want to store the name of the golfer you need to use Mata, but you can also use indexes of rows and columns to keep track of the golfers.
With that said, Stata is primarily a tool to make operations and analyze a single dataset loaded into memory (recently support for multiple datasets has been added). So to answer your question, yes, this can be done in Stata, but you are probably much better of doing it in a language with more support for multidimensional arrays/vectors. For example, R or Python.
I have a couple of different tables in my Report, for demonstration purposes lets say that I have 1 data source that is Actual Invoice amounts and then I have another table that is Forecasted amounts. Each table has several dimensions that are the same between them, let say Country, Region, Product Classification and Product.
What I want is to be able to display a table/matrix that pulls information from both of these data sources like this
Description Invoice Forecast vs Forecast
USA 300 325 92%
East 150 175 86%
Product Grouping 1 125 125 100%
Product 1 50 75 67%
Product 2 75 50 150%
Product Grouping 3 25 50 50%
Product 3 25 50 50%
West 150 150 100%
Product Grouping 1 75 100 75%
Product 1 25 50 50%
Product 2 50 50 100%
Product Grouping 3 75 50 150%
Product 3 75 50 150%
I have not been able to figure out a way to combine the information from the multiple data source into a single matrix table, so any help would be appreciated. The one thing that I did find was somebody hard coded the structure of the rows into a separate data source and then used DAX expressions to pull in the pieces of information into the columns, but I don't like this solution because the structure of the rows is not constant.
What you're asking about is a common part of the star schema: combining facts from different fact tables together into a single visual or report.
What Not To Do (That You Might Be Tempted To)
What you don't want to do is combine the 2 fact tables into a single table in your Power BI data model. That's a lot of work and there's absolutely no need. Especially, since there are likely dimensions that the 2 fact tables do not have in common (e.g. actual amounts might be associated with a customer dimension, but forecast amounts wouldn't be).
What you also don't want to do is relate the 2 fact tables to each other in any way. Again, that's a lot of work. (Especially since there's no natural way to relate them at the row level.)
What To Do
Generally, how you handle 2 fact tables is the same as you handle a single fact table. First, you have your dimensions (country, region, classification, product, date, customer). Then you load your fact tables, and join them to the dimensions. You do not join your fact tables to each other. You then create measures (i.e. DAX expressions).
When you want to combine measures from the two facts together in a single matrix, you only use rows/columns that are meaningful to both fact tables. For example, actual amounts might be associated with a customer, but forecast amounts aren't. So you can't include customer information in the matrix. Another possibility is that actual amounts are recorded each day, whereas forecasts were done for the whole month. In this situation, you could put month in your matrix (since that's meaningful to both), but you wouldn't want to use date because Power BI wouldn't know how to divide up forecasts to individual dates.
As long as you're only using dimensions & attributes that are meaningful to both fact tables, you can easily create a matrix as you envision above. Simply drag on the attributes you want, then add the measures (i.e. DAX expressions).
The Invoice & Forecast columns would both be measures. The two measures from different fact tables can be combined into a 3rd measure for the vs. Forecast measure. Everything will work as long as you're just using dimensions/attributes that mean something to both fact tables.
I don't see anything in your proposed pivot table that strikes me as problematic.
Other Situations
If you have a situation where forecasts are at a month level and actual is at a date level, then you may be wondering how you'd relate them both to the same date dimension. This situation is called having different granularities, and there's a good article here I'd recommend reading that has advice: https://www.daxpatterns.com/handling-different-granularities/. Indeed, there's a whole section on comparing budget with revenue that you might find useful.
Finally, you mention that someone hard-coded the structure of the rows and used DAX expressions to build everything. This does, admittedly, sound like overkill. The goal with Power BI is flexibility. Once you have your facts, measures & dimensions, you can combine them in any way that makes sense. Hard-coding the rows eliminates that flexibility, and is a good clue that something isn't right. (Another good clue that something isn't right is when DAX expressions seem really complicated for something that should be easy)
I hope my answer helps. It's a general answer since your question is general. If you have specific questions about your specific situation, definitely post additional questions. (Sample data, a description of the model, the problem you're seeing, and what you want to see is helpful to get a good answer.)
If you're brand new to Power BI, data models, and the star schema, Alberto Ferrari and Marco Russo have an excellent book that I'd recommend reading to get a crash course: https://www.sqlbi.com/books/analyzing-data-with-microsoft-power-bi-and-power-pivot-for-excel/
I have a small project where I need to tabulate a dataset with frequencies in various ways and export those tables in a large Excel sheet. Unfortunately, copy and paste truncates text-labels and causes lots of other issues for us.
Is there a way to save/export the result into a CSV or Excel format?
That is, something similar to the write.table command in R, which I can't install at work.
Update 1:
The Stata FAQ provided three solutions which would work for us: http://www.stata.com/support/faqs/data-management/copying-tables/, but Stata support did a followup mail a shortly after pointing to the FAQ with a link to tabout and the tutorial displayed some truly beautiful tabulations.
We've had some progress with the tabout, but we are not really sure if it would do everything we need, but so far creating tabulations with tabout D7 test.xls works nicely although without any proper aligment of labels and such as you would get generating LaTeX.
Update 2:
OK, so lots of tables weren't as straightforward as with tabulate and the by command in combination - some programming was required (not done at current Stata skill-level). The lack of native support for just exporting any result out is a real pain!
outreg is not going to work, as it only works with estimation (regression-like) results. xml_tab can probably produce anything you like (findit xml_tab to install). Obviously, you can export excel your data, although if you need frequency tables, you probably would want to collapse (count) ..., by(varlist) your data first. (I hate collapse though, as I think it is a poor idea that you need to destroy and reload your data; this is one example where R's concept of objects comes handier than Stata's idea of having only one data set in memory at a time.)
When wanting the tabulated output to anything, whether tabulate or regress or clogit, I always close the current log file and begin a new one, not in the .smcl format but with a .log suffix, handy because usually I want to keep a lot of the values from clogit returns
something along the lines of...
*close logs even if there isn't any
capture log close
log using NAMEOFOUTPUT.log
do something like tab or reg or clogit
log close
Your tabulated results from whichever command will then be in that .log file.
Could outreg be a solution?
http://www.kellogg.northwestern.edu/rc/stata-outreg.htm
Since the above will only do regression tables, estout is a good alternative. And the command estpost, I believe creates tables for tabulations:
http://repec.org/bocode/e/estout/estpost.html
For one way frequency tables fre module can be quite handy too. Output can be written to tab-delimited table and LaTeX.
sysuse auto, clear
fre rep78
rep78 -- Repair Record 1978
-----------------------------------------------------------
| Freq. Percent Valid Cum.
--------------+--------------------------------------------
Valid 1 | 2 2.70 2.90 2.90
2 | 8 10.81 11.59 14.49
3 | 30 40.54 43.48 57.97
4 | 18 24.32 26.09 84.06
5 | 11 14.86 15.94 100.00
Total | 69 93.24 100.00
Missing . | 5 6.76
Total | 74 100.00
-----------------------------------------------------------
Download and more info on SSC:
http://ideas.repec.org/c/boc/bocode/s456835.html