How to retain manually selected features other than blindly doing a a PCA or LDA? - pca

While doing a PCA, features get reduced from say 200 000 to 2000. But some required features can be lost. How can we manually retain those features which we select?

Related

Drawing a multi-line histogram in Power-BI

I have a set of courses in PowerBI, and I have the student grades for each course. I would like to compare the grade distribution and for this purpose, I would like to present a histogram (or a line) for each course in the same visualization.
Microsoft deprecated its Histogram visualization, so I looked for third-party tools. There are two options, but both do not answer my need. I know I could somehow define grade beans and a metric that counts the number of students in each bean. However, it seems to me that such a fundamental visualization must have a better solution.
Can you suggest an alternative?

Independent variable to find seasonality effect?

I'm not sure if it's right to ask this here but any help greatly appreciated. I'm working on sas forecast studio.
This is my time series dataset (quarterly data):
Date e.g. 1-Jan-80, 1-Apr-80, 1-Jul-80
DateQ e.g. 1980Q1, 1980Q2, 1980Q3
Year e.g. 1980, 1981, 1982
GDP (dependable variable) e.g. 2650.1
T e.g. 1, 2, 3
Which of this variable, or should I create a new quarterly variable, to use as an independent variable for a linear regression to evaluate if there is any seasonal effect?
Seasonal effects should not be identified using simple linear regression on the time variable when analyzing time-series data. But, to answer your question, use date with the intnx() function to convert it to quarter.
data want;
format quarter yyq.;
set have;
quarter = intnx('quarter', date, 0, 'B');
run;
Seasonal effects can be identified a number of ways:
1. Graphing it
If a time series has a seasonal effect, it will tend to be clear. Simply looking at a graph of the data will let you know whether it is seasonal by your chosen interval.
In sashelp.air, it's very clear that there is a 12-month season.
2. Spectral Density Analysis
proc timeseries will give you a spectrum analysis to help identify significant seasons within the data. Peaks indicate possible cycles or seasons. You will need to do some filtering to a reasonable seasonal amount since the density may increase significantly after a certain point, and it is not representative of the true season.
Forecast Studio and Time Series Studio will do this for you and can give you similar output to the below.
proc timeseries data=sashelp.air
outspectra=outspectra;
id date interval=month;
var air;
spectra;
run;
proc sgplot data=outspectra;
where period BETWEEN 1 AND 24;
scatter x=period y=p;
series x=period y=p;
run;
We can see a strong indicator for a seasonality of 12. We also see some potential 3-month and 6-month cycles that could be tested within a model for significance.
3. ACF/PACF/IACF plots
Your ACF/PACF/IACF plots in Forecast Studio will also help you identify clear seasons.
The classic decaying suspension-bridge look is indicative of a seasonal effect. Note that the season increases around 12 and then decreases again. Additionally, the significant negative spike at 12 in the PACF and IACF plots are other indicators of a significant seasonal effect at 12.
Model Building and Testing
Tools like the seasonal augmented dickey fuller test that are available Forecast Studio can help you identify if you've captured seasonality and achieved stationarity after differencing.
The selection boxes in the Series view allow you to quickly add simple or seasonal differencing. Selecting (1) for simple differencing will add one simple difference. i.e:
y = y - lag(y)
Selecting (1) for seasonal differencing will add 1 seasonal difference. Note that when you create a project in Forecast Studio, the season is automatically diagnosed and assumed. This should be done after doing our diagnostics above for our best guess as to what the true season is. In our case, we've assumed our season is 12. This would be equivalent to:
y = y - lag12(y)
We can then use stationarity tests to ensure we've achieved stationarity. In our case, we'll add 1 simple and seasonal difference.
Notice how our white noise plot has improved and our spikes at 12 have decreased to non-significance. Additionally, our stationarity tests are looking good and significant - that is, there is no unit root present.
Adding Seasonal or Cyclical Effects
Your model choice will dictate how seasonal or cyclical effects are added. Differencing in an ARIMA model will take care of seasonality. Dummy variables can be used for additional cyclical effects in the ARIMA model. For example:
data want;
set have;
q1 = (qtr(date) = 1);
q2 = (qtr(date) = 2);
q3 = (qtr(date) = 3);
run;
UCMs can take care of all of these by adding both seasonal and cyclical effects. Holt-Winters ESMs take care of trend and seasonality without requiring dummy variables. Your modeling goals and performance considerations for each type of model will dictate which model you choose.

How to display data from different data source tables in a single table in Power BI

I have a couple of different tables in my Report, for demonstration purposes lets say that I have 1 data source that is Actual Invoice amounts and then I have another table that is Forecasted amounts. Each table has several dimensions that are the same between them, let say Country, Region, Product Classification and Product.
What I want is to be able to display a table/matrix that pulls information from both of these data sources like this
Description Invoice Forecast vs Forecast
USA 300 325 92%
East 150 175 86%
Product Grouping 1 125 125 100%
Product 1 50 75 67%
Product 2 75 50 150%
Product Grouping 3 25 50 50%
Product 3 25 50 50%
West 150 150 100%
Product Grouping 1 75 100 75%
Product 1 25 50 50%
Product 2 50 50 100%
Product Grouping 3 75 50 150%
Product 3 75 50 150%
I have not been able to figure out a way to combine the information from the multiple data source into a single matrix table, so any help would be appreciated. The one thing that I did find was somebody hard coded the structure of the rows into a separate data source and then used DAX expressions to pull in the pieces of information into the columns, but I don't like this solution because the structure of the rows is not constant.
What you're asking about is a common part of the star schema: combining facts from different fact tables together into a single visual or report.
What Not To Do (That You Might Be Tempted To)
What you don't want to do is combine the 2 fact tables into a single table in your Power BI data model. That's a lot of work and there's absolutely no need. Especially, since there are likely dimensions that the 2 fact tables do not have in common (e.g. actual amounts might be associated with a customer dimension, but forecast amounts wouldn't be).
What you also don't want to do is relate the 2 fact tables to each other in any way. Again, that's a lot of work. (Especially since there's no natural way to relate them at the row level.)
What To Do
Generally, how you handle 2 fact tables is the same as you handle a single fact table. First, you have your dimensions (country, region, classification, product, date, customer). Then you load your fact tables, and join them to the dimensions. You do not join your fact tables to each other. You then create measures (i.e. DAX expressions).
When you want to combine measures from the two facts together in a single matrix, you only use rows/columns that are meaningful to both fact tables. For example, actual amounts might be associated with a customer, but forecast amounts aren't. So you can't include customer information in the matrix. Another possibility is that actual amounts are recorded each day, whereas forecasts were done for the whole month. In this situation, you could put month in your matrix (since that's meaningful to both), but you wouldn't want to use date because Power BI wouldn't know how to divide up forecasts to individual dates.
As long as you're only using dimensions & attributes that are meaningful to both fact tables, you can easily create a matrix as you envision above. Simply drag on the attributes you want, then add the measures (i.e. DAX expressions).
The Invoice & Forecast columns would both be measures. The two measures from different fact tables can be combined into a 3rd measure for the vs. Forecast measure. Everything will work as long as you're just using dimensions/attributes that mean something to both fact tables.
I don't see anything in your proposed pivot table that strikes me as problematic.
Other Situations
If you have a situation where forecasts are at a month level and actual is at a date level, then you may be wondering how you'd relate them both to the same date dimension. This situation is called having different granularities, and there's a good article here I'd recommend reading that has advice: https://www.daxpatterns.com/handling-different-granularities/. Indeed, there's a whole section on comparing budget with revenue that you might find useful.
Finally, you mention that someone hard-coded the structure of the rows and used DAX expressions to build everything. This does, admittedly, sound like overkill. The goal with Power BI is flexibility. Once you have your facts, measures & dimensions, you can combine them in any way that makes sense. Hard-coding the rows eliminates that flexibility, and is a good clue that something isn't right. (Another good clue that something isn't right is when DAX expressions seem really complicated for something that should be easy)
I hope my answer helps. It's a general answer since your question is general. If you have specific questions about your specific situation, definitely post additional questions. (Sample data, a description of the model, the problem you're seeing, and what you want to see is helpful to get a good answer.)
If you're brand new to Power BI, data models, and the star schema, Alberto Ferrari and Marco Russo have an excellent book that I'd recommend reading to get a crash course: https://www.sqlbi.com/books/analyzing-data-with-microsoft-power-bi-and-power-pivot-for-excel/

Stata: Groupwise regressions and ranking

I am currently developing a sentiment index using Google search frequencies taken from Google Trends.
I am using Stata 12 on Windows.
My approach is as following:
I downloaded approx ~150 business-related search queries from Googletrends from Jan 2004 to Dec 2013
I now want to construct an index using the 30 at that point in time most relevant queries related to the market I observe
To achieve that I want to use monthly expanding backward rolling regressions of each query on the market
Thus I need to regress 150 items one-by-one on the market 120 times (12 months x 10 years), using different time windows and then extract the 30 queries with the most negative t-test.
To exemplify the procedure, if I would want to construct the sentiment for January 2010 I would regress the query terms on the market during the period from Jan 2004 to December 2009 and then extract the 30 queries with the most negative t-statistic.
Now I am looking for a way to make this as automatized as possible. I guess should be able to run the 150 items at once, and I can specify the time window using the time stamps. Using Excel commands and creating a do-file with all the regression commands in it (which would be quite large) I could probably create the regressions relatively efficiently (although it depends on how much Stata can handle - any experience on that?).
What I would need to make the data extraction much easier is a command which I can use to rank the results of the regression according to their t-statistics. Does someone have an efficient approach to this? Or has general advice?
If you are using Stata, once you run a ttest, you can type return list and you will get scalars that stata stores. Once you run a loop you can store these values in a number of different ways. check out the post command.

Is there an absolute column limit for Google's Charts?

I have finally gotten a column chart working for my data set. However, it only outputs fifteen columns, and the data set has 36 columns. It will output fifteen columns (or less if I limit the set to only items that are non-zero...but my boss wants all of the data shown) no matter what width the graph is set to.
Is there an absolute hard-coded column limit for graphs made by Google's Charts API, and if not, is there a way I can tell the graph to output everything?
I've just run into this myself, almost 7 years after the original problem report. Columns representing the right-side of my data are being silently un-drawn.
Let's look at the big picture. Somebody provides a charting library. They should be expected to show the data as best they can. In the case of a column table, that would be to show the first and last columns, and then choose which intermediate columns to show based on an algorithm that takes available pixels into account. It would then let the user zoom in to see the full set of columns within the selected range. This gives the developer using the chart the freedom to show an unlimited amount of data and not have to worry that someday columns at the end are simply not drawn.
Google is already choosing to not print some of the column labels due to space constraints, so they're already halfway to understanding the big picture.
Nowhere in the documentation does it explain this truncation of columns due to space constraints, or for any other chart type that I've seen. But you sure can choose your background colors in great levels of detail.
If I had known this restriction going in, I would have chosen a different chart package and not wasted my time. My choices now are to break my "Lifetime" data into yearly graphs that fit in the available space, which is clunky as hell, or migrate to a different chart package. Thanks Google. :^(
P.S. I tried to post this as a comment to the OP, but after using SO for years I don't have enough points...