Google AutoML tables - google-cloud-platform

I'm new to google automl tables and have a basic question about which data is worthwhile including in the training of my model.
I have a dataset of golfers and will be looking at the averages of scores over different periods. For example, average over the past 3 months, 6 months, 1 year etc.
My question is, is it worthwhile also including the sample size for each date range for each player. For example, over the past 3 months, some players will have a sample size of 28 while some will only have 2. Those players that have 28 rounds will have more accurate averages than those with 2. However, I didn't know whether google automl tables would pick up this link automatically, whether I could create a different weighting/reliability variable, or whether there's a way to specify a link between columns? Or if this automated type of automl isn't really suitable?
Thanks in advance

Related

How to create an evenly spaced timeseries for forecasting

I am trying to create a forecast but this is the error that I get:
I am working with about 300,000 rows of data. Most of the report has already been built. My data just doesn't cotain certain dates. How can I solve this issue?
So the issue boils down to the problem of "How to create an evenly spaced timeline". You can easily achieve this in PowerQuery
Create a separate daily date table.
Outer join your observations onto the dates, which will give you "null" for the unobserved days
Apply the "fill down" operation on your values column, which basically means that the last value will be repeated until a new observation appears.
These evenly distributed time series is suitable for ML forecasting, at least when it comes to predicting trends. But the real power of this feature in Power BI is in predicting seasonality, and you most likely won't get that right with the above interpolation.

How to display data from different data source tables in a single table in Power BI

I have a couple of different tables in my Report, for demonstration purposes lets say that I have 1 data source that is Actual Invoice amounts and then I have another table that is Forecasted amounts. Each table has several dimensions that are the same between them, let say Country, Region, Product Classification and Product.
What I want is to be able to display a table/matrix that pulls information from both of these data sources like this
Description Invoice Forecast vs Forecast
USA 300 325 92%
East 150 175 86%
Product Grouping 1 125 125 100%
Product 1 50 75 67%
Product 2 75 50 150%
Product Grouping 3 25 50 50%
Product 3 25 50 50%
West 150 150 100%
Product Grouping 1 75 100 75%
Product 1 25 50 50%
Product 2 50 50 100%
Product Grouping 3 75 50 150%
Product 3 75 50 150%
I have not been able to figure out a way to combine the information from the multiple data source into a single matrix table, so any help would be appreciated. The one thing that I did find was somebody hard coded the structure of the rows into a separate data source and then used DAX expressions to pull in the pieces of information into the columns, but I don't like this solution because the structure of the rows is not constant.
What you're asking about is a common part of the star schema: combining facts from different fact tables together into a single visual or report.
What Not To Do (That You Might Be Tempted To)
What you don't want to do is combine the 2 fact tables into a single table in your Power BI data model. That's a lot of work and there's absolutely no need. Especially, since there are likely dimensions that the 2 fact tables do not have in common (e.g. actual amounts might be associated with a customer dimension, but forecast amounts wouldn't be).
What you also don't want to do is relate the 2 fact tables to each other in any way. Again, that's a lot of work. (Especially since there's no natural way to relate them at the row level.)
What To Do
Generally, how you handle 2 fact tables is the same as you handle a single fact table. First, you have your dimensions (country, region, classification, product, date, customer). Then you load your fact tables, and join them to the dimensions. You do not join your fact tables to each other. You then create measures (i.e. DAX expressions).
When you want to combine measures from the two facts together in a single matrix, you only use rows/columns that are meaningful to both fact tables. For example, actual amounts might be associated with a customer, but forecast amounts aren't. So you can't include customer information in the matrix. Another possibility is that actual amounts are recorded each day, whereas forecasts were done for the whole month. In this situation, you could put month in your matrix (since that's meaningful to both), but you wouldn't want to use date because Power BI wouldn't know how to divide up forecasts to individual dates.
As long as you're only using dimensions & attributes that are meaningful to both fact tables, you can easily create a matrix as you envision above. Simply drag on the attributes you want, then add the measures (i.e. DAX expressions).
The Invoice & Forecast columns would both be measures. The two measures from different fact tables can be combined into a 3rd measure for the vs. Forecast measure. Everything will work as long as you're just using dimensions/attributes that mean something to both fact tables.
I don't see anything in your proposed pivot table that strikes me as problematic.
Other Situations
If you have a situation where forecasts are at a month level and actual is at a date level, then you may be wondering how you'd relate them both to the same date dimension. This situation is called having different granularities, and there's a good article here I'd recommend reading that has advice: https://www.daxpatterns.com/handling-different-granularities/. Indeed, there's a whole section on comparing budget with revenue that you might find useful.
Finally, you mention that someone hard-coded the structure of the rows and used DAX expressions to build everything. This does, admittedly, sound like overkill. The goal with Power BI is flexibility. Once you have your facts, measures & dimensions, you can combine them in any way that makes sense. Hard-coding the rows eliminates that flexibility, and is a good clue that something isn't right. (Another good clue that something isn't right is when DAX expressions seem really complicated for something that should be easy)
I hope my answer helps. It's a general answer since your question is general. If you have specific questions about your specific situation, definitely post additional questions. (Sample data, a description of the model, the problem you're seeing, and what you want to see is helpful to get a good answer.)
If you're brand new to Power BI, data models, and the star schema, Alberto Ferrari and Marco Russo have an excellent book that I'd recommend reading to get a crash course: https://www.sqlbi.com/books/analyzing-data-with-microsoft-power-bi-and-power-pivot-for-excel/

How to estimates monthly/daily sales of an item using Amazon Advertising API

I have looked into the responses of "ItemSeach ()" and "lookUp()" functions in Amazon Advertising API and
could not find a possible way to get daily/monthly sales of an item.
Popular product research software like , JungleScout, ProfitPhonix, AMZ tracker etc do display Number of monthly sales but all of them show different results.
Does Amazon provide this information ? If not then how the above software are estimating it?
I think when they fetch the ASIN information, they do store "some thing" in their DB and next time when the same ASIN is pulled again then the estimated sales are roughly calculated based on DB previous value/score.
Any help will be highly appreciated .
Thanks
It is not a solution, but here is a reply from UnicornSmasher I found, it may help to save time searching for something that doesn't exist.
constantine We just took all of the bulk data from the products that are being tracked in AMZ Tracker and applied a formula to it all. If you have specific products that are way off please let us know! Certain categories we had less data on. This is version 1 of the research tool, so I'm sure it will continue to improve quickly over time.
Here is the link to question and answer:
amz forum
So, now, the question is 'What formula do they use?'
Let me know if you come up with an idea :)
Let me tell you first that if you're not the part of the Amazon data team you can't get the sales numbers of any product. And, its probably not easy to estimate sales using Amazon advertising API. You need to constantly track a huge number of products to estimate the sales. Here I can explain how AMZ Insight an Amazon tracking tool estimates the sales of any product.
They constantly track a few thousand products from all the categories and collect massive data. Then their in-house data scientist analyze the data to form the sales estimating algorithm. Relationship of multiple data points plots a scattered graph which means of course sales estimates are not 100 percent right.
Data is continuously gathered and analyzed by tracking the Best Seller Rank (BSR), Buybox, reviews and more factors. Then the relationship between this data is formed to come up with the unit sales. Once this relationship is in place then it is much easier to estimate monthly sales and revenue for the product.

How do I choose an appropriate number of customers for cluster analysis?

I am currently doing a customer segmentation project in SAS.
I have identified 2700 customers who are have made a purchase in each of the 4 years I am analysing. For the cluster analysis the more purchases/customer each year the better the data quality is. However as I become more selective over number of purchases needed each year per customer, the less customers can be considered in the cluster analysis.
How should I go about choosing the cutoff point for the number of purchases necessary per customer per year to be considered for analysis. I am struggling with this trade off between data quality and having enough customers for analysis.
Thanks a lot! :)
There is no correct way. It entirely depends on your data.
Clustering such data is "magic" and the results tend to be all but statistically sound. More like random gueses.
Because of this, always try multiple parameters and carefully inspect the results. No equation ever will tell you what a good clustering is.

Stata: Groupwise regressions and ranking

I am currently developing a sentiment index using Google search frequencies taken from Google Trends.
I am using Stata 12 on Windows.
My approach is as following:
I downloaded approx ~150 business-related search queries from Googletrends from Jan 2004 to Dec 2013
I now want to construct an index using the 30 at that point in time most relevant queries related to the market I observe
To achieve that I want to use monthly expanding backward rolling regressions of each query on the market
Thus I need to regress 150 items one-by-one on the market 120 times (12 months x 10 years), using different time windows and then extract the 30 queries with the most negative t-test.
To exemplify the procedure, if I would want to construct the sentiment for January 2010 I would regress the query terms on the market during the period from Jan 2004 to December 2009 and then extract the 30 queries with the most negative t-statistic.
Now I am looking for a way to make this as automatized as possible. I guess should be able to run the 150 items at once, and I can specify the time window using the time stamps. Using Excel commands and creating a do-file with all the regression commands in it (which would be quite large) I could probably create the regressions relatively efficiently (although it depends on how much Stata can handle - any experience on that?).
What I would need to make the data extraction much easier is a command which I can use to rank the results of the regression according to their t-statistics. Does someone have an efficient approach to this? Or has general advice?
If you are using Stata, once you run a ttest, you can type return list and you will get scalars that stata stores. Once you run a loop you can store these values in a number of different ways. check out the post command.