How do I choose an appropriate number of customers for cluster analysis?

How do I choose an appropriate number of customers for cluster analysis? - sas

I am currently doing a customer segmentation project in SAS.
I have identified 2700 customers who are have made a purchase in each of the 4 years I am analysing. For the cluster analysis the more purchases/customer each year the better the data quality is. However as I become more selective over number of purchases needed each year per customer, the less customers can be considered in the cluster analysis.
How should I go about choosing the cutoff point for the number of purchases necessary per customer per year to be considered for analysis. I am struggling with this trade off between data quality and having enough customers for analysis.
Thanks a lot! :)

There is no correct way. It entirely depends on your data.
Clustering such data is "magic" and the results tend to be all but statistically sound. More like random gueses.
Because of this, always try multiple parameters and carefully inspect the results. No equation ever will tell you what a good clustering is.

Related

Google AutoML tables

I'm new to google automl tables and have a basic question about which data is worthwhile including in the training of my model.
I have a dataset of golfers and will be looking at the averages of scores over different periods. For example, average over the past 3 months, 6 months, 1 year etc.
My question is, is it worthwhile also including the sample size for each date range for each player. For example, over the past 3 months, some players will have a sample size of 28 while some will only have 2. Those players that have 28 rounds will have more accurate averages than those with 2. However, I didn't know whether google automl tables would pick up this link automatically, whether I could create a different weighting/reliability variable, or whether there's a way to specify a link between columns? Or if this automated type of automl isn't really suitable?
Thanks in advance

How many rows are required by partition to have good performances in BigQuery?

I receive every day 100 rows from an application. Good practices in my company suggest to partition every table by day. I dont think is good to do this on the new table that I will create to daily insert a hundred of rows. I want to partition the data by year, is it good?
How many rows by partition are required for the best performances?

It really also depends on the queries that you are going to execute on this table that is what kind of date filters are going to use and joins on what columns. Refer to below answer which will really help you to decide on this.
Answer1
Answer2

Keep in mind that the number of partitions is limited (to 4000). Therefore partitioning is great for low cardinality. Per day, is perfect (about 11 years -> 4000 days).
If you have higher cardinality, customer ID for example (and I hope you have more than 4000 customers!), clustering is the solution to speed up the request.
When you partition, and cluster, your data, you create small bag. Lesser the data to process (load, read, store in cache (...)) you have, faster will be your query! Of course, on only 100 rows, you won't see any differences

Efficiency of measures in power bi

Let's say you had a table.
OrderNumber, OrderDate, City, & Sales
The Sales field is given to you. No need to calculate it.
When you bring in this data into Power BI, say you want to analyze Sales by City (in a table format).
You can just straight away drag the two fields into the table.
No need to create a measure.
So now, suppose you created a measure, though.
Total Sales = Sum(Sales).
Is there any advantage to it, in this scenario?
Is it more efficient to use: City, Total Sales
than it is to use: City, Sales
Both display the same information.

When you drag the field into the table, what Power BI does is create an implicit measure automatically based on its best guess of what aggregation (e.g. sum, max, count) it thinks you want.
So in this case, using an explicitly defined measure or an implicitly generated measure should perform the same since it is doing the same thing in the background, i.e., SUM(TableName[Sales]).
It's generally considered best practice to use explicit measures.
You may be interested in this video discussing the differences.

I was told that it is good to always create explicit measures, and that measures are more efficient. Weather right or wrong, I don't know, but from perspective of policy, it is a good idea, since measures do protect you from column name changes. In general, I think I can just make a rule of thumb to always define any measures that you want to report on explicitly.... BUT the answer above could also be correct... stack exchange doesn't let you choose multiple answers....

How to display data from different data source tables in a single table in Power BI

I have a couple of different tables in my Report, for demonstration purposes lets say that I have 1 data source that is Actual Invoice amounts and then I have another table that is Forecasted amounts. Each table has several dimensions that are the same between them, let say Country, Region, Product Classification and Product.
What I want is to be able to display a table/matrix that pulls information from both of these data sources like this
Description Invoice Forecast vs Forecast
USA 300 325 92%
East 150 175 86%
Product Grouping 1 125 125 100%
Product 1 50 75 67%
Product 2 75 50 150%
Product Grouping 3 25 50 50%
Product 3 25 50 50%
West 150 150 100%
Product Grouping 1 75 100 75%
Product 1 25 50 50%
Product 2 50 50 100%
Product Grouping 3 75 50 150%
Product 3 75 50 150%
I have not been able to figure out a way to combine the information from the multiple data source into a single matrix table, so any help would be appreciated. The one thing that I did find was somebody hard coded the structure of the rows into a separate data source and then used DAX expressions to pull in the pieces of information into the columns, but I don't like this solution because the structure of the rows is not constant.

What you're asking about is a common part of the star schema: combining facts from different fact tables together into a single visual or report.
What Not To Do (That You Might Be Tempted To)
What you don't want to do is combine the 2 fact tables into a single table in your Power BI data model. That's a lot of work and there's absolutely no need. Especially, since there are likely dimensions that the 2 fact tables do not have in common (e.g. actual amounts might be associated with a customer dimension, but forecast amounts wouldn't be).
What you also don't want to do is relate the 2 fact tables to each other in any way. Again, that's a lot of work. (Especially since there's no natural way to relate them at the row level.)
What To Do
Generally, how you handle 2 fact tables is the same as you handle a single fact table. First, you have your dimensions (country, region, classification, product, date, customer). Then you load your fact tables, and join them to the dimensions. You do not join your fact tables to each other. You then create measures (i.e. DAX expressions).
When you want to combine measures from the two facts together in a single matrix, you only use rows/columns that are meaningful to both fact tables. For example, actual amounts might be associated with a customer, but forecast amounts aren't. So you can't include customer information in the matrix. Another possibility is that actual amounts are recorded each day, whereas forecasts were done for the whole month. In this situation, you could put month in your matrix (since that's meaningful to both), but you wouldn't want to use date because Power BI wouldn't know how to divide up forecasts to individual dates.
As long as you're only using dimensions & attributes that are meaningful to both fact tables, you can easily create a matrix as you envision above. Simply drag on the attributes you want, then add the measures (i.e. DAX expressions).
The Invoice & Forecast columns would both be measures. The two measures from different fact tables can be combined into a 3rd measure for the vs. Forecast measure. Everything will work as long as you're just using dimensions/attributes that mean something to both fact tables.
I don't see anything in your proposed pivot table that strikes me as problematic.
Other Situations
If you have a situation where forecasts are at a month level and actual is at a date level, then you may be wondering how you'd relate them both to the same date dimension. This situation is called having different granularities, and there's a good article here I'd recommend reading that has advice: https://www.daxpatterns.com/handling-different-granularities/. Indeed, there's a whole section on comparing budget with revenue that you might find useful.
Finally, you mention that someone hard-coded the structure of the rows and used DAX expressions to build everything. This does, admittedly, sound like overkill. The goal with Power BI is flexibility. Once you have your facts, measures & dimensions, you can combine them in any way that makes sense. Hard-coding the rows eliminates that flexibility, and is a good clue that something isn't right. (Another good clue that something isn't right is when DAX expressions seem really complicated for something that should be easy)
I hope my answer helps. It's a general answer since your question is general. If you have specific questions about your specific situation, definitely post additional questions. (Sample data, a description of the model, the problem you're seeing, and what you want to see is helpful to get a good answer.)
If you're brand new to Power BI, data models, and the star schema, Alberto Ferrari and Marco Russo have an excellent book that I'd recommend reading to get a crash course: https://www.sqlbi.com/books/analyzing-data-with-microsoft-power-bi-and-power-pivot-for-excel/

How to estimates monthly/daily sales of an item using Amazon Advertising API

I have looked into the responses of "ItemSeach ()" and "lookUp()" functions in Amazon Advertising API and
could not find a possible way to get daily/monthly sales of an item.
Popular product research software like , JungleScout, ProfitPhonix, AMZ tracker etc do display Number of monthly sales but all of them show different results.
Does Amazon provide this information ? If not then how the above software are estimating it?
I think when they fetch the ASIN information, they do store "some thing" in their DB and next time when the same ASIN is pulled again then the estimated sales are roughly calculated based on DB previous value/score.
Any help will be highly appreciated .
Thanks

It is not a solution, but here is a reply from UnicornSmasher I found, it may help to save time searching for something that doesn't exist.
constantine We just took all of the bulk data from the products that are being tracked in AMZ Tracker and applied a formula to it all. If you have specific products that are way off please let us know! Certain categories we had less data on. This is version 1 of the research tool, so I'm sure it will continue to improve quickly over time.
Here is the link to question and answer:
amz forum
So, now, the question is 'What formula do they use?'
Let me know if you come up with an idea :)

Let me tell you first that if you're not the part of the Amazon data team you can't get the sales numbers of any product. And, its probably not easy to estimate sales using Amazon advertising API. You need to constantly track a huge number of products to estimate the sales. Here I can explain how AMZ Insight an Amazon tracking tool estimates the sales of any product.
They constantly track a few thousand products from all the categories and collect massive data. Then their in-house data scientist analyze the data to form the sales estimating algorithm. Relationship of multiple data points plots a scattered graph which means of course sales estimates are not 100 percent right.
Data is continuously gathered and analyzed by tracking the Best Seller Rank (BSR), Buybox, reviews and more factors. Then the relationship between this data is formed to come up with the unit sales. Once this relationship is in place then it is much easier to estimate monthly sales and revenue for the product.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js