Creating and doing Market basket analysis from raw data

Creating and doing Market basket analysis from raw data - data-mining

I have a data set with me which have many items and their sales data in terms of amount and quantity sold rolled up per week. I want to figure out that is there some correlation between the two or not, trying to access that if sales of one item affecting the other's sale or not, in terms of any positive or negative effect.
Consider the following type of data:
Week # Product # Sale($) Quantity
Week 1 Product 1 1 1
Product 2 2 1
Product 3 3 1
Week 2 Product 1 3 2
Product 3 2 1
Product 6 2 2
Week 3 Product 4 2 1
Product 3 1 2
Product 5 4 2
So,from the above data on week basis, I want to figure out that how can I convert this data into a form of market basket data with the above set of parameters available with me. Since, there isn't any market basket data available.
The parameters I could think of is :
To use the count or occurrences of each product in a given week.
To use the total quantity sold
To use the total sales to find correlation.
So, basically I have to come up with how can an item be correlated to the other of the affinity of one product with the other product.No matter it is positively correlated or negative correlated. The only issue is I do not have any primary key to bind the items with a basket or an order number since it's rolled up sales.
Any answers or help in this topic is highly appreciable. In case you find it incomplete, you can let me know for any further clarity.

You can't do this because you have no information about the co-occurrence. You also have data muddled from daily grain to weekly grain. Aggregates won't permit this.

Related

Changing organization of data so that each observation represents a new variable (I tried)

I am working in Stata with a dataset on electric vehicle charging stations. Variables include
station_name name of charging station
review_text all of the customer reviews for a specific station delimited by }{
num_reviews number of customer reviews.
I'm trying to make a new file where each observation represents one customer review in a new variable customer_review and another variable station_id has the name of the corresponding station. So, if the original dataset had 100 observations (one per station) with 5 reviews each, the new file should have 500 observations.
How can I do this? I would include some code I have tried but I have no idea how to start.

If your data look like this:
station reviews n
1. 1 {good}{bad}{great} 3
2. 2 {poor}{excellent} 2
Then the following:
split(reviews), parse(}{)
drop reviews n
reshape long reviews, i(station) j(review_num)
drop if reviews==""
replace reviews = subinstr(reviews, "}","",.)
replace reviews = subinstr(reviews, "{","",.)
will produce:
station review~m reviews
1. 1 1 good
2. 1 2 bad
3. 1 3 great
4. 2 1 poor
5. 2 2 excellent

How to deal with multiple ids multiple categories table to reach THIS on Power BI

I have a problem that i was trying to solve 3 days ago and i'm not able to.
I have the following tables:
Companies
company_id
sales
1
2000
2
3000
3
4000
4
1000
Categories
company_id
category
1
medical
1
sports
2
industrial
3
consumption
4
medical
4
consumption
All i want to reach is a COLUMN CHART with a CATEGORY SLICER where i choose the category and i see the TOP 5 companies by category and sales. Yes, in this example the TOP is not needed but in my real case i have 400 companies so i want to:
Select and Show only the required category.
In that category, show only the 5 better companies by sales.
The problem here is Power BI takes all the companies for the TOP N filter so if i choose a category, and also try a top 5, if the companies are not in the TOP5 all companies list, it doesn`t show anything.
Thanks!

If you always want to show the same Top N values in your visual, you can use the filter pane to achieve that.
Below a walk through:
The to add the Top N filtering, I add the following:
It is in Dutch, so a little clarification:
I add a 'filter on this visual'
I selected Populairste N, which is Top N
And as a value I drag and dropped the maximum of sales.
Results:
Things to keep in mind:
You are using a many to many relationship, make sure that this is activated in the Power BI model.
Make sure the direction of filtering is from category to sales, otherwise the slicer will not work. It looks like this:

Increment ID# by 1 if Same month ArrayFormula

I'm trying to set up an array formula in a google sheet to save filling in a simple formula for ID#s.
The sheet is populated by a google form, so it receives a timestamp. Let's say these are orders.
If the month of the order matches that of the previous I want to increase the ID# by one, essentially counting this months orders. The complete ID# is actually made up of several factors, the order count being just one of them (so that they are unique), but for the sake of this exercise, I'll keep it simple.
If the month of the order does not match the previous, then safe to say we've entered the new month and the ID should restart at 01.
I have a column that has the extracted month from the timestamp. So the data looks like this:
A B
ID# MONTH
1 1
2 1
3 1
4 1
5 1
6 1
1 2
2 2
3 2
1 3
2 3
3 3
4 3
I can't get the arrayformula to work! I've tried numerous countIfs and Ifs, something like
=ARRAYFORMULA(if(len(B2:B),if(B3:B<>B2:B,1,A2:A+1),""))
Does anyone have any suggestions for this?
I found it hard to Google for and have tried a few search terms!

try:
=ARRAYFORMULA(IF(B1:B<>"", COUNTIFS(B1:B, B1:B, ROW(B1:B), "<="&ROW(B1:B)), ))

Creating Relationships while avoiding ambiguities

I have a flat table like this,
R# Cat SWN CWN CompBy ReqBy Department
1 A 1 1 Team A Team B Department 1
2 A 1 3 Team A Team B Department 1
3 B 1 3 Team A Team B Department 1
4 B 2 3 Team A Team C Department 1
5 B 2 3 Team D Team C Department 2
6 C 2 2 Team D Team C Department 2
R# indicates the RequestNumber,
Cat# indicates the Category,
SWN indicates the Submitted Week Number,
CWN indicates the Completed Week Number,
CompBy indicates Completed By,
ReqBy Indicates Requested By,
Department Indicates Department Name,
I would like to create a data model that avoids ambiguity and at the same time allows me to report on Category, SWN, CWN (needs to be only a week number), CompBY, ReqBy, Department through a single filter.
For example, the dashboard will have a single filter choice to select a week number.If that week number is selected, it will show the details of these requests from submitted and completed week number. I understand this requires the creation of a calendar table or something like that.
I am looking for a data-model that explains the cardinality and direction(Single or both). If possible, kindly post the PBIX file and repost the link here.
What I have tried: Not able to establish one of the four connections
Update: Providing a bounty for this question because I would like to see how does the Star schema will look like for this flat table.
One of the reason I am looking for a star schema over a flat table is - For example, a restaurant menu is a dimension and the purchased food is a fact. If you combined these into one table, how would you identify which food has never been ordered? For that matter, prior to your first order, how would you identify what food was available on the menu?

The scope of your question is rather unclear, so I'm just addressing this part of the post:
the dashboard will have a single filter choice to select a week number. If that week number is selected, it will show the details of these requests from submitted and completed week number.
One way to get OR logic is to use a disconnected parameter table and write your measures using the parameters selected. For example, consider this schema:
If you put WN on a slicer, then you can write a measure to filter the table based on the number selected.
WN Filter = IF(COUNTROWS(
INTERSECT(
VALUES(WeekDimension[WN]),
UNION(
VALUES(MasterTable[SWN]),
VALUES(MasterTable[CWN])))) > 0, 1, 0)
Then if you use that measure as a visual level filter, you can see all the records that correspond to your WN selection.
If you can clarify your question to more closely approach a mcve, then you'll likely get better responses. I can't quite determine the specific idea you're having trouble with.

Dynamic Rolling Window in SAS for correlation calculation

Problem: I have a data set as below -
Comp date time returns
1 12-Aug-97 10:23:38 0.919292648
1 12-Aug-97 10:59:43 0.204139521
1 13-Aug-97 11:03:12 0.31909242
1 14-Aug-97 11:10:02 0.989339371
1 14-Aug-97 11:19:27 0.08394389
1 15-Aug-97 11:56:17 0.481199854
1 16-Aug-97 13:53:45 0.140404929
1 17-Aug-97 10:09:03 0.538569786
2 14-Aug-97 11:43:49 0.427344962
2 14-Aug-97 11:48:32 0.154836294
2 15-Aug-97 14:03:47 0.445415114
2 15-Aug-97 9:38:59 0.696953041
2 15-Aug-97 13:59:23 0.577391987
2 15-Aug-97 9:10:12 0.750949097
2 15-Aug-97 10:22:38 0.077787596
2 15-Aug-97 11:07:57 0.515822161
2 16-Aug-97 11:37:26 0.862673945
2 17-Aug-97 11:42:33 0.400670247
2 19-Aug-97 11:59:34 0.109279307
These are nothing but share price returns for every company at a date and time level.
I need to calculate autocorrelation(degree 1) of returns over a period of 10 days for each Comp and date value combination. As you can see, my time series is not continuous, it has breaks for weekends and public holidays. In such cases, if i need to take a 10 day range, I can't use a intnk function as adding 10 days to the date column might include a saturday/sunday for which I don't have data for and hence, my autocorrelation value will be compromised. How do I make this range dynamic?
I found this question Calculating rolling correlations in SAS that I thought might help but then again, there is the same intnx problem.

You can use the INTERVALDS system option to define a custom interval that fits your needs. See this article for more details.
The basic concept is that you create a dataset containing all of your possible dates (or datetimes) and define an interval value for each one, then tell SAS via the system option to use that dataset when you use a particular interval name. Then use INTNX as normal.
Otherwise, you could just do a PROC FREQ of your data to get the unique days, and then use that to create a day counter; then instead of creating your fromDate with intnx, you can just use SQL to grab the row with a date 10 less than current date.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Creating and doing Market basket analysis from raw data - data-mining

You can't do this because you have no information about the co-occurrence. You also have data muddled from daily grain to weekly grain. Aggregates won't permit this.

Related

Changing organization of data so that each observation represents a new variable (I tried)

How to deal with multiple ids multiple categories table to reach THIS on Power BI

Increment ID# by 1 if Same month ArrayFormula

Creating Relationships while avoiding ambiguities

Dynamic Rolling Window in SAS for correlation calculation

Categories

Resources