Changing organization of data so that each observation represents a new variable (I tried) - stata

I am working in Stata with a dataset on electric vehicle charging stations. Variables include
station_name name of charging station
review_text all of the customer reviews for a specific station delimited by }{
num_reviews number of customer reviews.
I'm trying to make a new file where each observation represents one customer review in a new variable customer_review and another variable station_id has the name of the corresponding station. So, if the original dataset had 100 observations (one per station) with 5 reviews each, the new file should have 500 observations.
How can I do this? I would include some code I have tried but I have no idea how to start.

If your data look like this:
station reviews n
1. 1 {good}{bad}{great} 3
2. 2 {poor}{excellent} 2
Then the following:
split(reviews), parse(}{)
drop reviews n
reshape long reviews, i(station) j(review_num)
drop if reviews==""
replace reviews = subinstr(reviews, "}","",.)
replace reviews = subinstr(reviews, "{","",.)
will produce:
station review~m reviews
1. 1 1 good
2. 1 2 bad
3. 1 3 great
4. 2 1 poor
5. 2 2 excellent

Related

Summarize a skills rating table with an unknown number of skill columns in Google Sheet using pure formulas and no GAS

I am unable to share a working sheet.
I have a sheet with a table like so:
Name
Skill 1
Skill 2
Skill 3
...
one
high
medium
low
none
two
three
low
medium
high
none
four
low
high
hig
...
It has an unknown number of rows
It has 3 skill columns today, but more skills may be added later
Not all rows are filled out
I want to summarize the table like so:
Skill
high
medium
low
none
Skill 1
1
0
2
0
Skill 2
1
2
0
0
Skill 3
1
1
1
0
...
Basically I am showing each skill and how many high/medium/low/none they have.
I am trying to use formulas so everything is dynamic. Meaning, if more names are added, or if more skills are added, then the table automatically shows it.
I can get a list of skills from the first table like so:
={
"Area";
TRANSPOSE(SORT(Ratings!B1:1))
}
But that is as far as I got.
use:
=ARRAYFORMULA(QUERY(SPLIT(FLATTEN(IF(Ratings!B2:5000="",,Ratings!B1:1&"×"&Ratings!B2:5000)), "×"),
"select Col1,count(Col1) where Col2 is not null group by Col1 pivot Col2 label Col1'Skill'"))

Increment ID# by 1 if Same month ArrayFormula

I'm trying to set up an array formula in a google sheet to save filling in a simple formula for ID#s.
The sheet is populated by a google form, so it receives a timestamp. Let's say these are orders.
If the month of the order matches that of the previous I want to increase the ID# by one, essentially counting this months orders. The complete ID# is actually made up of several factors, the order count being just one of them (so that they are unique), but for the sake of this exercise, I'll keep it simple.
If the month of the order does not match the previous, then safe to say we've entered the new month and the ID should restart at 01.
I have a column that has the extracted month from the timestamp. So the data looks like this:
A B
ID# MONTH
1 1
2 1
3 1
4 1
5 1
6 1
1 2
2 2
3 2
1 3
2 3
3 3
4 3
I can't get the arrayformula to work! I've tried numerous countIfs and Ifs, something like
=ARRAYFORMULA(if(len(B2:B),if(B3:B<>B2:B,1,A2:A+1),""))
Does anyone have any suggestions for this?
I found it hard to Google for and have tried a few search terms!
try:
=ARRAYFORMULA(IF(B1:B<>"", COUNTIFS(B1:B, B1:B, ROW(B1:B), "<="&ROW(B1:B)), ))

Creating Relationships while avoiding ambiguities

I have a flat table like this,
R# Cat SWN CWN CompBy ReqBy Department
1 A 1 1 Team A Team B Department 1
2 A 1 3 Team A Team B Department 1
3 B 1 3 Team A Team B Department 1
4 B 2 3 Team A Team C Department 1
5 B 2 3 Team D Team C Department 2
6 C 2 2 Team D Team C Department 2
R# indicates the RequestNumber,
Cat# indicates the Category,
SWN indicates the Submitted Week Number,
CWN indicates the Completed Week Number,
CompBy indicates Completed By,
ReqBy Indicates Requested By,
Department Indicates Department Name,
I would like to create a data model that avoids ambiguity and at the same time allows me to report on Category, SWN, CWN (needs to be only a week number), CompBY, ReqBy, Department through a single filter.
For example, the dashboard will have a single filter choice to select a week number.If that week number is selected, it will show the details of these requests from submitted and completed week number. I understand this requires the creation of a calendar table or something like that.
I am looking for a data-model that explains the cardinality and direction(Single or both). If possible, kindly post the PBIX file and repost the link here.
What I have tried: Not able to establish one of the four connections
Update: Providing a bounty for this question because I would like to see how does the Star schema will look like for this flat table.
One of the reason I am looking for a star schema over a flat table is - For example, a restaurant menu is a dimension and the purchased food is a fact. If you combined these into one table, how would you identify which food has never been ordered? For that matter, prior to your first order, how would you identify what food was available on the menu?
The scope of your question is rather unclear, so I'm just addressing this part of the post:
the dashboard will have a single filter choice to select a week number. If that week number is selected, it will show the details of these requests from submitted and completed week number.
One way to get OR logic is to use a disconnected parameter table and write your measures using the parameters selected. For example, consider this schema:
If you put WN on a slicer, then you can write a measure to filter the table based on the number selected.
WN Filter = IF(COUNTROWS(
INTERSECT(
VALUES(WeekDimension[WN]),
UNION(
VALUES(MasterTable[SWN]),
VALUES(MasterTable[CWN])))) > 0, 1, 0)
Then if you use that measure as a visual level filter, you can see all the records that correspond to your WN selection.
If you can clarify your question to more closely approach a mcve, then you'll likely get better responses. I can't quite determine the specific idea you're having trouble with.

Creating and doing Market basket analysis from raw data

I have a data set with me which have many items and their sales data in terms of amount and quantity sold rolled up per week. I want to figure out that is there some correlation between the two or not, trying to access that if sales of one item affecting the other's sale or not, in terms of any positive or negative effect.
Consider the following type of data:
Week # Product # Sale($) Quantity
Week 1 Product 1 1 1
Product 2 2 1
Product 3 3 1
Week 2 Product 1 3 2
Product 3 2 1
Product 6 2 2
Week 3 Product 4 2 1
Product 3 1 2
Product 5 4 2
So,from the above data on week basis, I want to figure out that how can I convert this data into a form of market basket data with the above set of parameters available with me. Since, there isn't any market basket data available.
The parameters I could think of is :
To use the count or occurrences of each product in a given week.
To use the total quantity sold
To use the total sales to find correlation.
So, basically I have to come up with how can an item be correlated to the other of the affinity of one product with the other product.No matter it is positively correlated or negative correlated. The only issue is I do not have any primary key to bind the items with a basket or an order number since it's rolled up sales.
Any answers or help in this topic is highly appreciable. In case you find it incomplete, you can let me know for any further clarity.
You can't do this because you have no information about the co-occurrence. You also have data muddled from daily grain to weekly grain. Aggregates won't permit this.

what is this program doing exactly? (SAS)

I was confused by the following SAS code. So, here, the SAS data set named WORK.SALARY contains 10 observations for each department,and is currently ordered by Department. The following SAS program is submitted:
data WORK.TOTAL;
set WORK.SALARY(keep=Department MonthlyWageRate);
by Department;
if First.Department=1 then Payroll=0;
Payroll+(MonthlyWageRate*12);
if Last.Department=1;
run;
So, what exactly is First.Department and Last.Department? Many thanks for your time and attention.
Your data step calculates the total PAYROLL for each DEPARTMENT.
The FIRST. and LAST. variables are generated automatically when you use a BY statement. They are true when the current observation is the first (or last) observation in the BY group. How the DATA Step Identifies BY Groups
The sum statement (Syntax: var+expression;) for PAYROLL means that the value of PAYROLL is retained (or carried over) to the next observation.
The IF/THEN statement will initializes the value to zero when a new group starts.
The subsetting IF statement will make sure that only the final observation for each department is output.
As explained, it is calculating payroll for each department.
First.department assigns value =1 when a particular department id is encountered. last.department assigns a value =1 when the last record for the department is read.
So if you have :
Department Wage
1 100
1 200
1 300
2 1000
2 2000
2 3000
With the first. and last. assigned, it will look like this:
Department Wage first.deaprtment last.department
1 100 1 0
1 200 0 0
1 300 0 1
2 1000 1 0
2 2000 0 0
2 3000 0 1
Now you can follow your logic as to what happens when first.department = 1.
By the way, in your code, I dont see they are doing anything if Last.Department=1;