Apache Pig Student Marks Average Calculation - mapreduce

I have a dataset in the format,
student_id|name|subject|marks
2 John English 50
3 mark Maths 50
3 mark English 50
This data is loaded into HDFS, I need to calculate the average of all subjects for each student using pig, what would be the pig methodology to do this.

Group by student and get the average.Assuming you have loaded the data to relation A.
B = GROUP A BY (student);
C = FOREACH B GENERATE group,AVG(A.marks);
DUMP C;

Related

Wrong calculations for rows in Power BI matrix

I am trying to calculate market share, but struggling with doing it correctly.
I have a matrix where I have Category, Name as rows, Channel as column, and Market Share as value.
Also In my dataset I have columns ABS_COMPANY (with sales inputted there if company = "A", so there are some blank ones), and ABS_TOTAL (with sales inputted in each row)
so my measure Market Share:
Market Share = SUM(table\[ABS_COMPANY\]) / SUM(table\[ABS_TOTAL\])
This correctly calculates values for each Category, but when I open the drop-down to see Name, Market Share of each Name equals to 100%. What is the problem and how to fix it?
e.g. What is now:
Market Share
Pens 43%
pen 1 100%
pen 2 100%
pen 3 100%
Pencils 29%
penc 1 100%
penc 2 100%
penc 3 100%
I've tried using Calculate(), but it does not work in a way I want to.
Unfortunately, I cannot share the data as it is sensitive.
Structure of dataset:
NAME STRING
CATEGORY STRING
CHANNEL STRING
ABS_COMPANY DECIMAL(20,2) - value of sales for each name
ABS_TOTAL DECIMAL(20,2) - it is a value grouped by CHANNEL AND CATEGORY at the backend

Calculating running count and percentages from long-type data

Hello lovely people of SO,
I have a dataset that looks like so:
ID_SALE
PRODUCT
STORE
SE_056
AAA
NORTH
XT-558
AAA
NORTH
8547Y
AAA
NORTH
TY856
BBB
NORTH
D-895
BBB
SOUTH
ER5H
CCC
SOUTH
5F6F-GD
CCC
SOUTH
65-FFD
TTT
SOUTH
56-YU
UUU
SOUTH
I want to be able to plot a table that will show the count of each PRODUCT and the contribution of the global percentage of each PRODUCT as well as the cumulative percentage like so:
PRODUCT
Subtotal
Percentage
running %
AAA
3
0,33333333
0,33333333
BBB
2
0,22222222
0,55555556
CCC
2
0,22222222
0,77777778
TTT
1
0,11111111
0,88888889
UUU
1
0,11111111
1
I also want to be able to have a filter in the PBI sheet that will filter by STORE so if I choose "NORTH" my table will show the following:
PRODUCT
Subtotal
Percentage
running %
AAA
3
0,75
0,75
BBB
1
0,25
1
First THANKS A LOT guys pbi is truly coming for me and my mental health and although I have used the quick-measure feature to get the cumulative total I get get it to sort in order my data and so I figured that DAX is the only way.
if you guys can help me out I will be so thankful I will be very attentive to your responses.
Assuming your table is named "Table".
Subtotal = COUNTROWS('Table')
Percentage = [Subtotal]/CALCULATE(COUNTROWS('Table'),REMOVEFILTERS())
running % =
VAR cursor = MAX('Table'[PRODUCT])
RETURN
CALCULATE( [Percentage], REMOVEFILTERS(),'Table'[PRODUCT]<= cursor)

Changing organization of data so that each observation represents a new variable (I tried)

I am working in Stata with a dataset on electric vehicle charging stations. Variables include
station_name name of charging station
review_text all of the customer reviews for a specific station delimited by }{
num_reviews number of customer reviews.
I'm trying to make a new file where each observation represents one customer review in a new variable customer_review and another variable station_id has the name of the corresponding station. So, if the original dataset had 100 observations (one per station) with 5 reviews each, the new file should have 500 observations.
How can I do this? I would include some code I have tried but I have no idea how to start.
If your data look like this:
station reviews n
1. 1 {good}{bad}{great} 3
2. 2 {poor}{excellent} 2
Then the following:
split(reviews), parse(}{)
drop reviews n
reshape long reviews, i(station) j(review_num)
drop if reviews==""
replace reviews = subinstr(reviews, "}","",.)
replace reviews = subinstr(reviews, "{","",.)
will produce:
station review~m reviews
1. 1 1 good
2. 1 2 bad
3. 1 3 great
4. 2 1 poor
5. 2 2 excellent

How do I group my certain rows into a new heading in power BI?

I have a matrix visual with a list of KPI's taken from different tables
Sample Matrix Data visual - ABCDEF are KPIS
Australia NZ India Korea China
A
B
C
D
E
F
I have to add another column to the left of the A,B,C etc called as "Pillar" which will be grouping these existing kpis into further bigger groups
Expected Output :
Australia NZ India Korea China
Dog A
Cat B
Cat C
Bird D
Bird E
Bird F
Bird G
I have to create this new column Pillar and put in my matrix visual.
Challenges I am facing are:
I do not have the list of KPI's in one Table, so do I add this new column in all my tables by going into data transformation?
How do I achieve this?

Get the last 5 results of column C or D if column A or B is equal to ___?

I know the title is a horrible description, sorry.
Basically I have a sheet with results from basketball games. So in column A I have the home team. In column B I have the away team. In column C the home team's points. In column D the away team's points. There's about 500 rows worth of data at the minute.
What I want to do is the following:
Say I want to get the average points scored by the New York Knicks in their last 5 games. The most recent games are at the bottom of the sheet, and the first/oldest ones at the top of the sheet.
So across the bottom/last 5 instances of "New York Knicks" in column A and B, I want the average of the results of C (if New York Knicks is in column A) and D (if in column B).
I know how to do this if I would want just the last 5 home games for instance (so in that instance I basically query the bottom 5 results of column C in the last 5 occurrences of column A being New York Knicks). I don't know how to do it when I am looking for when New York Knicks occurs in either column A or B, and then have to get the averages from column C or D.
Can anyone help?
this will transform your 4 columns into 2 columns:
=ARRAYFORMULA(SPLIT(TRANSPOSE(SPLIT(QUERY(TRANSPOSE(QUERY(TRANSPOSE(
IF(A2:B<>"", "♠"&A2:B&"♦"&C2:D, )),,999^99)),,999^99), "♠")), "♦"))
and average score of the last 5 games:
=ARRAYFORMULA(AVERAGE(QUERY(SPLIT(TRANSPOSE(SPLIT(QUERY(TRANSPOSE(QUERY(TRANSPOSE(
IF(A2:B<>"", "♠"&A2:B&"♦"&C2:D, )),,999^99)),,999^99), "♠")), "♦"),
"select Col2
where lower(Col1) contains 'new york knicks'
offset "&COUNTIF(A2:B, "new york knicks")-5)))