REPtree algorithm output - weka

Can anyone tell me how to interpret the output of REPtree algorithm under some kind of dataset.
Here is part if the output:
won < 36.5
| team = BOS
| | year < 1961.5 : 36.6 (4/41.19) [1/3.06]
| | year >= 1961.5
| | | pace < 96.31 : 43.4 (2/1) [3/86]
| | | pace >= 96.31 : 51.86 (3/2) [4/73]
| team = CH1 : 20 (0/0) [1/913.28]
| team = CL1 : 30 (1/0) [0/0]
| team = DE1 : 40 (1/0) [0/0]
| team = NYK
| | year < 1958
| | | year < 1952.5 : 26.75 (3/1.56) [1/40.11]
| | | year >= 1952.5 : 36.67 (2/0) [1/1]
| | year >= 1958 : 51.06 (12/91.24) [4/18.92]
| team = PH1 : 36.89 (7/126.49) [2/15.44]

I just found a good answer for you question.
the number before : is the conditional rule.
the first number after : is the class value.
the first figures in both brackets, that's the number of instances matching the rule.
the second figures in both brackets, that's the percentage of instances misclassified by the rule.
more reference, please see this link

Related

How to extracting all values that contain part of particular number and then deleting them?

How do you extract all values containing part of a particular number and then delete them?
I have data where the ID contains different lengths and wants to extract all the IDs with a particular number. For example, if the ID contains either "-00" or "02" or "-01" at the end, pull to be able to see the hit rate that includes those—then delete them from the ID. Is there a more effecient way in creating this code?
I tried to use the substring function to slice it to get the result, but there is some other ID along with the specified position.
Code:
Proc sql;
Create table work.data1 AS
SELECT Product, Amount_sold, Price_per_unit,
CASE WHEN Product Contains "Pen" and Lenghth(ID) >= 9 Then ID = SUBSTR(ID,1,9)
WHEN Product Contains "Book" and Lenghth(ID) >= 11 Then ID = SUBSTR(ID,1,11)
WHEN Product Contains "Folder" and Lenghth(ID) >= 12 Then ID = SUBSTR(ID,1,12)
...
END AS ID
FROM A
Quit;
Have:
+------------------+-----------------+-------------+----------------+
| ID | Product | Amount_sold | Price_per_unit |
+------------------+-----------------+-------------+----------------+
| 123456789 | Pen | 30 | 2 |
| 63495837229-01 | Book | 20 | 5 |
| ABC134475472 02 | Folder | 29 | 7 |
| AB-1235674467-00 | Pencil | 26 | 1 |
| 69598346-02 | Correction pen | 15 | 1.50 |
| 6970457688 | Highlighter | 15 | 2 |
| 584028467 | Color pencil | 15 | 10 |
+------------------+-----------------+-------------+----------------+
Wanted the final result:
+------------------+-----------------+-------------+----------------+
| ID | Product | Amount_sold | Price_per_unit |
+------------------+-----------------+-------------+----------------+
| 123456789 | Pen | 30 | 2 |
| 63495837229 | Book | 20 | 5 |
| ABC134475472 | Folder | 29 | 7 |
| AB-1235674467 | Pencil | 26 | 1 |
| 69598346 | Correction pen | 15 | 1.50 |
| 6970457688 | Highlighter | 15 | 2 |
| 584028467 | Color pencil | 15 | 10 |
+------------------+-----------------+-------------+----------------+
Just test if the string has any embedded spaces or hyphens and also that the last word when delimited by space or hyphen is 00 or 01 or 02 then chop off the last three characters.
data have;
infile cards dsd dlm='|' truncover ;
input id :$20. product :$20. amount_sold price_per_unit;
cards;
123456789 | Pen | 30 | 2 |
63495837229-01 | Book | 20 | 5 |
ABC134475472 02 | Folder | 29 | 7 |
AB-1235674467-00 | Pencil | 26 | 1 |
69598346-02 | Correction pen | 15 | 1.50 |
6970457688 | Highlighter | 15 | 2 |
584028467 | Color pencil | 15 | 10 |
;
data want;
set have ;
if indexc(trim(id),'- ') and scan(id,-1,'- ') in ('00' '01' '02') then
id = substrn(id,1,length(id)-3)
;
run;
Result
amount_ price_
Obs id product sold per_unit
1 123456789 Pen 30 2.0
2 63495837229 Book 20 5.0
3 ABC134475472 Folder 29 7.0
4 AB-1235674467 Pencil 26 1.0
5 69598346 Correction pen 15 1.5
6 6970457688 Highlighter 15 2.0
7 584028467 Color pencil 15 10.0
There may be other solutions but you have to use some string functions. I used here the functions substr, reverse (reverting the string) and indexc (position of one of the characters in the string):
data have;
input text $20.;
datalines;
12345678
AB-142353 00
AU-234343-02
132453 02
221344-09
;
run;
data want (drop=reverted pos);
set have;
if countw(text) gt 1
then do;
reverted=strip(reverse(text));
pos=indexc(reverted,'- ')+1;
new=strip(reverse(substr(reverted,pos)));
end;
else new=text;
run;

Power Bi, Dax - calculations, filters, balance

Could you please help me to solve the problem as I am totally new to DAX and English is not my first language so I am struggling to even find the correct question.
Here's the problem.
I have two tables:
start_balance
+------+---------------+
| Type | Start balance |
+------+---------------+
| A | 0 |
| B | 10 |
+------+---------------+
in_out
+------+-------+------+----+-----+
| Year | Month | Type | In | Out |
+------+-------+------+----+-----+
| 2020 | 1 | A | 20 | 20 |
| 2020 | 1 | A | 0 | 10 |
| 2020 | 2 | B | 20 | 0 |
| 2020 | 2 | B | 20 | 10 |
+------+-------+------+----+-----+
I'd like to get the result as follows:
Unfiltered:
+------+-------+------+---------+----+-----+------+
| Year | Month | Type | Balance | In | Out | Left |
+------+-------+------+---------+----+-----+------+
| 2020 | 1 | A | 0 | 20 | 20 | 0 |
| 2020 | 1 | B | 10 | 20 | 10 | 20 |
| 2020 | 2 | A | 0 | 20 | 10 | 10 |
| 2020 | 2 | B | 20 | 20 | 10 | 30 |
+------+-------+------+---------+----+-----+------+
Filtered (for example year/month 2020/2):
+------+-------+------+---------+----+-----+------+
| Year | Month | Type | Balance | In | Out | Left |
+------+-------+------+---------+----+-----+------+
| 2020 | 2 | A | 0 | 20 | 10 | 10 |
| 2020 | 2 | B | 20 | 20 | 10 | 30 |
+------+-------+------+---------+----+-----+------+
So while selecting a slicer for the year/month it should calculate balance before selected year/month and then show selected year/month values.
Edit: corrected start_balance table.
Is the sample data correct?
A -> the starting balance is 10, but in your unfiltered table example, it is 0.
Do you have any relationship between these tables?
Does opening balance always apply to the current year? What if 2021 appears in the in_out table? How do you know when the start balance started?
example without starting balance
If you want to show value breaking given filter you should use statement ALL or REMOVEFILTERS function (in Analysis Services 2019 and in Power BI since October 2019).
calculate(sum([in]) - sum([out]), all('in_out'[Year],'in_out'[Month]))
More helpful information:
https://www.sqlbi.com/articles/managing-all-functions-in-dax-all-allselected-allnoblankrow-allexcept/

How to iterate through a list of ranges in Google Apps Script

I have Google sheet with many names and hours which need to be organized. I tried using built in functions, but this sheet is the result of other inputs on other sheets (so the length of the rows is variable)
Sheet 1
A B C D E T U
Project| Name1 | Hours1 | Name2 | Hours2 | ... | ... | Name10 | Hours10|
————————————————————————————————————————————————————————————————————————
P1 | Larry | 10 | Bob | 20 | ... | ... | Tim | 10 |
P2 | Bob | 15 | Tim | 15 | ... | ... | Larry | 15 |
.... | ... | ... | ... | ... | ... | ... | ... | ... |
Pnth | Tim | 20 | Larry | 10 | ... | ... | Bob | 10 |
So far I have tried to iterate through the whole sheet and used a list of names from which to sort with, but I need it to take on a variable length of rows.
function organize(){
var sheet = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet();
var rangeList = sheet.getRangeList(['B1:C','D1:E','F1:G','H1:I','J1:K','L1:M','N1:O','P1:Q','R1:S','T1:U']);
What I want it to look like (on a separate sheet): list of names and total hours
Sheet 2
Name | Total hours | Number Projects Assigned|
——————————————————————————————————————————————
Larry| TOTAL NUMBER | 4 (P1,P2,Pnth) |
Tim | TOTAL NUMBER | 4 (P1,P2,Pnth) |
Bob | TOTAL NUMBER | 4 (P1,P2,Pnth) |
Flow:
Get all the values in the range
Loop through them vertically and then horizontally
Create a object with each name as key and [hours, projects] as value.
Map the object back to a 2D array
Snippet:
function organize(values) {
var out = {};//out object
values.forEach(function(row) {
for (var col = 1, l = row.length; col < l; col += 2) {
var name = row[col];
out[name] = out[name] || [0, 0]; //[hours,projects]
out[name][0] += row[col + 1]; //hours sum
out[name][1]++; //projects sum
}
});
return Object.keys(out).map(function(name) {
return [name, out[name][0], out[name][1]]; //[name, hours and projects]
});
}
If used as a custom function,
=organize(A2:U4)
returns name, hours and projects.

Rank categories by sum (Power BI)

I need to rank products for my dashboard. Each day, we store sales of products. In result we have this dataset example:
+-----------+------------+-------+
| product | date | sales |
+-----------+------------+-------+
| coffee | 11/03/2019 | 15 |
| coffee | 12/03/2019 | 10 |
| coffee | 13/03/2019 | 28 |
| coffee | 14/03/2019 | 1 |
| tea | 11/03/2019 | 5 |
| tea | 12/03/2019 | 2 |
| tea | 13/03/2019 | 6 |
| tea | 14/03/2019 | 7 |
| Chocolate | 11/03/2019 | 30 |
| Chocolate | 11/03/2019 | 4 |
| Chocolate | 11/03/2019 | 15 |
| Chocolate | 11/03/2019 | 10 |
+-----------+------------+-------+
My attempt
I actualy managed to Rank my products but not in the way I wanted it; In fact, the ranking process increase by the number of rows. for example, chocolate is first but we record 4 rows so coffee is ranked at 5 and not 2.
+-----------+------------+-------+-----+------+
| product | date | sales | sum | rank |
+-----------+------------+-------+-----+------+
| coffee | 11/03/2019 | 15 | 54 | 5 |
| coffee | 12/03/2019 | 10 | 54 | 5 |
| coffee | 13/03/2019 | 28 | 54 | 5 |
| coffee | 14/03/2019 | 1 | 54 | 5 |
| tea | 11/03/2019 | 5 | 20 | 9 |
| tea | 12/03/2019 | 2 | 20 | 9 |
| tea | 13/03/2019 | 6 | 20 | 9 |
| tea | 14/03/2019 | 7 | 20 | 9 |
| Chocolate | 11/03/2019 | 30 | 59 | 1 |
| Chocolate | 11/03/2019 | 4 | 59 | 1 |
| Chocolate | 11/03/2019 | 15 | 59 | 1 |
| Chocolate | 11/03/2019 | 10 | 59 | 1 |
+-----------+------------+-------+-----+------+
sum field formula formula:
sum =
SUMX(
FILTER(
Table1;
Table1[product] = EARLIER(Table1[product])
);
Table1[sales]
)
rank field formula :
rank = RANKX(
ALL(Table1);
Table1[sum]
)
As you can see, we get the following ranking:
1 : Chocolate
5 : Coffee
9 : Tea
Improvements
I would like to transform the previous result into :
1 : Chocolate
2 : Coffee
3 : Tea
Can you help me improving my ranking system and get a marvelous 1, 2, 3 instead of this ugly and not practical 1, 5, 9 ?
If you don't know the anwser, help by simply upvote the question ♥
Fortunately, this is an easy fix.
If you look at the documentation for the RANKX function, you'll notice an optional ties argument which you can set to Skip or Dense. The default is Skip but you want Dense. Try this:
rank =
RANKX(
ALL(Table1);
Table1[sum];
;;
"Dense"
)
(Those extra ; delimiters are there since we aren't specifying the optional value or order arguments.)

Django calculate percentages within group by

I have a model for which I want to perform a group-by on two values and calculate the percentages of each value per outer grouping.
Currently I just make a query to get all the rows and put them into a pandas dataframe and perform something similar to the answer here. Although this works I'm sure it would be more efficient if I could make the query return the information I require directly.
I am currently running Django 2.0.5 with a backend DB on PostgreSQL 9.6.8
I think window functions could be the solution as indicated here but I cannot construct a successful combination of annotate and values to give me the desired output.
Another possible solution could be rollup introduced in PostgreSQL 9.5 if I can find a way to get the summary row as a set of extra columns for each row? But I also think it's not yet supported by Django.
Model:
class ModelA(models.Model):
grouper1 = models.CharField()
grouper2 = models.CharField()
metric1 = models.IntegerField()
All rows:
grouper1 | grouper2 | metric1
---------+----------+---------
A | C | 2
A | C | 2
A | C | 2
A | D | 4
A | D | 4
A | D | 4
B | C | 5
B | C | 5
B | C | 5
B | D | 6
B | D | 4
B | D | 5
Desired output:
grouper1 | grouper2 | sum(metric1) | Percentage
---------+----------+--------------+-----------
A | C | 6 | 40
A | D | 12 | 60
B | C | 15 | 50
B | D | 15 | 50
I got close to what I expected with
ModelA.objects.all(
).values(
'grouper1',
'grouper2'
).annotate(
SumMetric1=Window(expression=Sum('metric1'), partition_by=[F('grouper1'), F('grouper2')]),
GroupSumMetric1=Window(expression=Sum('metric1'), partition_by=[F('grouper1')])
)
However this returns a row for every original row in the database like so:
grouper1 | grouper2 | sum(metric1) | Percentage
---------+----------+--------------+-----------
A | C | 6 | 40
A | C | 6 | 40
A | C | 6 | 40
A | D | 12 | 60
A | D | 12 | 60
A | D | 12 | 60
B | C | 15 | 50
B | C | 15 | 50
B | C | 15 | 50
B | C | 15 | 50
B | C | 15 | 50
B | D | 15 | 50
In this situation .distinct() might help.
More information is here.