Weka discretising attribute where one value is most common by far - weka

I have a dataset in which there's a numerical attribute for the 'number of days since last contact' but the value -1 is being used to indicate that there hasn't been a last contact. It is by far the largest value for this attribute.
My idea is to discretise this attribute but how can I ensure there is a 'no contact'/-1 bin?
Also, is this the correct approach to this problem?

The proper approach supposedly is to
Split the data into -1 and everything else
Apply binning to the values in the 'everything else' set only
Concatenate the data sets again (it may be good to shuffle, too)

If anyone else has this question and can't find an answer, here's how I did it based on Anony-Mousse's method. The filter documentation for MathExpression gives a good example of splitting into arbitrary bins.
Split using the MathExpression filter e.g. ifelse(A>0, 2, 1) to split into two bins: above and below 0. I used ifelse(A>0, ifelse(A>400, 21, ceil(A/20)+1), 1) to bin my -1 and >400 values, and for in between values to be in bins of width 20.
Convert using numericToNominal

Related

Excel Alternative to nested IF

I have a couple of rather large nested if functions in my spreadsheet. It sure would be nice to have an alternative method. Problem is I'm using a wildcard (*) in my lookup because the source text has slight variations (date for example).
For example, if my list of data contains:
VENMO PAYMENT 220828 1022093447487 BRENDA HOSPY
VENMO PAYMENT 220813 1031323447487 BRENDA HOSPY
I want these to show in an adjacent column of cells as just Venmo
Currently my if function in that second column of cells is:
=IF(COUNTIF($F10,"*APPLE.COM/BILL*"),"AP",
IF(COUNTIF($F10,"IIA VOYA*"),"VOYA",
IF(COUNTIF($F10,"VENMO PAYMENT*"),"Venmo",
IF(COUNTIF($F10,etc...
This works fine but quickly gets unruly as more things get added.
I've spent a great deal of time searching for functions and processes that would make this easier, or at least more compact, but I can't find a way with typical functions like vlookup or index/match.
If I've explained this in a comprehensible fashion perhaps you've seen or experienced a similar situation and could offer a suggestion. It would be appreciated!
I'm not opposed to using a programming function.
I've looked at, and for, various Excel functions or combinations with no luck on my own or online.
I have created a structure as below
Formula present in B2 is as below
=IFERROR(INDEX($F$2:$F$9,MIN(IF(COUNTIF(A2,"*"&$E$2:$E$9&"*")>0,ROW($E$2:$E$9),9999999)-1)),"---")
Enter it as an Array Formula using Ctrl+Shift+Enter
It will search all the strings present in column E in A2 when found will return all the row numbers of column E where there is a match, i have then used min to get the first one, and if not found it will return 9999999, and as the data is starting from row 2 i have added -1 to make it equal to the data index. after that i have called the index to search value present at that index in column F. and at the end used the if error function to show --- where no match was found and 999999 was returned.

Report Builder- Nested If Statements with Multiple Values to Categorize

I am working in Report Builder and having issues creating a calculated field to categorize data from another column.
To simplify and explain my goal:
I’d like to create a calculated field with 4 distinct categories and I’m assuming the best way to do that is a nested if statement. Feel free to correct me if that is not the best function to use.
Category 1: Let’s just call it “A”
Category 2: “B“
Category 3: “C“
Category 4: “D”
Values from the other column:
Simplified Example-
Numbers 1-10 would be category A,
numbers 11-20 would be B,
numbers 21-30 would be C,
numbers 31-40 would be category D
However in my particular case the values aren’t nicely organized in those 10 consecutive ranges. For example, I have a 33 value that would be an A category, which makes it so I can’t use the greater than or less than operators.
Having explained my issue and goal- my question is how to write the syntax for an if statement when I have multiple discrete values that aren’t neatly organized in consecutive numerical order?
I hope this question makes sense.
I tried using just one argument to get it going and got stumped when it didn’t work:
Iif(field data = 1,2,3,4,5,6,7,8,9,10,33, “A”, “Other”)
It doesn’t work with the commas and I tried inserting the Or Operator between each value and that didn’t work either.
Thanks for any syntax tips you can provide.
There are a few ways you can do this.
Option 1: In your database design
The best way, in my opinion, is to do this in your database. Create a table with these values/category pairs and simply join to that whenever you need to include the categorised view of the data.
Option 2: In your report design
If you really have to do this in the report design, then using SWITCH() will probably be easier, certainly to read.
Given your second example, and expanding it a little you could do something like this...
=SWITCH(
(Fields!myData.Value >=1 AND Fields!myData.Value <=10) OR Fields!myData.Value = 33, "A",
(Fields!myData.Value >=11 AND Fields!myData.Value <=15) OR Fields!myData.Value = 34 OR Fields!myData.Value = 39, "B",
True, "Other"
)
SWITCH uses pairs of values, when the first value in the pair evaluates to true the seconds value in the pair is returned.
The final True, "Other" acts like an else. If not previous criteria matched, then the final pair always evaluates to true so "Other" would be returned.

How to apply conditional formatting (if cell is in another range) to a range of cells

So I have searched through several different questions related to this. None of them seem to be asking exactly what I'm looking for and none of the solutions I've found have worked for me thus far.
I have several columns of data (Player names) where each column's values are generated from a formula in the 2nd row of that column. The 1st row is a header (Game name). This whole range is the collection of which players are willing to play which games. These are columns D-J(ish, the list is dynamically generated with another formula, based on form responses)
I have another range of data where the 1st column is the Player and the 2nd is the player's PREFERRED game. This data is also generated with a formula based on form responses. These are columns A-B.
Here's what I'm trying to do
Using conditional formatting in columns D-J, I want to highlight the player's name if this game (in row 1 of this column) is their preferred game (range A2:B).
I've tried several different variations of VLOOKUPS, MATCHES, and FILTERS in the conditional formatting, but so far nothing has worked. The problem I run into every time is that I can't figure out how to reference the cell that the formatting is applying to, but still have it reference each individual cell over the whole range.
I know I could do this if I applied an individual conditional formatting to each individual cell. However that is a very time consuming and inelegant solution to this issue considering I'm expecting my data range to be much larger in the future. I need a conditional formatting formula that will work across the whole range or , at the very least, for an entire column.
This is a mock of what I'm trying to accomplish:
This is a link to a mock of my sheet so that you can clearly see the data layout and specific formulas I'm using:
https://docs.google.com/spreadsheets/d/1wy1T6dWJwNC_EfdCAbkuxtkJH7y4Cg3x4IyEk6R567M/edit?usp=sharing
use:
=REGEXMATCH(D3, TEXTJOIN("|", 1, FILTER($A$3:$A, $B$3:$B=D$2)))

Formula Moving Data To Second Page If Criteria in Range is Met

I'm working on a Google Sheet Project that will move data from one page to another. I need the formula to search a range ( 'Booth Placement'!O2:O1000=133), if a cell is equal to the set value it will then write the data from the same row 'Booth Placement'!A2:A1000.
I know the IF can only work for one column and not a range spanning multiple columns. What should I switch the formula below?
=IF('Booth Placement'!O2:O1000=133,'Booth Placement'!A2:A1000,"")
I am trying to keep this formula as simple as possible since I will have to change the value it is searching for on each cell on the second page. I've googled this for two days and I'm pretty sure I'm just missing the obvious. Any/All Help is appreciated.
try:
=FILTER('Booth Placement'!A2:A1000; 'Booth Placement'!O2:O1000=133)
or:
=ARRAYFORMULA(IF('Booth Placement'!O2:O1000=133; 'Booth Placement'!A2:A1000; ))

How to set up COUNTIF or COUNTIFS formula in Google Sheets to compare columns?

I am trying to compare the numerical data in two columns in Google Sheets (say, Col. A and B) and return a count of all of the times that they vary by say, more than 1 (e.g., if A3 = 5 and B3 = 2, this should get counted). The two-column arrays will always be of equal size.
At first, I thought that either COUNTIF or COUNTIFS would be my go-to tool, but I can't get this to work with either formula. These formulas seem to handle criteria within a cell, but - as far as I can tell - can't handle criteria comparing data within two different (adjacent) cells.
Can someone help me with some super syntax work-around to get COUNTIF/COUNTIFS to work... or is there a more appropriate formula to the job (perhaps involving FILTER)?
*Quick Edit: I know I could always add an additional column, which would be very simple in this example. But my real-world spreadsheets are a lot more complex and are already suffering from column overload. A lot of other formulas are already set up around existing columns, and I was hoping to discover a more elegant solution that would allow me to come up with the count without having to add a new column for each and every comparison calculation.
=ARRAYFORMULA(IF(LEN(A:A&B:B), IF(A:A-B:B>1, 1, )+IF(B:B-A:A>1, 1, ), ))
if you want final sum instead of "per row" count use:
=SUM(ARRAYFORMULA(IF(LEN(A:A&B:B), IF(A:A-B:B>1, 1, )+IF(B:B-A:A>1, 1, ), )))
Add a third column, containing e.g. =ABS(SUM(A3-B3)). (The ABS gives you the positive difference regardless of which value is larger.)
At the bottom of that column, use COUNTIF like =COUNTIF(C1:C25, ">1") (where C1:C25 is the range of cels containing those positive differences).