Merge cells with similar but different data, different spelling - cell

I am trying Tableau with data extracted from Salesforce. The input includes a "Country" record were the row have different spellings for the same thing.
Example: Cananda, CANADA, CAnada etc.
Is there a way to fix this in Tableau?

The easiest solution is create a group field based on your Country field.
Select Country in the data pane on the left side bar, right click and choose Create Group. Select elements that you want to group together put them into a single group, say Canada, that contains all variations of spelling.
This new group field initially has a name of Country (group). You may want to rename it Country_Corrected. (Or even better, rename the first field, Country_Original, and call the group field simply Country. Then you can hide Country_Original)
Groups are implemented using SQL case statements. They have many uses, but one application is to easily tolerate some inconsistent spellings in your data source without having to change your data. In general, you can specify several transformations like this that take effect at query and visualization time. For very large data sets, or for very complicated transformations, you may eventually want to push some of them upstream in your data pipeline to get better performance. But make those optimizations later when you've proven the necessity.

If the differences are just in case (upper vs lower), you can right-click the Country dimension, and create a calculated field called something like "New Country", and use the following formula to make the case consistent:
upper([Country])
Use this new "New Country" calc dimension instead of your "Country" dimension, and it will group them all without case sensitivity, and display as uppercase. Or you can use "lower" instead of "upper" if preferred.

Related

AWS Personalize items attributes

I'm trying to implement personalization and having problems with Items schema.
Imagine I'm Amazon, I've products their brands and their categories. In what kind of Items schema should I include this information?
Should I include brand name as string as categorical field? Should I rather include brand ID as string or numeric? or should I include both?
What about categories? I've the same questions.
Metadata Fields Metadata includes string or non-string fields that
aren't required or don't use a reserved keyword. Metadata schemas have
the following restrictions:
Users and Items schemas require at least one metadata field,
Users and Interactions datasets can contain up to five metadata
fields. An Items dataset can contain up to 50 metadata fields.
If you add your own metadata field of type string, it must include the
categorical attribute. Otherwise, Amazon Personalize won't use the
field when training a model.
https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html
There are simply 2 ways to include your metadata in Items/Users datasets:
If it can be represented as a number value, then provide the actual value if it makes sense.
If it can be represented as string, then provide the string value and make sure, that categorical is set to true.
But let's take a look into "Why does they need me, to categorize my strings metadata?". The answer is pretty simple.
Let's start with an example.
If you would have Items as Amazon.com products and you would like to provide rates metadata field, then:
You could take all of the rates including the full review text sent by clients and simply put it as metadata field.
You can take just stars rating, calculate the average and put it as metadata field.
Probably the second one is making more sense in general. Having random, long reviews of product as metadata, pretty much changes nothing. Personalize doesn't understands if the review itself is good or bad, or if the author also recommends another product, so pretty much it doesn't really add anything to the recommendations.
However if you simply "cut" your dataset and calculate the average rating, like in the 2. point, then it makes a lot more sense. Maybe some of our customers like crappy products? Maybe they want to buy them, because they are famous YouTubers and they create videos about that? Based on their previous interactions and much more, Personalize will be able to perform just slightly better, because now it knows, that this product has rating of 5/5 or 3/5.
I wanted to show you, that for some cases, providing Items metadata as string makes no sense. That's why your string metadata must be categorical. It means, that it should be finite set of values, so it adds some knowledge for Personalize about given Item and why some of people might want to interact with it.
Going back to your question:
Should I include brand name as string as categorical field? Should I rather include brand ID as string or numeric? or should I include both?
I would simply go with brand ID as string. You could also go with brand name, but probably single brand can be renamed, when it's still the same brand, so picking up the ID would be more constant. Also two different brands could have the same names, because they are present on different markets, so picking up the ID solves that.
The "categorical": true switch in your schema just tells Personalize:
Hey, do you see that string field? It's categorised, finite set of values. If you train a model for me, please include this one during the training, it's important!
And as it's said in documentation, if you will provide string metadata field, which is not marked as categorical, then Personalize will "think" that:
Hmm.. this field is a string, it has pretty random values and it's not marked as categorical. It's probably just a leftover from Items export job. Let's ignore that.

Negative filtering by filter_box or some other mechanism

Let's say I have a column named Column1. There are more than 10k different values for this column, but my goal is to display on a dashboard all data except few of them. Is it possible to achieve it in Superset? As far as I understand the only one option to filter dashboard is a filter_box, and I have to choose values explicitly in filterbox, so no way to use a negative filter. Is it true, or there is some hidden mechanism?
You can use the limit selector values option to provide the filter out values you dont need by specifying the column name and the list of values you would like to ignore using the appropriate condition like *equals, not equals, etc

Group by similar words

Is there any way to group a table by a text field, having in count that this text field is not always exactly the same?
Example:
select city_hotel, count(city_hotel)
from hotels, temp_grid
where st_intersects(hotels.geom, temp_grid.geom)
and potential=1
and part=4
group by city_hotel
order by (city_hotel) desc
The output I get is the expected, for example, City name and count:
"Vassiliki ";1
"Vassiliki";1
"Vassilias, Skiathos";1
"Vassilias";5
"Vasilikí";25
"Vasiliki";23
"Vasilias";1
But I'd want to group more this field, and get only one "Vasiliki" (or an array with all, this is not a problem) and a count of all the cells containing something similar between them.
I do not know if could this be possible. Maybe some function to text analysis or something similar?
SELECT COUNT(*), `etc` FROM table GROUP BY textfield LIKE '%sili%';
// The '%' is a SQL wildcard, which matches as many of any character as required.
You could do something like the above, choosing a word for the 'like' that best fits the spellings that your users have used.
Something that can help with that would be to do a
SELECT COUNT(*), textfield FROM table GROUP BY textfield ORDER BY textfield;
And selecting the most 'average' spelling for your words.
Otherwise you're starting to get into a bit of language processing, and for that you will want to write some code outside of SQL.
This would be something like https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
To find word's that are the same within an arbitrary margin of error.
There is a MySQL implementation here that you should be able to transpose as needed
https://stackoverflow.com/a/6392380/1287480
(credit https://stackoverflow.com/a/3515291/1287480)
.
(Personal thoughts on the topic)
You Really Really want to think about limiting the input from users that can give you this issue in the first place. It's far far better to give the users a list of places to select from, than it is to push potentially 'dirty' information into your database. That eventually always winds up with you trying to clean the information at a later time. A problem that has kept many people employed for many years.

Tableau: Creating Dynamic Filter to exclude names

I have been working on this for the better part of the day and would like to crowd source as I must just be missing something simple.
I would like to use a parameter control to create a dynamic filter that would exclude the names of individuals that have already participated in an event. For example in the following list of two fields:
Name-Event Name
Carl-Agriculture
Carl-Agriculture
Carl-Agriculture
Jodie-Business
Jodie-Agriculture
Jodie-Agriculture
Pam-Business
Pam-Business
Pam-Business
if the parameter was set to Agriculture, only Pam would show up on the list, and if it was set to Business only Carl would show up. This list will help stakeholders send invitations to potentially interested parties.
I have tried so many calculations including the parameter itself in IF statements, IIF statements, CASE statements, etc. I've also tried creating a second calculation to work off of the first but am still striking out.
Any ideas?
You got most of the way there on your own. To finish the job:
Place Name on the filter shelf
Select "Use all" on the General tab of the filter panel
Select "By field:" on the Condition tab of the filter panel
Choose the "If Exclusion Statement" field, Count aggregation function (NOT Sum in this case), and set the test to "= 0"
The effect of this filter is equivalent to the SQL group by Name having Count(If Exclusion Statement) = 0

Infragistics UltraGrid - How to use displayed values in group by headers when using an IEditorDataFilter?

I have a situation where I'm using the IEditorDataFilter interface within a custom UltraGrid editor control to automatically map values from a bound data source when they're displayed in the grid cells. In this case it's converting guid-based key values into user-friendly values, and it works well by displaying what I need in the cell, but retaining the GUID values as the 'value' behind the scenes.
My issue is what happens when I enable the built-in group by functionality and the user groups by a column using my editor. In that case the group by headers default to using the cell's value, which is the guid in my case, so I end up with headers like this:
Column A: 7F720CE8-123A-4A5D-95A7-6DC6EFFE5009 (10 items)
What I really want is the cell's display value to be used instead so it's something like this:
Column A: Item 1 (10 items)
What I've tried so far
Infragistics provides a couple mechanisms for modifying what's shown in group by rows:
GroupByRowDescriptionMask property of the grid (http://bit.ly/1g72t1b)
Manually set the row description via the InitializeGroupByRow event (http://bit.ly/1ix1CbK)
Option 1 doesn't appear to give me what I need because the cell's display value is not exposed in the set of tokens they provide. Option 2 looks promising but it's not clear to me how to get at the cell's display value. The event argument only appears to contain the cell's backing value, which in my case is the GUID.
Is there a proper approach for using the group by functionality when you're also using an IEditorDataFilter implementation to convert values?
This may be frowned upon, but I asked my question on the Infragistic forums as well, and a complete answer is available there (along with an example solution demonstrating the problem):
http://www.infragistics.com/community/forums/p/88541/439210.aspx
In short, I was applying my custom editors at the cell level, which made them unavailable when the rows were grouped together. A better approach would be to apply the editor at the column level, which would make the editor available at the time of grouping, and would provide the expected behavior.