Group by similar words - regex

Is there any way to group a table by a text field, having in count that this text field is not always exactly the same?
Example:
select city_hotel, count(city_hotel)
from hotels, temp_grid
where st_intersects(hotels.geom, temp_grid.geom)
and potential=1
and part=4
group by city_hotel
order by (city_hotel) desc
The output I get is the expected, for example, City name and count:
"Vassiliki ";1
"Vassiliki";1
"Vassilias, Skiathos";1
"Vassilias";5
"Vasilikí";25
"Vasiliki";23
"Vasilias";1
But I'd want to group more this field, and get only one "Vasiliki" (or an array with all, this is not a problem) and a count of all the cells containing something similar between them.
I do not know if could this be possible. Maybe some function to text analysis or something similar?

SELECT COUNT(*), `etc` FROM table GROUP BY textfield LIKE '%sili%';
// The '%' is a SQL wildcard, which matches as many of any character as required.
You could do something like the above, choosing a word for the 'like' that best fits the spellings that your users have used.
Something that can help with that would be to do a
SELECT COUNT(*), textfield FROM table GROUP BY textfield ORDER BY textfield;
And selecting the most 'average' spelling for your words.
Otherwise you're starting to get into a bit of language processing, and for that you will want to write some code outside of SQL.
This would be something like https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
To find word's that are the same within an arbitrary margin of error.
There is a MySQL implementation here that you should be able to transpose as needed
https://stackoverflow.com/a/6392380/1287480
(credit https://stackoverflow.com/a/3515291/1287480)
.
(Personal thoughts on the topic)
You Really Really want to think about limiting the input from users that can give you this issue in the first place. It's far far better to give the users a list of places to select from, than it is to push potentially 'dirty' information into your database. That eventually always winds up with you trying to clean the information at a later time. A problem that has kept many people employed for many years.

Related

Merge cells with similar but different data, different spelling

I am trying Tableau with data extracted from Salesforce. The input includes a "Country" record were the row have different spellings for the same thing.
Example: Cananda, CANADA, CAnada etc.
Is there a way to fix this in Tableau?
The easiest solution is create a group field based on your Country field.
Select Country in the data pane on the left side bar, right click and choose Create Group. Select elements that you want to group together put them into a single group, say Canada, that contains all variations of spelling.
This new group field initially has a name of Country (group). You may want to rename it Country_Corrected. (Or even better, rename the first field, Country_Original, and call the group field simply Country. Then you can hide Country_Original)
Groups are implemented using SQL case statements. They have many uses, but one application is to easily tolerate some inconsistent spellings in your data source without having to change your data. In general, you can specify several transformations like this that take effect at query and visualization time. For very large data sets, or for very complicated transformations, you may eventually want to push some of them upstream in your data pipeline to get better performance. But make those optimizations later when you've proven the necessity.
If the differences are just in case (upper vs lower), you can right-click the Country dimension, and create a calculated field called something like "New Country", and use the following formula to make the case consistent:
upper([Country])
Use this new "New Country" calc dimension instead of your "Country" dimension, and it will group them all without case sensitivity, and display as uppercase. Or you can use "lower" instead of "upper" if preferred.

Tableau: Creating Dynamic Filter to exclude names

I have been working on this for the better part of the day and would like to crowd source as I must just be missing something simple.
I would like to use a parameter control to create a dynamic filter that would exclude the names of individuals that have already participated in an event. For example in the following list of two fields:
Name-Event Name
Carl-Agriculture
Carl-Agriculture
Carl-Agriculture
Jodie-Business
Jodie-Agriculture
Jodie-Agriculture
Pam-Business
Pam-Business
Pam-Business
if the parameter was set to Agriculture, only Pam would show up on the list, and if it was set to Business only Carl would show up. This list will help stakeholders send invitations to potentially interested parties.
I have tried so many calculations including the parameter itself in IF statements, IIF statements, CASE statements, etc. I've also tried creating a second calculation to work off of the first but am still striking out.
Any ideas?
You got most of the way there on your own. To finish the job:
Place Name on the filter shelf
Select "Use all" on the General tab of the filter panel
Select "By field:" on the Condition tab of the filter panel
Choose the "If Exclusion Statement" field, Count aggregation function (NOT Sum in this case), and set the test to "= 0"
The effect of this filter is equivalent to the SQL group by Name having Count(If Exclusion Statement) = 0

Select Statement Vs Find in Ax

while writing code we can either use select statement or select field list or find method on table for fetching the records.
I wonder which of the statement helps in better performance
It really depends on what you actually need.
find() methods must return the whole table buffer, that means, all of the columns are projected into the buffer returned by it, so you have the complete record selected. But sometimes you only need a single column, or just a few. In such cases it can be a waste to select the whole record, since you won't use the columns selected anyway.
So if you're dealing with a table that has lots of columns and you only need a few of them, consider writing a specific select statement for that, listing the columns you need.
Also, keep in mind that select statements that only project a few columns should not be made public. That means that you should NOT extract such statements into a method, because imagine the surprise of someone consuming that method and trying to figure out why column X was empty...
You can look at the find() method on the table and find out the same 'select'-statement there.
It can be the same 'select; statement as your own an the performance will be the same in this case.
And it can be different select statement then your own and the performance will be depend on indexes on the table, select statement, collected statistics and so on.
But there is no magic here. All of them is just select statement - no matter which method do you use.

How to determine if a given word or phrase from a list is within an anchor tag?

We have a ColdFusion based site that involves a large number of 'document authors' that have little or no knowledge of HTML. The 'documents' they create are comprised of HTML stored in a table in the database. They use a CKEDITOR interface. The content that they create is output into specific area of the page. The document frequently has tons of technical terms that readers may not be familiar with that we would like to have tooltips automatically show up for.
I and the other programmer want to have some code insert 'tooltip' code into the page based on a list of words in a table on our SQL server. The 'dictionary' table in our database has a unique ID, the word/phrase we will look for and a corresponding definition that would be displayed in the tooltip.
For instance, one of the word/phrases we will be looking for is 'Scrum Master'. If it occurs in the document area, we need to insert code around the words to create a tooltip. To do that, we need to see if certain conditions exist. Are the words within an anchor tag? If yes, is there already a title value for the tag (title is used to contain the info to be displayed in a tooltip)? If a title tag exists, don't do anything. If the words are not in an anchor tag, then we would put anchor tags around the words along with the title that will contain the definition.
The tooltip code we use is via jQuery (http://jqueryui.com/tooltip/). It is quick and simple to use. We just need to figure out how to use it dynamically based on our dictionary table.
Do you have any suggestions of how to go about this?
I was hoping that jSoup might have a function that I could use, but that doesn't seem to be the right technology for what I want to do, but I could be wrong and I am happy to be corrected!
We have a large number of these documents and so manually inserting and maintaining the tooltip code is just not an option.
Update you content with something like:
strOut = ReplaceList(strIn, ValueList(qryTT.find), ValueList(qryTT.replace));
Since words are delimited by spaces, the qryTT.find needs to have spaces. The replace column is going to need to include some of the original content. You are going to have to be careful with words followed by a comma or a period too.
I would cache the results because I would expect it to be memory intensive.

How do I find if any names in a list of names appears in a paragraph with ColdFusion?

Assume that I have a list of employee names from a database (thousands, potentially tens-of-thousands in the near future). To make the problem simpler assume that each firstname/lastname combination is unique (a big if, but a tangent).
I also have a RSS stream of news content that pertains to the business (again, could be in the hundreds of items per day).
What I would like to do is detect if an employees name appears in the several paragraph news item and, if so, 'tag' the item with the person its talking about.
There may be more than one employee named in a single news item so breaking the loop after the first positive match isn't a possibility.
I can certainly brute force things: for every news item, loop over each and every employee name and if a regex expression returns a match, make note of it.
Is there a simpler way in ColdFusion or should I just get on with my nested loops?
Just throwing this out there as something you could do...
It sounds like you'll almost unanimously have significantly more employee names than words per post. Here's how I might handle it:
Have an always-running CF app that will pull in the feeds and onAppStart
Grab all employees from your db
Create an app-scoped look up struct with first names as keys and a struct of last names as values ( you could also add middle names sibling to last names with a 3rd tier if desired ).
So one key in the look up might be "Vanessa" with a struct with 2 keys ( "Johnson" and "Forta" ) as its value.
Then, each article you parse, just listToArray with a space as a delimiter and loop through the array doing a simple structKeyExists with each token. For matches, check the next item in the array as a last name.
I'd guess this would be much more performant processingwise than doing however many searches and also take almost no time to code and you can feed in any future sources extremely simply ( your checker takes one argument, any text on Earth ).
Interested to see what route you go and whether your experiments expose anything new about performance in CF.
Matthew, you have a tall order there, and there are really multiple parts to the challenge/solution. But just in terms of comparing a list of values to a given set of text to see if one of them occur in there, you'll find that there's no one could CF function. BEcause of that, I created a new one, findList, available at cflib:
http://cflib.org/index.cfm?event=page.udfbyid&udfid=1908
It's not perfect, nor as optimal as it could be, but it may be a useful first step or you, or give you some ideas. That said, it suited my need (determine if a given blog comment had reference to any of the blacklisted words). I show it comparing a list of URLs, but it could be any words at all. Hope that's a little helpful.
Another option worth exploring is leveraging the Solr engine that ships with CF now. It will do the string search heavy lifting for you and you can probably focus on dynamically keeping your collections up to date and optimized as new feed items come in.
Good luck!