Search across multiple fields in the table view - powerbi

Let's say I don't want the text "Winter" across any of the fields in my current table. I can find and replace, but is there a way to find out, before replacing the values, whether they even exist, and how many exist?
Using find and replace is useful, but it doesn't help me understand if the data is present in the dataset, and it doesn't give an indication even after finishing, whether any data was replaced.

Related

Quicksight breaking up strings for use of all aspects

I was wondering if anyone has every had experience with breaking a string up in quicksight and using certain aspects of the string. My example is a data set that returns tags like this "animals|funny|dog-park" I have used "split(tags,'|',1)" but then all that gets returned is the first part(animals). I have also tried a combination of ifelse->locate->split with no luck. Is there a way to split these tags to where they are all usable (animals) & (funny) or (funny) & (dog-park), etc.? Say the article associated will then be broken up into one tag but also another separately? I know this will end up being a calculated field most likely. Thank you in advance!
Since QuickSight does not support any form of nested fields (including objects and list) in analysis, you need to normalise this into separate rows before feeding the data to QuickSight.
Otherwise, if you leave it as is, you would be limited to filtering using string contains and doing string lookup in calculated fields - nevertheless you would not be able to use these tags as categories (such as in colours field well of visuals).

Group by similar words

Is there any way to group a table by a text field, having in count that this text field is not always exactly the same?
Example:
select city_hotel, count(city_hotel)
from hotels, temp_grid
where st_intersects(hotels.geom, temp_grid.geom)
and potential=1
and part=4
group by city_hotel
order by (city_hotel) desc
The output I get is the expected, for example, City name and count:
"Vassiliki ";1
"Vassiliki";1
"Vassilias, Skiathos";1
"Vassilias";5
"Vasilikí";25
"Vasiliki";23
"Vasilias";1
But I'd want to group more this field, and get only one "Vasiliki" (or an array with all, this is not a problem) and a count of all the cells containing something similar between them.
I do not know if could this be possible. Maybe some function to text analysis or something similar?
SELECT COUNT(*), `etc` FROM table GROUP BY textfield LIKE '%sili%';
// The '%' is a SQL wildcard, which matches as many of any character as required.
You could do something like the above, choosing a word for the 'like' that best fits the spellings that your users have used.
Something that can help with that would be to do a
SELECT COUNT(*), textfield FROM table GROUP BY textfield ORDER BY textfield;
And selecting the most 'average' spelling for your words.
Otherwise you're starting to get into a bit of language processing, and for that you will want to write some code outside of SQL.
This would be something like https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
To find word's that are the same within an arbitrary margin of error.
There is a MySQL implementation here that you should be able to transpose as needed
https://stackoverflow.com/a/6392380/1287480
(credit https://stackoverflow.com/a/3515291/1287480)
.
(Personal thoughts on the topic)
You Really Really want to think about limiting the input from users that can give you this issue in the first place. It's far far better to give the users a list of places to select from, than it is to push potentially 'dirty' information into your database. That eventually always winds up with you trying to clean the information at a later time. A problem that has kept many people employed for many years.

I have large file contents that I want to make searchable on AWS CloudSearch but the maximum document size is 1MB - how do I deal with this?

I could split the file contents up into separate search documents but then I would have to manually identify this in the results and only show one result to the user - otherwise it will look like there are 2 files that match their search when in fact there is only one.
Also the relevancy score would be incorrect. Any ideas?
So the response from AWS support was to split the files up into separate documents. In response to my concerns regarding relevancy scoring and multiple hits they said the following:
You do raise two very valid concerns here for your more challenging use case here. With regard to relevance, you face a very significant problem already in that is harder to establish a strong 'signal' and degrees of differentiation with large bodies of text. If the documents you have are much like reports or whitepapers, a potential workaround to this may be in indexing the first X number of characters (or the first identified paragraph) into a "thesis" field. This field could be weighted to better indicate what the document subject matter may be without manual review.
With regard to result duplication, this will require post-processing on your end if you wish to filter it. You can create a new field that can generate a unique "Parent" id that will be shared for each chunk of the whole document. The post-processing can check to see if this "Parent" id has already been return(the first result should be seen as most relevant), and if it has, filter the subsequent results. What is doubly useful in such a scenario, is that you include a refinement link into your results that could filter on all matches within that particular Parent id.

How to determine if a given word or phrase from a list is within an anchor tag?

We have a ColdFusion based site that involves a large number of 'document authors' that have little or no knowledge of HTML. The 'documents' they create are comprised of HTML stored in a table in the database. They use a CKEDITOR interface. The content that they create is output into specific area of the page. The document frequently has tons of technical terms that readers may not be familiar with that we would like to have tooltips automatically show up for.
I and the other programmer want to have some code insert 'tooltip' code into the page based on a list of words in a table on our SQL server. The 'dictionary' table in our database has a unique ID, the word/phrase we will look for and a corresponding definition that would be displayed in the tooltip.
For instance, one of the word/phrases we will be looking for is 'Scrum Master'. If it occurs in the document area, we need to insert code around the words to create a tooltip. To do that, we need to see if certain conditions exist. Are the words within an anchor tag? If yes, is there already a title value for the tag (title is used to contain the info to be displayed in a tooltip)? If a title tag exists, don't do anything. If the words are not in an anchor tag, then we would put anchor tags around the words along with the title that will contain the definition.
The tooltip code we use is via jQuery (http://jqueryui.com/tooltip/). It is quick and simple to use. We just need to figure out how to use it dynamically based on our dictionary table.
Do you have any suggestions of how to go about this?
I was hoping that jSoup might have a function that I could use, but that doesn't seem to be the right technology for what I want to do, but I could be wrong and I am happy to be corrected!
We have a large number of these documents and so manually inserting and maintaining the tooltip code is just not an option.
Update you content with something like:
strOut = ReplaceList(strIn, ValueList(qryTT.find), ValueList(qryTT.replace));
Since words are delimited by spaces, the qryTT.find needs to have spaces. The replace column is going to need to include some of the original content. You are going to have to be careful with words followed by a comma or a period too.
I would cache the results because I would expect it to be memory intensive.

How do I find if any names in a list of names appears in a paragraph with ColdFusion?

Assume that I have a list of employee names from a database (thousands, potentially tens-of-thousands in the near future). To make the problem simpler assume that each firstname/lastname combination is unique (a big if, but a tangent).
I also have a RSS stream of news content that pertains to the business (again, could be in the hundreds of items per day).
What I would like to do is detect if an employees name appears in the several paragraph news item and, if so, 'tag' the item with the person its talking about.
There may be more than one employee named in a single news item so breaking the loop after the first positive match isn't a possibility.
I can certainly brute force things: for every news item, loop over each and every employee name and if a regex expression returns a match, make note of it.
Is there a simpler way in ColdFusion or should I just get on with my nested loops?
Just throwing this out there as something you could do...
It sounds like you'll almost unanimously have significantly more employee names than words per post. Here's how I might handle it:
Have an always-running CF app that will pull in the feeds and onAppStart
Grab all employees from your db
Create an app-scoped look up struct with first names as keys and a struct of last names as values ( you could also add middle names sibling to last names with a 3rd tier if desired ).
So one key in the look up might be "Vanessa" with a struct with 2 keys ( "Johnson" and "Forta" ) as its value.
Then, each article you parse, just listToArray with a space as a delimiter and loop through the array doing a simple structKeyExists with each token. For matches, check the next item in the array as a last name.
I'd guess this would be much more performant processingwise than doing however many searches and also take almost no time to code and you can feed in any future sources extremely simply ( your checker takes one argument, any text on Earth ).
Interested to see what route you go and whether your experiments expose anything new about performance in CF.
Matthew, you have a tall order there, and there are really multiple parts to the challenge/solution. But just in terms of comparing a list of values to a given set of text to see if one of them occur in there, you'll find that there's no one could CF function. BEcause of that, I created a new one, findList, available at cflib:
http://cflib.org/index.cfm?event=page.udfbyid&udfid=1908
It's not perfect, nor as optimal as it could be, but it may be a useful first step or you, or give you some ideas. That said, it suited my need (determine if a given blog comment had reference to any of the blacklisted words). I show it comparing a list of URLs, but it could be any words at all. Hope that's a little helpful.
Another option worth exploring is leveraging the Solr engine that ships with CF now. It will do the string search heavy lifting for you and you can probably focus on dynamically keeping your collections up to date and optimized as new feed items come in.
Good luck!