Power Query check if string contains strings from a list - list

Is there a way to check a text field to see if it contains any of the strings from a list?
Example Strings to Check:
The raisin is green
The pear is red
The apple is yellow
List Example to Validate Against
red
blue
green
The result would be
either:
green
red
null
or:
TRUE
TRUE
FALSE

Daniel has a decent solution, but it won't work if the example strings aren't space-separated. For example, The brick is reddish would detect red as a substring.
You can create a custom column with this formula instead:
(C) => List.AnyTrue(List.Transform(Words, each Text.Contains(C[Texts], _)))
This takes the list Words = {"red","blue","green"} and checks if each of the colors in the list is contained in the [Texts] column for that row. If any are, then it returns TRUE otherwise FALSE.
The whole query looks like this:
let
TextList = {"The raisin is green","The pear is red","The apple is yellow"},
Texts = Table.FromList(TextList, Splitter.SplitByNothing(), {"Texts"}, null, ExtraValues.Error),
Words = {"red","blue","green"},
#"Added Custom" = Table.AddColumn(Texts, "Check", (C) => List.AnyTrue(List.Transform(Words, each Text.Contains(C[Texts], _))))
in
#"Added Custom"

This will make the trick, it's PowerQuery ("M") code:
let
Texts = {"The raisin is green","The pear is red","The apple is yellow"},
Words = {"red","blue","green"},
TextsLists = List.Transform(Texts, each Text.Split(_," ")),
Output = List.Transform(TextsLists, each List.Count(List.Intersect({_,Words}))>0)
in
Output
There are two lists: the sentences (Texts) and the words to check (Words). The first thing to do is to convert the sentences in lists of words splitting the strings using " " as the delimiter.
TextsLists = List.Transform(Texts, each Text.Split(_," ")),
Then you "cross" the new lists with the list of Words. The result are lists of elements (strings) that appears in both lists (TextLists and Words). Now you count these new lists and check if the result is bigger than cero.
Output = List.Transform(TextsLists, each List.Count(List.Intersect({_,Words}))>0)
Output is a new list {True, True, False).
Alternatively, you can change the Output line by this one:
Output = List.Transform(TextsLists, each List.Intersect({_,Words}){0}?)
This will return a list of the first coincidence or null if there's no coincidence. In the example: {"green", "red", "null"}
Hope this helps you.

each if Text.Remove([Texts], {"The raisin is green","The pear is red","The apple is yellow"})<>[Texts] then ...

Related

How can I search a list of text strings for more than one word or collection of words in Power Query?

I have a table of data that I converted into a list using Table.ColumnsNames, and with this list I want to be able to select multiple items in the list and put into a new list and remove all items I did not select. For examples my current list contains {Apple, Pear, Orange, Banana} I want to extract "Apple" and "Banana" from the list and into a new one.
I tried doing this with List.contains or List.FindText but you can only select one parameter to such as "Apple" or "Banana" not both.
If anyone has a solution for this it would be great!!
you want List.Intersect or List.Difference See documentation at
https://learn.microsoft.com/en-us/powerquery-m/list-difference
https://learn.microsoft.com/en-us/powerquery-m/list-intersect
This looks for [Apple Pear Dog] from list of [Apple Pear Orange Banana] and returns [Apple Pear]
= List.Intersect ({{"Apple", "Pear", "Orange", "Banana"},{"Apple", "Pear", "Dog"}})

Powerquery, does string contain an item in a list

I would like to filter on whether multiple text columns ([Name], [GenericName], or [SimpleGenericName]) contains a substring from a list. The text is also mixed case so I need to do a Text.Lower([Column]) in there as well.
I've tried the formula:
= Table.SelectRows(#"Sorted Rows", each List.Contains(MED_NAME_LIST, Text.Lower([Name])))
However, this does not work as the Column [Name] does not exactly match those items in the list (e.g. it won't pick up "Methylprednisolone Tab" if the list contains "methylprednisolone")
An example of a working filter, with all some of the list written out is:
= Table.SelectRows(#"Sorted Rows", each Text.Contains(Text.Lower([Name]), "methylprednisolone") or Text.Contains(Text.Lower([Name]), "hydroxychloroquine") or Text.Contains(Text.Lower([Name]), "remdesivir") or Text.Contains(Text.Lower([GenericName]), "methylprednisolone") or Text.Contains(Text.Lower([GenericName]), "hydroxychloroquine") or Text.Contains([GenericName], "remdesivir") or Text.Contains(Text.Lower([SimpleGenericName]), "methylprednisolone") or Text.Contains(Text.Lower([SimpleGenericName]), "hydroxychloroquine") or Text.Contains([SimpleGenericName], "remdesivir"))
I would like to make this cleaner than having to write all of this out, as I would also like to be able to expand the list from a referenced table to make this a dynamic search.
Thank you in advance
If I have a list of medicines:
and I need to filter my table:
to only keep rows where certain columns (we'll specify which ones exactly later) contain case-insensitive, partial matches for any of the items in the above list of medicines, then one way to do this might be:
let
MED_NAME_LIST = {"MEthYlprednisolone", "hYdroxychloroquine", "rEMdesivir"},
initialTable = Table.FromRows({
{"Methylprednisolone Tab", "train", "car", "bike"},
{"no", "no", "no", "no"},
{"tram", "teleport", "hydroxychloroQuine Tab", "jet"},
{"no", "no", "no", "yes"},
{"REMdesivir Tab", "bus", "taxi", "concord"}
}, type table [Name = text, GenericName = text, SimpleGenericName = text, SomeOtherColumn = text]),
filtered = Table.SelectRows(initialTable, each List.ContainsAny(
{[Name], [GenericName], [SimpleGenericName]},
MED_NAME_LIST,
(rowValue as text, medicineFromList as text) as logical => Text.Contains(rowValue, medicineFromList, Comparer.OrdinalIgnoreCase)
))
in
filtered
In filtered, List.ContainsAny is used to determine if any of the specified columns (Name, GenericName, SimpleGenericName) contain a "match" for any of the values in MED_NAME_LIST.
The criteria for the "match" is that:
case sensitivity must be ignored (hence Comparer.OrdinalIgnoreCase is used)
the match must be partial (hence Text.Contains is used)
The above code gives me the following, which I believe is the filtering behaviour you described:

keyword inspection based on words present in multiple lists

I have a dictionary similar to this:
countries = ["usa", "france", "japan", "china", "germany"]
fruits = ["mango", "apple", "passion-fruit", "durion", "bananna"]
cf_dict = {k:v for k,v in zip(["countries", "fruits"], [countries, fruits])}
and I also have a list of strings similar to this:
docs = ["mango is a fruit that is very different from Apple","I like to travel, last year I was in Germany but I like France.it was lovely"]
I would like to inspect the docs and see if each string contains any of the keywords in any of the lists(the values of cf_dict are lists) in cf_dict, and if they are present then return the corresponding key(based on values) for that string(strings in docs) as output.
so for instance, if I inspect the list docs the output will be [fruits, countries]
something similar to this answer but this checks only one list, however, I would like to check multiple lists.
The following returns a dict of sets in case a string matches values in more than one list (e.g. 'apple grows in USA' should be mapped to {'fruits', 'countries'}).
print({s: {k for k, l in cf_dict.items() for w in l if w in s.lower()} for s in docs})
This outputs:
{'mango is a fruit that is very different from Apple': {'fruits'}, 'I like to travel, last year I was in Germany but I like France.it was lovely': {'countries'}}

Join strings from the same column in ´pandas´ using a placeholder condition

I have a series of data that I need to filter.
The df consists of one col. of information that is separated by a row with with value NaN.
I would like to join all of the rows that occur until each NaN in a new column.
For example my data looks something like:
the
car
is
red
NaN
the
house
is
big
NaN
the
room
is
small
My desired result is
B
the car is red
the house is big
the room is small
Thus far, I am approaching this problema by building a function and applying it to each row in my dataframe. See below for my working code example so far.
def joinNan(row):
newRow = []
placeholder = 'NaN'
if row is not placeholder:
newRow.append(row)
if row == placeholder:
return newRow
df['B'] = df.loc[0].apply(joinNan)
For some reason, the first row of my data is being used as the index or column title, hence why I am using 'loc[0]' here instead of a specific column name.
If there is a more straight forward way to approach this directly iterating in the column, I am open for that suggestion too.
For now, I am trying to reach my desired solution and have not found any other similiar case in Stack overflow or the web in general to help me.
I think for test NaNs is necessary use isna, then greate helper Series by cumsum and aggregate join with groupby:
df=df.groupby(df[0].isna().cumsum())[0].apply(lambda x: ' '.join(x.dropna())).to_frame('B')
#for oldier version of pandas
df=df.groupby(df[0].isnull().cumsum())[0].apply(lambda x: ' '.join(x.dropna())).to_frame('B')
Another solution is filter out all NaNs before groupby:
mask = df[0].isna()
#mask = df[0].isnull()
df['g'] = mask.cumsum()
df = df[~mask].groupby('g')[0].apply(' '.join).to_frame('B')

Find String from One List within Another List and Return String Found

I found part of what I was looking for at Matchlists/tables in power query, but I need a bit more.
Using the "Flags only" example provided at Matchlists/tables in power query, I’m comparing two lists, ListA and ListB, to check if ListB’s row content appears in ListA’s row content at all. I can’t do a one-for-one match of both rows’ contents (like with List.Intersect) because the content of a row in ListB might only be part of the content of a row in ListA.
Note that, in the query below, ListB includes “roo”, which is the first three letters in the word room. I would want to know that “roo” is in ListA’s row that has “in my room.”
The "Flags only" example provided by Matchlists/tables in power query already determines that “roo” is part of ListA’s row that has “in my room.” I built on the example to assign “yes,” instead of true when there is such a match between the ListA and ListB.
What I’d like to do is to replace “yes” with the actual value from ListB — the value “roo,” for instance. I tried to simply substitute wordB for “yes” but I got an error that wordB wasn’t recognized.
let
ListA = {"help me rhonda", "in my room", "good vibrations", "god only knows"},
ListB = {"roo", "me", "only"},
contains_word=List.Transform(ListA, (lineA)=>if List.MatchesAny(ListB, (wordB)=>Text.Contains(lineA, wordB)) = true then "yes" else "no")
in
contains_word
The current query results in this:
List
1 yes
2 yes
3 no
4 yes
I want the query results to be:
List
1 roo
2 me
3
4 only
Any idea how to make it so?
(p.s. I'm extremely new to Power Query / M)
Thanks
I would do this way:
let
ListA = {"help me rhonda", "in my room", "good vibrations", "god only knows"},
ListB = {"roo", "me", "only"},
contains_word=List.Transform(ListA, (lineA)=>List.Select(List.Transform(ListB, (wordB)=>if Text.Contains(lineA, wordB) = true then wordB else null), (x)=>x <> null){0}?)
in
contains_word
[edited]
The idea is to use List.Transform twice: inner one changes list B to leave only matching values. Then 1st non-null of latest replaces string from list A (outer List.Tramsform).
Edit: I think you switched the first 2 elements of the result?
You can use the following code:
let
ListA = {"help me rhonda", "in my room", "good vibrations", "god only knows"},
ListB = {"roo", "help", "me", "only"},
TableA = Table.FromList(ListA,null,{"ListA"}),
AddedListBMatches = Table.AddColumn(TableA, "ListBMatches", (x) => List.Select(ListB, each Text.PositionOf(x[ListA], _) >= 0)),
ExtractedValues = Table.TransformColumns(AddedListBMatches, {"ListBMatches", each Text.Combine(List.Transform(_, Text.From), ","), type text}),
Result = ExtractedValues[ListBMatches]
in
Result
The "ExtractedValues" step is the result of pressing the expand button in the header of the "ListBMatches" column and choose Extract Values, comma separated.
This option was added in the January 2017 update.
I added "help" to ListB so the first element of ListA has 2 matches that are both returned.