Searching column with string for wildcard match - regex

I can't seem to figure out the VLOOKUP magic needed to make this work as I want it to.
See, what I've got is a column B containing filenames, like this:
[COLUMN B]
./11001 Boogie Oogie Oogie (A Taste Of Honey).wav
./11001 Rescue Me (A Taste Of Honey).wav
./11001 Sukiyaki (A Taste Of Honey).wav
./11002 Memory (Acker Bilk).wav
./11002 Stuck On You (Acker Bilk).wav
./11002 Could I Have This Dance (Acker Bilk).wav
./11002 Do That To Me One More Time (Acker Bilk).wav
./11002 This Masquerade (Acker Bilk).wav
./11002 Just Once (Acker Bilk).wav
And so on for 6220 entries.
I have another column, Column E, which contains a TRACK NAME which is present within the filename. Looks like this:
American Patrol
Artistry In Rhythm
Begin The Beguine
Big John's Special
Cherokee
For example. So what I want to do is, in another column I want to search through Column B using the strings from Column E and then returning the matched string from Column B.
So if we imagine I put this formula in the C Column starting in the same row as the American Patrol track name, it would search through the range in Column B and return this:
./11249 American Patrol (BBC Big Band).wav
./11249 Artistry In Rhythm (BBC Big Band).wav
./11249 Begin The Beguine (BBC Big Band).wav
And so on.
I tried doing this formula
=VLOOKUP(E2;B2:B6235;2;TRUE)
So, this returns a file name, but it seems to have matched all the filenames and are just returning whichever result I specify in the col_index variable, so now it returns the second match (basically, just the second row in Column B) and if I put a 3 instead, it would just return the third hit, again having matched all the file names, it seems..
I'm not that familiar with Excel functions, so I'm not sure where to look for the solution beyond this.

You should not be using TRUE as a VLOOKUP function's range_lookup parameter on unsorted data. You can, however, wrap your track title in wildcards to achieve the search you are looking for.
      
The formula in C1 is,
=INDEX(B:B, MATCH("*"&E1&"*",B:B, 0))
... or,
=VLOOKUP("*"&E1&"*",B:B, 1, FALSE)
They accomplish the same thing.

Related

Googlesheet IF with multiple cases

in my Google sheet table I have the first list with summary of invoices which are then separated to 4 lists according to parameters (manually). I need to know about all invoices from the first list, on which category/list they are.
So for example - lists: Alphabet, abc, def, mno, xyz. In Alphabet is column "list".
How to write function which found invoice on another list according to ID (column B) from Alphabet and write name of the correct list to column "list". I tried to write this function using IF, match, etc. But I still don't have solution. Can you help me please? Sorry for my English :-)
So here is an example which you could adapt. In columns E:H on the first sheet (and I could hide these columns later, starting in row2 and dragging down as needed, I put the following formulas:
=IF(LEN(iferror(query(abc!$A$2:$A,"select A where A='" & $A2 &"'"),""))>0,"abc","")
=IF(LEN(iferror(query(def!$A$2:$A,"select A where A='" & $A2 &"'"),""))>0,"def","")
=IF(LEN(iferror(query(mno!$A$2:$A,"select A where A='" & $A2 &"'"),""))>0,"mno","")
=IF(LEN(iferror(query(xyz!$A$2:$A,"select A where A='" & $A2 &"'"),""))>0,"xyz","")
Probably I could have simplified a little by putting the sheet names in E1:H1, but you get the idea.
Each of these looks for the ID. If the query succeeds, it returns the name of the sheet. If it fails, it returns the empty string.
Now in column B where I actually want the results, I put this formula in B2 and drag to copy as needed.
=if(E2&F2&G2&H2="","nowhere",E2&F2&G2&H2)
It says put those strings together, and if there is nothing there say nowhere, otherwise say the list. If it appears on more than one, and that can really happen, you could use JOIN instead.

How to extract a column based on it's content in PowerBI

I have a column in my table which looks like below.
ResourceIdentifier
------------------
arn:aws:ec2:us-east-1:7XXXXXX1:instance/i-09TYTYTY79716
arn:aws:glue:us-east-1:5XXXXXX85:devEndpoint/etl-endpoint
i-075656565f7fea3
i-02c3434343f22
qa-271111145-us-east-1-raw
prod-95756565631-us-east-1-raw
prod-957454551631-us-east-1-isin-repository
i-02XXXXXXf0
I want a new column called 'Trimmed Resource Identifier' which looks at ResourceIdentifier and if the value starts with "arn", then returns value after last "/", else returns the whole string.
For eg.
arn:aws:ec2:us-east-1:7XXXXXX1:instance/i-09TYTYTY79716  ---> i-09TYTYTY797168
i-02XXXXXXf0 --> i-02XXXXXXf0
How do I do this ? I tried creating a new column called "first 3 letters" by extracting first 3 letters of the ResourceIdentifier column but I am getting stuck at the step of adding conditional column. Please see the image below.
Is there a way I can do all of this in one step using DAX instead of creating a new intermediate column ?
Many Thanks
The GUI is too simple to do exactly what you want but go ahead and use it to create the next step, which we can then modify to work properly.
Filling out the GUI like this
will produce a line of code that looks like this (turn on the Formula Bar under the View tab in the query editor if you don't see this formula).
= Table.AddColumn(#"Name of Previous Step Here", "Custom",
each if Text.StartsWith([ResourceIdentifier], "arn") then "output" else [ResourceIdentifier])
The first three letters bit is already handled with the operator I chose, so all that remains is to change the "output" placeholder to what we actually want. There's a handy Text.AfterDelimiter function we can use for this.
Text.AfterDelimiter([ResourceIdentifier], "/", {0, RelativePosition.FromEnd})
This tells it to take the text after the first / (starting from the end). Replace "output" with this expression and you should be good to go.

How to create new column that parses correct values from a row to a list

I am struggling on creating a formula with Power Bi that would split a single rows value into a list of values that i want.
So I have a column that is called ID and it has values such as:
"ID001122, ID223344" or "IRRELEVANT TEXT ID112233, MORE IRRELEVANT;ID223344 TEXT"
What is important is to save the ID and 6 numbers after it. The first example would turn into a list like this: {"ID001122","ID223344"}. The second example would look exactly the same but it would just parse all the irrelevant text from between.
I was looking for some type of an loop formula where you could use the text find function to find ID starting point and use middle function to extract 8 characters from the start but I had no progress in finding such. I tried making lists from comma separator but I noticed that not all rows had commas to separate IDs.
The end results would be that the original value is on one column next to the list of parsed values which then could be expanded to new rows.
ID Parsed ID
"Random ID123456, Text;ID23456" List {"ID123456","ID23456"}
Any of you have former experience?
Hey I found the answer by myself using a good article similar to my problem.
Here is my solution without any further text parsing which i can do later on.
each let
PosList = Text.PositionOf([ID],"ID",Occurrence.All),
List = List.Transform(PosList, (x) => Text.Middle([ID],x,8))
in List
For example this would result "(ID343137,ID352973) ID358388" into {ID343137,ID352973,ID358388}
Ended up being easier than I thought. Suppose the solution relied again on the lists!

How to find a substring anywhere in a string

This should be easy, but I'm finding it difficult.
I just want to find whether a substring exists anywhere in a string. In my case, whether the name of a website exists in the title of a product.
My code is like this:
#FindNoCase("Amazon.com", "Google Chromecast available at Amazon")#
The above returns a 0 which is correct because the entire substring "Amazon.com" doesn't exist in the main string. But some of it does, namely the "Amazon" part.
How could I achieve what I'm trying to do which is just see if ANY of the substring (at least more than 2 character in length) exists in the main string?
So I need something like FindOneOf() but actually "find at least three of". It should then look at the word "Amazon" in the product title and check if at least 3 characters in the sequence of "Amazon.com" exists. When it sees that "Ama" exists, then it just needs to return a true value. Can it be done using the existing built-in functions somehow?
Update: Very simple solution. I used Left("amazon", 3).
There's a lot of danger in false positives, like if someone was buying the Alabama state flag.
Because of store names that contain spaces, this is a little tricky (Wal Mart is often written with a space).
If your string always contains at [store], you can extract the store name by finding the last at in the sentence and creating a string by chopping off everything else.
Because it looks for occurrences of at only as a whole word, there's no danger with store names such as Beats Audio, or Sam's Meat Shop. I can't think of any any stores with the word at in the name. While that would technically trip it up, there's much lower risk, and you can do a pre-replace on such store names.
<cfset mystring = "Google Chromecast available at Amazon">
<cfset SellerName = REReplaceNoCase(mystring,".*\b(?:at)\b(?!.*\b(?:at)\b)\s*","")>
<cfoutput>Seller: #Sellername#</cfoutput>
You can then do your comparisons much more safely.
Per your comment, If you know all possible patterns, you can still obtain the data if you want to (false positives can either be embarrassing or catastrophic, depending on the action). If you know the stores you're working with, you can use a regex to pull out the string like this
<cfset mystring = "Google Chromecast available at Amazon.co.uk">
<cfset SellerName = REReplaceNoCase(mystring,".*\b((Google|Amazon|Wal[\W]*Mart|E[\W]*bay)(\.[a-z]+)*)\b","\1")>
<cfoutput>Seller: #Sellername#</cfoutput>
The only part you need to update is the pipe-delimited list You might add K-Mart as K[\W]*Mart the [\W]* permits any special character or space so it covers kMart, K-Mart, k*Mart, but not Kwik-E-Mart.
Update #2, per more comments
<cfset mystring = "Google Chromecast available at Toys-R-US">
<cfset SellerNameRE = REReplace(rsProduct.sellername,"[\W]+","[\W]*","ALL")>
<cfset TheSellerName = REReplaceNoCase(mystring,".*\b((#sellernameRE#)(\.[a-z]+)*)\b","\1")>
<cfoutput>Seller: #TheSellername# (#SellerNameRE#)</cfoutput>
This replaces any symbols with the wildcard character so that symbols aren't required so that if something says Wal*Mart, it will still match WalMart.
You could also load a seperate column with "Regex Names" so that you're not doing this each time.
So your table would look something like
SellerID SellerName RegexName
1 Wal-Mart Wal[\W]*Mart
2 Toys-R-US Toys[\W]*R[\W]*US
<cfset mystring = "Google Chromecast available at Toys-R-US">
<cfset TheSellerName = REReplaceNoCase(mystring,".*\b((#rsProduct.RegexName#)(\.[a-z]+)*)\b","\1")>
<cfoutput>Seller: #TheSellername# (#SellerNameRE#)</cfoutput>
Solved it by doing this
#FindNoCase(left("Amazon.com", 3), "Google Chromecast available at Amazon")#
Yes there is potential it won't do what I need in cases where the seller name less than 3 characters long. But I think its rare enough to be ok.

Stata: Efficient way to replace numerical values with string values

I have code that currently looks like this:
replace fname = "JACK" if id==103
replace lname = "MARTIN" if id==103
replace fname = "MICHAEL" if id==104
replace lname = "JOHNSON" if id==104
And it goes on for multiple pages like this, replacing an ID name with a first and last name string. I was wondering if there is a more efficient way to do this en masse, perhaps by using the recode command?
I will echo the other answers that suggest a merge is the best way to do this.
But if you absolutely must code the lines item-wise (again, messy) you can generate a long list ("pages") of replace commands by using MS Excel to "help" you write the code. Here is a picture of your Excel sheet with one example, showing the MS Excel formula:
columns:
A B C D
row: 1 last first id code
2 MARTIN JACK 103 ="replace fname=^"&B2&"^ if id=="&C2
You type that in, make sure it looks like Stata code when the formula calculates (aside from the carets), and copy the formula in column D down to the end of your list. Then copy the whole block of Stata code in column D generated by the formulas into your do-file, and do a find and replace (be careful here if you are using the caret elsewhere for mathematical uses!!) for all ^ to be replaced with ", which will end up generating proper Stata syntax.
(This is truly a brute force way of doing this, and is less dynamic in the case that there are subsequent changes to your generation list. All--apologies in advance for answering a question here advocating use of Excel :) )
You don't explain where the strings you want to add come from, but what is generally the best technique is explained at
http://www.stata.com/support/faqs/data-management/group-characteristics-for-subsets/index.html
Create an associative array of ids vs Fname,Lname
103 => JACK,MARTIN
104 => MICHAEL,JOHNSON
...
Replace
id => hash{id} ( fname & lname )
The efficiency of doing this will be taken care by the programming language used