Match multiple strings using re.search within an if condition - regex

I am writing a program to create a list from a spreadsheet based on a position value in another cell. So my code looks like
for j in xrange(1,13):
for sheet in wb.sheets():
for i in xrange(1,12*15):
team=sheet.cell(i,0)
position=sheet.cell(i,2)
games=sheet.cell(i,23)
if re.match(owner[j], str(team.value)) and (not re.findall('Defense' or 'K,' or 'KFG' or 'KKO', str(position.value))):
try:
list.append(int(games.value))
except ValueError:
list.append(0)
else:
pass
print list
list=[]
So the goal of this is to append to a list when a row matches owner in the first column, and not Defense K, KFG KKO in the position column.
Unfortunately, the values for K, KFG and KKO all show
up in my lists, but the Defense values properly do not. How can I
ensure the other filtering criteria are met?
As a side note, these positions are in amongst other bits of text so
the search() is used here instead of match().

"Defense" is a 'truthy' value, so the result of:
'Defense' or 'K,' or 'KFG' or 'KKO'
is 'Defense'.
Therefore, the condition you have is no different from:
re.match(owner[j], str(team.value)) and (not re.findall('Defense', str(position.value)))
If you want alternatives in a regex, use | in the pattern:
re.match(owner[j], str(team.value)) and (not re.findall('Defense|K,|KFG|KKO', str(position.value)))

Related

Google Sheets: How can I extract partial text from a string based on a column of different options?

Goal: I have a bunch of keywords I'd like to categorise automatically based on topic parameters I set. Categories that match must be in the same column so the keyword data can be filtered.
e.g. If I have "Puppies" as a first topic, it shouldn't appear as a secondary or third topic otherwise the data cannot be filtered as needed.
Example Data: https://docs.google.com/spreadsheets/d/1TWYepApOtWDlwoTP8zkaflD7AoxD_LZ4PxssSpFlrWQ/edit?usp=sharing
Video: https://drive.google.com/file/d/11T5hhyestKRY4GpuwC7RF6tx-xQudNok/view?usp=sharing
Parameters Tab: I will add words in columns D-F that change based on the keyword data set and there will often be hundreds, if not thousands, of options for larger data sets.
Categories Tab: I'd like to have a formula or script that goes down the columns D-F in Parameters and fills in a corresponding value (in Categories! columns D-F respectively) based on partial match with column B or C (makes no difference to me if there's a delimiter like a space or not. Final data sheet should only have one of these columns though).
Things I've Tried:
I've tried a bunch of things. Nested IF formula with regexmatch works but seems clunky.
e.g. this formula in Categories! column D
=IF(REGEXMATCH($B2,LOWER(Parameters!$D$3)),Parameters!$D$3,IF(REGEXMATCH($B2,LOWER(Parameters!$D$4)),Parameters!$D$4,""))
I nested more statements changing out to the next cell in Parameters!D column (as in , manually adding $D$5, $D$6 etc) but this seems inefficient for a list thousands of words long. e.g. third topic will get very long once all dog breed types are added.
Any tips?
Functionality I haven't worked out:
if a string in Categories B or C contains more than one topic in the parameters I set out, is there a way I can have the first 2 to show instead of just the first one?
e.g. Cell A14 in Categories, how can I get a formula/automation to add both "Akita" & "German Shepherd" into the third topic? Concatenation with a CHAR(10) to add to new line is ideal format here. There will be other keywords that won't have both in there in which case these values will just show up individually.
Since this data set has a bunch of mixed breeds and all breeds are added as a third topic, it would be great to differentiate interest in mixes vs pure breeds without confusion.
Any ideas will be greatly appreciated! Also, I'm open to variations in layout and functionality of the spreadsheet in case you have a more creative solution. I just care about efficiently automating a tedious task!!
Try using custom function:
To create custom function:
1.Create or open a spreadsheet in Google Sheets.
2.Select the menu item Tools > Script editor.
3.Delete any code in the script editor and copy and paste the code below into the script editor.
4.At the top, click Save save.
To use custom function:
1.Click the cell where you want to use the function.
2.Type an equals sign (=) followed by the function name and any input value — for example, =DOUBLE(A1) — and press Enter.
3.The cell will momentarily display Loading..., then return the result.
Code:
function matchTopic(p, str) {
var params = p.flat(); //Convert 2d array into 1d
var buildRegex = params.map(i => '(' + i + ')').join('|'); //convert array into series of capturing groups. Example (Dog)|(Puppies)
var regex = new RegExp(buildRegex,"gi");
var results = str.match(regex);
if(results){
// The for loops below will convert the first character of each word to Uppercase
for(var i = 0 ; i < results.length ; i++){
var words = results[i].split(" ");
for (let j = 0; j < words.length; j++) {
words[j] = words[j][0].toUpperCase() + words[j].substr(1);
}
results[i] = words.join(" ");
}
return results.join(","); //return with comma separator
}else{
return ""; //return blank if result is null
}
}
Example Usage:
Parameters:
First Topic:
Second Topic:
Third Topic:
Reference:
Custom Functions
I've added a new sheet ("Erik Help") with separate formulas (highlighted in green currently) for each of your keyword columns. They are each essentially the same except for specific column references, so I'll include only the "First Topic" formula here:
=ArrayFormula({"First Topic";IF(A2:A="",,IFERROR(REGEXEXTRACT(LOWER(B2:B&C2:C),JOIN("|",LOWER(FILTER(Parameters!D3:D,Parameters!D3:D<>""))))) & IFERROR(CHAR(10)&REGEXEXTRACT(REGEXREPLACE(LOWER(B2:B&C2:C),IFERROR(REGEXEXTRACT(LOWER(B2:B&C2:C),JOIN("|",LOWER(FILTER(Parameters!D3:D,Parameters!D3:D<>""))))),""),JOIN("|",LOWER(FILTER(Parameters!D3:D,Parameters!D3:D<>""))))))})
This formula first creates the header (which can be changed within the formula itself as you like).
The opening IF condition leaves any row in the results column blank if the corresponding cell in Column A of that row is also blank.
JOIN is used to form a concatenated string of all keywords separated by the pipe symbol, which REGEXEXTRACT interprets as OR.
IFERROR(REGEXEXTRACT(LOWER(B2:B&C2:C),JOIN("|",LOWER(FILTER(Parameters!D3:D,Parameters!D3:D<>""))))) will attempt to extract any of the keywords from each concatenated string in Columns B and C. If none is found, IFERROR will return null.
Then a second-round attempt is made:
& IFERROR(CHAR(10)&REGEXEXTRACT(REGEXREPLACE(LOWER(B2:B&C2:C),IFERROR(REGEXEXTRACT(LOWER(B2:B&C2:C),JOIN("|",LOWER(FILTER(Parameters!D3:D,Parameters!D3:D<>""))))),""),JOIN("|",LOWER(FILTER(Parameters!D3:D,Parameters!D3:D<>"")))))
Only this time, REGEXREPLACE is used to replace the results of the first round with null, thus eliminating them from being found in round two. This will cause any second listing from the JOIN clause to be found, if one exists. Otherwise, IFERROR again returns null for round two.
CHAR(10) is the new-line character.
I've written each of the three formulas to return up to two results for each keyword column. If that is not your intention for "First Topic" and "Second Topic" (i.e., if you only wanted a maximum of one result for each of those columns), just select and delete the entire round-two portion of the formula shown above from the formula in each of those columns.

How to create new column that parses correct values from a row to a list

I am struggling on creating a formula with Power Bi that would split a single rows value into a list of values that i want.
So I have a column that is called ID and it has values such as:
"ID001122, ID223344" or "IRRELEVANT TEXT ID112233, MORE IRRELEVANT;ID223344 TEXT"
What is important is to save the ID and 6 numbers after it. The first example would turn into a list like this: {"ID001122","ID223344"}. The second example would look exactly the same but it would just parse all the irrelevant text from between.
I was looking for some type of an loop formula where you could use the text find function to find ID starting point and use middle function to extract 8 characters from the start but I had no progress in finding such. I tried making lists from comma separator but I noticed that not all rows had commas to separate IDs.
The end results would be that the original value is on one column next to the list of parsed values which then could be expanded to new rows.
ID Parsed ID
"Random ID123456, Text;ID23456" List {"ID123456","ID23456"}
Any of you have former experience?
Hey I found the answer by myself using a good article similar to my problem.
Here is my solution without any further text parsing which i can do later on.
each let
PosList = Text.PositionOf([ID],"ID",Occurrence.All),
List = List.Transform(PosList, (x) => Text.Middle([ID],x,8))
in List
For example this would result "(ID343137,ID352973) ID358388" into {ID343137,ID352973,ID358388}
Ended up being easier than I thought. Suppose the solution relied again on the lists!

Excel, duplicates in string, single cell iteration

I'm trying to extract certain pieces of data from a very long string within a single cell. For the sake of this exercise, this is the data I have in cell A1.
a:2:{s:15:"info_buyRequest";a:5:{s:4:"uenc";s:252:"WN0aW9uYWwuaHRlqdyZ2dC1hdD0lN0JhZHR5cGUlN0QmdnQtcHRpPSU3QmFkd29yZHNfcHJvZHVjdHRhcmdldGlkJTdEJiU3Qmlnbm9y,";s:7:"product";s:4:"1253";s:8:"form_key";s:16:"wyfg89N";s:7:"options";a:6:{i:10144;s:5:"73068";i:10145;s:5:"63085";i:10141;s:5:"73059";i:10143;s:5:"73064";i:13340;s:5:"99988";i:10142;s:5:"73063";}s:3:"qty";s:1:"1";}s:7:"options";a:6:{i:0;a:7:{s:5:"label";s:5:"Color";s:5:"value";s:11:"White";s:11:"print_value";s:11:"White";s:9:"option_id";s:5:"10144";s:11:"option_type";s:9:"drop_down";s:12:"option_value";s:5:"73068";s:11:"custom_view";b:0;}i:1;a:7:{s:5:"label";s:4:"Trim";s:5:"value";s:11:"Black";s:11:"print_value";s:11:"Black";s:9:"option_id";s:5:"10145";s:11:"option_type";s:9:"drop_down";s:12:"option_value";s:5:"63085";s:11:"custom_view";b:0;}i:2;a:7:{s:5:"label";s:7:"Material";s:5:"value";s:15:"Vinyl";s:11:"print_value";s:15:"Vinyl";s:9:"option_id";s:5:"10141";s:11:"option_type";s:9:"drop_down";s:12:"option_value";s:5:"73059";s:11:"custom_view";b:0;}i:3;a:7:{s:5:"label";s:6:"Orientation";s:5:"value";s:17:"Left Side";s:11:"print_value";s:17:"Left Side";s:9:"option_id";s:5:"10143";s:11:"option_type";s:9:"drop_down";s:12:"option_value";s:5:"73064";s:11:"custom_view";b:0;}i:4;a:7:{s:5:"label";s:12:"Table";s:5:"value";s:16:"YES! Add Table";s:11:"print_value";s:16:"YES! Add Table";s:9:"option_id";s:5:"13340";s:11:"option_type";s:9:"drop_down";s:12:"option_value";s:5:"99988";s:11:"custom_view";b:0;}i:5;a:7:{s:5:"label";s:8:"Shipping";s:5:"value";s:20:"Front Door Delivery";s:11:"print_value";s:20:"Front Door Delivery";s:9:"option_id";s:5:"10142";s:11:"option_type";s:9:"drop_down";s:12:"option_value";s:5:"73063";s:11:"custom_view";b:0;}}}
The end result, would be to separate the values for Color, Trim, Material Orientation, etc.
The formula I was using is this:
=MID(LEFT(A4,FIND("print_value",A4)-9),FIND("Color",A4)+25,LEN(A4))
This basically looks in between two points and trims out the fat. It works, but only for the first iteration of "print_value". If I were to use this searching for "Trim"...
=MID(LEFT(A4,FIND("print_value",A4)-9),FIND("Trim",A4)+25,LEN(A4))
...I get an empty result. This happens because print_value is duplicate and not unique to the string. Excel doesn't understand what point to apply its function to and poops itself.
Even though there are unique factors within this string that I could essentially attach myself to (and arrive at the desired result), I CAN NOT use them as they will not be consistent and will render the formula useless when applied to other cells.
That said, here is what I need. Within this formula, I need a way to either A) tell the formula which iteration of print_value to find or B) change print_value to print_value(1,2,3,4, etc) and then run my trimming formula.
Few options based on this link:
1) VBA - Using a User Defined Function
If you're new to these then follow this tutorial.
Function FindN(sFindWhat As String, _
sInputString As String, N As Integer) As Integer
Dim J As Integer
Application.Volatile
FindN = 0
For J = 1 To N
FindN = InStr(FindN + 1, sInputString, sFindWhat)
If FindN = 0 Then Exit For
Next
End Function
2) Using a Formula
=FIND(CHAR(1),SUBSTITUTE(A1,"c",CHAR(1),3))
c is the character you want to find
A1 is the text you want to look in
3 is the nth instance

compare two dictionary, one with list of float value per key, the other one a value per key (python)

I have a query sequence that I blasted online using NCBIWWW.qblast. In my xml blast file result I obtained for a query sequence a list of hit (i.e: gi|). Each hit or gi| have multiple hsp. I made a dictionary my_dict1 where I placed gi| as key and I appended the bit score as value. So multiple values for each key.
my_dict1 = {
gi|1002819492|: [437.702, 384.47, 380.86, 380.86, 362.83],
gi|675820360| : [2617.97, 2614.37, 122.112],
gi|953764029| : [414.258, 318.66, 122.112, 86.158],
gi|675820410| : [450.653, 388.08, 386.27] }
Then I looked for max value in each key using:
for key, value in my_dict1.items():
max_value = max(value)
And made a second dictionary my_dict2:
my_dict2 = {
gi|1002819492|: 437.702,
gi|675820360| : 2617.97,
gi|953764029| : 414.258,
gi|675820410| : 450.653 }
I want to compare both dictionary. So I can extract the hsp with the highest score bits. I am also including other parameters like query coverage and identity percentage (Not shown here). The finality is to get the best gi| with the highest bit scores, coverage and identity percentage.
I tried many things to compare both dictionary like this :
First code :
matches[]
if my_dict1.keys() not in my_dict2.keys():
matches[hit_id] = bit_score
else:
matches = matches[hit_id], bit_score
Second code:
if hit_id not in matches.keys():
matches[hit_id]= bit_score
else:
matches = matches[hit_id], bit_score
Third code:
intersection = set(set(my_dict1.items()) & set(my_dict2.items()))
Howerver I always end up with 2 types of errors:
1 ) TypeError: list indices must be integers, not unicode
2 ) ... float not iterable...
Please I need some help and guidance. Thank you very much in advance for your time. Best regards.
It's not clear what you're trying to do. What is hit_id? What is bit_score? It looks like your second dict is always going to have the same keys as your first if you're creating it by pulling the max value for each key of the first dict.
You say you're trying to compare them, but don't really state what you're actually trying to do. Find those with values under a certain max? Find those with the highest max?
Your first code doesn't work because I'm assuming you're trying to use a dict key value as an index to matches, which you define as a list. That's probably where your first error is coming from, though you haven't given the lines where the error is actually occurring.
See in-code comments below:
# First off, this needs to be a dict.
matches{}
# This will never happen if you've created these dicts as you stated.
if my_dict1.keys() not in my_dict2.keys():
matches[hit_id] = bit_score # Not clear what bit_score is?
else:
# Also not sure what you're trying to do here. This will assign a tuple
# to matches with whatever the value of matches[hit_id] is and bit_score.
matches = matches[hit_id], bit_score
Regardless, we really need more information and the full code to figure out your actual goal and what's going wrong.

Compare a portion of String value present in 2 Lists

Below code extract a particular value from List srchlist and check for a particular value in List rplzlist. The contents of list srchlist and rplzlist looks like below.
srchlist = ["DD='A'\n", "SOUT='*'\n", 'PGM=FTP\n', 'PGM=EMAIL']
rplzlist = ['A=ZZ.VVMSSB\n', 'SOUT=*\n', 'SALEDB=TEST12']
I am extracting the characters after the '='(equal) sign and within the single quotes using a combination of strip and translate function.
Of the elements in the srchlist only the 'SOUT' matches with the rplzlist.
Do let me know why the below code does not work, also suggest me a better approach to compare a part of string present in the list.
for ele in srchlist:
sYmls = ele.split('=')
vAlue = sYmls[1].translate(None,'\'')
for elem in rplzlist:
rPls = elem.split('=')
if vAlue in rPls:
print("vAlue")
Here is the more pythonic approach for what you wanted to do:
>>> list(set([(i.split('='))[1].translate(None,'\'') for i in srchlist]) & set([j.split('=')[1] for j in rplzlist]))
['*\n']
I used set() and then get the whole output as list, you may use .join().
Inside set(), list comprehension is given which is faster than the normal for loops.
Another Solution Using join(), and replace() in place of translate():
>>> "".join(set([(i.split('='))[1].replace('\'','') for i in srchlist]) & set([j.split('=')[1] for j in rplzlist]))
'*\n'