Why i'm facing Regex detection problem inside a loop in Python - regex

I'm facing very weird problem while using regex.
weight='abcd'
description='ml'
for symbol in Syntax['symbol']:
print(symbol)
weight=re.findall(symbol,description)
print(weight)
output --> []
Syntax is a data frame that contains different units, also contains " ml " inside symbol column, i have manually printed the symbol variable it prints required unit that is "ml" which will be set as pattern in loop but still it returns [] OR None while using re.match().
But when i try code below
description='ml'
pattern='ml'
print(re.findall(pattern,description)
it prints "ml", Why ??? Both above and Top code are logically same.

In the top code, you're only printing the result of the final regex search, since print(weight) is outside your loop. It's all well and good if "ml" is somewhere in your data frame, but if the last value of symbol doesn't match anything in description, the regex won't find any matches and you won't get any output.
Try printing weight inside the for loop and see what output you get.

description='ml'
weight=0
for symbol in self.units['symbol']:
print("units in Syntax dataframe : ",symbol)
weight=re.findall(symbol,description)
if weight!=[]:
break
print(weight)
I have understood the problem, i was not stopping the loop when the 'ml' is found, that's why it was printing [] or None

Related

Matlab: What's the most efficient approach to parse a large table or cell array with regexp when sometimes there is no match?

I am working with a messy manually maintained "database" that has a column containing a string with name,value pairs. I am trying to parse the entire column with regexp to pull out the values. The column is huge (>100,000 entries). As a proxy for my actual data, let's use this code:
line1={'''thing1'': ''-583'', ''thing2'': ''245'', ''thing3'': ''246'', ''morestuff'':, '''''};
line2={'''thing1'': ''617'', ''thing2'': ''239'', ''morestuff'':, '''''};
line3={'''thing1'': ''unexpected_string(with)parens5'', ''thing2'': 245, ''thing3'':''246'', ''morestuff'':, '''''};
mycell=vertcat(line1,line2,line3);
This captures the general issues encountered in the database. I want to extract what thing1, thing2, and thing3 are in each line using cellfun to output a scalar cell array. They should normally be 3 digit numbers, but sometimes they have an unexpected form. Sometimes thing3 is completely missing, without the name even showing up in the line. Sometimes there are minor formatting inconsistencies, like single quotes missing around the value, spaces missing, or dashes showing up in front of the three digit value. I have managed to handle all of these, except for the case where thing3 is completely missing.
My general approach has been to use expressions like this:
expr1='(?<=thing1''):\s?''?-?([\w\d().]*?)''?,';
expr2='(?<=thing2''):\s?''?-?([\w\d().]*?)''?,';
expr3='(?<=thing3''):\s?''?-?([\w\d().]*?)''?,';
This looks behind for thingX' and then tries to match : followed by zero or one spaces, followed by 0 or 1 single quote, followed by zero or one dash, followed by any combination of letters, numbers, parentheses, or periods (this is defined as the token), using a lazy match, until zero or one single quote is encountered, followed by a comma. I call regexp as regexp(___,'tokens','once') to return the matching token.
The problem is that when there is no match, regexp returns an empty array. This prevents me from using, say,
out=cellfun(#(x) regexp(x,expr3,'tokens','once'),mycell);
unless I call it with 'UniformOutput',false. The problem with that is twofold. First, I need to then manually find the rows where there was no match. For example, I can do this:
emptyout=cellfun(#(x) isempty(x),out);
emptyID=find(emptyout);
backfill=cell(length(emptyID),1);
[backfill{:}]=deal('Unknown');
out(emptyID)=backfill;
In this example, emptyID has a length of 1 so this code is overkill. But I believe this is the correct way to generalize for when it is longer. This code will change every empty cell array in out with the string Unknown. But this leads to the second problem. I've now got a 'messy' cell array of non-scalar values. I cannot, for example, check unique(out) as a result.
Pardon the long-windedness but I wanted to give a clear example of the problem. Now my actual question is in a few parts:
Is there a way to accomplish what I'm trying to do without using 'UniformOutput',false? For example, is there a way to have regexp pass a custom string if there is no match (e.g. pass 'Unknown' if there is no match)? I can think of one 'cheat', which would be to use the | operator in the expression, and if the first token is not matched, look for something that is ALWAYS found. I would then still need to double back through the output and change every instance of that result to 'Unknown'.
If I take the 'UniformOutput',false approach, how can I recover a scalar cell array at the end to easily manipulate it (e.g. pass it through unique)? I will admit I'm not 100% clear on scalar vs nonscalar cell arrays.
If there is some overall different approach that I'm not thinking of, I'm also open to it.
Tangential to the main question, I also tried using a single expression to run regexp using 3 tokens to pull out the values of thing1, thing2, and thing3 in one pass. This seems to require 'UniformOutput',false even when there are no empty results from regexp. I'm not sure how to get a scalar cell array using this approach (e.g. an Nx1 cell array where each cell is a 3x1 cell).
At the end of the day, I want to build a table using these results:
mytable=table(out1,out2,out3);
Edit: Using celldisp sheds some light on the problem:
celldisp(out)
out{1}{1} =
246
out{2} =
Unknown
out{3}{1} =
246
I assume that I need to change the structure of out so that the contents of out{1}{1} and out{3}{1} are instead just out{1} and out{3}. But I'm not sure how to accomplish this if I need 'UniformOutput',false.
Note: I've not used MATLAB and this doesn't answer the "efficient" aspect, but...
How about forcing there to always be a match?
Just thinking about you really wanting a match to skip this problem, how about an empty match?
Looking on the MATLAB help page here I can see a 'emptymatch' option, perhaps this is something to try.
E.g.
the_thing_i_want_to_find|
Match "the_thing_i_want_to_find" or an empty match, note the | character.
In capture group it might look like this:
(the_thing_i_want_to_find|)
As a workaround, I have found that using regexprep can be used to find entries where thing3 is missing. For example:
replace='$1 ''thing3'': ''Unknown'', ''morestuff''';
missingexpr='(?<=thing2'':\s?)(''?-?[\w\d().]*?''?,) ''morestuff''';
regexprep(mycell{2},missingexpr,replace)
ans =
''thing1': '617', 'thing2': '239', 'thing3': 'Unknown', 'morestuff':, '''
Applying it to the entire array:
fixedcell=cellfun(#(x) regexprep(x,missingexpr,replace),mycell);
out=cellfun(#(x) regexp(x,expr3,'tokens','once'),fixedcell,'UniformOutput',false);
This feels a little roundabout, but it works.
cellfun can be replaced with a plain old for loop. Your code will either be equally fast, or maybe even faster. cellfun is implemented with a loop anyway, there is no advantage of using it other than fewer lines of code. In your explicit loop, you can then check the output of regexp, and build your output array any way you like.

printing a character when no other output was printed

How do I print a particular character when up to that point in my code, no condition has been satisfied?
Example :
If I have an array of 100 numbers and if I traverse the array by 'for' loop and do not find my number, how do I print a single line " Not Found " ?
I am using C++.
You would typically create a boolean variable before the loop, maybe called found and set it to false. If the number is found in the loop, you would set found to true. After the loop is done, you would use an if statement to test if found is false and, if found to be so, output the " Not Found " message.
There are other ways that are usually better. But this is the simplest one that everyone should learn first.

Cell appears to be empty but keeps saying it contains values in it

I have this formula in a cell:
=ARRAYFORMULA(UNIQUE(TRANSPOSE(SPLIT(QUERY(TRANSPOSE(QUERY(TRANSPOSE( REGEXREPLACE(REGEXEXTRACT(INDIRECT("C2:C"&COUNTA(C2:C)+1), REPT("(.)", LEN(INDIRECT("C2:C"&COUNTA(C2:C)+1)))), "['A-Za-z\.-]", )),,999^99)),,999^99), " "))))
When no diacritics appear in the search column, the cell was supposed to be empty, but when you copy that cell to another, it comes back as if there were values in it, it seems to be several spaces together.
When using the LEN function, it also appears to have values, but apparently it's empty ... And that's how I would really like it to be, totally empty if I didn't find diacritics in the list names
I would like a help so that it really gets empty when not finding diacritics
Here is the link to the spreadsheet if it becomes easier to understand the situation:
https://docs.google.com/spreadsheets/d/1yfB8GskVU_ciFKuzae9XQF-pi3y6jsYtsanN46vmNOs/edit?usp=sharing
you can add TRIM to fix this issue:
=ARRAYFORMULA(UNIQUE(TRIM(TRANSPOSE(SPLIT(QUERY(TRANSPOSE(QUERY(TRANSPOSE(
REGEXREPLACE(REGEXEXTRACT(INDIRECT("C2:C"&COUNTA(C2:C)+1), REPT("(.)",
LEN(INDIRECT("C2:C"&COUNTA(C2:C)+1)))), "['A-Za-z\.-]", )),,999^99)),,999^99), " ")))))

Python printing Deepdiff value

I am using the deepdiff function to find the difference between 2 dictionaries, which gives the output as: A = {'dictionary_item_added': set(["root['mismatched_element']"])}. How to print just 'mismatched_element'?
Try this:
set_item = A['dictionary_item_added'].pop()
print set_item[set_item.find("['")+2 : set_item.find("']")]
The first line gets the element from the set, the second removes the [] and everything around them, and prints.
This code does the specific task you asked for, but it's hard to generalize the solution without a more generalized question..

Unable to find string using Regex in Word 2010

I am trying to find a string in word, I can see 3 of the strings in the document. However, the remaining 600+ of them are not visible.
I'm trying to search using (this is the regex in the external tool I used initially):
(ABC-\d+)
Using a tool to search in Word I searched for
(ABC.*)
and all of the results ended up being some form of the following:
ABCNormal -13
I don't have a clue how to find out what that even means in this context.
I tried searching IN Word for the following REGEX and it doesn't find any except the 3 that don't have the "normal " thing.
ABC?#[0-9]#
That should mean look for ABC some number of characters and some number of numbers.
I have tried turning on the hidden text/etc within the display options, the paragraph icon on the ribbon, anything I an think of.
Any ideas how to figure out how to SEE what this is, and either fix it, or work around it?
In the external tool [(ABC)[^0-9]+(\d+)] finally worked, but I still don't understand how to remove the Normal Text that is in the string that is NOT visible.
For example the string I visibly see
ABC-13
the text Regex is seeing is
ABCNormal -13