How to get correct size of substring? - regex

I would like to match substring correctly.
re:run("étude", "é",[unicode]).
The result of running this code is {match,[{0,2}]}. This result looks like I use unnormilize Unicode string.
So next I try to add normalization:
re:run(unicode:characters_to_nfc_list("étude"), unicode:characters_to_nfc_list("é"),[unicode]).
The result was the same: {match,[{0,2}]}
How to describe Erlang (what kind of option I need to set) to get correct result of character size? I wold like get {match,[{0,1}]}

Try ucp instead of unicode option.
>re:run("étude", "é",[ucp]).
{match,[{0,1}]}

Related

Matlab: What's the most efficient approach to parse a large table or cell array with regexp when sometimes there is no match?

I am working with a messy manually maintained "database" that has a column containing a string with name,value pairs. I am trying to parse the entire column with regexp to pull out the values. The column is huge (>100,000 entries). As a proxy for my actual data, let's use this code:
line1={'''thing1'': ''-583'', ''thing2'': ''245'', ''thing3'': ''246'', ''morestuff'':, '''''};
line2={'''thing1'': ''617'', ''thing2'': ''239'', ''morestuff'':, '''''};
line3={'''thing1'': ''unexpected_string(with)parens5'', ''thing2'': 245, ''thing3'':''246'', ''morestuff'':, '''''};
mycell=vertcat(line1,line2,line3);
This captures the general issues encountered in the database. I want to extract what thing1, thing2, and thing3 are in each line using cellfun to output a scalar cell array. They should normally be 3 digit numbers, but sometimes they have an unexpected form. Sometimes thing3 is completely missing, without the name even showing up in the line. Sometimes there are minor formatting inconsistencies, like single quotes missing around the value, spaces missing, or dashes showing up in front of the three digit value. I have managed to handle all of these, except for the case where thing3 is completely missing.
My general approach has been to use expressions like this:
expr1='(?<=thing1''):\s?''?-?([\w\d().]*?)''?,';
expr2='(?<=thing2''):\s?''?-?([\w\d().]*?)''?,';
expr3='(?<=thing3''):\s?''?-?([\w\d().]*?)''?,';
This looks behind for thingX' and then tries to match : followed by zero or one spaces, followed by 0 or 1 single quote, followed by zero or one dash, followed by any combination of letters, numbers, parentheses, or periods (this is defined as the token), using a lazy match, until zero or one single quote is encountered, followed by a comma. I call regexp as regexp(___,'tokens','once') to return the matching token.
The problem is that when there is no match, regexp returns an empty array. This prevents me from using, say,
out=cellfun(#(x) regexp(x,expr3,'tokens','once'),mycell);
unless I call it with 'UniformOutput',false. The problem with that is twofold. First, I need to then manually find the rows where there was no match. For example, I can do this:
emptyout=cellfun(#(x) isempty(x),out);
emptyID=find(emptyout);
backfill=cell(length(emptyID),1);
[backfill{:}]=deal('Unknown');
out(emptyID)=backfill;
In this example, emptyID has a length of 1 so this code is overkill. But I believe this is the correct way to generalize for when it is longer. This code will change every empty cell array in out with the string Unknown. But this leads to the second problem. I've now got a 'messy' cell array of non-scalar values. I cannot, for example, check unique(out) as a result.
Pardon the long-windedness but I wanted to give a clear example of the problem. Now my actual question is in a few parts:
Is there a way to accomplish what I'm trying to do without using 'UniformOutput',false? For example, is there a way to have regexp pass a custom string if there is no match (e.g. pass 'Unknown' if there is no match)? I can think of one 'cheat', which would be to use the | operator in the expression, and if the first token is not matched, look for something that is ALWAYS found. I would then still need to double back through the output and change every instance of that result to 'Unknown'.
If I take the 'UniformOutput',false approach, how can I recover a scalar cell array at the end to easily manipulate it (e.g. pass it through unique)? I will admit I'm not 100% clear on scalar vs nonscalar cell arrays.
If there is some overall different approach that I'm not thinking of, I'm also open to it.
Tangential to the main question, I also tried using a single expression to run regexp using 3 tokens to pull out the values of thing1, thing2, and thing3 in one pass. This seems to require 'UniformOutput',false even when there are no empty results from regexp. I'm not sure how to get a scalar cell array using this approach (e.g. an Nx1 cell array where each cell is a 3x1 cell).
At the end of the day, I want to build a table using these results:
mytable=table(out1,out2,out3);
Edit: Using celldisp sheds some light on the problem:
celldisp(out)
out{1}{1} =
246
out{2} =
Unknown
out{3}{1} =
246
I assume that I need to change the structure of out so that the contents of out{1}{1} and out{3}{1} are instead just out{1} and out{3}. But I'm not sure how to accomplish this if I need 'UniformOutput',false.
Note: I've not used MATLAB and this doesn't answer the "efficient" aspect, but...
How about forcing there to always be a match?
Just thinking about you really wanting a match to skip this problem, how about an empty match?
Looking on the MATLAB help page here I can see a 'emptymatch' option, perhaps this is something to try.
E.g.
the_thing_i_want_to_find|
Match "the_thing_i_want_to_find" or an empty match, note the | character.
In capture group it might look like this:
(the_thing_i_want_to_find|)
As a workaround, I have found that using regexprep can be used to find entries where thing3 is missing. For example:
replace='$1 ''thing3'': ''Unknown'', ''morestuff''';
missingexpr='(?<=thing2'':\s?)(''?-?[\w\d().]*?''?,) ''morestuff''';
regexprep(mycell{2},missingexpr,replace)
ans =
''thing1': '617', 'thing2': '239', 'thing3': 'Unknown', 'morestuff':, '''
Applying it to the entire array:
fixedcell=cellfun(#(x) regexprep(x,missingexpr,replace),mycell);
out=cellfun(#(x) regexp(x,expr3,'tokens','once'),fixedcell,'UniformOutput',false);
This feels a little roundabout, but it works.
cellfun can be replaced with a plain old for loop. Your code will either be equally fast, or maybe even faster. cellfun is implemented with a loop anyway, there is no advantage of using it other than fewer lines of code. In your explicit loop, you can then check the output of regexp, and build your output array any way you like.

Regex-Match while ignoring a char from Searchword

I am using an Engineering Program which lets me Code formulas in order to filter out specific lines in a database. I am trying to look for a certain line in the database which contains e.g. "concrete" as a property.
In the Code I can use regular expressions.
The regex I was using so far looked like this:
".*(concrete).*";
so if the line in the database contains concrete, I will get the wanted result.
Now the Problem is: i would like to switch the word concrete with a variable, so that it Looks like this:
".*(#VARIABLE1).*";
(the Syntax with the # works in the program btw.)
the Problem is: if i set the variable as concrete, the program automatically switches it for 'concrete' . Obviously, the word concrete cant be found anymore, since the searchterm now contains the two ' Symbols in the beginning and i the end.
Is there a way to ignore those two characters using the Right regex?
what I want it to do is the following:
If a line in the database contains "25cm concrete in Grey"
I should get a match from the regex.
with the searchterm ".*(concrete).*"; it works, with the variable ".*(#VARIABLE1).*"; it doesnt.
EDIT:
the whole "Formula" in the program Looks like that:
if(Match(QTO(Typ:="Attribut{FloorsLayer_02_MaterialName}");".*(#V_QUALITY).*" ;"regex") ;QTO(Typ:="Attribut{Fläche}");0)
I want the if-condition to be true, when the match inside is true.
the whole QTO function is just the programs Syntax to use a certain Attribute into the match-function, the middle part is my Problem. I really don't know the programming language or anything,I'm new to this. hope it helps!
Thats more of a hack than a real solution and i'm not sure if it even works:
if you use the regex
.*(#VARIABLE1)?).*
and the string ?concrete(
this will result in a regex looking like this:
.*('?concrete(')?).*
which makes the additional characters optional.
This uses the following assumtption:
the string (#VARIABLE1) gets replaced by the ('<content of VARIABLE1>')

Starting position for replace function in db2

I'm converting some Access VBA functionality to DB2 and found a vital difference. VBA lets you specify the starting point in the character string you're working on. DB2 doesn't have that option. It starts from position 1 and replaces whatever you want to be replaced in the whole string. How can I make DB2 start the replace at a specified place in the string? For example, my string is "Incongruent Plastics Incorporated" and I want to replace the second "Inc" at position 22 with "Inc". I'm doing this in a WHILE loop, going through long strings, replacing parts of them until they are less than a specified maximum (15 or 30 depending on the field).
I looked at the Locate function, but I'm not sure that's right.
Replace(a.PAYEE_STD_NAME, B.FullWord, B.abbreviation, B.mLastWord)
Where a.PAYEE_STD_NAME is the string I'm looking at, B.FullWord is what I want to replace, B.abbreviation is what I want to replace it with, and B.mLastWord is the position where I want to start replacing. Something like Replace("Incongruent Plastics Incorporated","Incorporated","Inc",22)
I expect the characters to be replaced starting in the position I need, towards the back of the string, not in the beginning.
Thanks!
Not that good at DB2, but that limitation can generally be worked around by using SUBSTR
The equivalent of Replace(a.PAYEE_STD_NAME, B.FullWord, B.abbreviation, B.mLastWord) would be:
CONCAT(SUBSTR(a.PAYEE_STD_NAME, 1, B.mLastWord - 1), Replace(SUBSTR(a.PAYEE_STD_NAME, b.mLastWord), B.FullWord, B.abbreviation))
This assumes b.mLastWord is greater than 1, if it's 1 you can use a normal REPLACE.
Maybe consider using REGEXP_REPLACE https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.1.0/com.ibm.db2.luw.sql.ref.doc/doc/r0061496.html
and possibly consider recusrive SQL rather than looping logic

Matching multiple hex characters in a PGSQL Regex

I'm trying to find some very specific multibyte characters in PostgreSQL using Regex. I know I have the option to make a long CASE WHEN but i decided to check if there is a different way to finding these.
My current Regex looks like this E'\xf0\x9f\x98\x83'
This works pretty well, except that I would need to find all from \xf0\x9f\x98\x80 to \xf0\x9f\x98\x99.
In JS I would just be able to write something like \xf0\x9f\x98[\x80-\x89] but for whatever reason this returns an error in PGSQL. Is there a shortcut like this, or am I doomed to writing 20 CASE WHEN-s?
I have realized my mistake. PGSQL Error was caused because I'm looking for 4 byte characters and I just wanted to mess with the last byte. I realized I'd have to write it like this: E'[\xf0\x9f\x98\x80-\xf0\x9f\x98\x90]'

Hbase RegexStringComparator Filter giving more rows than expected

I have a FilterList with several RegexStringComparator filters. I have an issue when the regex string is similar to .*15.0.0. This will pick up rows such as xxx15.0 which I am not interested in. I assume this is because xxx15.0 is effectively acting as xxx15.0.* for the matching. Is there any way around this in hbase?
Based on your comment, it looks like you need to specify how the string is to be terminated. You don't really provide enough information, so I'll give you your options and you can pick the one that fits your situation.
If the version string appears in another string, such as shockwave:15.0 installed or the like, what you really want is to say "match the string shockwave:15.0 that's NOT followed by a period". You can do that like this:
shockwave:15\.0[^.]
If the string appears at the end of a line, you can can just specify the end-of-line anchor:
shockwave:15\.0$
If it could be either (in the middle of the line or at the end of it), you can combine the two:
shockwave:15\.0($|[^.])
That should cover all the cases....