Inconsistent behaviour of my Regular Expression

Inconsistent behaviour of my Regular Expression - regex

I have a regular expression which is matching correctly when parameters are in their reversed order but not when they are in the intended order:
^\s*Proc\s+[a-z_][0-9a-z_]+\s*\({1}\s*([0-9a-z_ ,.]+)\s* as (?:pin|bit|byte|word|dword|float|sbyte|sword|sdword|Long|slong|double|string\*[0-9]+,*)
matches this text just like I want to:
proc HMI_SendNumber(Value As Sword, Object As String*10)
But if I reverse the order of the parameters I am looking for...:
proc HMI_SendNumber(Object As String*10,Value As Sword)
...I only get a match on the first one, i.e. Object. It only occurs when String* is present, so I guess it has to do with the *10 element of it. Is there a way around this?

No, you don't get "two matches", you only get one of:
Value As Sword, Object As String
See, how *10 is missing? That's because [0-9a-z_ ,.]+ does not allow * to match, too. Likewise your other text only has one match of:
Object As String
What do you really want? One match of all parameters? Multiple matches - one for each parameter? Because it's totally irrelevant to define all the as (1|2|3...) because it already matches your initial class. Your whole regex can be reduced to:
^\s*Proc\s+[a-z_][0-9a-z_]+\s*\(\s*([0-9a-z_ ,.]+)\s*\)
if there would be no String*10 as data type. It can be fixed by including * as in:
^\s*Proc\s+[a-z_][0-9a-z_]+\s*\(\s*([0-9a-z_ ,.*]+)\s*\)
Beware that this still is only one match, not multiple matches. The match itself may have your desired multiple parameters.
Also this has nothing to do with Delphi. It's slightly Visual Basic at best.

Related

Matlab: What's the most efficient approach to parse a large table or cell array with regexp when sometimes there is no match?

I am working with a messy manually maintained "database" that has a column containing a string with name,value pairs. I am trying to parse the entire column with regexp to pull out the values. The column is huge (>100,000 entries). As a proxy for my actual data, let's use this code:
line1={'''thing1'': ''-583'', ''thing2'': ''245'', ''thing3'': ''246'', ''morestuff'':, '''''};
line2={'''thing1'': ''617'', ''thing2'': ''239'', ''morestuff'':, '''''};
line3={'''thing1'': ''unexpected_string(with)parens5'', ''thing2'': 245, ''thing3'':''246'', ''morestuff'':, '''''};
mycell=vertcat(line1,line2,line3);
This captures the general issues encountered in the database. I want to extract what thing1, thing2, and thing3 are in each line using cellfun to output a scalar cell array. They should normally be 3 digit numbers, but sometimes they have an unexpected form. Sometimes thing3 is completely missing, without the name even showing up in the line. Sometimes there are minor formatting inconsistencies, like single quotes missing around the value, spaces missing, or dashes showing up in front of the three digit value. I have managed to handle all of these, except for the case where thing3 is completely missing.
My general approach has been to use expressions like this:
expr1='(?<=thing1''):\s?''?-?([\w\d().]*?)''?,';
expr2='(?<=thing2''):\s?''?-?([\w\d().]*?)''?,';
expr3='(?<=thing3''):\s?''?-?([\w\d().]*?)''?,';
This looks behind for thingX' and then tries to match : followed by zero or one spaces, followed by 0 or 1 single quote, followed by zero or one dash, followed by any combination of letters, numbers, parentheses, or periods (this is defined as the token), using a lazy match, until zero or one single quote is encountered, followed by a comma. I call regexp as regexp(___,'tokens','once') to return the matching token.
The problem is that when there is no match, regexp returns an empty array. This prevents me from using, say,
out=cellfun(#(x) regexp(x,expr3,'tokens','once'),mycell);
unless I call it with 'UniformOutput',false. The problem with that is twofold. First, I need to then manually find the rows where there was no match. For example, I can do this:
emptyout=cellfun(#(x) isempty(x),out);
emptyID=find(emptyout);
backfill=cell(length(emptyID),1);
[backfill{:}]=deal('Unknown');
out(emptyID)=backfill;
In this example, emptyID has a length of 1 so this code is overkill. But I believe this is the correct way to generalize for when it is longer. This code will change every empty cell array in out with the string Unknown. But this leads to the second problem. I've now got a 'messy' cell array of non-scalar values. I cannot, for example, check unique(out) as a result.
Pardon the long-windedness but I wanted to give a clear example of the problem. Now my actual question is in a few parts:
Is there a way to accomplish what I'm trying to do without using 'UniformOutput',false? For example, is there a way to have regexp pass a custom string if there is no match (e.g. pass 'Unknown' if there is no match)? I can think of one 'cheat', which would be to use the | operator in the expression, and if the first token is not matched, look for something that is ALWAYS found. I would then still need to double back through the output and change every instance of that result to 'Unknown'.
If I take the 'UniformOutput',false approach, how can I recover a scalar cell array at the end to easily manipulate it (e.g. pass it through unique)? I will admit I'm not 100% clear on scalar vs nonscalar cell arrays.
If there is some overall different approach that I'm not thinking of, I'm also open to it.
Tangential to the main question, I also tried using a single expression to run regexp using 3 tokens to pull out the values of thing1, thing2, and thing3 in one pass. This seems to require 'UniformOutput',false even when there are no empty results from regexp. I'm not sure how to get a scalar cell array using this approach (e.g. an Nx1 cell array where each cell is a 3x1 cell).
At the end of the day, I want to build a table using these results:
mytable=table(out1,out2,out3);
Edit: Using celldisp sheds some light on the problem:
celldisp(out)
out{1}{1} =
246
out{2} =
Unknown
out{3}{1} =
246
I assume that I need to change the structure of out so that the contents of out{1}{1} and out{3}{1} are instead just out{1} and out{3}. But I'm not sure how to accomplish this if I need 'UniformOutput',false.

Note: I've not used MATLAB and this doesn't answer the "efficient" aspect, but...
How about forcing there to always be a match?
Just thinking about you really wanting a match to skip this problem, how about an empty match?
Looking on the MATLAB help page here I can see a 'emptymatch' option, perhaps this is something to try.
E.g.
the_thing_i_want_to_find|
Match "the_thing_i_want_to_find" or an empty match, note the | character.
In capture group it might look like this:
(the_thing_i_want_to_find|)

As a workaround, I have found that using regexprep can be used to find entries where thing3 is missing. For example:
replace='$1 ''thing3'': ''Unknown'', ''morestuff''';
missingexpr='(?<=thing2'':\s?)(''?-?[\w\d().]*?''?,) ''morestuff''';
regexprep(mycell{2},missingexpr,replace)
ans =
''thing1': '617', 'thing2': '239', 'thing3': 'Unknown', 'morestuff':, '''
Applying it to the entire array:
fixedcell=cellfun(#(x) regexprep(x,missingexpr,replace),mycell);
out=cellfun(#(x) regexp(x,expr3,'tokens','once'),fixedcell,'UniformOutput',false);
This feels a little roundabout, but it works.

cellfun can be replaced with a plain old for loop. Your code will either be equally fast, or maybe even faster. cellfun is implemented with a loop anyway, there is no advantage of using it other than fewer lines of code. In your explicit loop, you can then check the output of regexp, and build your output array any way you like.

Jax-RS overloading methods/paths order of execution

I am writing an API for my app, and I am confused about how Jax-RS deals with certain scenarios
For example, I define two paths:
#Path("user/{name : [a-zA-Z]+}")
and
#Path("user/me")
The first path that I specified clearly encompasses the second path since the regular expression includes all letters a-z. However, the program doesn't seem to have an issue with this. Is it because it defaults to the most specific path (i.e. /me and then looks for the regular expression)?
Furthermore, what happens if I define two regular expressions as the path with some overlap. Is there a default method which will be called?
Say I want to create three paths for three different methods:
#Path{"user/{name : [a-zA-Z]+}")
#Path("user/{id : \\d+}")
#Path("user/me")
Is this best practice/appropriate? How will it know which method to call?
Thank you in advance for any clarification.

This is in the spec in "Matching Requests to Resource Methods"
Sort E using (1) the number of literal characters in each member as the primary key (descending order), (2) the number of capturing groups as a secondary key (descending order), (3) the number of capturing groups with non-default regular expressions (i.e. not ‘([^ /]+?)’) as the tertiary key (descending order), ...
What happens is the candidate methods are sorted by specified ordered "key". I highlight them in bold.
The first sort key is the number of literal characters. So for these three
#Path{"user/{name : [a-zA-Z]+}")
#Path("user/{id : \\d+}")
#Path("user/me")
if the requested URI is ../user/me, the last one will always be chosen, as it has the most literal characters (7, / counts). The others only have 5.
Aside from ../users/me anything else ../users/.. will depend on the regex. In your case one matches only numbers and one matches only letters. There is no way for these two regexes to overlap. So it will match accordingly.
Now just for fun, let's say we have
#Path{"user/{name : .*}")
#Path("user/{id : \\d+}")
#Path("user/me")
If you look at the top two, we now have overlapping regexes. The first will match all numbers, as will the second one. So which one will be used? We can't make any assumptions. This is a level of ambiguity not specified and I've seen different behavior from different implementations. AFAIK, there is no concept of a "best matching" regex. Either it matches or it doesn't.
But what if we wanted the {id : \\d+} to always be checked first. If it matches numbers then that should be selected. We can hack it based on the specification. The spec talks about "capturing groups" which are basically the {..}s. The second sorting key is the number of capturing groups. The way we could hack it is to add another "optional" group
#Path{"user/{name : .*}")
#Path("user/{id : \\d+}{dummy: (/)?}")
Now the latter has more capturing groups so it will always be ahead in the sort. All it does is allow an optional /, which doesn't really affect the API, but insures that if the request URI is all numbers, this path will always be chose.
You can see a discussion with some test cases in this answer

How do I find strings that only differ by their diacritics?

I'm comparing three lexical resources. I use entries from one of them to create queries — see first column — and see if the other two lexicons return the right answers. All wrong answers are written to a text file. Here's a sample out of 3000 lines:
réincarcérer<IND><FUT><REL><SG><1> réincarcèrerais réincarcérerais réincarcérerais
réinsérer<IND><FUT><ABS><PL><1> réinsèrerons réinsérerons réinsérerons
macérer<IND><FUT><ABS><PL><3> macèreront macéreront macéreront
répéter<IND><FUT><ABS><PL><1> répèterons répéterons répéterons
The first column is the query, the second is the reference. The third and fourth columns are the results returned by the lexicons. The values are tab-separated.
I'm trying to identify answers that only differ from the reference by their diacritics. That is, répèterons répéterons should match because the only difference between the two is that the second part has an acute accent on the e rather than a grave accent.
I'd like to match the entire line. I'd be grateful for a regex that would also identify answers that differ by their gemination — the following two lines should match because martellerait has two ls while martèlerait only has one.
modeler<IND><FUT><ABS><SG><2> modelleras modèleras modèleras
marteler<IND><FUT><REL><SG><3> martellerait martèlerait martèlerait
The last two values will always be identical. You can focus on values #2 and 3.

The first part can be achieved by doing a lossy conversion to ASCII and then doing a direct string comparison. Note, converting to ASCII effectively removes the diacritics.
To do the second part is not possible (as far as I know) with a regex pattern. You will need to do some research into things like the Levenshtein distance.
EDIT:
This regex will match duplicate consonants. It might be helpful for your gemination problem.
([b-df-hj-np-tv-xz])\\1+
Which means:
([b-df-hj-np-tv-xz]) # Match only consonants
\\1+ # Match one or times again what was captured in the first capture group

Regex to Find/replace argument pattern in a function-call across all files

I have a large codebase, where we need to make a pattern-change in the argument of a specific function.
i.e. All arguments to a function foo() are renamed from the format something.anotherThing are to be renamed as something_anotherThing
The arguments can be anything but will always be in a str1.str2 format. It is to be done for arguments of this one function only, all other code should remain untouched.
e.g.
foo(a.x) --> foo(a_x)
foo(a4.b6) --> foo(a4_b6)
Is there any way I can achieve it using regular expression or a tool, where i can do this in one step for all the files, for one specific function?

If the function would have only one argument, it would be easy:
Use a tool that is able to search and replace in multiple files, eg. TextCrawler.
And than select the regular expression tab and fill in:
RegExp:
(foo\([^)]+)(\.)([^)]+\))
Replace:
$1_$3
This will not work, if there are more arguments in the function. But you can click the "Replace" button again and then again until it says that no result was found. You will have to do it maximum n-times, where n = max number of arguments in any function.

regex matching multiple values when they might not exist

I am trying to right a preg_match_all to match horse race distance.
My source lists races as:
xmxfxy
I want to match the m value, the f value, the y value. However different races will maybe only have m, or f, or y, or two of them or even all three.
// e.g. $raw = 5f213y;
preg_match_all('/(\d{1,})m|(\d{1,})f|(\d{1,})y/', $raw, $distance);
The above sort of works, but for some reason the matches appear in unpredictable positions in the returned array. I guess it is because it is running the match 3 times for each OR. How do I match all three (that may or may not exist) in a single run.
EDIT
A full sample string is:
Hardings Catering Services Handicap (Div I) Cl6 5f213y

If I understand you correctly, you're processing listings (like the one in your question) one at a time. If that's the case, you should be using preg_match, not preg_match_all, and the regex should match the whole "distance" code, not individual components of it. Try this:
preg_match('#\b(?:(?<M>\d+)m|(?<F>\d+)f|(?<Y>\d+)y){1,3}\b#',
$raw, $distance);
The results are now stored in a one-dimensional array, but you don't need to worry about the group numbers anyway; you can access them by name instead (e.g., $distance['M'], $distance['F'], $distance['Y']).
Note that, while this regex matches codes with one, two, or three components, it doesn't require the letters to be unique. There's nothing to stop it from matching something like 1m2m3m (a weakness shared by your own approach, by the way).

you can use "?" as a conditional
preg_match_all('/((\d{1,})m)?|((\d{1,})f)?|((\d{1,})y)?/', $raw, $distance);

If I understand what you're asking correctly, you would like to get each number from these values separately? This works for me:
$input = "Hardings Catering Services Handicap (Div I) Cl6 5f213y";
preg_match_all('/((\d+)(m|f|y))/', $input, $matches);
After the preg_match_all() executes, $matches[2] holds an array of the numbers that matched (in this case, $matches[2][0] is 5 and $matches[2][1] is 213.
If all three values exist, m will be in $matches[2][0], f in $matches[2][1], and y in $matches[2][2]. If any values are missing, the next value gets bumped up a spot. It may also come in handy that $matches[3] will hold an array of the corresponding letter matched on, so if you need to check whether it was an m, f, or y, you can.
If this isn't what you're after, please provide an example of the output you would like to see for this or another sample input.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Inconsistent behaviour of my Regular Expression - regex

Related

Matlab: What's the most efficient approach to parse a large table or cell array with regexp when sometimes there is no match?

Jax-RS overloading methods/paths order of execution

How do I find strings that only differ by their diacritics?

Regex to Find/replace argument pattern in a function-call across all files

regex matching multiple values when they might not exist

Categories

Resources