I have a regular expression which is matching correctly when parameters are in their reversed order but not when they are in the intended order:
^\s*Proc\s+[a-z_][0-9a-z_]+\s*\({1}\s*([0-9a-z_ ,.]+)\s* as (?:pin|bit|byte|word|dword|float|sbyte|sword|sdword|Long|slong|double|string\*[0-9]+,*)
matches this text just like I want to:
proc HMI_SendNumber(Value As Sword, Object As String*10)
But if I reverse the order of the parameters I am looking for...:
proc HMI_SendNumber(Object As String*10,Value As Sword)
...I only get a match on the first one, i.e. Object. It only occurs when String* is present, so I guess it has to do with the *10 element of it. Is there a way around this?
No, you don't get "two matches", you only get one of:
Value As Sword, Object As String
See, how *10 is missing? That's because [0-9a-z_ ,.]+ does not allow * to match, too. Likewise your other text only has one match of:
Object As String
What do you really want? One match of all parameters? Multiple matches - one for each parameter? Because it's totally irrelevant to define all the as (1|2|3...) because it already matches your initial class. Your whole regex can be reduced to:
^\s*Proc\s+[a-z_][0-9a-z_]+\s*\(\s*([0-9a-z_ ,.]+)\s*\)
if there would be no String*10 as data type. It can be fixed by including * as in:
^\s*Proc\s+[a-z_][0-9a-z_]+\s*\(\s*([0-9a-z_ ,.*]+)\s*\)
Beware that this still is only one match, not multiple matches. The match itself may have your desired multiple parameters.
Also this has nothing to do with Delphi. It's slightly Visual Basic at best.
I am working with a messy manually maintained "database" that has a column containing a string with name,value pairs. I am trying to parse the entire column with regexp to pull out the values. The column is huge (>100,000 entries). As a proxy for my actual data, let's use this code:
line1={'''thing1'': ''-583'', ''thing2'': ''245'', ''thing3'': ''246'', ''morestuff'':, '''''};
line2={'''thing1'': ''617'', ''thing2'': ''239'', ''morestuff'':, '''''};
line3={'''thing1'': ''unexpected_string(with)parens5'', ''thing2'': 245, ''thing3'':''246'', ''morestuff'':, '''''};
mycell=vertcat(line1,line2,line3);
This captures the general issues encountered in the database. I want to extract what thing1, thing2, and thing3 are in each line using cellfun to output a scalar cell array. They should normally be 3 digit numbers, but sometimes they have an unexpected form. Sometimes thing3 is completely missing, without the name even showing up in the line. Sometimes there are minor formatting inconsistencies, like single quotes missing around the value, spaces missing, or dashes showing up in front of the three digit value. I have managed to handle all of these, except for the case where thing3 is completely missing.
My general approach has been to use expressions like this:
expr1='(?<=thing1''):\s?''?-?([\w\d().]*?)''?,';
expr2='(?<=thing2''):\s?''?-?([\w\d().]*?)''?,';
expr3='(?<=thing3''):\s?''?-?([\w\d().]*?)''?,';
This looks behind for thingX' and then tries to match : followed by zero or one spaces, followed by 0 or 1 single quote, followed by zero or one dash, followed by any combination of letters, numbers, parentheses, or periods (this is defined as the token), using a lazy match, until zero or one single quote is encountered, followed by a comma. I call regexp as regexp(___,'tokens','once') to return the matching token.
The problem is that when there is no match, regexp returns an empty array. This prevents me from using, say,
out=cellfun(#(x) regexp(x,expr3,'tokens','once'),mycell);
unless I call it with 'UniformOutput',false. The problem with that is twofold. First, I need to then manually find the rows where there was no match. For example, I can do this:
emptyout=cellfun(#(x) isempty(x),out);
emptyID=find(emptyout);
backfill=cell(length(emptyID),1);
[backfill{:}]=deal('Unknown');
out(emptyID)=backfill;
In this example, emptyID has a length of 1 so this code is overkill. But I believe this is the correct way to generalize for when it is longer. This code will change every empty cell array in out with the string Unknown. But this leads to the second problem. I've now got a 'messy' cell array of non-scalar values. I cannot, for example, check unique(out) as a result.
Pardon the long-windedness but I wanted to give a clear example of the problem. Now my actual question is in a few parts:
Is there a way to accomplish what I'm trying to do without using 'UniformOutput',false? For example, is there a way to have regexp pass a custom string if there is no match (e.g. pass 'Unknown' if there is no match)? I can think of one 'cheat', which would be to use the | operator in the expression, and if the first token is not matched, look for something that is ALWAYS found. I would then still need to double back through the output and change every instance of that result to 'Unknown'.
If I take the 'UniformOutput',false approach, how can I recover a scalar cell array at the end to easily manipulate it (e.g. pass it through unique)? I will admit I'm not 100% clear on scalar vs nonscalar cell arrays.
If there is some overall different approach that I'm not thinking of, I'm also open to it.
Tangential to the main question, I also tried using a single expression to run regexp using 3 tokens to pull out the values of thing1, thing2, and thing3 in one pass. This seems to require 'UniformOutput',false even when there are no empty results from regexp. I'm not sure how to get a scalar cell array using this approach (e.g. an Nx1 cell array where each cell is a 3x1 cell).
At the end of the day, I want to build a table using these results:
mytable=table(out1,out2,out3);
Edit: Using celldisp sheds some light on the problem:
celldisp(out)
out{1}{1} =
246
out{2} =
Unknown
out{3}{1} =
246
I assume that I need to change the structure of out so that the contents of out{1}{1} and out{3}{1} are instead just out{1} and out{3}. But I'm not sure how to accomplish this if I need 'UniformOutput',false.
Note: I've not used MATLAB and this doesn't answer the "efficient" aspect, but...
How about forcing there to always be a match?
Just thinking about you really wanting a match to skip this problem, how about an empty match?
Looking on the MATLAB help page here I can see a 'emptymatch' option, perhaps this is something to try.
E.g.
the_thing_i_want_to_find|
Match "the_thing_i_want_to_find" or an empty match, note the | character.
In capture group it might look like this:
(the_thing_i_want_to_find|)
As a workaround, I have found that using regexprep can be used to find entries where thing3 is missing. For example:
replace='$1 ''thing3'': ''Unknown'', ''morestuff''';
missingexpr='(?<=thing2'':\s?)(''?-?[\w\d().]*?''?,) ''morestuff''';
regexprep(mycell{2},missingexpr,replace)
ans =
''thing1': '617', 'thing2': '239', 'thing3': 'Unknown', 'morestuff':, '''
Applying it to the entire array:
fixedcell=cellfun(#(x) regexprep(x,missingexpr,replace),mycell);
out=cellfun(#(x) regexp(x,expr3,'tokens','once'),fixedcell,'UniformOutput',false);
This feels a little roundabout, but it works.
cellfun can be replaced with a plain old for loop. Your code will either be equally fast, or maybe even faster. cellfun is implemented with a loop anyway, there is no advantage of using it other than fewer lines of code. In your explicit loop, you can then check the output of regexp, and build your output array any way you like.
I am looking for a regular expression which allows only the time offset values.
I have used:
^(?:[+-](?:2[0-3]|[01][0-9]):[0-5][0-9])$
The ONLY strings I need to match:
-12:00
+14:00
-11:00
-10:00
-09:30
-09:00
-08:00
-07:00
-06:00
-05:00
-04:00
-03:30
-03:00
-02:00
-01:00
00:00
+01:00
+04:00
+03:30
+03:00
+02:00
+04:30
+05:00
+05:30
+05:45
+06:00
+06:30
+07:00
+08:00
+08:45
+09:00
+09:30
+10:00
+10:30
+11:00
+12:00
+12:45
+13:00
+14:00
Please check here for what I have tried so far, and the values I want it to allow.
It works fine for the all the values except for 00:00.
Also, it allows some extra values such as -19:30 +23:00 22:30 21:00 which should not be allowed.
I want it to allow only those values which have been mentioned in my aforesaid link.
I was able to achieve the results you wanted by slightly tweaking your regex.
This is also short and concise.
^(?:(?:[+-](?:1[0-4]|0[1-9]):[0-5][0-9])|00:00)$
You can check the results and test it further here
One point which should be noted here is that you would be able to pass other values between the current valid values of timezone(-12:00 to +14:00). By reading the comments in the question, I feel it is better to have it this way, for future proofing just in case they change. (You would need to tweak the regex to allow values greater than 14:00)
If you strictly want to limit it to the values which you have listed, enumeration would be a better approach to go about it.
To only match these strings of yours, you can use
^(?:\+(?:0(?:[12]:00|[34]:[03]0|5:(?:[03]0|45)|6:[03]0|7:00|8:(?:00|45)|9:[03]0)|1(?:0:[03]0|1:00|2:(?:00|45)|[34]:00))|-(?:0(?:[12]:00|3:[03]0|[4-8]:00|9:[03]0)|1[0-2]:00)|00:00)$
See the regex demo
Use online/external tools to build word list regexps like this (e.g. My Regex Tester, etc.).
I am working in an environment without a JSON parser, so I am using regular expressions to parse some JSON. The value I'm looking to isolate may be either a string or an integer.
For instance
Entry1
{"Product_ID":455233, "Product_Name":"Entry One"}
Entry2
{"Product_ID":"455233-5", "Product_Name":"Entry One"}
I have been attempting to create a single regex pattern to extract the Product_ID whether it is a string or an integer.
I can successfully extract both results with separate patterns using look around with either (?<=Product_ID":")(.*?)(?=") or (?<=Product_ID":)(.*?)(?=,)
however since I don't know which one I will need ahead of time I would like a one size fits all.
I have tried to use [^"] in the pattern however I just cant seem to piece it together
I expect to receive 455233-5 and 455233 but currently I receive "455233-5"
(?<="Product_ID"\s*:\s*"?)[^"]+(?="?\s*,)
, try it here.
I need some help building a regular expression for a string which may contain 1, 2, 3, or 4 fields. Each field has a format of: tag=value.
Below is a comprehensive list of all possible strings I can have. code tag is a three-digits number:
type=buy&code=123&time=yes&save=yes
type=buy&code=123&time=yes&save=no
type=buy&code=123&time=no&save=yes
type=buy&code=123&time=no&save=no
type=buy&code=123&time=yes
type=buy&code=123&time=no
type=sell&code=123&time=yes&save=yes
type=sell&code=123&time=yes&save=no
type=sell&code=123&time=no&save=yes
type=sell&code=123&time=no&save=no
type=sell&code=123&time=yes
type=sell&code=123&time=no
type=long&code=123
type=short&code=123
type=fill&code=123
type=confirm&code=123
type=cancelall
type=resendall
So these are the possible values for the four tags:
type={buy|sell|long|short|fill|confirm|cancelall|resendall}
code=[[:digit:]]{3}
time={yes|no}
save={yes|no}
This is what I have right now:
value={buy|sell|long|short|fill|confirm|cancelall|resendall}&code=[[:digit:]]{3}&time={yes|no}&save={yes|no}
It is obviously not correct, I do not know how make number of fields to be variable.
I want to use regular expression to check if the string is in correct format from C++ code. I am already doing it by parsing the string and using multiple "if" statements which makes tens of lines of code and it is also error prone.
Thank you!
This regex will do it:
/^type=(?:(?:buy|sell)&code=\d{3}&time=(?:yes|no)(?:&save=(?:yes|no))?|(?:long|short|fill|confirm)&code=\d{3}|cancelall|resendall)$/
(using two anchors, an optional item and lots of alternations in non-capturing groups)
I am already doing it by parsing the string and using multiple "if" statements
For checking rules, this might be the better alternative. You still might use regexes for tokenizing your string.
You also might want to have a look at a parser generator, since you already seem to have a grammar available. The generator will yield parser code from that, which can be called to check the validity of your inputs and will return helpful error messages.