Chemical formula parser C++

Chemical formula parser C++ - c++

I am currently working on a program that can parse a chemical formula and return molecular weight and percent composition. The following code works very well with compounds such as H2O, LiOH, CaCO3, and even C12H22O11. However, it is not capable of understanding compounds with polyatomic ions that lie within parenthesis, such as (NH4)2SO4.
I am not looking for someone to necessarily write the program for me, but just give me a few tips on how I might accomplish such a task.
Currently, the program iterates through the inputted string, raw_molecule, first finding each element's atomic number, to store in a vector (I use a map<string, int> to store names and atomic #). It then finds the quantities of each element.
bool Compound::parseString() {
map<string,int>::const_iterator search;
string s_temp;
int i_temp;
for (int i=0; i<=raw_molecule.length(); i++) {
if ((isupper(raw_molecule[i]))&&(i==0))
s_temp=raw_molecule[i];
else if(isupper(raw_molecule[i])&&(i!=0)) {
// New element- so, convert s_temp to atomic # then store in v_Elements
search=ATOMIC_NUMBER.find (s_temp);
if (search==ATOMIC_NUMBER.end())
return false;// There is a problem
else
v_Elements.push_back(search->second); // Add atomic number into vector
s_temp=raw_molecule[i]; // Replace temp with the new element
}
else if(islower(raw_molecule[i]))
s_temp+=raw_molecule[i]; // E.g. N+=a which means temp=="Na"
else
continue; // It is a number/parentheses or something
}
// Whatever's in temp must be converted to atomic number and stored in vector
search=ATOMIC_NUMBER.find (s_temp);
if (search==ATOMIC_NUMBER.end())
return false;// There is a problem
else
v_Elements.push_back(search->second); // Add atomic number into vector
// --- Find quantities next --- //
for (int i=0; i<=raw_molecule.length(); i++) {
if (isdigit(raw_molecule[i])) {
if (toInt(raw_molecule[i])==0)
return false;
else if (isdigit(raw_molecule[i+1])) {
if (isdigit(raw_molecule[i+2])) {
i_temp=(toInt(raw_molecule[i])*100)+(toInt(raw_molecule[i+1])*10)+toInt(raw_molecule[i+2]);
v_Quantities.push_back(i_temp);
}
else {
i_temp=(toInt(raw_molecule[i])*10)+toInt(raw_molecule[i+1]);
v_Quantities.push_back(i_temp);
}
}
else if(!isdigit(raw_molecule[i-1])) { // Look back to make sure the digit is not part of a larger number
v_Quantities.push_back(toInt(raw_molecule[i])); // This will not work for polyatomic ions
}
}
else if(i<(raw_molecule.length()-1)) {
if (isupper(raw_molecule[i+1])) {
v_Quantities.push_back(1);
}
}
// If there is no number, there is only 1 atom. Between O and N for example: O is upper, N is upper, O has 1.
else if(i==(raw_molecule.length()-1)) {
if (isalpha(raw_molecule[i]))
v_Quantities.push_back(1);
}
}
return true;
}
This is my first post, so if I have included too little (or maybe too much) information, please forgive me.

While you might be able to do an ad-hoc scanner-like thing that can handle one level of parens, the canonical technique used for things like this is to write a real parser.
And there are two common ways to do that...
Recursive descent
Machine-generated bottom-up parser based on a grammar-specification file.
(And technically, there is a third category, PEG, that is machine-generated-top-down.)
Anyway, for case 1, you need to code a recursive call to your parser when you see a ( and then return from this level of recursion on the ) token.
Typically a tree-like internal representation is created; this is called a syntax tree, but in your case, you can probably skip that and just return the atomic weight from the recursive call, adding to the level you will be returning from the first instance.
For case 2, you need to use a tool like yacc to turn a grammar into a parser.

Your parser understands certain things. It know that when it sees N, that this means "Atom of Nitrogen type". When it sees O, it means "Atom of Oxygen type".
This is very similar to the concept of identifiers in C++. When the compiler sees int someNumber = 5;, it says, "there exists a variable named someNumber of int type, into which the number 5 is stored". If you later use the name someNumber, it knows that you're talking about that someNumber (as long as you're in the right scope).
Back to your atomic parser. When your parser sees an atom followed by a number, it knows to apply that number to that atom. So O2 means "2 Atoms of Oxygen type". N2 means "2 Atoms of Nitrogen type."
This means something for your parser. It means that seeing an atom isn't sufficient. It's a good start, but it is not sufficient to know how many of that atom exists in the molecule. It needs to read the next thing. So if it sees O followed by N, it knows that the O means "1 Atom of Oxygen type". If it sees O followed by nothing (the end of the input), then it again means "1 Atom of Oxygen type".
That's what you have currently. But it's wrong. Because numbers don't always modify atoms; sometimes, they modify groups of atoms. As in (NH4)2SO4.
So now, you need to change how your parser works. When it sees O, it needs to know that this is not "Atom of Oxygen type". It is a "Group containing Oxygen". O2 is "2 Groups containing Oxygen".
A group can contain one or more atoms. So when you see (, you know that you're creating a group. Therefore, when you see (...)3, you see "3 Groups containing ...".
So, what is (NH4)2? It is "2 Groups containing [1 Group containing Nitrogen followed by 4 Groups containing Hydrogen]".
The key to doing this is understanding what I just wrote. Groups can contain other groups. There is nesting in groups. How do you implement nesting?
Well, your parser looks something like this currently:
NumericAtom ParseAtom(input)
{
Atom = ReadAtom(input); //Gets the atom and removes it from the current input.
if(IsNumber(input)) //Returns true if the input is looking at a number.
{
int Count = ReadNumber(input); //Gets the number and removes it from the current input.
return NumericAtom(Atom, Count);
}
return NumericAtom(Atom, 1);
}
vector<NumericAtom> Parse(input)
{
vector<NumericAtom> molecule;
while(IsAtom(input))
molecule.push_back(ParseAtom(input));
return molecule;
}
Your code calls ParseAtom() until the input runs dry, storing each atom+count in an array. Obviously you have some error-checking in there, but let's ignore that for now.
What you need to do is stop parsing atoms. You need to parse groups, which are either a single atom, or a group of atoms denoted by () pairs.
Group ParseGroup(input)
{
Group myGroup; //Empty group
if(IsLeftParen(input)) //Are we looking at a `(` character?
{
EatLeftParen(input); //Removes the `(` from the input.
myGroup.SetSequence(ParseGroupSequence(input)); //RECURSIVE CALL!!!
if(!IsRightParen(input)) //Groups started by `(` must end with `)`
throw ParseError("Inner groups must end with `)`.");
else
EatRightParen(input); //Remove the `)` from the input.
}
else if(IsAtom(input))
{
myGroup.SetAtom(ReadAtom(input)); //Group contains one atom.
}
else
throw ParseError("Unexpected input."); //error
//Read the number.
if(IsNumber(input))
myGroup.SetCount(ReadNumber(input));
else
myGroup.SetCount(1);
return myGroup;
}
vector<Group> ParseGroupSequence(input)
{
vector<Group> groups;
//Groups continue until the end of input or `)` is reached.
while(!IsRightParen(input) and !IsEndOfInput(input))
groups.push_back(ParseGroup(input));
return groups;
}
The big difference here is that ParseGroup (the analog to the ParseAtom function) will call ParseGroupSequence. Which will call ParseGroup. Which can call ParseGroupSequence. Etc. A Group can either contain an atom or a sequence of Groups (such as NH4), stored as a vector<Group>
When functions can call themselves (either directly or indirectly), it is called recursion. Which is fine, so long as it doesn't recurse infinitely. And there's no chance of that, because it will only recurse every time it sees (.
So how does this work? Well, let's consider some possible inputs:
NH3
ParseGroupSequence is called. It isn't at the end of input or ), so it calls ParseGroup.
ParseGroup sees an N, which is an atom. It adds this atom to the Group. It then sees an H, which is not a number. So it sets the Group's count to 1, then returns the Group.
Back in ParseGroupSeqeunce, we store the returned group in the sequence, then iterate in our loop. We don't see the end of input or ), so it calls ParseGroup:
ParseGroup sees an H, which is an atom. It adds this atom to the Group. It then sees a 3, which is a number. So it reads this number, sets it as the Group's count, and returns the Group.
Back in ParseGroupSeqeunce, we store the returned Group in the sequence, then iterate in our loop. We don't see ), but we do see the end of input. So we return the current vector<Group>.
(NH3)2
ParseGroupSequence is called. It isn't at the end of input or ), so it calls ParseGroup.
ParseGroup sees an (, which is the start of a Group. It eats this character (removing it from the input) and calls ParseGroupSequence on the Group.
ParseGroupSequence isn't at the end of input or ), so it calls ParseGroup.
ParseGroup sees an N, which is an atom. It adds this atom to the Group. It then sees an H, which is not a number. So it sets the group's count to 1, then returns the Group.
Back in ParseGroupSeqeunce, we store the returned group in the sequence, then iterate in our loop. We don't see the end of input or ), so it calls ParseGroup:
ParseGroup sees an H, which is an atom. It adds this atom to the Group. It then sees a 3, which is a number. So it reads this number, sets it as the Group's count, and returns the Group.
Back in ParseGroupSeqeunce, we store the returned group in the sequence, then iterate in our loop. We don't see the end of input, but we do see ). So we return the current vector<Group>.
Back in the first call to ParseGroup, we get the vector<Group> back. We stick it into our current Group as a sequence. We check to see if the next character is ), eat it, and continue. We see a 2, which is a number. So it reads this number, sets it as the Group's count, and returns the Group.
Now, way, way back at the original ParseGroupSequence call, we store the returned Group in the sequence, then iterate in our loop. We don't see ), but we do see the end of input. So we return the current vector<Group>.
This parser uses recursion to "descend" into each group. Therefore, this kind of parser is called a "recursive descent parser" (there's a formal definition for this kind of thing, but this is a good lay-understanding of the concept).

It is often helpful to write down the rules of the grammar for the strings you want to read and recognise. A grammar is just a bunch of rules which say what sequence of characters is acceptable, and by implication which are not acceptable. It helps to have the grammar before and while writing the program, and might be fed into a parser generator (as described by DigitalRoss)
For example, the rules for the simple compound, without polyatomic ions looks like:
Compound: Component { Component };
Component: Atom [Quantity]
Atom: 'H' | 'He' | 'Li' | 'Be' ...
Quantity: Digit { Digit }
Digit: '0' | '1' | ... '9'
[...] is read as optional, and will be an if test in the program (either it is there or missing)
| is alternatives, and so is an if .. else if .. else or switch 'test', it is saying the input must match one of these
{ ... } is read as repetition of 0 or more, and will be a while loop in the program
Characters between quotes are literal characters which will be in the string. All the other words are names of rules, and for a recursive descent parser, end up being the names of the functions which get called to chop up, and handle the input.
For example, the function that implements the 'Quantity' rule just needs to read one or mre digits characters, and converts them to an integer. The function that implements the Atom rule reads enough characters to figure out which atom it is, and stores that away.
A nice thing about recursive descent parsers is the error messages can be quite helpful, and of the form, "Expecting an Atom name, but got %c", or "Expecting a ')' but reached tghe end of the string". It is a bit complicated to recover after an error, so you might want to throw an exception at the first error.
So are polyatomic ions just one level of parenthesis? If so, the grammar might be:
Compound: Component { Component }
Component: Atom [Quantity] | '(' Component { Component } ')' [Quantity];
Atom: 'H' | 'He' | 'Li' ...
Quantity: Digit { Digit }
Digit: '0' | '1' | ... '9'
Or is it more complex, and the notation must allow for nested parenthesis. Once that is clear, you can figure out an approach to parsing.
I do not know the entire scope of your problem, but recursive descent parsers are relatively straightforward to write, and look adequate for your problem.

Consider re-structuring your program as a simple Recursive Descent Parser.
First, you need to change the parseString function to take a string to be parsed, and the current position from which to start the parse, passed by reference.
This way you can structure your code so that when you see a ( you call the same function at the next position get a Composite back, and consume the closing ). When you see a ) by itself, you return without consuming it. This lets you consume formulas with unlimited nesting of ( and ), although I am not sure if it is necessary (it's been more than 20 years since the last time I saw a chemical formula).
This way you'd write the code for parsing composite only once, and re-use it as many times as needed. It will be easy to supplement your reader to consume formulas with dashes etc., because your parser will need to deal only with the basic building blocks.

Maybe you can get rid of brackets before parsing. You need to find how many "brackets in brackets" (sorry for my english) are there and rewrite it like that beginning with the "deepest":
(NH4(Na2H4)3Zn)2SO4 (this formula doesn't mean anyting, actually...)
(NH4Na6H12Zn)2SO4
NH8Na12H24Zn2SO4
no brackets left, let's run your code with NH8Na12H24Zn2SO4

Related

Multi line comment flex lex

I'm trying to make a multiline comment with this conditions:
Starts with ''' and finish with '''
Can't contain exactly three ''' inside, example:
'''''' Correct
'''''''' Correct
'''a'a''a''''a''' Correct
''''''''' Incorrect
'''a'''a''' Incorrect
This is my aproximation but I'm not able to make the correct expression for this:
'''([^']|'[^']|''[^']|''''+[^'])*'''+

The easy solution is to use a start condition. (Note that this doesn't pass on all your test cases, because I think the problem description is ambiguous. See below.)
In the following, I assume that you want to return the matched token, and that you are using a yacc/bison-generated parser which includes char* str as one of the union types. The start-condition block is a Flex extension; in the unlikely event that you're using some other lex derivative, you'll need to write out the patterms one per line, each one with the <SC_TRIPLE_QUOTE> prefix (and no space between that and the pattern).
%x SC_TRIPLE_QUOTE
%%
''' { BEGIN(TRIPLE_QUOTE); }
<TRIPLE_QUOTE>{
''' { yylval.str = strndup(yytext, yyleng - 3);
BEGIN(INITIAL);
return STRING_LITERAL;
}
[^']+ |
''?/[^'] |
''''+ { yymore(); }
<<EOF>> { yyerror("Unterminated triple-quoted string");
return 0;
}
}
I hope that's more or less self-explanatory. The four patterns inside the start condition match the ''' terminator, any sequence of characters other than ', no more than two ', and at least four '. yymore() causes the respective matches to be accumulated. The call to strndup excludes the delimiter.
Note:
The above code won't provide what you expect from the second example, because I don't think it is possible (or, alternatively, you need to be clearer about which of the two possible analyses is correct, and why). Consider the possible comments:
'''a'''
'''a''''
'''a''''a'''
According to your description (and your third example), the third one should match, with the internal value a''''a, because '''' is more than three quotes. But according to your second example (slightly modified), the second one should match, with the internal value ', because the final ''' is taken as a terminator. The question is, how are these two possible interpretations supposed to be distinguished? In other words, what clue does the lexical scanner have that in the second one case, the token ends at ''' while in the third one it doesn't? Since both of these are part of an input stream, there could be arbitrary text following. And since these are supposed multi-line comments, there's no apriori reason to believe that the newline character isn't part of the token.
So I made an arbitrary choice about which interpretation to choose. I could have made the other arbitrary choice, but then a different example wouldn't work.

Regex Multiple rows [duplicate]

I'm trying to get the list of all digits preceding a hyphen in a given string (let's say in cell A1), using a Google Sheets regex formula :
=REGEXEXTRACT(A1, "\d-")
My problem is that it only returns the first match... how can I get all matches?
Example text:
"A1-Nutrition;A2-ActPhysiq;A2-BioMeta;A2-Patho-jour;A2-StgMrktg2;H2-Bioth2/EtudeCas;H2-Bioth2/Gemmo;H2-Bioth2/Oligo;H2-Bioth2/Opo;H2-Bioth2/Organo;H3-Endocrino;H3-Génétiq"
My formula returns 1-, whereas I want to get 1-2-2-2-2-2-2-2-2-2-3-3- (either as an array or concatenated text).
I know I could use a script or another function (like SPLIT) to achieve the desired result, but what I really want to know is how I could get a re2 regular expression to return such multiple matches in a "REGEX.*" Google Sheets formula.
Something like the "global - Don't return after first match" option on regex101.com
I've also tried removing the undesired text with REGEXREPLACE, with no success either (I couldn't get rid of other digits not preceding a hyphen).
Any help appreciated!
Thanks :)

You can actually do this in a single formula using regexreplace to surround all the values with a capture group instead of replacing the text:
=join("",REGEXEXTRACT(A1,REGEXREPLACE(A1,"(\d-)","($1)")))
basically what it does is surround all instances of the \d- with a "capture group" then using regex extract, it neatly returns all the captures. if you want to join it back into a single string you can just use join to pack it back into a single cell:

You may create your own custom function in the Script Editor:
function ExtractAllRegex(input, pattern,groupId) {
return [Array.from(input.matchAll(new RegExp(pattern,'g')), x=>x[groupId])];
}
Or, if you need to return all matches in a single cell joined with some separator:
function ExtractAllRegex(input, pattern,groupId,separator) {
return Array.from(input.matchAll(new RegExp(pattern,'g')), x=>x[groupId]).join(separator);
}
Then, just call it like =ExtractAllRegex(A1, "\d-", 0, ", ").
Description:
input - current cell value
pattern - regex pattern
groupId - Capturing group ID you want to extract
separator - text used to join the matched results.

Edit
I came up with more general solution:
=regexreplace(A1,"(.)?(\d-)|(.)","$2")
It replaces any text except the second group match (\d-) with just the second group $2.
"(.)?(\d-)|(.)"
1 2 3
Groups are in ()
---------------------------------------
"$2" -- means return the group number 2
Learn regular expressions: https://regexone.com
Try this formula:
=regexreplace(regexreplace(A1,"[^\-0-9]",""),"(\d-)|(.)","$1")
It will handle string like this:
"A1-Nutrition;A2-ActPhysiq;A2-BioM---eta;A2-PH3-Généti***566*9q"
with output:
1-2-2-2-3-

I wasn't able to get the accepted answer to work for my case. I'd like to do it that way, but needed a quick solution and went with the following:
Input:
1111 days, 123 hours 1234 minutes and 121 seconds
Expected output:
1111 123 1234 121
Formula:
=split(REGEXREPLACE(C26,"[a-z,]"," ")," ")

The shortest possible regex:
=regexreplace(A1,".?(\d-)|.", "$1")
Which returns 1-2-2-2-2-2-2-2-2-2-3-3- for "A1-Nutrition;A2-ActPhysiq;A2-BioMeta;A2-Patho-jour;A2-StgMrktg2;H2-Bioth2/EtudeCas;H2-Bioth2/Gemmo;H2-Bioth2/Oligo;H2-Bioth2/Opo;H2-Bioth2/Organo;H3-Endocrino;H3-Génétiq".
Explanation of regex:
.? -- optional character
(\d-) -- capture group 1 with a digit followed by a dash (specify (\d+-) multiple digits)
| -- logical or
. -- any character
the replacement "$1" uses just the capture group 1, and discards anything else
Learn more about regex: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex

This seems to work and I have tried to verify it.
The logic is
(1) Replace letter followed by hyphen with nothing
(2) Replace any digit not followed by a hyphen with nothing
(3) Replace everything which is not a digit or hyphen with nothing
=regexreplace(A1,"[a-zA-Z]-|[0-9][^-]|[a-zA-Z;/é]","")
Result
1-2-2-2-2-2-2-2-2-2-3-3-
Analysis
I had to step through these procedurally to convince myself that this was correct. According to this reference when there are alternatives separated by the pipe symbol, regex should match them in order left-to-right. The above formula doesn't work properly unless rule 1 comes first (otherwise it reduces all characters except a digit or hyphen to null before rule (1) can come into play and you get an extra hyphen from "Patho-jour").
Here are some examples of how I think it must deal with the text

The solution to capture groups with RegexReplace and then do the RegexExctract works here too, but there is a catch.
=join("",REGEXEXTRACT(A1,REGEXREPLACE(A1,"(\d-)","($1)")))
If the cell that you are trying to get the values has Special Characters like parentheses "(" or question mark "?" the solution provided won´t work.
In my case, I was trying to list all “variables text” contained in the cell. Those “variables text “ was wrote inside like that: “{example_name}”. But the full content of the cell had special characters making the regex formula do break. When I removed theses specials characters, then I could list all captured groups like the solution did.

There are two general ('Excel' / 'native' / non-Apps Script) solutions to return an array of regex matches in the style of REGEXEXTRACT:
Method 1)
insert a delimiter around matches, remove junk, and call SPLIT
Regexes work by iterating over the string from left to right, and 'consuming'. If we are careful to consume junk values, we can throw them away.
(This gets around the problem faced by the currently accepted solution, which is that as Carlos Eduardo Oliveira mentions, it will obviously fail if the corpus text contains special regex characters.)
First we pick a delimiter, which must not already exist in the text. The proper way to do this is to parse the text to temporarily replace our delimiter with a "temporary delimiter", like if we were going to use commas "," we'd first replace all existing commas with something like "<<QUOTED-COMMA>>" then un-replace them later. BUT, for simplicity's sake, we'll just grab a random character such as  from the private-use unicode blocks and use it as our special delimiter (note that it is 2 bytes... google spreadsheets might not count bytes in graphemes in a consistent way, but we'll be careful later).
=SPLIT(
LAMBDA(temp,
MID(temp, 1, LEN(temp)-LEN(""))
)(
REGEXREPLACE(
"xyzSixSpaces:[ ]123ThreeSpaces:[ ]aaaa 12345",".*?( |$)",
"$1"
)
),
""
)
We just use a lambda to define temp="match1match2match3", then use that to remove the last delimiter into "match1match2match3", then SPLIT it.
Taking COLUMNS of the result will prove that the correct result is returned, i.e. {" ", " ", " "}.
This is a particularly good function to turn into a Named Function, and call it something like REGEXGLOBALEXTRACT(text,regex) or REGEXALLEXTRACT(text,regex), e.g.:
=SPLIT(
LAMBDA(temp,
MID(temp, 1, LEN(temp)-LEN(""))
)(
REGEXREPLACE(
text,
".*?("&regex&"|$)",
"$1"
)
),
""
)
Method 2)
use recursion
With LAMBDA (i.e. lets you define a function like any other programming language), you can use some tricks from the well-studied lambda calculus and function programming: you have access to recursion. Defining a recursive function is confusing because there's no easy way for it to refer to itself, so you have to use a trick/convention:
trick for recursive functions: to actually define a function f which needs to refer to itself, instead define a function that takes a parameter of itself and returns the function you actually want; pass in this 'convention' to the Y-combinator to turn it into an actual recursive function
The plumbing which takes such a function work is called the Y-combinator. Here is a good article to understand it if you have some programming background.
For example to get the result of 5! (5 factorial, i.e. implement our own FACT(5)), we could define:
Named Function Y(f)=LAMBDA(f, (LAMBDA(x,x(x)))( LAMBDA(x, f(LAMBDA(y, x(x)(y)))) ) ) (this is the Y-combinator and is magic; you don't have to understand it to use it)
Named Function MY_FACTORIAL(n)=
Y(LAMBDA(self,
LAMBDA(n,
IF(n=0, 1, n*self(n-1))
)
))
result of MY_FACTORIAL(5): 120
The Y-combinator makes writing recursive functions look relatively easy, like an introduction to programming class. I'm using Named Functions for clarity, but you could just dump it all together at the expense of sanity...
=LAMBDA(Y,
Y(LAMBDA(self, LAMBDA(n, IF(n=0,1,n*self(n-1))) ))(5)
)(
LAMBDA(f, (LAMBDA(x,x(x)))( LAMBDA(x, f(LAMBDA(y, x(x)(y)))) ) )
)
How does this apply to the problem at hand? Well a recursive solution is as follows:
in pseudocode below, I use 'function' instead of LAMBDA, but it's the same thing:
// code to get around the fact that you can't have 0-length arrays
function emptyList() {
return {"ignore this value"}
}
function listToArray(myList) {
return OFFSET(myList,0,1)
}
function allMatches(text, regex) {
allMatchesHelper(emptyList(), text, regex)
}
function allMatchesHelper(resultsToReturn, text, regex) {
currentMatch = REGEXEXTRACT(...)
if (currentMatch succeeds) {
textWithoutMatch = SUBSTITUTE(text, currentMatch, "", 1)
return allMatches(
{resultsToReturn,currentMatch},
textWithoutMatch,
regex
)
} else {
return listToArray(resultsToReturn)
}
}
Unfortunately, the recursive approach is quadratic order of growth (because it's appending the results over and over to itself, while recreating the giant search string with smaller and smaller bites taken out of it, so 1+2+3+4+5+... = big^2, which can add up to a lot of time), so may be slow if you have many many matches. It's better to stay inside the regex engine for speed, since it's probably highly optimized.
You could of course avoid using Named Functions by doing temporary bindings with LAMBDA(varName, expr)(varValue) if you want to use varName in an expression. (You can define this pattern as a Named Function =cont(varValue) to invert the order of the parameters to keep code cleaner, or not.)
Whenever I use varName = varValue, write that instead.
to see if a match succeeds, use ISNA(...)
It would look something like:
Named Function allMatches(resultsToReturn, text, regex):
UNTESTED:
LAMBDA(helper,
OFFSET(
helper({"ignore"}, text, regex),
0,1)
)(
Y(LAMBDA(helperItself,
LAMBDA(results, partialText,
LAMBDA(currentMatch,
IF(ISNA(currentMatch),
results,
LAMBDA(textWithoutMatch,
helperItself({results,currentMatch}, textWithoutMatch)
)(
SUBSTITUTE(partialText, currentMatch, "", 1)
)
)
)(
REGEXEXTRACT(partialText, regex)
)
)
))
)

How can I allow my program to continue when a regex doesn't match?

I want to use the regex crate and capture numbers from a string.
let input = "abcd123efg";
let re = Regex::new(r"([0-9]+)").unwrap();
let cap = re.captures(e).unwrap().get(1).unwrap().as_str();
println!("{}", cap);
It worked if numbers exist in input, but if numbers don't exist in input I get the following error:
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value'
I want my program continue if the regex doesn't match. How can I handle this error?

You probably want to (re-)read the chapter on "Error Handling" in the Rust book. Error handling in Rust is mostly done via the types Result<T, E> and Option<T>, both representing an optional value of type T with Result<T, E> carrying additional information about the absence of the main value.
You are calling unwrap() on each Option or Result you encounter. unwrap() is a method saying: "if there is no value of type T, let the program explode (panic)". You only want to call unwrap() if an absence of a value is not expected and thus would be a bug! (NB: actually, the unwrap() in your second line is a perfectly reasonable use!)
But you use unwrap() incorrectly twice: on the result of captures() and on the result of get(1). Let's tackle captures() first; it returns an Option<_> and the docs say:
If no match is found, then None is returned.
In most cases, the input string not matching the regex is to be expected, thus we should deal with it. We could either just match the Option (the standard way to deal with those possible errors, see the Rust book chapter) or we could use Regex::is_match() before, to check if the string matches.
Next up: get(1). Again, the docs tell us:
Returns the match associated with the capture group at index i. If i does not correspond to a capture group, or if the capture group did not participate in the match, then None is returned.
But this time, we don't have to deal with that. Why? Our regex (([0-9]+)) is constant and we know that the capture group exists and encloses the whole regex. Thus we can rule out both possible situations that would lead to a None. This means we can unwrap(), because we don't expect the absence of a value.
The resulting code could look like this:
let input = "abcd123efg";
let re = Regex::new(r"([0-9]+)").unwrap();
match re.captures(e) {
Some(caps) => {
let cap = caps.get(1).unwrap().as_str();
println!("{}", cap);
}
None => {
// The regex did not match. Deal with it here!
}
}

You can either check with is_match or just use the return type of captures(e) to check it (it's an Option<Captures<'t>>) instead of unwrapping it, by using a match (see this how to handle options).

Using regex to obtain terminator int in for-loop

I searched a lot ,and unable find to detect for loop that is inside my variable ;
I have a String like : "for(i=0;i<=1000;i++)" or it may be any kind of for loop !
I want to get first part of for loop : like for(i=0; and 1000 that how many times it run if digit is given.
I used for now that give me first part but it is not perfect :
str.replace(/for\(\w+=\w;/g,"something");
it fails when i put some space in between!
Is it any way to get only the part of for loop Using regEx ??

This works:
\bfor\s*\([^;]+;.+\b(\d+)\b\s*;
Debuggex Demo
This allows spaces between the for and open-paren, and also between the number in capture group one (the number you want) and the following semi-colon.
As far as how you "get" the number you want: It's in capture group one. I don't know what language you're using, but in Java, you'd retrieve it with something like:
Matcher m = Pattern.
compile("\\bfor\\s*\\([^;]+;.+\\b(\\d+)\\b\\s*;").
matcher(theSourceCode);
int theNumberYouWant = -1;
if(m.find()) {
//Safe to translate to int, since the match guarantees it's a number
theNumberYouWant = Integer.parseInt(m.group(1));
}
If you also need to capture the first number, before the first semicolon, just duplicate the current capture-group. So change this
\bfor\s*\([^;]+;.+\b(\d+)\b\s*;
to
\bfor\s*\(.+\b(\d+)\b\s*;.+\b(\d+)\b\s*;
and now the first number is in capture group 1, then second in capture group 2.

Regex for Markdown style emphasis: with x amount of *

I'm wanting to write a regular expression for matching strings that are wrapped in * characters, much like markdown which uses them to **make things bold**.
But I'm wanting to also wanting the number of *'s at the start and end to be a variable amount. The amount of stars will amount to how important that string is.
At the moment I'm using this;
/(\*\*\*|\*\*|\*)(.*?)\1/
Which works for upto ***three*** either side. This returns both the string between the *'s and the string containing the ***. I then count the length of that string to get the number of *'s.
In ruby, this looks like;
"*this is important*, but this is ***very important***.scan(/(\*\*\*|\*\*|\*)(.*?)\1/).each do |match|
points << { :str => match[1], :importance => match[0].length }
end
The regex is working fine in most parts, but if I wanted to get ********something really important********; the expression would get a bit out of hand - doing it the way I have so far.
I understand my current pattern is searching for an amount of *'s and finding the text between that and another occurrence of the same string. But it would also be nice to account for human error, such as a string like;
**This is quite important*, but ***this is really important****.
Thanks all!

What about simply the below?
/(\*+)(.*?)\1/
\*+ is one or more *'s.
Or, if you want to limit it to a specific amount:
/(\*{1,5})(.*?)\1/
\*{1,5} means anywhere between 1 and 5 *'s. You're obviously free to change 1 and 5 as you see fit.
Different lengths on either side:
The above will work for the same amount of *'s on both sides (because of the back-reference \1).
If you want to allow for different amounts on either side, you can use \*+ instead of \1, so:
/(\*+)(.*?)\*+/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Chemical formula parser C++ - c++

Related

Multi line comment flex lex

Regex Multiple rows [duplicate]

How can I allow my program to continue when a regex doesn't match?

Using regex to obtain terminator int in for-loop

Regex for Markdown style emphasis: with x amount of *

Categories

Resources