Dart regex for capturing groups but ignoring certain similar patterns - regex

I'm trying to capture a group from a string with ~, ~~ and ~~~ symbols. I was successful with extracting single symbols but it doesn't ignore the other occurrences in the string.
This is my code I tried experimenting with:
String f = '~the calculator is on and working~I entered 50 into the calculator'+
'~~I press add button~~holding equal button ~~~The result should be 50';
List<String>givens = f.split(RegExp(r'~+'));
List<String>whens = f.split(RegExp(r'~~+'));
List<String>thens = f.split(RegExp(r'~~~+'));
for(String ss in givens){
print(ss);
}
print('xxxxxxxxxxxx');
for(String ss in whens){
print(ss);
}
print('xxxxxxxxxxxx');
for(String ss in thens){
print(ss);
}
Which will result with:
The givens capture group also captured the ones with ~~ and ~~~ which is not intended.
The whens capture group also captured the ones single ~ which made it very confusing.
Lastly, the thens capture group also captured the others which is also not intended.
I only need to capture the strings starting with the specific pattern but will stop when they see a different one.
Example: givens should only capture 'the calculator is on and working' and 'I entered 50 into the calculator' only.
Any hints or help is greatly appreciated!

I think the problem is that you started off by splitting the string into pieces. But it might be easier to search for the elements with a pattern that will look for some text preceeded with either one, two or three ~ chars.
This can be done with regex positive lookbehind patterns.
Typically, if you want to find a string preceeded by one tild then you have to avoid that it matches if we have other tilds before it.
Find givens
(?<=(?:[^~]|^)~)[^~]+ would be the pattern to find only givens.
Test it here: https://regex101.com/r/9WLbM3/2
Explanation
[^~] means search for any character which is not a ~. This is because [abc] means any char which is in the list, so a, b or c. If you add the ^ char at the beginning of the list then it means "not these chars".
[^~]+ means search for one or multiple times a character which is not ~. This will capture phrases between the tilds.
A positive lookbehind is done with (?<=something present). We want to search for a tild so we would put (?<=~) as positive lookbehind. But the problem is that it will also match the ones with several tilds in front. To avoid that we can say that the tild should either be prefixed by ^ (meaning the beginning of a string) or by [^~] (meaning not a tild). To say "either this or that", we use the syntax (this|that|or even that). But using parenthesis will capture the content and we don't need that. To disable group capturing we can add ?: at the beginning of the group, leading finally to (?:[^~]|^) meaning either a non-tild char or the beginning of the string, without capturing it.
Find whens and thens
The regular expression is almost the same. It's just that we replace ~ by ~{2} or ~{3}.
Pattern for whens: (?<=(?:[^~]|^)~{2})[^~]+
Pattern for thens: (?<=(?:[^~]|^)~{3})[^~]+

Related

Regex to match(extract) string between dot(.)

I want to select some string combination (with dots(.)) from a very long string (sql). The full string could be a single line or multiple line with new line separator, and this combination could be in start (at first line) or a next line (new line) or at both place.
I need help in writing a regex for it.
Examples:
String s = I am testing something like test.test.test in sentence.
Expected output: test.test.test
Example2 (real usecase):
UPDATE test.table
SET access = 01
WHERE access IN (
SELECT name FROM project.dataset.tablename WHERE name = 'test' GROUP BY 1 )
Expected output: test.table and project.dataset.tablename
, can I also add some prefix or suffix words or space which should be present where ever this logic gets checked. In above case if its update regex should pick test.table, but if the statement is like select test.table regex should not pick it up this combinations and same applies for suffix.
Example3: This is to illustrate the above theory.
INS INTO test.table
SEL 'abcscsc', wu_id.Item_Nbr ,1
FROM test.table as_t
WHERE as_t.old <> 0 AND as_t.date = 11
AND (as_t.numb IN ('11') )
Expected Output: test.table, test.table (Key words are INTO and FROM)
Things Not Needed in selection:as_t.numb, as_t.old, as_t.date
If I get the regex I can use in program to extract this word.
Note: Before and after string words to the combination could be anything like update, select { or(, so we have to find the occurrence of words which are joined together with .(dot) and all the number of such occurrence.
I tried something like this:
(?<=.)(.?)(?=.)(.?) -: This only selected the word between two .dot and not all.
.(?<=.)(.?)(?=.)(.?). - This everything before and after.
To solve your initial problem, we can just use some negation. Here's the pattern I came up with:
[^\s]+\.[^\s]+
[^ ... ] Means to make a character class including everything except for what's between the brackets. In this case, I put \s in there, which matches any whitespace. So [^\s] matches anything that isn't whitespace.
+ Is a quantifier. It means to find as many of the preceding construct as you can without breaking the match. This would happily match everything that's not whitespace, but I follow it with a \., which matches a literal .. The \ is necessary because . means to match any character in regex, so we need to escape it so it only has its literal meaning. This means there has to be a . in this group of non-whitespace characters.
I end the pattern with another [^\s]+, which matches everything after the . until the next whitespace.
Now, to solve your secondary problem, you want to make this match only work if it is preceded by a given keyword. Luckily, regex has a construct almost specifically for this case. It's called a lookbehind. The syntax is (?<= ... ) where the ... is the pattern you want to look for. Using your example, this will only match after the keywords INTO and FROM:
(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
Here (?:INTO|FROM) means to match either the text INTO or the text FROM. I then specify that it should be followed by a whitespace character with \s. One possible problem here is that it will only match if the keywords are written in all upper case. You can change this behavior by specifying the case insensitive flag i to your regex parser. If your regex parser doesn't have a way to specify flags, you can usually still specify it inline by putting (?i) in front of the pattern, like so:
(?i)(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
If you are new to regex, I highly recommend using the www.regex101.com website to generate regex and learn how it works. Don't forget to check out the code generator part for getting the regex code based on the programming language you are using, that's a cool feature.
For your question, you need a regex that understands any word character \w that matches between 0 and unlimited times, followed by a dot, followed by another series of word character that repeats between 0 and unlimited times.
So here is my solution to your question:
Your regex in JavaScript:
const regex = /([\w][.][\w])+/gm;
in Java:
final String regex = "([\w][.][\w])+";
in Python:
regex = r"([\w][.][\w])+"
in PHP:
$re = '/([\w][.][\w])+/m';
Note that: this solution is written for your use case (to be used for SQL strings), because now if you have something like '.word' or 'word..word', it will still catch it which I assume you don't have a string like that.
See this screenshot for more details

Regex to find after particular word inside a string

I am using regex to find few keywords after colon(:) and the best I have reached so far is:
sample test case
test {
test1 {
sadffd(test: "aff", aaa: "aa1") {}
}
}
Now I have to find a keyword inside () brackets and its working for 'aaa' but when I add test it fails, it matches entire words in string.
my regex so far
\btest(.*\w") (failed case) expected "aff" returned "aff", aaa: "aa1"
\baaa(.*\w") (pass case) returned "aa1"
please let me know if more information is needed
You may try
:\s*"(.*?)"
And the data you need is in the first capturing group.
Explanation
:\s*"(.*?)"
: colon
\s* followed by optionally any number of spaces
" followed by quote
( ) capturing group, containing...
.*? any number of character, matching as few as possible
" followed by quote
Demo:
https://regex101.com/r/WnvzdG/1
Update:
If you want to match ONLY after specific keywords, followed by colon, you can do something like:
(KEYWORD1|KEYWORD2|KEYWORD3)\s*:\s*"(.*?)"
First capture group will be the keyword matched, second capture group will be the value.
One more approach (executed in Python)
items = ['test{test1 {sadffd(test: "aff", aaa: "aa1") {}}}']
for item in items:
print(re.findall(r'"(\w+)"',item))
print(re.findall(r'(?<=: )"(\w+)"',item))
Output
['aff', 'aa1']
['aff', 'aa1']
I believe a simple regex would work to get everything inside the double quotes in your case:
("\w+")
Note that your question above says you want to capture "aff" and not just aff so I've included the surrounding quotes within the capturing group.
Example from regex101:
It's pretty crude but this should be OK for the input you've presented. (It wouldn't handle things like an escaped double quote in the string, for example).

Regex to MATCH number string (with optional text) in a sentence

I am trying to write a regex that matches only strings like this:
89-72
10-123
109-12
122-311(a)
22-311(a)(1)(d)(4)
These strings are embedded in sentences and sometimes there are 2 potential matches in the sentence like this:
In section 10-123 which references section 122-311(a) there is a phone number 456-234-2222
I do not want to match the phone. Here is my current working regex
\d{2,3}\-\d{2,3}(\([a-zA-Z0-9]\))*
see DEMO
I've been looking on Stack and have not found anything yet. Any help would be appreciated. Will be using this in a google sheet and potentially postgres.
Based on regex, suggested by #Wiktor Stribiżew:
=REGEXEXTRACT(A1,REPT("\b(\d{2,3}-\d{2,3}\b(?:\([A-Za-z0-9]\))*)(?:[^-]|$)(?:.*)",LEN(REGEXREPLACE(REGEXREPLACE(A1,"\b(\d{2,3}-\d{2,3}\b(?:\([A-Za-z0-9]\))*)(?:[^-]|$)", char (9)),"[^"&char(9)&"]",""))))
The formula will return all matches.
String:
A
In 22-311(a)(1)(d)(4) section 10-123 which ... 122-311(a) ... number 456-234-2222
Output:
B C D
22-311(a)(1)(d)(4) 10-123 122-311(a)
Solution
To extract all matches from a string, use this pattern:
=REGEXEXTRACT(A1,
REPT(basic_regex & "(?:.*)",
LEN(REGEXREPLACE(REGEXREPLACE(A1,basic_regex, char (9)),"[^"&char(9)&"]",""))))
The tail of a function:
LEN(REGEXREPLACE(REGEXREPLACE(A1,basic_regex, char (9)),"[^"&char(9)&"]","")))
is just for finding number 3 -- how many entries of a pattern in a string.
To not match the phone number you have to indicate that the match must neither be preceded nor followed by \d or -. Google spreadsheet uses RE2 which does not support look around assertion (see the list of supported feature) so as far as I can tell, the only solution is to add a character before and after the match, or the string boundary:
(?:^|[^-\d])\d{2,3}\-\d{2,3}(\([a-zA-Z0-9]\))*(?:$|[^-\d])
(?:^|[^-\d]) means either the start of a line (^) or a character that is not - or \d (you might want to change that, and forbid all letters as well). $ is the end of a line. ^ and $ only do what you want with the /m flag though
As you can see here this finds the correct strings, but with additional spaces around some of the matches.

regex needed for parsing string

I am working with government measures and am required to parse a string that contains variable information based on delimiters that come from issuing bodies associated with the fda.
I am trying to retrieve the delimiter and the value after the delimiter. I have searched for hours to find a regex solution to retrieve both the delimiter and the value that follows it and, though there seems to be posts that handle this, the code found in the post haven't worked.
One of the major issues in this task is that the delimiters often have repeated characters. For instance: delimiters are used such as "=", "=,", "/=". In this case I would need to tell the difference between "=" and "=,".
Is there a regex that would handle all of this?
Here is an example of the string :
=/A9999XYZ=>100T0479&,1Blah
Notice the delimiters are:
"=/"
"=>'
"&,1"
Any help would be appreciated.
You can use a regex like this
(=/|=>|&,1)|(\w+)
Working demo
The idea is that the first group contains the delimiters and the 2nd group the content. I assume the content can be word characters (a to z and digits with underscore). You have then to grab the content of every capturing group.
You need to capture both the delimiter and the value as group 1 and 2 respectively.
If your values are all alphanumeric, use this:
(&,1|\W+)(\w+)
See live demo.
If your values can contain non-alphanumeric characters, it get complicated:
(=/|=>|=,|=|&,1)((?:.(?!=/|=>|=,|=|&,1))+.)
See live demo.
Code the delimiters longest first, eg "=," before "=", otherwise the alternation, which matches left to right, will match "=" and the comma will become part of the value.
This uses a negative look ahead to stop matching past the next delimiter.

Regex Group not starting with

I'm having trouble to compute 2 regex in one (used to deal with .ini files)
I've got this one (I suggest you to use rubular with theses examples to understand)
^(?<key>[^=;\r\n]+)=((?<value>\"*.*;*.*\"[^;\r\n]*);?(?<comment>.*)[^\r\n]*)
to match :
This="isnot;acomment"
This="isa";comment
This="isa;special";case
And I've got this one :
^(?<key>[^=;\r\n]+)=(?<value>[^;\r\n]*);?(?<comment>[^\r\n]*)
to match
This=isasimplecase
This=isasimple;comment
And I'm trying to merge the 2 regex, sadly I do not manage to say "If my value group is not starting with \" use the second one if not use the first one".
Right now i've got this :
^(?<key>[^=;\r\n]+)=(((?<value>\"*.*;*.*\"[^;\r\n]*);?(?<comment>.*)[^\r\n]*)|(?<value>[^;\r\n]*);?(?<comment>[^\r\n]*))
But it's creating 2 more sections unnamed for the simple case without quoted. I was thinking that maybe by adding "the first item of the value group for the simple case must not start with \". But I didn't manage to do it.
PS : I suggest you to use rubular to understand better my problem. Sorry if I wasn't clear enough
How about this?
^(?<key>[^=;\r\n]+)=(?<value>"[^"]*"|[^;\n\r]*);?(?<comment>.*)
DEMO
(?<key>[^=;\r\n]+) Matches the part before the = symbol.
"[^"]*" Matches the string within the double quotes , ex strings like "foobar". If there is no " then the regex engine move on to the next pattern that is [^;\n\r]* and it matches upto the first ; or newline or \r character. These matched characters are stored into a named group called value.
;? Optional semicolon.
(?<comment>.*) Remaining characters are stored into the comment group.