Regular Expression : Splitting a string of list of multivalues - regex

My goal is splitting this string with regular expression:
AA(1.2,1.3)+,BB(125)-,CC(A,B,C)-,DD(QWE)+
in a list of:
AA(1.2,1.3)+
BB(125)-
CC(A,B,C)-
DD(QWE)+
Regards.

This regex works with your sample string:
,(?![^(]+\))
This splits on comma, but uses a negative lookahead to assert that the next bracket character is not a right bracket. It will still split even if there are no following brackets.
Here's some java code demonstrating it working with your sample plus some general input showing its robustness:
String input = "AA(1.2,1.3)+,BB(125)-,FOO,CC(A,B,C)-,DD(QWE)+,BAR";
String[] split = input.split(",(?![^(]+\\))");
for (String s : split) System.out.println(s);
Output:
AA(1.2,1.3)+
BB(125)-
FOO
CC(A,B,C)-
DD(QWE)+
BAR

I don't know what language you are working with, but this makes it in grep:
$ grep -o '[A-Z]*([A-Z0-9.,]*)[^,]*' file
AA(1.2,1.3)+
BB(125)-
CC(A,B,C)-
DD(QWE)+
Explanation
[A-Z]*([A-Z0-9.,]*)[^,]*
^^^^^^ ^^^^^^^^^^^ ^^^^^
| ^ | ^ |
| | | | everything but a comma
| ( char | ) char
| A-Z 0-9 . or , chars
list of chars from A to Z

Related

C++ getline - Extracting a substring using regex

I have a file with contents like this -
Random text
+-------------------+------+-------+-----------+-------+
| Data | A | B | C | D |
+-------------------+------+-------+-----------+-------+
| Data 1 | 1403 | 0 | 2520 | 55.67 |
| Data 2 | 1365 | 2 | 2520 | 54.17 |
| Data 3 | 1 | 3 | 1234 | 43.12 |
Some more random text
I want to extract the value of column D of row Data 1 i.e. I want to extract the value 55.67 from the example above. I am parsing this file line by line using getline -
while(getline(inputFile1,line)) {
if(line.find("| Data 1") != string::npos) {
subString = //extract the desired value
}
How can I extract the desired sub string from the line. Is there any way using boost::regex that I can extract this substring?
While regex may have its uses, it's probably overkill for this.
Bring in a trim function and:
char delim;
std::string line, data;
int a, b, c;
double d;
while(std::getline(inputFile1, line)) {
std::istringstream is(line);
if( std::getline(is >> delim, data, '|') >>
a >> delim >> b >> delim >> c >> delim >> d >> delim)
{
trim(data);
if(data == "Data 1") {
std::cout << a << ' ' << b << ' ' << c << ' ' << d << '\n';
}
}
}
Demo
Yes, it is easily possible to extract your substring with a regex. There is no need to use boost, you can also use the existing C++ regex library.
The resulting program is ultra simple.
We read all lines of the source file in a simple for loop. Then we use std::regex_match to match a just read line against our regex. If we have found a match, then the result will be in the std::smatch sm, group 1.
And because we will design the regex for finding double values, we will get exactly what we need, without any additional spaces.
This we can convert to a double and show the result on the screen. And because we defined the regex to find a double, we can be sure that std::stod will work.
The resulting program is rather straightforward and easy to understand:
#include <iostream>
#include <string>
#include <sstream>
#include <regex>
// Please note. For std::getline, it does not matter, if we read from a
// std::istringstream or a std::ifstream. Both are std::istream's. And because
// we do not have files here on SO, we will use an istringstream as data source.
// If you want to read from a file later, simply create an std::ifstream inputFile1
// Source File with all data
std::istringstream inputFile1{ R"(
Random text
+-------------------+------+-------+-----------+-------+
| Data | A | B | C | D |
+-------------------+------+-------+-----------+-------+
| Data 1 | 1403 | 0 | 2520 | 55.67 |
| Data 2 | 1365 | 2 | 2520 | 54.17 |
| Data 3 | 1 | 3 | 1234 | 43.12 |
Some more random text)"
};
// Regex for finding the desired data
const std::regex re(R"(\|\s+Data 1\s+\|.*?\|.*?\|.*?\|\s*([-+]?[0-9]*\.?[0-9]+)\s*\|)");
int main() {
// The result will be in here
std::smatch sm;
// Read all lines of the source file
for (std::string line{}; std::getline(inputFile1, line);) {
// If we found our matching string
if (std::regex_match(line, sm, re)) {
// Then extract the column D info
double data1D = std::stod(sm[1]);
// And show it to the user.
std::cout << data1D << "\n";
}
}
}
For most people the tricky part is how to define the regular expression. There are pages like Online regex tester and debugger. There is also a breakdown for the regex and a understandable explanation.
For our regex
\|\s+Data 1\s+\|.*?\|.*?\|.*?\|\s*([-+]?[0-9]*\.?[0-9]+)\s*\|
we get the following explanation:
\|
matches the character | literally (case sensitive)
\s+
matches any whitespace character (equal to [\r\n\t\f\v ])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
Data 1 matches the characters Data 1 literally (case sensitive)
\s+
matches any whitespace character (equal to [\r\n\t\f\v ])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
\|
matches the character | literally (case sensitive)
.*?
matches any character (except for line terminators)
*? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy)
\|
matches the character | literally (case sensitive)
.*?
matches any character (except for line terminators)
*? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy)
\|
matches the character | literally (case sensitive)
.*?
matches any character (except for line terminators)
\|
matches the character | literally (case sensitive)
\s*
matches any whitespace character (equal to [\r\n\t\f\v ])
1st Capturing Group ([-+]?[0-9]*\.?[0-9]+)
\s*
matches any whitespace character (equal to [\r\n\t\f\v ])
\|
matches the character | literally (case sensitive)
By the way, a more safe (more secure matching) regex would be:
\|\s+Data 1\s+\|\s*?\d+\s*?\|\s*?\d+\s*?\|\s*?\d+\s*?\|\s*([-+]?[0-9]*\.?[0-9]+)\s*\|

How to match something not defined

if I have defined something like that
COMMAND = "HI" | "HOW" | "ARE" | "YOU"
How can i say "if u match something that is not a COMMAND"?..
I tried with this one
[^COMMAND]
But didn't work..
As far as I can tell this is not possible with (current) JFlex.
We would need an effective tempered negative lookahead: ((?!bad).)*
There are two ways to do a negative lookahead in JFlex:
negation in the lookahead: x / !(y [^]*) (match x if not followed by y in the lookahead).
lookahead with negated elements: x / [^y]|y[^z] (match if x is followed by something that is !a or a!b.
Otherwise, you may get some ideas from this answer (specifically the lookaround alternatives): https://stackoverflow.com/a/37988661/8291949
Well, you can just match anything else, then
COMMAND = "HI" | "HOW" | "ARE" | "YOU"
. {throw new RuntimeException("Illegal character: <" + yytext() + ">");}

Matching five characters in Excel/VBA using RegEx, with first character being dependant on cell value

I need your help! I’d like to use RegEx in a Excel/VBA environment. I do have an approach, but I’m kind of reaching my limits...
I need to match 5 characters within a great many lines of string (the string being in column B of my excel sheet, A comes later). The 5 characters can be 5 digits or a „K“ followed by 4 digits (ex. 12345, 98765, K2345). This would be covered by (\d{5}|K\d{4}).
Them five can be preceeded or followed by letters or special characters, but not by numbers. Meaning no leading zeros are allowed and also the digits shouldn’t just be matched within a longer number. That's one point where I'm stuck.
If there’s more than one possible match in a string, I need them all to be matched. If the same number has been matched within a line already, I’d like it not to be matched again. For these two requirements, I do have a sort of solution already, that works as part of the VBA code at the end of this posting: (\d{5}|K\d{4})(?!.*?\1.*$)
In addition, I do have a specific single digit (or a „K“) in column A. I need the five characters to start with this specific character, or otherwise not be matched.
Example of strings (numbered). The two columns A and B are separated by "|" for better readability
(1) | 1 | 2018/ID11298 00000012345 PersoNR: 889899 Bridgestone BNPN
(2) | 3 | Kompo 32280EP ###Baukasten### 3789936690 ID PFK Carbon0
(3) | 2 | 20613, 20614, Mietop Antragsnummer C300Coup IVS 33221 ABF
(4) | 2 | Q21009 China lokal produzierte Derivate f/Radverbund 991222 VV
(5) | 6 | ID:61953 F-Pace Enfantillages (Machine arriere) VvSKPMG Lyon09
(6) | 2 | 2017/22222 22222 21895 Einzelkostenprob. 28932 ZürichMP KOS
(7) | K | ID:K1245 Panamera Nitsche Radlager Derivativ Bayreumion PwC
(8) | 7 | LaunchSupport QBremsen BBG BFG BBD 70142,70119 KK 70142
The results that I'm looking for here are:
(1) | 11298 | ............................. [but don't match 12345, since no preceeding numbers allowed]
(2) | 32280 | ............................. [but don't match 37899 within 3789936690]
(3) | 20613 | 20614 | ................ [match both starting with a 2, don't match the one starting with 3]
(4) | 21009 | ............................. [preceeded by a letter, which is perfectly fine
(5) | 61953 | ..............................[random example]
(6) | 22222 | 21895 | 28932 | ... [match them all, but no duplicates]
(7) | K1245 | ............................. [special case with a "K"]
(8) | 70142 | 70119 | ................ [ignore second 70142]
The RegEx/VBA Code that I've put together so far is:
Sub RegEx()
Dim varOut() As Variant
Dim objRegEx As Object
Dim lngColumn As Long
Dim objRegA As Object
Dim varArr As Variant
Dim lngUArr As Long
Dim lngTMP As Long
On Error GoTo Fin
With Worksheets("Sheet1")
varArr = .Range("B2:B50")
Set objRegEx = CreateObject("VBScript.Regexp")
With objRegEx
.Pattern = "(\d{5}|K\d{4})(?!.*?\1.*$)" 'this is where the magic happens
.Global = True
For lngUArr = 1 To UBound(varArr)
Set objRegA = .Execute(varArr(lngUArr, 1))
If objRegA.Count >= lngColumn Then
lngColumn = objRegA.Count
End If
Set objRegA = Nothing
Next lngUArr
If lngColumn = 0 Then Exit Sub
ReDim varOut(1 To UBound(varArr), 1 To lngColumn)
For lngUArr = 1 To UBound(varArr)
Set objRegA = .Execute(varArr(lngUArr, 1))
For lngTMP = 1 To objRegA.Count
varOut(lngUArr, lngTMP) = objRegA(lngTMP - 1)
Next lngTMP
Set objRegA = Nothing
Next lngUArr
End With
.Cells(2, 3).Resize(UBound(varOut), UBound(varOut, 2)) = varOut
End With
Fin:
Set objRegA = Nothing
Set objRegEx = Nothing
If Err.Number <> 0 Then MsgBox "Error: " & Err.Number & " " & Err.Description
End Sub
This code is checking the string from column B and delivering its matches in columns C, D, E etc. It's not matching duplicates. It is however matching numbers within larger numbers, which is a problem. \b for example doesn't work for me, because I still want to match 12345 in EP12345.
Also, I have no idea how to implement the character from column A to be the very first character.
I've uploaded my excel file here: mollmell.de/RegEx.xlsm
Thank you so much for suggestions
Stephan
To sort out the numbers which are too long, you can use a negative lookbehind and lookahead that doesn't match preceding and successing digits:
(?x) (?<!\d) (\d{5} | K\d{4}) (?!\d)
https://regex101.com/r/RBnoMo/1
To match only numbers with the key in column 2 is rather hard. Maybe you match either the key or the numbers and do the logic afterwards:
(?x)
\|[ ](?<key>.)[ ]\| |
(?<!\d) (?<number>\d{5} | K\d{4}) (?!\d)
https://regex101.com/r/60d0yT/2

Parse arbitrary delimiter character using Antlr4

I try to create a grammar in Antlr4 that accepts regular expressions delimited by an arbitrary character (similar as in Perl). How can I achieve this?
To be clear: My problem is not the regular expression itself (which I actually do not handle in Antlr, but in the visitor), but the delimiter characters. I can easily define the following rules to the lexer:
REGEXP: '/' (ESC_SEQ | ~('\\' | '/'))+ '/' ;
fragment ESC_SEQ: '\\' . ;
This will use the forward slash as the delimiter (like it is commonly used in Perl). However, I also want to be able to write a regular expression as m~regexp~ (which is also possible in Perl).
If I had to solve this using a regular expression itself, I would use a backreference like this:
m(.)(.+?)\1
(which is an "m", followed by an arbitrary character, followed by the expression, followed by the same arbitrary character). But backreferences seem not to be available in Antlr4.
It would be even better when I could use pairs of brackets, i.e. m(regexp) or m{regexp}. But since the number of possible bracket types is quite small, this could be solved by simply enumerating all different variants.
Can this be solved with Antlr4?
You could do something like this:
lexer grammar TLexer;
REGEX
: REGEX_DELIMITER ( {getText().charAt(0) != _input.LA(1)}? REGEX_ATOM )+ {getText().charAt(0) == _input.LA(1)}? .
| '{' REGEX_ATOM+ '}'
| '(' REGEX_ATOM+ ')'
;
ANY
: .
;
fragment REGEX_DELIMITER
: [/~##]
;
fragment REGEX_ATOM
: '\\' .
| ~[\\]
;
If you run the following class:
public class Main {
public static void main(String[] args) throws Exception {
TLexer lexer = new TLexer(new ANTLRInputStream("/foo/ /bar\\ ~\\~~ {mu} (bla("));
for (Token t : lexer.getAllTokens()) {
System.out.printf("%-20s %s\n", TLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText().replace("\n", "\\n"));
}
}
}
you will see the following output:
REGEX /foo/
ANY
ANY /
ANY b
ANY a
ANY r
ANY \
ANY
REGEX ~\~~
ANY
REGEX {mu}
ANY
ANY (
ANY b
ANY l
ANY a
ANY (
The {...}? is called a predicate:
Syntax of semantic predicates in Antlr4
Semantic predicates in ANTLR4?
The ( {getText().charAt(0) != _input.LA(1)}? REGEX_ATOM )+ part tells the lexer to continue matching characters as long as the character matched by REGEX_DELIMITER is not ahead in the character stream. And {getText().charAt(0) == _input.LA(1)}? . makes sure there actually is a closing delimiter matched by the first chararcter (which is a REGEX_DELIMITER, of course).
Tested with ANTLR 4.5.3
EDIT
And to get a delimiter preceded by m + some optional spaces to work, you could try something like this (untested!):
lexer grammar TLexer;
#lexer::members {
boolean delimiterAhead(String start) {
return start.replaceAll("^m[ \t]*", "").charAt(0) == _input.LA(1);
}
}
REGEX
: '/' ( '\\' . | ~[/\\] )+ '/'
| 'm' SPACES? REGEX_DELIMITER ( {!delimiterAhead(getText())}? ( '\\' . | ~[\\] ) )+ {delimiterAhead(getText())}? .
| 'm' SPACES? '{' ( '\\' . | ~'}' )+ '}'
| 'm' SPACES? '(' ( '\\' . | ~')' )+ ')'
;
ANY
: .
;
fragment REGEX_DELIMITER
: [~##]
;
fragment SPACES
: [ \t]+
;

Regular expression for string between any quotation mark

I want to write a regular expression for a string starts with quotation mark and ends with same mark. It consists of alpha numeric words (e.g. "PL", or 'CS', . . . ).
I thought about [^"].*[^"] , but this is only work for "" these.
i want output like
input: "CS300"
output: 1 tSTRING
or
input:'a'
ouput: 1 tSTRING
Thanks
my code is
%{
int linecounter=1;
%}
%%
\n linecounter++;
(['"])[^'"]*\1 printf("%d tSTRING \n", linecounter);
%%
main()
{
yylex();
}
Use negated character class and backreference:
(['"]).*?\1
Explanation:
(['"]) : matches a single or a double quote and keep it in group1
.*? : matches what is between
\1 : backreference, same quote as in group 1
If your regex flavor doesn't support lazy quantifiers:
(['"])[^'"]*\1
I write \"(\.|[^"])\" and \'(\.|[^'])\' now its working.