Regular expression to extract n-values from a string - regex

Can regex extract the values embedded within a string, as identified by a variable template defined earlier within the same string? Or is this better handled in Java?
For example: "2012 Ferrari [F12] - Ostrich Leather interior [F12#OL] - Candy Red Metallic [F12#3]" The variable template is the first string encountered with square brackets, e.g. [F12], and the desired variables are found within subsequent instances of that template, e.g. 'OL' and '3'.

Since you are mentioning Java, I'll assume you are using the Java implementation, Pattern.
Java's Pattern supports so called back references, which can be used to match the same value a previous capturing group matched.
Unfortunately you cannot extract multiple values from a single capturing group, so you'll have to hardcode the number of templates you want to match, if you want to do this with a single pattern.
For one variable, it could look like this:
\[(.*?)\].*?\[\1#(.*?)\]
^^^^^ ^^^^^ variable
template ^^ back reference to whatever template matched
You can add more optional matches by wrapping them in optional non-capturing groups like this:
\[(.*?)\].*?\[\1#(.*?)\](?:.*?\[\1#(.*?)\])?(?:.*?\[\1#(.*?)\])?
^ optional group ^ another one
This would match up to three variables:
String s = "2012 Ferrari [F12] - Ostrich Leather interior [F12#OL] - Candy Red Metallic [F12#3]";
String pattern = "\\[(.*?)\\].*?\\[\\1#(.*?)\\](?:.*?\\[\\1#(.*?)\\])?(?:.*?\\[\\1#(.*?)\\])?";
Matcher matcher = Pattern.compile(pattern).matcher(s);
if (matcher.find()) {
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println(matcher.group(i));
}
}
// prints F12, OL, 3, null
If you'll need to match any number of variables, however, you'll have to resort to extracting the template in a first pass and then embedding it in a second pattern:
// compile once and store in a static variable
Pattern templatePattern = Pattern.compile("\\[(.*?)\\]");
String s = "2012 Ferrari [F12] - Ostrich Leather interior [F12#OL] - Candy Red Metallic [F12#3]";
Matcher templateMatcher = templatePattern.matcher(s);
if (!templateMatcher.find()) {
return;
}
String template = templateMatcher.group(1);
Pattern variablePattern = Pattern.compile("\\[" + Pattern.quote(template) + "#(.*?)\\]");
Matcher variableMatcher = variablePattern.matcher(s);
while (variableMatcher.find()) {
System.out.println(variableMatcher.group(1));
}

Related

How to Get substring from given QString in Qt

I have a QString like this:
QString fileData = "SOFT_PACKAGES.ABC=MY_DISPLAY_OS:MY-Display-OS.2022-3.10.25.10086-1.myApplication"
What I need to do is to create substrings as follow:
SoftwareName = MY_DISPLAY_OS //text after ':'
Version = 10.25.10086-1
Release = 2022-3
I tried using QString QString::sliced(qsizetype pos, qsizetype n) const but didn't worked as I'm using 5.9 and this is supported on 6.0.
QString fileData = "SOFT_PACKAGES.ABC=MY_DISPLAY_OS:MY-Display-OS.2022-3.10.25.10086-1.myApplication";
QString SoftwareName = fileData.sliced(fileData.lastIndexOf(':'), fileData.indexOf('.'));
Please help me to code this in Qt.
Use QString::split 3 times:
Split by QLatin1Char('=') to two parts:
SOFT_PACKAGES.ABC
MY_DISPLAY_OS:MY-Display-OS.2022-3.10.25.10086-1.myApplication
Next, split 2nd part by QLatin1Char(':'), probably again to just 2 parts if there can never be more than 2 parts, so the 2nd part can contain colons:
MY_DISPLAY_OS
MY-Display-OS.2022-3.10.25.10086-1.myApplication
Finally, split 2nd part of previous step by QLatin1Char('.'):
MY-Display-OS
2022-3
10
25
10086-1
myApplication
Now just assemble your required output strings from these parts. If exact number of parts is unknown, you can get Version = 10.25.10086-1 by removing two first elements and last element from the final list above, and then joining the rest by QLatin1Char('.'). If indexes are known and fixed, you can just use QStringLiteral("%1.%2.%3").arg(....
One way is using
QString::mid(int startIndex, int howManyChar);
so you probably want something like this:
QString fileData = "SOFT_PACKAGES.ABC=MY_DISPLAY_OS:MY-Display-OS.2022-3.10.25.10086-1.myApplication";
QString SoftwareName = fileData.mid(fileData.indexOf('.')+1, (fileData.lastIndexOf(':') - fileData.indexOf('.')-1));
To extract the other part you requested and if the number of '.' characters remains constant along all strings you want to check you can use the second argument IndexOf to find shift the starting location to skip known many occurences of '.', so for example
int StartIndex = 0;
int firstIndex = fileData.indexOf('.');
for (int i=0; i<=6; i++) {
StartIndex += fileData.indexOf('.', firstIndex+StartIndex);
}
int EndIndex = fileData.indexOf('.', StartIndex+8);
should give the right indices to be cut out with
QString SoftwareVersion = fileData.mid(StartIndex, EndIndex - StartIndex);
If the strings to be parsed stay less consistent in this way, try switching to regular expressions, they are the more flexible approach.
In my experience, using regular expressions for these types of tasks is generally simpler and more robust. You can do this with a regular expressions with the following:
// Create the regular expression.
// Using C++ raw string literal to reduce use of escape characters.
QRegularExpression re(R"(.+=([\w_]+):[\w-]+\.(\d+-\d+)\.(\d+\.\d+\.\d+-?\d+))");
// Match against your string
auto match = re.match("SOFT_PACKAGES.ABC=MY_DISPLAY_OS:MY-Display-OS.2022-3.10.25.10086-1.myApplication");
// Now extract the portions you are interested in
// match.captured(0) is always the full string that matched the entire expression
const auto softwareName = match.captured(1);
const auto version = match.captured(3);
const auto release = match.captured(2);
Of course for this to make sense, you have to understand regex, so here is my explanation of the regex used here:
.+=([\w_]+):[\w-]+\.(\d+-\d+)\.(\d+\.\d+\.\d+-?\d+)
.+=
get all characters up to and including the first equals sign
([\w_]+)
capture one or more word characters (alphanumeric characters) or underscores
:
a colon
[\w-]+\.
one or more alphanumeric or dash characters followed by a single period
(\d+-\d+)
capture one or more of digits followed by a dash followed by one or more digits
\.
a single period
(\d+\.\d+\.\d+-?\d*)
capture three sets of digits with periods in between, then an optional dash, and any number of digits (could be zero digits)
I think it is generally easier to make a regex that handles changes to the input - lets say version becomes 10.25.10087 - more easily than manually parsing things by index.
Regex is a powerful tool once you get used to it, but it can certainly seem daunting at first.
Example of this regex on regex101.com: https://regex101.com/r/dj3Z4U/1

Regex greedy to pull only required information

I have one scenario
CF-123/NAME-ANUBHAV/RT-INR 450/SI-No smoking/SC-123
Regex should be compatible with java and it needs to be done in one statement.
wherein I have to pick some information from this string.which are prefixed with predefined tags and have to put them in named groups.
(CF-) confirmationNumber = 123
(Name-) name = ANUBHAV
(RT-) rate = INR 450
(SI-) specialInformation = No smoking
(SC-) serviceCode = 123
I have written below regex:
^(CF-(?<confirmationNumber>.*?)(\/|$))?(([^\s]+)(\/|$))?(NAME-(?<name>.*?)(\/|$))?([^\s]+(\/|$))?(RT-(?<rate>.*?)(\/|$))?([^\s]+(\/|$))?(SI-(?<specialInformation>.*?)(\/|$))?([^\s]+(\/|$))?(SC-(?<serviceCode>.*)(\/|$))?
There can be certain scenarios.
**1st:** CF-123/**Ignore**/NAME-ANUBHAV/RT-INR 450/SI-No smoking/SC-123
**2nd:** CF-123//NAME-ANUBHAV/RT-INR 450/SI-No smoking/SC-123
**3rd:** CF-123/NAME-ANUBHAV/RT-INR 450/**Ignore**/SI-No smoking/SC-123
there can be certain tags in between the string separated by / which we don't need to capture in our named group.enter code here
Basically we need to pick CF-,NAME-,RT-,SI-,SC- and have to assign them in confirmationNumber,name,rate,specialInformation,serviceCode. Anything coming in between the string need not to be captured.
To find the five bits of information that you are interested, you can use a pattern with named groups, compiling the pattern with the regex Pattern
Then, you can use the regex Matcher to find groups
String line = "CF-123/**Ignore**/NAME-ANUBHAV/RT-INR 450/SI-No smoking/SC-123";
String pattern = "CF-(?<confirmationNumber>[^/]+).*NAME-(?<name>[^/]+).*RT-(?<rate>[^/]+).*SI-(?<specialInformation>[^/]+).*SC-(?<serviceCode>[^/]+).*";
// Create a Pattern object
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher m = r.matcher(line);
After that, you can work with the matched groups:
if (m.find( )) {
String confirmationNumber = m.group("confirmationNumber");
String name = m.group("name");
String rate = m.group("rate");
String specialInformation = m.group("specialInformation");
String serviceCode = m.group("serviceCode");
// continue with your processing
} else {
System.out.println("NO MATCH");
}

how to get a number between two characters?

I have this string:
String values="[52,52,73,52],[23,32],[40]";
How to only get the number 40?
I'm trying this pattern "\\[^[0-9]*$\\]", I've had no luck.
Can someone provide me with the appropriate pattern?
There is no need to use ^
The correct regex here is \\[([0-9]+)\\]$
If you are sure of the single number inside the [], this simple regex would do
\\[(\d+)\\]
Your could update your pattern to use a capturing group and a quantifier + after the character class and omit the ^ anchor to assert the start of the string.
Change the anchor to assert the end of string $ to the end of the pattern:
\\[([0-9]+)\\]$
^ ^^
Regex demo | Java demo
For example:
String regex = "\\[([0-9]+)\\]$";
String string = "[52,52,73,52],[23,32],[40]";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);
if(matcher.find()) {
System.out.println(matcher.group(1)); // 40
}
Given that you appear to be using Java, I recommend taking advantage of String#split here:
String values = "[52,52,73,52],[23,32],[40]";
String[] parts = values.split("(?<=\\]),(?=\\[)");
String[][] contents = new String[parts.length][];
for (int i=0; i < parts.length; ++i) {
contents[i] = parts[i].replaceAll("[\\[\\]]", "").split(",");
}
// now access any element at any position, e.g.
String forty = contents[2][0];
System.out.println(forty);
What the above snippet generates is a jagged 2D Java String array, where the first index corresponds to the array in the initial CSV, and the second index corresponds to the element inside that array.
Why not just use String.substring if you need the content between the last [ and last ]:
String values = "[52,52,73,52],[23,32],[40]";
String wanted = values.substring(values.lastIndexOf('[')+1, values.lastIndexOf(']'));

java regex pattern.compile Vs matcher

Im trying to find whether a word contains consecutive identical strings or not, using java.regex.patterns, while testing an regex with matcher, It returns true. But if I only use like this :
System.out.println("test:" + scanner.hasNext(Pattern.compile("(a-z)\\1")));
it returns false.
public static void test2() {
String[] strings = { "Dauresselam", "slab", "fuss", "boolean", "clap", "tellme" };
String regex = "([a-z])\\1";
Pattern pattern = Pattern.compile(regex);
for (String string : strings) {
Matcher matcher = pattern.matcher(string);
if (matcher.find()) {
System.out.println(string);
}
}
}
this returns true. which one is correct.
The pattern ([a-z])\\1 uses a capturing group to match a single lowercase character which is then followed by a backreference to what is captured in group 1.
Ih you have Dauresselam for example, it would match the first s in the capturing group and then matches the second s. So if you want to match consecutive characters you could use that pattern.
The pattern (a-z)\\1 uses a capturing group to match a-z literally and then then uses a backreference to what is captured in group 1. So that would match a-za-z
It depends on what you want. Here you use parenthesis:
Pattern.compile("(a-z)\\1").
Here you use Square brackets inside pareanthesis:
String regex = "([a-z])\\1";
To compare, you should obviously use the same pattern.

Capture number inside tag in Qt

My tag struct looks like this:
<sml8/>
combination of < , sml , digits (one or two) and />
Is there anyway to capture number inside tag?
for example in above I want capture 8 inside
I've defined regular expression and I tried to capture it by digit position but it's not working for me.
QRegExp rxlen("<sml(.*)/>");
int index = rxlen.pos(3);
I guess it's not correct way and it gives me position of digit although I want value of digit (or digits).
You need to use capturedTexts() together with <sml(\\d{1,2})/> regex (it matches <sml literally, then 1 or 2 digits capturing them into Captured group 1, then />:
QString str = "<sml8/>";
QRegExp rxlen("<sml(\\d{1,2})/>");
int pos = rxlen.indexIn(str);
QStringList list = rxlen.capturedTexts();
QString my_number = list[1];