Parsing of a string with the length specified within the string - regex

Example data:
029Extract this specific string. Do not capture anything else.
In the example above, I would like to capture the first n characters immediately after the 3 digit entry which defines the value of n. I.E. the 29 characters "Extract this specific string."
I can do this within a loop, but it is slow. I would like (if it is possible) to achieve this with a single regex statement instead, using some kind of backreference. Something like:
(\d{3})(.{\1})

With perl, you can do:
my $str = '029Extract this specific string. Do not capture anything else.';
$str =~ s/^(\d+)(.*)$/substr($2,0,$1)/e;
say $str;
output:
Extract this specific string.

You can not do it with single regex, while you can use knowledge where regex stop processing to use substr. For example in JavaScript you can do something like this http://jsfiddle.net/75Tm5/
var input = "blahblah 011I want this, and 029Extract this specific string. Do not capture anything else.";
var regex = /(\d{3})/g;
var matches;
while ((matches = regex.exec(input)) != null) {
alert(input.substr(regex.lastIndex, matches[0]));
}
This will returns both lines:
I want this
Extract this specific string.
Depending on what you really want, you can modify Regex to match only numbers starting from line beginning, match only first match etc

Are you sure you need a regex?
From https://stackoverflow.com/tags/regex/info:
Fools Rush in Where Angels Fear to Tread
The tremendous power and expressivity of modern regular expressions
can seduce the gullible — or the foolhardy — into trying to use
regular expressions on every string-related task they come across.
This is a bad idea in general, ...
Here's a Python three-liner:
foo = "029Extract this specific string. Do not capture anything else."
substr_len = int(foo[:3])
print foo[3:substr_len+3]
And here's a PHP three-liner:
$foo = "029Extract this specific string. Do not capture anything else.";
$substr_len = (int) substr($foo,0,3);
echo substr($foo,3,substr_len+3);

Related

How do I do regex substitutions with multiple capture groups?

I'm trying to allow users to filter strings of text using a glob pattern whose only control character is *. Under the hood, I figured the easiest thing to filter the list strings would be to use Js.Re.test[https://rescript-lang.org/docs/manual/latest/api/js/re#test_], and it is (easy).
Ignoring the * on the user filter string for now, what I'm having difficulty with is escaping all the RegEx control characters. Specifically, I don't know how to replace the capture groups within the input text to create a new string.
So far, I've got this, but it's not quite right:
let input = "test^ing?123[foo";
let escapeRegExCtrl = searchStr => {
let re = [%re("/([\\^\\[\\]\\.\\|\\\\\\?\\{\\}\\+][^\\^\\[\\]\\.\\|\\\\\\?\\{\\}\\+]*)/g")];
let break = ref(false);
while (!break.contents) {
switch (Js.Re.exec_ (re, searchStr)) {
| Some(result) => {
let match = Js.Re.captures(result)[0];
Js.log2("Matching: ", match)
}
| None => {
break := true;
}
}
}
};
search -> escapeRegExCtrl
If I disregard the "test" portion of the string being skipped, the above output will produce:
Matching: ^ing
Matching: ?123
Matching: [foo
With the above example, at the end of the day, what I'm trying to produce is this (with leading and following .*:
.*test\^ing\?123\[foo.*
But I'm unsure how to achieve creating a contiguous string from the matched capture groups.
(echo "test^ing?123[foo" | sed -r 's_([\^\?\[])_\\\1_g' would get the work done on the command line)
EDIT
Based on Chris Maurer's answer, there is a method in the JS library that does what I was looking for. A little digging exposed the ReasonML proxy for that method:
https://rescript-lang.org/docs/manual/latest/api/js/string#replacebyre
Let me see if I have this right; you want to implement a character matcher where everything is literal except *. Presumably the * is supposed to work like that in Windows dir commands, matching zero or more characters.
Furthermore, you want to implement it by passing a user-entered character string directly to a Regexp match function after suitably sanitizing it to only deal with the *.
If I have this right, then it sounds like you need to do two things to get the string ready for js.re.test:
Quote all the special regex characters, and
Turn all instances of * into .* or maybe .*?
Let's keep this simple and process the string in two steps, each one using Js.re.replace. So the list of special characters in regex are [^$.|?*+(). Suitably quoting these for replace:
str.replace(/[\[\\\^\$\.\|\?\+\(\)]/g, '\$&')
This is just all those special characters quoted. The $& in the replacement specifications says to insert whatever matched.
Then pass that result to a second replace for the * to .*? transformation.
str.replace(/*+/g, '.*?')

Parse string using regex

I need to come up with a regular expression to parse my input string. My input string is of the format:
[alphanumeric].[alpha][numeric].[alpha][alpha][alpha].[julian date: yyyyddd]
eg:
A.A2.ABC.2014071
3.M1.MMB.2014071
I need to substring it from the 3rd position and was wondering what would be the easiest way to do it.
Desired result:
A2.ABC.2014071
M1.MMB.2014071
(?i) will be considered as case insensitive.
(?i)^[a-z\d]\.[a-z]\d\.[a-z]{3}\.\d{7}$
Here a-z means any alphabet from a to z, and \d means any digit from 0 to 9.
Now, if you want to remove the first section before dot, then use this regex and replace it with $1 (or may be \1)
(?i)^[a-z\d]\.([a-z]\d\.[a-z]{3}\.\d{7})$
Another option is replace below with empty:
(?i)^[a-z\d]\.
If the input string is just the long form, then you want everything except the first two characters. You could arrange to substitute them with nothing:
s/^..//
Or you could arrange to capture everything except the first two characters:
/^..(.*)/
If the expression is part of a larger string, then the breakdown of the alphanumeric components becomes more important.
The details vary depending on the language that is hosting the regex. The notations written above could be Perl or PCRE (Perl Compatible Regular Expressions). Many other languages would accept these regexes too, but other languages would require tweaks.
Use this regex:
\w.[A-Z]\d.[A-Z]{3}.\d{7}
Use the above regex like this:
String[] in = {
"A.A2.ABC.2014071", "3.M1.MMB.2014071"
};
Pattern p = Pattern.compile("\\w.[A-Z]\\d.[A-Z]{3}.\\d{7}");
for (String s: in ) {
Matcher m = p.matcher(s);
while (m.find()) {
System.out.println("Result: " + m.group().substring(2));
}
}
Live demo: http://ideone.com/tns9iY

Using Regex is there a way to match outside characters in a string and exclude the inside characters?

I know I can exclude outside characters in a string using look-ahead and look-behind, but I'm not sure about characters in the center.
What I want is to get a match of ABCDEF from the string ABC 123 DEF.
Is this possible with a Regex string? If not, can it be accomplished another way?
EDIT
For more clarification, in the example above I can use the regex string /ABC.*?DEF/ to sort of get what I want, but this includes everything matched by .*?. What I want is to match with something like ABC(match whatever, but then throw it out)DEF resulting in one single match of ABCDEF.
As another example, I can do the following (in sudo-code and regex):
string myStr = "ABC 123 DEF";
string tempMatch = RegexMatch(myStr, "(?<=ABC).*?(?=DEF)"); //Returns " 123 "
string FinalString = myStr.Replace(tempMatch, ""); //Returns "ABCDEF". This is what I want
Again, is there a way to do this with a single regex string?
Since the regex replace feature in most languages does not change the string it operates on (but produces a new one), you can do it as a one-liner in most languages. Firstly, you match everything, capturing the desired parts:
^.*(ABC).*(DEF).*$
(Make sure to use the single-line/"dotall" option if your input contains line breaks!)
And then you replace this with:
$1$2
That will give you ABCDEF in one assignment.
Still, as outlined in the comments and in Mark's answer, the engine does match the stuff in between ABC and DEF. It's only the replacement convenience function that throws it out. But that is supported in pretty much every language, I would say.
Important: this approach will of course only work if your input string contains the desired pattern only once (assuming ABC and DEF are actually variable).
Example implementation in PHP:
$output = preg_replace('/^.*(ABC).*(DEF).*$/s', '$1$2', $input);
Or JavaScript (which does not have single-line mode):
var output = input.replace(/^[\s\S]*(ABC)[\s\S]*(DEF)[\s\S]*$/, '$1$2');
Or C#:
string output = Regex.Replace(input, #"^.*(ABC).*(DEF).*$", "$1$2", RegexOptions.Singleline);
A regular expression can contain multiple capturing groups. Each group must consist of consecutive characters so it's not possible to have a single group that captures what you want, but the groups themselves do not have to be contiguous so you can combine multiple groups to get your desired result.
Regular expression
(ABC).*(DEF)
Captures
ABC
DEF
See it online: rubular
Example C# code
string myStr = "ABC 123 DEF";
Match m = Regex.Match(myStr, "(ABC).*(DEF)");
if (m.Success)
{
string result = m.Groups[1].Value + m.Groups[2].Value; // Gives "ABCDEF"
// ...
}

Capture multiple texts.

I have a problem with Regular Expressions.
Consider we have a string
S= "[sometext1],[sometext],[sometext]....,[sometext]"
The number of the "sometexts" is unknown,it's user's input and can vary from one to ..for example,1000.
[sometext] is some sequence of characters ,but each of them is not ",",so ,we can say [^,].
I want to capture the text by some regular expression and then to iterate through the texts in cycle.
QRegExp p=new QRegExp("???");
p.exactMatch(S);
for(int i=1;i<=p.captureCount;i++)
{
SomeFunction(p.cap(i));
}
For example,if the number of sometexts is 3,we can use something like this:
([^,]*),([^,]*),([^,]*).
So,i don't know what to write instead of "???" for any arbitrary n.
I'm using Qt 4.7,I didn't find how to do this on the class reference page.
I know we can do it through the cycles without regexps or to generate the regex itself in cycle,but these solutions don't fit me because the actual problem is a bit more complex than this..
A possible regular expression to match what you want is:
([^,]+?)(,|$)
This will match string that end with a coma "," or the end of the line. I was not sure that the last element would have a coma or not.
An example using this regex in C#:
String textFromFile = "[sometext1],[sometext2],[sometext3],[sometext4]";
foreach (Match match in Regex.Matches(textFromFile, "([^,]+?)(,|$)"))
{
String placeHolder = match.Groups[1].Value;
System.Console.WriteLine(placeHolder);
}
This code prints the following to screen:
[sometext1]
[sometext2]
[sometext3]
[sometext4]
Using an example for QRegex I found online here is an attempt at a solution closer to what you are looking for:
(example I found was at: http://doc.qt.nokia.com/qq/qq01-seriously-weird-qregexp.html)
QRegExp rx( "([^,]+?)(,|$)");
rx.setMinimal( TRUE ); // this is if the Qregex does not understand the +? non-greedy notation.
int pos = 0;
while ( (pos = rx.search(text, pos)) != -1 )
{
someFunction(rx.cap(1));
}
I hope this helps.
We can do that, you can use non-capturing to hook in the comma and then ask for many of the block:
Try:
QRexExp p=new QRegExp("([^,]*)(?:,([^,]*))*[.]")
Non-capturing is explained in the docs: http://doc.qt.nokia.com/latest/qregexp.html
Note that I also bracketed the . since it has meaning in RegExp and you seemed to want it to be a literal period.
I only know of .Net that lets you specify a variable number of captures with a single
expression. Example - (capture.*me)+
It creates a capture object that can be itterated over. Even then it only simulates
what every other regex engine provides.
Most engines provide an incremental match until no matches left from within a
loop. The global flag tells the engine to keep matching from where the last
sucessfull match left off.
Example (in Perl):
while ( $string =~ /([^,]+)/g ) { print $1,"\n" }

How to match the pattern between a specified characters?

I have a fixed message delimited by "|"... tag=value is the pair between the delimiter;
(8=FIX.4.2|9=0360|35=8|49=BLPFT|56=ESP|34=8415|52=20110201-15:59:59|50=MBA|143=LN|115=MSET|57=2457172|30=CHIX|60=20110201-15:59:59.121|150=1|31=56.3100|151=71785|32=137|6=56.4058|37=9D9ZIhgu4BGU9sBtfHcYeQA|38=97370|39=1|40=1|11=20110201-05529|12=0.0012|13=2|14=25585|15=EUR|76=CHIXCCP|17=272674|47=A|167=CS|18=1|48=FR0000131104|20=0|21=1|22=4|113=N|54=1|55=BNP|207=FP|29=1|59=0|10=205|)
How to extract a data between "11=" and a first occurrence of "|" after a match?
For example i want a data
20110201-05529
which is between "|11=" and "|"
Can you please tell me the regular expression?
The best approach will depend on how much you know about the data you are trying to match. If you know it will be comprised of numbers and dashes only:
m/11=([0-9\-]+)/
Conversely, if the data could contain any kind of characters, use:
m/11=([^|]+)/
Which matches anything that isn't a pipe character. This is probably the most reliable expression.
In both cases, the data you want is captured into the $1 special variable.
If you don't always want to match the value for the key 11, you can use variables in the pattern, so:
my $key = 42; # or any number
if ($text =~ m/$key=([^|]+)/) {
print "I found $1"; # prints "I found 20110201-05529"
}
As always, there is more than 1 way to solve the problem. Therefore, there is no such thing as "the regular expression".
But you will definitely want to perldoc split.
Something like this will match everything else than = then everything else than |
[^=]+=([^|])+