Regular expression for string between any quotation mark - regex

I want to write a regular expression for a string starts with quotation mark and ends with same mark. It consists of alpha numeric words (e.g. "PL", or 'CS', . . . ).
I thought about [^"].*[^"] , but this is only work for "" these.
i want output like
input: "CS300"
output: 1 tSTRING
or
input:'a'
ouput: 1 tSTRING
Thanks
my code is
%{
int linecounter=1;
%}
%%
\n linecounter++;
(['"])[^'"]*\1 printf("%d tSTRING \n", linecounter);
%%
main()
{
yylex();
}

Use negated character class and backreference:
(['"]).*?\1
Explanation:
(['"]) : matches a single or a double quote and keep it in group1
.*? : matches what is between
\1 : backreference, same quote as in group 1
If your regex flavor doesn't support lazy quantifiers:
(['"])[^'"]*\1

I write \"(\.|[^"])\" and \'(\.|[^'])\' now its working.

Related

Regex to replace single occurrence of character in C++ with another character

I am trying to replace a single occurrence of a character '1' in a String with a different character.
This same character can occur multiple times in the String which I am not interested in.
For example, in the below string I want to replace the single occurrence of 1 with 2.
input:-0001011101
output:-0002011102
I tried the below regex but it is giving be wrong results
regex b1("(1){1}");
S1=regex_replace( S,
b1, "2");
Any help would be greatly appreciated.
If you used boost::regex, Boost regex library, you could simply use a lookaround-based solution like
(?<!1)1(?!1)
And then replace with 2.
With std::regex, you cannot use lookbehinds, but you can use a regex that captures either start of string or any one char other than your char, then matches your char, and then makes sure your char does not occur immediately on the right.
Then, you may replace with $01 backreference to Group 1 (the 0 is necessary since the $12 replacement pattern would be parsed as Group 12, an empty string here since there is no Group 12 in the match structure):
regex reg("([^1]|^)1(?!1)");
S1=std::regex_replace(S, regex, "$012");
See the C++ demo online:
#include <iostream>
#include <regex>
int main() {
std::string S = "-0001011101";
std::regex reg("([^1]|^)1(?!1)");
std::cout << std::regex_replace(S, reg, "$012") << std::endl;
return 0;
}
// => -0002011102
Details:
([^1]|^) - Capturing group 1: any char other than 1 ([^...] is a negated character class) or start of string (^ is a start of string anchor)
1 - a 1 char
(?!1) - a negative lookahead that fails the match if there is a 1 char immediately to the right of the current location.
Use a negative lookahead in the regexp to match a 1 that isn't followed by another 1:
regex b1("1(?!1)");

Dart Regex: Only allow dot and numbers

I need to format the price string in dart.
String can be: ₹ 2,19,990.00
String can be: $1,114.99
String can be: $14.99
What I tried:
void main() {
String str = "₹ 2,19,990.00";
RegExp regexp = RegExp("(\\d+[,.]?[\\d]*)");
RegExpMatch? match = regexp.firstMatch(str);
str = match!.group(1)!;
print(str);
}
What my output is: 2,19
What my output is: 1,114
What my output is: 14.99
Expected output: 219990.00
Expected output: 1114.99
Expected output: 14.99 (This one is correct because there is no comma)
The simplest solution would be to replace all non-digit/non-dot characters with nothing.
The most efficient way to do that is:
final re = RegExp(r"[^\d.]+");
String sanitizeCurrency(String input) => input.replaceAll(re, "");
You can't do it by matching because a match is always contiguous in the source string, and you want to omit the embedded ,s.
You can use this regex for search:
^\D+|(?<=\d),(?=\d)
And replace with an empty string i.e. "".
RegEx Details:
^: Start
\D+: Match 1+ non-digit characters
|: OR
(?<=\d),(?=\d): Match a comma if it surrounded with digits on both sides
RegEx Demo
Code: Using replaceAll method:
str = str.replaceAll(RegExp(r'^\D+|(?<=\d),(?=\d)'), '');

regex - how to specify the expressions to exclude

I need to replace two characters {, } with {\n, \n}.
But they must be not surrounded in '' or "".
I tried this code to achieve that
text = 'hello(){imagine{myString("HELLO, {WORLD}!")}}'
replaced = re.sub(r'{', "{\n", text)
Ellipsis...
Naturally, This code replaces curly brackets that are surrounded in quote marks.
What are the negative statements like ! or not that can be used in regular expressions?
And the following is what I wanted.
hello(){
imagine{
puts("{HELLO}")
}
}
In a nutshell - what I want to do is
Search { and }.
If that is not enclosed in '' or ""
replace { or } to {\n or \n}
In the opposite case, I can solve it with (?P<a>\".*){(?P<b>.*?\").
But I have no clue how I can solve it in my case.
First replace all { characters with {\n. You will also be replacing {" with {\n". Now, you can replace back all {\n" characters with {".
text = 'hello(){imagine{puts("{HELLO}")}}'
replaced = text.replace('{', '{\n').replace('{\n"','{"')
You may match single and double quoted (C-style) string literals (those that support escape entities with backslashes) and then match { and } in any other context that you may replace with your desired values.
See Python demo:
import re
text = 'hello(){imagine{puts("{HELLO}")}}'
dblq = r'(?<!\\)(?:\\{2})*"[^"\\]*(?:\\.[^"\\]*)*"'
snlq = r"(?<!\\)(?:\\{2})*'[^'\\]*(?:\\.[^'\\]*)*'"
rx = re.compile(r'({}|{})|[{{}}]'.format(dblq, snlq))
print(rx.pattern)
def repl(m):
if m.group(1):
return m.group(1)
elif m.group() == '{':
return '{\n'
else:
return '\n}'
# Examples
print(rx.sub(repl, text))
print(rx.sub(repl, r'hello(){imagine{puts("Nice, Mr. \"Know-all\"")}}'))
print(rx.sub(repl, "hello(){imagine{puts('MORE {HELLO} HERE ')}}"))
The pattern that is generated in the code above is
((?<!\\)(?:\\{2})*"[^"\\]*(?:\\.[^"\\]*)*"|(?<!\\)(?:\\{2})*'[^'\\]*(?:\\.[^'\\]*)*')|[{}]
It can actually be reduced to
(?<!\\)((?:\\{2})*(?:"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*'))|[{}]
See the regex demo.
Details:
The pattern matches 2 main alternatives. The first one matches single- and double-quoted string literals.
(?<!\\) - no \ immediately to the left is allowed
((?:\\{2})*(?:"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')) - Group 1:
(?:\\{2})* - 0+ repetitions of two consecutive backslashes
(?: - a non-capturing group:
"[^"\\]*(?:\\.[^"\\]*)*" - a double quoted string literal
| - or
'[^'\\]*(?:\\.[^'\\]*)*' - a single quoted string literal
) - end of the non-capturing group
| - or
[{}] - a { or }.
In the repl method, Group 1 is checked for a match. If it matched, the single- or double-quoted string literal is matched, it must be put back where it was. Else, if the match value is {, it is replaced with {\n, else, with \n}.
Replace { with {\n:
text.replace('{', '{\n')
Replace } with \n}:
text.replace('}', '\n}')
Now to fix the braces that were quoted:
text.replace('"{\n','"{')
and
text.replace('\n}"', '}"')
Combined together:
replaced = text.replace('{', '{\n').replace('}', '\n}').replace('"{\n','"{').replace('\n}"', '}"')
Output
hello(){
imagine{
puts("{HELLO}")
}
}
You can check the similarities with the input and try to match them.
text = 'hello(){imagine{puts("{HELLO}")}}'
replaced = text.replace('){', '){\n').replace('{puts', '{\nputs').replace('}}', '\n}\n}')
print(replaced)
output:
hello(){
imagine{
puts("{HELLO}")
}
}
UPDATE
try this: https://regex101.com/r/DBgkrb/1

Regex doesn't fetch the nested curly braces

Curly braces matches sometimes and doesn't in few case.
My Code:
use strict;
use warnings;
my $str1 = '$$\eqalign{&\cases{\mathdot{\bf x}=A{\bf x}+Bu\cr y=H{\bf x}}\quad{\rm with}\{\bf x}=\left(\matrix{x\cr\mathdot{x}\cr\theta\cr\mathdot{\theta}}\right),\cr&A\!=\!\!\left(\matrix{0&1&0&0\cr 0&0&-{m_{a}\over M}g&0\cr 0&0&0&1\cr 0&0&{(M\!+\!m_{a})\over Ml}g&0}\right)\!,\ B\!=\!\left(\matrix{0\cr{a\over M}\cr 0\cr-{a\over Ml}}\right)\!,\ H^{T}\!=\!\left(\matrix{1\cr 0\cr 1\cr 0}\right)\!.}$$';
my $str2 = "\\bibcite{Airdetal2013}{{2}{2017}{{{John} {et~al.}}}{{{James}, {Flexi}, {Buella}, {Curren}, {Mozes}, {Sam}, {Kandan}, {Alexander}, {Alfonsa}, {Fireknight}, {Georgen}, {Karims}, {Merloni}, {Nanda}, {Terra}, {Alvato}, {Nini}, {Winski}, {Shankar}, {Gnali}, \& {Giito}}}}";
my $regex = qr/(?:[^{}]*(?:{(?:[^{}]*(?:{(?:[^{}]*(?:{[^{}]*})*[^{}]*)})*[^{}]*)*})*[^{}]*)*/;
if($str1=~m/\{$regex\}/) { print "str1: $&\n"; }
if($str2=~m/\{$regex\}/) { print "str2: $&\n"; }
OUTPUT:
str1: {&\cases{\mathdot{\bf x}=A{\bf x}+Bu\cr y=H{\bf x}}\quad{\rm with}\ {\bf x}=\left(\matrix{x\cr\mathdot{x}\cr\theta\cr\mathdot{\theta}}\right),\cr&A\!=\!\!\left(\matrix{0&1&0&0\cr 0&0&-{m_{a}\over M}g&0\cr 0&0&0&1\cr 0&0&{(M\!+ !m_{a})\over Ml}g&0}\right)\!,\ B\!=\!\left(\matrix{0\cr{a\over M}\cr 0\cr-{a\over Ml}}\right)\!,\ H^{T}\!=\!\left(\matrix{1\cr 0\cr 1\cr 0}\right)\!.}
str2: {2}
str1 is correct output. str2 incorrect output.
Expected Output on str2 is:
str2: {{2}{2017}{{{John} {et~al.}}}{{{James}, {Flexi}, {Buella}, {Curren}, {Mozes}, {Sam}, {Kandan}, {Alexander}, {Alfonsa}, {Fireknight}, {Georgen}, {Karims}, {Merloni}, {Nanda}, {Terra}, {Alvato}, {Nini}, {Winski}, {Shankar}, {Gnali}, \& {Giito}}}}
In the sample str1 string doesn't matched with the nested curly braces. However the second sample str12 string can matched the nested curly braces.
This is my question can matched the nested curly braces. I am clueless. It would be better if someone point out my mistake.
Thanks in advance.
Since your actual requirements (discussed in the chat) are to match substrings starting with \bib followed with {...} substrings or any chars other than { and }, you should use a regex with a subroutine:
/\\bib(?:({(?:[^{}]++|(?1))*})|(?!\\bib)[^{}])*/g
Details:
\\bib - \bib literal text
(?:({(?:[^{}]++|(?1))*})|(?!\\bib)[^{}])* - 0+ occurrences of:
({(?:[^{}]++|(?1))*}) - Group 1 (that will be recursed with (?1)) matching
{ - a literal {
(?:[^{}]++|(?1))* - 0 or more occurrences of 1+ chars other than { and } or the whole Group 1 subpattern
} - a literal }
| - or
(?!\\bib)[^{}] - a char other than { and } not starting a \bib literal char sequence.
See the sample Perl code:
use strict;
use warnings;
use feature 'say';
my $str2 = "\\bibcite{Airdetal2013}{{2}{2017}{{{John} {et~al.}}}{{{James}, {Flexi}, {Buella}, {Curren}, {Mozes}, {Sam}, {Kandan}, {Alexander}, {Alfonsa}, {Fireknight}, {Georgen}, {Karims}, {Merloni}, {Nanda}, {Terra}, {Alvato}, {Nini}, {Winski}, {Shankar}, {Gnali}, \& {Giito}}}}";
while($str2 =~ /\\bib(?:({(?:[^{}]++|(?1))*})|(?!\\bib)[^{}])*/g) {
say "$&";
}
Note The edit in the question adds \\bibcite{Airdetal2013} in front. However, this doesn't change the analysis below as it doesn't change the overall nesting levels.
This has got to be possible to do in a better way. There is recursive regex offered by Wiktor Stribiżew in comments. There are modules for recursive parsing. And there are tools for parsing Latex.
However, out of curiosity ...
Your string, shortened suitably
my $str2 = "{{2}{2017}{{{John}{et~al.}}}{{{James}, ... {Gnali}, \& {Giito}}}}";
or, with C standing for a pair of curlies with something inside (no nesting)
"{ C C { { C C } { C, ... \& C } } }"
So you have three levels of nesting, to get down to the last pair {...} (no further nesting).
Your regex, spread out and with $nc = qr/[^{}]*/ (Non-Curlies), so that we can look at it
my $regex = qr/
(?: $nc
(?: {
(?: $nc
(?: {
(?: $nc (?: { $nc } )* $nc )
}
)* $nc
)*
}
)* $nc
)*/x;
I can count two levels here. (The $nc has no curlies so { $nc } matches my C above.)
Thus this regex cannot match that whole string.
How to fix it? Best, find another way so to not drown in this.
Or, write it out like above, very carefully, and add the missing level.

How to validate a string to have only certain letters by perl and regex

I am looking for a perl regex which will validate a string containing only the letters ACGT. For example "AACGGGTTA" should be valid while "AAYYGGTTA" should be invalid, since the second string has "YY" which is not one of A,C,G,T letters. I have the following code, but it validates both the above strings
if($userinput =~/[A|C|G|T]/i)
{
$validEntry = 1;
print "Valid\n";
}
Thanks
Use a character class, and make sure you check the whole string by using the start of string token, \A, and end of string token, \z.
You should also use * or + to indicate how many characters you want to match -- * means "zero or more" and + means "one or more."
Thus, the regex below is saying "between the start and the end of the (case insensitive) string, there should be one or more of the following characters only: a, c, g, t"
if($userinput =~ /\A[acgt]+\z/i)
{
$validEntry = 1;
print "Valid\n";
}
Using the character-counting tr operator:
if( $userinput !~ tr/ACGT//c )
{
$validEntry = 1;
print "Valid\n";
}
tr/characterset// counts how many characters in the string are in characterset; with the /c flag, it counts how many are not in the characterset. Using !~ instead of =~ negates the result, so it will be true if there are no characters not in characterset or false if there are characters not in characterset.
Your character class [A|C|G|T] contains |. | does not stand for alternation in a character class, it only stands for itself. Therefore, the character class would include the | character, which is not what you want.
Your pattern is not anchored. The pattern /[ACGT]+/ would match any string that contains one or more of any of those characters. Instead, you need to anchor your pattern, so that only strings that contain just those characters from beginning to end are matched.
$ can match a newline. To avoid that, use \z to anchor at the end. \A anchors at the beginning (although it doesn't make a difference whether you use that or ^ in this case, using \A provides a nice symmetry.
So, you check should be written:
if ($userinput =~ /\A [ACGT]+ \z/ix)
{
$validEntry = 1;
print "Valid\n";
}