regex - how to specify the expressions to exclude - regex

I need to replace two characters {, } with {\n, \n}.
But they must be not surrounded in '' or "".
I tried this code to achieve that
text = 'hello(){imagine{myString("HELLO, {WORLD}!")}}'
replaced = re.sub(r'{', "{\n", text)
Ellipsis...
Naturally, This code replaces curly brackets that are surrounded in quote marks.
What are the negative statements like ! or not that can be used in regular expressions?
And the following is what I wanted.
hello(){
imagine{
puts("{HELLO}")
}
}
In a nutshell - what I want to do is
Search { and }.
If that is not enclosed in '' or ""
replace { or } to {\n or \n}
In the opposite case, I can solve it with (?P<a>\".*){(?P<b>.*?\").
But I have no clue how I can solve it in my case.

First replace all { characters with {\n. You will also be replacing {" with {\n". Now, you can replace back all {\n" characters with {".
text = 'hello(){imagine{puts("{HELLO}")}}'
replaced = text.replace('{', '{\n').replace('{\n"','{"')

You may match single and double quoted (C-style) string literals (those that support escape entities with backslashes) and then match { and } in any other context that you may replace with your desired values.
See Python demo:
import re
text = 'hello(){imagine{puts("{HELLO}")}}'
dblq = r'(?<!\\)(?:\\{2})*"[^"\\]*(?:\\.[^"\\]*)*"'
snlq = r"(?<!\\)(?:\\{2})*'[^'\\]*(?:\\.[^'\\]*)*'"
rx = re.compile(r'({}|{})|[{{}}]'.format(dblq, snlq))
print(rx.pattern)
def repl(m):
if m.group(1):
return m.group(1)
elif m.group() == '{':
return '{\n'
else:
return '\n}'
# Examples
print(rx.sub(repl, text))
print(rx.sub(repl, r'hello(){imagine{puts("Nice, Mr. \"Know-all\"")}}'))
print(rx.sub(repl, "hello(){imagine{puts('MORE {HELLO} HERE ')}}"))
The pattern that is generated in the code above is
((?<!\\)(?:\\{2})*"[^"\\]*(?:\\.[^"\\]*)*"|(?<!\\)(?:\\{2})*'[^'\\]*(?:\\.[^'\\]*)*')|[{}]
It can actually be reduced to
(?<!\\)((?:\\{2})*(?:"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*'))|[{}]
See the regex demo.
Details:
The pattern matches 2 main alternatives. The first one matches single- and double-quoted string literals.
(?<!\\) - no \ immediately to the left is allowed
((?:\\{2})*(?:"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')) - Group 1:
(?:\\{2})* - 0+ repetitions of two consecutive backslashes
(?: - a non-capturing group:
"[^"\\]*(?:\\.[^"\\]*)*" - a double quoted string literal
| - or
'[^'\\]*(?:\\.[^'\\]*)*' - a single quoted string literal
) - end of the non-capturing group
| - or
[{}] - a { or }.
In the repl method, Group 1 is checked for a match. If it matched, the single- or double-quoted string literal is matched, it must be put back where it was. Else, if the match value is {, it is replaced with {\n, else, with \n}.

Replace { with {\n:
text.replace('{', '{\n')
Replace } with \n}:
text.replace('}', '\n}')
Now to fix the braces that were quoted:
text.replace('"{\n','"{')
and
text.replace('\n}"', '}"')
Combined together:
replaced = text.replace('{', '{\n').replace('}', '\n}').replace('"{\n','"{').replace('\n}"', '}"')
Output
hello(){
imagine{
puts("{HELLO}")
}
}

You can check the similarities with the input and try to match them.
text = 'hello(){imagine{puts("{HELLO}")}}'
replaced = text.replace('){', '){\n').replace('{puts', '{\nputs').replace('}}', '\n}\n}')
print(replaced)
output:
hello(){
imagine{
puts("{HELLO}")
}
}
UPDATE
try this: https://regex101.com/r/DBgkrb/1

Related

Regex Express Return All Chars before a '/' but if there are 2 '/' Return all before that

I have been trying to get a regex expression to return me the following in the following situations.
XX -> XX
XXX -> XXX
XX/XX -> XX
XX/XX/XX -> XX/XX
XXX/XXX/XX -> XXX/XXX
I had the following Regex, however they do no work.
^[^/]+ => https://regex101.com/r/xvCbNB/1
=========
([A-Z])\w+ => https://regex101.com/r/xvCbNB/2
They are close but are not there.
Any Help would be appreciated.
You want to get all text from the start till the last occurrence of a specific character or till the end of string if the character is missing.
Use
^(?:.*(?=\/)|.+)
See the regex demo and the regex graph:
Details
^ - start of string
(?:.*(?=\/)|.+) - a non-capturing group that matches either of the two alternatives, and if the first one matches first the second won't be tried:
.*(?=\/) - any 0+ chars other than line break chars, as many as possible upt to but excluding /
| - or
.+ - any 1+ chars other than line break chars, as many as possible.
It will be easier to use a replace here to match / followed by non-slash characters before end of line:
Search regex:
/[^/]*$
Replacement String:
""
Updated RegEx Demo 1
If you're looking for a regex match then use this regex:
^(.*?)(?:/[^/]*)?$
Updated RegEx Demo 2
Any special reason it has to be a regular expression? How about just splitting the string at the slashes, remove the last item and rejoin:
function removeItemAfterLastSlash(string) {
const list = string.split(/\//);
if (list.length == 1) [
return string;
}
list.pop();
return list.join("/");
}
Or look for the last slash an remove it:
function removeItemAfterLastSlash(string) {
const index = string.lastIndexOf("/");
if (index === -1) {
return string;
}
return string.splice(0, index);
}

c# Regex expression to extract all non-numeric values in brackets

This is the Regex expression i have built so far \{([^{]*[^0-9])\}.
"This is the sample string {0} {1} {} {abc} {12abc} {abc123}"
I wish to extract everything within the string that includes brackets and that does not contain only an integer.
(e.g) '{}'
'{abc}' '{12abc}' '{abc123}'
However the last one which contains numbers at the end is not extracted with the rest.
{abc123}
How can i extract all values in the string that are in curly brackets and do not contain an Integer?
You may use
var res = Regex.Matches(s, #"{(?!\d+})[^{}]*}")
.Cast<Match>()
.Select(x => x.Value)
.ToList();
See the regex demo and the online C# demo.
Pattern details
{ - a { char
(?!\d+}) - no 1+ digits and then } allowed immediately to the right of the current location
[^{}]* - 0+ chars other than { and }
} - a } char.

Groovy complaining about illegal character range in regex

Groovy 2.4 here. I am trying to build a regex that will filter out all the following characters:
`,./;[]-&<>?:"()|
Here's my best attempt:
static void main(String[] args) {
// `,./;[]-&<>?:"()|
String regex = "`,./;[]-&<>?:\"()|"
String test = "ooekrofkrofor ` oxkeoe , wdkeodeko / kodek ] woekoedk \" swjiej ' wsjwdjeiji :"
println test.replaceAll(regex, "")
}
However this produces a compile error on the regex string definition, complaining:
illegal character range (to < from)
Not sure if this is a Java or Groovy thing, but I can't figure out how to define the regex properly so that it quiets the error and correctly strips these "illegal characters" out of my string. Any ideas?
It seems to me you want to remove all the characters listed in your regex variable. The problem is that you declared a sequence while you need a character class (enclose the characters with []).
See Groovy demo:
String regex = "[`,./;\\[\\]&<>?:\"()|-]+"
^ ^^^^^^ ^ ^
String test = "ooekrofkrofor ` oxkeoe , wdkeodeko / kodek ] woekoedk \" swjiej ' wsjwdjeiji :"
println test.replaceAll(regex, "")
Output: ooekrofkrofor oxkeoe wdkeodeko kodek woekoedk swjiej ' wsjwdjeiji
The pattern now contains a character class matching any of the characters defined inside it - [`,./;\[\]&<>?:\"()|-] - one or more times due to the + quantifier. Note that inside the character class, ] and [ must always be escaped, and the - can be left unescaped when placed at the start/end of the character class.
You need to escape a few special characters in your pattern:
String regex = "[`,./;\\[]\\-&<>?:\"\\(\\)|]+"
Note using double \\ to turn them into a single \ in the string, so when the pattern is parsed, the next character is escaped.

Regex to remove bracket but not its contents

I would like to remove constant text and brackets from a String using regex. e.g.
INPUT Expected OUTPUT
var x = CONST_STR(ABC) var x = ABC
var y = CONST_STR(DEF) var y = DEF
How to achieve this?
Try with regex:
(?<==\s)\w+\(([^)]+)\)
DEMO
which means:
(?<==\s) - lookbehind for = and space (\s), the regax need to fallow these signs,
\w+ - one or more word characters (A-Za-z_0-9) this might be upgraded to naming rules of language from code you are processing,
\( - opening bracket,
([^)]+) - capturing group (\1 or $1) for one or more characters, not closing brackets,
\) - closing bracket,
Just remember that in Java you need to use double escape character \,
like in:
public class Test {
public static void main(String[] args) {
System.out.println("var y = CONST_STR(DEF)".replaceAll("(?<==\\s)\\w+\\(([^)]+)\\)", "$1"));
}
}
with output:
var y = DEF
The $1 in replaceAll() is a call to captured groups nr 1, so a text captured by this fragment of regex: ([^)]+) - words in brackets.

How to validate a string to have only certain letters by perl and regex

I am looking for a perl regex which will validate a string containing only the letters ACGT. For example "AACGGGTTA" should be valid while "AAYYGGTTA" should be invalid, since the second string has "YY" which is not one of A,C,G,T letters. I have the following code, but it validates both the above strings
if($userinput =~/[A|C|G|T]/i)
{
$validEntry = 1;
print "Valid\n";
}
Thanks
Use a character class, and make sure you check the whole string by using the start of string token, \A, and end of string token, \z.
You should also use * or + to indicate how many characters you want to match -- * means "zero or more" and + means "one or more."
Thus, the regex below is saying "between the start and the end of the (case insensitive) string, there should be one or more of the following characters only: a, c, g, t"
if($userinput =~ /\A[acgt]+\z/i)
{
$validEntry = 1;
print "Valid\n";
}
Using the character-counting tr operator:
if( $userinput !~ tr/ACGT//c )
{
$validEntry = 1;
print "Valid\n";
}
tr/characterset// counts how many characters in the string are in characterset; with the /c flag, it counts how many are not in the characterset. Using !~ instead of =~ negates the result, so it will be true if there are no characters not in characterset or false if there are characters not in characterset.
Your character class [A|C|G|T] contains |. | does not stand for alternation in a character class, it only stands for itself. Therefore, the character class would include the | character, which is not what you want.
Your pattern is not anchored. The pattern /[ACGT]+/ would match any string that contains one or more of any of those characters. Instead, you need to anchor your pattern, so that only strings that contain just those characters from beginning to end are matched.
$ can match a newline. To avoid that, use \z to anchor at the end. \A anchors at the beginning (although it doesn't make a difference whether you use that or ^ in this case, using \A provides a nice symmetry.
So, you check should be written:
if ($userinput =~ /\A [ACGT]+ \z/ix)
{
$validEntry = 1;
print "Valid\n";
}