Replace patterns that are inside delimiters using a regular expression call - regex

I need to clip out all the occurances of the pattern '--' that are inside single quotes in long string (leaving intact the ones that are outside single quotes).
Is there a RegEx way of doing this?
(using it with an iterator from the language is OK).
For example, starting with
"xxxx rt / $ 'dfdf--fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '--ggh--' vcbcvb"
I should end up with:
"xxxx rt / $ 'dfdffggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g 'ggh' vcbcvb"
So I am looking for a regex that could be run from the following languages as shown:
+-------------+------------------------------------------+
| Language | RegEx |
+-------------+------------------------------------------+
| JavaScript | input.replace(/someregex/g, "") |
| PHP | preg_replace('/someregex/', "", input) |
| Python | re.sub(r'someregex', "", input) |
| Ruby | input.gsub(/someregex/, "") |
+-------------+------------------------------------------+

I found another way to do this from an answer by Greg Hewgill at Qn138522
It is based on using this regex (adapted to contain the pattern I was looking for):
--(?=[^\']*'([^']|'[^']*')*$)
Greg explains:
"What this does is use the non-capturing match (?=...) to check that the character x is within a quoted string. It looks for some nonquote characters up to the next quote, then looks for a sequence of either single characters or quoted groups of characters, until the end of the string. This relies on your assumption that the quotes are always balanced. This is also not very efficient."
The usage examples would be :
JavaScript: input.replace(/--(?=[^']*'([^']|'[^']*')*$)/g, "")
PHP: preg_replace('/--(?=[^\']*'([^']|'[^']*')*$)/', "", input)
Python: re.sub(r'--(?=[^\']*'([^']|'[^']*')*$)', "", input)
Ruby: input.gsub(/--(?=[^\']*'([^']|'[^']*')*$)/, "")
I have tested this for Ruby and it provides the desired result.

This cannot be done with regular expressions, because you need to maintain state on whether you're inside single quotes or outside, and regex is inherently stateless. (Also, as far as I understand, single quotes can be escaped without terminating the "inside" region).
Your best bet is to iterate through the string character by character, keeping a boolean flag on whether or not you're inside a quoted region - and remove the --'s that way.

If bending the rules a little is allowed, this could work:
import re
p = re.compile(r"((?:^[^']*')?[^']*?(?:'[^']*'[^']*?)*?)(-{2,})")
txt = "xxxx rt / $ 'dfdf--fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '--ggh--' vcbcvb"
print re.sub(p, r'\1-', txt)
Output:
xxxx rt / $ 'dfdf-fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '-ggh-' vcbcvb
The regex:
( # Group 1
(?:^[^']*')? # Start of string, up till the first single quote
[^']*? # Inside the single quotes, as few characters as possible
(?:
'[^']*' # No double dashes inside theses single quotes, jump to the next.
[^']*?
)*? # as few as possible
)
(-{2,}) # The dashes themselves (Group 2)
If there where different delimiters for start and end, you could use something like this:
-{2,}(?=[^'`]*`)
Edit: I realized that if the string does not contain any quotes, it will match all double dashes in the string. One way of fixing it would be to change
(?:^[^']*')?
in the beginning to
(?:^[^']*'|(?!^))
Updated regex:
((?:^[^']*'|(?!^))[^']*?(?:'[^']*'[^']*?)*?)(-{2,})

Hm. There might be a way in Python if there are no quoted apostrophes, given that there is the (?(id/name)yes-pattern|no-pattern) construct in regular expressions, but it goes way over my head currently.
Does this help?
def remove_double_dashes_in_apostrophes(text):
return "'".join(
part.replace("--", "") if (ix&1) else part
for ix, part in enumerate(text.split("'")))
Seems to work for me. What it does, is split the input text to parts on apostrophes, and replace the "--" only when the part is odd-numbered (i.e. there has been an odd number of apostrophes before the part). Note about "odd numbered": part numbering starts from zero!

You can use the following sed script, I believe:
:again
s/'\(.*\)--\(.*\)'/'\1\2'/g
t again
Store that in a file (rmdashdash.sed) and do whatever exec magic in your scripting language allows you to do the following shell equivalent:
sed -f rmdotdot.sed < file containing your input data
What the script does is:
:again <-- just a label
s/'\(.*\)--\(.*\)'/'\1\2'/g
substitute, for the pattern ' followed by anything followed by -- followed by anything followed by ', just the two anythings within quotes.
t again <-- feed the resulting string back into sed again.
Note that this script will convert '----' into '', since it is a sequence of two --'s within quotes. However, '---' will be converted into '-'.
Ain't no school like old school.

Related

Do not select if additional character is included

Suppose I have the following numbers:
3,000mt
300mt
44,000m
320m
And I want 44,000m and 320m to be selected.
What regex should I use to only select the numbers (comma separated) that have "m" in the end and not the ones that have "mt"?
This is what I have tried:
\d+[,]?\d+m.
I have no idea how to negate mt though.
You are very close to the solution and only missed the possibility to check for a word boundary (represented by regex character \b). So instead of using any character . at the end of your regular expression, you will probably only look if the string is ended by a word boundary (e.g. spaces or newlines or nothing more):
\d+(,\d+)?m\b
where
\d+ looks for any digits (at least one)
(,\d+)? looks for a comma followed by one digit or more (it's grouped by using parentheses and the whole group is completely optional using the ? sign)
m\b as explained above looks for a literal m at the end of a word
With this regex you can also match strings with one digit only followed by m like 9m or similar. This is a slight change in comparison to your regex (grouping comma followed by digits).
I proved the regex via Python and also added some more edge cases:
>>> import re
>>> text = "3,000mt 300mt 44,000m 1m 1mt 1,3mt 320m"
>>> re.findall(r"\d+(?:,\d+)?m\b", text) # ?: is python specific for findall method
['44,000m', '1m', '320m']
how about a unix solution like below
> echo "3,000mt 300mt 44,000m 320m" | tr ' ' '\n' | awk -F" " ' $0~/m$/ { print } '
44,000m
320m
>

Regex match until third occurrence of a char is found, counting occurrence of said char starting from the end of string

Let's dive in : Input :
p9_rec_tonly_.cr_called.seg
p9_tonly_.cr_called.seg
p10_nor_nor_.cr_called.seg
p10_rec_tn_.cr_called.seg
p10_tn_.cr_called.seg
p26_rec_nor_nor_.cr_called.seg
p26_rec_tn_.cr_called.seg
p26_tn_.cr_called.seg
Desired output :
p9_rec
p9
p10_nor
p10_rec
p10
p26_rec_nor
p26_rec
p26
Starting from the beginning of my string, I need to match until the third occurrence of " _ " (underscore) is found, but I need to count " _ " (underscore) occurrence starting from end of string.
Any tips is appreciated,
Best regards
I believe this regex should do the trick!
^.*?(?=_[^_]*_[^_]*_[^_]*$)
Online Demo
Explanation:
^ the start of the line
.*? matches as many characters as possible
(?=...) asserts that its contents follow our match
_[^_]*_[^_]*_[^_]* Looks for exactly three underscores after our match.
$ the end of the line
You should think beyond regex to solve this problem. For example, if you are using Python just use rsplit with a limit of 3 and get the first resulting string:
>>> data = [
'p9_rec_tonly_.cr_called.seg',
'p9_tonly_.cr_called.seg',
'p10_nor_nor_.cr_called.seg',
'p10_rec_tn_.cr_called.seg',
'p10_tn_.cr_called.seg',
'p26_rec_nor_nor_.cr_called.seg',
'p26_rec_tn_.cr_called.seg',
'p26_tn_.cr_called.seg',
]
>>> for d in data:
print(d.rsplit('_', 3)[0])
p9_rec
p9
p10_nor
p10_rec
p10
p26_rec_nor
p26_rec
p26
bash you say? Well it's not a regular expression but you can do pattern substitutions (or stripping with bash):
while read var ; do echo ${var%_*_*_*} ; done <<EOT
p9_rec_tonly_.cr_called.seg
p9_tonly_.cr_called.seg
p10_nor_nor_.cr_called.seg
p10_rec_tn_.cr_called.seg
p10_tn_.cr_called.seg
p26_rec_nor_nor_.cr_called.seg
p26_rec_tn_.cr_called.seg
p26_tn_.cr_called.seg
EOT
${var%_*_*_*} expands variable var stripping shorted suffix match for _*_*_*.
Otherwise to perform regex operations in shell, you could normally ask a utility like sed for help and feed your lines through for instance this:
sed -e 's#_[^_]*_[^_]*_[^_]*$##'
or for short:
sed -e 's#\(_[^_]*\)\{3\}$##'
Find three groups of _ and zero or more characters of not _ at the end of line $ replacing them with nothing ('').

Replace a sequence of characters with a sequence of different characters of same length using regular expressions

I have a string which starts with spaces. I want to replace the leading spaces with equal number of dashes -. I don't want to replace any other spaces which may occur elsewhere in the string.
If I use /^\s*/-/, it only replaces with a single dash. If I use /^\s/-/, it only replaces the first space with a dash. If I remove the anchor /\s/-/, it replaces every occurences of space in the string which is not acceptable.
My string looks like this in general:
<n-leading-spaces><a-non-space-character><remaining-characters>
Example (pipes added to show the boundary):
| ajfn ssfdjn ng jnv sjfj%nv sjfj n s ;sn |
After substitution (pipes added to show the boundary):
|---ajfn ssfdjn ng jnv sjfj%nv sjfj n s ;sn |
NOTE: I cannot use any code snippet. I just want to know whether this can be done using just regex patterns. (Forgive my formatting as I'm new to markdown. I welcome formatting corrections)
You can use the following solution to replace a sequence of characters with a sequence of different characters of same length using regular expressions:
my $string = ' ajfn ssfdjn ng jnv sjfj%nv sjfj n s ;sn ';
$string =~ s/^(\s+)/"-" x length($1)/eg;
print $string;
Returns '----ajfn ssfdjn ng jnv sjfj%nv sjfj n s ;sn '

Trim end of string

I'm having trouble trimming off some characters at the end of a string. The string usually looks like:
C:\blah1\blah2
But sometimes it looks like:
C:\blah1\blah2.extra
I need to extract out the string 'blah2'. Most of the time, that's easy with a substring command. But on the rare occasions when the '.extra' portion is present, I need to first trim that part off.
The thing is, '.extra' always begins with a dot, but then is followed by various combinations of letters with various lengths. So wildcards will be necessary. Essentially, I need to script, "If the string contains a dot, trim off the dot and anything following it."
$string.replace(".*","") doesn't work. Nor does $string.replace(".\*",""). Nor does $string.replace(".[A-Z]","").
Also, I can't get at it from the beginning of the string either. 'blah1' is unknown and of various lengths. I have to get at 'blah2' from the end of the string.
Assuming that the string is always a path to a file with or without an extension (such as ".extra"), you can use Path.GetFileNameWithoutExtension():
PS C:\> [System.IO.Path]::GetFileNameWithoutExtension("C:\blah1\blah2")
blah2
PS C:\> [System.IO.Path]::GetFileNameWithoutExtension("C:\blah1\blah2.extra")
blah2
The path doesn't even have to be rooted:
PS C:\> [System.IO.Path]::GetFileNameWithoutExtension("blah1\blah2.extra")
blah2
If you want to implement similar functionality on your own, that should be fairly simply as well - use String.LastIndexOf() to find the last \ in the string and use that as your starting argument for Substring():
function Extract-Name {
param($NameString)
# Extract part after the last occurrence of \
if($NameString -like '*\*') {
$NameString = $NameString.Substring($NameString.LastIndexOf('\') + 1)
}
# Remove anything after a potential .
if($NameString -like '*.*') {
$NameString.Remove($NameString.IndexOf("."))
}
$NameString
}
And you'll see similar results:
PS C:\> Extract-Name "C:\blah1\blah2.extra"
blah2
PS C:\> Extract-Name "C:\blah124323\blah2.extra"
blah2
PS C:\> Extract-Name "C:\blah124323\blah2"
blah2
PS C:\> Extract-Name "abc124323\blah2"
blah2
As the other posters have said, you can use special file name manipulators for this. If you'd like to do it with regular expressions, you can say
$string.replace("\..*","")
The \..* regex matches a dot (\.) and then any string of characters (.*).
Let me address each of the non-working regexes individually:
$string.replace(".*","")
The reason this doesn't work is that . and * are both special characters in regular expressions: . is a wildcard character that matches any character, and * means "match the previous character zero or more times." So .* means "any string of characters."
$string.replace(".\*","")
In this instance, you're escaping the * character, meaning that the regex treats it literally, so the regex matches any single character (.) followed by a star (\*).
$string.replace(".[A-Z]","")
In this case, the regex will match any character (.) followed by any single capital letter ([A-Z]).
If the strings are actual paths using Get-Item would be another option:
$path = 'C:\blah1\blah2.something'
(Get-Item $path).BaseName
The Replace() method can't be used here, because it doesn't support wildcards or regular expressions.

Regex for quoted string with escaping quotes

How do I get the substring " It's big \"problem " using a regular expression?
s = ' function(){ return " It\'s big \"problem "; }';
/"(?:[^"\\]|\\.)*"/
Works in The Regex Coach and PCRE Workbench.
Example of test in JavaScript:
var s = ' function(){ return " Is big \\"problem\\", \\no? "; }';
var m = s.match(/"(?:[^"\\]|\\.)*"/);
if (m != null)
alert(m);
This one comes from nanorc.sample available in many linux distros. It is used for syntax highlighting of C style strings
\"(\\.|[^\"])*\"
As provided by ePharaoh, the answer is
/"([^"\\]*(\\.[^"\\]*)*)"/
To have the above apply to either single quoted or double quoted strings, use
/"([^"\\]*(\\.[^"\\]*)*)"|\'([^\'\\]*(\\.[^\'\\]*)*)\'/
Most of the solutions provided here use alternative repetition paths i.e. (A|B)*.
You may encounter stack overflows on large inputs since some pattern compiler implements this using recursion.
Java for instance: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6337993
Something like this:
"(?:[^"\\]*(?:\\.)?)*", or the one provided by Guy Bedford will reduce the amount of parsing steps avoiding most stack overflows.
/(["\']).*?(?<!\\)(\\\\)*\1/is
should work with any quoted string
"(?:\\"|.)*?"
Alternating the \" and the . passes over escaped quotes while the lazy quantifier *? ensures that you don't go past the end of the quoted string. Works with .NET Framework RE classes
/"(?:[^"\\]++|\\.)*+"/
Taken straight from man perlre on a Linux system with Perl 5.22.0 installed.
As an optimization, this regex uses the 'posessive' form of both + and * to prevent backtracking, for it is known beforehand that a string without a closing quote wouldn't match in any case.
This one works perfect on PCRE and does not fall with StackOverflow.
"(.*?[^\\])??((\\\\)+)?+"
Explanation:
Every quoted string starts with Char: " ;
It may contain any number of any characters: .*? {Lazy match}; ending with non escape character [^\\];
Statement (2) is Lazy(!) optional because string can be empty(""). So: (.*?[^\\])??
Finally, every quoted string ends with Char("), but it can be preceded with even number of escape sign pairs (\\\\)+; and it is Greedy(!) optional: ((\\\\)+)?+ {Greedy matching}, bacause string can be empty or without ending pairs!
An option that has not been touched on before is:
Reverse the string.
Perform the matching on the reversed string.
Re-reverse the matched strings.
This has the added bonus of being able to correctly match escaped open tags.
Lets say you had the following string; String \"this "should" NOT match\" and "this \"should\" match"
Here, \"this "should" NOT match\" should not be matched and "should" should be.
On top of that this \"should\" match should be matched and \"should\" should not.
First an example.
// The input string.
const myString = 'String \\"this "should" NOT match\\" and "this \\"should\\" match"';
// The RegExp.
const regExp = new RegExp(
// Match close
'([\'"])(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))' +
'((?:' +
// Match escaped close quote
'(?:\\1(?=(?:[\\\\]{2})*[\\\\](?![\\\\])))|' +
// Match everything thats not the close quote
'(?:(?!\\1).)' +
'){0,})' +
// Match open
'(\\1)(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))',
'g'
);
// Reverse the matched strings.
matches = myString
// Reverse the string.
.split('').reverse().join('')
// '"hctam "\dluohs"\ siht" dna "\hctam TON "dluohs" siht"\ gnirtS'
// Match the quoted
.match(regExp)
// ['"hctam "\dluohs"\ siht"', '"dluohs"']
// Reverse the matches
.map(x => x.split('').reverse().join(''))
// ['"this \"should\" match"', '"should"']
// Re order the matches
.reverse();
// ['"should"', '"this \"should\" match"']
Okay, now to explain the RegExp.
This is the regexp can be easily broken into three pieces. As follows:
# Part 1
(['"]) # Match a closing quotation mark " or '
(?! # As long as it's not followed by
(?:[\\]{2})* # A pair of escape characters
[\\] # and a single escape
(?![\\]) # As long as that's not followed by an escape
)
# Part 2
((?: # Match inside the quotes
(?: # Match option 1:
\1 # Match the closing quote
(?= # As long as it's followed by
(?:\\\\)* # A pair of escape characters
\\ #
(?![\\]) # As long as that's not followed by an escape
) # and a single escape
)| # OR
(?: # Match option 2:
(?!\1). # Any character that isn't the closing quote
)
)*) # Match the group 0 or more times
# Part 3
(\1) # Match an open quotation mark that is the same as the closing one
(?! # As long as it's not followed by
(?:[\\]{2})* # A pair of escape characters
[\\] # and a single escape
(?![\\]) # As long as that's not followed by an escape
)
This is probably a lot clearer in image form: generated using Jex's Regulex
Image on github (JavaScript Regular Expression Visualizer.)
Sorry, I don't have a high enough reputation to include images, so, it's just a link for now.
Here is a gist of an example function using this concept that's a little more advanced: https://gist.github.com/scagood/bd99371c072d49a4fee29d193252f5fc#file-matchquotes-js
here is one that work with both " and ' and you easily add others at the start.
("|')(?:\\\1|[^\1])*?\1
it uses the backreference (\1) match exactley what is in the first group (" or ').
http://www.regular-expressions.info/backref.html
One has to remember that regexps aren't a silver bullet for everything string-y. Some stuff are simpler to do with a cursor and linear, manual, seeking. A CFL would do the trick pretty trivially, but there aren't many CFL implementations (afaik).
A more extensive version of https://stackoverflow.com/a/10786066/1794894
/"([^"\\]{50,}(\\.[^"\\]*)*)"|\'[^\'\\]{50,}(\\.[^\'\\]*)*\'|“[^”\\]{50,}(\\.[^“\\]*)*”/
This version also contains
Minimum quote length of 50
Extra type of quotes (open “ and close ”)
If it is searched from the beginning, maybe this can work?
\"((\\\")|[^\\])*\"
I faced a similar problem trying to remove quoted strings that may interfere with parsing of some files.
I ended up with a two-step solution that beats any convoluted regex you can come up with:
line = line.replace("\\\"","\'"); // Replace escaped quotes with something easier to handle
line = line.replaceAll("\"([^\"]*)\"","\"x\""); // Simple is beautiful
Easier to read and probably more efficient.
If your IDE is IntelliJ Idea, you can forget all these headaches and store your regex into a String variable and as you copy-paste it inside the double-quote it will automatically change to a regex acceptable format.
example in Java:
String s = "\"en_usa\":[^\\,\\}]+";
now you can use this variable in your regexp or anywhere.
(?<="|')(?:[^"\\]|\\.)*(?="|')
" It\'s big \"problem "
match result:
It\'s big \"problem
("|')(?:[^"\\]|\\.)*("|')
" It\'s big \"problem "
match result:
" It\'s big \"problem "
Messed around at regexpal and ended up with this regex: (Don't ask me how it works, I barely understand even tho I wrote it lol)
"(([^"\\]?(\\\\)?)|(\\")+)+"