Regexp so backslash means 'delete next char' - regex

I need a regular expression that will do the following transformation:
Input: ab\xy
Output: aby
Input: ab\\xy
Output: ab\xy
Consider all of those backslashes as LITERAL backslashes. That is, the first input is the sequence of characters ['a', 'b', '\', 'x', 'y'], and the second is ['a', 'b', '\', '\', 'x', 'y'].
The rule is "in a string of characters, if a backslash is encountered, delete it and the following character ... unless the following character is a backslash, in which case delete only one of the two backslashes."
This is escape sequence hell and I can't seem to find my way out.

You may use
(?s)\\(\\)|\\.
and replace with $1 to restore the \ when a double backslash is found.
Details:
(?s) - a dotall modifier so that . could match any chars inlcuding line break chars
\\(\\) - matches a backslash and then matches and captures another backslash into Group 1
| - or
\\. - matches any escape sequence (a backslash + any char).
See the regex demo and a PHP demo:
$re = '/\\\\(\\\\)|\\\\./s';
$str = 'ab\\xy ab\\\\xy ab\\\\\\xy';
echo $result = preg_replace($re, '$1', $str);
// => aby ab\xy ab\y

Related

perl regex for variable substitution

I want to substitute variables marked by a "#" and terminated by a dot or a non-alphanumeric character.
Example: Variable #name should be substituted be "Peter"
abc#name.def => abcPeterdef
abc#namedef => abc#namedef
abc#name-def => abcPeter-def
So if the variable is terminated with a dot, it is replaced and the dot removed. Is it terminated by any non-alphanum character, it is replaced also.
I use the following:
s/#name\./Peter/i
s/#name(\W)/Peter$1/i
This works but is it possible to merge it into one expression?
There are several possible approaches.
s/#name(\W)/"Peter" . ($1 eq "." ? "" : $1)/e
Here we use /e to turn the replacement part into an expression, so we can inspect $1 and choose the replacement string dynamically.
s/#name(?|\.()|([^.\w]))/Peter$1/
Here we use (?| ) to reset the numbering of capture groups between branches, so both \.() and ([^.\w]) set $1. If a . is matched, $1 becomes the empty string; otherwise it contains the matched character.
You may use
s/#name(?|\.()|(\W))/Peter$1/i
Details
#name - matches the literal substring
(?|\.()|(\W)) - a branch reset group matching either of the two alternatives:
\.() - a dot and then captures an empty string into $1
| - or
(\W) - any non-word char captured into $1.
So, upon a match, $1 placeholder is either empty or contains any non-word char other than a dot.
You can do this by using either a literal dot or a word boundary for the terminator
Like this
s/#name(?:\.|\b)/Peter/i
Here's a complete program that reproduces the required output shown in your question
use strict;
use warnings 'all';
for my $s ( 'abc#name.def', 'abc#namedef', 'abc#name-def' ) {
( my $s2 = $s ) =~ s/#name(?:\.|\b)/Peter/i;
printf "%-12s => %-s\n", $s, $s2;
}
output
abc#name.def => abcPeterdef
abc#namedef => abc#namedef
abc#name-def => abcPeter-def

Regular expression exclude double character

I need to have regular expression which find a character which not following by same character after it. Mean exclude double even multiple same character.
For example, when i need to find 'e' character from string: "need only single characteeer", it is mean will find 'e' on each words breakdown as below:
"need" > not match because it has double 'e'
"only" > not match because no 'e'
"single" > match because has only single 'e'
"characteeer" > not match because has multiple of 'e'
Not sure whether it is possible or not. Any answer or comment will be highly appreciated. Thanks in advance.
UPDATE
Maybe my question above still ambiguous. Actually i need to find the 'e' character only instead the words. I am going to replace it with double character. So the one which already has double character will not replaced.
The main purpose is to replace 'e' with 'ee' for example. But the one which has 'ee' or 'eee' already, or even more 'e', will be untouched.
UPDATE:
(?<!e)e(?!e)
Will match e not with negative lookbehind to prevent preceeding e and negative lookahead preventing following e.
Can be checked here
\b(([A-Za-z])(?!\2))+\b
Will match a word (sequence of one or more characters between A-Za-Z), with negative lookahead which prevents following character to be the same as last match, group 2, or using non capturing group.
/\b(?:([A-Za-z])(?!\1))+\b/g
however only will match because it doesn't contain repeated character.
to match a word containing e but no ee
/(?<![a-z])(?=[a-z]*e)(?![a-z]*ee)[a-z]+/gi
/\b([a-df-z]*e[a-df-z]*)\b\s*/g
You could add the flag case insensitive /i if needed.
Explanation:
/ : regex delimiter
\b : word boundary
( : start group 1
[a-df-z]* : 0 or more letter that is not "e"
e : 1 letter "e"
[a-df-z]* : 0 or more letter that is not "e"
) : end group 1
\b : word boundary
\s* : 0 or more spaces
/g : regex delimiter, global flag
As you didn't give which language you're using, here is a perl script:
my $str = "need only single characteeer";
my #list = $str =~ /\b([a-df-z]*e[a-df-z]*)\b\s*/g;
say Dumper\#list;
Output:
$VAR1 = [
'single'
];
And a php script:
$str = "need only single characteeer";
preg_match_all("/\b([a-df-z]*e[a-df-z]*)\b\s*/", $str, $match);
print_r($match);
Output:
Array
(
[0] => Array
(
[1] => single
)
[1] => Array
(
[1] => single
)
)

Filter out chars not in a set

I am trying to filter all strings that I pass through my system, so that I only send out valid chars.
The following are allowed.
a-z
A-Z
"-" (hypen, 0x24)
" " (space, 0x20)
"’" (single quote, 0x27)
"~" (tilde, 0x7E)
Now I can come up with a regex that searches for chars in this set. But What I need is a regex that matches to chars out of this set so I can replace them with nothing.
Any ideas?
Here is a way you can do it. You tagged Perl, so i will give you a perlish solution:
my $string = q{That is a ~ v%^&*()ery co$ol ' but not 4 realistic T3st};
print $string . "\n";
$string =~ s{[^-a-zA-Z '~]}{}g;
print $string . "\n";
Prints:
That is a ~ v%^&*()ery co$ol ' but not 4 realistic T3st
That is a ~ very cool ' but not realistic Tst
To make it clear:
$string =~ s{[^-a-zA-Z '~]}{}g;
matches for chars who are not [^..] inside the [,] parenthesis and replace them with nothing. The g flag at the end of the substitution is for replacing more than 1 character.
The regular expression for matching the strings mentioned by you is:
[a-zA-Z\\-~]|\x27
For further information refer http://www.regular-expressions.info/quickstart.html

when [:punct:] is too much [duplicate]

This question already has answers here:
Remove all punctuation except apostrophes in R
(4 answers)
Closed 9 years ago.
I'm cleaning text strings in R. I want to remove all the punctuation except apostrophes and hyphens. This means I can't use the [:punct:] character class (unless there's a way of saying [:punct:] but not '-).
! " # $ % & ( ) * + , . / : ; < = > ? # [ \ ] ^ _ { | } ~. and backtick must come out.
For most of the above, escaping is not an issue. But for square brackets, I'm really having issues. Here's what I've tried:
gsub('[abc]', 'L', 'abcdef') #expected behaviour, shown as sanity check
# [1] "LLLdef"
gsub('[[]]', 'B', 'it[]') #only 1 substitution, ie [] treated as a single character
# [1] "itB"
gsub('[\[\]]', 'B', 'it[]') #single escape, errors as expected
Error: '[' is an unrecognized escape in character string starting "'[["
gsub('[\\[\\]]', 'B', 'it[]') #double escape, single substitution
# [1] "itB"
gsub('[\\]\\[]', 'B', 'it[]') #double escape, reversed order, NO substitution
# [1] "it[]"
I'd prefer not to used fixed=TRUE with gsub since that will prevent me from using a character class. So, how do I include square brackets in a regex character class?
ETA additional trials:
gsub('[[\\]]', 'B', 'it[]') #double escape on closing ] only, single substitution
# [1] "itB"
gsub('[[\]]', 'B', 'it[]') #single escape on closing ] only, expected error
Error: ']' is an unrecognized escape in character string starting "'[[]"
ETA: the single substitution was caused by not setting perl=T in my gsub calls. ie:
gsub('[[\\]]', 'B', 'it[]', perl=T)
You can use [:punct:], when you combine it with a negative lookahead
(?!['-])[[:punct:]]
This way a [:punct:]is only matched, if it is not in ['-]. The negative lookahead assertion (?!['-]) ensures this condition. It failes when the next character is a ' or a - and then the complete expression fails.
Inside a character class you only need to escape the closing square bracket:
Try using '[[\\]]' or '[[\]]' (I am not sure about escaping the backslash as I don't know R.)
See this example.

How can I extract substrings from a string in Perl?

Consider the following strings:
1) Scheme ID: abc-456-hu5t10 (High priority) *****
2) Scheme ID: frt-78f-hj542w (Balanced)
3) Scheme ID: 23f-f974-nm54w (super formula run) *****
and so on in the above format - the parts in bold are changes across the strings.
==> Imagine I've many strings of format Shown above.
I want to pick 3 substrings (As shown in BOLD below) from the each of the above strings.
1st substring containing the alphanumeric value (in eg above it's "abc-456-hu5t10")
2nd substring containing the word (in eg above it's "High priority")
3rd substring containing * (IF * is present at the end of the string ELSE leave it )
How do I pick these 3 substrings from each string shown above? I know it can be done using regular expressions in Perl... Can you help with this?
You could do something like this:
my $data = <<END;
1) Scheme ID: abc-456-hu5t10 (High priority) *
2) Scheme ID: frt-78f-hj542w (Balanced)
3) Scheme ID: 23f-f974-nm54w (super formula run) *
END
foreach (split(/\n/,$data)) {
$_ =~ /Scheme ID: ([a-z0-9-]+)\s+\(([^)]+)\)\s*(\*)?/ || next;
my ($id,$word,$star) = ($1,$2,$3);
print "$id $word $star\n";
}
The key thing is the Regular expression:
Scheme ID: ([a-z0-9-]+)\s+\(([^)]+)\)\s*(\*)?
Which breaks up as follows.
The fixed String "Scheme ID: ":
Scheme ID:
Followed by one or more of the characters a-z, 0-9 or -. We use the brackets to capture it as $1:
([a-z0-9-]+)
Followed by one or more whitespace characters:
\s+
Followed by an opening bracket (which we escape) followed by any number of characters which aren't a close bracket, and then a closing bracket (escaped). We use unescaped brackets to capture the words as $2:
\(([^)]+)\)
Followed by some spaces any maybe a *, captured as $3:
\s*(\*)?
You could use a regular expression such as the following:
/([-a-z0-9]+)\s*\((.*?)\)\s*(\*)?/
So for example:
$s = "abc-456-hu5t10 (High priority) *";
$s =~ /([-a-z0-9]+)\s*\((.*?)\)\s*(\*)?/;
print "$1\n$2\n$3\n";
prints
abc-456-hu5t10
High priority
*
(\S*)\s*\((.*?)\)\s*(\*?)
(\S*) picks up anything which is NOT whitespace
\s* 0 or more whitespace characters
\( a literal open parenthesis
(.*?) anything, non-greedy so stops on first occurrence of...
\) a literal close parenthesis
\s* 0 or more whitespace characters
(\*?) 0 or 1 occurances of literal *
Well, a one liner here:
perl -lne 'm|Scheme ID:\s+(.*?)\s+\((.*?)\)\s?(\*)?|g&&print "$1:$2:$3"' file.txt
Expanded to a simple script to explain things a bit better:
#!/usr/bin/perl -ln
#-w : warnings
#-l : print newline after every print
#-n : apply script body to stdin or files listed at commandline, dont print $_
use strict; #always do this.
my $regex = qr{ # precompile regex
Scheme\ ID: # to match beginning of line.
\s+ # 1 or more whitespace
(.*?) # Non greedy match of all characters up to
\s+ # 1 or more whitespace
\( # parenthesis literal
(.*?) # non-greedy match to the next
\) # closing literal parenthesis
\s* # 0 or more whitespace (trailing * is optional)
(\*)? # 0 or 1 literal *s
}x; #x switch allows whitespace in regex to allow documentation.
#values trapped in $1 $2 $3, so do whatever you need to:
#Perl lets you use any characters as delimiters, i like pipes because
#they reduce the amount of escaping when using file paths
m|$regex| && print "$1 : $2 : $3";
#alternatively if(m|$regex|) {doOne($1); doTwo($2) ... }
Though if it were anything other than formatting, I would implement a main loop to handle files and flesh out the body of the script rather than rely ing on the commandline switches for the looping.
Long time no Perl
while(<STDIN>) {
next unless /:\s*(\S+)\s+\(([^\)]+)\)\s*(\*?)/;
print "|$1|$2|$3|\n";
}
This just requires a small change to my last answer:
my ($guid, $scheme, $star) = $line =~ m{
The [ ] Scheme [ ] GUID: [ ]
([a-zA-Z0-9-]+) #capture the guid
[ ]
\( (.+) \) #capture the scheme
(?:
[ ]
([*]) #capture the star
)? #if it exists
}x;
String 1:
$input =~ /'^\S+'/;
$s1 = $&;
String 2:
$input =~ /\(.*\)/;
$s2 = $&;
String 3:
$input =~ /\*?$/;
$s3 = $&;