regular expression - replace a word by a space - regex

I have this line and now wants to replace not only dots and underline by a space. Now I also would replace the word "German" (without the quotes) by a blank line.
Can anybody help ?
preg_replace('/\(.*?\)|\.|_/i', ' ',
best regs
Edit:
public function parseMovieName($releasename)
{
$cat = new Category;
if (!$cat->isMovieForeign($releasename))
{
preg_match('/^(?P<name>.*)[\.\-_\( ](?P<year>19\d{2}|20\d{2})/i', $releasename, $matches);
if (!isset($matches['year']))
preg_match('/^(?P<name>.*)[\.\-_ ](?:dvdrip|bdrip|brrip|bluray|hdtv|divx|xvid|proper|repack|real\.proper|sub\.?fix|sub\.?pack|ac3d|unrated|1080i|1080p|720p|810p)/i', $releasename, $matches);
if (isset($matches['name']))
{
$name = preg_replace('/\(.*?\)|\.|_/i', ' ', $matches['name']);
$year = (isset($matches['year'])) ? ' ('.$matches['year'].')' : '';
return trim($name).$year;
}
}
return false;
}
The string is for example "movieName German 2015" but the output should be "movieName 2015" (without the quotes)
Solved:
Change now the line preg_replace('/\(.*?\)|\.|_/i', ' ', $matches['name']); to $name = preg_replace('/\h*\bGerman\b|\([^()]*\)|[._]/', ' ', $matches['name']);
Thanks # Wiktor Stribiżew

To add an alternative to an alternation group, you just need to use
$name = preg_replace('/\h*\bGerman\b|\([^()]*\)|[._]/', ' ', $matches['name']);
^^^^^^ ^^^^^^^
Note that \h matches horizontal whitespace only (no linebreaks), if you need linebreaks, use \s.
The \h*\bGerman\b matches zero or more spaces followed by a whole word "German" (as \b is a word boundary, no "Germanic" word will be matched).
Also, (\.|_) is equal to [._] in the result this pattern matches, but a character class [...] is much more efficient when matching single symbols.

Related

Capturing a delimiter that isn't in between single quotes

Like the question says, is it possible to use a single Regex string to get a delimiter that isn't in between some quotes?
For example, I want to split this string with the delimiter &:
"example=3&testing='f&tmp'"
should produce
["example=3", "testing='f&tmp'"]
Essentially, things inside single quotes (' ') should remain untouched.
I found out how to get things within quotes with expression: (?:'.*?')
The closest I could get to a tangible solution was: (.[^']&[^'])
It is not an easy task for a String#split, but is quite a feasible task for Matcher#find if you use
[^&\s=]+=(?:'[^']*'|[^\s&]*)
(see this regex demo) and this Java code:
String text = "example=3&testing='f&tmp'";
Pattern p = Pattern.compile("[^&\\s=]+=(?:'[^']*'|[^\\s&]*)");
Matcher m = p.matcher(text);
List<String> res = new ArrayList<>();
while(m.find()) {
res.add(m.group());
}
System.out.println(res);
// => [example=3, testing='f&tmp']
Details
[^&\s=]+ - one or more chars other than &, = and whitespace
= - a = char
(?:'[^']*'|[^\s&]*) - a non-capturing group matching either ', zero or more chars other than ' and then a ', or zero or more chars other than whitespace and &.

Ruby Regex: empty space at beginning and end of string

I want to find all users with a first name that has an empty space at the beginning or ending.
It could look like: "Juliette " or " Juliette"
For now I only have the regex to match when the space is at the end of string:
^[ab]:[[:space:]]|$
I didn't find how to match the empty space at the beginning of the string and I don't know if it's possible to accomplish both of these conditions in one regex ?
Thanks for your help.
Test for Strippable Whitespace without Regexp
There's a little trick you can use with String#strip!, which returns nil if it can't find whitespace to strip. For example:
# return true if str has leading/trailing whitespace;
# otherwise returns false
def strippable? str
{ str => !!str.dup.strip! }
end
# leading space, trailing space, no space
test_values = [ ' foo', 'foo ', 'foo' ]
test_values.map { |str| strippable? str }
#=> [{" foo"=>true}, {"foo "=>true}, {"foo"=>false}]
This doesn't rely on a regular expression, but rather on properties of the String and the Boolean result of an inverted #strip!. Regardless of whether the Ruby engine uses regular expressions under the hood, these types of String methods are often faster than comparable Regexp matches, but your mileage and specific use cases may vary.
Alternatives with Regexp
Using the same test data as above, you could do something similar with a regular expression. For example:
# leading space, trailing space, no space
test_values = [ ' foo', 'foo ', 'foo' ]
# test start/end of string
test_values = [ ' foo', 'foo ', 'foo' ].grep /\A\s+|\s+\z/
#=> [" foo", "foo "]
# test start/end of line
test_values = [ ' foo', 'foo ', 'foo' ].grep /^\s+|\s+$/
#=> [" foo", "foo "]
Benchmarks
require 'benchmark'
ITERATIONS = 1_000_000
TEST_VALUES = [ ' foo', 'foo ', 'foo' ]
def regex_grep array
array.grep /^\s+|\s+$/
end
def string_strip array
array.map { |str| { str => !!str.dup.strip! } }
end
Benchmark.bmbm do |x|
n = ITERATIONS
x.report('regex') { n.times { regexp_grep TEST_VALUES } }
x.report('strip') { n.times { string_strip TEST_VALUES } }
end
user system total real
regex 1.539269 0.001325 1.540594 ( 1.541438)
strip 1.256836 0.001357 1.258193 ( 1.259955)
A quarter second over a million iterations may not seem like a big difference, but on significantly larger data sets or iterations it can add up. Whether or not it's enough for you to care for this particular use case is up to you, but the general pattern is that native String methods (regardless of how they're implemented by the interpreter under the hood) are generally faster than regular expression pattern matching. Of course there are edge cases, but that's what benchmarks are for!
You can use
/\A([a-zA-Z]+ | [a-zA-Z]+)\z/
/\A(?:[[:alpha:]]+[[:space:]]|[[:space:]][[:alpha:]]+)\z/
/\A(?:\p{L}+[\p{Z}\t]|[\p{Z}\t]\p{L}+)\z/
See the Rubular demo (with line anchors instead of string anchors used for the demo purposes)
Details:
\A - a string start anchor
(...) - a capturing group
(?:...) - a non-capturing group (it is preferred here since you are not extracting, just validating)
[a-zA-Z]+ - any one or more ASCII letters
\p{L}+ - any one or more Unicode letters
| - or
\z - end of string anchor.

Regex doesn't fetch the nested curly braces

Curly braces matches sometimes and doesn't in few case.
My Code:
use strict;
use warnings;
my $str1 = '$$\eqalign{&\cases{\mathdot{\bf x}=A{\bf x}+Bu\cr y=H{\bf x}}\quad{\rm with}\{\bf x}=\left(\matrix{x\cr\mathdot{x}\cr\theta\cr\mathdot{\theta}}\right),\cr&A\!=\!\!\left(\matrix{0&1&0&0\cr 0&0&-{m_{a}\over M}g&0\cr 0&0&0&1\cr 0&0&{(M\!+\!m_{a})\over Ml}g&0}\right)\!,\ B\!=\!\left(\matrix{0\cr{a\over M}\cr 0\cr-{a\over Ml}}\right)\!,\ H^{T}\!=\!\left(\matrix{1\cr 0\cr 1\cr 0}\right)\!.}$$';
my $str2 = "\\bibcite{Airdetal2013}{{2}{2017}{{{John} {et~al.}}}{{{James}, {Flexi}, {Buella}, {Curren}, {Mozes}, {Sam}, {Kandan}, {Alexander}, {Alfonsa}, {Fireknight}, {Georgen}, {Karims}, {Merloni}, {Nanda}, {Terra}, {Alvato}, {Nini}, {Winski}, {Shankar}, {Gnali}, \& {Giito}}}}";
my $regex = qr/(?:[^{}]*(?:{(?:[^{}]*(?:{(?:[^{}]*(?:{[^{}]*})*[^{}]*)})*[^{}]*)*})*[^{}]*)*/;
if($str1=~m/\{$regex\}/) { print "str1: $&\n"; }
if($str2=~m/\{$regex\}/) { print "str2: $&\n"; }
OUTPUT:
str1: {&\cases{\mathdot{\bf x}=A{\bf x}+Bu\cr y=H{\bf x}}\quad{\rm with}\ {\bf x}=\left(\matrix{x\cr\mathdot{x}\cr\theta\cr\mathdot{\theta}}\right),\cr&A\!=\!\!\left(\matrix{0&1&0&0\cr 0&0&-{m_{a}\over M}g&0\cr 0&0&0&1\cr 0&0&{(M\!+ !m_{a})\over Ml}g&0}\right)\!,\ B\!=\!\left(\matrix{0\cr{a\over M}\cr 0\cr-{a\over Ml}}\right)\!,\ H^{T}\!=\!\left(\matrix{1\cr 0\cr 1\cr 0}\right)\!.}
str2: {2}
str1 is correct output. str2 incorrect output.
Expected Output on str2 is:
str2: {{2}{2017}{{{John} {et~al.}}}{{{James}, {Flexi}, {Buella}, {Curren}, {Mozes}, {Sam}, {Kandan}, {Alexander}, {Alfonsa}, {Fireknight}, {Georgen}, {Karims}, {Merloni}, {Nanda}, {Terra}, {Alvato}, {Nini}, {Winski}, {Shankar}, {Gnali}, \& {Giito}}}}
In the sample str1 string doesn't matched with the nested curly braces. However the second sample str12 string can matched the nested curly braces.
This is my question can matched the nested curly braces. I am clueless. It would be better if someone point out my mistake.
Thanks in advance.
Since your actual requirements (discussed in the chat) are to match substrings starting with \bib followed with {...} substrings or any chars other than { and }, you should use a regex with a subroutine:
/\\bib(?:({(?:[^{}]++|(?1))*})|(?!\\bib)[^{}])*/g
Details:
\\bib - \bib literal text
(?:({(?:[^{}]++|(?1))*})|(?!\\bib)[^{}])* - 0+ occurrences of:
({(?:[^{}]++|(?1))*}) - Group 1 (that will be recursed with (?1)) matching
{ - a literal {
(?:[^{}]++|(?1))* - 0 or more occurrences of 1+ chars other than { and } or the whole Group 1 subpattern
} - a literal }
| - or
(?!\\bib)[^{}] - a char other than { and } not starting a \bib literal char sequence.
See the sample Perl code:
use strict;
use warnings;
use feature 'say';
my $str2 = "\\bibcite{Airdetal2013}{{2}{2017}{{{John} {et~al.}}}{{{James}, {Flexi}, {Buella}, {Curren}, {Mozes}, {Sam}, {Kandan}, {Alexander}, {Alfonsa}, {Fireknight}, {Georgen}, {Karims}, {Merloni}, {Nanda}, {Terra}, {Alvato}, {Nini}, {Winski}, {Shankar}, {Gnali}, \& {Giito}}}}";
while($str2 =~ /\\bib(?:({(?:[^{}]++|(?1))*})|(?!\\bib)[^{}])*/g) {
say "$&";
}
Note The edit in the question adds \\bibcite{Airdetal2013} in front. However, this doesn't change the analysis below as it doesn't change the overall nesting levels.
This has got to be possible to do in a better way. There is recursive regex offered by Wiktor Stribiżew in comments. There are modules for recursive parsing. And there are tools for parsing Latex.
However, out of curiosity ...
Your string, shortened suitably
my $str2 = "{{2}{2017}{{{John}{et~al.}}}{{{James}, ... {Gnali}, \& {Giito}}}}";
or, with C standing for a pair of curlies with something inside (no nesting)
"{ C C { { C C } { C, ... \& C } } }"
So you have three levels of nesting, to get down to the last pair {...} (no further nesting).
Your regex, spread out and with $nc = qr/[^{}]*/ (Non-Curlies), so that we can look at it
my $regex = qr/
(?: $nc
(?: {
(?: $nc
(?: {
(?: $nc (?: { $nc } )* $nc )
}
)* $nc
)*
}
)* $nc
)*/x;
I can count two levels here. (The $nc has no curlies so { $nc } matches my C above.)
Thus this regex cannot match that whole string.
How to fix it? Best, find another way so to not drown in this.
Or, write it out like above, very carefully, and add the missing level.

Need help converting "sassy" to "$a55y" using a regular expression?

Any s at the beginning of the word should be converted to a $.
Any s inside the word should be converted to a 5.
To match an s at the start of the word, use \b to match word boundaries and \w to match alphanumerics:
/\bs\w/
(as #Matthew points out, the \w is really superfluous:)
/\bs/
Once you've replaced all s at the start of a word, then the only remaining ones are inside the word (I'm assuming that you also want to replace s at the end of a word with 5) so you can simply use
/s/
For completeness, here's how to put it all together (I'm going to assume JavaScript):
function pimpMyEsses(str)
{
return str.replace(/\bs/gi, '$').replace(/s/gi, '5');
}
console.log(pimpMyEsses('slither quantum Sassy. arcades'));
// > "$lither quantum $a55y. arcade5"
Depending on the language it may be possible to capture the substitutions with a single regular expression and replace them procedurally. Here's a PHP example:
<?php
$word = 'sassy';
preg_match_all('/\b(s)|([^s]+)|(s)/', $word, $matches, PREG_SET_ORDER);
/* captures:
* $matches = array(
* array('s','s'),
* array('a','','a'),
* array('s','','','s'),
* array('s','','','s'),
* array('y','','y')
* )
*/
$newword = '';
foreach ($matches as $m){
if ($m[1]) $newword .= '$'; # leading s --> $
elseif ($m[2]) $newword .= $m[2]; # not an s --> as-is
else $newword .= '5'; # any other s --> 5
}
echo $newword;
Because I've used \b to match a word-boundary before the "leading s", the string 'sassy socks' becomes '$a55y $ock5'
If you want only the s at the start of "sassy" to become a $, change the regular expression to:
'/^(s)|([^s]+)|(s)/'
You can do:
/^(s)/ to select only the first "s";
/(?:[^s])(?:(s)[^s]*)+ to select all other "s". Note that the first character will be skipped (which is independent of);
Explain:ignore first character;Repeat one or more: get a "s" and ignore others character that not "s";
Next step: you need to determinate what language you will use.

I want to replace ',' on the 150th location in a String with a <br>

My String is : PI Last Name equal to one of
('AARONSON','ABDEL MEGUID','ABDEL-LATIF','ABDOOL KARIM','ABELL','ABRAMS','ACKERMAN','ADAIR','ADAMS','ADAMS-CAMPBELL', 'ADASHI','ADEBAMOWO','ADHIKARI','ADIMORA','ADRIAN', 'ADZERIKHO','AGADJANYAN','AGARWAL','AGOT', 'AGUIRRE-CRUZ','AHMAD','AHMED','AIKEN', 'AINAMO', 'AISENBERG','AJAIYEOBA','AKA','AKHTAR','AKINGBEMI','AKINYINKA','AKKERMAN','AKSOY','AKYUREK', 'ALBEROLA-ILA','ALBERT','ALCANTARA' ,'ALCOCK','ALEMAN', 'ALEXANDER','ALEXANDRE','ALEXANDROV','ALEXANIAN','ALLAND','ALLEN','ALLISON','ALPER', 'ALTMAN','ALVAREZ','AMARYAN','AMBESI-IMPIOMBATO','AMEGBETO','AMOWITZ', 'ANAGNOSTARAS','ANAND','ANDERSEN','ANDERSON', 'ANDRADE','ANDREEFF','ANDROPHY','ANGER','ANHOLT','ANTHONY','ANTLE','ANTONELLI','ANTONY', 'ANZULOVICH', 'APODACA','APOSHIAN','APPEL','APPLEBY','APRIL','ARAUJO','ARBIB','ARBOLEDA', 'ARCHAKOV','ARCHER', 'ARECHAVALETA-VELASCO','ARENS','ARGON','ARGYROKASTRITIS', 'ARIAS','ARIZAGA','ARMSTRONG','ARNON', 'ARSHAVSKY','ARVIN','ASATRYAN','ASCOLI','ASKENASE','ASSI','ATALAY','ATANASOVA','ATKINSON','ATTYGALLE','ATWEH','AU','AVETISYAN','AWE','AYOUB','AZAD','BACSO','BAGASRA','BAKER','BALAS', 'BALCAZAR','BALK','BALKAY','BALLOU','BALRAJ','BALSTER','BANERJEE','BANKOLE','BANTA','BARAL','BARANOWSKA','BARBAS', 'BARBER','BARILLAS-MURY','BARKHOLT','BARNES','BARNETT','BARRETT','BARRIA','BARROW','BARROWS','BARTKE','BARTLETT','BASSINGTHWAIGHTE','BASSIOUNY','BASU','BATES','BATTAGLIA','BATTERMAN','BAUER','BAUERLE','BAUM','BAUME', 'BAUMLER','BAVISTER','BAWA','BAYNE','BEASLEY','BEATTY','BEATY','BEBENEK','BECK','BECKER','BECKMAN','BECKMAN-SUURKULA' ,'BEDFORD','BEDOLLA','BEEBE','BEEMON','BEHETS','BEHRMAN','BEIER','BEKKER','BELL','BELLIDO','BELMAIN', 'BENATAR','BENBENISHTY','BENBROOK','BENDER','BENEDETTI','BENNETT','BENNISH','BENZ','BERG','BERGER','BERGEY','BERGGREN','BERK','BERKOWITZ','BERLIN','BERLINER','BERMAN','BERTINO','BERTOZZI','BERTRAND','BERWICK','BETHONY','BEYERS','BEYRER' ,'BEZPROZVANNY','BHAGWAT','BHANDARI','BHARGAVA','BHARUCHA','BHUJWALLA','BIANCO','BIDLACK','BIELERT','BIER','BIESSMANN','BIGELOW' ,'BILLER','BILLINGS','BINDER','BINDMAN','BINUTU','BIRBECK','BIRGE','BIRNBAUM','BIRO','BIRT','BISHAI','BISHOP','BISSELL','BJORKEGREN','BJORNSTAD','BLACK','BLANCHARD','BLASS','BLATTNER','BLIGNAUT','BLOCH','BLOCK','BLOOM','BLOOM,','BLUM','BLUMBERG' ,'BLUMENTHAL','BLYUKHER','BODDULURI','BOFFETTA','BOGOLIUBOVA', 'BOLLINGER','BOLLS','BOMSZTYK','BONANNO','BONNER','BOOM','BOOTHROYD','BOPPANA','BORAWSKI','BORG','BORIS-LAWRIE','BORISY','BORLONGAN','BORNSTEIN','BORODOVSKY','BORST','BOS','BOTO','BOWDEN','BOWEN','BOYCE-JACINO','BRADEN','BRADY' ,'BRAITHWAITE','BRANN','BRASH','BRAUNSTEIN', 'BREMAN','BRENNAN','BRENNER','BRETSCHER','BREW','BREYSSE','BRIGGS','BRITES','BRITT','BRITTENHAM','BRODIE','BRODY','BROOK','BROOTEN','BROSCO','BROSNAN','BROWN','BROWNE','BRUCKNER','BRUNENGRABER','BRYL','BRYSON','BU','BUCHAN','BUDD','BUDNIK', 'BUEKENS','BUKRINSKY','BULLMORE','BULUN','BURBANO','BURGENER','BURGESS','BURKS','BURMEISTER','BURNETT','BURNHAM','BURNS','BURRIDGE','BURTON','BUSCIGLIO','BUSHEK','BUSIJA','BUZSAKI','BZYMEK','CABA')
I need to have a regex which will greedily looks for up to 150 characters with a last character being a ','. And then replace the last ',' of the 150 with a <br />
Any suggestions pls?
I used this ','(?=[^()]*\)) but this one replaces all the occurences. I want the 150th ones to be replaced.
Thanks everyone for your suggestions. I managed to do it with Java code instead of regex.
StringBuilder sb = new StringBuilder(html);
int i = 0;
while ((i = sb.indexOf("','", i + 150)) != -1) {
int j = sb.lastIndexOf("','", i + 150);
sb.insert(i+1, "<BR>");
}
return sb.toString();
However, this breaks at the first encounter of ',' in the 150 chars.
Can anyone help modify my code to incorporate the break at the last occurence of ',' withing the 150 chars.
You'll want something like this:
Look for every occurrence of \([^)]+*,[^)]+*\) (Find a parenthesis-wrapped string with a comma in it and then run the following regular expression on each of the matched elements:
(.{135,150}[^,]*?),
The first number is the minimum number of characters you want to match before you add a break tag -- the second is the maximum number of characters you would like to match before inserting a break tag. If there is no , between the characters in question then the regular expression will continue to consume characters until it finds a comma.
You could probably do it like this:
regex ~ /(^.{1,14}),/
replacement ~ '\1<replacement' or "$1<insert your text>"
In Perl:
$target = ','x 22;
$target =~ s/(^ .{1,14}) , /$1<15th comma>/x;
print $target;
Output
,,,,,,,,,,,,,,<15th comma>,,,,,,,
Edit: As an alternative, if you want to break the string up into succesive 150 or less
you could do it this way:
regex ~ /(.{1,150},)/sg
replacement ~ '\1<br/>' or "$1<br\/>"
// That is a regex of type global (/g) and include newlines (/s)
In Perl:
$target = "
('AARONSON','ABDEL MEGUID','ABDEL-LATIF','ABDOOL KARIM','ABELL','ABRAMS','ACKERMAN','ADAIR','ADAMS','ADAMS-CAMPBELL', 'ADASHI','ADEBAMOWO','ADHIKARI','ADIMORA','ADRIAN', 'ADZERIKHO','AGADJANYAN','AGARWAL','AGOT', 'AGUIRRE-CRUZ','AHMAD','AHMED','AIKEN', 'AINAMO', 'AISENBERG','AJAIYEOBA','AKA','AKHTAR','AKINGBEMI','AKINYINKA','AKKERMAN','AKSOY','AKYUREK', 'ALBEROLA-ILA','ALBERT','ALCANTARA' ,'ALCOCK','ALEMAN', 'ALEXANDER','ALEXANDRE','ALEXANDROV','ALEXANIAN','ALLAND','ALLEN','ALLISON','ALPER', 'ALTMAN', ... )
";
if ($target =~ s/( .{1,150} , )/$1<br\/>/sxg) {
print $target;
}
Output:
('AARONSON','ABDEL MEGUID','ABDEL-LATIF','ABDOOL KARIM','ABELL','ABRAMS','ACKERMAN','ADAIR','ADAMS','ADAMS-CAMPBELL', 'ADASHI','ADEBAMOWO','ADHIKARI',<br/>'ADIMORA','ADRIAN', 'ADZERIKHO','AGADJANYAN','AGARWAL','AGOT', 'AGUIRRE-CRUZ','AHMAD','AHMED','AIKEN', 'AINAMO', 'AISENBERG','AJAIYEOBA','AKA',<br/>'AKHTAR','AKINGBEMI','AKINYINKA','AKKERMAN','AKSOY','AKYUREK', 'ALBEROLA-ILA','ALBERT','ALCANTARA' ,'ALCOCK','ALEMAN', 'ALEXANDER','ALEXANDRE',<br/>'ALEXANDROV','ALEXANIAN','ALLAND','ALLEN','ALLISON','ALPER', 'ALTMAN',<br/> ... )