Need help converting "sassy" to "$a55y" using a regular expression? - regex

Any s at the beginning of the word should be converted to a $.
Any s inside the word should be converted to a 5.

To match an s at the start of the word, use \b to match word boundaries and \w to match alphanumerics:
/\bs\w/
(as #Matthew points out, the \w is really superfluous:)
/\bs/
Once you've replaced all s at the start of a word, then the only remaining ones are inside the word (I'm assuming that you also want to replace s at the end of a word with 5) so you can simply use
/s/
For completeness, here's how to put it all together (I'm going to assume JavaScript):
function pimpMyEsses(str)
{
return str.replace(/\bs/gi, '$').replace(/s/gi, '5');
}
console.log(pimpMyEsses('slither quantum Sassy. arcades'));
// > "$lither quantum $a55y. arcade5"

Depending on the language it may be possible to capture the substitutions with a single regular expression and replace them procedurally. Here's a PHP example:
<?php
$word = 'sassy';
preg_match_all('/\b(s)|([^s]+)|(s)/', $word, $matches, PREG_SET_ORDER);
/* captures:
* $matches = array(
* array('s','s'),
* array('a','','a'),
* array('s','','','s'),
* array('s','','','s'),
* array('y','','y')
* )
*/
$newword = '';
foreach ($matches as $m){
if ($m[1]) $newword .= '$'; # leading s --> $
elseif ($m[2]) $newword .= $m[2]; # not an s --> as-is
else $newword .= '5'; # any other s --> 5
}
echo $newword;
Because I've used \b to match a word-boundary before the "leading s", the string 'sassy socks' becomes '$a55y $ock5'
If you want only the s at the start of "sassy" to become a $, change the regular expression to:
'/^(s)|([^s]+)|(s)/'

You can do:
/^(s)/ to select only the first "s";
/(?:[^s])(?:(s)[^s]*)+ to select all other "s". Note that the first character will be skipped (which is independent of);
Explain:ignore first character;Repeat one or more: get a "s" and ignore others character that not "s";
Next step: you need to determinate what language you will use.

Related

Search for substring and store another part of the string as variable in perl

I am revamping an old mail tool and adding MIME support. I have a lot of it working but I'm a perl dummy and the regex stuff is losing me.
I had:
foreach ( #{$body} ) {
next if /^$/;
if ( /NEMS/i ) {
/.*?(\d{5,7}).*/;
$nems = $1;
next;
}
if ( $delimit ) {
next if (/$delimit/ && ! $tp);
last if (/$delimit/ && $tp);
$tp = 1, next if /text.plain/;
$tp = 0, next if /text.html/;
s/<[^>]*>//g;
$newbody .= $_ if $tp;
} else {
s/<[^>]*>//g;
$newbody .= $_ ;
}
} # End Foreach
Now I have $body_text as the plain text mail body thanks to MIME::Parser. So now I just need this part to work:
foreach ( #{$body_text} ) {
next if /^$/;
if ( /NEMS/i ) {
/.*?(\d{5,7}).*/;
$nems = $1;
next;
}
} # End Foreach
The actual challenge is to find NEMS=12345 or NEMS=1234567 and set $nems=12345 if found. I think I have a very basic syntax problem with the test because I'm not exposed to perl very often.
A coworker suggested:
foreach (split(/\n/,$body_text)){
next if /^$/;
if ( /NEMS/i ) {
/.*?(\d{5,7}).*/;
$nems = $1;
next;
}
}
Which seems to be working, but it may not be the preferred way?
edit:
So this is the most current version based on tips here and testing:
foreach (split(/\n/,$body_text)){
next if /^$/;
if ( /NEMS/i ) {
/^\s*NEMS\s*=\s*(\d+)/i;
$nems = $1;
next;
}
}
Match the last two digits as optional and capture the first five, and assign the capture directly
($nems) = /(\d{5}) (?: \d{2} )?/x; # /x allows spaces inside
The construct (?: ) only groups what's inside, without capture. The ? after it means to match that zero or one time. We need parens so that it applies to that subpattern only. So the last two digits are optional -- five digits or seven digits match. I removed the unneeded .*? and .*
However, by what you say it appears that the whole thing can be simplified
if ( ($nems) = /^\s*NEMS \s* = \s* (\d{5}) (?:\d{2})?/ix ) { next }
where there is now no need for if (/NEMS/) and I've adjusted to the clarification that NEMS is at the beginning and that there may be spaces around =. Then you can also say
my $nems;
foreach ( split /\n/, $body_text ) {
# ...
next if ($nems) = /^\s*NEMS\s*=\s*(\d{5})(?:\d{2})?/i;
# ...
}
what includes the clarification that the new $body_text is a multiline string.
It is clear that $nems is declared (needed) outside of the loop and I indicate that.
This allows yet more digits to follow; it will match on 8 digits as well (but capture only the first five). This is what your trailing .* in the regex implies.
Edit It's been clarified that there can only be 5 or 7 digits. Then the regex can be tightened, to check whether input is as expected, but it should work as it stands, too.
A few notes, let me know if more would be helpful
The match operator returns a list so we need the parens in ($nems) = /.../;
The ($nems) = /.../ syntax is a nice shortcut, for ($nems) = $_ =~ /.../;.
If you are matching on a variable other than $_ then you need the whole thing.
You always want to start Perl programs with
use warnings 'all';
use strict;
This directly helps and generally results in better code.
The clarification of the evolved problem understanding states that all digits following = need be captured into $nems (and there may be 5,(not 6),7,8,9,10 digits). Then the regex is simply
($nems) = /^\s*NEMS\s*=\s*(\d+)/i;
where \d+ means a digit, one or more times. So a string of digits (match fails if there are none).

regular expression which should allow limited special characters

Can any one tell me the regular expression for textfield which should not allow following characters and can accept other special characters,alphabets,numbers and so on :
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ # &
this will not allow string that contains any of the characters in any part of the string mentioned above.
^(?!.*[+\-&|!(){}[\]^"~*?:#&]+).*$
See Here
Brief Explanation
Assert position at the beginning of a line (at beginning of the string or after a line break character) ^
Assert that it is impossible to match the regex below starting at this position (negative lookahead) (?!.*[+\-&|!(){}[\]^"~*?:#&]+)
Match any single character that is not a line break character .*
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) *
Match a single character present in the list below [+\-&|!(){}[\]^"~*?:#&]+
Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
The character "+" +
A "-" character \-
One of the characters &|!(){}[” «&|!(){}[
A "]" character \]
One of the characters ^"~*?:#&” «^"~*?:#&
Match any single character that is not a line break character .*
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) *
Assert position at the end of a line (at the end of the string or before a line break character) $
Its usually better to whitelist characters you allow, rather than to blacklist characters you don't allow. both from a security standpoint, and from an ease of implementation standpoint.
If you do go down the blacklist route, here is an example, but be warned, the syntax is not simple.
http://groups.google.com/group/regex/browse_thread/thread/0795c1b958561a07
If you want to whitelist all the accent characters, perhaps using unicode ranges would help? Check out this link.
http://www.regular-expressions.info/unicode.html
I recognize those as the characters which need to be escaped for Solr. If this is the case, and if you are coding in PHP, then you should use my PHP utility functions from Github. Here is one of the Solr functions from there:
/**
* Escape values destined for Solr
*
* #author Dotan Cohen
* #version 2013-05-30
*
* #param value to be escaped. Valid data types: string, array, int, float, bool
* #return Escaped string, NULL on invalid input
*/
function solr_escape($str)
{
if ( is_array($str) ) {
foreach ( $str as &$s ) {
$s = solr_escape($s);
}
return $str;
}
if ( is_int($str) || is_float($str) || is_bool($str) ) {
return $str;
}
if ( !is_string($str) ) {
return NULL;
}
$str = addcslashes($str, "+-!(){}[]^\"~*?:\\");
$str = str_replace("&&", "\\&&", $str);
$str = str_replace("||", "\\||", $str);
return $str;
}

Regex to extract value at fixed position index

I have the following string of characters:
73746174652C313A312C310D
|
- extract the value at this position
I would like to extract the value 1 (the 1 at the end of the string) using regex.
So basically a regex that acts as a charAt(index).
I need this solution for a 3rd party application that only supports regular expressions. Note that the application cannot access capture groups and does not support negative lookbehinds.
In C#:
(?<=^.{21})(.)
in JS:
/.(?=.{2}$)/
You could try:
(?<=^.{21}).
It won't work in Javascript, but perhaps it will work in your app.
It means: a single character preceded (?<= ... ) by the beginning of the string ^ plus 21 characters .{21} . So, in the end, it returns the 22th character.
The 22nd character is in capture group 1.
/^.{21}(.)/
But what system are you in that requires this instead of normal string processing?
Depends how you want to match it ( x distance from the beginning or x distance from the end )
/(.).{2}$/ Third from the end (capturing group 1)
/^.{21}(.)/ 22nd character (capturing group 1)
//PHP
$str = '73746174652C313A312C310D';
$char = preg_replace('/(.).{2}$/','$1',$str); //3rd from last
preg_match('/(.).{2}$/',$str,$chars); //3rd from last
$char = $chars[1];
preg_match('/^.{21}(.)/',$str,$chars); //22nd character
$char = $chars[1];
//JS
var str = '73746174652C313A312C310D';
var ch = str.replace(/(.).{2}$/,'$1'); //3rd from last
var ch = str.match(/(.).{2}$/)[1]; //3rd from last
var ch = str.match(/^.{21}(.)/)[1]; //22nd character
If you're having to use the result of the First match: bit of your tool, run it twice:
73746174652C313A312C310D - ^.{21}. = 73746174652C313A312C31
73746174652C313A312C31 - .$ = 1

How to validate a string to have only certain letters by perl and regex

I am looking for a perl regex which will validate a string containing only the letters ACGT. For example "AACGGGTTA" should be valid while "AAYYGGTTA" should be invalid, since the second string has "YY" which is not one of A,C,G,T letters. I have the following code, but it validates both the above strings
if($userinput =~/[A|C|G|T]/i)
{
$validEntry = 1;
print "Valid\n";
}
Thanks
Use a character class, and make sure you check the whole string by using the start of string token, \A, and end of string token, \z.
You should also use * or + to indicate how many characters you want to match -- * means "zero or more" and + means "one or more."
Thus, the regex below is saying "between the start and the end of the (case insensitive) string, there should be one or more of the following characters only: a, c, g, t"
if($userinput =~ /\A[acgt]+\z/i)
{
$validEntry = 1;
print "Valid\n";
}
Using the character-counting tr operator:
if( $userinput !~ tr/ACGT//c )
{
$validEntry = 1;
print "Valid\n";
}
tr/characterset// counts how many characters in the string are in characterset; with the /c flag, it counts how many are not in the characterset. Using !~ instead of =~ negates the result, so it will be true if there are no characters not in characterset or false if there are characters not in characterset.
Your character class [A|C|G|T] contains |. | does not stand for alternation in a character class, it only stands for itself. Therefore, the character class would include the | character, which is not what you want.
Your pattern is not anchored. The pattern /[ACGT]+/ would match any string that contains one or more of any of those characters. Instead, you need to anchor your pattern, so that only strings that contain just those characters from beginning to end are matched.
$ can match a newline. To avoid that, use \z to anchor at the end. \A anchors at the beginning (although it doesn't make a difference whether you use that or ^ in this case, using \A provides a nice symmetry.
So, you check should be written:
if ($userinput =~ /\A [ACGT]+ \z/ix)
{
$validEntry = 1;
print "Valid\n";
}

I want to replace ',' on the 150th location in a String with a <br>

My String is : PI Last Name equal to one of
('AARONSON','ABDEL MEGUID','ABDEL-LATIF','ABDOOL KARIM','ABELL','ABRAMS','ACKERMAN','ADAIR','ADAMS','ADAMS-CAMPBELL', 'ADASHI','ADEBAMOWO','ADHIKARI','ADIMORA','ADRIAN', 'ADZERIKHO','AGADJANYAN','AGARWAL','AGOT', 'AGUIRRE-CRUZ','AHMAD','AHMED','AIKEN', 'AINAMO', 'AISENBERG','AJAIYEOBA','AKA','AKHTAR','AKINGBEMI','AKINYINKA','AKKERMAN','AKSOY','AKYUREK', 'ALBEROLA-ILA','ALBERT','ALCANTARA' ,'ALCOCK','ALEMAN', 'ALEXANDER','ALEXANDRE','ALEXANDROV','ALEXANIAN','ALLAND','ALLEN','ALLISON','ALPER', 'ALTMAN','ALVAREZ','AMARYAN','AMBESI-IMPIOMBATO','AMEGBETO','AMOWITZ', 'ANAGNOSTARAS','ANAND','ANDERSEN','ANDERSON', 'ANDRADE','ANDREEFF','ANDROPHY','ANGER','ANHOLT','ANTHONY','ANTLE','ANTONELLI','ANTONY', 'ANZULOVICH', 'APODACA','APOSHIAN','APPEL','APPLEBY','APRIL','ARAUJO','ARBIB','ARBOLEDA', 'ARCHAKOV','ARCHER', 'ARECHAVALETA-VELASCO','ARENS','ARGON','ARGYROKASTRITIS', 'ARIAS','ARIZAGA','ARMSTRONG','ARNON', 'ARSHAVSKY','ARVIN','ASATRYAN','ASCOLI','ASKENASE','ASSI','ATALAY','ATANASOVA','ATKINSON','ATTYGALLE','ATWEH','AU','AVETISYAN','AWE','AYOUB','AZAD','BACSO','BAGASRA','BAKER','BALAS', 'BALCAZAR','BALK','BALKAY','BALLOU','BALRAJ','BALSTER','BANERJEE','BANKOLE','BANTA','BARAL','BARANOWSKA','BARBAS', 'BARBER','BARILLAS-MURY','BARKHOLT','BARNES','BARNETT','BARRETT','BARRIA','BARROW','BARROWS','BARTKE','BARTLETT','BASSINGTHWAIGHTE','BASSIOUNY','BASU','BATES','BATTAGLIA','BATTERMAN','BAUER','BAUERLE','BAUM','BAUME', 'BAUMLER','BAVISTER','BAWA','BAYNE','BEASLEY','BEATTY','BEATY','BEBENEK','BECK','BECKER','BECKMAN','BECKMAN-SUURKULA' ,'BEDFORD','BEDOLLA','BEEBE','BEEMON','BEHETS','BEHRMAN','BEIER','BEKKER','BELL','BELLIDO','BELMAIN', 'BENATAR','BENBENISHTY','BENBROOK','BENDER','BENEDETTI','BENNETT','BENNISH','BENZ','BERG','BERGER','BERGEY','BERGGREN','BERK','BERKOWITZ','BERLIN','BERLINER','BERMAN','BERTINO','BERTOZZI','BERTRAND','BERWICK','BETHONY','BEYERS','BEYRER' ,'BEZPROZVANNY','BHAGWAT','BHANDARI','BHARGAVA','BHARUCHA','BHUJWALLA','BIANCO','BIDLACK','BIELERT','BIER','BIESSMANN','BIGELOW' ,'BILLER','BILLINGS','BINDER','BINDMAN','BINUTU','BIRBECK','BIRGE','BIRNBAUM','BIRO','BIRT','BISHAI','BISHOP','BISSELL','BJORKEGREN','BJORNSTAD','BLACK','BLANCHARD','BLASS','BLATTNER','BLIGNAUT','BLOCH','BLOCK','BLOOM','BLOOM,','BLUM','BLUMBERG' ,'BLUMENTHAL','BLYUKHER','BODDULURI','BOFFETTA','BOGOLIUBOVA', 'BOLLINGER','BOLLS','BOMSZTYK','BONANNO','BONNER','BOOM','BOOTHROYD','BOPPANA','BORAWSKI','BORG','BORIS-LAWRIE','BORISY','BORLONGAN','BORNSTEIN','BORODOVSKY','BORST','BOS','BOTO','BOWDEN','BOWEN','BOYCE-JACINO','BRADEN','BRADY' ,'BRAITHWAITE','BRANN','BRASH','BRAUNSTEIN', 'BREMAN','BRENNAN','BRENNER','BRETSCHER','BREW','BREYSSE','BRIGGS','BRITES','BRITT','BRITTENHAM','BRODIE','BRODY','BROOK','BROOTEN','BROSCO','BROSNAN','BROWN','BROWNE','BRUCKNER','BRUNENGRABER','BRYL','BRYSON','BU','BUCHAN','BUDD','BUDNIK', 'BUEKENS','BUKRINSKY','BULLMORE','BULUN','BURBANO','BURGENER','BURGESS','BURKS','BURMEISTER','BURNETT','BURNHAM','BURNS','BURRIDGE','BURTON','BUSCIGLIO','BUSHEK','BUSIJA','BUZSAKI','BZYMEK','CABA')
I need to have a regex which will greedily looks for up to 150 characters with a last character being a ','. And then replace the last ',' of the 150 with a <br />
Any suggestions pls?
I used this ','(?=[^()]*\)) but this one replaces all the occurences. I want the 150th ones to be replaced.
Thanks everyone for your suggestions. I managed to do it with Java code instead of regex.
StringBuilder sb = new StringBuilder(html);
int i = 0;
while ((i = sb.indexOf("','", i + 150)) != -1) {
int j = sb.lastIndexOf("','", i + 150);
sb.insert(i+1, "<BR>");
}
return sb.toString();
However, this breaks at the first encounter of ',' in the 150 chars.
Can anyone help modify my code to incorporate the break at the last occurence of ',' withing the 150 chars.
You'll want something like this:
Look for every occurrence of \([^)]+*,[^)]+*\) (Find a parenthesis-wrapped string with a comma in it and then run the following regular expression on each of the matched elements:
(.{135,150}[^,]*?),
The first number is the minimum number of characters you want to match before you add a break tag -- the second is the maximum number of characters you would like to match before inserting a break tag. If there is no , between the characters in question then the regular expression will continue to consume characters until it finds a comma.
You could probably do it like this:
regex ~ /(^.{1,14}),/
replacement ~ '\1<replacement' or "$1<insert your text>"
In Perl:
$target = ','x 22;
$target =~ s/(^ .{1,14}) , /$1<15th comma>/x;
print $target;
Output
,,,,,,,,,,,,,,<15th comma>,,,,,,,
Edit: As an alternative, if you want to break the string up into succesive 150 or less
you could do it this way:
regex ~ /(.{1,150},)/sg
replacement ~ '\1<br/>' or "$1<br\/>"
// That is a regex of type global (/g) and include newlines (/s)
In Perl:
$target = "
('AARONSON','ABDEL MEGUID','ABDEL-LATIF','ABDOOL KARIM','ABELL','ABRAMS','ACKERMAN','ADAIR','ADAMS','ADAMS-CAMPBELL', 'ADASHI','ADEBAMOWO','ADHIKARI','ADIMORA','ADRIAN', 'ADZERIKHO','AGADJANYAN','AGARWAL','AGOT', 'AGUIRRE-CRUZ','AHMAD','AHMED','AIKEN', 'AINAMO', 'AISENBERG','AJAIYEOBA','AKA','AKHTAR','AKINGBEMI','AKINYINKA','AKKERMAN','AKSOY','AKYUREK', 'ALBEROLA-ILA','ALBERT','ALCANTARA' ,'ALCOCK','ALEMAN', 'ALEXANDER','ALEXANDRE','ALEXANDROV','ALEXANIAN','ALLAND','ALLEN','ALLISON','ALPER', 'ALTMAN', ... )
";
if ($target =~ s/( .{1,150} , )/$1<br\/>/sxg) {
print $target;
}
Output:
('AARONSON','ABDEL MEGUID','ABDEL-LATIF','ABDOOL KARIM','ABELL','ABRAMS','ACKERMAN','ADAIR','ADAMS','ADAMS-CAMPBELL', 'ADASHI','ADEBAMOWO','ADHIKARI',<br/>'ADIMORA','ADRIAN', 'ADZERIKHO','AGADJANYAN','AGARWAL','AGOT', 'AGUIRRE-CRUZ','AHMAD','AHMED','AIKEN', 'AINAMO', 'AISENBERG','AJAIYEOBA','AKA',<br/>'AKHTAR','AKINGBEMI','AKINYINKA','AKKERMAN','AKSOY','AKYUREK', 'ALBEROLA-ILA','ALBERT','ALCANTARA' ,'ALCOCK','ALEMAN', 'ALEXANDER','ALEXANDRE',<br/>'ALEXANDROV','ALEXANIAN','ALLAND','ALLEN','ALLISON','ALPER', 'ALTMAN',<br/> ... )