Regex find most center - regex

I need to manully hyphante words that are too long. Using hyphen.js, I get soft hyphens between every syllable, like below.
I want to find the hyphen closes to the middle. All words will be more than 14 characters long. Regex that works in https://regex101.com/ or node/js example.
Basically, find the middle character excluding hyphens, check if there is a hyphen there, then step backwards one step and then forwards one step, then backwards to steps etc.
re-spon-si-bil-i-ties => [re-spon-si,-bil-i-ties]
com-pe-ten-cies. => [com-pe,-ten-cies.]
ini-tia-tives. => [ini-tia,-tives]
vul-ner-a-bil-i-ties => [vul-ner-a,-bil-i-ties]

Here's a simple js approach based on string splitting. There could be a binary search style algorithm as you mentioned which would avoid the array allocation but that seems overkill for these small data sets.
function halve(str) {
var right = str.split('-');
var left = right.splice(0, Math.ceil(right.length / 2));
return right.length > 0 ? [left.join('-'), '-' + right.join('-')] : left;
}
console.log(halve('re-spon-si-bil-i-ties'));
console.log(halve('com-pe-ten-cies.'));
console.log(halve('ini-tia-tives.'));
console.log(halve('vul-ner-a-bil-i-ties'));
console.log(halve('none')); // no hyphens returns ["none"]

You can work this out with this method:
Get middle point of string
From the middle point, and checking each character in both directions (left from middle, right from middle) check if that position is the - character. Set the index to the first such match.
If it matches that character, stop the loop and split the string on that index, otherwise return the original word.
words = [
're-spon-si-bil-i-ties',
'com-pe-ten-cies.',
'ini-tia-tives.',
'vul-ner-a-bil-i-ties',
'test',
'-aa',
'aa-'
];
split = '-'
for(word of words) {
m=Math.floor(word.length/2),offset=0,i=null
do{
if(word[m-offset] == split) i = m-offset
else if(word[m+offset] == split) i = m+offset
else offset++
}while(offset<=m && i == null)
if(i!=null && i>0) console.log([word.substring(0,i),word.substring(i)])
else console.log(word)
}

You can achieve this with:
var words = [
're-spon-si-bil-i-ties',
'com-pe-ten-cies.',
'ini-tia-tives.',
'vul-ner-a-bil-i-ties',
're-ports—typ-i-cal-ly',
'none'
];
for(var i = 0; i < words.length; ++i){
var matches = words[i]
.match(
new RegExp(
'^((?:[^-]+?-?){' // Start the regex
+parseInt(
words[i].replace( /-/g, '' ).length/2 // Round down the halfway point of this word's length without the hyphens
)
+'})(-.+)?$' // End the regex
)
)
.slice( 1 ); // Remove position 0 because it is the entire word
console.log( matches );
}
Regex explanation for re-spon-si-bil-i-ties:
^((?:[^-]+?-?){8})(-.+)$
^( - start the capture group leading up to the half way point
(?:[^-]+?-?) - find everything not a hyphen with an optional hyphen after it. Make the hyphen optional so that the second capture group can greedily claim it
{8} - 8 times; this will get us half way
) - close the half way capture group
(-.+)?$ - greedily get the hyphen and everything after it till the end of the string

Related

Remove all underscores until last number

I faced with the following problem. I need to remove all underscores between the start of the string and last digit in string (like was: 123_456__ - became: 123456__). I used the usual loop for it, which goes through string.length - 1 down to 0 and when the symbol is digit I start the new loop from the 0 to the i, where i is position of the found digit and forming new string skipping underscores. But it seems that there are some ways to replace it with regex or more "Kotlin-style" code, but I do not know how to do it. Is it possible to do it in more convenient way?
One way to to this is to use string functions like takeLastWhile / drop etc.
val s = "123_456__"
val end = s.takeLastWhile { !it.isDigit() }
val start = s.dropLast(end.length).filter { it != '_' } // or replace("_", "")
val result = start + end
println(result)

How to find any non-digit characters using RegEx in ABAP

I need a Regular Expression to check whether a value contains any other characters than digits between 0 and 9.
I also want to check the length of the value.
The RegEx I´ve made: ^([0-9]\d{6})$
My test value is: 123Z45 and 123456
The ABAP code:
FIND ALL OCCURENCES OF REGEX '^([0-9]\d{6})$' IN L_VALUE RESULTS DATA(LT_RESULTS).
I´m expecting a result in LT_RESULTS, when I´m testing the first test value '123Z45', because there is a non-digit character.
But LT_RESULTS is in nearly every test case empty.
Your expression ^([0-9]\d{6})$ translates to:
^ - start of input
( - begin capture group
[0-9] - a character between 0 and 9
\d{6} - six digits (digit = character between 0 and 9)
) - end capture group
$ - end of input
So it will only match 1234567 (7 digit strings), not 123456, or 123Z45.
If you just need to find a string that contains non digits you could use the following instead: ^\d*[^\d]+\d*$
* - previous element may occur zero, one or more times
[^\d] - ^ right after [ means "NOT", i.e. any character which is not a digit
+ - previous element may occur one or more times
Example:
const expression = /^\d*[^\d]+\d*$/;
const inputs = ['123Z45', '123456', 'abc', 'a21345', '1234f', '142345'];
console.log(inputs.filter(i => expression.test(i)));
You can also use this character class if you want to extract non-digit group:
DATA(l_guid) = '0074162D8EAA549794A4EF38D9553990680B89A1'.
DATA(regx) = '[[:alpha:]]+'.
DATA(substr) = match( val = l_guid
regex = regx
occ = 1 ).
It finds a first occured non-digit group of characters and shows it.
If you want to just check if they are exists or how much of them reside in your string, count built-in function is your friend:
DATA(how_many) = count( val = l_guid regex = regx ).
DATA(yes) = boolc( count( val = l_guid regex = regx ) > 0 ).
Match and count exist since ABAP 7.50.
If you don't need a Regular Expression for something more complex, ABAP has some nice comparison operators CO (Contains Only), CA, NA etc for you. Something like:
IF L_VALUE CO '0123456789' AND STRLEN( L_VALUE ) = 6.

Regex replace phone numbers with asterisks pattern

I want to apply a mask to my phone numbers replacing some characters with "*".
The specification is the next:
Phone entry: (123) 123-1234
Output: (1**) ***-**34
I was trying with this pattern: "\B\d(?=(?:\D*\d){2})" and the replacing the matches with a "*"
But the final input is something like (123)465-7891 -> (1**)4**-7*91
Pretty similar than I want but with two extra matches. I was thinking to find a way to use the match zero or once option (??) but not sure how.
Try this Regex:
(?<!\()\d(?!\d?$)
Replace each match with *
Click for Demo
Explanation:
(?<!\() - negative lookbehind to find the position which is not immediately preceded by (
\d - matches a digit
(?!$) - negative lookahead to find the position not immediately followed by an optional digit followed by end of the line
Alternative without lookarounds :
match \((\d)\d{2}\)\s+\d{3}-\d{2}(\d{2})
replace by (\1**) ***-**\2
In my opinion you should avoid lookarounds when possible. I find them less readable, they are less portable and often less performant.
Testing Gurman's regex and mine on regex101's php engine, mine completes in 14 steps while Gurman's completes in 80 steps
Some "quickie":
function maskNumber(number){
var getNumLength = number.length;
// The number of asterisk, when added to 4 should correspond to length of the number
var asteriskLength = getNumLength - 4;
var maskNumber = number.substr(-4);
for (var i = 0; i < asteriskLength; i++) maskNumber+= '*';
var mask = maskNumber.split(''), maskLength = mask.length;
for(var i = maskLength - 1; i > 0; i--) {
var j = Math.floor(Math.random() * (i + 1));
var tmp = mask[i];
mask[i] = mask[j];
mask[j] = tmp;
}
return mask.join('');
}

Regex to extract value at fixed position index

I have the following string of characters:
73746174652C313A312C310D
|
- extract the value at this position
I would like to extract the value 1 (the 1 at the end of the string) using regex.
So basically a regex that acts as a charAt(index).
I need this solution for a 3rd party application that only supports regular expressions. Note that the application cannot access capture groups and does not support negative lookbehinds.
In C#:
(?<=^.{21})(.)
in JS:
/.(?=.{2}$)/
You could try:
(?<=^.{21}).
It won't work in Javascript, but perhaps it will work in your app.
It means: a single character preceded (?<= ... ) by the beginning of the string ^ plus 21 characters .{21} . So, in the end, it returns the 22th character.
The 22nd character is in capture group 1.
/^.{21}(.)/
But what system are you in that requires this instead of normal string processing?
Depends how you want to match it ( x distance from the beginning or x distance from the end )
/(.).{2}$/ Third from the end (capturing group 1)
/^.{21}(.)/ 22nd character (capturing group 1)
//PHP
$str = '73746174652C313A312C310D';
$char = preg_replace('/(.).{2}$/','$1',$str); //3rd from last
preg_match('/(.).{2}$/',$str,$chars); //3rd from last
$char = $chars[1];
preg_match('/^.{21}(.)/',$str,$chars); //22nd character
$char = $chars[1];
//JS
var str = '73746174652C313A312C310D';
var ch = str.replace(/(.).{2}$/,'$1'); //3rd from last
var ch = str.match(/(.).{2}$/)[1]; //3rd from last
var ch = str.match(/^.{21}(.)/)[1]; //22nd character
If you're having to use the result of the First match: bit of your tool, run it twice:
73746174652C313A312C310D - ^.{21}. = 73746174652C313A312C31
73746174652C313A312C31 - .$ = 1

use regular expression to find and replace but only every 3 characters for DNA sequence

Is it possible to do a find/replace using regular expressions on a string of dna such that it only considers every 3 characters (a codon of dna) at a time.
for example I would like the regular expression to see this:
dna="AAACCCTTTGGG"
as this:
AAA CCC TTT GGG
If I use the regular expressions right now and the expression was
Regex.Replace(dna,"ACC","AAA") it would find a match, but in this case of looking at 3 characters at a time there would be no match.
Is this possible?
Why use a regex? Try this instead, which is probably more efficient to boot:
public string DnaReplaceCodon(string input, string match, string replace) {
if (match.Length != 3 || replace.Length != 3)
throw new ArgumentOutOfRangeException();
var output = new StringBuilder(input.Length);
int i = 0;
while (i + 2 < input.Length) {
if (input[i] == match[0] && input[i+1] == match[1] && input[i+2] == match[2]) {
output.Append(replace);
} else {
output.Append(input[i]);
output.Append(input[i]+1);
output.Append(input[i]+2);
}
i += 3;
}
// pick up trailing letters.
while (i < input.Length) output.Append(input[i]);
return output.ToString();
}
Solution
It is possible to do this with regex. Assuming the input is valid (contains only A, T, G, C):
Regex.Replace(input, #"\G((?:.{3})*?)" + codon, "$1" + replacement);
DEMO
If the input is not guaranteed to be valid, you can just do a check with the regex ^[ATCG]*$ (allow non-multiple of 3) or ^([ATCG]{3})*$ (sequence must be multiple of 3). It doesn't make sense to operate on invalid input anyway.
Explanation
The construction above works for any codon. For the sake of explanation, let the codon be AAA. The regex will be \G((?:.{3})*?)AAA.
The whole regex actually matches the shortest substring that ends with the codon to be replaced.
\G # Must be at beginning of the string, or where last match left off
((?:.{3})*?) # Match any number of codon, lazily. The text is also captured.
AAA # The codon we want to replace
We make sure the matches only starts from positions whose index is multiple of 3 with:
\G which asserts that the match starts from where the previous match left off (or the beginning of the string)
And the fact that the pattern ((?:.{3})*?)AAA can only match a sequence whose length is multiple of 3.
Due to the lazy quantifier, we can be sure that in each match, the part before the codon to be replaced (matched by ((?:.{3})*?) part) does not contain the codon.
In the replacement, we put back the part before the codon (which is captured in capturing group 1 and can be referred to with $1), follows by the replacement codon.
NOTE
As explained in the comment, the following is not a good solution! I leave it in so that others will not fall for the same mistake
You can usually find out where a match starts and ends via m.start() and m.end(). If m.start() % 3 == 0 you found a relevant match.