Substitute every character after a certain position with a regex - regex

I'm trying to use a regex replace each character after a given position (say, 3) with a placeholder character, for an arbitrary-length string (the output length should be the same as that of the input). I think a lookahead (lookbehind?) can do it, but I can't get it to work.
What I have right now is:
regex: /.(?=.{0,2}$)/
input string: 'hello there'
replace string: '_'
current output: 'hello th___' (last 3 substituted)
The output I'm looking for would be 'hel________' (everything but the first 3 substituted).
I'm doing this in Typescript, to replace some old javascript that is using ugly split/concatenate logic. However, I know how to make the regex calls, so the answer should be pretty language agnostic.

If you know the string is longer than given position n, the start-part can be optionally captured
(^.{3})?.
and replaced with e.g. $1_ (capture of first group and _). Won't work if string length is <= n.
See this demo at regex101
Another option is to use a lookehind as far as supported to check if preceded by n characters.
(?<=.{3}).
See other demo at regex101 (replace just with underscore) - String length does not matter here.
To mention in PHP/PCRE the start-part could simply be skipped like this: ^.{1,3}(*SKIP)(*F)|.

Related

Regular expression to extract string from urls

I need to extract a string from an URL. Here are some examples:
Input: https://www.example.net/eur_en/bas-026-009-basic-baby-hat-beige.html – Output: bas-026-009
Input: https://www.example.net/eur_en/aw18-245-b86-big-cherries-snow-jacket-plum-red.html – Output: aw18-245-b86
Input: https://www.example.net/eur_en/ss20-028-e70-hearts-tee-off-white-yellow.html – Output: ss20-028-e70
I want to be able to extract the string that goes from the first character after the "/eur_en/" until the third dash. Can someone help me? Thanks
You're looking for regexp: \/eur_en\/([^-]+-[^-]+-[^-]+)
Play & test it at regex101: https://regex101.com/r/RvGROG/1
You need something like this:
const urls = [
"https://www.example.net/eur_en/bas-026-009-basic-baby-hat-beige.html",
"https://www.example.net/eur_en/aw18-245-b86-big-cherries-snow-jacket-plum-red.html",
"https://www.example.net/eur_en/ss20-028-e70-hearts-tee-off-white-yellow.html",
]
const rg = new RegExp(`\/eur_en\/([^-]+-[^-]+-[^-]+)`)
const strs = urls.map(url => url.match(rg)[1])
console.log(strs)
// Output:
// [
// "bas-026-009",
// "aw18-245-b86",
// "ss20-028-e70"
// ]
Of course, it's a simple example. In real cases don't forget to check that .match returned array with length greater than 1.
So, the first element is full captured string and the second (as third and next) it's a sub-strings, which is captured by parentheses.
We can improve and complicate our regex like so:
\/((?:[^-\/]+-){2}[^-\/]+)
It'll allow us to not to use a specific anchor /eur_en/ and control the number of dash divided parts.
The expression you're looking for is the following:
/(?<=eur_en\/)[^-]*-[^-]*-[^-]*/
Here is how it works:
(?<=eur_en\/): will look behind for eur_env/ but will not use it in the output
[^-]*: it will match any character that is not a dash. So it will get everything up to the first dash (not including the dash)
[^-]*: it will match any character that is not a dash. So it will get everything up to the second dash (not including the dash)
[^-]*: it will match any character that is not a dash. So it will get everything up to the third dash (not including the dash).
/(?<=\/eur_en\/)\w+-\w+-\w+/g
Tolkens
Description
(?<=\/eur_en\/)
Look behind - If /eur_en/ is found, match whatever proceeds it.
\w+-\w+-\w+
One or more Word character = [A-Za-z0-9] and a literal hyphen three consecutive times.
Review: https://regex101.com/r/Ge0zA3/1

Matlab: How to replace dynamic part of string with regexprep

I have strings like
#(foo) 5 + foo.^2
#(bar) bar(1,:) + bar(4,:)
and want the expression in the first group of parentheses (which could be anything) to be replaced by x in the whole string
#(x) 5 + x.^2
#(x) x(1,:) + x(4,:)
I thought this would be possible with regexprep in one step somehow, but after reading the docu and fiddling around for quite a while, I have not found a working solution, yet.
I know, one could use two commands: First, grab the string to be matched with regexp and then use it with regexprep to replace all occurrences.
However, I have the gut feeling this should be somehow possible with the functionality of dynamic expressions and tokens or the like.
Without the support of an infinite-width lookbehind, you cannot do that in one step with a single call to regexprep.
Use the first idea: extract the first word and then replace it with x when found in between word boundaries:
s = '#(bar) bar(1,:) + bar(4,:)';
word = regexp(s, '^#\((\w+)\)','tokens'){1}{1};
s = regexprep(s, strcat('\<',word,'\>'), 'x');
Output: #(x) x(1,:) + x(4,:)
The ^#\((\w+)\) regex matches the #( at the start of the string, then captures alphanumeric or _ chars into Group 1 and then matches a ). tokens option allows accessing the captured substring, and then the strcat('\<',word,'\>') part builds the whole word matching regex for the regexprep command.

Regex: Separate a string of characters with a non-consistent pattern (Oracle) (POSIX ERE)

EDIT: This question pertains to Oracle implementation of regex (POSIX ERE) which does not support 'lookaheads'
I need to separate a string of characters with a comma, however, the pattern is not consistent and I am not sure if this can be accomplished with Regex.
Corpus: 1710ABCD.131711ABCD.431711ABCD.41711ABCD.4041711ABCD.25
The pattern is basically 4 digits, followed by 4 characters, followed by a dot, followed by 1,2, or 3 digits! To make the string above clear, this is how it looks like separated by a space 1710ABCD.13 1711ABCD.43 1711ABCD.4 1711ABCD.404 1711ABCD.25
So the output of a replace operation should look like this:
1710ABCD.13,1711ABCD.43,1711ABCD.4,1711ABCD.404,1711ABCD.25
I was able to match the pattern using this regex:
(\d{4}\w{4}\.\d{1,3})
It does insert a comma but after the third digit beyond the dot (wrong, should have been after the second digit), but I cannot get it to do it in the right position and globally.
Here is a link to a fiddle
https://regex101.com/r/qQ2dE4/329
All you need is a lookahead at the end of the regular expression, so that the greedy \d{1,3} backtracks until it's followed by 4 digits (indicating the start of the next substring):
(\d{4}\w{4}\.\d{1,3})(?=\d{4})
^^^^^^^^^
https://regex101.com/r/qQ2dE4/330
To expand on #CertainPerformance's answer, if you want to be able to match the last token, you can use an alternative match of $:
(\d{4}\w{4}\.\d{1,3})(?=\d{4}|$)
Demo: https://regex101.com/r/qQ2dE4/331
EDIT: Since you now mentioned in the comment that you're using Oracle's implementation, you can simply do:
regexp_replace(corpus, '(\d{1,3})(\d{4})', '\1,\2')
to get your desired output:
1710ABCD.13,1711ABCD.43,1711ABCD.4,1711ABCD.404,1711ABCD.25
Demo: https://regex101.com/r/qQ2dE4/333
In order to continue finding matches after the first one you must use the global flag /g. The pattern is very tricky but it's feasible if you reverse the string.
Demo
var str = `1710ABCD.131711ABCD.431711ABCD.41711ABCD.4041711ABCD.25`;
// Reverse String
var rts = str.split("").reverse().join("");
// Do a reverse version of RegEx
/*In order to continue searching after the first match,
use the `g`lobal flag*/
var rgx = /(\d{1,3}\.\w{4}\d{4})/g;
// Replace on reversed String with a reversed substitution
var res = rts.replace(rgx, ` ,$1`);
// Revert the result back to normal direction
var ser = res.split("").reverse().join("");
console.log(ser);

How to change case from upper to lower of a words 1st letter if the same word occures at least one time in lower case using regex only

I've created the following RegEx in Python 3 to find all lower case words in a text and back reference the first letter and the tail of that word. Example:
w ord
^ ^^^
| |
1st letter tail
Afterwards I use a for loop to replace all occurrences of matches with the first group converted to uppercase and the unaltered tail with the lowercase first letter followed by the unaltered tail.
str = "Some text here and some more after that. Something that should remain untouched."
for match in re.finditer(r"\b([a-z])([a-z]+)\b", str):
# print(match.group(1).upper() + match.group(2)) # just for debugging
str = re.sub(r"\b" + match.group(1).upper() + match.group(2) + r"\b", match.group(1) + match.group(2), str)
print(str) #print the desired result
Is there a way to do this in Python 3 with a single regular expression and no additional procedural code? It feels like there should be a more elegant way but I don't see it (yet).
For completeness: If the code is applied to the string stored in str this is the result:
some text here and some more after that. Something that should remain untouched.
Please note that the RegEx-Replace may only match whole words but not partial words. The 5th word in my text is "some" this causes the 1st word's ("Some") 1st letter to to be converted to lower case but leaves the word "Something", the 2nd sentence starts with, untouched.
You can't do that with the re module since it doesn't support variable length lookbehind and since when you use an inline modifier like (?i), it is set for all the pattern and you can't turn it off. It is possible to do it with the new regex module with this pattern:
\b([A-Z][a-z]*)\b(?:(?=.*\b(?=[a-z]+\b)(?i)\1\b)|(?<=\b(?=[a-z]+\b)(?i)\1\b.+))
However, I'm not sure this is a more "elegant" way.
It is possible to test the pattern with regexstorm.net/tester (since .net regex engine allows variable length lookbehinds too.)
Note that the scope of the inline modifier is limited to the subpattern after it and ends at the first closing parenthesis.

RegEx Lookaround issue

I am using Powershell 2.0. I have file names like my_file_name_01012013_111546.xls. I am trying to get my_file_name.xls. I have tried:
.*(?=_.{8}_.{6})
which returns my_file_name. However, when I try
.*(?=_.{8}_.{6}).{3}
it returns my_file_name_01.
I can't figure out how to get the extension (which can be any 3 characters. The time/date part will always be _ 8 characters _ 6 characters.
I've looked at a ton of examples and tried a bunch of things, but no luck.
If you just want to find the name and extension, you probably want something like this: ^(.*)_[0-9]{8}_[0-9]{6}(\..{3})$
my_file_name will be in backreference 1 and .xls in backreference 2.
If you want to remove everything else and return the answer, you want to substitute the "numbers" with nothing: 'my_file_name_01012013_111546.xls' -replace '_[0-9]{8}_[0-9]{6}' ''. You can't simply pull two bits (name and extension) of the string out as one match - regex patterns match contiguous chunks only.
try this ( not tested), but it should works for any 'my_file_name' lenght , any lenght of digit and any kind of extension.
"my_file_name_01012013_111546.xls" -replace '(?<=[\D_]*)(_[\d_]*)(\..*)','$2'
non regex solution:
$a = "my_file_name_01012013_111546.xls"
$a.replace( ($a.substring( ($a.LastIndexOf('.') - 16 ) , 16 )),"")
The original regex you specified returns the maximum match that has 14 characters after it (you can change to (?=.{14}) who is the same).
Once you've changed it, it returns the maximum match that has 14 characters after it + the next 3 characters. This is why you're getting this result.
The approach described by Inductiveload is probably better in case you can use backreferences. I'd use the following regex: (.*)[_\d]{16}\.(.*) Otherwise, I'd do it in two separate stages
get the initial part
get the extension
The reason you get my_filename_01 when you add that is because lookaheads are zero-width. This means that they do not consume characters in the string.
As you stated, .*(?=_.{8}_.{6}) matches my_file_name because that string is is followed by something matching _.{8}_.{6}, however once that match is found, you've only consumed my_file_name, so the addition of .{3} will then consume the next 3 characters, namely _01.
As for a regex that would fit your needs, others have posted viable alternatives.