remove repeated character between words - regex

I am trying out the quiz from Regex 101
In Task 6, the question is
Oh no! It seems my friends spilled beer all over my keyboard last night and my keys are super sticky now. Some of the time when I press a key, I get two duplicates. Can you pppllleaaaseee help me fix this? Content in bold should be removed.
I have tried this regex
([a-z])(\1{2})
But couldn't get the solution.

The solution for the riddle on that website is:
/(.)\1{2}/g
Since any key on the keyboard can get stuck, so we need to use ..
\1 in the regex means match whatever the 1st capturing group (.) matches.
Replacement is $1 or \1.
The rest of your regex is correct, just that there are unnecessary capturing groups.

Your regex is correct if you want to match exactly three characters. If you want to match at least three, that is
([a-z])(\1{2,})
or
([a-z])(\1\1+)
Since you don't need to capture anything but the first occurence, these are slightly better:
([a-z])\1{2} # your original regex (exactly three occurences)
([a-z])\1{2,}
([a-z])\1\1+
Now, the replacement should be exactly one occurence of the character, and nothing more:
\1

Replace:
(.)\1+
with:
\1
This of course requires that your regex engine suports backreferences... Also, in the replacement part, and according to regex engines, \1 may have to be written as $1.

I'd do it with (\w)(\1+)? but can't find out how to "remove" within the given site...
Best way would be to replace the results of the secound match with empty strings

Related

Notepad++: Can I use regex to find some values and remove only one character instead of the whole pattern?

I want to use regex in notepad to find this pattern: "[0-9]+[\.][0-9]+[,][0-9]+" e.g. 1.010,80260
However from these kind of numbers I just want to remove the '.' , so the new value should be 1010,80260 .
So far I can only replace the whole pattern. Is there a way to do it?
Thank you in advance!
You can make use of the \K meta escape since PCRE doesn't support variable width lookbehinds:
regex:
[0-9]+\K[\.](?=[0-9]+[,][0-9]+)
[0-9]+ - capture digits
\K - forget what we've captured
[\.] - capture a period; just \. can be used, no need for the char class brackets
(?=[0-9]+[,][0-9]+) - ahead of me should be digits followed by a comma and digits
replace:
Nothing
\K is bugged in Notepad++ so you could use this regex instead since you only care that at least one digit is behind the period:
(?<=\d)\.(?=[0-9]+[,][0-9]+)
You can use \K, which basically says throw away whatever was matched up until that point, then add a lookahead. Like so
[0-9]+\K\.(?=[0-9]+[,][0-9]+)
Change the regular expression to: ([0-9]+)[\.]([0-9]+[,][0-9]+)
The () pieces are groups which you can refer to in the replace with \1 for the first group, and \2 for the second group.
The docs also explain this here: https://npp-user-manual.org/docs/searching/#substitution-grouping (even better, and in more detail, than my usage in this answer...)
EDIT: I just wanted to share the animated gif showing that 'Replace' in Notepad++ 7.9.5. does not seem to work.

Regex to match url with or without 'folder'

I'm struggling to get the right regex to match the following;
content/foo/B6128/8918/foo+bar+foo
OR
content/foo/B6128/8918/foo+bar+foo/randomstringnumsletters
I'm sure this isn't that complicated and I'm nearly there, just can't get it perfected. Here's what I've tried;
content\/(\w+)\/(\w+)\/(\d+)\/([^\/]+[\w]+)\/?(\w*)$
using this online tester: http://regex101.com/r/sB8rR5/2
It still matches a 5th item with this string content/foo/B6128/8918/foo+bar+foo;
And while technically this pattern does match either OR url structures. I don't want it to match the 5th item when there's no randomstringnumsletters present.
After playing around with it for a bit, I do realise some elements are redundant with what I've tried, but I'm not getting anywhere with it...
Just turn the last capturing group into an optional one, and change \w* to \w+ in the last capturing group inorder to prevent null character to be captured by the 5th group.
content\/(\w+)\/(\w+)\/(\d+)\/([^\/]+[\w]+)\/?(\w+)?$
DEMO
Looks like your REAL pattern should be:
content\/((?:\w+\/?)+)
DEMO
or am I wrong? This will match the whole string (after content/) and return it all / delimited. You can parse each variable from there.
You can take each part as an array, then take the part that you need...
DEMO

How to match text which the part of it is already matched previous?

I have a string like aaa**b***c****ddd, and I want to get a sequence of matched text of pattern [^*]\*+[^*], which should I thank be [a**b, b***c, c***d]. However, when I test this in text editor like vim or emacs, the second (b***c) is not matched.
aaa**b***c***ddd
|--| |---|
first third
|---|
second, which I think should be matched but not
How should I modify the regular expression to match the second?
Yes you can, the trick consists to put all in a capturing group inside a lookahead to allow overlapping results:
(?=([^*]\*+[^*]))
But you can't use this do to replacements since this pattern matches nothing. (or perhaps if you can get the capture group length and the current offset)
EDIT:
it seems to be possible to obtain the capture group length with vim with strlen(submatch(1))
#CommuSoft is correct. One way to approach this problem would be to match the whole string against this regex and then the second time around, you match this regex against the substring that starts at (index_of_first_previous_match + 1) until the end of the string. Hope that is clear.
So if the index of your first match above (a**b) was 2. Then the new substring that you match against the regex the second time should start from index 3 till the end of the string. This will give you the two results.
However, Casimir's answer is much simpler.

Regex for deleting characters before a certain character?

I'm very new at regex, and to be completely honest it confounds me. I need to grab the string after a certain character is reached in said string. I figured the easiest way to do this would be using regex, however like I said I'm very new to it. Can anyone help me with this or point me in the right direction?
For instance:
I need to check the string "23444:thisstring" and save "thisstring" to a new string.
If this is your string:
I'm very new at regex, and to be completely honest it confounds me
and you want to grab everything after the first "c", then this regular expression will work:
/c(.*)/s
It will return this match in the first matched group:
"ompletely honest it confounds me"
Try it at the regex tester here: regex tester
Explanation:
The c is the character you are looking for
.* (in combination with /s) matches everything left
(.*) captures what .* matched, making it available in $1 and returned in list context.
Regex for deleting characters before a certain character!
You can use lookahead like this
.*(?=x)
where x is a particular character or word or string.{using characters like .,$,^,*,+ have special meaning in regex so don't forget to escape when using it within x}
EDIT
for your sample string it would be
.*(?=thisstring)
.* matches 0 to many characters till thisisstring
Here is a one-line solution for matching everything after "before"
print $1."\n" if "beforeafter" =~ m/before(.*)/;
Edit:
While using lookbehind is possible, it's not required. Grouping provides an easier solution.
To get the string before : in your example, you have to use [^:][^:]*:\(.*\). Notice that you should have at least one [^:] followed by any number of [^:]s followed by an actual :, the character you are searching for.

Regular expression to find a lowercase letter followed by an uppercase

I have difficulty using Regular Expression (Grep) in TextWrangler to find occurrences of lowercase letter followed by uppercase. For example:
This announcement meansStudents are welcome.
In fact, I want to split the occurrence by adding a colon so that it becomes means: Students
I have tried:
[a-z][A-Z]
But this expression does not work in TextWrangler.
*EDIT: here are the exact contexts in which the occurrences appear (I mean only with these font colors).*
<font color =#48B700> - Stột jlăm wẻ baOne hundred and three<br></font>
<font color =#C0C0C0> »» Qzống pguộc lyời ba yghìm fảy dyổiTo live a life full of vicissitudes, to live a life marked by ups and downs<br></font>
"baOne" and "dyổiTo" must be "ba: One" and "dyổi: To"
Could anyone help? Many thanks.
I do believe (don't have TextWrangler at hand though) that you need to search for ([a-z])([A-Z]) and replace it with: \1: \2
Hope this helps.
Replace ([a-z])([A-Z]) with \1:\2 - I don't have TextWrangler, but it works on Notepad++
The parenthesis are for capturing the data, which is referred to using \1 syntax in the replacement string
This question is ages old, but I stumbled upon it, so someone else might, as well. The OP's comment to Igor's response clarified how the task was meant to be described (& could have be added to the description).
To match only those font-specific lines of the HTML replace
(?<=<font color =#(?:48B700|C0C0C0)>)(.*?[a-z])([A-Z])
with \1: \2
Explanation:
(?<=[fixed-length regex]) is a positive lookbehind and means "if my match has this just before it"
(?:48B700|C0C0C0) is an unnamed group to match only 2 colours. Since they are of the same length, they work in a lookbehind (that needs to be of fixed length)
(.*?[a-z])([A-Z]) will match everything after the > of those begin font tags up to your Capital letters.
The \1: \2 replacement is the same as in Igor's response, only that \1 will match the entire first string that needs separating.
Addition:
Your input strings contain special characters and the part you want to split may very well end in one. In this case they won't be caught by [a-z] alone. You will need to add a character ranger that captures all the letters you care about, something like
(?<=<font color =#(?:48B700|C0C0C0)>)(.*?[a-zḁ-ῼ])([A-Z])
That is the correct pattern for identifying lower case and upper case letters, however, you will need to check matching to be Case Sensitive within the Find/Replace dialogue.