I have difficulty using Regular Expression (Grep) in TextWrangler to find occurrences of lowercase letter followed by uppercase. For example:
This announcement meansStudents are welcome.
In fact, I want to split the occurrence by adding a colon so that it becomes means: Students
I have tried:
[a-z][A-Z]
But this expression does not work in TextWrangler.
*EDIT: here are the exact contexts in which the occurrences appear (I mean only with these font colors).*
<font color =#48B700> - Stột jlăm wẻ baOne hundred and three<br></font>
<font color =#C0C0C0> »» Qzống pguộc lyời ba yghìm fảy dyổiTo live a life full of vicissitudes, to live a life marked by ups and downs<br></font>
"baOne" and "dyổiTo" must be "ba: One" and "dyổi: To"
Could anyone help? Many thanks.
I do believe (don't have TextWrangler at hand though) that you need to search for ([a-z])([A-Z]) and replace it with: \1: \2
Hope this helps.
Replace ([a-z])([A-Z]) with \1:\2 - I don't have TextWrangler, but it works on Notepad++
The parenthesis are for capturing the data, which is referred to using \1 syntax in the replacement string
This question is ages old, but I stumbled upon it, so someone else might, as well. The OP's comment to Igor's response clarified how the task was meant to be described (& could have be added to the description).
To match only those font-specific lines of the HTML replace
(?<=<font color =#(?:48B700|C0C0C0)>)(.*?[a-z])([A-Z])
with \1: \2
Explanation:
(?<=[fixed-length regex]) is a positive lookbehind and means "if my match has this just before it"
(?:48B700|C0C0C0) is an unnamed group to match only 2 colours. Since they are of the same length, they work in a lookbehind (that needs to be of fixed length)
(.*?[a-z])([A-Z]) will match everything after the > of those begin font tags up to your Capital letters.
The \1: \2 replacement is the same as in Igor's response, only that \1 will match the entire first string that needs separating.
Addition:
Your input strings contain special characters and the part you want to split may very well end in one. In this case they won't be caught by [a-z] alone. You will need to add a character ranger that captures all the letters you care about, something like
(?<=<font color =#(?:48B700|C0C0C0)>)(.*?[a-zḁ-ῼ])([A-Z])
That is the correct pattern for identifying lower case and upper case letters, however, you will need to check matching to be Case Sensitive within the Find/Replace dialogue.
Related
I have a regular expression as follows:
te\b"[^Haste]"
I want to find all words ending with "te" in each segment but need to exclude the word "Haste" and possibly few other words as they are sometimes flooding the list of errors as false positives.
Any help would be gratefully appreciated :-)
I tried to look it up here and there with no success. Also, many tries on regex101 with no success.
Try this:
\b(?!(?:Haste|AAAte)\b)\w*te\b
\b word boundary.
(?!(?:Haste|AAAte)\b) that is not followed by the word Haste or AAAte.
\w* zero or more word character.
te the string te.
\b word boundary.
See regex demo
One way is to match, but not capture, what you don't want and capture what you do want. Suppose we wanted to skip over "haste" and "paste". We could then use the following regular expression.
\b(?:haste|paste|(\w*te))\b
Suppose the string were as follows.
"In the surgeon's haste to amputate he removed the wrong leg."
The string pointer maintained by the regex engine would move from left to right one character at a time until it matched a word in the sentence ending in "te". The first would be "haste". That would be matched but not captured. We therefore pay no attention to that match.
Next, "amputate" is matched by
(\w*te)
As it is captured as well we find that "amputate" is a valid match.
Demo.
Text example,
We will NOT let these Caravans, which are also made up of some very
bad thugs and gang members, into the U.S. Our Border is sacred, must
come in legally.
I want to replace NOT to not. But don't replace We, Caravans, U.S. Our Border.
I found this Regex replace uppercase with lowercase letters
Find: (\w) Replace With: \L$1
This replaces all the upper cases.
PS: My reason for doing this is that I use a TTS to produce sounds. NOT will be pronounced to N.O.T. I just want it to be read as not. I don't want to replace too many things. Because reader will see the text, keeping original will be good.
Because \L before a capture group turns it to lower case in whatever your environment is, sounds like all you need to do is to find occurrences of two or more upper-case characters: match
([A-Z]{2,})
and replace with
\L$1
If you only want full words to be replaced, then add word boundaries as well:
(\b[A-Z]{2,}\b)
(you may be able to omit the capture group and use $& instead of $1)
Just in your posted question Regex replace uppercase with lowercase letters
The second answer can solve your problem
find:
([A-Z]){2,n}
replace:
\L$1
I think I'd go with this:
(\b[A-Z]{2,}\b)
The \b is a word boundary, the [A-Z] is simply capital letters, and the {2,} is two or more.
https://regex101.com/r/qHmE1V/1
And in your replace \L$1
I'm using the below regex string to match the word "kohls" which is located in a group of other words.
\W*((?i)kohls(?-i))\W*
It works great when the word is alone, but if the word is in a url, the match includes a period on both sides.
See the below examples:
Thank you for shopping at Kohls - returns a match for kohls.
https://www.kohls.com - returns a match for .kohls.
Edit. https://www.KohlsAndMichaels.com - doesn't return any match for kohls.
I want it to only extract the exact match for kohls without periods or any other symbols/text in front or behind it. Can you tell me what I'm doing wrong?
In cases like that you can always use a site like regex101.com, which explains the regular expression and shows the matches with colors. So this is how your regular expression currently works:
As you can see in blue color, the problem with the dots is in the \W*, which matches any non-word character. In order to fix this, you can use the following regular expression:
\b((?i)kohls(?-i))\b
The \b (before and after the word you want to match) is used to assert the position at a word boundary. See how this work on that website now:
If you still have questions, look at the explanation of the regular expression provided by that website. It is worth looking.
The \W metacharacter is used to find non-word characters. So adding a star operator will match 0 or more of these non-word characters (like periods). Did you meant to add a word boundary instead?
\b(?i)kohls(?-i)\b
Replace both \W* with [\W,\.\-]* etc.
Should be enough.
I've got strings like this in a text file:
10.Divide using the divider at 12C. and pressure at 3.0.
11.Form into cylinders and put on boards, don't handle too much.
This Regex (\d+\.)[A-Z] correctly finds a numeric value, followed by a period, followed by a capital letter.
I want to insert a space between the period and the capital letter. How do I do this?
Actually your regex is wrong:
(\d+.)[A-Z] matches 1-or-more occurennce of digits, follow by ANY CHARACTER. . in regex means any character. The more correct one should be \d+\.[A-Z] (Omitted the group too as it is not required for matching. Note that the . is escaped).
In order to insert space, apart from the solution provided by another answer by using 2 groups: i.e. Find (\d+\.)([A-Z]) (note the dot fixed) and replace with \1 \2, you may also consider using lookaround feature:
Find (?<=\d\.)(?=[A-Z]) and Replace with (a single space). This regex find a spot that is preceded with a digit and then a dot, and is followed by a capital letter. Then we are replacing that spot with a space. (Note that lookahead and lookbehind group is not included in the "matched" result)
You're mostly there. When you wrap a regex subexpression in parentheses, you can refer to it in the "Replace" field of a find and replace operation. So...
Find what: (\d+.)([A-Z])
Replace with: \1 \2
(See similar questions like this one.)
I am trying out the quiz from Regex 101
In Task 6, the question is
Oh no! It seems my friends spilled beer all over my keyboard last night and my keys are super sticky now. Some of the time when I press a key, I get two duplicates. Can you pppllleaaaseee help me fix this? Content in bold should be removed.
I have tried this regex
([a-z])(\1{2})
But couldn't get the solution.
The solution for the riddle on that website is:
/(.)\1{2}/g
Since any key on the keyboard can get stuck, so we need to use ..
\1 in the regex means match whatever the 1st capturing group (.) matches.
Replacement is $1 or \1.
The rest of your regex is correct, just that there are unnecessary capturing groups.
Your regex is correct if you want to match exactly three characters. If you want to match at least three, that is
([a-z])(\1{2,})
or
([a-z])(\1\1+)
Since you don't need to capture anything but the first occurence, these are slightly better:
([a-z])\1{2} # your original regex (exactly three occurences)
([a-z])\1{2,}
([a-z])\1\1+
Now, the replacement should be exactly one occurence of the character, and nothing more:
\1
Replace:
(.)\1+
with:
\1
This of course requires that your regex engine suports backreferences... Also, in the replacement part, and according to regex engines, \1 may have to be written as $1.
I'd do it with (\w)(\1+)? but can't find out how to "remove" within the given site...
Best way would be to replace the results of the secound match with empty strings