Putting space in camel case string using regular expression - regex

I am driving my question from add a space between two words.
Requirement: Split a camel case string and put spaces just before the capital letter which is followed by a small case letter or may be nothing. The space should not incur between capital letters.
eg: CSVFilesAreCoolButTXT is a string I want to yield it this way CSV Files Are Cool But TXT
I drove a regular express this way:
"LightPurple".replace(/([a-z])([A-Z])/, '$1 $2')
If you have more than 2 words, then you'll need to use the g flag, to match them all.
"LightPurpleCar".replace(/([a-z])([A-Z])/g, '$1 $2')
If are trying to split words like CSVFile then you might need to use this regexp instead:
"CSVFilesAreCool".replace(/([a-zA-Z])([A-Z])([a-z])/g, '$1 $2$3')
But still it does not serve the way I have put my requirements.

var rex = /([A-Z])([A-Z])([a-z])|([a-z])([A-Z])/g;
"CSVFilesAreCoolButTXT".replace( rex, '$1$4 $2$3$5' );
// "CSV Files Are Cool But TXT"
And also
"CSVFilesAreCoolButTXTRules".replace( rex, '$1$4 $2$3$5' );
// "CSV Files Are Cool But TXT Rules"
The text of the subject string that matches the regex pattern will be replaced by the replacement string '$1$4 $2$3$5', where the $1, $2 etc. refer to the substrings matched by the pattern's capture groups ().
$1 refers to the substring matched by the first ([A-Z]) sub-pattern, and $3 refers to the substring matched by the first ([a-z]) sub-pattern etc.
Because of the alternation character |, to make a match the regex will have to match either the ([A-Z])([A-Z])([a-z]) sub-pattern or the ([a-z])([A-Z]) sub-pattern, so if a match is made several of the capture groups will remain unmatched. These capture groups can be referenced in the replacement string but they have have no effect upon it - effectively, they will reference an empty string.
The space in the replacement string ensures a space is inserted in the subject string every time a match is made (the trailing g flag means the regular expression engine will look for more than one match).

If the first character is always lowercase.
'camelCaseString'.replace(/([A-Z]+)/g, ' $1')
If the first character is uppercase.
'CamelCaseString'.replace(/([A-Z]+)/g, ' $1').replace(/^ /, '')

Splitting CamelCase with regex in .NET :
Regex.Replace(input, "((?<!^)([A-Z][a-z]|(?<=[a-z])[A-Z]))", " $1").Trim();
Example :
Regex.Replace("TheCapitalOfTheUAEIsAbuDhabi", "((?<!^)([A-Z][a-z]|(?<=[a-z])[A-Z]))", " $1").Trim();
Output :
The Capital Of The UAE Is Abu Dhabi

This worked for me
let camelCase = "CSVFilesAreCoolButTXTRules"
let re = /[A-Z-_\&](?=[a-z0-9]+)|[A-Z-_\&]+(?![a-z0-9])/g
let delimited = camelCase.replace(re,' $&').trim()
The above code works for almost all the use cases i had. I had a few peculiarities where '&' and '_' should be treated equivalent to an upper case character
ThisIsASlug ---> This Is A Slug
loremIpsum ---> lorem Ipsum
PAGS_US ---> PAGS_US
TheCapitalOfTheUAEIsAbuDhabi ---> The Capital Of The UAE Is Abu Dhabi
eclipseRCPExt ---> eclipse RCP Ext
VALUE ---> VALUE
SG&A ---> SG&A
A brief explanation
[A-Z-_\&](?=[a-z0-9]+)
//Matches normal words i.e. one uppercase followed by one or more non-uppercase characters
[A-Z-_\&]+(?![a-z0-9])
//Matches acronyms & abbreviations i.e. a sequence of uppercase characters that are not followed by non-uppercase characters
Check out the regexr fiddle here

Camel-case replacement for Javascript using lookaheads / behinds:
"TheCapitalOfTheUAEIsAbuDhabi".replace(/([A-Z](?=[a-z]+)|[A-Z]+(?![a-z]))/g, ' $1').trim()
// "The Capital Of The UAE Is Abu Dhabi"

Related

How can I allow or ignore apostrophes?

I am looking for a regex expression that allows (or ignores) an apostrophe? I'm fairly new to regex and I looked at other similar questions but didn't find the help I need.
I am using a textbox to search an RTB and match all words with a specific or common ending (i.e. the search term inserted in the textbox). Then, I need to pass all matches to a second RTB.
I have tried many different expressions including: \b\w*[-']\w*\b but the program either separates the word at the apostrophe, finds only words with an apostrophe, or lists all words as matches?
My sample list of words to search is:
mi'iria, mi'i, piraria, makuptiaria, netap, hap, kuap, uimikuaptiaria, uhyt, set, uipu'aptiaria, mu'ap, atat, hat, haria, yat. (commas are not in the original list)!
As you can see, there are words that end in "ria" which contain an apostrophe and words that do not. I want to match all words that end with "ria," but I get results like: mi as one match, iria as another match and piraria, makuptiaria, uimikuaptiaria and haria aren't matched?
I need an expression that will allow (or ignore) the apostrophe so that all words that end in "ria" are matched independent of whether they contain an apostrophe or not. Also, words which contain an apostrophe (i.e. similar to mi'iria) should not be separated because of the apostrofe. Can anyone help on this? I am very grateful for any help! Thanks!
Okay, I spent some time tinkering on https://regex101.com/r/X4oL0y/1 and came up with the following expression which matches all words that end with "ria" including those with and those without an apostrophe:
\b\w+\'?\w+ria\w*\b
However, the w+ria part of this regex represents literal characters. This limits the functionality to words that end with "ria." Is there a way to generically declare the search term the user enters in the textbox as the character(s) to match so that all whole words that end with the search term are matched?
This is my code so far:
'Set index:
Dim index As Integer = 0
'Find and highlight all search term occurencies:
While index < RichTextBox1.Text.LastIndexOf(TextBox1.Text)
RichTextBox1.Find(TextBox1.Text, index, RichTextBox1.TextLength, RichTextBoxFinds.None)
RichTextBox1.SelectionBackColor = ColorTranslator.FromOle(RGB(255, 255, 192))
index = RichTextBox1.Text.IndexOf(TextBox1.Text, index) + 1
End While
' Input string.
Dim value As String = RichTextBox1.Text
' Call Regex.Matches method.
Dim matches As MatchCollection = Regex.Matches(value, "\b\w+\'?\w+ria\w*\b")
' Loop over matches.
For Each m As Match In matches
' Loop over captures.
For Each c As Capture In m.Captures
' Display.
RichTextBox2.Text += String.Format("Index={0}, Value={1}" & Chr(13), c.Index, c.Value)
Next
Next
If you want the whole word to be matched, you could make the character class optional [-']? and add ria to the end right before the word boundary
\b\w*[-']?\w*ria\b
See a .NET regex demo
As per comment of #ctwheels using an optional non capturing group is more efficient.
\b\w*(?:[-']\w*)?ria\b
\b Word boundary
\w* Match 0+ word chars
(?: Non capturing group
[-']\w* Match either - or ' and 0+ word chars
)? Close group and make it optional
ria Match literally
\b Word boundary
See another .NET regex demo
Assuming that list is file like this:
mi'iria
mi'i
piraria
makuptiaria
netap
hap
kuap
uimikuaptiaria
uhyt
set
uipu'aptiaria
mu'ap
atat
hat
haria
yat
Try this one
\b[a'-z]*.ria\b

Validating a string's first 3 letters as uppercase with regex

I have a question on Classic ASP regarding validating a string's first 3 letters to be uppercase while the last 4 characters should be in numerical form using regex.
For e.g.:
dim myString = "abc1234"
How do I validate that it should be "ABC1234" instead of "abc1234"?
Apologies for my broken English and for being a newbie in Classic ASP.
#ndn has a good regex pattern for you. To apply it in Classic ASP, you just need to create a RegExp object that uses the pattern and then call the Test() function to test your string against the pattern.
For example:
Dim re
Set re = New RegExp
re.Pattern = "^[A-Z]{3}.*[0-9]{4}$" ' #ndn's pattern
If re.Test(myString) Then
' Match. First three characters are uppercase letters and last four are digits.
Else
' No match.
End If
^[A-Z]{3}.*[0-9]{4}$
Explanation:
Surround everything with ^$ (start and end of string) to ensure you are matching everything
[A-Z] - gives you all capital letters in the English alphabet
{3} - three of those
.* - optionally, there can be something in between (if there can't be, you can just remove this)
[0-9] - any digit
{4} - 4 of those

Regex in PHP: take all the words after the first one in string and truncate all of them to the first character

I'm quite terrible at regexes.
I have a string that may have 1 or more words in it (generally 2 or 3), usually a person name, for example:
$str1 = 'John Smith';
$str2 = 'John Doe';
$str3 = 'David X. Cohen';
$str4 = 'Kim Jong Un';
$str5 = 'Bob';
I'd like to convert each as follows:
$str1 = 'John S.';
$str2 = 'John D.';
$str3 = 'David X. C.';
$str4 = 'Kim J. U.';
$str5 = 'Bob';
My guess is that I should first match the first word, like so:
preg_match( "^([\w\-]+)", $str1, $first_word )
then all the words after the first one... but how do I match those? should I use again preg_match and use offset = 1 in the arguments? but that offset is in characters or bytes right?
Anyway after I matched the words following the first, if the exist, should I do for each of them something like:
$second_word = substr( $following_word, 1 ) . '. ';
Or my approach is completely wrong?
Thanks
ps - it would be a boon if the regex could maintain the whole first two words when the string contain three or more words... (e.g. 'Kim Jong U.').
It can be done in single preg_replace using a regex.
You can search using this regex:
^\w+(?:$| +)(*SKIP)(*F)|(\w)\w+
And replace by:
$1.
RegEx Demo
Code:
$name = preg_replace('/^\w+(?:$| +)(*SKIP)(*F)|(\w)\w+/', '$1.', $name);
Explanation:
(*FAIL) behaves like a failing negative assertion and is a synonym for (?!)
(*SKIP) defines a point beyond which the regex engine is not allowed to backtrack when the subpattern fails later
(*SKIP)(*FAIL) together provide a nice alternative of restriction that you cannot have a variable length lookbehind in above regex.
^\w+(?:$| +)(*SKIP)(*F) matches first word in a name and skips it (does nothing)
(\w)\w+ matches all other words and replaces it with first letter and a dot.
You could use a positive lookbehind assertion.
(?<=\h)([A-Z])\w+
OR
Use this regex if you want to turn Bob F to Bob F.
(?<=\h)([A-Z])\w*(?!\.)
Then replace the matched characters with \1.
DEMO
Code would be like,
preg_replace('~(?<=\h)([A-Z])\w+~', '\1.', $string);
DEMO
(?<=\h)([A-Z]) Captures all the uppercase letters which are preceeded by a horizontal space character.
\w+ matches one or more word characters.
Replace the matched chars with the chars inside the group index 1 \1 plus a dot will give you the desired output.
A simple solution with only look-ahead and word boundary check:
preg_replace('~(?!^)\b(\w)\w+~', '$1.', $string);
(\w)\w+ is a word in the name, with the first character captured
(?!^)\b performs a word boundary check \b, and makes sure the match is not at the start of the string (?!^).
Demo

Recognize numbers in french format inside document using regex

I have a document containing numbers in various formats, french, english, custom formats.
I wanted a regex that could catch ONLY numbers in french format.
This is a complete list of numbers I want to catch (d represents a digit, decimal separator is comma , and thousands separator is space)
d,d d,dd d,ddd
dd,d dd,dd dd,ddd
ddd,d ddd,dd ddd,ddd
d ddd,d d ddd,dd d ddd,ddd
dd ddd,d dd ddd,dd dd ddd,ddd
ddd ddd,d ddd ddd,dd ddd ddd,ddd
d ddd ddd,d...
dd ddd ddd,d...
ddd ddd ddd,d...
This is the regex I have
(\d{1,3}\s(\d{3}\s)*\d{3}(\,\d{1,3})?|\d{1,3}\,\d{1,3})
catches french formats like above, so I am on the right track, but also numbers like d,ddd.dd (because it catches d,ddd) or d,ddd,ddd (because it catches d,ddd ).
What should I add to my regex ?
The VBA code I have:
Sub ChangeNumberFromFRformatToENformat()
Dim SectionText As String
Dim RegEx As Object, RegC As Object, RegM As Object
Dim i As Integer
Set RegEx = CreateObject("vbscript.regexp")
With RegEx
.Global = True
.MultiLine = False
.Pattern = "(\d{1,3}\s(\d{3}\s)*\d{3}(\,\d{1,3})?|\d{1,3}\,\d{1,3})"
' regular expression used for the macro to recognise FR formated numners
End With
For i = 1 To ActiveDocument.Sections.Count()
SectionText = ActiveDocument.Sections(i).Range.Text
If RegEx.test(SectionText) Then
Set RegC = RegEx.Execute(SectionText)
' RegC regular expresion matches collection, holding french format numbers
For Each RegM In RegC
Call ChangeThousandAndDecimalSeparator(RegM.Value)
Next 'For Each RegM In RegC
Set RegC = Nothing
Set RegM = Nothing
End If
Next 'For i = 6 To ActiveDocument.Sections.Count()
Set RegEx = Nothing
End Sub
The user stema, gave me a nice solution. The regex should be:
(?<=^|\s)\d{1,3}(?:\s\d{3})*(?:\,\d{1,3})?(?=\s|$)
But VBA complains that the regexp has unescaped characters. I have found one here (?: \d{3}) between (?: \d{3}) which is a blank character, so I can substitute that with \s. The second one I think is here (?:,\d{1,3}) between ?: and \d, the comma character, and if I escape it will be \, .
So the regex is now (?<=^|\s)\d{1,3}(?:\s\d{3})*(?:\,\d{1,3})?(?=\s|$) and it works fine in RegExr but my VBA code will not accept it.
NEW LINE IN POST :
I have just discovered that VBA doesn't agree with this sequence of the regex ?<=^
What about this?
\b\d{1,3}(?: \d{3})*(?:,\d{1,3})?\b
See it here on Regexr
\b are word boundaries
At first (\d{1,3}) match 1 to 3 digits, then there can be 0 or more groups of a leading space followed by 3 digits ((?: \d{3})*) and at last there can be an optional fraction part ((?:,\d{1,3})?)
Edit:
if you want to avoid 1,111.1 then the \b anchors are not good for you. Try this:
(?<=^|\s)\d{1,3}(?: \d{3})*(?:,\d{1,3})?(?=\s|$)
Regexr
This regex requires now a whitespace or the start of the string before and a whitespace or the end of the string after the number to match.
Edit 2:
Since look behinds are not supported you can change to
(?:^|\s)\d{1,3}(?: \d{3})*(?:,\d{1,3})?(?=\s|$)
This changes nothing at the start of the string, but if the number starts with a leading whitespace, this is now included in the match. If the result of the match is used for something at first the leading whitespace has to be stripped (I am quite sure VBA does have a methond for that (try trim())).
If you are reading on a line by line basis, you might consider adding anchors (^ and $) to your regex, so you will end up with something like so:
^(\d{1,3}\s(\d{3}\s)*\d{3}(\,\d{1,3})?|\d{1,3}\,\d{1,3})$
This instructs the RegEx engine to start matching from the beginning of the line till the very end.

Regular expression for duplicate words

I'm a regular expression newbie and I can't quite figure out how to write a single regular expression that would "match" any duplicate consecutive words such as:
Paris in the the spring.
Not that that is related.
Why are you laughing? Are my my regular expressions THAT bad??
Is there a single regular expression that will match ALL of the bold strings above?
Try this regular expression:
\b(\w+)\s+\1\b
Here \b is a word boundary and \1 references the captured match of the first group.
Regex101 example here
I believe this regex handles more situations:
/(\b\S+\b)\s+\b\1\b/
A good selection of test strings can be found here: http://callumacrae.github.com/regex-tuesday/challenge1.html
The below expression should work correctly to find any number of duplicated words. The matching can be case insensitive.
String regex = "\\b(\\w+)(\\s+\\1\\b)+";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0), m.group(1));
}
Sample Input : Goodbye goodbye GooDbYe
Sample Output : Goodbye
Explanation:
The regex expression:
\b : Start of a word boundary
\w+ : Any number of word characters
(\s+\1\b)* : Any number of space followed by word which matches the previous word and ends the word boundary. Whole thing wrapped in * helps to find more than one repetitions.
Grouping :
m.group(0) : Shall contain the matched group in above case Goodbye goodbye GooDbYe
m.group(1) : Shall contain the first word of the matched pattern in above case Goodbye
Replace method shall replace all consecutive matched words with the first instance of the word.
Try this with below RE
\b start of word word boundary
\W+ any word character
\1 same word matched already
\b end of word
()* Repeating again
public static void main(String[] args) {
String regex = "\\b(\\w+)(\\b\\W+\\b\\1\\b)*";// "/* Write a RegEx matching repeated words here. */";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE/* Insert the correct Pattern flag here.*/);
Scanner in = new Scanner(System.in);
int numSentences = Integer.parseInt(in.nextLine());
while (numSentences-- > 0) {
String input = in.nextLine();
Matcher m = p.matcher(input);
// Check for subsequences of input that match the compiled pattern
while (m.find()) {
input = input.replaceAll(m.group(0),m.group(1));
}
// Prints the modified sentence.
System.out.println(input);
}
in.close();
}
Regex to Strip 2+ duplicate words (consecutive/non-consecutive words)
Try this regex that can catch 2 or more duplicate words and only leave behind one single word. And the duplicate words need not even be consecutive.
/\b(\w+)\b(?=.*?\b\1\b)/ig
Here, \b is used for Word Boundary, ?= is used for positive lookahead, and \1 is used for back-referencing.
Example
Source
The widely-used PCRE library can handle such situations (you won't achieve the the same with POSIX-compliant regex engines, though):
(\b\w+\b)\W+\1
Here is one that catches multiple words multiple times:
(\b\w+\b)(\s+\1)+
No. That is an irregular grammar. There may be engine-/language-specific regular expressions that you can use, but there is no universal regular expression that can do that.
This is the regex I use to remove duplicate phrases in my twitch bot:
(\S+\s*)\1{2,}
(\S+\s*) looks for any string of characters that isn't whitespace, followed whitespace.
\1{2,} then looks for more than 2 instances of that phrase in the string to match. If there are 3 phrases that are identical, it matches.
Since some developers are coming to this page in search of a solution which not only eliminates duplicate consecutive non-whitespace substrings, but triplicates and beyond, I'll show the adapted pattern.
Pattern: /(\b\S+)(?:\s+\1\b)+/ (Pattern Demo)
Replace: $1 (replaces the fullstring match with capture group #1)
This pattern greedily matches a "whole" non-whitespace substring, then requires one or more copies of the matched substring which may be delimited by one or more whitespace characters (space, tab, newline, etc).
Specifically:
\b (word boundary) characters are vital to ensure partial words are not matched.
The second parenthetical is a non-capturing group, because this variable width substring does not need to be captured -- only matched/absorbed.
the + (one or more quantifier) on the non-capturing group is more appropriate than * because * will "bother" the regex engine to capture and replace singleton occurrences -- this is wasteful pattern design.
*note if you are dealing with sentences or input strings with punctuation, then the pattern will need to be further refined.
The example in Javascript: The Good Parts can be adapted to do this:
var doubled_words = /([A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]+)\s+\1(?:\s|$)/gi;
\b uses \w for word boundaries, where \w is equivalent to [0-9A-Z_a-z]. If you don't mind that limitation, the accepted answer is fine.
This expression (inspired from Mike, above) seems to catch all duplicates, triplicates, etc, including the ones at the end of the string, which most of the others don't:
/(^|\s+)(\S+)(($|\s+)\2)+/g, "$1$2")
I know the question asked to match duplicates only, but a triplicate is just 2 duplicates next to each other :)
First, I put (^|\s+) to make sure it starts with a full word, otherwise "child's steak" would go to "child'steak" (the "s"'s would match). Then, it matches all full words ((\b\S+\b)), followed by an end of string ($) or a number of spaces (\s+), the whole repeated more than once.
I tried it like this and it worked well:
var s = "here here here here is ahi-ahi ahi-ahi ahi-ahi joe's joe's joe's joe's joe's the result result result";
print( s.replace( /(\b\S+\b)(($|\s+)\1)+/g, "$1"))
--> here is ahi-ahi joe's the result
Try this regular expression it fits for all repeated words cases:
\b(\w+)\s+\1(?:\s+\1)*\b
I think another solution would be to use named capture groups and backreferences like this:
.* (?<mytoken>\w+)\s+\k<mytoken> .*/
OR
.*(?<mytoken>\w{3,}).+\k<mytoken>.*/
Kotlin:
val regex = Regex(""".* (?<myToken>\w+)\s+\k<myToken> .*""")
val input = "This is a test test data"
val result = regex.find(input)
println(result!!.groups["myToken"]!!.value)
Java:
var pattern = Pattern.compile(".* (?<myToken>\\w+)\\s+\\k<myToken> .*");
var matcher = pattern.matcher("This is a test test data");
var isFound = matcher.find();
var result = matcher.group("myToken");
System.out.println(result);
JavaScript:
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/;
const input = "This is a test test data";
const result = regex.exec(input);
console.log(result.groups.myToken);
// OR
const regex = /.* (?<myToken>\w+)\s+\k<myToken> .*/g;
const input = "This is a test test data";
const result = [...input.matchAll(regex)];
console.log(result[0].groups.myToken);
All the above detect the test as the duplicate word.
Tested with Kotlin 1.7.0-Beta, Java 11, Chrome and Firefox 100.
You can use this pattern:
\b(\w+)(?:\W+\1\b)+
This pattern can be used to match all duplicated word groups in sentences. :)
Here is a sample util function written in java 17, which replaces all duplications with the first occurrence:
public String removeDuplicates(String input) {
var regex = "\\b(\\w+)(?:\\W+\\1\\b)+";
var pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
var matcher = pattern.matcher(input);
while (matcher.find()) {
input = input.replaceAll(matcher.group(), matcher.group(1));
}
return input;
}
As far as I can see, none of these would match:
London in the
the winter (with the winter on a new line )
Although matching duplicates on the same line is fairly straightforward,
I haven't been able to come up with a solution for the situation in which they
stretch over two lines. ( with Perl )
To find duplicate words that have no leading or trailing non whitespace character(s) other than a word character(s), you can use whitespace boundaries on the left and on the right making use of lookarounds.
The pattern will have a match in:
Paris in the the spring.
Not that that is related.
The pattern will not have a match in:
This is $word word
(?<!\S)(\w+)\s+\1(?!\S)
Explanation
(?<!\S) Negative lookbehind, assert not a non whitespace char to the left of the current location
(\w+) Capture group 1, match 1 or more word characters
\s+ Match 1 or more whitespace characters (note that this can also match a newline)
\1 Backreference to match the same as in group 1
(?!\S) Negative lookahead, assert not a non whitespace char to the right of the current location
See a regex101 demo.
To find 2 or more duplicate words:
(?<!\S)(\w+)(?:\s+\1)+(?!\S)
This part of the pattern (?:\s+\1)+ uses a non capture group to repeat 1 or more times matching 1 or more whitespace characters followed by the backreference to match the same as in group 1.
See a regex101 demo.
Alternatives without using lookarounds
You could also make use of a leading and trailing alternation matching either a whitespace char or assert the start/end of the string.
Then use a capture group 1 for the value that you want to get, and use a second capture group with a backreference \2 to match the repeated word.
Matching 2 duplicate words:
(?:\s|^)((\w+)\s+\2)(?:\s|$)
See a regex101 demo.
Matching 2 or more duplicate words:
(?:\s|^)((\w+)(?:\s+\2)+)(?:\s|$)
See a regex101 demo.
Use this in case you want case-insensitive checking for duplicate words.
(?i)\\b(\\w+)\\s+\\1\\b