Split words from CamelCase string - regex

I have a string
string = 'one Two9three four_Five 67SixSevenEightNine';
I need to split it into the words:
'one' 'two' 'three' 'four' 'five' 'six' 'seven' 'eight' 'nine'
I managed to separate all except the CamelCase, when the lowercase letter is followed by uppercase:
while ~isempty(string)
[str,string] = ...
strtok(string, ...
[' ~#$/#.-:&*+=[]?!(){},''">_<;%' char(9) char(10) char(13) '0-9']);
str = regexprep(str, '[0-9]','');
end
I also can get the index of the pattern, but only if I knew how to insert space or some character between, then I could use the code above once again to split into words:
pattern = '[a-z][A-Z]+';
[pat,idx]=regexp(str, pattern,'match');
any ideas?
Thanks!

Why not replace the camelCase before you do your other processing?
newstring = regexprep(string, '([a-z])([A-Z])', '$1 $2');
while ~isempty(newstring)
...

Related

regex in Python to remove commas and spaces

I have a string with multiple commas and spaces as delimiters between words. Here are some examples:
ex #1: string = 'word1,,,,,,, word2,,,,,, word3,,,,,,'
ex #2: string = 'word1 word2 word3'
ex #3: string = 'word1,word2,word3,'
I want to use a regex to convert either of the above 3 examples to "word1, word2, word3" - (Note: no comma after the last word in the result).
I used the following code:
import re
input_col = 'word1 , word2 , word3, '
test_string = ''.join(input_col)
test_string = re.sub(r'[,\s]+', ' ', test_string)
test_string = re.sub(' +', ',', test_string)
print(test_string)
I get the output as "word1,word2,word3,". Whereas I actually want "word1, word2, word3". No comma after word3.
What kind of regex and re methods should I use to achieve this?
you can use the split to create an array and filter len < 1 array
import re
s='word1 , word2 , word3, '
r=re.split("[^a-zA-Z\d]+",s)
ans=','.join([ i for i in r if len(i) > 0 ])
How about adding the following sentence to the end your program:
re.sub(',+$','', test_string)
which can remove the comma at the end of string
One approach is to first split on an appropriate pattern, then join the resulting array by comma:
string = 'word1,,,,,,, word2,,,,,, word3,,,,,,'
parts = re.split(",*\s*", string)
sep = ','
output = re.sub(',$', '', sep.join(parts))
print(output
word1,word2,word3
Note that I make a final call to re.sub to remove a possible trailing comma.
You can simply use [ ]+ to detect extra spaces and ,\s*$ to detect the last comma. Then you can simply substitute the [ ]+,[ ]+ with , and the last comma with an empty string
import re
input_col = 'word1 , word2 , word3, '
test_string = re.sub('[ ]+,[ ]+', ', ', input_col) # remove extra space
test_string = re.sub(',\s*$', '', test_string) # remove last comma
print(test_string)

Retaining punctuations in a word

How can i remove punctuations from a line, but retain punctuation in the word using re ??
For Example :
Input = "Hello!!!, i don't like to 'some String' .... isn't"
Output = (['hello','i', 'don't','like','to', 'some', 'string', 'isn't'])
I am trying to do this:
re.sub('\W+', ' ', myLine.lower()).split()
But this is splitting the words like "don't" into don and t.
You can use lookarounds in your regex:
>>> input = "Hello!!!, i didn''''t don't like to 'some String' .... isn't"
>>> regex = r'\W+(?!\S*[a-z])|(?<!\S)\W+'
>>> print re.sub(regex, '', input, 0, re.IGNORECASE).split()
['Hello', 'i', "didn''''t", "don't", 'like', 'to', 'some', 'String', "isn't"]
RegEx Demo
\W+(?!\S*[a-z])|(?<!\S)\W+ matches a non-word, non-space character that doesn't have a letter at previous position or a letter at next position after 1 or more non-space characters.

Splitting the string except for symbols

I tried using these 2 codes:
Dim splitQuery() As String = Regex.Split(TextBoxQuery.Text, "\s+")
and
Dim splitQuery() As String = TextBoxQuery.Text.Split(New Char() {" "c})
My example query is a dog . Notice there's a single space between dog and .. When I check the length of splitQuery, it gives me 3 and the split words are a, dog, and ..
How can I stop it from counting . and other symbols as word? I want words/terms (alphanumeric) only to be stored in my splitQuery array. Thanks.
I suggest doing that in 2 steps:
Use txt = Regex.Replace(TextBoxQuery.Text, "\W+$", "", RegexOptions.RightToLeft) to remove the non-word characters from the end of the string
Then, split with \s+: splits = Regex.Split(txt, "\s+")
If you prefer to split with any non-word chars, you may use
splits = Regex.Split(Regex.Replace(TextBoxQuery.Text, "^\W+|\W+$", ""), "\W+")
Here, Regex.Replace(TextBoxQuery.Text, "^\W+|\W+$", "") removes non-word chars both at the start and end of string.
you should also be able to create a string of unwanted characters and trim them with a stringsplitoption to RemoveEmptyEntries.
dim unwanted as string = "./?!#"
Dim splitQuery() as string = yourString.Trim(unwanted.tochararray).Split(New Char() {" "c}), StringSplitOptions.RemoveEmptyEntries)
I would tackle this problem in two parts.
I would split up the text by spaces like you're doing
I would then run through that list of words and remove any query terms that are non-alphanumeric.
The following is an example of that:
Imports System.Collections
' ... Your Other Code ...
' A function to determine if a string is AlphaNumeric
Private Function IsAlphaNum(ByVal strInputText As String) As Boolean
Dim IsAlpha As Boolean = False
If System.Text.RegularExpressions.Regex.IsMatch(strInputText, "^[a-zA-Z0-9]+$") Then
IsAlpha = True
Else
IsAlpha = False
End If
Return IsAlpha
End Function
' A function to get the words from the textbox
Private Function GetWords() As String()
' Get a raw list of all words separated by spaces
Dim splitQuery() As String = Regex.Split(TextBoxQuery.Text, "\s+")
' ArrayList to place all words into:
Dim alWords As New ArrayList()
' Loop all words and check them:
For Each word As String In splitQuery
If(IsAlphaNum(word)) Then
' Word is alphanumeric
' Add it to the list of alphanumeric words
alWords.add(word)
End If
Next
' Convert the ArrayList of words to a primitive array of strings
Dim words As String() = CType(alWords.ToArray(GetType(String)), String())
' Return the list of filtered words
return words
End Function
This code does the following:
splits up the textbox's text
declares an ArrayList for the filtered query terms/words
loops through all the words in the split up array of terms/words
it then checks if the term is alphanumeric
If the term is alphanumeric, it is added to the ArrayList. If it's not alphanumeric, the term is disregarded.
Finally, it casts the terms/words in the ArrayList back to a normal String array and returns.
Because this solution uses an ArrayList, it requires System.Collections as an import.

Python 2.7 : Remove elements from a multidimensional list

Basically, I have a 3dimensional list (it is a list of tokens, where the first dimension is for the text, second for the sentence, and third for the word).
Addressing an element in the list (lets call it mat) can be done for example:
mat[2][3][4]. That would give us the fifth word or the fourth sentence in the third text.
But, some of the words are just symbols like '.' or ',' or '?'. I need to remove all of them. I thought to do that with a procedure:
def removePunc(mat):
newMat = []
newText = []
newSentence = []
for text in mat:
for sentence in text:
for word in sentence:
if word not in " !##$%^&*()-_+={}[]|\\:;'<>?,./\"":
newSentence.append(word)
newText.append(newSentence)
newMat.append(newText)
return newMat
Now, when I try to use that:
finalMat = removePunc(mat)
it is giving me the same list (mat is a 3 dimensional list). My idea was to iterate over the list and remove only the 'words' which are actually punctuation symbols.
I don't know what I am doing wrong but surely there is a simple logical mistake.
Edit: I need to keep the structure of the array. So, words of the same sentence should still be in the same sentence (just without the 'punctuation symbol' words). Example:
a = [[['as', '.'], ['w', '?', '?']], [['asas', '23', '!'], ['h', ',', ',']]]
after the changes should be:
a = [[['as'], ['w']], [['asas', '23'], ['h']]]
Thanks for reading and/or giving me a reply.
I would suspect that your data are not organized as you think they are. And although I am usually not the one to propose regular expressions, I think in your case they may be among the best solutions.
I would also suggest that instead of eliminating non-alphabetic characters from words, you process sentences
>>> import re
>>> non_word = re.compile(r'\W+') # If your sentences may
>>> sentence = '''The formatting sucks, but the only change that I've made to your code was shortening the "symbols" string to one character. The only issue that I can identify is either with the "symbols" string (though it looks like all chars in it are properly escaped) that you used, or the punctuation is not actually separate words'''
>>> words = re.split(non_word, sentence)
>>> words
['The', 'formatting', 'sucks', 'but', 'the', 'only', 'change', 'that', 'I', 've', 'made', 'to', 'your', 'code', 'was', 'shortening', 'the', 'symbols', 'string', 'to', 'one', 'character', 'The', 'only', 'issue', 'that', 'I', 'can', 'identify', 'is', 'either', 'with', 'the', 'symbols', 'string', 'though', 'it', 'looks', 'like', 'all', 'chars', 'in', 'it', 'are', 'properly', 'escaped', 'that', 'you', 'used', 'or', 'the', 'punctuation', 'is', 'not', 'actually', 'separate', 'words']
>>>
The code you wrote seems solid and it looks like "it should work", but only if this:
But, some of the words are just symbols like '.' or ',' or '?'
is actually fulfilled.
I would actually expect the symbols to not be separate from words, so instead of:
["Are", "you", "sure", "?"] #example sentence
you would rather have:
["Are", "you", "sure?"] #example sentence
If this is the case, you would need to go along the lines of:
def removePunc(mat):
newMat = []
newText = []
newSentence = []
newWord = ""
for text in mat:
for sentence in text:
for word in sentence:
for char in word:
if char not in " !##$%^&*()-_+={}[]|\\:;'<>?,./\"":
newWord += char
newSentence.append(newWord)
newText.append(newSentence)
newMat.append(newText)
return newMat
Finally, found it. As expected, it was a very small logical mistake that was always there but couldn't see it. Here is the working solution:
def removePunc(mat):
newMat = []
for text in mat:
newText = []
for sentence in text:
newSentence = []
for word in sentence:
if word not in " !##$%^&*()-_+={}[]|\\:;'<>?,./\"":
newSentence.append(word)
newText.append(newSentence)
newMat.append(newText)
return newMat

Challenging regular expression

There is a string in the following format:
It can start with any number of strings enclosed by double braces, possibly with white space between them (whitespace may or may not occur).
It may also contain strings enclosed by double-braces in the middle.
I am looking for a regular expression that can separate the start from the rest.
For example, given the following string:
{{a}}{{b}} {{c}} def{{g}}hij
The two parts are:
{{a}}{{b}} {{c}}
def{{g}}hij
I tried this:
/^({{.*}})(.*)$/
But, it captured also the g in the middle:
{{a}}{{b}} {{c}} def{{g}}
hij
I tried this:
/^({{.*?}})(.*)$/
But, it captured only the first a:
{{a}}
{{b}} {{c}} def{{g}}hij
This keeps matching {{, any non { or } character 1 or more times, }}, possible whitespace zero or more times and stores it in the first group. Rest of the string will be in the 2nd group. If there are no parts surrounded by {{ and }} the first group will be empty. This was in JavaScript.
var str = "{{a}}{{b}} {{c}} def{{g}}hij";
str.match(/^\s*((?:\{\{[^{}]+\}\}\s*)*)(.*)/)
// [whole match, group 1, group 2]
// ["{{a}}{{b}} {{c}} def{{g}}hij", "{{a}}{{b}} {{c}} ", "def{{g}}hij"]
How about using preg_split:
$str = '{{a}}{{b}} {{c}} def{{g}}hij';
$list = preg_split('/(\s[^{].+)/', $str, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
print_r($list);
output:
Array
(
[0] => {{a}}{{b}} {{c}}
[1] => def{{g}}hij
)
I think I got it:
var string = "{{a}}{{b}} {{c}} def{{g}}hij";
console.log(string.match(/((\{\{\w+\}\})\s*)+/g));
// Output: [ '{{a}}{{b}} {{c}} ', '{{g}}' ]
Explanation:
( starts a group.
( another;
\{\{\w+\}\} looks for {{A-Za-z_0-9}}
) closes second group.
\s* Counts whitespace if it's there.
)+ closes the first group and looks for oits one or more occurrences.
When it gets any not-{{something}} type data, it stops.
P.S. -> Complex RegEx takes CPU speed.
You can use this:
(java)
string[] result = yourstr.split("\\s+(?!{)");
(php)
$result = preg_split('/\s+(?!{)/', '{{a}}{{b}} {{c}} def{{g}}hij');
print_r($result);
I donĀ“t know exactly why are you want to split, but in case that the string contains always a def inside, and you want to separate the string from there in two halves, then, you can try something like:
string text = "{{a}}{{b}} {{c}} def{{g}}hij";
Regex r = new Regex("def");
string[] split = new string[2];
int index = r.Match(text).Index;
split[0] = string.Join("", text.Take(index).Select(x => x.ToString()).ToArray<string>());
split[1] = string.Join("", text.Skip(index).Take(text.Length - index).Select(x => x.ToString()).ToArray<string>());
// Output: [ '{{a}}{{b}} {{c}} ', 'def{{g}}hij' ]