The following example is taken from the python re documents
re.split(r'\b', 'Words, words, words.')
['', 'Words', ', ', 'words', ', ', 'words', '.']
'\b' matches the empty string at the beginning or end of a word. Which means if you run this code it produces an error.
(jupyter notebook python 3.6)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-128-f4d2d57a2022> in <module>
1 reg = re.compile(r"\b")
----> 2 re.split(reg, "Words, word, word.")
/usr/lib/python3.6/re.py in split(pattern, string, maxsplit, flags)
210 and the remainder of the string is returned as the final element
211 of the list."""
--> 212 return _compile(pattern, flags).split(string, maxsplit)
213
214 def findall(pattern, string, flags=0):
ValueError: split() requires a non-empty pattern match.
Since \b only matches empty strings, split() does not get its requirement "non-empty" pattern match. I have seen varying questions related to split() and empty strings. Some I could see how you may want to do it in practice, example, the question here. Answers vary from "just can't do it" to (older ones) "it's a bug".
My question is this:
Since this is still an example on the python web page, should this be possible? is it something that is possible in the bleeding edge release?
The question in the in the link above involved
re.split(r'(?<!foo)(?=bar)', 'foobarbarbazbar'), it was asked in 2015 and there was no way to accomplish the requirements with just re.split(), is this still the case?
In Python 3.7 re, you can split with zero-length matches:
Changed in version 3.7: Added support of splitting on a pattern that could match an empty string.
Also, note that
Empty matches for the pattern split the string only when not adjacent to a previous empty match.
>>> re.split(r'\b', 'Words, words, words.')
['', 'Words', ', ', 'words', ', ', 'words', '.']
>>> re.split(r'\W*', '...words...')
['', '', 'w', 'o', 'r', 'd', 's', '', '']
>>> re.split(r'(\W*)', '...words...')
['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', '']
Also, with
re.split(r'(?<!foo)(?=bar)', 'foobarbarbazbar')
I get ['foobar', 'barbaz', 'bar'] result in Python 3.7.
Related
I have a mathematical expression (formula) in string format. What I'm trying to do is to split that string and make an array of all the operators and words collectively an array. I'm doing this by passing regex to split() function (As I'm new with regex I tried to create the regex to get my desired result). With this expression I'm getting an array seperated by operators, digits and words. But, somehow I'm getting an extra blank element in the array after each element. Have a look below to get what I'm exactly talking about.
My mathematical expression (formula):
1+2-(0.67*2)/2%2=O_AnnualSalary
Regex that I'm using to split it into an array:
this.createdFormula.split(/([?=+-*/%,()])/)
What I'm expecting an array should I get:
['1', '+', '2', '-', '(', '0', '.', '6', '7', '*', '2', ')', '/', '2', '%', '2', '=', 'O_AnnualSalary']
This what I'm getting:
['', '1', '', '+', '', '2', '', '-', '', '(', '', '0', '', '.', '', '6', '', '7', '', '*', '', '2', '', ')', '', '/', '', '2', '', '%', '', '2', '', '=', 'O_AnnualSalary']
So far what I've tried this expressions from many posts on SO:
this.createdFormula.split(/([?=+-\\*\\/%,()])/)
this.createdFormula.split(/([?=\\\W++-\\*\\/%,()])/)
this.createdFormula.split(/([?=//\s++-\\*\\/%,()])/)
this.createdFormula.split(/([?=+-\\*\\/%,()])(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)/)
this.createdFormula.split(/([?=+-\\*\\/%0-9,()])/)
Can anyone help me to fix this expression to get the desired result? If you need any more information please feel free to ask.
Any help is really appreciated.
Thanks
Assuming you have string match() available, we can use:
var input = "1+2-(0.67*2)/2%2=O_AnnualSalary,HwTotalDays";
var parts = input.match(/(?:[0-9.,%()=/*+-]|\w+)/g);
console.log(parts);
I am trying to implement a tokenizer to split string of words.
The special conditions I have are: split punctuation . , ! ? into a separate string
and split any characters that have a space in them i.e. I have a dog!'-4# -> 'I', 'have', 'a' , 'dog', !, "'-4#"
Something like this.....
I don't plan on trying the nltk's package, and I have looked at re.split and re.findall, yet for both cases:
re.split = I don't know how to split out words with punctuation next to them such as 'Dog,'
re.findall = Sure it prints out all the matched string, but what about the unmatched ones?
IF you guys have any suggestions, I'd be very happy to try them.
Are you trying to split on a delimiter(punctuation) while keeping it in the final results? One way of doing that would be this:
import re
import string
sent = "I have a dog!'-4#"
punc_Str = str(string.punctuation)
print(re.split(r"([.,;:!^ ])", sent))
This is the result I get.
['I', ' ', 'have', ' ', 'a', ' ', 'dog', '!', "'-4#"]
Try:
re.findall(r'[a-z]+|[.!?]|(?:(?![.!?])\S)+', txt, re.I)
Alternatives in the regex:
[a-z]+ - a non-empty sequence of letters (ignore case),
[.!?] - any (single) char from your list (note that between brackets
neither a dot nor a '?' need to be quoted),
(?:(?![.!?])\S)+ - a non-empty sequence of non-white characters,
other than in your list.
E.g. for text containing I have a dog!'-4#?. the result is:
['I', 'have', 'a', 'dog', '!', "'-4#", '?', '.']
I created a batch file that writes user names to a file. It works perfectly and cleans up net user and writes the user names to a file so it would look like this:
Administrator Michael Guest
Pianoman Billy George
I don't know how many usernames there will be so my question is: how can I clean up this white space between the undetermined number of names since I don't know the length of names I'll be dealing with and thus not know how many spaces there will be.
My python program is supposed to read these names from a file and turn them into a list. I was planning on just using .split(" ") so ideally someone could suggest a way to get the difference down to one space between each name. I already looked at .format method, and it doesn't seem to be up to the task. I'm also open if there is a somewhat readable way (doubtable) to format this in batch.
BTW: I considered simply redirecting the output from dir /B C:\Users but this doesn't work in situation.
Use .split() without sep argument:
string.split(s[, sep[, maxsplit]])
Return a list of the words of the string s. If the optional second
argument sep is absent or None, the words are separated by
arbitrary strings of whitespace characters (space, tab, newline,
return, formfeed). If the second argument sep is present and not
None, it specifies a string to be used as the word separator. The
returned list will then have one more item than the number of
non-overlapping occurrences of the separator in the string. If
maxsplit is given, at most maxsplit number of splits occur, and
the remainder of the string is returned as the final element of the
list (thus, the list will have at most maxsplit+1 elements). If
maxsplit is not specified or -1, then there is no limit on the
number of splits (all possible splits are made).
The behavior of split on an empty string depends on the value of
sep. If sep is not specified, or specified as None, the result
will be an empty list. If sep is specified as any string, the result
will be a list containing one element which is an empty string.
Example:
>>> x='Administrator CLIENT1 Guest'
>>> x.split(' ')
['Administrator', '', '', '', '', '', '', '', '', '', '', '', 'CLIENT1', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '','Guest']
>>> x.split()
['Administrator', 'CLIENT1', 'Guest']
>>>
Another approach:
>>> import string
>>> x='Administrator CLIENT1 Guest'
>>> string.split(x,' ')
['Administrator', '', '', '', '', '', '', '', '', '', '', '', 'CLIENT1', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '','Guest']
>>> string.split(x)
['Administrator', 'CLIENT1', 'Guest']
>>>
I use the following regex to split sentences into words:
"('?\w[\w']*(?:-\w+)*'?)"
For example:
import re
re.split("('?\w[\w']*(?:-\w+)*'?)","'cos I like ice-cream")
gives:
['', "'cos", ' ', 'I', ' ', 'like', ' ', 'ice-cream', '!']
However, formatting tags sometimes appear in my text and my regex obviously can't process them as I would like:
re.split("('?\w[\w']*(?:-\w+)*'?)","'cos I <i>like</i> ice-cream!")
gives:
['', "'cos", ' ', 'I', ' <', 'i', '>', 'like', '</', 'i', '> ', 'ice-cream', '!']
while I would like:
['', "'cos", ' ', 'I', ' <i>', 'like', '</i> ', 'ice-cream', '!']
How would you go about solving this?
You could use a word boundary regex, specifying exclusions of matches using negative lookbehind and lookahead assertions:
^|(?<!['<\/-])\b(?![>-])
Regex demo.
Unfortunately, the python regex engine doesn't support splitting on zero-width characters, so you have to use a workaround.
import re
a = re.sub(r"^|(?<!['<\/-])\b(?![>-])", "|", "'cos I <i>like</i> ice-cream!").split('|');
print(a)
# ['', "'cos", ' ', 'I', ' <i>', 'like', '</i> ', 'ice-cream', '!']
Python demo.
# I added a negative lookahead to your pattern to assert bracket > is closed properly
import re
print re.split("('?\w[\w']*(?:-\w+)*'?(?!>))","'cos I <i>like</i> ice-cream!" )
[Output]
['', "'cos", ' ', 'I', ' <i>', 'like', '</i> ', 'ice-cream', '!']
I am looking to capitalize the first letter of words in a string. I've managed to put together something by reading examples on here. However, I'm trying to get any names that start with O' to separate into 2 strings so that each gets capitalized. I have this so far:
\b([^\W_\d](?!')[^\s-]*) *
which omits selecting the X' from any string X'XYZ. That works for capitalizing the part after the ', but doesn't capitalize the X'. Further more, i'm becomes i'M since it's not specific to O'. To state the goal:
o'malley should go to O'Malley
o'malley's should go to O'Malley's
don't should go to Don't
i'll should go to I'll
(as an aside, I want to omit any strings that start with numbers, like 23F, that seems to work with what I have)
How to make it specific to the strings that start with O'? Thx
if you use the following pattern:
([oO])'([\w']+)|([\w']+)
then you can access each word by calling:
match[0] == 'o' || match[1] == 'name' #if word is "o'name"
match[2] == 'word' #if word is "word"
if it is one of the two above, the others will be blank, ie if word == "word" then
match[0] == match[1] == ""
since there is no o' prefix.
Test Example:
>>> import re
>>> string = "o'malley don't i'm hello world"
>>> match = re.findall(r"([oO])'([\w']+)|([\w']+)",string)
>>> match
[('o', 'malley', ''), ('', '', "don't"), ('', '', "i'm"), ('', '', 'hello'), ('', '', 'world')]
NOTE: This is for python. This MIGHT not work for all engines.