In python I can do:
import re
re.split('(o)', 'hello world')
and get:
['hell', 'o', ' w', 'o', 'rld']
With crystal:
"hello world".split(/(o)/)
I get:
["hell", " w", "rld"]
But I want to keep the matches in the array like in the python example. Is it possible?
http://crystal-lang.org/api/String.html
This just got added, see this issue.
Until that lands in a release you can trick with lookaround expressions:
"hello world".split(/(?<=o)|(?=o)/) #=> ["hell", "o", " w", "o", "rld"]
Related
I am trying to implement a tokenizer to split string of words.
The special conditions I have are: split punctuation . , ! ? into a separate string
and split any characters that have a space in them i.e. I have a dog!'-4# -> 'I', 'have', 'a' , 'dog', !, "'-4#"
Something like this.....
I don't plan on trying the nltk's package, and I have looked at re.split and re.findall, yet for both cases:
re.split = I don't know how to split out words with punctuation next to them such as 'Dog,'
re.findall = Sure it prints out all the matched string, but what about the unmatched ones?
IF you guys have any suggestions, I'd be very happy to try them.
Are you trying to split on a delimiter(punctuation) while keeping it in the final results? One way of doing that would be this:
import re
import string
sent = "I have a dog!'-4#"
punc_Str = str(string.punctuation)
print(re.split(r"([.,;:!^ ])", sent))
This is the result I get.
['I', ' ', 'have', ' ', 'a', ' ', 'dog', '!', "'-4#"]
Try:
re.findall(r'[a-z]+|[.!?]|(?:(?![.!?])\S)+', txt, re.I)
Alternatives in the regex:
[a-z]+ - a non-empty sequence of letters (ignore case),
[.!?] - any (single) char from your list (note that between brackets
neither a dot nor a '?' need to be quoted),
(?:(?![.!?])\S)+ - a non-empty sequence of non-white characters,
other than in your list.
E.g. for text containing I have a dog!'-4#?. the result is:
['I', 'have', 'a', 'dog', '!', "'-4#", '?', '.']
I have a string as follow:
str = 'chem biochem chem chemi hem achem abcchemde chem\n asd chem\n'
I want to replace the word "chem" with "chemistry" while preserving the end of line character ('\n'). I also want the regex not match words like 'biochem', 'chemi', 'hem', 'achem' and 'abcchemde'. How can I do this?
Here's what I'm using but it doesn't work:
import re
re.sub(r'[ ^c|c]hem[$ ]', r' chemistry ', str)
Thank you
use word boundaries:
>>> s = 'chem biochem chem chemi hem achem abcchemde chem\n asd chem\n'
>>> import re
>>> re.sub(r'\bchem\b','chemistry',s)
'chemistry biochem chemistry chemi hem achem abcchemde chemistry\n asd chemistry\n'
just a note, dont use str as a variable name, that covers the builtin str type
You need to use \b to match a word boundary:
import re
re.sub(r'\bchem\b', r'chemistry', mystring)
(And as R Nar pointed out, you should avoid using str as a variable name.)
I just found the answer. Thanks to #Jota.
The super-simple Regex is as follow:
re.sub(r'\bchem\b', r' chemistry ', str)
I have been tokenizing English strings with a simple \b split. However, given the string Hello, "Joe!", a split on \b gives back these tokens:
print join "\n", split /\b/, 'Hello, "Joe!"';
Hello
, "
Joe
!"
I need separate punctuation to be separate tokens. What I need is this list below:
print join "\n", split /awesome regex here/, 'Hello, "Joe!"';
Hello
,
"
Joe
!
"
I can process the whitespace afterwards, but I can't think of a quick regex way to split the string properly. Any ideas?
EDIT
A better test case is "Hello there, Joe!", since it checks that words are split correctly.
(?=\W)|(?<=\W)|\s+
You can try this.See demo.
https://regex101.com/r/fX3oF6/4
Do matching instead of splitting.
[A-Za-z]+|[^\w\s]
You can use lookarounds regex to get this:
print join "\n", split /\s+|(?=\p{P})|(?<=\p{P})/, 'Hello, "Joe!"';
Output:
Hello
,
"
Joe
!
"
\p{P} matches any punctuation character.
Example 2:
print join "\n", split /\s+|(?=\p{P})|(?<=\p{P})/, 'hello there, Joe!';
hello
there
,
Joe
!
Using Regex, how do you match everything except four digits in a row? Here is a sample text that I might be using:
foo1234bar
baz 1111bat
asdf 0000 fdsa
a123b
Matches might look something like the following:
"foo", "bar", "baz ", "bat", "asdf ", " fdsa", "a123b"
Here are some regular expressions I've come up with on my own that have failed to capture everything I need:
[^\d]+ (this one includes a123b)
^.*(?=[\d]{4}) (this one does not include the line after the 4 digits)
^.*(?=[\d]{4}).* (this one includes the numbers)
Any ideas on how to get matches before and after a four digit sequence?
You haven't specified your app language, but practically every app language has a split function, and you'll get what you want if you split on \d{4}.
eg in java:
String[] stuffToKeep = input.split("\\d{4}");
You can use a negative lookahead:
(?!\b\d{4}\b)(\b\w+\b)
Demo
In Python the following is very close to what you want:
In [1]: import re
In [2]: sample = '''foo1234bar
...: baz 1111bat
...: asdf 0000 fdsa
...: a123b'''
In [3]: re.findall(r"([^\d\n]+\d{0,3}[^\d\n]+)", sample)
Out[3]: ['foo', 'bar', 'baz ', 'bat', 'asdf ', ' fdsa', 'a123b']
With python ( regex module ), I am triying to substitute 'x' for each letter 'c' in those strings occurring in a text and:
delimited by 'a', at the left, and 'b' at the right, and
with no more 'a's and 'b's in them.
Example:
cuacducucibcl -> cuaxduxuxibcl
How can I do this?
Thank you.
With the standard re module in Python, you can use a[^ab]+b to match the string which starts and end with a and b and doesn't have any occurence of a or b in between, then supply a replacement function to take care of the replacement of c:
>>> import re
>>> re.sub('a[^ab]+b', lambda m: m.group(0).replace('c', 'x'), 'cuacducucibcl')
'cuaxduxuxibcl'
Document of re.sub for reference.
Use the below regex and then replace the matched c's with x . For this , you need to install external regex module.
>>> import regex
>>> s = 'cuacducucibcl'
>>> regex.sub(r'((?:a|(?<!^)\G)[^abc\n]*)c', r'\1x', s)
'cuaxduxuxibcl'
DEMO