regex for tokenizing words and punctuation

regex for tokenizing words and punctuation - regex

I have been tokenizing English strings with a simple \b split. However, given the string Hello, "Joe!", a split on \b gives back these tokens:
print join "\n", split /\b/, 'Hello, "Joe!"';
Hello
, "
Joe
!"
I need separate punctuation to be separate tokens. What I need is this list below:
print join "\n", split /awesome regex here/, 'Hello, "Joe!"';
Hello
,
"
Joe
!
"
I can process the whitespace afterwards, but I can't think of a quick regex way to split the string properly. Any ideas?
EDIT
A better test case is "Hello there, Joe!", since it checks that words are split correctly.

(?=\W)|(?<=\W)|\s+
You can try this.See demo.
https://regex101.com/r/fX3oF6/4

Do matching instead of splitting.
[A-Za-z]+|[^\w\s]

You can use lookarounds regex to get this:
print join "\n", split /\s+|(?=\p{P})|(?<=\p{P})/, 'Hello, "Joe!"';
Output:
Hello
,
"
Joe
!
"
\p{P} matches any punctuation character.
Example 2:
print join "\n", split /\s+|(?=\p{P})|(?<=\p{P})/, 'hello there, Joe!';
hello
there
,
Joe
!

Related

Replace pattern with pattern in vb.net string

I want to replace "0A ","0B ",...,"1A ","1B ",... patterns with "0A|","0B|",...,"1A|","1B|",... from string vb.net
I can write individual replace lines like
string = string.Replace("0A ", "0A|")
string = string.Replace("0B ", "0B|")
.
.
.
string = string.Replace("0Z ", "0Z|")
But, I would have to write too many lines(26*10*2- Two because such scenario occurs twice) and it just doesn't seem to be a good solution. Can someone give me a good regex solution for this?

Use Regex.Replace:
result = Regex.Replace(string, "(\d+[A-Z]+) ", "$1|")
I used the pattern \d+[A-Z]+ to represent the text under the assumption that your series of data might see more than one digit/letter. This seems to be working in the demo below.
Demo

Regex: \s Substitution: |
Details:
\s Matches any whitespace character
Regex demo
VB.NET code:
Regex.Replace("0A ", "\s", "|") Output: 0A|

python regex split on repeating character

I have a string for example
--------------------------------
hello world !
--------------------------------
world hello !
--------------------------------
! hello world
and I want to be able to split the lines on the hyphens, the hyphens could be of variable length which is why I decided to use regex, the information I want to extract out of this is ['hello world !', 'world hello !', '! hello world'] I have tried splitting the string using static number of hyphens, this works but not sure how to go about it if it was of variable length. I have tried doing:
re.split(r'\-{3,}', str1)
however that did not seem to work

You may strip the unnecessary whitespace from the input and resulting split chunks with a .strip() method:
import re
p = re.compile(r'(?m)^-{3,}$')
t = "--------------------------------\nhello world !\n--------------------------------\nworld hello !\n--------------------------------\n! hello world"
result = [x.strip() for x in p.split(t.strip("-\n\r"))]
print(result)
As for the regex, I suggest limiting to the hyphen-only lines with (?m)^-{3,}$ that matches 3 or more hyphens between the start of line (^) and end of line ($) (due to (?m), these anchors match the line boundaries, not the string boundaries).
See the IDEONE demo

Perl: string manipulation - surrounding a word with a character '#'

I am trying to extract email address from a txt file. I've thought about surrounding words that contain the '#' character. Does anybody know a expression to do that?

Whenever you need some reasonably common matching problem resolve in Perl, you should always first check the Regexp::Common family on CPAN. In this case: Regexp::Common::Email::Address. From POD Synopsys:
use Regexp::Common qw[Email::Address];
use Email::Address;
while (<>) {
my (#found) = /($RE{Email}{Address})/g;
my (#addrs) = map $_->address, Email::Address->parse("#found");
print "X-Addresses: ", join(", ", #addrs), "\n";
}

Here's a very quick and dirty regex which will match non-whitespace characters on either side of an #:
/\S+#\S+/
This will match john.smith#example.com in
some rubbish text john.smith#example.com more rubbish text
Hope this helps.

Attach a newline to every sentences

i was wondering how to turn a paragraph, into bullet sentences.
before:
sentence1. sentence2. sentence3. sentence4. sentence5. sentence6. sentence7.
after:
sentence1.
sentence2.
sentence3
sentence4.
sentence5.

Since all the other answers so far show how to do it various programming languages and you have tagged the question with Vim, here's how to do it in Vim:
:%s/\.\(\s\+\|$\)/.\r\r/g
I've used two carriage returns to match the output format you showed in the question. There are a number of alternative regular expression forms you could use:
" Using a look-behind
:%s/\.\#<=\( \|$\)/\r\r/g
" Using 'very magic' to reduce the number of backslashes
:%s/\v\.( |$)/.\r\r/g
" Slightly different formation: this will also break if there
" are no spaces after the full-stop (period).
:%s/\.\s*$\?/.\r\r/g
and probably many others.
A non-regexp way of doing it would be:
:let s = getline('.')
:let lineparts = split(s, '\.\#<=\s*')
:call append('.', lineparts)
:delete
See:
:help pattern.txt
:help change.txt
:help \#<=
:help :substitute
:help getline()
:help append()
:help split()
:help :d

You can use a regex
/\.( |$)/g
That will match the end of the sentence, then you can add newlines.
Or you can use some split function with . (dot space) and . (dot), then join with newlines.

Just replace all end of sentences /(?<=.) / with a period followed by two newline characters /.\n\n/. The syntax would of course depend on the language you are using.

Using Perl:
perl -e "$_ = <>; s/\.\s*/.\n/g; print"
Longer, somewhat more readable version:
my $input = 'foo. bar. baz.';
$input =~ s/
\. # A literal '.'
\s* # Followed by 0 or more space characters
/.\n/gx; # g for all occurences, x to allow comments and whitespace in regex
print $input;
Using Python:
import re
input = 'foo. bar. baz.'
print re.sub(r'\.\s*', '.\n', input)

An example using Ruby:
ruby-1.9.2 > a = "sentence1. sentence2. sentence3. and array.split(). the end."
=> "sentence1. sentence2. sentence3. and array.split(). the end."
ruby-1.9.2 > puts a.gsub(/\.(\s+|$)/, ".\n\n")
sentence1.
sentence2.
sentence3.
and array.split().
the end.
It goes like, for every . followed by (1 whitespace character or more, or followed by end of line), replace it with just . and two newline characters.

using awk
$ awk '{$1=$1}1' OFS="\n" file
sentence1.
sentence2.
sentence3.
sentence4.
sentence5.
sentence6.
sentence7

In PHP:
<?php
$input = "sentence. sentence. sentence.";
$output = preg_replace("/(.*?)\\.[\\s]+/", "$1\n", $input);
?&gt
Also, regular expressions are a blast, but not necessary for this problem. You can also try:
&lt?php
$input = "sentence. sentence. sentence.";
$arr = explode('.', $input);
foreach ($arr as $k => $v) $arr[$k] = trim($v);
$output = implode("\n", $arr);
?&gt

I figured out how to do this in RegExr
Search String is
(\-=?\s+)
--
Replace String is
\n\n
This is the generated information for the current regex
RegExp: /(\-=?\s+)/g
pattern: (\-=?\s+)
flags: g
capturing groups: 1
group 1: (\-=?\s+)
This will find every - in the sentence below and replace it with two newlines
Sentence 1- Sentence 2- Sentence 3- Sentence 4- Sentence 5-
The end result is
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Sentence 5

I have a really simple naive solution using capturing regexs.
:%s/[.!?]/\1y\r\r/g
The main draw back is this won't handle ellipses or multiple punctuation.

Regex line by line: How to match triple quotes but not double quotes

I need to check to see if a string of many words / letters / etc, contains only 1 set of triple double-quotes (i.e. """), but can also contain single double-quotes (") and double double-quotes (""), using a regex. Haven't had much success thus far.

A regex with negative lookahead can do it:
(?!.*"{3}.*"{3}).*"{3}.*
I tried it with these lines of java code:
String good = "hello \"\"\" hello \"\" hello ";
String bad = "hello \"\"\" hello \"\"\" hello ";
String regex = "(?!.*\"{3}.*\"{3}).*\"{3}.*";
System.out.println( good.matches( regex ) );
System.out.println( bad.matches( regex ) );
...with output:
true
false

Try using the number of occurrences operator to match exactly three double-quotes.
\"{3}
["]{3}
[\"]{3}
I've quickly checked using http://www.regextester.com/, seems to work fine.
How you correctly compile the regex in your language of choice may vary, though!

Depends on your language, but you should only need to match for three double quotes (e.g., /\"{3}/) and then count the matches to see if there is exactly one.

There are probably plenty of ways to do this, but a simple one is to merely look for multiple occurrences of triple quotes then invert the regular expression. Here's an example from Perl:
use strict;
use warnings;
my $match = 'hello """ hello "" hello';
my $no_match = 'hello """ hello """ hello';
my $regex = '[\"]{3}.*?[\"]{3}';
if ($match !~ /$regex/) {
print "Matched as it should!\n";
}
if ($no_match !~ /$regex/) {
print "You shouldn't see this!\n";
}
Which outputs:
Matched as it should!
Basically, you are telling it to find the thing you DON'T want, then inverting the truth. Hope that makes sense. Can help you convert the example to another language if you need help.

This may be a good start for you.
^(\"([^\"\n\\]|\\[abfnrtv?\"'\\0-7]|\\x[0-9a-fA-F])*\"|'([^'\n\\]|\\[abfnrtv?\"'\\0-7]|\\x[0-9a-fA-F])*'|\"\"\"((?!\"\"\")[^\\]|\\[abfnrtv?\"'\\0-7]|\\x[0-9a-fA-F])*\"\"\")$
See it in action at regex101.com.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

regex for tokenizing words and punctuation - regex

(?=\W)|(?<=\W)|\s+ You can try this.See demo. https://regex101.com/r/fX3oF6/4

Do matching instead of splitting. [A-Za-z]+|[^\w\s]

You can use lookarounds regex to get this: print join "\n", split /\s+|(?=\p{P})|(?<=\p{P})/, 'Hello, "Joe!"'; Output: Hello , " Joe ! " \p{P} matches any punctuation character. Example 2: print join "\n", split /\s+|(?=\p{P})|(?<=\p{P})/, 'hello there, Joe!'; hello there , Joe !

Related

Replace pattern with pattern in vb.net string

python regex split on repeating character

Perl: string manipulation - surrounding a word with a character '#'

Attach a newline to every sentences

Regex line by line: How to match triple quotes but not double quotes

Categories

Resources