Regex for moving periods at end of sentences not abbreviations - regex

Looking for some ideas on how to remove the period character in sentences but not remove the periods in abbreviations. For instance
"The N.J. turnpike is long. Today is a beautiful day."
Would be changed to:
"The N.J. turnpike is long Today is a beautiful day"

This is a hard problem. Lingua::EN::Sentence makes a three-quarters assed attempt to solve it. It knows about common abbreviations in American English and has hooks for you to add other abbreviations you know about.

As others have said, this is a very difficult task in the general case. If you want to learn more, you should start out by reading more about "sentence segmentation" or "sentence boundary disambiguation," which is the task of dividing a text into sentences. Here's a few links to get you started:
http://en.wikipedia.org/wiki/Sentence_boundary_disambiguation
http://en.wikipedia.org/wiki/Text_segmentation#Sentence_segmentation
http://www.robincamille.com/2012-02-18-nltk-sentence-tokenizer/
http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html#sec-further-examples-of-supervised-classification
http://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html#punkt-tokenizer

Why would you want to remove periods at the end of a sentence in light of abbreviations? Go BIG: remove all dots, or go NONE!

Related

Regex select words longer than 4 characters but only one instance if duplicates

I am trying to format text in InDesign using GREP Style.
The goal is to select words longer then 4 letters in a paragraph but if the word has been duplicated in a paragraph it should not select more then first instance of this word.
This is sample text:
"The Lord's right hand is lifted high; the Lord's right hand has done mighty things!"
The solution should give
Lord right hand lifted high done mighty things
i have done the first part
[[:word:]]{4,}
but don't have a clue how to deal with those duplicates.
Is order a requirement? If not, how about words longer than 4 characters not followed by that same word later in the text? See:
([[:word:]]{4,})(?!.*\1)
https://regex101.com/r/Ug4dLZ/1
Result: lifted high Lord right hand done many things
To be more comprehensive, include word breaks (i.e. count "Person" and "Personhood" as 2 separate words):
([[:word:]]{4,})(?!.*\b\1\b)

Regex lookbehind - excluding words from searches

I need to search my corpus for words such as game or shame but I would like to specify the search to exclude three strings a game/a shame or , A game/A shame and a/an/A/An WORD game or a/an/A/An WORD shame , where WORD is a modifier, e.g., a great game or a great shame.
If someone could help me out, that would be great, thanks!
In my corpus, the optional WORD between the indefinite article a/an and game or a/an and shame is most commonly great and real. So even excluding these two, would already help me a lot.
The lookbehind below works perfectly to exclude a/A
(?<!a\s|A\s)\bshame\b
To exclude the modifying WORD, I was trying to use ?\w in the lookbehind grep, but it just wouldn't work - the grep below without ? runs and it still excludes examples such as a shame, but it still returns the undesired examples such as a great shame or a crying shame - see concordance lines (3) and (4) in the sample text below:
(?<!a\s|A\s|a\b\w\b|A\b\w\b)\bshame\b
The tool I'm using to implement regex is AntConc, which supports Perl regular expressions.
Sample text with two irrelevant examples (3 & 4) after using the search string below
(?<!a\s|A\s)\bshame\b
1 (match shame)
, people ogling from the sidelines. If you want a closer look, you have to ring for entry and wait to be admitted. I guess me and Saul just have no shame (or just know the benefits of our bank accounts being in hard currencies), because we wandered into plenty. Lots and lots of little boutiques and edgily designed fashion stores with music blaring.& abbutterflie.txt 47 1
2 (match shame)
last twenty years and I've experienced all sorts of biggotry but I seriously thought that anti black nazism in football wass a thing of the past. You should all hang your heads in shame, bunch of [badword]s. adamdphillips.txt 57 1
3 (don't match shame)
me monetarily as I wasn't that close to her, but she was really good friends with the other girl and it's messed that up for them a bit, which is a great shame. Anyway, Holly and I have since found somewhere to move in just the two of us. It's going to cost an absolute fortune and I'm going to be eating basics beans on aderyn.txt 60 1
4 (don't match shame)
are loads of amazingly good bands out there, gigging up and down the country who will never get signed because no-one can figure out how to market them, and this is a crying shame. There are artists out there like Thea Gilmore and <a href="http://blog.amandapalmer.net/" rel="nofollow"> Amanda Palmer& aderyn.txt 60 2
5 (match shame)
/><br />"There is no better time to show these terrorists that we have no fear of them. Instead we are forced, through the cowardly acts of our superiors, to hide in shame."<br /><br />But Herb Wiseman, high school consultant for Lee County, Florida, pointed to the July 7 London bombings.<br /><br />"What happens if kids get on aggy91.txt 64 1
Because variable length negative lookbehinds are not allowed, the approach in your previous question's answer won't transfer to this one.
I've gone with a (*SKIP)(*FAIL) pattern. This will match and discard the disqualified matches, and only retain qualifying matches:
/[Aa]n?( \w+)? shame(*SKIP)(*FAIL)|shame/ 3844 steps (Demo)
Or if you wish to include word boundary metacharacters:
/\b[Aa]n?( \w+)? shame\b(*SKIP)(*FAIL)|\bshame\b/ 4762 steps (Demo)

Excluding % from a Regex number search

I'm attempting to create a Regex that finds only 2-digit integers or numbers with a precision of 2 decimal points.
In the example string at the bottom, I want to find only the following:
21 and 10.50
Using this expression, 100% is getting captured, in addition to the strings I desire to capture:
(\d){1,2}(\.?)([0-9]?[0-9]?){1,2}
I know I need to use ^% somewhere, but I can't figure out where it goes. Any suggestions are greatly appreciated.
Here's my sample string:
Earn Up to $21 Per Hour - Deliver Food with !!
Delivery Drivers work when they want and make great money when they do.
All orders are prepaid, just pick them up and deliver them to hungry diners. No waiting in line or fumbling with receipts and prepaid cards.
It's fast and easy to start working. Get started today.
Apply Now
Why choose ?
More orders than any other takeout platform
100% of our restaurants are official partners
Competitive pay: Per order fee + mileage + tips
We guarantee an hourly minimum of $10.50/hour*
Create your own schedule & work the hours you want
Word boundaries in your regular expression will grant you a bit more control.
Since word boundaries are a bit strict, we need to introduce an OR condition to address both cases which will satisfy your regex.
(\b[\d]{2}\.[\d]{2}\b)|(\b[\d]{2}\b)
Edit: Try this one,
\b[\d]{2}\b(\.[\d]{2})?
The first example has a chance to fail as it is order dependent due to the way it short-circuits. This I believe should address multiple cases properly.
I think this should work:
(?<!\d)((\d+\.\d\d)|(\d\d))(?!%|\d)
Demo (and explanation)
EDIT:
Improved version:
(?<!\d)(\d{1,2}(?:\.\d{1,2})?)(?!%|\d)
Demo (and explanation)
You can try this variant: (\d{1,}|[\d.])\b(?!%)
It uses negative lookahead (?!%) to exclude digits following by % sign.
Details at regex101

Using RegEx to Find a Block of Text

I'm attempting to block a long string of unnecessary text that's on every page of a document.
Ex: "36075 This is another page and this is the date March 4 2013"
I know this must be very simple, but I'm hoping there is a way to block text verbatim. Is the only way to block this text by using a lot of /d/s/w+/+ etc or is there is a way to say, "match 36075 This is another page and this is the date March 4 2013".
This would be SO HELPFUL to know. Thank you for helping!
From what you wrote I assume you need to get leading numbers from string, to do it you just need to use this pattern: ^\d+ which from this input:
36075 This is another page and this is the date March 4 2013
will return this:
36075
For future, in case of such questions please provide example string and expected output. As well as what you have tried.
I realized the issue I was having. I didn't need to use RegEx. The program I was using has the functionality to match specific words or groups of words and pronounce them differently. What I discovered is that it will not match the words unless the word groups are input exactly the way the program typically reads them.
Ergo --> The channel saw
the end of the British hold over
Would have to be listed as one group for, "The channel saw" and a second group for "the end of the British hold over"
In addition, there were some numbers --> 11960_30_o_ho_
and if the program naturally read 119 and then 60_3 and then _o_ho_ then three strings would need to be input for each section.
A few frustrating hours later, problem solved :) Thank you for your assistance.

replacing specific characters in between specific elements

I'd like to use a regular expression to replace a space in a string. The space in question is the only space between two elements in the string. The string itself however contains much more elements and spaces. So far i've tried
(<-)[\s]*?(->)
But that doesnt work. It is supposed to take
<-word anotherword->
and allow me to replace the space in it.
As \s selects all spaces, and
(<-)[\s\S]*?(->)
Selects all characters inbetween the <- and ->, i tried to re-use the expression but then for the spaces only.
I'm not so good at these expressions, and i can't for the life of me find an answer anywhere.
If anyone could just point me to the answer, that would be great. Thanks.
It's difficult to be sure what you want, post some before and after examples. And, specify what language you are using.
But, it looks like (<-\S+)\s*(\S+->) should probably do it (deletes spaces).
If the <- and -> are NOT to be preserved, move them out of the parentheses, like so:
<-(\S+)\s*(\S+)->
Here's what it would look like in JavaScript:
var before = "Ten years ago a crack <-flegan esque-> unit was sent to prison by a military "
+ "court for a crime they didn't commit.\n"
+ "These men promptly escaped from a maximum security stockade to the "
+ "<-flargon 666-> underground.\n"
+ "Today, still wanted by the government, they survive as soldiers of fortune.\n"
+ "If you have a problem and no one else can help, and if you can find them, "
+ "maybe you can hire the <-flugen 9->.\n"
;
var after = before.replace (/(<-\S+)\s*(\S+->)/g, "$1$2");
alert (after);
Which yields:
Ten years ago a crack <-fleganesque-> unit was sent to prison by a military court for a crime they didn't commit.
These men promptly escaped from a maximum security stockade to the <-flargon666-> underground.
Today, still wanted by the government, they survive as soldiers of fortune.
If you have a problem and no one else can help, and if you can find them, maybe you can hire the <-flugen9->.