I would like to add sentence number to a text file:
Put [1][2][3]... in front of each sentence.
[1] Sentence one. [2] Sentence two. ...
And a sentence ended with one of .!?.
I have no clue how to do it in Clojure. Here is my attempt:
(def text "Martin Luther King, Jr.
I Have a Dream
delivered 28 August 1963, at the Lincoln Memorial, Washington D.C.
I am happy to join with you today in what will go down in history as the greatest demonstration for freedom in the history of our nation.
Five score years ago, a great American, in whose symbolic shadow we stand today, signed the Emancipation Proclamation. This momentous decree came as a great beacon light of hope to millions of Negro slaves who had been seared in the flames of withering injustice. It came as a joyous daybreak to end the long night of their captivity.
But one hundred years later, the Negro still is not free. One hundred years later, the life of the Negro is still sadly crippled by the manacles of segregation and the chains of discrimination. One hundred years later, the Negro lives on a lonely island of poverty in the midst of a vast ocean of material prosperity. One hundred years later, the Negro is still languished in the corners of American society and finds himself an exile in his own land. And so we've come here today to dramatize a shameful condition.")
Define sentence ending:
(def sentence-ending #"[.!?]")
Use replace function:
(require '[clojure.string :as str])
(str/replace text sentence-ending "[number]")
I know this is logically wrong! I got replace all the .!? with a string. Perhaps string replace is not the right way. How to tackle this problem?
You can split the text into sequence of sentences. Then map each sentence to prepend [number], and join the sentences again to make one string.
(->> (clojure.string/split text #"[.?!]") ; split text
(map-indexed #(str "[" (inc %1) "] " %2)) ; prepend number
(apply str)) ; join to one string
But the condition for splitting the text into string is naive. As you can see, some of words contains . which are not the end of sentence. You should refine the sentence termination condition.
One way to get the full sentence (including the punctuation) is to regex out the whole thing and use a matcher. I don't know if this is the best way. But it works.
After that, I think interleave works nicely for this kind of problem.
(let [matcher (re-matcher #"[^.!?]*[.!?]" text)
sentences (take-while seq (repeatedly #(re-find matcher)))
numbers (map #(str "[" % "] ") (range))]
(apply str (interleave numbers sentences)))
Related
I am trying to match spaces just after punctuation marks so that I can split up a large corpus of text, but I am seeing some common edge cases with places, titles and common abbreviations:
I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith
I am using this with the re.split function in Python 3 I want to get this:
["I am from New York, N.Y. and I would like to say hello!",
"How are you today?",
"I am well.",
"I owe you $6. 00 because you bought me a No. 3 burger."
"-Sgt. Smith"]
This is currently my regex:
(?<=[\.\?\!])(?<=[^A-Z].)(?<=[^0-9].)(?<=[^N]..)(?<=[^o].)
I decided to try to fix the No. first, with the last two conditions. But it relies on matching the N and the o independently which I think is going to case false positives elsewhere. I cannot figure out how to get it to make just the string No behind the period. I will then use a similar approach for Sgt. and any other "problem" strings I come across.
I am trying to use something like:
(?<=[\.\?\!])(?<=[^A-Z].)(?<=[^0-9].)^(?<=^No$)
But it doesn't capture anything after that. How can I get it to exclude certain strings which I expect to have a period in it, and not capture them?
Here is a regexr of my situation: https://regexr.com/4sgcb
This is the closest regex I could get (the trailing space is the one we match):
(?<=(?<!(No|\.\w))[\.\?\!])(?! *\d+ *)
which will split also after Sgt. for the simple reason that a lookbehind assertion has to be fixed width in Python (what a limitation!).
This is how I would do it in vim, which has no such limitation (the trailing space is the one we match):
\(\(No\|Sgt\|\.\w\)\#<![?.!]\)\( *\d\+ *\)\#!\zs
For the OP as well as the casual reader, this question and the answers to it are about lookarounds and are very interesting.
You may consider a matching approach, it will offer you better control over the entities you want to count as single words, not as sentence break signals.
Use a pattern like
\s*((?:\d+\.\s*\d+|(?:No|M[rs]|[JD]r|S(?:r|gt))\.|\.(?!\s+-?[A-Z0-9])|[^.!?])+(?:[.?!]|$))
See the regex demo
It is very similar to what I posted here, but it contains a pattern to match poorly formatted float numbers, added No. and Sgt. abbreviation support and a better handling of strings not ending with final sentence punctuation.
Python demo:
import re
p = re.compile(r'\s*((?:\d+\.\s*\d+|(?:No|M[rs]|[JD]r|S(?:r|gt))\.|\.(?!\s+-?[A-Z0-9])|[^.!?])+(?:[.?!]|$))')
s = "I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith"
for m in p.findall(s):
print(m)
Output:
I am from New York, N.Y. and I would like to say hello!
How are you today?
I am well.
I owe you $6. 00 because you bought me a No. 3 burger.
-Sgt. Smith
Pattern details
\s* - matches 0 or more whitespace (used to trim the results)
(?:\d+\.\s*\d+|(?:No|M[rs]|[JD]r|S(?:r|gt))\.|\.(?!\s+-?[A-Z0-9])|[^.!?])+ - one or more occurrences of several aternatives:
\d+\.\s*\d+ - 1+ digits, ., 0+ whitespaces, 1+ digits
(?:No|M[rs]|[JD]r|S(?:r|gt))\. - abbreviated strings like No., Mr., Ms., Jr., Dr., Sr., Sgt.
\.(?!\s+-?[A-Z0-9]) - matches a dot not followed by 1 or more whitespace and then an optional - and uppercase letters or digits
| - or
[^.!?] - any character but a ., !, and ?
(?:[.?!]|$) - a ., !, and ? or end of string.
As mentioned in my comment above, if you are not able to define a fixed set of edge cases, this might not be possible without false positives or false negatives. Again, without context you are not able to destinguish between abbreviations like "-Sgt. Smith" and ends of sentences like "Sergeant is often times abbreviated as Sgt. This makes it shorter.".
However, if you can define a fixed set of edge cases, its probably easier and much more readable to do this in multiple steps.
1. Identify your edge cases
For example, you can destinguish "Ill have a No. 3" and "No. I am your father" by checking for a subsequent number. So you would identify that edge case with a regex like this: No. \d. (Again, context matters. Sentences like "Is 200 enough? No. 200 is not enough." will still give you a false positive)
2. Mask your edge cases
For each edge case, mask the string with a respective string that will 100% not be part of the original text. E.g. "No." => "======NUMBER======"
3. Run your algorithm
Now that you got rid of your unwanted punctuations, you can run a simpler regex like this to identify the true positives: [\.\!\?]\s
4. Unmask your edge cases
Turn "======NUMBER======" back into "No."
Doing it with only one regex will be tricky - as stated in comments, there are lots of edge cases.
Myself I would do it with three steps:
Replace spaces that should stay with some special character (re.sub)
Split the text (re.split)
Replace the special character with space
For example:
import re
zero_width_space = '\u200B'
s = 'I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith'
s = re.sub(r'(?<=\.)\s+(?=[\da-z])|(?<=,)\s+|(?<=Sgt\.)\s+', zero_width_space, s)
s = re.split(r'(?<=[.?!])\s+', s)
from pprint import pprint
pprint([line.replace(zero_width_space, ' ') for line in s])
Prints:
['I am from New York, N.Y. and I would like to say hello!',
'How are you today?',
'I am well.',
'I owe you $6. 00 because you bought me a No. 3 burger.',
'-Sgt. Smith']
I want to create series of puzzle games where you change one letter in a word to create a new word, with the aim of reaching a given target word. For example, to change "this" to "that":
this
thin
than
that
What I want to do is create a regex which will scan a list of words and choose all those that do not match the current word by all but one letter. For example, if my starting word is "pale" and my list of words is...
pale
male
sale
tale
pile
pole
pace
page
pane
pave
palm
peal
leap
play
help
pack
... I want all the words from "peal" to "pack" to be selected. This means that I can delete them from my list, leaving only the words that could be the next match. (It's OK for "pale" itself to be unselected.)
I can do this in parts:
^.(?!ale).{3}\n selects words not like "*ale"
^.(?<!p).{3}\n|^.{2}(?!le).{2}\n selects words not like "p*le"
^.{2}(?<!pa).{2}\n|^.{3}(?!e).\n selects words not like "pa*e"
^.{3}(?<!pal).\n selects words not like "pal*".
However, when I put them together...
^.(?!ale).{3}\n|^.(?<!p).{3}\n|^.{2}(?!le).{2}\n|^.{2}(?<!pa).{2}\n|^.{3}(?!e).\n|^.{3}(?<!pal).\n
... everything but "pale" is matched.
I need some way to create an AND relationship between the different regexes, or (more likely) a completely different approach.
You can use the Python regex module that allows fuzzy matching:
>>> import regex
>>> regex.findall(r'(?:pale){s<=1}', "male sale tale pile pole pace page pane pave palm peal leap play help pack")
['male', 'sale', 'tale', 'pile', 'pole', 'pace', 'page', 'pane', 'pave', 'palm']
In this case, you want a substitution of 0 or 1 is a match.
Or consider the TRE library and the command line agrep which supports a similar syntax.
Given:
$ echo $s
male sale tale pile pole pace page pane pave palm peal leap play help pack
You can filter to a list of a single substitution:
$ echo $s | tr ' ' '\n' | agrep '(?:pale){ 1s <2 }'
male
sale
tale
pile
pole
pace
page
pane
pave
palm
Here's a solution that uses cool python tricks and no regex:
def almost_matches(word1, word2):
return sum(map(str.__eq__, word1, word2)) == 3
for word in "male sale tale pile pole pace page pane pave palm peal leap play help pack".split():
print almost_matches("pale", word)
A completely different approach: Levenshtein distance
...the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.
PHP example:
$words = array(
"pale",
"male",
"sale",
"tale",
"pile",
"pole",
"pace",
"page",
"pane",
"pave",
"palm",
"peal",
"leap",
"play",
"help",
"pack"
);
foreach($words AS $word)
if(levenshtein("pale", $word) > 1)
echo $word."\n";
This assumes the word on the first line is the keyword. Just a brute force parallel letter-match and count gets the job done:
awk 'BEGIN{FS=""}
NR==1{n=NF;for(i=1;i<=n;++i)c[i]=$i}
NR>1{j=0;for(i=1;i<=n;++i)j+=c[i]==$i;if(j<n-1)print}'
A regexp general solution would need to be a 2-stepper I think -- generate the regexp in first step (from the keyword), run the regexp against the file in the second step.
By the way, the way to do an "and" of regexp's is to string lookaheads (and the lookaheads don't need to be as complicated as you had above I think):
^(?!.ale)(?!p.le)(?!pa.e)(?!pal.)
Can anyone explain why sort-by is reacting like this with these keyfunctions?
user=> (sort-by number? [1 2 13 4 "s" 0 "a"])
("s" "a" 1 2 13 4 0)
user=> (sort-by str [1 2 3 4 "s" 0 "a"])
(0 1 2 3 4 "a" "s")
My idea is its dividing the args of the vector into strings and numbers? Is there anything more about what's happening here?
And my second question does sort-by travel through every item of the vector and then returning the result?
number? returns true or false depending on whether the input is a number. false is apparently less than true for comparisons.
str returns a string whose value depends on the input. e.g. (str 1) => "1". String comparison is somewhat complicated, but, in general, numerals are less than uppercase letters are less than lowercase letters and letters are sorted in alphabetical order.
I'm not sure exactly the behavior you're wanting, but it would seem that (sort-by number? ...) did indeed "divide the vector into strings and numbers" by giving you strings at the start of the list and numbers at the end.
If you're wanting to separate strings from numbers, use (group-by number? ...)
As for your second question, sort-by uses the keyfn for comparisons during a merge sort.
I need to read through a string (which is a sequence) 3 characters at a time. I know of take-while and of take 3 and since take returns nil when there is no more input it seems like the perfect predicate for take-while but I cannot figure out how to wrap the string sequence so that it returns string of the next 3 characters at a time. If this was an object oriented language I'd wrap the sequence's read call or something, but with Clojure I have no idea how to proceed further.
You can use partition or partition-all
(partition 3 "abcdef")
user=> ((\a \b \c) (\d \e \f))
The docs for both are
clojure.core/partition
([n coll] [n step coll] [n step pad coll])
Returns a lazy sequence of lists of n items each, at offsets step
apart. If step is not supplied, defaults to n, i.e. the partitions
do not overlap. If a pad collection is supplied, use its elements as
necessary to complete last partition upto n items. In case there are
not enough padding elements, return a partition with less than n items.
nil
clojure.core/partition-all
([n coll] [n step coll])
Returns a lazy sequence of lists like partition, but may include
partitions with fewer than n items at the end.
nil
If your string is not guaranteed to be of length that is multiple of three, then you should probably use partition-all. The last partition will contain less than 3 elements though. If you want to use partition instead, then to avoid having characters from the string chopped off, you should use step=3, and a padding collection to fill in the holes in the last partition.
To turn every tuple to a string, you can use apply str on every tuple. So you'd want to use map here.
(map (partial apply str) (partition-all 3 "abcdef"))
user=> ("abc" "def")
You can do this without boxing every character:
(re-seq #"..." "Some words to split")
;("Som" "e w" "ord" "s t" "o s" "pli")
If, as your comment on #turingcomplete's answer indicates, you want every other triple,
(take-nth 2 (re-seq #"..." "Some words to split"))
;("Som" "ord" "o s")
I trying to create a fill-in-the-blanks worksheet from a chunk of text, and I think regex and a replace function in a text editor will greatly expedite my project.
Example text:
HAMLET O, that this too too solid flesh would melt Thaw and resolve
itself into a dew! Or that the Everlasting had not fix'd His canon
'gainst self-slaughter! O God! God! How weary, stale, flat and
unprofitable, Seem to me all the uses of this world! Fie on't! ah fie!
'tis an unweeded garden, That grows to seed; things rank and gross in
nature Possess it merely. That it should come to this! But two months
dead: nay, not so much, not two: So excellent a king; that was, to
this, Hyperion to a satyr; so loving to my mother That he might not
beteem the winds of heaven Visit her face too roughly. Heaven and
earth! Must I remember? why, she would hang on him, As if increase of
appetite had grown By what it fed on: and yet, within a month-- Let me
not think on't--Frailty, thy name is woman!-- A little month, or ere
those shoes were old With which she follow'd my poor father's body,
Like Niobe, all tears:--why she, even she-- O, God! a beast, that
wants discourse of reason, Would have mourn'd longer--married with my
uncle, My father's brother, but no more like my father Than I to
Hercules: within a month: Ere yet the salt of most unrighteous tears
Had left the flushing in her galled eyes, She married. O, most wicked
speed, to post With such dexterity to incestuous sheets! It is not nor
it cannot come to good: But break, my heart; for I must hold my
tongue.
Replace alternate text sets with a blank "__" a character length equal to that of the length that has been replaced, where a text set is defined as group of words ending with a "!", "," "--", "?" etc.
So the above text from Hamlet becomes like
HAMLET O, ___________________ Or that the
Everlasting had not fix'd His canon 'gainst self-slaughter! __
God! _____, stale, ________ ......
What is the regex that I should use to achieve this end?
Here is an attempt using perl regex:
perl -pe 's/(.*?)([\!\?\,;\.]|--)(.*?)([\!\?\,;\.]|--)/\1\2________________\4/g' file
Output:
HAMLET O,_______! Or that the Everlasting had not fix'd His
canon 'gainst self-slaughter!_______! God!_______,
stale,_______, Seem to me all the uses of this
world!_______! ah fie!_______, That grows to
seed;_______. That it should come to this!_______,
not so much,_______; that was,_______, Hyperion to a
satyr;_______. Heaven and earth!_______?
why,_______, As if increase of appetite had grown By what it
fed on: and yet,_______-- Let me not think
on't--_______, thy name is woman!_______-- A little
month,_______, Like Niobe,_______--why
she,_______-- O,_______! a beast,_______,
Would have mourn'd longer--_______, My father's
brother,_______, She married._______, most wicked
speed,_______! It is not nor it cannot come to good: But
break,_______; for I must hold my tongue.
This solution replaces fix number of '__' and I am yet to figure out how to replace with matching charater length.