Get more than 1 quotations in text paragraph in R regex - regex

First: Find the texts that are inside the quotations "I want everything inside here".
Second: To extract 1 sentence before quotation.
I would like to achieve this output desirable by look behind regex in R if possible
Example:
Yoyo. He is sad. Oh no! "Don't sad!" Yeah: "Testing... testings," Boys. Sun. Tree... 0.2% green,"LL" "WADD" HOLA.
Desired Output:
[1] Oh no! "Don't sad!"
[2] Yeah: "Testing... testings"
[3] Tree... 0.2% green, "LL"
[4] Tree... 0.2% green, "LL" "WADD"
dput:
"Yoyo. He is sad. Oh no! \"Don't sad!\" Yeah: \"Testing... testings,\" Boys. Sun. Tree... 0.2% green,\"LL\" \"WAAD\" HOLA."
Tried using this but can't work:
str_extract(t, "(?<=\\.\\s)[^.:]*[.:]\\s*\"[^\"]*\"")
Also tried:
regmatches(t , gregexpr('^[^\\.]+[\\.\\,\\:]\\s+(.*(?:\"[^\"]+\\")).*$', t))
regmatches(t , gregexpr('\"[^\"]*\"(?<=\\s[.?][^\\.\\s])', t))
Tried your method #naurel:
> regmatches(t, regexpr("(?:\"? *([^\"]*))(\"[^\"]*\")", t, perl=T))
[1] " Yoyo. He is sad. Oh no! \"Don't sad!\""

Since you just want the last sentence I've cleared the regex for you : result
Explanation :
First you're looking for something that is between quotes. And if there is multiples quotes successively you want them to match as one.
(\"[^\"]*\"(?: *\"[^\"]*\")*)
Does the trick. Then you want to match the sentence before this group. A sentence is starting with a CAPITAL letter. So we will start the match to the first capital encounter before the previously defined group (ie : not followed by any other CAPITAL letter)
([A-Z](?:[a-z0-9\W\s])*)
Put it togeither and you obtain :
([A-Z](?:[a-z0-9\W\s])*)(\"[^\"]*\"(?: *\"[^\"]*\")*)

Related

Regex match characters when not preceded by a string

I am trying to match spaces just after punctuation marks so that I can split up a large corpus of text, but I am seeing some common edge cases with places, titles and common abbreviations:
I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith
I am using this with the re.split function in Python 3 I want to get this:
["I am from New York, N.Y. and I would like to say hello!",
"How are you today?",
"I am well.",
"I owe you $6. 00 because you bought me a No. 3 burger."
"-Sgt. Smith"]
This is currently my regex:
(?<=[\.\?\!])(?<=[^A-Z].)(?<=[^0-9].)(?<=[^N]..)(?<=[^o].)
I decided to try to fix the No. first, with the last two conditions. But it relies on matching the N and the o independently which I think is going to case false positives elsewhere. I cannot figure out how to get it to make just the string No behind the period. I will then use a similar approach for Sgt. and any other "problem" strings I come across.
I am trying to use something like:
(?<=[\.\?\!])(?<=[^A-Z].)(?<=[^0-9].)^(?<=^No$)
But it doesn't capture anything after that. How can I get it to exclude certain strings which I expect to have a period in it, and not capture them?
Here is a regexr of my situation: https://regexr.com/4sgcb
This is the closest regex I could get (the trailing space is the one we match):
(?<=(?<!(No|\.\w))[\.\?\!])(?! *\d+ *)
which will split also after Sgt. for the simple reason that a lookbehind assertion has to be fixed width in Python (what a limitation!).
This is how I would do it in vim, which has no such limitation (the trailing space is the one we match):
\(\(No\|Sgt\|\.\w\)\#<![?.!]\)\( *\d\+ *\)\#!\zs
For the OP as well as the casual reader, this question and the answers to it are about lookarounds and are very interesting.
You may consider a matching approach, it will offer you better control over the entities you want to count as single words, not as sentence break signals.
Use a pattern like
\s*((?:\d+\.\s*\d+|(?:No|M[rs]|[JD]r|S(?:r|gt))\.|\.(?!\s+-?[A-Z0-9])|[^.!?])+(?:[.?!]|$))
See the regex demo
It is very similar to what I posted here, but it contains a pattern to match poorly formatted float numbers, added No. and Sgt. abbreviation support and a better handling of strings not ending with final sentence punctuation.
Python demo:
import re
p = re.compile(r'\s*((?:\d+\.\s*\d+|(?:No|M[rs]|[JD]r|S(?:r|gt))\.|\.(?!\s+-?[A-Z0-9])|[^.!?])+(?:[.?!]|$))')
s = "I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith"
for m in p.findall(s):
print(m)
Output:
I am from New York, N.Y. and I would like to say hello!
How are you today?
I am well.
I owe you $6. 00 because you bought me a No. 3 burger.
-Sgt. Smith
Pattern details
\s* - matches 0 or more whitespace (used to trim the results)
(?:\d+\.\s*\d+|(?:No|M[rs]|[JD]r|S(?:r|gt))\.|\.(?!\s+-?[A-Z0-9])|[^.!?])+ - one or more occurrences of several aternatives:
\d+\.\s*\d+ - 1+ digits, ., 0+ whitespaces, 1+ digits
(?:No|M[rs]|[JD]r|S(?:r|gt))\. - abbreviated strings like No., Mr., Ms., Jr., Dr., Sr., Sgt.
\.(?!\s+-?[A-Z0-9]) - matches a dot not followed by 1 or more whitespace and then an optional - and uppercase letters or digits
| - or
[^.!?] - any character but a ., !, and ?
(?:[.?!]|$) - a ., !, and ? or end of string.
As mentioned in my comment above, if you are not able to define a fixed set of edge cases, this might not be possible without false positives or false negatives. Again, without context you are not able to destinguish between abbreviations like "-Sgt. Smith" and ends of sentences like "Sergeant is often times abbreviated as Sgt. This makes it shorter.".
However, if you can define a fixed set of edge cases, its probably easier and much more readable to do this in multiple steps.
1. Identify your edge cases
For example, you can destinguish "Ill have a No. 3" and "No. I am your father" by checking for a subsequent number. So you would identify that edge case with a regex like this: No. \d. (Again, context matters. Sentences like "Is 200 enough? No. 200 is not enough." will still give you a false positive)
2. Mask your edge cases
For each edge case, mask the string with a respective string that will 100% not be part of the original text. E.g. "No." => "======NUMBER======"
3. Run your algorithm
Now that you got rid of your unwanted punctuations, you can run a simpler regex like this to identify the true positives: [\.\!\?]\s
4. Unmask your edge cases
Turn "======NUMBER======" back into "No."
Doing it with only one regex will be tricky - as stated in comments, there are lots of edge cases.
Myself I would do it with three steps:
Replace spaces that should stay with some special character (re.sub)
Split the text (re.split)
Replace the special character with space
For example:
import re
zero_width_space = '\u200B'
s = 'I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith'
s = re.sub(r'(?<=\.)\s+(?=[\da-z])|(?<=,)\s+|(?<=Sgt\.)\s+', zero_width_space, s)
s = re.split(r'(?<=[.?!])\s+', s)
from pprint import pprint
pprint([line.replace(zero_width_space, ' ') for line in s])
Prints:
['I am from New York, N.Y. and I would like to say hello!',
'How are you today?',
'I am well.',
'I owe you $6. 00 because you bought me a No. 3 burger.',
'-Sgt. Smith']

regular expression to match all except the last 2 strings in every line <python>

Here's my data to play around with.
The quick 12 apple
brown8 fox jumped 67 banana
sam 20 ace over 2.5 orange
the13 lazy dog 88.09 grapes
The data is consistent, there's always a number and a word that follows(ex. 12 apple) at the end of every line. I would like an output to be something like : The quick, brown8 fox jumped, sam 20 ace over, the13 lazy dog
You can use the following Regex:
/(?: (?:\d+(?:\.\d+)?) \w+\s?)/g
Then you need to replace the matches with an empty string.
That should give you: 'The quick, brown8 fox jumped, over, the13 lazy dog'.
The regex use noncapturing Groups (start with (?: ) and starts by matching a Space, then one or more numbers ('\d'), an optional decimal point ('\.') followed by one or more digits ('\d+'). Then another Space and finally one or more Word characters and an optional White space.
The global flag should give all results, which must be replaced.
Edit:
Seems you want a comma (,) where the matches are, so you should replace with comma instead.
Edit2:
According to new info:
/(?: (?:\d+(?:\.\d+)?) \w+$)/gm
You now have to specify multiline option.

Lookbehind to get the text in R regex [duplicate]

This question already has an answer here:
Get more than 1 quotations in text paragraph in R regex
(1 answer)
Closed 7 years ago.
I have data like this:
Good afternoon. Hello. My bro's name is John... and he said softly 0.8% : "Don't you think I am handsome??" HAHA. jiji. koko.
I would like to take get the sentence before the quotations, and text inside the quotation by using Look Behind regex in R.
First: I want to look for quotation marks in a bunch of text.
Second: Look back and extract 1 sentence before the quotations. If there is no sentence, it's fine. Still extract the text in the quotations.
Below is what I would like to achieve:
My bro's name is John... and he said softly 0.8%: "Don't you think I am handome??"
I tried using this, but I would like to seek help by using Look Behind regex. Thank you.
regmatches(x, gregexpr('[^\\.]+[\\.\\:]"([^"]*)"', x))
dput :
"Good afternoon. Hello. My bro's name is John... and he said softly 0.8% : \"Don't you think I am handsome?? \" HAHA. jiji. koko."
We can also use gsub. We match one or more characters that is not a . followed by a . and one or more space (\\s+) or one or more space followed by one or more characters that are not space till the end of the string ($) and replace with ''.
gsub('[^.]+\\.\\s+|\\s+[^ ]+$', '', str1)
#[1] "My bro's name is John... and he said softly 0.8% : \"Don't you think I am handsome?? \""
Or we match one or more characters that are not a . followed by a . followed by one or more space (\\s+), then we capture the rest of the string until the " followed by one or more characters (.*) to the end of the string and replace with the capture group (\\1).
gsub('^[^.]+\\.\\s+(.*(?:"[^"]+")).*$', '\\1', str1, perl=TRUE)
#[1] "My bro's name is John... and he said softly 0.8% : \"Don't you think I am handsome?? \""

regexpr and only matching prices and not other digits

I'm trying to come up with code that will extract only the price from a line of text.
Motivated by RegEx for Prices?, I came up with the following command:
gregexpr('\\d+(\\.\\d{1,2})', '23434 34.232 asdf 3.12 ')
[[1]]
[1] 7 19
attr(,"match.length")
[1] 5 4
attr(,"useBytes")
[1] TRUE
However, in my case, I would only like 3.12 to match and not 34.232. Any suggestions?
I think this should work:
'\\d+\\.\\d{1,2}(?!\\d)'
\\d+\\.\\d{1,2}(?!\\d)
I'm not 100% sure that negative lookahead is supported in r, so here is an alternative:
\\d+\\.\\d{1,2}(?:[^\\d]|$)
one or more digits followed by a point, followed by 1 or 2 digits, followed by white space or end of string
\\d+\\.\\d{1,2}(\w|$)
Edit: as per comments, R uses double-escape

remove comma from a digits portion string

How can I (fastest preferable) remove commas from a digit part of a string without affecting the rest of the commas in the string. So in the example below I want to remove the comas from the number portions but the comma after dog should remain (yes I know the comma in 1023455 is wrong but just throwing a corner case out there).
What I have:
x <- "I want to see 102,345,5 dogs, but not too soo; it's 3,242 minutes away"
Desired outcome:
[1] "I want to see 1023455 dogs, but not too soo; it's 3242 minutes away"
Stipulation: must be done in base no add on packages.
Thank you in advance.
EDIT:
Thank you Dason, Greg and Dirk. Both your responses worked very well. I was playing with something close to Dason's response but had the comma inside the parenthesis. Now looking at it that doesn't even make sense. I microbenchmarked both responses as I need speed here (text data):
Unit: microseconds
expr min lq median uq max
1 Dason_0to9 14.461 15.395 15.861 16.328 25.191
2 Dason_digit 21.926 23.791 24.258 24.725 65.777
3 Dirk 127.354 128.287 128.754 129.686 154.410
4 Greg_1 18.193 19.126 19.127 19.594 27.990
5 Greg_2 125.021 125.954 126.421 127.353 185.666
+1 to all of you.
You could replace anything with the pattern (comma followed by a number) with the number itself.
x <- "I want to see 102,345,5 dogs, but not too soo; it's 3,242 minutes away"
gsub(",([[:digit:]])", "\\1", x)
#[1] "I want to see 1023455 dogs, but not too soo; it's 3242 minutes away"
#or
gsub(",([0-9])", "\\1", x)
#[1] "I want to see 1023455 dogs, but not too soo; it's 3242 minutes away"
Using Perl regexp, and focusing on "digit comma digit" we then replace with just the digits:
R> x <- "I want to see 102,345,5 dogs, but not too soo; it's 3,242 minutes away"
R> gsub("(\\d),(\\d)", "\\1\\2", x, perl=TRUE)
[1] "I want to see 1023455 dogs, but not too soo; it's 3242 minutes away"
R>
Here are a couple of options:
> tmp <- "I want to see 102,345,5 dogs, but not too soo; it's 3,242 minutes away"
> gsub('([0-9]),([0-9])','\\1\\2', tmp )
[1] "I want to see 1023455 dogs, but not too soo; it's 3242 minutes away"
> gsub('(?<=\\d),(?=\\d)','',tmp, perl=TRUE)
[1] "I want to see 1023455 dogs, but not too soo; it's 3242 minutes away"
>
They both match a digit followed by a comma followed by a digit. The [0-9] and \d (the extra \ escapes the second one so that it makes it through to the regular epression) both match a single digit.
The first epression captures the digit before the comma and the digit after the comma and uses them in the replacement string. Basically pulling them out and putting them back (but not putting the comma back).
The second version uses zero-length matches, the (?<=\\d) says that there needs to be a single digit before the comma in order for it to match, but the digit itself is not part of the match. The (?=\\d) says that there needs to be a digit after the comma in order for it to match, but it is not included in the match. So basically it matches a comma, but only if preceded and followed by a digit. Since only the comma is matched, the replacement string is empty meaning delete the comma.