Using regular expression to extract repeated phrase in R - regex

I am attempting to locate(then extract) a repeated phrase by using the below code. I require phrases beginning with "approximately" and ending in "closed".
For example "approximately $162.9 million in total assets and $144.5 million in total deposits was closed"
str_locate(x,"(\b[Aa]pproximately\b)(.*)(\b[Cc]losed\b)")
str_extract(x,"(\b[Aa]pproximately\b)(.*)(\b[Cc]losed\b)")
The above code returns NA for phrase start and end points.
Here is a sample of the character vector where the phrases are located.(it is a webpage of publicly available FDIC information)
"206-4662).\r\n\r\nDecember \r\n\r\n\r\n Western National Bank, Phoenix, AZ with approximately $162.9 million in total assets and $144.5 million in total deposits was closed. Washington Federal, Seattle, WA has agreed to assume all deposits excluding certain brokered deposits.\r\n(PR-195-2011) \r\n\r\n\r\n\r\n Premier Community Bank of the Emerald Coast, Crestview, FL with approximately $126.0 million in total assets and $112.1 million in total deposits was closed. Summit Bank, N.A., Panama City, FL has agreed to assume all deposits.\r\n(PR-194-2011)"
I may be using reg expression incorrectly as i am new to it so any advice much appreciated.

\b is ASCII backspace. You need to escape the backslashes if you want it to mean "word boundary":
str_locate(x,"(\\b[Aa]pproximately\\b)(.*)(\\b[Cc]losed\\b)")
Also, you don't need the parentheses around your keywords, unless you want to check their capitalization later. And you can match case-insensitively with the (?i) modifier when using the perl() function for your regexes.
Lastly, be aware that .* will not match if there are newlines between approximately and closed (this can be fixed with (?s)), and it may yield unwanted results if more than one pair of keywords is present in the string.
Therefore, you should probably change your regex to
str_locate(x, perl("(?is)\\bapproximately\\b(.*?)\\bclosed\\b"))

Related

Regex to extract UK Currency including £ symbol and Pence (p)

I am fairly new to RegEx and have had a search around online but am unable to find a regex that fits my requirements.
The ultimate aim is to search a string of text and extract the lowest monetary amount, however as the string may contain more than one £amount, then i'm happy for a regex to just extract all monetary values it can find and then I can write a calculation in order to return the lowest amount.
The string may have numbers that are not monetary values / numerous amounts, therefore the regex should always look for a £ symbol first OR it could end with a "p" or "P" to signify pence. For example "I need 2 of these at £10 each and one of those at 50p" - should return 10.00 & 0.50 - I can then calculate that 0.50 is the lowest amount.
As people also write their amounts in various ways, I need the regex to be able to spot different patterns - including the "," for every thousand. All below values should be valid:
£0
£0.00
£0.00p
£0000
£0000.00
£0000.00p
£0,000
£0,000.00
£0,000.00p
0p
Hopefully someone may be able to advise the best way to approach this.
Thanks
This works on your data set:
(?=^£|.*p$)£?\d*(?:,\d{3})*(\.\d{2})?p?
But it may improperly match some edge cases as well because everything is optional...
https://regex101.com/r/WptUn6/3

How to get a string before the last occurrence of a specific character before a maximum character count?

I have some long but variable-length texts that are divided into sections marked by ********************. I need to post those texts into a field that only accepts 2048 characters, so I will need to divide that text into groups of no more than 2048 characters but which do not contain an incomplete section.
My regex so far is ^([\s\S]{1,2048})([\s\S]{1,2048})([\s\S]{1,2048})
However, this has two problems:
1) It divides the text into groups that can include an incomplete section. What I want is a complete section, even if it is not a full 2048 characters. Assume the example below is at the end of 2048 characters.
Here's my actual result. Notice that the "7 Minute Workout" section is cut off mid-section
********************
Maybe Baby™ Period & Fertility (📱)
Popular app for tracking your periods and predicting times of fertility; recommended; avg 4.5/5 stars (3,500+ ratings); 50% off, $3.99 ↘️ $1.99!
https://example.com/2019/07/29/maybe-baby-period-fertility-7-29-19/
********************
7 Minute Workout: Lose Weight (📱)
Scientifically-proven and featured by the New York Times, a 7-minute high intensity workout proven to lose weig
Here's my desired result. Notice that the "7 Minute Workout" section is entirely omitted because it could not be included in its entirety while staying under the 2048 character limit.
********************
Maybe Baby™ Period & Fertility (📱)
Popular app for tracking your periods and predicting times of fertility; recommended; avg 4.5/5 stars (3,500+ ratings); 50% off, $3.99 ↘️ $1.99!
https://example.com/2019/07/29/maybe-baby-period-fertility-7-29-19/
2) The second problem with this regex is that the text I need to input varies greatly in length; it may be less than 2048 or it could be 10,000+ characters. My regex obviously only works for texts up to 6,144 characters long. Do I just keep duplicating the regex a crazy number of times to get longer than the longest text I could enter, or is there a way to get it to repeat?
Addendum: Several asked about the use case/environment for this question. No, it’s not a spambot 🙂. Rather, I’m trying to use Apple’s Shortcuts app to cross-post items from my website to followers on Kik. Unfortunately, Kik has a 2048 character limit, so I can’t post it all at once. I’m trying to use regex to split the text into appropriate sections so I can copy them from Shortcuts and paste them one at a time into Kik.
Couple Notes:
No need to use groups at all, just use match results directly as each match represent one section.
Use lazy quantifier instead of greedy by adding ? after {1,2048} to make the match cut in the right place.
In my regex, I used only Global g without the multiline m.
The code below will work only with sections that have 2048 characters or less. If the section has more than 2048 characters, it will be skipped.
The regex below uses Positive Lookahead to signal the end of the section without matching.
Here is the regex:
^|\*[\s\S]{1,2048}?(?=\n\*|$)
Example: https://regex101.com/r/hezvu5/1/
==== Update ====
To make the results greedy, to match as many sections as possible without splitting the last section, use this regex:
^|\*[\s\S]{1,2048}(?=\n\*|$)

Regex (Python) data extraction - overlapping or incomplete results

I'm trying to extract data from some WHO codebooks that I've converted from PDF to text with Python slate library.
The text I want to hit starts with 2 digits, dash, 2 digits, followed by some text and ends with "Q"+1 or 2 digits and again "Q"+1 or 2 digits
17-17How old are you?Q1Q1
31-31During the past 30 days, how many times per day did you usually eat fruit, such as bananas, apples, oranges, dates, or any other fruits?Q7Q11
Sometimes those phrases end with a blank, sometimes the next questions starts immediately (here are three question), observe Q4Q424-29 and Q5Q530-30
20-23How tall are you without your shoes on? (Note: Data are in meters.)Q4Q424-29How much do you weigh without your shoes on? (Note: Data are in kilograms.)Q5Q530-30During the past 30 days, how often did you go hungry because there was not enough food in your home?Q6Q7
With
\d{2}-\d{2}[a-zA-Z0-9 .()?:,]+Q\d{1,2}Q\d(\d)*?
I get pretty close, but I'm missing the second digit when the second "Q" has two digits.
I've tried to add a negative lookahead
\d{2}-\d{2}[a-zA-Z0-9 .()?:,]+Q\d{1,2}Q\d((\d)(?!\d\d-))
to exclude the start of the pattern with two digits and a dash.
\d{2}-\d{2}[a-zA-Z0-9 .()?:,]+Q\d{1,2}Q\d{1,2}
includes the second digit of the "Q" but generates overlapping results e.g. at Q4Q424-29 where the first string ends with Q4Q42 and the second string starts with 4-29.
The regex with parts of the original sample text is here: https://regex101.com/r/d9Dlga/2/
Any suggestions who to extract the correct strings like:
17-17How old are you?Q1Q1
20-23How tall are you without your shoes on? (Note: Data are in meters.)Q4Q4
24-29How much do you weigh without your shoes on? (Note: Data are in kilograms.)Q5Q5
31-31During the past 30 days, how many times per day did you usually eat fruit, such as bananas, apples, oranges, dates, or any other fruits?Q7Q11
Thanks!
I see the problem now. New attempt that I think works:
\d{2}-\d{2}.+?Q\d{1,2}Q\d{1,2}(?!\d-\d{2})
I put a negative lookahead at the end to test if a new section has begun.
9 matches
Correctly grabs the full 2-digit endings
Demo
The following pattern should work:
\d{2}-\d{2}[a-zA-Z0-9 .()?:,]+Q\d{1,2}Q\d(\d(?!\d-))?

RegEx to clean VISA merchant names (remove random strings)

I am trying to develop a ReGex (.Net flavor), which I can use to clean VISA merchant names.
Examples:
Norton *AP1223506209 --> Norton *AP
Norton *AP1223511428
EUROWINGS VYJD6J_123001 --> EUROWINGS
EUROWINGS W6PDFI_125626
AER LINGUCB22QKM2 --> AER LINGUCB
AER LINGUCB248L2W
AIR FRANCE JWNCSC --> AIR FRANCE
AIR FRANCE K8L7TT
PAYPAL *AIRBNB HMQXBW --> PAYPAL *AIRBNB
PAYPAL *AIRBNB HMQXNZ
SAS 1174565172360 --> SAS
SAS 1174565172368
I would like to keep the first "name" part, but remove the second "gibberish" part.
The following Regex works for Norton and Air Lingu as well as for Eurowings and Air France, if they contain numbers in the gibberish part. It totally fails for PAYPAL *AIRBNB and other strings, that don't contain any numbers in the gibberish part, and also for SAS, probably because the name is too short / there are too many spaces:
Search:
([A-z *-]{2,50}[A-z]{2,50})(.{0,3}([0-9-]{0,3}[A-z *+.#-/]{0,3}){1,10})
Replace:
$1
Is there any way to make this work for gibberish parts that don't contain numbers? I have something like this in mind, but don't manage to create an according RegEx:
Group 1 (to keep)
Must contain consonants and vowels
Can contain few numbers, spaces or punctuation signs (e.g.: "7x7: Taxi Service")
Group 2 (to be removed)
Consists of sequences of numbers, letters and optional punctuation signs
OR: consists of consonants, only
OR: consists of numbers, only
Thanks for any help and best regards
Pesche
Edit:
If I add more examples, Lindens solution still works quite well, but does not recognize all of the examples or in some cases too much of the string. I tried to adjust it, but with my lacking skills didn't quite succeed:
https://regex101.com/r/7y9zGl/4
The following problems remain:
with a length of 6 for the last \w, longer patterns would not be matched in full length (e.g. after easyjet and after EMP Merchan). Increasing it, however, causes other strings to be truncated (e.g. AER LINGU, potentially also HOTELS.COM if > 12 was used).
The merchant names after PAYPAL * and GOOGLE * should not be deleted, as they are true merchant names. I tried to exclude strings containing GOOGLE * with a negative lookbehind, but it does not seem to work like that.
Whereas the merchant name after PAYPAL * should generally remain, in some cases it is followed by gibberish, e.g. PAYPAL *AIRBNB HMQXBW. If the negative lookbehind worked, those cases would no longer be cleaned.
if the merchant name is not followed by gibberish, part of the name itself may be deleted (e.g. EMP Merchan)
As the full list of merchant names is long and versatile, the approach to detect "gibberish" should be as generic as possible (i.e. not rely on a certain length of the gibberish part). Hence my original, now slightly modified "pattern":
Consists of sequences of numbers, letters and optional punctuation signs
OR: consists non or very few vowels (EASYJET 000ESJ5TWN -> the gibberish contains only one vowel, EASYJET 3 of them; PAYPAL *NITSCHKE -> NITSCHKE should not be matched, it contains 2 vowels)
OR: consists of numbers, only
Is such a thing even possible? The goal is to use SQL to clean the merchant names. If necessary, this can be done in several run throughs (for different kind of patterns).
Thx again!
Updated regex based on extended sample and desired results:
[\s*<]+\d+$|[\s*<]+(?![A-Z]{6}.*)\w*\d[\w>]*$|\d{6,}$|[\s*<]+[A-Z]{6}$|(?![A-Z]+$)(?<=[A-Z])\w{6}$
Demo
I cannot validate as I'm only on my phone, but can you try something like this?
^([0-9A-Za-z\*][ ]{0-2})
Take all the numbers, the letters (capital and minor) the star and max 2 spaces from the beginning of the line.
Please check the () but I guess the idea is here.
Sorry, it seems wrong when there is no double space.
You want to take all the char until 2 spaces or 2 numbers according to your examples.
.* {2}|.*[0-9]{2}
Is it better?
Regards,
Thomas

Removing specific numbers using regex in notepad++

To illustrate my problem with an example, in the following paragraph,
1.If we may believe the Egyptians, Hephaestus was the son of the Nile, and with him philosophy began, priests and prophets being its chief exponents. 2. Hephaestus lived 48,863 years before Alexander of Macedon, and in the interval there occurred 373 solar and 832 lunar eclipses. The date of the Magians, beginning with Zoroaster the Persian, was 5000 years before the fall of Troy, as given by Hermodorus the Platonist in his work on mathematics; but Xanthus the Lydian reckons 6000 years from Zoroaster to the expedition of Xerxes, and after that event he places a long line of Magians in succession, bearing the names of Ostanas, Astrampsychos, Gobryas, and Pazatas, down to the conquest of Persia by Alexander,
I want to remove the "1." and "2.", but not the "373", "832", or any of the other numbers. The document contains much more than just this example, so just removing single digit numbers won't work. I assume this is fairly straightforward, but I'm new to using regex and I've been finding it difficult.
Find \d+\. and replace with empty string.