Regex select words longer than 4 characters but only one instance if duplicates - regex

I am trying to format text in InDesign using GREP Style.
The goal is to select words longer then 4 letters in a paragraph but if the word has been duplicated in a paragraph it should not select more then first instance of this word.
This is sample text:
"The Lord's right hand is lifted high; the Lord's right hand has done mighty things!"
The solution should give
Lord right hand lifted high done mighty things
i have done the first part
[[:word:]]{4,}
but don't have a clue how to deal with those duplicates.

Is order a requirement? If not, how about words longer than 4 characters not followed by that same word later in the text? See:
([[:word:]]{4,})(?!.*\1)
https://regex101.com/r/Ug4dLZ/1
Result: lifted high Lord right hand done many things
To be more comprehensive, include word breaks (i.e. count "Person" and "Personhood" as 2 separate words):
([[:word:]]{4,})(?!.*\b\1\b)

Related

Regex - How can you identify strings which are not words?

Got an interesting one, and can't come up with any solid ideas, so thought maybe someone else may have done something similar.
I want to be able to identify strings of letters in a longer sentence that are not words and remove them. Essentially things like kuashdixbkjshakd
Everything annoyingly is in lowercase which makes it more difficult, but since I only care about English, I'm essentially looking for the opposite of consonant clusters, groups of them that don't make phonetically pronounceable sounds.
Has anyone heard of/done something like this before?
EDIT: this is what ChatGpt tells me
It is difficult to provide a comprehensive list of combinations of consonants that have never appeared in a word in the English language. The English language is a dynamic and evolving language, and new words are being created all the time. Additionally, there are many regional and dialectal variations of the language, which can result in different sets of words being used in different parts of the world.
It is also worth noting that the frequency of use of a particular combination of consonants in the English language is difficult to quantify, as the existing literature on the subject is limited. The best way to determine the frequency of use of a particular combination of consonants would be to analyze a large corpus of written or spoken English.
In general, most combinations of consonants are used in some words in the English language, but some combinations of consonants may be relatively rare. Some examples of relatively rare combinations of consonants in English include "xh", "xw", "ckq", and "cqu". However, it is still possible that some words with these combinations of consonants exist.
You could try to pass every single word inside the sentence to a function that checks wether the word is listed inside a dictionary. There is a good number of dictionary text files on GitHub. To speed up the process: use a hash map :)
You could also use an auto-corretion API or a library.
Algorithm to combine both methods:
Run sentence through auto correction
Run every word through dictionary
Delete words that aren't listed in the dictionary
This could remove typos and words that are non-existent.
You could train a simple model on sequences of characters which are permitted in the language(s) you want to support, and then flag any which contain sequences which are not in the training data.
The LangId language detector in SpamAssassin implements the Cavnar & Trenkle language-identification algorithm which basically uses a sliding window over the text and examines the adjacent 1 to 5 characters at each position. So from the training data "abracadabra" you would get
a 5
ab 2
abr 2
abra 2
abrac 1
b 2
br 2
bra 2
brac 1
braca 1
:
With enough data, you could build a model which identifies unusual patterns (my suggestion would be to try a window size of 3 or smaller for a start, and train it on several human languages from, say, Wikipedia) but it's hard to predict how precise exactly this will be.
SpamAssassin is written in Perl and it should not be hard to extract the language identification module.
As an alternative, there is a library called libtextcat which you can run standalone from C code if you like. The language identification in LibreOffice uses a fork which they adapted to use Unicode specifically, I believe (though it's been a while since I last looked at that).
Following Cavnar & Trenkle, all of these truncate the collected data to a few hundred patterns; you would probably want to extend this to cover up to all the 3-grams you find in your training data at least.
Perhaps see also Gertjan van Noord's link collection: https://www.let.rug.nl/vannoord/TextCat/
Depending on your test data, you could still get false positives e.g. on peculiar Internet domain names and long abbreviations. Tweak the limits for what you want to flag - I would think that GmbH should be okay even if you didn't train on German, but something like 7 or more letters long should probably be flagged and manually inspected.
This will match words with more than 5 consonants (you probably want "y" to not be considered a consonant, but it's up to you):
\b[a-z]*[b-z&&[^aeiouy]]{6}[a-z]*\b
See live demo.
5 was chosen because I believe witchcraft has the longest chain of consonants of any English word. You could dial back "6" in the regex to say 5 or even 4 if you don't mind matching some outliers.

Regex (Python) data extraction - overlapping or incomplete results

I'm trying to extract data from some WHO codebooks that I've converted from PDF to text with Python slate library.
The text I want to hit starts with 2 digits, dash, 2 digits, followed by some text and ends with "Q"+1 or 2 digits and again "Q"+1 or 2 digits
17-17How old are you?Q1Q1
31-31During the past 30 days, how many times per day did you usually eat fruit, such as bananas, apples, oranges, dates, or any other fruits?Q7Q11
Sometimes those phrases end with a blank, sometimes the next questions starts immediately (here are three question), observe Q4Q424-29 and Q5Q530-30
20-23How tall are you without your shoes on? (Note: Data are in meters.)Q4Q424-29How much do you weigh without your shoes on? (Note: Data are in kilograms.)Q5Q530-30During the past 30 days, how often did you go hungry because there was not enough food in your home?Q6Q7
With
\d{2}-\d{2}[a-zA-Z0-9 .()?:,]+Q\d{1,2}Q\d(\d)*?
I get pretty close, but I'm missing the second digit when the second "Q" has two digits.
I've tried to add a negative lookahead
\d{2}-\d{2}[a-zA-Z0-9 .()?:,]+Q\d{1,2}Q\d((\d)(?!\d\d-))
to exclude the start of the pattern with two digits and a dash.
\d{2}-\d{2}[a-zA-Z0-9 .()?:,]+Q\d{1,2}Q\d{1,2}
includes the second digit of the "Q" but generates overlapping results e.g. at Q4Q424-29 where the first string ends with Q4Q42 and the second string starts with 4-29.
The regex with parts of the original sample text is here: https://regex101.com/r/d9Dlga/2/
Any suggestions who to extract the correct strings like:
17-17How old are you?Q1Q1
20-23How tall are you without your shoes on? (Note: Data are in meters.)Q4Q4
24-29How much do you weigh without your shoes on? (Note: Data are in kilograms.)Q5Q5
31-31During the past 30 days, how many times per day did you usually eat fruit, such as bananas, apples, oranges, dates, or any other fruits?Q7Q11
Thanks!
I see the problem now. New attempt that I think works:
\d{2}-\d{2}.+?Q\d{1,2}Q\d{1,2}(?!\d-\d{2})
I put a negative lookahead at the end to test if a new section has begun.
9 matches
Correctly grabs the full 2-digit endings
Demo
The following pattern should work:
\d{2}-\d{2}[a-zA-Z0-9 .()?:,]+Q\d{1,2}Q\d(\d(?!\d-))?

how to find a specific word having random located newline

As I stated on the title.
I'm try to find regex result on a specific word(like apple) having random newline(\r\n) special character.
Illustrate more detail...
Let's find a word 'apple' on the text file. but We don't know where is exact position of newline(\r\n) on the file like below...
ap
ple
or
appl
e
I also googled many pages but I couldn't find the answer.
Should I have to write beginner regex like below?
(a\r\npple|ap\r\nple|app\r\nle|appl\r\ne|apple\r\n|)
I need to find more smarter regex to find exact word.
updated.
the word can be vary like "ripe apple", "rotten apple" and "brightapple".
In the case of third item, white space removed by writer.
updated
i have many txt files. i have to find the string within those.
So remove /r/n is not useful and cannot handle(too much menory and time required).
You have to take the naive approach ("beginner regex") if you want to use regular expressions, since they belong to the type 3 grammars and cannot express the state needed (see also The difference between Chomsky type 3 and Chomsky type 2 grammar)

Using RegEx to Find a Block of Text

I'm attempting to block a long string of unnecessary text that's on every page of a document.
Ex: "36075 This is another page and this is the date March 4 2013"
I know this must be very simple, but I'm hoping there is a way to block text verbatim. Is the only way to block this text by using a lot of /d/s/w+/+ etc or is there is a way to say, "match 36075 This is another page and this is the date March 4 2013".
This would be SO HELPFUL to know. Thank you for helping!
From what you wrote I assume you need to get leading numbers from string, to do it you just need to use this pattern: ^\d+ which from this input:
36075 This is another page and this is the date March 4 2013
will return this:
36075
For future, in case of such questions please provide example string and expected output. As well as what you have tried.
I realized the issue I was having. I didn't need to use RegEx. The program I was using has the functionality to match specific words or groups of words and pronounce them differently. What I discovered is that it will not match the words unless the word groups are input exactly the way the program typically reads them.
Ergo --> The channel saw
the end of the British hold over
Would have to be listed as one group for, "The channel saw" and a second group for "the end of the British hold over"
In addition, there were some numbers --> 11960_30_o_ho_
and if the program naturally read 119 and then 60_3 and then _o_ho_ then three strings would need to be input for each section.
A few frustrating hours later, problem solved :) Thank you for your assistance.

Regular Expression - Matching and extracting complicated conditions

I'm trying to write a regular expression that will match these conditions:
Maximum of 8000 characters (any characters, including "\r\n")
Maximum of 10 lines (separated by \r\n).
to extract from the matched text only the first 4 lines.
Can't find a good way do it...:/
Thanks!!
Regular expressions are not what you need. They are used to match a certain pattern, not a certain length. If you are holding the data in a string, myString.length <= 8000 is all you need for the character count (using the correct syntax for your language, of course). For the number of lines, you will have to count the number of \r\n sequences in your string (can be done iteratively). To get the first four lines, just find the 4th \r\n and get everything before that with a substring method.
Description
This expression does the following:
validates the input string is between zero and 8,000 characters
validates there are at most 10 line of new line delimited text
then captures the first 4 new line delimited lines of text
\A(?=.{0,8000}\Z)(?=(?:^.*?(?:\r|\n|\Z)){0,10}\Z)(?:^.*?[\r\n\Z]+){0,4} This requires options: m multiline, and s dot matches all characters
Expanded
\A anchor to the begining of the string, this anchor allows the use of the s option which allows the . to match new line and line feed characters
(?=.{0,8000}\Z) look ahead and validate there are between zero and 8000 characters
(?=(?:^.*?(?:\r|\n|\Z)){0,10}\Z) look ahead and validate there are no more then 10 new line delimited lines
(?:^.*?[\r\n\Z]+){0,4} match the first 4 lines of text
PHP Code Example:
You didn't specify a language so I'm including this PHP example to show how it works and the sample output.
Input Text
This input test is 8 lines of new line delimited strings. There are only 1779 characters here.
Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean. A small
river named Duden flows by their place and supplies it with the necessary regelialia. It is a paradisematic country, in which roasted parts of sentences fly into your mouth. Even the all-powerful Pointing has no control about
the blind texts it is an almost unorthographic life One day however a small line of blind text by the name of Lorem Ipsum decided to leave for the far World of Grammar. The Big Oxmox advised her not to do so, because there were
thousands of bad Commas, wild Question Marks and devious Semikoli, but the Little Blind Text didn’t listen. She packed her seven versalia, put her initial into the belt and made herself on the way. When she reached the first hills of
the Italic Mountains, she had a last view back on the skyline of her hometown Bookmarksgrove, the headline of Alphabet Village and the subline of her own road, the Line Lane. Pityful a rethoric question ran over her cheek, then
she continued her way. On her way she met a copy. The copy warned the Little Blind Text, that where it came from it would have been rewritten a thousand times and everything that was left from its origin would be the word "and"
and the Little Blind Text should turn around and return to its own, safe country. But nothing the copy said could convince her and so it didn’t take long until a few insidious Copy Writers ambushed her, made her drunk with Longe
and Parole and dragged her into their agency, where they abused her for their projects again and again. And if she hasn’t been rewritten, then they are still using her.
Code
<?php
$sourcestring="your source string";
preg_match('/\A(?=.{0,8000}\Z)(?=(?:^.*?(?:\r|\n|\Z)){0,10}\Z)(?:^.*?[\r|\n\Z]+){0,4}/ims',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
Matches
$matches Array:
(
[0] => Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean. A small
river named Duden flows by their place and supplies it with the necessary regelialia. It is a paradisematic country, in which roasted parts of sentences fly into your mouth. Even the all-powerful Pointing has no control about
the blind texts it is an almost unorthographic life One day however a small line of blind text by the name of Lorem Ipsum decided to leave for the far World of Grammar. The Big Oxmox advised her not to do so, because there were
thousands of bad Commas, wild Question Marks and devious Semikoli, but the Little Blind Text didn’t listen. She packed her seven versalia, put her initial into the belt and made herself on the way. When she reached the first hills of
)