Regexp exact word match - regex

I need to match words from lines. For example:
The blue bird is dancing.
Yellow card is drawn
The day is perfect rainy
blue bird is eating
The four lines are in a text file l2.
I want to match the blue bird, yellow card, day and every time a line is printed that matched word is printed before the line.
y=regexp(l2,('^(?=.*blue bird)|(?=.*day)|(?=.*Yellow card)$'));
Is this how it works? I can't get the result.
sprintf('[%s]',y,l2);

MATLAB's regex engine doesn't use \b as word boundary anchors but \< and \>.
So your regex would become
y = regexp(l2, '^(?=.*\<(?:blue bird|day|Yellow card)\>).*', 'lineanchors');
assuming that l2 is a multiline string.

Try this reg exp.
(?:blue bird|yellow card|day)

Related

Regexp_Replace in HIVE to keep only certain words

I am trying to use regexp_replace in HIVE as a way to only keep certain words of a string.
I am trying to use the following:
select regexp_replace('I AM WANTING BLUE SHOES AND YELLOW STRINGS', '([^BLUE|SHOES|YELLOW|STRINGS])',' ')
But it is giving me "I W NTING BLUE SHOES N YELLOW STRINGS" instead of "BLUE SHOES YELLOW STRINGS"
I have tried using \b \s and any number of things to no avail. Any tips on getting this to work?
I also considered using regexp_replace but for my particular use case there are too many variables for it to be useful.
You can use
select regexp_replace('I AM WANTING BLUE SHOES AND YELLOW STRINGS', '\\s*\\b(?!(?:BLUE|SHOES|YELLOW|STRINGS)\\b)\\w+','')
See the regex demo
Details
\s* - 0+ whitespaces
\b - a word boundary
(?!(?:BLUE|SHOES|YELLOW|STRINGS)\b) - no BLUE, SHOES, YELLOW and STRINGS substring is allowed immediately on the right
\w+ - 1 or more word chars (letters, digits or _)

Regex measurement like 100x100

I have some text with different measures in that Im trying to exract with regex.
a text can look something like this
Ipsum Lorem 3. 100x210 cm
Ipsum Lorem Lorem, 100x210 cm
I have got as far as I can extract the measurements, but when there is an int in the middle of the text ( like option 1) my regex fails.
([0-9x]+)(?:\^(-?\d+))?
Gets me
Match 1 : 100x210
Match 2 : 3
Match 3 : 100X210
Any suggestion on how I can skip match 2 and only regex INTxINT ?
Thanks in advance
Using a character class [0-9x]+ could possibly also match only xxx or in this case, only 3
The optional group in your pattern could possibly also match 100x210^-2, not sure if that is intended as \^ will match a caret.
To match both the lower and uppercase variant of x, you could use a character class [xX] or make the regex case insensitive.
Using word boundaries \b on the left and right:
\b\d+[xX]\d+\b
Or a more specific pattern using a capturing group, taking matching the cm part afterwards:
\b(\d+[xX]\d+) cm\b
See a regex demo
You may use a regex like
\d+x\d+
See proof. It will match two substrings containing one or more digits separated with x character.

python regex: how can I get the smallest substring from a certain word to the end of the text?

I'm analyzing a text and I'd like to extract the smallest substring starting from the occurrence of a certain word until the end of the text. My particular problem is that that word can be in several parts of my text.
I've tried the following:
pattern = re.compile('(word)(.*?)$', re.DOTALL)
result = re.search(pattern, MY_TEXT).group()
My problem is that this doesn't result in the smallest possible string being returned, but in the largest string found in the text (i.e: the first occurrence of word until the end of the text, instead of the last occurrence). I was sure that adding the ? character after .* inside the second parenthesis would have solved the problem, but it didn't.
Example input:
text = "Pokémon is a media franchise managed by The Pokémon Company, a Japanese consortium between Nintendo, Game Freak, and Creatures.\nThe franchise began as Pokémon Red and Green (later released outside of Japan as Pokémon Red and Blue)."
word = 'Pokémon'
I'd expect my result to be the string: Pokémon Red and Blue)., but right now I'm getting the whole text as a result.
How can I get what I expect?
Thanks in advance.
Your current pattern (Pokémon)(.*?)$ has 2 capturing groups where it will only match the first occurrence of word because the second group follows by matching until the end of the string.
To get to the last word, you could use .*Pokémon as .* will first match until the end of the string and will backtrack until it can fit Pokémon.
Then the rest of the string will be matched by the following .* The value is in the first capturing group.
^.*(Pokémon .*)$
Regex demo | Python demo
To create a more dynamic pattern
text = "Pokémon is a media franchise managed by The Pokémon Company, a Japanese consortium between Nintendo, Game Freak, and Creatures.\nThe franchise began as Pokémon Red and Green (later released outside of Japan as Pokémon Red and Blue)."
word = "and"
pattern = r"^.*(" + re.escape(word) + ".*)$"
regex = re.compile(pattern, re.DOTALL)
result = re.search(regex, text).group(1)
print(result)
Result
and Blue).
If the word can also be the last word in the sentence, you could assert what is on the right is not a non whitespace char (?!\S) using a negative lookahead.
^.*(Pokémon(?!\S).*)$
Regex demo
I'm guessing that you wish to extract the last instance of Pokémon to the end of the input string, which this expression for instance
^.*(Pokémon.*)$
is likely to do so.
DEMO

Match the nth word in a line

In the app I use, I cannot select a match Group 1.
The result that I can use is the full match from a regex.
but I need the 5th word "jumps" as a match result and not the complete match "The quick brown fox jumps"
^(?:[^ ]*\ ){4}([^ ]*)
The quick brown fox jumps over the lazy dog
Here is a link https://regex101.com/r/nB9yD9/6
Since you need the entire match to be only the n-th word, you can try to use 'positive lookbehind', which allows you to only match something, if it is preceded by something else.
To match only the fifth word, you want to match the first word that has four words before it.
To match four words (i.e. word characters followed by a space character):
(\w+\s){4}
To match a single word, but only if it was preceded by four other words:
(?<=(\w+\s){4})(\w+)
Test the result here https://regex101.com/r/QIPEkm/1
To find the 3rd word of sentence, use:
^(?:\w+ ){2}\K\w+
Explanation:
^ # beginning of line
(?: # start non capture group
\w+ # 1 or more word character
# a space
){2} # group must appear twice (change {2} in {3} to get the 4th word and so on)
\K # forget all we have seen until this position
\w+ # 1 or more word character
Demo
It works https://regex101.com/r/pR22LK/2 with PCRE. Your app doesn't seem to support it, but I don't know how it works. I think you have to extract all the words in an array then select the ones you want. – Toto 23 hours ago
Hello Toto, your solution works in the the App too, like PCRE, thanks !!! – gsxr1300 just now edit
To match "the first" four words (i.e. word characters followed by a space character):
^(\w+\s){4}
To match a single word, but only if it was preceded by "the first" four other words:
(?<=^(\w+\s){4})(\w+)
Note the ^ difference
If you want to know what this "?<=" mean, check this:
https://stackoverflow.com/a/2973495/11280142

Actionscript Regex to determine full stop

Another actionscript problem. I need to extract the first sentence from a block of text but if the first sentence does not contain more than 80 characters then I need to extract the first and second sentence.
The example code below is an attempt to find the sentences and not get confused with the other perios/full stops in the eg text.
I have this test code:
import flash.text.TextField;
var descr:String =
"The temperature was 32.8 degrees Celsius. His B.Sc. degree was deemed insufficient. the Dr. owed B. the bank USD 4000.50 which he had not paid back. On 27.07.2004 a major earthquake occurred. It was 17.05 by the clock.";
var array:Array;
array = descr.split(/\s[a-zA-Z]{3,30}\.\s/);
trace(descr); //put original above output for checking against
trace(array+"\n"+array.length);//ouput
Any suggestions would be appreciated. Will check back when I get up.
Thanks
You could try using a lazy quantifier of the form {m,n}? and a positive lookahead to make sure that the period is one which matches at the end of the sentence:
^.{0,79}?(?=\.(?:$| [A-Z]))\..+?(?=\.(?:$| [A-Z]))\.|^.{80,}?(?=\.(?:$| [A-Z]))\.
The regex is of two parts:
^.{0,79}?(?=\.(?:$| [A-Z]))\..+?(?=\.(?:$| [A-Z]))\.
To match the two first sentences if the first sentence is below 80 characters.
^.{80,}?(?=\.(?:$| [A-Z]))\.
To match the first sentence (when the first part fails, that is when the first sentence is above 80 chars).
^ matches the beginning of the string.
.{0,79}? matches at most 79 characters and will stop at the closest sentence period.
.{80,}? matches at least 80 characters and will stop at the closest sentence period.
.+? is for the second sentence and can contain any number of characters.
(?=\.(?:$| [A-Z])) is a positive lookahead which matches a period that is either at the end of the string (\.$) OR, a period followed by a whitespace and a capital letter (\. [A-Z]).
Then match the period with \..
regex101 demo
NOTE: This is a regex to match and not split.