Regex to match sentences with jumbled words but preserving sentence order - regex

I want to match sentences in such a way that words with the sentence can be any order but the sentences should be in same order.
e.g.
My name is Sam. I love regex.
Acceptable input:
My Sam is name. regex I love.
name is My Sam. I regex love.
Invalid input:
I love regex. My name is Sam.
regex I love. is My name Sam.
sample regex I have come up so far to solve the above problem
^((?=.*\bMy\b)(?=.*\bSam\b)(?=.*\bis\b)(?=.*\bname\b))((?=.*\bregex\b)(?=.*\bI\b)(?=.*\blove\b)).*$
Which is not working as expected.
Can this problem be solved by regex? What would be the recommended approach to solve this?
Note: Please ignore . I am using it just for clarity.

I think you are looking for something else than regex. If you would want to do this, the most efficient way would be to compare an array of expected words and 'check' if they all appear once in a sentence. This is completely dependent on which context you are using. If you need a regex that literally finds what you stated in your example, you could use something like this:
/(My|name|is|Sam) (My|name|is|Sam) (My|name|is|Sam) (My|name|is|Sam)\. (I|love|regex) (I|love|regex) (I|love|regex)./g
But as you can see, this regex would grow exponentially the more words your sentence has. Also, it's really inefficient compared to parsing it with something else.

I couldn't achieve with a single regex, instead I did the following:
Virtually divided the sentence into multiple blocks.
Maintained a sentence block -> regex configuration.
regex configuration depends on the rule applicable on that sentence block.
Applied the regex on the sentence to identify whether such block is existing or not.
At last verifying whether the blocks are appearing in the configured order or not.

Related

Matching within matches by extending an existing Regex

I'm trying to see if its possible to extend an existing arbitrary regex by prepending or appending another regex to match within matches.
Take the following example:
The original regex is cat|car|bat so matching output is
cat
car
bat
I want to add to this regex and output only matches that start with 'ca',
cat
car
I specifically don't want to interpret a whole regex, which could be quite a long operation and then change its internal content to match produce the output as in:
^ca[tr]
or run the original regex and then the second one over the results. I'm taking the original regex as an argument in python but want to 'prefilter' the matches by adding the additional code.
This is probably a slight abuse of regex, but I'm still interested if it's possible. I have tried what I know of subgroups and the following examples but they're not giving me what I need.
Things I've tried:
^ca(cat|car|bat)
(?<=ca(cat|car|bat))
(?<=^ca(cat|car|bat))
It may not be possible but I'm interested in what any regex gurus think. I'm also interested if there is some way of doing this positionally if the length of the initial output is known.
A slightly more realistic example of the inital query might be [a-z]{4} but if I create (?<=^ca([a-z]{4})) it matches against 6 letter strings starting with ca, not 4 letter.
Thanks for any solutions and/or opinions on it.
EDIT: See solution including #Nick's contribution below. The tool I was testing this with (exrex) seems to have a slight bug that, following the examples given, would create matches 6 characters long.
You were not far off with what you tried, only you don't need a lookbehind, but rather a lookahead assertion, and a parenthesis was misplaced. The right thing is: Put the original pattern in parentheses, and prepend (?=ca):
(?=ca)(cat|car|bat)
(?=ca)([a-z]{4})
In the second example (without | alternative), the parentheses around the original pattern wouldn't be required.
Ok, thanks to #Armali I've come to the conclusion that (?=ca)(^[a-z]{4}$) works (see https://regexr.com/3f4vo). However, I'm trying this with the great exrex tool to attempt to produce matching strings, and it's producing matches that are 6 characters long rather than 4. This may be a limitation of exrex rather than the regex, which seems to work in other cases.
See #Nick's comment.
I've also raised an issue on the exrex GitHub for this.

What is the proper way to check if a string contains a set of words in regex?

I have a string, let's say, jkdfkskjak some random string containing a desired word
I want to check if the given string has a word from a set of words, say {word1, word2, word3} in latex.
I can easily do it in Java, but I want to achieve it using regex. I am very new to regular expressions.
if you want only to recognise the words as part of a word, then use:
(word1|word2|...|wordn)
(see first demo)
if you want them to appear as isolated words, then
\b(word1|word2|...|wordn)\b
should be the answer (see second demo)
I am not able to understand the complete context like what kind of text you have or what kind of words will this be but I can offer you a easy solution the literal way programmatically you can generate this regex (dormammu|bargain) and then search this in text like this "dormammu I come to bargain". I have no clue about latex but I think that is not your question.
For more information you can tinker with it at [regex101][1].
If you are having trouble understanding it [regexone][2] this is the place to go. For beginners its a good start.
[1]: http://regex101.com [2]: https://regexone.com/

Negating a regex query

I have looked at multiple posts about this, and am still having issues.
I am attempting to write a regex query that finds the names of S3 buckets that do not follow the naming scheme we want. The scheme we want is as follows:
test-bucket-logs**-us-east-1**
The bolded part is optional. Meaning, the following two are valid bucket names:
test-bucket-logs
test-bucket-logs-us-east-1
Now, what I want to do is negate this. So I want to catch all buckets that do not follow the scheme above. I have successfully formed a query that will match for the naming scheme, but am having issues forming one that negates it. The regex is below:
^(.*-bucket-logs)(-[a-z]{2}-[a-z]{4,}-\d)?$
So some more valid bucket names:
example-bucket-logs-ap-northeast-1
something-bucket-logs-eu-central-1
Invalid bucket names (we want to match these):
Iscrewedthepooch
test-bucket-logs-us-ee
bucket-logs-us-east-1
Thank you for the help.
As mr Barmar said, probably the best approach on these circumstances is solving it programatically. You could write the usual regex for matching the right pattern, and exclude them from the collection.
But you can try this:
^(?:.(?!-bucket-logs-[a-z]{2}-[a-z]{4,}-\d|-bucket-logs$))*$
which is a typical solution using a negative lookeahead (?!) which is a non-capturing group, with zero-length. Basically it states that you want every line that starts with something but dont has the pattern after it.
EDITED
As Ibrahim pointed out(thank you!), there was a little issue with my first regex. I fixed it and I think it is ok now. I had forgot to set the last part of inner regex as optional(?).

Re: Regex Question (Help)

I am new here. I just discovered this tool "Everything Search Engine". It allows the use of regex in the search. I posted in their forum for a bit help here http://forum.voidtools.com/viewtopic.php?f=5&t=1343. This section explains how regex can be used in the tool http://www.voidtools.com/faq.php#How_do_I_use_regex.
The question I am asking is:
What is the correct regex to use in the search to obtain the desired results describe below.
For example, I am searching for the word "dog" in the file name. And it returns a result "Pseudogout" which is a file name.
Notice the word "dog" is inside the word "Pseudogout".
How do I use the regex to eliminate such results?
I would appreciate some help here.
Thanks.
Use anchors: ^dog$. The caret matches the beginning of a string, and the dollar sign, the end.
If you want to match the string "dog" inside another string, but not inside another word, you might be able to use something like ^dog$|^dog[^A-Za-z]|[^A-Za-z]dog[^A-Za-z]|[^A-Za-z]dog$ but that is obviously somewhat cumbersome to type and use.
(This subsumes some information from the comments below.)

Regex, how to select all items outside of selection group

I'm a Regex noob and am pretty sure I'm not going about this in the most efficient way - wanted to get some advice.
I have a Regex expression ((\w+\b.*?){100}){1} which selects the first 100 words of my string, the length of which varies.
What I want is to select the entire string except for the first 100 words.
Is there syntax I can add to my current expression to do this, or am I better off trying to directly select the rest of the text instead.
Also, if anyone has any good resources for improving my Regex knowledge, i'd be very appreciative. Thus far I've found http://gskinner.com/RegExr/ to be very helpful.
Thanks in advance!
If you use this, you can refer to everything else as group 3 noted as $3
This one will treat hyphenated words as one word.
(\w+(-\w+|\b).*?){100}(.*)
Regex training Here