What is the proper way to check if a string contains a set of words in regex? - regex

I have a string, let's say, jkdfkskjak some random string containing a desired word
I want to check if the given string has a word from a set of words, say {word1, word2, word3} in latex.
I can easily do it in Java, but I want to achieve it using regex. I am very new to regular expressions.

if you want only to recognise the words as part of a word, then use:
(word1|word2|...|wordn)
(see first demo)
if you want them to appear as isolated words, then
\b(word1|word2|...|wordn)\b
should be the answer (see second demo)

I am not able to understand the complete context like what kind of text you have or what kind of words will this be but I can offer you a easy solution the literal way programmatically you can generate this regex (dormammu|bargain) and then search this in text like this "dormammu I come to bargain". I have no clue about latex but I think that is not your question.
For more information you can tinker with it at [regex101][1].
If you are having trouble understanding it [regexone][2] this is the place to go. For beginners its a good start.
[1]: http://regex101.com [2]: https://regexone.com/

Related

Regex to match sentences with jumbled words but preserving sentence order

I want to match sentences in such a way that words with the sentence can be any order but the sentences should be in same order.
e.g.
My name is Sam. I love regex.
Acceptable input:
My Sam is name. regex I love.
name is My Sam. I regex love.
Invalid input:
I love regex. My name is Sam.
regex I love. is My name Sam.
sample regex I have come up so far to solve the above problem
^((?=.*\bMy\b)(?=.*\bSam\b)(?=.*\bis\b)(?=.*\bname\b))((?=.*\bregex\b)(?=.*\bI\b)(?=.*\blove\b)).*$
Which is not working as expected.
Can this problem be solved by regex? What would be the recommended approach to solve this?
Note: Please ignore . I am using it just for clarity.
I think you are looking for something else than regex. If you would want to do this, the most efficient way would be to compare an array of expected words and 'check' if they all appear once in a sentence. This is completely dependent on which context you are using. If you need a regex that literally finds what you stated in your example, you could use something like this:
/(My|name|is|Sam) (My|name|is|Sam) (My|name|is|Sam) (My|name|is|Sam)\. (I|love|regex) (I|love|regex) (I|love|regex)./g
But as you can see, this regex would grow exponentially the more words your sentence has. Also, it's really inefficient compared to parsing it with something else.
I couldn't achieve with a single regex, instead I did the following:
Virtually divided the sentence into multiple blocks.
Maintained a sentence block -> regex configuration.
regex configuration depends on the rule applicable on that sentence block.
Applied the regex on the sentence to identify whether such block is existing or not.
At last verifying whether the blocks are appearing in the configured order or not.

Exclude a certain String from variable in regex

Hi I have a Stylesheet where i use xsl:analyze-string with the following regex:
(&journal_abbrevs;)[\s ]*([0-9]{{4}})[,][\s ][S]?[\.]?[\s ]?([0-9]{{1,4}})([\s ][(][0-9]{{1,4}}[)])?
You don't need to look at the whole thing :)
&journal_abbrevs; looks like this:
"example-String1|example-String2|example-String3|..."
What I need to do know is exclude one of the strings in &journal_abbrevs; from this regex. E.g. I don't want example-String1 to be matched.
Any ideas on how to do that ?
It seems XSLT regex does not support look-around. So I don't think you'll be able to get a solution for this that does not involve writing out all strings from journal_abbrevs in your regex. Related question.
To minimize the amount of writing out, you could split journal_abbrevs into say journal_abbrevs1, journal_abbrevs2 and journal_abbrevs3 (or how many you decide to use) and only write out whichever one that contains the string you wish to exclude. If journal_abbrevs1 contains the string, you'd then end up with something like:
((&journal_abbrevs2;)|(&journal_abbrevs3;)|example-String2|example-String3|...)...
If it supported look-around, you could've used a very simple:
(?!example-String1)(&journal_abbrevs;)...

Re: Regex Question (Help)

I am new here. I just discovered this tool "Everything Search Engine". It allows the use of regex in the search. I posted in their forum for a bit help here http://forum.voidtools.com/viewtopic.php?f=5&t=1343. This section explains how regex can be used in the tool http://www.voidtools.com/faq.php#How_do_I_use_regex.
The question I am asking is:
What is the correct regex to use in the search to obtain the desired results describe below.
For example, I am searching for the word "dog" in the file name. And it returns a result "Pseudogout" which is a file name.
Notice the word "dog" is inside the word "Pseudogout".
How do I use the regex to eliminate such results?
I would appreciate some help here.
Thanks.
Use anchors: ^dog$. The caret matches the beginning of a string, and the dollar sign, the end.
If you want to match the string "dog" inside another string, but not inside another word, you might be able to use something like ^dog$|^dog[^A-Za-z]|[^A-Za-z]dog[^A-Za-z]|[^A-Za-z]dog$ but that is obviously somewhat cumbersome to type and use.
(This subsumes some information from the comments below.)

RegExp get string inside string

Let presume we have something like this:
<div1>
<h1>text1</h1>
<h1>text2</h1>
</div1>
<div2>
<h1>text3</h1>
</div2>
Using RegExp we need to get text1 and text2 but not text3.
How to do this?
Thanks in advance.
EDIT:
This is just an example.
The text I'm parsing could be just plain text.
The main thing I want to accomplish is list all strings from a specific section of a document.
I gave this HTML code for example as it perfectly resembles the thing I need to get.
(?siU)<h1>(.*)</h1> would parse all three strings, but how to get only first two?
EDIT2:
Here is another rather dumb example. :)
Section1
This is a "very" nice sentence.
It has "just" a few words.
Section2
This is "only" an example.
The End
I need quoted words from first but not from second section.
Yet again, (?siU)"(.*)" returns quoted words from whole text,
and I need only those between words Section1 and Section2.
This is for the "Rainmeter" application, which apparently uses Perl regex syntax.
I'm sorry, but I can't explain it better. :)
For the general case of the two examples provided -- for use in Rainmeter regex -- you can use:
(?siU)<h1>(.*)</h1>(?=.+<div2>) for the first sample and
(?siU)"(.*)"(?=.+Section2) for the second.
Note that Rainmeter seems to escape things for you, but you might need to change " to \", above.
These both use Positive Lookahead but beware: both solutions will fail in the case of nested tags/structures or if there are mutiple Section1's and Section2's. Regex is not the best tool for this kind of parsing.
But maybe this is good enough for your current needs?
Use a DOM library and getElementsByTagName('div') and you'll get a nodeList back. You can reference the first item with ->item(0) and then getElementsByTagName('h1') using the div as a context node, grab the text with ->nodeValue property.

Algorithm to get a Regex

Something like this is on my mind: I put one or a few strings in, and the algorithm shows me a matching regex.
Is there an "easy" way to do this, or does something like this already exist?
Edit 1: Yes, I'm trying to find a way to generate regex.
Edit 2: Regulazy is not what I am looking for. The common use for the code I want is to find a correct RegEx; for example, article numbers:
I put in 123456, the regex should be \d{6}
I put in nb-123456, the regex should be \w{2}-\d{6}
If you have Emacs you can use regexp-opt. For example, evaluating:
(regexp-opt (list "my" "list" "of" "some" "strings" "to" "search"))
returns
"list\\|my\\|of\\|s\\(?:earch\\|ome\\|trings\\)\\|to"
Perl can do it: http://www.hakank.org/makeregex/
So does ruby: http://www.toolbox-mag.de/data/makeregex.html
Note: not so perfect solution.
And there is a CLI tool: txt2regex.
There was txt2re, once upon a time...
It sounds like you want an algorithm to generate a regular grammar based on some samples. In a lot of cases, there are many possible grammars for a given set of examples--there can even be infinite possible grammars. Of course, the possibilities can be limited by a second set of required non-matches, which can limit it to zero possibilities if the non-matching strings are too inclusive.
txt2re does something like this.
How about the following (matches every string)?
.*
I think that Regulazy by Roy Osherove does this to a certain extent, or it may be Regulator. BOth are on this page:
http://weblogs.asp.net/rosherove/pages/tools-and-frameworks-by-roy-osherove.aspx
if your input strings are not random strings and they are based on some rules, by using a parser (i.e. jflex), you can create a regex generator which will generate a regex w.r.t. the given strings.
Look at txt2re.
This site holds a form that takes a sample string and generates a regex pattern that can match the given string.
Then it generates the corresponding script for the following languages: Perl, PHP, Python, Java, Javascript, ColdFusion, C, C++ Ruby, VB, VBScript, J#.net, C#.net, C++.net, VB.net