RegExp get string inside string - regex

Let presume we have something like this:
<div1>
<h1>text1</h1>
<h1>text2</h1>
</div1>
<div2>
<h1>text3</h1>
</div2>
Using RegExp we need to get text1 and text2 but not text3.
How to do this?
Thanks in advance.
EDIT:
This is just an example.
The text I'm parsing could be just plain text.
The main thing I want to accomplish is list all strings from a specific section of a document.
I gave this HTML code for example as it perfectly resembles the thing I need to get.
(?siU)<h1>(.*)</h1> would parse all three strings, but how to get only first two?
EDIT2:
Here is another rather dumb example. :)
Section1
This is a "very" nice sentence.
It has "just" a few words.
Section2
This is "only" an example.
The End
I need quoted words from first but not from second section.
Yet again, (?siU)"(.*)" returns quoted words from whole text,
and I need only those between words Section1 and Section2.
This is for the "Rainmeter" application, which apparently uses Perl regex syntax.
I'm sorry, but I can't explain it better. :)

For the general case of the two examples provided -- for use in Rainmeter regex -- you can use:
(?siU)<h1>(.*)</h1>(?=.+<div2>) for the first sample and
(?siU)"(.*)"(?=.+Section2) for the second.
Note that Rainmeter seems to escape things for you, but you might need to change " to \", above.
These both use Positive Lookahead but beware: both solutions will fail in the case of nested tags/structures or if there are mutiple Section1's and Section2's. Regex is not the best tool for this kind of parsing.
But maybe this is good enough for your current needs?

Use a DOM library and getElementsByTagName('div') and you'll get a nodeList back. You can reference the first item with ->item(0) and then getElementsByTagName('h1') using the div as a context node, grab the text with ->nodeValue property.

Related

What is the proper way to check if a string contains a set of words in regex?

I have a string, let's say, jkdfkskjak some random string containing a desired word
I want to check if the given string has a word from a set of words, say {word1, word2, word3} in latex.
I can easily do it in Java, but I want to achieve it using regex. I am very new to regular expressions.
if you want only to recognise the words as part of a word, then use:
(word1|word2|...|wordn)
(see first demo)
if you want them to appear as isolated words, then
\b(word1|word2|...|wordn)\b
should be the answer (see second demo)
I am not able to understand the complete context like what kind of text you have or what kind of words will this be but I can offer you a easy solution the literal way programmatically you can generate this regex (dormammu|bargain) and then search this in text like this "dormammu I come to bargain". I have no clue about latex but I think that is not your question.
For more information you can tinker with it at [regex101][1].
If you are having trouble understanding it [regexone][2] this is the place to go. For beginners its a good start.
[1]: http://regex101.com [2]: https://regexone.com/

Regex to match sentences with jumbled words but preserving sentence order

I want to match sentences in such a way that words with the sentence can be any order but the sentences should be in same order.
e.g.
My name is Sam. I love regex.
Acceptable input:
My Sam is name. regex I love.
name is My Sam. I regex love.
Invalid input:
I love regex. My name is Sam.
regex I love. is My name Sam.
sample regex I have come up so far to solve the above problem
^((?=.*\bMy\b)(?=.*\bSam\b)(?=.*\bis\b)(?=.*\bname\b))((?=.*\bregex\b)(?=.*\bI\b)(?=.*\blove\b)).*$
Which is not working as expected.
Can this problem be solved by regex? What would be the recommended approach to solve this?
Note: Please ignore . I am using it just for clarity.
I think you are looking for something else than regex. If you would want to do this, the most efficient way would be to compare an array of expected words and 'check' if they all appear once in a sentence. This is completely dependent on which context you are using. If you need a regex that literally finds what you stated in your example, you could use something like this:
/(My|name|is|Sam) (My|name|is|Sam) (My|name|is|Sam) (My|name|is|Sam)\. (I|love|regex) (I|love|regex) (I|love|regex)./g
But as you can see, this regex would grow exponentially the more words your sentence has. Also, it's really inefficient compared to parsing it with something else.
I couldn't achieve with a single regex, instead I did the following:
Virtually divided the sentence into multiple blocks.
Maintained a sentence block -> regex configuration.
regex configuration depends on the rule applicable on that sentence block.
Applied the regex on the sentence to identify whether such block is existing or not.
At last verifying whether the blocks are appearing in the configured order or not.

Exclude a certain String from variable in regex

Hi I have a Stylesheet where i use xsl:analyze-string with the following regex:
(&journal_abbrevs;)[\s ]*([0-9]{{4}})[,][\s ][S]?[\.]?[\s ]?([0-9]{{1,4}})([\s ][(][0-9]{{1,4}}[)])?
You don't need to look at the whole thing :)
&journal_abbrevs; looks like this:
"example-String1|example-String2|example-String3|..."
What I need to do know is exclude one of the strings in &journal_abbrevs; from this regex. E.g. I don't want example-String1 to be matched.
Any ideas on how to do that ?
It seems XSLT regex does not support look-around. So I don't think you'll be able to get a solution for this that does not involve writing out all strings from journal_abbrevs in your regex. Related question.
To minimize the amount of writing out, you could split journal_abbrevs into say journal_abbrevs1, journal_abbrevs2 and journal_abbrevs3 (or how many you decide to use) and only write out whichever one that contains the string you wish to exclude. If journal_abbrevs1 contains the string, you'd then end up with something like:
((&journal_abbrevs2;)|(&journal_abbrevs3;)|example-String2|example-String3|...)...
If it supported look-around, you could've used a very simple:
(?!example-String1)(&journal_abbrevs;)...

Searching my code with regex

It happens all the time, I would need to scan my code for places where I have two or more of the same keywords.
For example $json["VALID"]
So, I would need to find json, and VALID.
Some places in the code may contain:
// a = $json['VALID']; // (note the apostrophes)
(I am using EditPlus which is a great text editor, letting me use regex in my searches)
What would be the string in the regex to find json and VALID (in this example) ?
Thanks in advance!
Use this regex:
\$json\[["']VALID['"]\]
wound find $json<2 character>VALID
\$json.{2}VALID

Re: Regex Question (Help)

I am new here. I just discovered this tool "Everything Search Engine". It allows the use of regex in the search. I posted in their forum for a bit help here http://forum.voidtools.com/viewtopic.php?f=5&t=1343. This section explains how regex can be used in the tool http://www.voidtools.com/faq.php#How_do_I_use_regex.
The question I am asking is:
What is the correct regex to use in the search to obtain the desired results describe below.
For example, I am searching for the word "dog" in the file name. And it returns a result "Pseudogout" which is a file name.
Notice the word "dog" is inside the word "Pseudogout".
How do I use the regex to eliminate such results?
I would appreciate some help here.
Thanks.
Use anchors: ^dog$. The caret matches the beginning of a string, and the dollar sign, the end.
If you want to match the string "dog" inside another string, but not inside another word, you might be able to use something like ^dog$|^dog[^A-Za-z]|[^A-Za-z]dog[^A-Za-z]|[^A-Za-z]dog$ but that is obviously somewhat cumbersome to type and use.
(This subsumes some information from the comments below.)