Exclude a certain String from variable in regex - regex

Hi I have a Stylesheet where i use xsl:analyze-string with the following regex:
(&journal_abbrevs;)[\s ]*([0-9]{{4}})[,][\s ][S]?[\.]?[\s ]?([0-9]{{1,4}})([\s ][(][0-9]{{1,4}}[)])?
You don't need to look at the whole thing :)
&journal_abbrevs; looks like this:
"example-String1|example-String2|example-String3|..."
What I need to do know is exclude one of the strings in &journal_abbrevs; from this regex. E.g. I don't want example-String1 to be matched.
Any ideas on how to do that ?

It seems XSLT regex does not support look-around. So I don't think you'll be able to get a solution for this that does not involve writing out all strings from journal_abbrevs in your regex. Related question.
To minimize the amount of writing out, you could split journal_abbrevs into say journal_abbrevs1, journal_abbrevs2 and journal_abbrevs3 (or how many you decide to use) and only write out whichever one that contains the string you wish to exclude. If journal_abbrevs1 contains the string, you'd then end up with something like:
((&journal_abbrevs2;)|(&journal_abbrevs3;)|example-String2|example-String3|...)...
If it supported look-around, you could've used a very simple:
(?!example-String1)(&journal_abbrevs;)...

Related

What is the proper way to check if a string contains a set of words in regex?

I have a string, let's say, jkdfkskjak some random string containing a desired word
I want to check if the given string has a word from a set of words, say {word1, word2, word3} in latex.
I can easily do it in Java, but I want to achieve it using regex. I am very new to regular expressions.
if you want only to recognise the words as part of a word, then use:
(word1|word2|...|wordn)
(see first demo)
if you want them to appear as isolated words, then
\b(word1|word2|...|wordn)\b
should be the answer (see second demo)
I am not able to understand the complete context like what kind of text you have or what kind of words will this be but I can offer you a easy solution the literal way programmatically you can generate this regex (dormammu|bargain) and then search this in text like this "dormammu I come to bargain". I have no clue about latex but I think that is not your question.
For more information you can tinker with it at [regex101][1].
If you are having trouble understanding it [regexone][2] this is the place to go. For beginners its a good start.
[1]: http://regex101.com [2]: https://regexone.com/

Regex capture words inside tags

Given an XML document, I'd like to be able to pick out individual key/value pairsfrom a particular tag:
<aaa>key0:val0 key1:val1 key2:va2</aaa>
I'd like to get back
key0:val0
key1:val1
key2:val2
So far I have
(?<=<aaa>).*(?=<\/aaa>)
Which will match everything inside, but as one result.
I also have
[^\s][\w]*:[\w]*[^\s] which will also match correctly in groups on this:
key0:val0 key1:val1 key2:va2
But not with the tags. I believe this is an issue with searching for subgroups and I'm not sure how to get around it.
Thanks!
You cannot combine the two expressions in the way you want, because you have to match each occurrence of "key:value".
So in what you came up with - (?<=<abc>)([\w]*:[\w]*[\s]*)+(?=<\/abc>) - there are two matching groups. The bigger one matches everything inside the tags, while the other matches a single "key:value" occurrence. The regex engine cannot give each individual occurence because it does not work that way. So it just gives you the last one.
If you think in python, on the matcher object obtained after applying you regex, you will have access to matcher.group(1) and matcher.group(2), because you have two matching ( ) groups in the regex.
But what you want is the n occurences of "key:value". So it's easier to just run the simpler \w+:\w+ regex on the string inside the tags.
I uploaded this one at parsemarket, and I'm not sure its what you are looking for, but maybe something like this:
(<aaa>)((\w+:\w+\s)*(\w+:\w+)*)(<\/aaa>)
AFAIK, unless you know how many k:v pairs are in the tags, you can't capture all of them in one regex. So, if there are only three, you could do something like this:
<(?:aaa)>(\w+:\w+\s*)+(\w+:\w+\s*)+(\w+:\w+\s*)+<(?:\/aaa)>
But I would think you would want to do some sort of loop with whatever language you are using. Or, as some of the comments suggest, use the parser classes in the language. I've used BeautifulSoup in Python for HTML.

Regex exclude value from repetition

I'm working with load files and trying to write a regex that will check for any rows in the file that do not have the correct number of delimiters. Let's pretend the delimiter is % (I'm not sure this text field supports the delimiters that are in the load file). The regex I wrote that finds all rows that are correct is:
^%([^%]*%){20}$
Sometimes it's more beneficial to find any rows that do not have the correct number of delimiters, so to accomplish that, I wrote this:
(^%([^%]*%){0,19}$)|(^%([^%]*%){21,}$)
I'm concerned about the efficiency of this (and any regex I write in general), so I'm wondering if there's a better way to write it, or if the way I wrote it is fine. I thought maybe there would be some way to use alternation with the repetition tokens, such as:
{0,19}|{21,}
but that doesn't seem to work.
If it's helpful to know, I'm just searching through the files in Sublime Text, which I believe uses PCRE. I'm also open to suggestions for making the first regex better in general, although I've found it to work pretty well even in exceptionally large load files.
If your regex engine supports negative lookaheads, you can slightly modify your original regex.
^(?!%([^%]*%){20}$)
The regex above is useful for test only. If you want to capture, then you need to add .* part.
^(?!%([^%]*%){20}$).*$

Regex Pattern Matching Concatenation

Is it possible to concatenate the results of Regex Pattern Matching using only Regex syntax?
The specific instance is a program is allowing regex syntax to pull info from a file, but I would like it to pull from several portions and concatenate the results.
For instance:
Input string: 1234567890
Desired result string: 2389
Regex Pattern match: (?<=1).+(?=4)%%(?<=7).+(?=0)
Where %% represents some form of concatenation syntax. Using starting and ending with syntax is important since I know the field names but not the values of the field.
Does a keyword that functions like %% exist? Is there a more clever way to do this? Must the code be changed to allow multiple regex inputs, automatically concatenating?
Again, the pieces to be concatenated may be far apart with unknown characters in between. All that is known is the information surrounding the substrings.
2011-08-08 edit: The program is written in C#, but changing the code is a major undertaking compared to finding a regex-based solution.
Without knowing exactly what you want to match and what language you're using, it's impossible to give you an exact answer. However, the usual way to approach something like this is to use grouping.
In C#:
string pattern = #"(?<=1)(.+)(?=4).+(?<=7)(.+)(?=0)";
Match m = Regex.Match(input, pattern);
string result = m.Groups[0] + m.Groups[1];
The same approach can be applied to many other languages as well.
Edit
If you are not able to change the code, then there's no way to accomplish what you want. The reason is that in C#, the regex string itself doesn't have any power over the output. To change the result, you'd have to either change the called method of the Regex class or do some additional work afterwards. As it is, the method called most likely just returns either a Match object or a list of matching objects, neither of which will do what you want, regardless of the input regex string.

RegExp get string inside string

Let presume we have something like this:
<div1>
<h1>text1</h1>
<h1>text2</h1>
</div1>
<div2>
<h1>text3</h1>
</div2>
Using RegExp we need to get text1 and text2 but not text3.
How to do this?
Thanks in advance.
EDIT:
This is just an example.
The text I'm parsing could be just plain text.
The main thing I want to accomplish is list all strings from a specific section of a document.
I gave this HTML code for example as it perfectly resembles the thing I need to get.
(?siU)<h1>(.*)</h1> would parse all three strings, but how to get only first two?
EDIT2:
Here is another rather dumb example. :)
Section1
This is a "very" nice sentence.
It has "just" a few words.
Section2
This is "only" an example.
The End
I need quoted words from first but not from second section.
Yet again, (?siU)"(.*)" returns quoted words from whole text,
and I need only those between words Section1 and Section2.
This is for the "Rainmeter" application, which apparently uses Perl regex syntax.
I'm sorry, but I can't explain it better. :)
For the general case of the two examples provided -- for use in Rainmeter regex -- you can use:
(?siU)<h1>(.*)</h1>(?=.+<div2>) for the first sample and
(?siU)"(.*)"(?=.+Section2) for the second.
Note that Rainmeter seems to escape things for you, but you might need to change " to \", above.
These both use Positive Lookahead but beware: both solutions will fail in the case of nested tags/structures or if there are mutiple Section1's and Section2's. Regex is not the best tool for this kind of parsing.
But maybe this is good enough for your current needs?
Use a DOM library and getElementsByTagName('div') and you'll get a nodeList back. You can reference the first item with ->item(0) and then getElementsByTagName('h1') using the div as a context node, grab the text with ->nodeValue property.