Regex to get everything between 2 words - regex

I am trying to get through a lot of content and to extract some data from it. Therefore I need to pick the information between 2 set of characters.
It looks like this
***some text*** li> ***data to capture*** </li ***more text***
What regex can I use to get everything that is enclosed between li> and </li ?

Basically it will be like this:
li>(.*?)(?:</li)
Depending on your language environment, certain characters may need to be escaped or the way of retrieving the matched string may differ. Typically you would need to escape / by prepending a backslash, resulting in this new version:
li>(.*?)(?:<\/li)
Here's a live demo:
https://regex101.com/r/zV4uN6/1

Related

Regex for fixing YAML strings

I am trying to create a bunch of YAML files, mostly composed of strings of text. Now when using apostrophes in words, they must be escaped by typing a double apostrophe, because I’m using apostrophes to wrap the strings.
I want to create a regex that will check for apostrophes in the text that aren’t double. What I have is this:
^([^'\n]*?)'(([^'\n]*?)'(?!')([^'\n]+?))*?'$\n
https://regex101.com/r/v4nUTn/3
My issue is that as soon as my string has a double apostrophe, but also has an apostrophe which isn’t a double apostrophe, it doesn’t match because my negative lookahead doesn’t match as soon as it sees the double apostrophe. (for example the string t''e'st won’t match even though it is missing a double apostrophe after the e)
How can I make it so that my negative lookahead will not fail as soon as it sees one double apostrophe?
This regex should work:
\w'\w
Test here.
My guess is that maybe an expression similar to
('[^'\r\n]*'|[^\r\n\w']+)|([\w']*)
would be an option to look into.
If the second capturing group returns true, then the string is undesired.
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.
One suggestion would be to do this in two steps.
For example, if every 'candidate' value looks like this: - 'something here' (where you want to test the apostrophes in the something here content of the string, then first isolate out that content via:
/^\s*- '(.+)'$/im
And then make sure all apostrophe's appear as you want them to appear within match group 1 of the result.
Then, replace the original match with your 'sanitised' match.
Doing this means you don't have to be concerned with the bounding apostrophes causing complications to the check for apostrophes in the value.
Note: there may well be a perfect one-step regex to do this, but understanding that you can break tasks into several steps is useful if you spend a lot of time with regular expressions, and can help you sidestep 'perfect regex paralysis'.
If you want your string to match if there is at least one 'single quote' between your singlequote strings, then you should allow consumption of either a string which doesn't have any singlequote in it or consume if it contains two singlequotes and then you should modify your regex a bit to consume two singlequotes and add |'' in your regex, which will now consume either non-singlequote text or a portion which has at least two singlequotes.
Try this updated regex demo and see if this works like you wanted?
https://regex101.com/r/v4nUTn/4

Are there any characters that are not allowed/used in regex

I have the somehow weird requirement that several regex should be passed as one single string to a jenkins plugin.
They should be entered in one single textfield and I have to split this string in a List of Regex later on.
Now the issue is, I can't think of any way to delimit the regexes in the string so I can later split this string as a character like a , could also be considered part of a regex itself.
E.g. if I'd use a , for the two regex "(\d+,?\s+\d{1})\.xls" and "\w+\.exe" :
"(\d+,?\s+\d{1})\.xls,\w+\.exe"
would be split into 3 regexes: "(\d+", "?\s+\d{1})\.xls" and "\w+\.exe"
where the first 2 are obviously invalid.
So my actual question is, are there any characters, that can never appear in a regex which I could use to delimit my regexes?
No, any and all characters can appear in a regex. Use any serialisation format to serialise your list of strings into a clearly expressed list format, e.g. JSON:
["(\\d+", "?\\s+\\d{1})\\.xls", "\\w+\\.exe"]
Alternatively CSV or anything else that can express a list of things and properly escapes characters used to denote item separators.

Regex for removing spaces and random trailing chars

I am successfully validating an ID such as:
ZFA1G2H34J5K6L7P5
using this regex:
([a-h,A-H,j-n,J-N,p-z,P-Z,0-9]){17}$
This ID sometimes arrives corrupted (comes from a OCR process) and therefore the previous regex does not work. I need to support the most common way of corruption which is having a space within the ID:
ZFA1G2H34 J5K6L7P5
The regex should remove the space and compose just the allowed 17 chars of the ID.
Please note I cannot use scripting (.replace for example) because the software where this regex is used does not support it.
As a bonus, sometimes the ID contains trailing chars which I would like to remove as well:
ZFA1G2H34 J5K6L7P5...ç
You can use one of the following regular expressions to validate the query:
^(?:(?![iIoO])[ ç0-9a-zA-Z]){17,}$
^([ ça-hA-Hj-nJ-Np-zP-Z0-9]){17,}$
And then, you can use the following regular expression to only match characters you like:
(?:(?![iIoO])[0-9a-zA-Z])
[a-hA-Hj-nJ-Np-zP-Z0-9]
Don't use , in a set like [A-Z,a-z], because commas are actually part of the set and not a separator between the character ranges.

Elasticsearch Regex to match url starting with one string and not ending with another, without look ahead/behind

I have two groups of strings that take the formats
http://example.com/foo/something
and
http://example.com/foo/something/something-else/bar/1
Where example.com, foo and bar are fixed, something and something else could be any string and 1 is any number.
I want to use regex to match strings following the first format (they must start with http://example.com/foo/) and not the second. The exclusion could be around number of slashes, the "bar" string or ending in a number.
I don't have support for look ahead or look back.
What's the best approach?
Examples of strings that should match
http://example.com/foo/apple
http://example.com/foo/bear-bear
http://example.com/foo/cake-cake
Examples of strings that should NOT match
http://example.com/baa/apple
http://example.com/foo/apple/cake/bar/1
http://example.com/foo/bear-apple/camel/bar/2
Examples of strings that wouldn't exist in the data set
(So it doesn't matter if they match or not)
http://example.com/foo/bear-bear/cake/bar/two
http://example.com/foo/bear/camel/tar/2
http://example.com/foo/bear-bear/camel
http://example.com/foo/bear/camel/
http://example.com/foo/bear-bear/camel/tar/2
UPDATE
It turns out that the regex engine the application I'm using this in is from Elasticsearch, so this documentation (and one of our developers) was helpful: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html
The end solution was:
(http://example.com/foo.*)&~(.*bar.*)
All your examples have a specific prefix URL, followed by one-and-only-one path element. If this is the general case, you can do this by simply looking for the prefix URL followed by a word which doesn't contain a path separator, followed by EOL.
You didn't say what engine you're using, so here's an example with Gnu grep in bash:
grep -e '^http://example.com/foo/[^/]\+$'
Bash makes for readable examples, because single-quoting means very few characters need escaping. The sole exception in my example is the + character.

Need assistance regex matching a single quote, but do not include the quote in the result

I'm trying to find out a way to match the following test string:
token = '1866FB352F4DF76BCB92C3482DB7D7B4F562';
The data I want returned is...
1866FB352F4DF76BCB92C3482DB7D7B4F562
I've tried the following, the closes I have is this, but it's including the single quote at the end:
(?!token = ')(\w+)';
Now, another one, which works closely, but it's including the last single quote:
'([^']+)'
Anyone want to take a stab at this?
Update: After looking at what I need to parse, I found the same value in the html, in the form area, which looks like it might be easier to grab:
name="token" value="482CD1FE037F68D5A36F4C961A6D57D9"
Again, I just need the contents within value="*"
However, the regex will have to parse the entire html source, so I assume I will need to search for name="toke" value= but not include that in the result set.
If your regex engine supports lookaround, you can use
(?<=')\w+(?=')
This matches an alphanumeric word if it's surrounded by single quotes, without making those quotes a part of the actual match. If you only want to match hexadecimal numbers, use
(?i)(?<=')[0-9A-F]+(?=')
EDIT:
Since you have now added that you're using JMeter, and because JMeter doesn't support lookbehind assertions for reasons incomprehensible to me (because Java itself does support it just fine), you can possibly cheat like this:
\b[0-9A-F]+(?=')
only checks whether an entire hex number occurs right before a ' character. It does not check for the presence of an opening quote, but chances are that this won't matter.