Regex matching terminating quote only if quote at the beginning - regex

I want to match the following element with regex
target="#MOBILE"
and all valid variants.
I've written the regex
target[\s\S]*#MOBILE[^>^\s]*
which matches the following
target="#MOBILE"
target = "#MOBILE"
target=#MOBILE
target="#MOBILE" (followed directly by >)
but it doesn't match
target=" #MOBILE "
properly (note the extra space). It only matches
target=" #MOBILE
missing out the final quote
What I need is the terminating expression [^>^\s]* to match a quote only if it matches a quote at the beginning. It also needs to work with single quotes. The terminating expression also needs to end with a whitespace or > char as it does currently.
I'm sure there is a way to do this - but I'm not sure how. It's probably standard stuff - I just don't know it
Incidently I'm not sure that [^>^\s]* is the best way to terminate if the regex hits a space or > char but it's the only way that I can get it to work.

You can use a backreference, similar to jensgram's suggestion:
target\s*=\s*(?:(")\s*)?#Mobile\s*\1
(?:(")\s*)? - Optional non-capturing group that contains a quote (which is captured), and additional optional spaces. If it matched, \1 will contain a quote.
Working example: http://regexr.com?2vkkq
A better alternative for .Net (mainly because you want single quotes, and \1 behaves differently for uncaptured groups):
target\s*=\s*(["']?)\s*?#Mobile\s*\1
Working example: Regex Storm

Try the following if you need to check that your quotes are in pairs:
target\s*=\s*(['"])(?=\1)\s*#MOBILE\s*(?<=\1)\1
But it really depends if your regex engine supports positive look-(ahead|behind) syntax. And if it supports back-referencing.

Without quotes target\s*=\s*#MOBILE
With double quotes target\s*=\s*"\s*#MOBILE\s*"
With single quotes target\s*=\s*'\s*#MOBILE\s*'
All together
(target\s*=\s*#MOBILE)|(target\s*=\s*"\s*#MOBILE\s*")|(target\s*=\s*'\s*#MOBILE\s*')
Or someone can make it neater.

Related

What is the difference between `(\S.*\S)` and `^\s*(.*)\s*$` in regex?

I'm doing the RegexOne regex tutorial and it has a question about writing a regular expression to remove unnecessary whitespace.
The solution provided in the tutorial is
We can just skip all the starting and ending whitespace by not capturing it in a line. For example, the expression ^\s*(.*)\s*$ will catch only the content.
The setup for the question does indicate the use of the hat at the beginning and the dollar sign at the end, so it makes sense that this is the expression that they want:
We have previously seen how to match a full line of text using the hat ^ and the dollar sign $ respectively. When used in conjunction with the whitespace \s, you can easily skip all preceding and trailing spaces.
That said, using \S instead, I was able to come up with what seems like a simpler solution - (\S.*\S).
I've found this Stack Overflow solution that match the one in the tutorial - Regex Email - Ignore leading and trailing spaces? and I've seen other guides that recommend the same format but I'm struggling to find an explanation for why the \S is bad.
Additionally, this validates as correct in their tool... so, are there cases where this would not work as well as the provided solution? Or is the recommended version just a standard format?
The tutorial's solution of ^\s*(.*)\s*$ is wrong. The capture group .* is greedy, so it will expand as much as it can, all the way to the end of the line - it will capture trailing spaces too. The .* will never backtrack, so the \s* that follows will never consume any characters.
https://regex101.com/r/584uVG/1
Your solution is much better at actually matching only the non-whitespace content in the line, but there are a couple odd cases in which it won't match the non-space characters in the middle. (\S.*\S) will only capture at least two characters, whereas the tutorial's technique of (.*) may not capture any characters if the input is composed of all whitespace. (.*) may also capture only a single character.
But, given the problem description at your link:
Occasionally, you'll find yourself with a log file that has ill-formatted whitespace where lines are indented too much or not enough. One way to fix this is to use an editor's search a replace and a regular expression to extract the content of the lines without the extra whitespace.
From this, matching only the non-whitespace content (like you're doing) probably wouldn't remove the undesirable leading and trailing spaces. The tutorial is probably thinking to guide you towards a technique that can be used to match a whole line with a particular pattern, and then replace that line with only the captured group, like:
Match ^\s*(.*\S)\s*$, replace with $1: https://regex101.com/r/584uVG/2/
Your technique would work given the problem if you had a way to make a new text file containing only the captured groups (or all the full matches), eg:
const input = ` foo
bar
baz
qux `;
const newText = (input.match(/\S(?:$|.*\S)/gm) || [])
.join('\n');
console.log(newText);
Using \S instead of . is not bad - if one knows a particular location must be matched by a non-space character, rather than by a space, using \S is more precise, can make the intent of the pattern clearer, and can make a bad match fail faster, and can also avoid problems with catastrophic backtracking in some cases. These patterns don't have backtracking issues, but it's still a good habit to get into.

Regex everything after, but not including

I am trying to regex the following string:
https://www.amazon.com/Tapps-Top-Apps-and-Games/dp/B00VU2BZRO/ref=sr_1_3?ie=UTF8&qid=1527813329&sr=8-3&keywords=poop
I want only B00VU2BZRO.
This substring is always going to be a 10 characters, alphanumeric, preceded by dp/.
So far I have the following regex:
[d][p][\/][0-9B][0-9A-Z]{9}
This matches dp/B00VU2BZRO
I want to match only B00VU2BZRO with no dp/
How do I regex this?
Here is one regex option which would produce an exact match of what you want:
(?<=dp\/)(.*)(?=\/)
Demo
Note that this solution makes no assumptions about the length of the path fragment occurring after dp/. If you want to match a certain number of characters, replace (.*) with (.{10}), for example.
Depending on your language/method of application, you have a couple of options.
Positive look behind. This will make your regex more complicated, but will make it match what you want exactly:
(<=dp/)[0-9A-Z]{10}
The construct (<=...) is called a positive look behind. It will not consume any of the string, but will only allow the match to happen if the pattern between the parens is matched.
Capture group. This will make the regex itself slightly simpler, but will add a step to the extraction process:
dp/([0-9A-Z]{10})
Anything between plain parens is a capture group. The entire pattern will be matched, including dp/, but most languages will give you a way of extracting the portion you are interested in.
Depending on your language, you may need to escape the forward slash (/).
As an aside, you never need to create a character class for single characters: [d][p][\/] can equally well be written as just dp\/.

Regex expression - single quote without comma

I need a regex expression to find single quotes that does not have a comma neither right before nor right after it. Also the single quotes should not be the first character or the last character in the string and should have an alphanumeric character on each side
Example "Jane's book" would detect while "'apples','oranges'"
Can anyone help?
You can use this regex with lookarounds:
(?<=[a-zA-Z0-9])'(?=[a-zA-Z0-9])
RegEx Demo
Something like:
(?<=[A-Za-z0-9])\'(?=[A-Za-z0-9])
should give you matches in the languages that support positive lookaheads and positive lookbehinds (JavaScript only supports lookaheads if I remember correctly). I didn't test the above, but I'm not sure you would even need to escape the single quote...
You need your language-appropriate variation of:
.+'[^,]+.*
' finds you a single quote. You generally do not need to escape a single quotation mark.
[^,] allows any character but a comma and + indicates that you require at least one such character
.* says you can have as many of any character as you like, so putting it before and after what you care about says your expression can occur anywhere in the string. .+ means you must have at least one of any character not a comma.
Note that I'm making some assumptions, like that you'll only have one ' in the string that you want to find. Also I'm assuming you don't care about , except for right after '. If that's not true, you need to be more specific about your requirements.

Ant regex expression

Quite a simple one in theory but can't quite get it!
I want a regex in ant which matches anything as long as it has a slash on the end.
Below is what I expect to work
<regexp id="slash.end.pattern" pattern="*/"/>
However this throws back
java.util.regex.PatternSyntaxException: Dangling meta character '*' near index 0
*/
^
I have also tried escaping this to \*, but that matches a literal *.
Any help appreciated!
Your original regex pattern didn't work because * is a special character in regex that is only used to quantify other characters.
The pattern (.)*/$, which you mentioned in your comment, will match any string of characters not containing newlines, however it uses a possibly unnecessary capturing group. .*/$ should work just as well.
If you need to match newline characters, the dot . won't be enough. You could try something like [\s\S]*/$
On that note, it should be mentioned that you might not want to use $ in this pattern. Suppose you have the following string:
abc/def/
Should this be evaluated as two matches, abc/ and def/? Or is it a single match containing the whole thing? Your current approach creates a single match. If instead you would like to search for strings of characters and then stop the match as soon as a / is found, you could use something like this: [\s\S]*?/.

Regular expression using negative lookbehind not working in Notepad++

I have a source file with literally hundreds of occurrences of strings flecha.jpg and flecha1.jpg, but I need to find occurrences of any other .jpg image (i.e. casa.jpg, moto.jpg, whatever)
I have tried using a regular expression with negative lookbehind, like this:
(?<!flecha|flecha1).jpg
but it doesn't work! Notepad++ simply says that it is an invalid regular expression.
I have tried the regex elsewhere and it works, here is an example so I guess it is a problem with NPP's handling of regexes or with the syntax of lookbehinds/lookaheads.
So how could I achieve the same regex result in NPP?
If useful, I am using Notepad++ version 6.3 Unicode
As an extra, if you are so kind, what would be the syntax to achieve the same thing but with optional numbers (in this case only '1') as a suffix of my string? (even if it doesn't work in NPP, just to know)...
I tried (?<!flecha[1]?).jpg but it doesn't work. It should work the same as the other regex, see here (RegExr)
Notepad++ seems to not have implemented variable-length look-behinds (this happens with some tools). A workaround is to use more than one fixed-length look-behind:
(?<!flecha)(?<!flecha1)\.jpg
As you can check, the matches are the same. But this works with npp.
Notice I escaped the ., since you are trying to match extensions, what you want is the literal .. The way you had, it was a wildcard - could be any character.
About the extra question, unfortunately, as we can't have variable-length look-behinds, it is not possible to have optional suffixes (numbers) without having multiple look-behinds.
Solving the problem of the variable-length-negative-lookbehind limitation in Notepad++
Given here are several strategies for working around this limitation in Notepad++ (or any regex engine with the same limitation)
Defining the problem
Notepad++ does not support the use of variable-length negative lookbehind assertions, and it would be nice to have some workarounds. Let's consider the example in the original question, but assume we want to avoid occurrences of files named flecha with any number of digits after flecha, and with any characters before flecha. In that case, a regex utilizing a variable-length negative lookbehind would look like (?<!flecha[0-9]*)\.jpg.
Strings we don't want to match in this example
flecha.jpg
flecha1.jpg
flecha00501275696.jpg
aflecha.jpg
img_flecha9.jpg
abcflecha556677.jpg
The Strategies
Inserting Temporary Markers
Begin by performing a find-and-replace on the instances that you want to avoid working with - in our case, instances of flecha[0-9]*\.jpg. Insert a special marker to form a pattern that doesn't appear anywhere else. For this example, we will insert an extra . before .jpg, assuming that ..jpg doesn't appear elsewhere. So we do:
Find: (flecha[0-9]*)(\.jpg)
Replace with: $1.$2
Now you can search your document for all the other .jpg filenames with a simple regex like \w+\.jpg or (?<!\.)\.jpg and do what you want with them. When you're done, do a final find-and-replace operation where you replace all instances of ..jpg with .jpg, to remove the temporary marker.
Using a negative lookahead assertion
A negative lookahead assertion can be used to make sure that you're not matching the undesired file names:
(?<!\S)(?!\S*flecha\d*\.jpg)\S+\.jpg
Breaking it down:
(?<!\S) ensures that your match begins at the start of a file name, and not in the middle, by asserting that your match is not preceded by a non-whitespace character.
(?!\S*flecha\d*\.jpg) ensures that whatever is matched does not contain the pattern we want to avoid
\S+\.jpg is what actually gets matched -- a string of non-whitespace characters followed by .jpg.
Using multiple fixed-length negative lookbehinds
This is a quick (but not-so-elegant) solution for situations where the pattern you don't want to match has a small number of possible lengths.
For example, if we know that flecha is only followed by up to three digits, our regex could be:
(?<!flecha)(?<!flecha[0-9])(?<!flecha[0-9][0-9])(?<!flecha[0-9][0-9][0-9])\.jpg
Are you aware that you're only matching (in the sense of consuming) the extension (.jpg)? I would think you wanted to match the whole filename, no? And that's much easier to do with a lookahead:
\b(?!flecha1?\b)\w+\.jpg
The first \b anchors the match to the beginning of the name (assuming it's really a filename we're looking at). Then (?!flecha1?\b) asserts that the name is not flecha or flecha1. Once that's done, the \w+ goes ahead and consumes the name. Then \.jpg grabs the extension to finish off the match.