What is the difference between `(\S.*\S)` and `^\s*(.*)\s*$` in regex? - regex

I'm doing the RegexOne regex tutorial and it has a question about writing a regular expression to remove unnecessary whitespace.
The solution provided in the tutorial is
We can just skip all the starting and ending whitespace by not capturing it in a line. For example, the expression ^\s*(.*)\s*$ will catch only the content.
The setup for the question does indicate the use of the hat at the beginning and the dollar sign at the end, so it makes sense that this is the expression that they want:
We have previously seen how to match a full line of text using the hat ^ and the dollar sign $ respectively. When used in conjunction with the whitespace \s, you can easily skip all preceding and trailing spaces.
That said, using \S instead, I was able to come up with what seems like a simpler solution - (\S.*\S).
I've found this Stack Overflow solution that match the one in the tutorial - Regex Email - Ignore leading and trailing spaces? and I've seen other guides that recommend the same format but I'm struggling to find an explanation for why the \S is bad.
Additionally, this validates as correct in their tool... so, are there cases where this would not work as well as the provided solution? Or is the recommended version just a standard format?

The tutorial's solution of ^\s*(.*)\s*$ is wrong. The capture group .* is greedy, so it will expand as much as it can, all the way to the end of the line - it will capture trailing spaces too. The .* will never backtrack, so the \s* that follows will never consume any characters.
https://regex101.com/r/584uVG/1
Your solution is much better at actually matching only the non-whitespace content in the line, but there are a couple odd cases in which it won't match the non-space characters in the middle. (\S.*\S) will only capture at least two characters, whereas the tutorial's technique of (.*) may not capture any characters if the input is composed of all whitespace. (.*) may also capture only a single character.
But, given the problem description at your link:
Occasionally, you'll find yourself with a log file that has ill-formatted whitespace where lines are indented too much or not enough. One way to fix this is to use an editor's search a replace and a regular expression to extract the content of the lines without the extra whitespace.
From this, matching only the non-whitespace content (like you're doing) probably wouldn't remove the undesirable leading and trailing spaces. The tutorial is probably thinking to guide you towards a technique that can be used to match a whole line with a particular pattern, and then replace that line with only the captured group, like:
Match ^\s*(.*\S)\s*$, replace with $1: https://regex101.com/r/584uVG/2/
Your technique would work given the problem if you had a way to make a new text file containing only the captured groups (or all the full matches), eg:
const input = ` foo
bar
baz
qux `;
const newText = (input.match(/\S(?:$|.*\S)/gm) || [])
.join('\n');
console.log(newText);
Using \S instead of . is not bad - if one knows a particular location must be matched by a non-space character, rather than by a space, using \S is more precise, can make the intent of the pattern clearer, and can make a bad match fail faster, and can also avoid problems with catastrophic backtracking in some cases. These patterns don't have backtracking issues, but it's still a good habit to get into.

Related

Finding ALL matches after a certain character(s)

I'd like to think I'm ok at writing RegEx's, but there's one thing I can't seem to crack:
I want to start looking for multiple, identical matches after a certain set of characters and capture all of them. Here's an example string:
Dialogue: 0,0:05:47.99,0:05:50.74,JoJo-main,Koichi,0000,0000,0000,,What are you doing, Giorno Giovanna?!
For this example, I want to start looking for matches after ,,. I want to find all instances of Gio i.e.
Dialogue: 0,0:05:47.99,0:05:50.74,JoJo-main,Koichi,0000,0000,0000,,What are you doing, {Gio}rno {Gio}vanna?!
I've tried first using non-capturing groups like /(?:,,.*?)(Gio)/g then lookbehinds like /(?<=,,.*?)(Gio)/g, /(?<=,,)(?:.*?)(Gio)/g and /(?<=,,)((?:.*?)(Gio))+/g to avoid consuming the ,,
None of these give me the behaviour I want, as I want individual matches as if I just used Gio, but without the chance of accidentally capturing stuff before the ,,
I could, of course, run one RegEx to find the ,, then feed that position to another RegEx to look for Gios after that point.
However I have thousands of lines like these to parse and thousands of words to look for on each line (I separate them with |), so I'd ideally like to do it with one RegEx and without a loop.
You may consider the following option for the .NET or modern ECMAScript 2018+ compliant JS environments:
/(?<=,,.*?)Gio/g
See the regex demo.
The (?<=,,.*?)Gio pattern matches Gio when it is preceded with ,, and any 0+ chars other than line break chars, as few as possible.
The following variant will work with PCRE/Onigmo regex engines:
/(?:\G(?!^)|,,).*?\KGio/
See another regex demo. Here, (?:\G(?!^)|,,) matches either the end of the previous successful match or ,, and then .*? matches and consumes any 0+ chars other than line break chars, as few as possible, then \K will reset the match buffer and Gio will land right there.

Regex everything after, but not including

I am trying to regex the following string:
https://www.amazon.com/Tapps-Top-Apps-and-Games/dp/B00VU2BZRO/ref=sr_1_3?ie=UTF8&qid=1527813329&sr=8-3&keywords=poop
I want only B00VU2BZRO.
This substring is always going to be a 10 characters, alphanumeric, preceded by dp/.
So far I have the following regex:
[d][p][\/][0-9B][0-9A-Z]{9}
This matches dp/B00VU2BZRO
I want to match only B00VU2BZRO with no dp/
How do I regex this?
Here is one regex option which would produce an exact match of what you want:
(?<=dp\/)(.*)(?=\/)
Demo
Note that this solution makes no assumptions about the length of the path fragment occurring after dp/. If you want to match a certain number of characters, replace (.*) with (.{10}), for example.
Depending on your language/method of application, you have a couple of options.
Positive look behind. This will make your regex more complicated, but will make it match what you want exactly:
(<=dp/)[0-9A-Z]{10}
The construct (<=...) is called a positive look behind. It will not consume any of the string, but will only allow the match to happen if the pattern between the parens is matched.
Capture group. This will make the regex itself slightly simpler, but will add a step to the extraction process:
dp/([0-9A-Z]{10})
Anything between plain parens is a capture group. The entire pattern will be matched, including dp/, but most languages will give you a way of extracting the portion you are interested in.
Depending on your language, you may need to escape the forward slash (/).
As an aside, you never need to create a character class for single characters: [d][p][\/] can equally well be written as just dp\/.

Language Syntax Highlight - Comment Line Starts With * may or may not have following words

I am creating a syntax highlight file for a language and I have everything mapped out and working with one exception.
I cannot come up with a regex that will match the following conditions for a specific line comment style.
If the first non white-space character is an asterisk (*) the line is considered a comment.
I have created many samples that work in regexr but it never captures in vscode.
For example, regexr is cool with this:
^(?:\s*)\*+(?:.*)?\n
So I convert it into the proper format for the tmlanguage.json file:
^(?:\\s*)\\*+(?:.*)?\\n
But it is not capturing properly, if the first character of the line is an *, it does not catch, but if the first character is a whitespace character followed by an * it does work.
I suck at formatting on stackoverflow, so represents a chr(9) tab character. is a space.
*******************************
*****************************
<tab>*************************
* comment
* comment
<tab>* comment
But it shouldn't work in these cases:
string *******************************
string ***************************** string
<tab>string *************************
x *= 3
I am guessing that either the anchor ^ isn't working in my regex or I am escaping something incorrectly.
Any advice?
Please see sample image attached: screenshot
I don't know the regex engine you're using. I'm just going to give you some
general tips on how it should be done.
First off, if you're reading a string with more than 1 newline in it,
the anchor ^, in an engines default state means Beginning of String (BOS)
What you want in this case is Multi-Line-Mode. This makes the anchor ^ match at the Beginning of Line (BO
L) as well as the BOS.
Second, you don't need those non capture groups (?:\s*) (?:.*), they encapsulate single constructs.
Third, it is redundant to make a group optional when its enclosed contents are optional (?:.*)?
Fourth, you don't need the newline \n construct at the end, since it should not be highlighted anyway, and it might not be present on the last line of text.
The latter will make it not match.
So, putting it all together, the modified regex would be (?m)^\s*\*.*
Explained
(?m) # Inline modifier: Multi-line mode
^ # Beginning of line
\s* # Optional many whitespace
\* # Required at least a single asterisk
.* # Optional rest of non-newline characters
Note that you could put a single capture group around the data
if you need to reference it in a replace (?m)^(\s*\*.*)
Also, the language you're using should have a way to specify options when compiling the regex. If the engine doesn't accept inline modifiers (?m) take it out and specify that option when compiling the regex.
Apparently VS Code's syntax highlighter is single-line. No matter how much i tried matching regeces that are over several lines, these never worked.
Second, if you're designing a language I suggest you not to use an arithmetic operator for comments.
Third, apparently you can match newlines in the begin and end attributes. You can try it there.

Regex to find last occurrence of pattern in a string

My string being of the form:
"as.asd.sd fdsfs. dfsd d.sdfsd. sdfsdf sd .COM"
I only want to match against the last segment of whitespace before the last period(.)
So far I am able to capture whitespace but not the very last occurrence using:
\s+(?=\.\w)
How can I make it less greedy?
In a general case, you can match the last occurrence of any pattern using the following scheme:
pattern(?![\s\S]*pattern)
(?s)pattern(?!.*pattern)
pattern(?!(?s:.*)pattern)
where [\s\S]* matches any zero or more chars as many as possible. (?s) and (?s:.) can be used with regex engines that support these constructs so as to use . to match any chars.
In this case, rather than \s+(?![\s\S]*\s), you may use
\s+(?!\S*\s)
See the regex demo. Note the \s and \S are inverse classes, thus, it makes no sense using [\s\S]* here, \S* is enough.
Details:
\s+ - one or more whitespace chars
(?!\S*\s) - that are not immediately followed with any 0 or more non-whitespace chars and then a whitespace.
You can try like so:
(\s+)(?=\.[^.]+$)
(?=\.[^.]+$) Positive look ahead for a dot and characters except dot at the end of line.
Demo:
https://regex101.com/r/k9VwC6/3
"as.asd.sd ffindMyLastOccurrencedsfs. dfindMyLastOccurrencefsd d.sdfsd. sdfsdf sd ..COM"
.*(?=((?<=\S)\s+)).*
replaced by `>\1<`
> <
As a more generalized example
This example defines several needles and finds the last occurrence of either one of them. In this example the needles are:
defined word findMyLastOccurrence
whitespaces (?<=\S)\s+
dots (?<=[^\.])\.+
"as.asd.sd ffindMyLastOccurrencedsfs. dfindMyLastOccurrencefsd d.sdfsd. sdfsdf sd ..COM"
.*(?=(findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+)).*
replaced by `>\1<`
>..<
Explanation:
Part 1 .*
is greedy and finds everything as long as the needles are found. Thus, it also captures all needle occurrences until the very last needle.
edit to add:
in case we are interested in the first hit, we can prevent the greediness by writing .*?
Part 2 (?=(findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+|(?<=**Not**NeedlePart)NeedlePart+))
defines the 'break' condition for the greedy 'find-all'. It consists of several parts:
(?=(needles))
positive lookahead: ensure that previously found everything is followed by the needles
findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+)|(?<=**Not**NeedlePart)NeedlePart+
several needles for which we are looking. Needles are patterns themselves.
In case we look for a collection of whitespaces, dots or other needleparts, the pattern we are looking for is actually: anything which is not a needlepart, followed by one or more needleparts (thus needlepart is +). See the example for whitespaces \s negated with \S, actual dot . negated with [^.]
Part 3 .*
as we aren't interested in the remainder, we capture it and dont use it any further. We could capture it with parenthesis and use it as another group, but that's out of scope here
SIMPLE SOLUTION for a COMMON PROBLEM
All of the answers that I have read through are way off topic, overly complicated, or just simply incorrect. This question is a common problem that regex offers a simple solution for.
Breaking Down the General Problem
THE STRING
The generalized problem is such that there is a string that contains several characters.
THE SUB-STRING
Within the string is a sub-string made up of a few characters. Often times this is a file extension (i.e .c, .ts, or .json), or a top level domain (i.e. .com, .org, or .io), but it could be something as arbitrary as MC Donald's Mulan Szechuan Sauce. The point it is, it may not always be something simple.
THE BEFORE VARIANCE (Most important part)
The before variance is an arbitrary character, or characters, that always comes just before the sub-string. In this question, the before variance is an unknown amount of white-space. Its a variance because the amount of white-space that needs to be match against varies (or has a dynamic quantity).
Describing the Solution in Reference to the Problem
(Solution Part 1)
Often times when working with regular expressions its necessary to work in reverse.
We will start at the end of the problem described above, and work backwards, henceforth; we are going to start at the The Before Variance (or #3)
So, as mentioned above, The Before Variance is an unknown amount of white-space. We know that it includes white-space, but we don't know how much, so we will use the meta sequence for Any Whitespce with the one or more quantifier.
The Meta Sequence for "Any Whitespace" is \s.
The "One or More" quantifier is +
so we will start with...
NOTE: In ECMAS Regex the / characters are like quotes around a string.
const regex = /\s+/g
I also included the g to tell the engine to set the global flag to true. I won't explain flags, for the sake of brevity, but if you don't know what the global flag does, you should DuckDuckGo it.
(Solution Part 2)
Remember, we are working in reverse, so the next part to focus on is the Sub-string. In this question it is .com, but the author may want it to match against a value with variance, rather than just the static string of characters .com, therefore I will talk about that more below, but to stay focused, we will work with .com for now.
It's necessary that we use a concept here that's called ZERO LENGTH ASSERTION. We need a "zero-length assertion" because we have a sub-string that is significant, but is not what we want to match against. "Zero-length assertions" allow us to move the point in the string where the regular expression engine is looking at, without having to match any characters to get there.
The Zero-Length Assertion that we are going to use is called LOOK AHEAD, and its syntax is as follows.
Look-ahead Syntax: (?=Your-SubStr-Here)
We are going to use the look ahead to match against a variance that comes before the pattern assigned to the look-ahead, which will be our sub-string. The result looks like this:
const regex = /\s+(?=\.com)/gi
I added the insensitive flag to tell the engine to not be concerned with the case of the letter, in other words; the regular expression /\s+(?=\.cOM)/gi
is the same as /\s+(?=\.Com)/gi, and both are the same as: /\s+(?=\.com)/gi &/or /\s+(?=.COM)/gi. Everyone of the "Just Listed" regular expressions are equivalent so long as the i flag is set.
That's it! The link HERE (REGEX101) will take you to an example where you can play with the regular expression if you like.
I mentioned above working with a sub-string that has more variance than .com.
You could use (\s*)(?=\.\w{3,}) for instance.
The problem with this regex, is even though it matches .txt, .org, .json, and .unclepetespurplebeet, the regex isn't safe. When using the question's string of...
"as.asd.sd fdsfs. dfsd d.sdfsd. sdfsdf sd .COM"
as an example, you can see at the LINK HERE (Regex101) there are 3 lines in the string. Those lines represent areas where the sub-string's lookahead's assertion returned true. Each time the assertion was true, a possibility for an incorrect final match was created. Though, only one match was returned in the end, and it was the correct match, when implemented in a program, or website, that's running in production, you can pretty much guarantee that the regex is not only going to fail, but its going to fail horribly and you will come to hate it.
You can try this. It will capture the last white space segment - in the first capture group.
(\s+)\.[^\.]*$

What mistake did I do for this unexpected negative lookahead subpattern?

I am actually working with a .tsv database whose headers are full of meaningful things for me.
I thus wanted to rip them off from the header to something that I & others users (non proficient with relational databases, so we mostly use Excel in the end to organize data and process it) would be more able to handle with Excel, by breaking them up with tabs.
Example header:
>(name1)database-ID:database2-ID:value1:value2
(I know this seems strange to put values in an header but this is descriptive of parameters of the third value associated to the header, that we don't have to mess here)
output as:
name1\tdatabase-ID\tdatabase2-ID\tvalue1\tvalue2\n
I thus pasted my data (headers, one per line) in EmEditor (BOOST syntax) and came with this regex:
>\((.*)\)(.*?)\:(.*?)\:(.*?)\:(.*?)\n
with each capturing group being then separated from others by inserting tabs between each others. It works, with perfect matches, no problem.
But I became aware there were malformed lines that didn't respected the logic of the whole database, and I wanted to make an expression to separate them at once.
If I make it with wrong lines it would be:
>(name1)database-ID:database2-ID:value1-1:value1-2\n
>(name2)database-ID:database2-ID:value2-1:value2-2\n
>(name3)database-ID:database2-ID:value3-1value3-2\n
Last line is ill-formed because it lacks the : between both last values.
I want it to be matched by working around the original expression that recognizes well-formed lines.
I perfectly know that I could came with different solutions by slightly tweaking my first expression for eliminating the good lines and retrieving misformed one after but
I don't want a solution to my process, I just want to understand what I made not well there; so that I become more educated (and not just more tricky by being able to circumvent my mistakes that I can't resolve):
I tried a negation of the above mentioned expression:
([^(>\((.*)\)(.*?)\:(.*?)\:(.*?)\:(.*?)\n)])
That doesn't match with anything.
I tried a negative lookahead, but It will be extremely, painfully slow then will match every 0-length matches possible in the document:
(?!(^>\((.*)\)(.*?)\:(.*?)\:(.*?)\:(.*?)\n))
I thus added a group capture for a string of characters behind,
but it doesn't work either:
(?!(^>\((.*)\)(.*?)\:(.*?)\:(.*?)\:(.*?)\n))(^.*?)
So please explain me where I have been wrong with the negating group ([^whatever]) and the use of the negative lookahead?
So please explain me where I have been wrong with the negating group ([^whatever]) and the use of the negative lookahead?
Let's address the question first: What does [^(pattern)] do?
You seem to have a misunderstanding and expect it to:
Match everything except the subpattern pattern. (Negation)
What it actually does is to:
Match any character that aren't (, p, a, t, ... n, ).
Therefore, the pattern
([^(>\((.*)\)(.*?)\:(.*?)\:(.*?)\:(.*?)\n)])
... Matches a character that aren't (, >, (, ... \n, ).
As for the negative lookahead, you're simply doing it wrong. The anchor ^ is in the wrong position, therefore your assertion will fail to provide any useful help. It's also not what negative lookaheads are for altogether.
(?!(^>\((.*)\)(.*?)\:(.*?)\:(.*?)\:(.*?)\n))
I'll explain what this does:
(?! Open negative lookahead group: Assert the position does not match this pattern, without moving the pointer position.
( Capturing group. The use of capturing groups in negative lookaheads are useless, as the subpattern in negative lookahead groups never matches.
^ Assert position at start of string.
>\( Literal character sequence ">(".
(.*) Capturing group which matches as many characters as possible except newlines, then backtracks.
\) Literal character ")".
(.*?) Capturing group with reluctant zero-to-one match of any characters except newlines.
\: Literal character ":".
(.*?)\:(.*?)\:(.*?)
\n A new line.
) Closes capturing group.
) Closes negative lookahead group. When this assertion is finished, the pointer position is same as beginning, and thus the resulting match is zero-length.
Note that the anchor is nested within the negative lookahead group. It should be at the start:
^(?!(>\((.*)\)(.*?)\:(.*?)\:(.*?)\:(.*?)\n))
While this doesn't return anything useful, it explains what is wrong, since you don't need a solution. ;)
In case you are in need of a solution suddenly, please refer to this relevant answer of mine (I'm not adding anything else into the post):
Rails 3 - Precompiling all css, sass and scss files in a folder
You could do this simply through PCRE Verb (*SKIP)(*F). The below regex would match all the bad-lines.
(?:^>\([^()]*\):[^:]*:[^:]*:[^:]*:[^:\n]*$)(*SKIP)(*F)|^.+
DEMO
Based on what I have been reading from Unihedron;
This is what I came for in emEditor:
^(?!>\(([A-Za-z0-9_\'\-]*?)\)(([A-Za-z0-9_\'\-]*?)\:){3}([A-Za-z0-9_\'\-]*?)\n).*\n
>(name1)database-ID:database2-ID:value1-1:value1-2
(NOT MATCH)
>(name2)database-ID:database2-ID:value2-1:value2-2
(NOT MATCH)
>(name3)database-ID:database2-ID:value3-1value3-2
(MATCH)
>(name3)database-ID::database2-ID:value3-1:value3-2
(MATCH)
(the character class avoid discarding names including special characters without making it possible to have two subsequent ":".)
I also could achieve the same results with:
(?!^>\(([A-Za-z0-9_\'\-]*?)\)(([A-Za-z0-9_\'\-]*?)\:){3}([A-Za-z0-9_\'\-]*?)\n)^.*\n
So I guess that all along capturing groups were what was messing with my lookahead.
Now I acknowledge that Avinash Raj is more efficient with the (*SKIP)(*F)|^.+ pattern, just that I didn't know about those functions and I also wanted to understand my logic / syntax mistake. (Thanks to Unihedron for that)