I'm using a custom blog syndication tool and having problems in using the regex syntax.
Example:
The original code
<img src="http://www.mydomain.com/some image.png">
I tried:
/\<img src\=\"http\:\/\/www\.mydomain\.com\/some\%20image\.png\"\>/
and
/\<img src\=\"http\:\/\/www\.mydomain\.com\/some image\.png\"\>/
But none of them seem to work.
Any suggestions?
The pattern delimiters (/) at either end mean the regex engine should know where the pattern ends, so the engine shouldn't be confused by a space. Are you sure there isn't something else wrong with the pattern? I suspect this is the most likely problem.
You might like to try the %20 without the % being escaped.
Another thing to try could be an escaped space (with a preceding backslash).
Otherwise, you could try using \s to match a space - it's reasonably standard in regex engines (but also matches tabs, line feeds and carriage returns).
Related
I'm doing the RegexOne regex tutorial and it has a question about writing a regular expression to remove unnecessary whitespace.
The solution provided in the tutorial is
We can just skip all the starting and ending whitespace by not capturing it in a line. For example, the expression ^\s*(.*)\s*$ will catch only the content.
The setup for the question does indicate the use of the hat at the beginning and the dollar sign at the end, so it makes sense that this is the expression that they want:
We have previously seen how to match a full line of text using the hat ^ and the dollar sign $ respectively. When used in conjunction with the whitespace \s, you can easily skip all preceding and trailing spaces.
That said, using \S instead, I was able to come up with what seems like a simpler solution - (\S.*\S).
I've found this Stack Overflow solution that match the one in the tutorial - Regex Email - Ignore leading and trailing spaces? and I've seen other guides that recommend the same format but I'm struggling to find an explanation for why the \S is bad.
Additionally, this validates as correct in their tool... so, are there cases where this would not work as well as the provided solution? Or is the recommended version just a standard format?
The tutorial's solution of ^\s*(.*)\s*$ is wrong. The capture group .* is greedy, so it will expand as much as it can, all the way to the end of the line - it will capture trailing spaces too. The .* will never backtrack, so the \s* that follows will never consume any characters.
https://regex101.com/r/584uVG/1
Your solution is much better at actually matching only the non-whitespace content in the line, but there are a couple odd cases in which it won't match the non-space characters in the middle. (\S.*\S) will only capture at least two characters, whereas the tutorial's technique of (.*) may not capture any characters if the input is composed of all whitespace. (.*) may also capture only a single character.
But, given the problem description at your link:
Occasionally, you'll find yourself with a log file that has ill-formatted whitespace where lines are indented too much or not enough. One way to fix this is to use an editor's search a replace and a regular expression to extract the content of the lines without the extra whitespace.
From this, matching only the non-whitespace content (like you're doing) probably wouldn't remove the undesirable leading and trailing spaces. The tutorial is probably thinking to guide you towards a technique that can be used to match a whole line with a particular pattern, and then replace that line with only the captured group, like:
Match ^\s*(.*\S)\s*$, replace with $1: https://regex101.com/r/584uVG/2/
Your technique would work given the problem if you had a way to make a new text file containing only the captured groups (or all the full matches), eg:
const input = ` foo
bar
baz
qux `;
const newText = (input.match(/\S(?:$|.*\S)/gm) || [])
.join('\n');
console.log(newText);
Using \S instead of . is not bad - if one knows a particular location must be matched by a non-space character, rather than by a space, using \S is more precise, can make the intent of the pattern clearer, and can make a bad match fail faster, and can also avoid problems with catastrophic backtracking in some cases. These patterns don't have backtracking issues, but it's still a good habit to get into.
I currently have this regex: this\.(.*)?\s[=,]\s, however I have come across a pickle I cannot fix.
I tried the following Regex, which works, but it captures the space as well which I don't want: this\.(.*)?(?<=\s)=|(?<!\s),. What I'm trying to do is match identifier names. An example of what I want and the result is this:
this.""W = blah; which would match ""W. The second regex above does this almost perfectly, however it also captures the space before the = in the first group. Can someone point me in the correct direction to fix this?
EDIT: The reason for not simply using [^\s] in the wildcard group is that sometimes I can get lines like this: this. "$ = blah;
EDIT2: Now I have another issue. Its not matching lines like param1.readBytes(this.=!3,0,param1.readInt()); properly. Instead of matching =!3 its matching =!3,0. Is there a way to fix this? Again, I cannot simply use a [^,] because there could be a name like param1.readBytes(this.,3$,0,param1.readInt()); which should match ,3$.
(.*) will match any character including whitespace.
To force it not to end in whitespace change it to (.*[^\s])
Eg:
this\.(.*[^\s])?\s?[=,]\s
For your second edit, it seems like you are doing a language parser. Even though regular expressions are powerful, they do have limits. You need a grammar parser for that.
Maybe you can tell in your first block to capture non space characters, instead of any.
this\.(\S*)?(?<=\s)=|(?<!\s),
I'm trying to append "/index.html" to some folder paths in a list like this:
path/one/
/another/index.html
other/file/index.html
path/number/two
this/is/the/third/path/
path/five
sixth/path/goes/here/
Obviously the text only needs to be added where it does not exist yet. I could achieve some good results with (vim command):
:%s/^\([^.]*\)$/\1\/index.html/
The only problem is that after running this command, some lines like the 1st, 5th and 7th in the previous example end up with duplicated slashes. That's easy to solve too, all I have to do is search for duplicates and replace with a single slashes.
But the question is:
Isn't there a better way to achieve the correct result at once?
I'm a Vim beginner, and not a regex master also. Any tips are really appreciated!
Thanks!
So very close :)
Just add an optional slash to the end of the regex:
\/\?
Then you need to change the rest of the pattern to a non-greedy match so that it ignores a trailing slash. The syntax for a non-greedy match in vim (replacing the *) is:
\{-}
So we end up with:
:%s/^\([^\.]\{-}\)\/\?$/\1\/index.html/
(Doesn't hurt to be safe and escape the period.)
Vim's regex supports the ability to match a bit of text foo if it does or doesn't precedes or follows some other text bar without matching bar, and this is exactly the sort of thing you're looking for. Here you want to match the end of line with an optional /, but only if the / isn't followed by index.html, and then replace it with /index.html. A quick look at Vim's help tells me \#<! is exactly what to use. It tells Vim that the preceding atom must be in the text but not in what's matched. With a little experimentation, I get
:%s;/\?\(index\.html\)\#<!$;/index.html;
I use ; to delimit the parts of the :s command so that I don't have to escape any / in the regex or replacement expression. In this particular situation, it's not a big deal though.
The / is optional, and we say so with \?.
We need to group index.html together because otherwise our special \#<! would only affect the l otherwise.
I just stumbled on a case where I had to remove quotes surrounding a specific regex pattern in a file, and the immediate conclusion I came to was to use vim's search and replace util and just escape each special character in the original and replacement patterns.
This worked (after a little tinkering), but it left me wondering if there is a better way to do these sorts of things.
The original regex (quoted): '/^\//' to be replaced with /^\//
And the search/replace pattern I used:
s/'\/\^\\\/\/'/\/\^\\\/\//g
Thanks!
You can use almost any character as the regex delimiter. This will save you from having to escape forward slashes. You can also use groups to extract the regex and avoid re-typing it. For example, try this:
:s#'\(\\^\\//\)'#\1#
I do not know if this will work for your case, because the example you listed and the regex you gave do not match up. (The regex you listed will match '/^\//', not '\^\//'. Mine will match the latter. Adjust as necessary.)
Could you avoid using regex entirely by using a nice simple string search and replace?
Please check whether this works for you - define the line number before this substitute-expression or place the cursor onto it:
:s:'\(.*\)':\1:
I used vim 7.1 for this. Of course, you can visually mark an area before (onto which this expression shall be executed (use "v" or "V" and move the cursor accordingly)).
I am trying to get an issueUrlBuilder to work in my CruiseControl.NET config, and cannot figure out why they aren't working.
The first one I tried is this:
<cb:define name="issueTracker">
<issueUrlBuilder type="regexIssueTracker">
<find>^.*Issue (\d*).|\n*$</find>
<replace>https://issuetracker/ViewIssue.aspx?ID=$1</replace>
</issueUrlBuilder>
</cb:define>
Then, I reference it in the sourceControl block:
<sourcecontrol type="vaultplugin">
...
<issueTracker/>
</sourcecontrol>
My checkin comments look like this:
[Issue 1234] This is a test comment
I cannot find anywhere in the build reports/logs/etc. where that issue link is converted to a link. Is my regex wrong?
I've also tried the default issueUrlBuilder:
<cb:define name="issueTracker">
<issueUrlBuilder type="defaultIssueTracker">
<url>https://issuetracker/ViewIssue.aspx?ID={0}</url>
</issueUrlBuilder>
</cb:define>
Again, same comments and no links anywhere.
Anyone have any ideas.
It looks like you're trying to match a potentially multiline comment by using .|\n instead of just ., which doesn't match newlines by default. Your first problem is that | has the lowest associativity of all regex constructs, so it's dividing your whole regex into the alternatives ^.*Issue (\d*). or \n*$. You would need to enclose the alternation in a group: (?:.|\n)*.
Another potential problem is that the lines might be separated by \r\n (carriage-return plus linefeed) instead of just \n. If CCNET uses the .NET regex engine under the hood, that won't be a problem because the dot matches \r. But that's not true of all flavors, and anyway, there's always a better way to match anything including newlines than (?:.|\n)*. I suggest you try
<find>^.*Issue (\d*)(?s:.*)$</find>
or
<find>(?s)^.*Issue (\d*).*$</find>
(?s) and (?s:...) are inline modifiers which allow the dot to match line separator characters.
EDIT: It looks like this is a known bug in CCNET. If the inline modifier doesn't work, try replacing . with [\s\S], as you would in a JavaScript regex. Example:
<find>^.*Issue (\d*)[\s\S]*$</find>