Regex to replace whitespace in Markdown URLs - regex

I've a bunch of Markdown links with whitespace, and I need to replace the whitespace with %20. So far I've hacked a few solutions, but none that work in VSCode, or do exactly what I'm looking for.
This is the URL format conversion I need:
[My link](../../_resources/my resource.jpg)
[My link](../../_resources/my%20resource.jpg)
\s+(?=[^(\)]*\)) will work on any whitespace inside brackets - but gives false positives as it works on anything with brackets.
(?:\]\(|(?!^)\G)[^]\s]*\K\h+ does the job, but I'm getting some "Invalid Escape Character" messages in VSCode, so I assume the language isn't compatible.
I've been trying to identify the link on the characters ]( but as I'm relatively new to regex, struggling a bit.
I tried with this regex: (?<=\]\()s\+ as this (?<=\]\().+ correctly identifies the url, but it doesn't work.
Where am I going wrong here? Thanks in advance!
EDIT: VSCode find in files doesn't support variable length lookbehind, even though find/replace in the open file does support this. Open to any other solutions before I dive into writing a script!

VSCode regex does not support \K, \G, or \h, but it does support Lookbehinds with non-fixed width. So, you may use something like the following:
(?<=\]\([^\]\r\n]*)[^\S\r\n]+
Online demo.

You can use
(?<=\]\([^\]]*)\s+(?=[^()]*\))
Replace with %20. See the demo screenshot:
Details:
(?<=\]\([^\]]*) - a positive lookbehind that matches a location that is immediately preceded with ]( and then any zero or more chars other than ]
\s+ - any one or more whitespace chars (other than line break chars in Visual Studio Code, if there is no \n or \r in the regex, \sdoes not match line break chars)
(?=[^()]*\)) - a positive lookahead that matches a location that is immediately followed with zero or more chars other than ( and ) and then a ) char.
Since you are using it in Find/Replace in Files, this lookbehind solution won't work.
You can use Notepad++ with
(\G(?!\A)|\[[^][]*]\()([^()\s]*)\s+(?=[^()]*\))
and $1$2%20 replacement pattern. In Notepad++, press CTRL+SHIFT+F and after filling out the necessary fields, hit Replace in Files.
See the sample settings:

In the end, as I'm on a Mac and didn't want to fire up a virtual PC to run Notepad++ (Sublime uses the same engine and Atom doesn't allow you exclude files), I used a combination of a Python script with #Wiktor Stribizew's answer for individual files that weren't picked up by the pattern for whatever reason.
md_url_pattern = r'(\[(.+)\])\(([^\)]+)\)'
def remove_spacing(match_obj):
if match_obj.group(3) is not None:
print("Match Object: " + match_obj.group(1) + "(" + re.sub(r"\s+", "%20", match_obj.group(3)) + ")")
return match_obj.group(1) + "(" + re.sub(r"\s+", "%20", match_obj.group(3)) + ")"
# THIS_FOLDER = os.path.dirname(os.path.abspath(__file__))
this_folder = '<my_document_folder>' # fixed folder path
note_path = '<note_folder>' # change this
full_path = os.path.join(this_folder, note_path)
directory = os.listdir(full_path)
os.chdir(full_path)
for file in directory:
open_file = open(file, 'r')
read_file = open_file.read()
read_file = re.sub(md_url_pattern, remove_spacing, read_file)
if not read_file:
print("Empty file!")
else:
write_file = open(file,'w')
write_file.write(read_file)
This script could do with a bit of tidying up and debugging (the odd weird empty file and no subfolder compatibility) but it was the best I could do.

Related

Using Regex selecting text match everything after a word or patterns (similar topic but text is not fix patterns except 1 character)

I am trying to use Regex in notepad++ to select everything after v+(number|character)* but in the selection it should excluded the v+(num|char)*.
e.g. master\_\move_consolidate_archives_html_to_move_base_v2kjkj_(2021_01_19_11h43m59s-fi_m_dt xx-) - Copy (2).bat"
I am expecting
_(2021_01_19_11h43m59s-fi_m_dt xx-) - Copy (2).bat"
so far I can use this line (?i)(v\d[0-9a-z]*)
to select v2kjkj
but I can't get this to work with lookbehind (?<=xxxx).
I am also trying to use if-then-else condition but no luck for me. I am still don't understand enough to using it.
issue.
because the "v" have different pattern in it. I can't hard code to certain string
v2
v23
v2kjkj
v2343434
Test string:
mmaster\_\move_consolidate_archives_html_to_move_base_v2_16_.bat"
master\_\move_consolidate_archiv es_html_to_move_base_v23_17_.bat"
master\_\move_consolidate_archives_html_to_move_base_v2_17_(2021_01_19_12h37m19s-fi_m_dt xx-).bat"
master\_\move_consolidate_archives_html_to_move_base_v2_(2021_01_19_11h43m59s-fi_m_dt xx-) - CopyCopy.bat"
master\_\move_consolidate_archives_html_to_move_base_v2kjkj_(2021_01_19_11h43m59s-fi_m_dt xx-) - Copy (2).bat"
master\_\move_consolidate_archives_html_to_move_base_v2343434_(2021_01_19_11h43m59s-fi_m_dt xx-) - Copy (3).bat"
I have been reading and searching for a day but I can't apply anything I have seen so for.
the closest one I see was
Regexp match everything after a word
Getting the text that follows after the regex match
I am welcome any comments.
Ctrl+H
Find what: v\d[0-9a-z]*\K.*$
Replace with: LEAVE EMPTY
UNCHECK Match case
CHECK Wrap around
CHECK Regular expression
UNCHECK . matches newline
Replace all
Explanation:
v # a "v"
\d # a digit
[0-9a-z]* # 0 or more alphanum
\K # forget all we have seen until this position
.* # 0 or more any character but newline
$ # end of line
Screenshot (before):
Screenshot (after):

Exclude special charcters with Regex

I am trying to get version number of my assembly. I use regex for it and here is my pattern.
$pattern = '\[assembly: AssemblyVersion\("(.*)"\)\]'
It works good but in AssemblyInfo.cs and AssemblyInfo.vb there are some special characters as example
in cs file
// by using the '*' as shown below:
// [assembly: AssemblyVersion("1.0.*")]
[assembly: AssemblyVersion("3.7.1.0")]
in .vb file
' <Assembly: AssemblyVersion("1.0.*")>
<Assembly: AssemblyVersion("3.2.0.0")>
<Assembly: AssemblyFileVersion("1.0.0.0")>
So I want to exclude // and ' charachters in my pattern. I tried to exclude it with [^//] but it does not work. I tried something else but it did not work either.
And the second question is
in .vb file, there are different starting.
<Assembly: AssemblyVersion("3.2.0.0")>
and in c# file there are different starting
[assembly: AssemblyVersion("3.7.1.0")]
How i can include also vb version into my pattern?
Where is the problem?
You can use negative lookbehind if your library supports it
(?<!\/\/\s|\'\s)\[assembly: AssemblyVersion\("(.*)"\)\]
Edit:
For matching different brackets you can just use [variant1|variant2] syntax
(?<!\/\/\s|\'\s)[<\[][Aa]ssembly: AssemblyVersion\("(.*)"\)[>\]]
You want to "exclude" rows which start either with / or '.
Start from setting m (multi-line) flag in you regex.
It ensures that ^ matches start of each line (not the whole string).
Then start the regex from:
^ - acting now (in multi-line mode) as a row start marker,
(?!\/|') - a negative lookahead group with two variants
inside (do not allow that the line starts from either / or '),
\s* - an optional sequence of spaces.
and then goes your regex.
So the whole regex should look like below:
^(?!\/|')\s*\[assembly: AssemblyVersion\("(.*)"\)\]
(remember about the m flag).
Negative lookbehind solution mentioned in other answers has such a flaw
that even if a row starts from ' or / but has no space thereafter,
such a regex will fail.

replace all line breaks not precede by a period with a regular expression?

Is is possible to select only line breaks that are not preceded by a period using regular expressions ?
I am editing subtitles files for students. To make the printed version dead tree friendly I am trying to replace all the line breaks not preceded by a period or question mark with a space.
option 1
select all the line breaks not preceded by a period or question mark regex [a-z]\n works for that but then it of course selects the last letter of the word before the line break.
-> Is it possible to somehow save and insert the last letter of the word before the line break and insert that together with a space using regular expressions or do I have to write a script for that (say php)
option 2
Select only line breaks that are preceded by a character. I tried looking into lookbehind.
While writing this question the solution hit me.
To select a line break precede by a character do (?<=[a-z])\n and then replace with a space.
I searched stack overflow and could not really find what I was looking for. I hope I will not offend anybody by posting the question and solution at the same time. It might help someone else in the future.
I have had this problem recently, I solved it like this:
search:
"(?<!\.|\?)(\r\n)+([^?\.]+)"
replace: (Be careful! There is a space!!)
" $2"
(?<!\.|\?) -> There can't be ./?
(\r\n)+ -> one or more newlines
([^?\.]+) -> selects everything of the new line except ?/.
" $2" -> second capture group with SPACE before.
I used Regex Buddy, if it doesn't work for you, I can try to convert it for you to another programming language using Regex Buddy.
The syntax can vary depending on what you are using to replace the text (Java, Perl, PHP, sed, vi, etc.).
In Java you could try this :
str.replaceAll("([^\\.!?])\r?\n", "$1 ").replaceAll(" +", " ");
In perl :
perl -p -e 's/([^\.!?])\n/\1 /g; s/ +/ /g;' file.txt
You can also read this answer to a similar question :
How can I replace a newline (\n) using sed?
Let's define a line break first. In some regex flavors, Java 8 / PHP (PCRE), Ruby (Onigmo), you may use a \R shorthand character class that matches any line break style. In Java 8 regex reference, \R is defined as:
\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]
Now, you want to find this pattern if it is not preceded with . char. You need to use a negative lookbehind, (?<!\.). It fails the match once it finds a . immediately to the left of the current location. So, here are some examples of how to remove the line break not preceded with a dot in some languages:
PHP (demo): preg_replace('~(\.\R+)|\R+~', '$1', $s)
Java 7 (demo): String rx_R = "(?:\\u000D\\u000A|[\\u000A\\u000B\\u000C\\u000D\\u0085\\u2028\\u2029])"; String res = s.replaceAll("(\\." + rx_R + ")|" + rx_R, "$1");
Ruby (demo): s.gsub(/(\.\R+)|\R+/, '\1')
C# (see demo): var rx_R = #"(?:\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029])"; var res = Regex.Replace(txt, $#"(\.{rx_R})|{rx_R}", "$1");
Python (both 2.x and 3.x) (demo): rx_R = r'(?:\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029])' and then re.sub(r'(\.{0})|{0}'.format(rx_R), lambda x: x.group(1) if x.group(1) else '', s)
JavaScript: it has no support for a lookbehind, thus, use a ([^.]|^) capturing group and a backreference ($1 to reference it from the replacement string) to keep the char other than . before a line break:
var s = "Line1\u000D\u000A Line2\u000B Line3\u000C Line4\u0085 Line5\u2028 Line6\u2029 Line7";
var rx = /([^.]|^)(?:\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029])/g;
console.log(s.replace(rx, '$1'));

Regex for extracting filename from path

I need to extract just the filename (no file extension) from the following path....
\\my-local-server\path\to\this_file may_contain-any&character.pdf
I've tried several things, most based off of something like http://regexr.com?302m5 but can't quite get there
^\\(.+\\)*(.+)\.(.+)$
This regex has been tested on these two examples:
\var\www\www.example.com\index.php
\index.php
First block "(.+\)*" matches directory path.
Second block "(.+)" matches file name without extension.
Third block "(.+)$" matches extension.
This will get the filename but will also get the dot. You might want to truncate the last digit from it in your code.
[\w-]+\.
Update
#Geoman if you have spaces in file name then use the modified pattern below
[ \w-]+\. (space added in brackets)
Demo
This is just a slight variation on #hmd's so you don't have to truncate the .
[ \w-]+?(?=\.)
Demo
Really, thanks goes to #hmd. I've only slightly improved on it.
Try this:
[^\\]+(?=\.pdf$)
It matches everything except back-slash followed by .pdf at the end of the string.
You can also (and maybe it's even better) take the part you want into the capturing group like that:
([^\\]+)\.pdf$
But how you refer to this group (the part in parenthesis) depends on the language or regexp flavor you're using. In most cases it'll be smth like $1, or \1, or the library will provide some method for getting capturing group by its number after regexp match.
I use #"[^\\]+$"
That gives the filename including the extension.
I'm using this regex to replace the filename of the file with index. It matches a contiguous string of characters that doesn't contain a slash and is followed by a . and a string of word characters at the end of the string. It will retrieve the filename including spaces and dots but will ignore the full file extension.
const regex = /[^\\/]+?(?=\.\w+$)/
console.log('/path/to/file.png'.match(regex))
console.log('/path/to/video.webm'.match(regex))
console.log('/path/to/weird.file.gif'.match(regex))
console.log('/path with/spaces/and file.with.spaces'.match(regex))
If anyone is looking for a windows absolute path (and relative path) javascript regular expression in javascript for files:
var path = "c:\\my-long\\path_directory\\file.html";
((/(\w?\:?\\?[\w\-_\\]*\\+)([\w-_]+)(\.[\w-_]+)/gi).exec(path);
Output is:
[
"c:\my-long\path_directory\file.html",
"c:\my-long\path_directory\",
"file",
".html"
]
Here's a slight modification to Angelo's excellent answer that allows for spaces in the path, filename and extension as well as missing parts:
function parsePath (path) {
var parts = (/(\w?\:?\\?[\w\-_ \\]*\\+)?([\w-_ ]+)?(\.[\w-_ ]+)?/gi).exec(path);
return {
path: parts[0] || "",
folder: parts[1] || "",
name: parts[2] || "",
extension: parts[3] || "",
};
}
If you want to return the file name with its extension, Regex should be as below:
[A-Za-z0-9_\-\.]+\.[A-Za-z0-9]+$
works for
path/to/your/filename.some
path/to/your/filename.some.other
path\to\your\filename.some
path\to\your\filename.some.other
http://path/to/your/filename.some
http://path/to/your/filename.some.other
And so on
Which returns full file name with extension(eg: filename.some or filename.some.other)
If you want to return file name without the last extension Regex should be as below:
[A-Za-z0-9_\-\.]+(?=\.[A-Za-z0-9]+$)
Which returns full file name without last extension(eg: "filename" for "filename.some" and "filename.some" for "filename.some.other")
Click the Explain button on these links shown TEST to see how they work.
This is specific to the pdf extension.
TEST ^.+\\([^.]+)\.pdf$
This is specific to any extension, not just pdf.
TEST ^.+\\([^.]+)\.[^\.]+$
([^.]+)
This is the $1 capture group to extract the filename without the extension.
\\my-local-server\path\to\this_file may_contain-any&character.pdf
will return
this_file may_contain-any&character
TEST ^(.*[\\\/])?(.*?)(\.[^.]*?|)$
example:
/^(.*[\\\/])?(.*?)(\.[^.]*?|)$/.exec("C:\\folder1\\folder2\\foo.ext1.ext")
result:
0: "C:\folder1\folder2\foo.ext1.ext"
1: "C:\folder1\folder2\"
2: "foo.ext1"
3: ".ext"
the $1 capture group is the folder
the $2 capture group is the name without extension
the $3 capture group is the extension (only the last)
works for:
C:\folder1\folder2\foo.ext
C:\folder1\folder2\foo.ext1.ext
C:\folder1\folder2\name-without extension
only name
name.ext
C:\folder1\folder2\foo.ext
/folder1/folder2/foo.ext
C:\folder1\folder2\foo
C:\folder1\folder2\
C:\special&chars\folder2\f [oo].ext1.e-x-t
Answer with:
File name and directory space support
Named capture group
Gets unlimited file extensions (captures file.tar.gz, not just file.tar)
*NIX and Win support
^.+(\\|\/)(?<file_name>([^\\\/\n]+)(\.)?[^\n\.]+)$
Explanation:
^.+(\\|\/) Gets anything up to the final / or \ in a file path
(?<file_name> Begin named capture group
([^\\\/\n]+) get anything except for a newline or new file
(\.)?[^\n\.]+ Not really needed but it works well for issues with odd characters in file names
)$ End named capture group and end line
Note that if you're putting this in a string and you need to escape backslashes (such as with C) you'll be using this string:
"^.+(\\\\|\/)(?<file_name>([^\\\/\n]+)(\.)?[^\n\.]+)$"
Here is an alternative that works on windows/unix:
"^(([A-Z]:)?[\.]?[\\{1,2}/]?.*[\\{1,2}/])*(.+)\.(.+)"
First block: path
Second block: dummy
Third block: file name
Fourth block: extension
Tested on:
".\var\www\www.example.com\index.php"
"\var\www\www.example.com\index.php"
"/var/www/www.example.com/index.php"
"./var/www/www.example.com/index.php"
"C:/var/www/www.example.com/index.php"
"D:/var/www/www.example.com/index.php"
"D:\\var\\www\\www.example.com\\index.php"
"\index.php"
"./index.php"
This regular expression extract the file extension, if group 3 isn't null it's the extension.
.*\\(.*\.(.+)|.*$)
also one more for file in dir and root
^(.*\\)?(.*)(\..*)$
for file in dir
Full match 0-17 `\path\to\file.ext`
Group 1. 0-9 `\path\to\`
Group 2. 9-13 `file`
Group 3. 13-17 `.ext`
for file in root
Full match 0-8 `file.ext`
Group 2. 0-4 `file`
Group 3. 4-8 `.ext`
For most of the cases ( that is some win , unx path , separator , bare file name , dot , file extension ) the following one is enough:
// grap the dir part (1), the dir sep(2) , the bare file name (3)
path.replaceAll("""^(.*)[\\|\/](.*)([.]{1}.*)""","$3")
Direct approach:
To answer your question as it's written, this will provide the most exact match:
^\\\\my-local-server\\path\\to\\(.+)\.pdf$
General approach:
This regex is short and simple, matches any filename in any folder (with or without extension) on both windows and *NIX:
.*[\\/]([^.]+)
If a file has multiple dots in its name, the above regex will capture the filename up to the first dot. This can easily be modified to match until the last dot if you know that you will not have files without extensions or that you will not have a path with dots in it.
If you know that the folder will only contain .pdf files or you are only interested in .pdf files and also know that the extension will never be misspelled, I would use this regex:
.*[\\/](.+)\.pdf$
Explanation:
. matches anything except line terminators.
* repeats the previous match from zero to as many times as possible.
[\\/] matches a the last backslash or forward slash (previous ones are consumed by .*). It is possible to omit either the backslash or the forward slash if you know that only one type of environment will be used.
If you want to capture the path, surround .* or .*[\\/] in parenthesis.
Parenthesis will capture what is matched inside them.
[^.] matches anything that is not a literal dot.
+ repeats the previous match one or more times, as many as possible.
\. matches a literal dot.
pdf matches the string pdf.
$ asserts the end of the string.
If you want to match files with zero, one or multiple dots in their names placed in a variable path which also may contain dots, it will start to get ugly. I have not provided an answer for this scenario as I think it is unlikely.
Edit: To also capture filenames without a path, replace the first part with (?:.*[\\/])?, which is an optional non-capturing group.
Does this work...
.*\/(.+)$
Posting here so I can get feedback
Here a solution to extract the file name without the dot of the extension.
I begin with the answer from #Hammad Khan and add the dot in the search character. So, dots can be part of the file name:
[ \w-.]+\.
Then use the regex look ahead(?= ) for a dot, so it will stop the search at the last dot (the dot before the extension), and the dot will not appears in the result:
[ \w-.]+(?=[.])
reorder, it's not necessary but look better:
[\w-. ]+(?=[.])
try this
[^\\]+$
you can also add extension for specificity
[^\\]+pdf$

extract fileName using Regex

If I want to match only fileName, i.e,
in C://Directory/FileName.cs, somehow ignore everything before FileName.cs using Regex.
How can I do it?
I need this for a Compiled UI I am working on ... can't use programming language as it only accepts Regex.
Any ideas?
Something like this might work:
[^/]*$
It matches all characters to the end of the line that are not "/"..
If you want to match paths that use the "\" path separator you would change the regex to:
[^\]*$
But do make sure to escape the "\" character if your programming language or environment requires it. For instance you might have to write something like this:
[^\\]*$
EDIT
I removed the leading "/" and trailing "/" as they may be confusing since they are not really part of the regEx but they are very common of representing a regular expression.
And of course, depending on the features that the regEx engine supports you may be able to use look-ahead/look-behind and capturing to craft a better regEx.
What language are you using? Why are you not using the standard path mechanisms of that language?
How about http://msdn.microsoft.com/en-us/library/system.io.path.aspx ?
Based on your comment of needing to exclude paths that do not match 'abc', try this:
^.+/(?:(?!abc)[^/])+$
Completely split out in regex comment mode, that is:
(?x) # flag to enable comments
^ # start of line
.+ # match any character (except newline)
# greedily one or more times
/ # a literal slash character
(?: # begin non-capturing group
(?! # begin negative lookahead
# (contents must not appear after the current position)
abc # literal text abc
) # end negative lookahead
[^/] # any character that is not a slash
) # end non-capturing group
+ # repeat the above nc group one or more times
# (essentially, we keep looking for non-backspaces that are not 'abc')
$ # end of line
The regex expression that did it for me was
[^\/]*$
I'm way late to the party and I'm also ignoring the requirement of regex because, as J-16 SDiZ pointed out, sometimes there is a better solution. Even though the question is 4 years old, people looking for a simple solution deserve choices.
Try using the following:
public string ConvertFileName(string filename)
{
string[] temparray = filename.Split('\\');
filename = temparray[temparray.Length - 1];
return filename;
}
This method splits the string on the "\" character, stores the resulting strings in an array and returns the last element of the array (the filename).
Though the OP seems to be writing for UNIX, it doesn't take much to figure out how to tailor it to your particular need.
Seeing as filename can be interpreted as the basename by some. Then, this example can extract the filename/basename for any files that may not have an extension for some reason. It can also get the last directory in the same fashion.
You can see how it works and test it here.
https://regexr.com/4ht5v
The regexp is:
.+?\\(?=\w+)|\.\w+$|\\$
Before:
C:\Directory\BaseFileName.ext
C:\Directory\BaseFileName
C:\This is a Directory\Last Directory With trailing backslash\
C:\This is a Directory\Last Directory Without trailing backslash
After:
BaseFileName
BaseFileName
Last Directory With trailing backslash
Last Directory Without trailing backslash
For the sake of completion, this is how it would work with JavaScript should anyone require it.
// Example of getting a BaseFileName from a path
var path = "C:\\Directory\\FileName.cs";
var result = path.replace(/.+?\\(?=\w+)|\.\w+$|\\$/gm,"");
console.log(result);
Try this (working with / and \):
[^\/|\\]*$
I would use: ./(.$)
The parenthesis mark a group wich is the file name.
The regular expression you use may vary dependig on the regex syntax(PCRE, POSIX)
I sugest you use a regex tool, there are several for windows and linux:
Windows - http://sourceforge.net/projects/regexcreator/
Windows - http://weitz.de/regex-coach/
Linux - kodos
Hope it helps
just a variation on miky's that works for both filesystem path characters:
[^\\/]*\s
Suppose the file name has special characters, specially when supporting MAC where special characters are allowing in filenames, server side Path.GetFileName(fileName) fails and throws error because of illegal characters in path. The following code using regex come for the rescue.
The following regex take care of 2 things
In IE, when file is uploaded, the file path contains folders aswell (i.e. c:\samplefolder\subfolder\sample.xls). Expression below will replace all folders with empty string and retain the file name
When used in Mac, filename is the only thing supplied as its safari browser and allows special chars in file name.
var regExpDir = #"(^[\w]:\\)([\w].+\w\\)";
var fileName = Regex.Replace(fileName, regExpDir, string.Empty);
I did this without RegEx in Powershell:
Put the link in a variable
$Link = "http://some.url/some/path/file.name"
Split the link on the "/" character
$split = $Link.Split("/")
Count the splits
$SplitCount = $Split.Count
Target the filename
$Split[$SplitCount -1]
Full code :
$Link = "http://some.url/some/path/file.name"
$Split = $Link.Split("/")
$SplitCount = $Split.Count
$Split[$SplitCount -1]
A rather elegant solution with lookahead and lookbehind wasn't mentioned:
(?<=.+)(?=.cs)