Regex to parse file paths - regex

I have this text:
Unexpected error creating debug information file
'c:\Users\Path1\Path2\Strategies\Path3\CustomStrategy.PDB' --
'c:\Users\Path1\Path2\Strategies\Path3\CustomStrategy.pdb: The system
cannot find the path specified.
I need to parse out the file paths c:\Users\Path1\Path2\Strategies\Path3 or c:\Users\Path1\Path2\Strategies\Path3\CustomStrategy.PDB, whatever is easier. I tried to use the following Regex
\w:.+[.]\w{3}
But, this RegEx doesn't stop at first file extension and continues to match the the second instance of the path, stopping at the second instance of .pdb; thus putting both file paths in one regex match.
What do I need to change in order for the regex to parse the two paths as two separate matches? Thanks.

Non-greedy re:
\w:.+?[.]\w{3}
Note ? after +.
Also, if your path contains no dots except the last one, you can write it so:
\w:[^.]+[.]\w{3}
If you are not sure that the extension consists of three letters, you must specify the range:
\w:[^.]+[.]\w{1,3}
And when you are not sure that your path has extension at all, but it contains no spaces, then:
\w:\S+

What about this
\w:\\(?:[^\\\s]+\\)+
See it here on Regexr
\w:\\ matches a word character, a : and a backslash
(?:[^\\\s]+\\)+ matches the directories, non-backslash or non whitespace characters till a backslash, and this repeated.
So, this would match both paths c:\Users\Path1\Path2\Strategies\Path3. works as long as the directory names does not contain spaces.

Actually, here you may as well do without regex at all.
Split the text by ' and use the second part.
As for regex, I would use something more complicated, but allowing to catch other filenames, not just those ending with a 3-letter extension:
'([a-z]:(?:[\\/][^\\/]*)+?)' --
(and use first subpattern from the match)

Related

Find file paths that have spaces

I am trying to create a regular expression to select file paths that contain spaces and are not wrapped in quotes. In addition, I only want paths that begin with a volume letter (e.g., C:\, D:\, E:) and I want to ignore any switches or commands that come after the path.
Take for instance the following list, I have highlighted in bold all of the text I want to match and return:
C:\This path has spaces\system.sys -switch /command
C:\Thispathhasnospaces\filename.exe
\sytem32\ThisDidNotBeginWithADriveLetter\something.doc
D:\This path also has spaces\something.xlsx
"C:\I don't care if it is wrapped in quotes\something.abc" -switch
So far what I have come up with is:
^\w:\(.+)(.\w\w\w)
Which sort of works, but it selects paths both with spaces and without spaces. It also doesn't select the full filename if the path as a four character extension, such as .xlsx
Any help would be very much appreciated. If you do post a better regex, if you added some explanation it would really help because I am trying to learn it.
Thanks!
I would go by
^[A-Z]:\\.+\s.+\.\S+
^ is an anchor for the start of the string
[A-Z]:\\ matches a letter followed by colon and backslash
.+ matches any character, 1 or more times
\s matches a single space
.+\.\S+ matches any characters followed by dot and non-spaces
See https://regex101.com/r/fC5tF8/2 for a demo
a regex I would use is
^\w:[^\s]+[\.]*[^\s]+
this will find anything that starts with a alphanumeric and contains no spaces.

Searching for multiple keywords in the same file within bigger project in Sublime Text 3

I would like to search a whole project for files that have both ABCKeyword and XYZKeyword in the same file. Is this possible?
When I search a whole project with regex (ABCKeyword|XYZKeyword), it returns files that have one or the other but not necessarily both.
You can use positive look-aheads:
(?s)^(?=.*\bABCKeyword\b)(?=.*\bXYZKeyword\b)
Only the text that has both will be matched.
See demo
(?s) makes the . to match a newline symbol. (?=.*...) look-aheads check, but do not consume characters, thus only asseting if there is ABCKeyword or XYZKeyword further in the text. The \b word boundaries make sure we only match full words (if you need to match them partially, inside longer words, remove \bs).

Specific search pattern using regex

I would like to search for a pattern in following type of strings.
I have both of these patterns
"<deliveries!ntg5!intel!api!ntg5!avt!tuner!src>CDAVTTunerTVProxy.cpp"
and
"<.>api/sys/mocca/pf/comm/component/src\HBServices.hpp"
I would like to extract the file names from the patterns above
I tried the following
if(m/(\|>[0-9a-zA-Z_]\.cpp"$|\.hpp"$|\.h"$|\.c")$/){
Above expression is not listing file names with " >xxxxx.cpp" ( or .hpp, or .h, or .c)
Any idea would be of great help.
There are a few mistakes in your regex
if(m/(\|>[0-9a-zA-Z_]\.cpp"$|\.hpp"$|\.h"$|\.c")$/){
I assume that \|> is supposed to match either \ or >, but this is incorrect. It will try to match a pipe | followed by >. Backslash is used to escape characters, and so if you want to match a literal backslash, you need to escape it: \\. This is the wrong way to use an alternation, though (see more below), and there is a better way, which is to use a character class: [\\>].
[0-9a-zA-Z_] is a character class that is represented by \w, so it makes sense to use that instead to make your regex more readable. Also, you are only matching one character. If you want to match more than that, you need to supply a quantifier, such as +, which is suitable in this case. The quantifier + means to match 1 or more times.
Your alternations | are mixed up. Unless you group them properly, they will be intended to match the entire string. Your regex as it is now would capture strings like:
|>A.cpp"
.hpp"
.c"
Which is not what you want. If you want to apply the different extensions to the main file name body, you have to group the alternate extensions properly:
\w+\.(?:cpp|hpp|h|c)"$
Using parentheses that do not capture (?: ... ) are suitable for grouping. As you can also see, there is no need to repeat the parts of the string which are identical for all extensions.
So what do we end up with?
/([\\>]\w+\.(?:cpp|hpp|h|c)")$/
Although I do not think that you really want to include the leading [\\>] in the match, or the trailing ". So more properly it would be
/[\\>](\w+\.(?:cpp|hpp|h|c))"$/
Note that as I said in the comment, there is a module to use if these are paths, and you want to extract the file name. File::Basename is included in Perl core since version 5.
Please try this regex:
m/([0-9a-zA-Z_]+\.(?:cpp|hpp|h|c))$/
This one is looking for the extension cpp, hpp, h or c at the end of the string(using $) and then looking for the file name just before the period(.) with extension.

Notepad++ Regex: Find all 1 and 2 letter words

I’m working with a text file with 200.000+ lines in Notepad++. Each line has only one word. I need to strip out and remove all words which only contains one letter (e.g.: I) and words which contains only two letters (e.g.: as).
I thought I could just pas in regular regex like this [a-zA-Z]{1,2} but I does not recognize anything (I’m trying to Mark them).
I’ve done manual search and I know that there do exists words of that length so therefor can it only be my regex code that’s wrong. Anyone knows how to do this in Notepad++ ???
Cheers,
- Mestika
If you want to remove only the words but leave the lines empty, this works:
^[a-zA-Z]{1,2}$
Replace this with an empty string. ^ and $ are anchors for the beginning and the end of a line (because Notepad++'s regexes work in multi-line mode).
If you want to remove the lines completely, search for this:
^[a-zA-Z]{1,2}\r\n
And replace with an empty string. However, this won't work before Notepad++ 6, so make sure yours is up-to-date.
Note that you will have to replace \r\n with the specific line-endings of your file!
As Tim Pietzker suggested, a platform independent solution that also removes empty lines would be:
^[a-zA-Z]{1,2}[\r\n]+
A platform-independent solution that does not remove empty lines but only those with one or two letters would be:
^[a-zA-Z]{1,2}(\r\n?|\n)
I don't use Notepad++ but my guess is it could be because you have too many matches - try including word boundaries (your exp will match every set of 2 letters)
\b[a-zA-Z]{1,2}\b
The regex you specified should find 1-or-2 characters (even in Notepad++'s Find-dialog), but not in the way you'd think. You want to have the regex make sure it starts at the beginning of the line and ends at the end with ^ and $, respecitevely:
^[a-zA-Z]{1,2}$
Notepad++ version 6.0 introduced the PCRE engine, so if this doesn't work in your current version try updating to the most recent.
You seem to use the version of Notepad++ that doesn't support explicit quantifiers: that's why there's no match at all (as { and } are treated as literals, not special symbols).
The solution is to use their somewhat more lengthy replacement:
\w\w?
... but that's only part of the story, as this regex will match any symbol, and not just short words. To do that, you need something like this:
^\w\w?$

how to extract filename in this situation?

my input strings look like this:
1 warning: rg: W, MULT: file 'filename_a.h' was listed twice.
2 warning: rg: W, SCOP: scope redefined in '/proj/test/site_a/filename_b.c'.
3 warning: rg: W, ATTC: file /proj/test/site_b/filename_c.v is not resolved.
4 warning: rg: W, MULTH: property file filename_d.vu was listed outside.
They come in four different flavors as listed above. I read these from a log file line by line.
For the one with path specified (line 2,3) I can extract filename using $file=~s#.*/##; and seems to work fine. Is there a way not to use conditional statements for different type and extract the filename? I want to use just one clean regex and extract the filename. Perl's File::basename will not work also in this case.
I am using Perl.
You could do it in two steps:
extract path from each line
get basename from the path
Example
#!/usr/bin/perl -n
use feature 'say';
use File::Basename;
#NOTE: assume that unquoted path has no spaces in it
say basename($1.$2) if /(?:file|redefined in)\s+(?:'([^']+)'|(\S+))/;
Output
filename_a.h
filename_b.c
filename_c.v
filename_d.vu
Your problem needs more constraints. For example, what's a good way to characterize a string as a "path" (or "filename") or not? You might say, "Hey, when I see a single dot immediately followed by letters and numbers (but not symbols), and there are a bunch of characters before that dot too, then it might be a path or filename!"
\s+([^\s]+\.\w+)
But this doesn't catch all paths, nor files without an extension. So we might latch on an alternation to say, "Either the above, or, a string with at least one slash in it."
\s+([^\s]+\.\w+|[^\s]*\/[^\s]*)
(Note that you may not need to escape the slash in the above example, since you seem to be using # as your delimiter.)
What I'm getting at, in any case, is that you need to specify your problem more rigorously, and this will automatically bring you to a satisfying solution. Of course, there is no truly "correct" solution using regexes alone: you'd need to do file tests to do that.
To go further with this example, perhaps you want to define a list of extensions:
\s+([^\s]+\.(?:c|h|cc|cpp)|[^\s]*\/[^\s]*)
Or, perhaps you want to be more generic, but allow only extensions up to 4 characters long:
\s+([^\s]+\.\w{1,4}|[^\s]*\/[^\s]*)
Perhaps you only consider something a path if it begins with a slash, but you still want at least one another slash somewhere in it:
\s+([^\s]+\.\w{1,4}|/[^\s]*\/[^\s]*)
Good luck.
/\w*.\w*/
This will match the file name expressed in the four different warning logs. \w will match any word character (letters, digits, and underscores), so this regex looks for any number of word characters, followed by a dot followed by more word characters.
This works because the only other dot in your logs is at the end of the log.