how to extract filename in this situation? - regex

my input strings look like this:
1 warning: rg: W, MULT: file 'filename_a.h' was listed twice.
2 warning: rg: W, SCOP: scope redefined in '/proj/test/site_a/filename_b.c'.
3 warning: rg: W, ATTC: file /proj/test/site_b/filename_c.v is not resolved.
4 warning: rg: W, MULTH: property file filename_d.vu was listed outside.
They come in four different flavors as listed above. I read these from a log file line by line.
For the one with path specified (line 2,3) I can extract filename using $file=~s#.*/##; and seems to work fine. Is there a way not to use conditional statements for different type and extract the filename? I want to use just one clean regex and extract the filename. Perl's File::basename will not work also in this case.
I am using Perl.

You could do it in two steps:
extract path from each line
get basename from the path
Example
#!/usr/bin/perl -n
use feature 'say';
use File::Basename;
#NOTE: assume that unquoted path has no spaces in it
say basename($1.$2) if /(?:file|redefined in)\s+(?:'([^']+)'|(\S+))/;
Output
filename_a.h
filename_b.c
filename_c.v
filename_d.vu

Your problem needs more constraints. For example, what's a good way to characterize a string as a "path" (or "filename") or not? You might say, "Hey, when I see a single dot immediately followed by letters and numbers (but not symbols), and there are a bunch of characters before that dot too, then it might be a path or filename!"
\s+([^\s]+\.\w+)
But this doesn't catch all paths, nor files without an extension. So we might latch on an alternation to say, "Either the above, or, a string with at least one slash in it."
\s+([^\s]+\.\w+|[^\s]*\/[^\s]*)
(Note that you may not need to escape the slash in the above example, since you seem to be using # as your delimiter.)
What I'm getting at, in any case, is that you need to specify your problem more rigorously, and this will automatically bring you to a satisfying solution. Of course, there is no truly "correct" solution using regexes alone: you'd need to do file tests to do that.
To go further with this example, perhaps you want to define a list of extensions:
\s+([^\s]+\.(?:c|h|cc|cpp)|[^\s]*\/[^\s]*)
Or, perhaps you want to be more generic, but allow only extensions up to 4 characters long:
\s+([^\s]+\.\w{1,4}|[^\s]*\/[^\s]*)
Perhaps you only consider something a path if it begins with a slash, but you still want at least one another slash somewhere in it:
\s+([^\s]+\.\w{1,4}|/[^\s]*\/[^\s]*)
Good luck.

/\w*.\w*/
This will match the file name expressed in the four different warning logs. \w will match any word character (letters, digits, and underscores), so this regex looks for any number of word characters, followed by a dot followed by more word characters.
This works because the only other dot in your logs is at the end of the log.

Related

Elasticsearch Regex to match url starting with one string and not ending with another, without look ahead/behind

I have two groups of strings that take the formats
http://example.com/foo/something
and
http://example.com/foo/something/something-else/bar/1
Where example.com, foo and bar are fixed, something and something else could be any string and 1 is any number.
I want to use regex to match strings following the first format (they must start with http://example.com/foo/) and not the second. The exclusion could be around number of slashes, the "bar" string or ending in a number.
I don't have support for look ahead or look back.
What's the best approach?
Examples of strings that should match
http://example.com/foo/apple
http://example.com/foo/bear-bear
http://example.com/foo/cake-cake
Examples of strings that should NOT match
http://example.com/baa/apple
http://example.com/foo/apple/cake/bar/1
http://example.com/foo/bear-apple/camel/bar/2
Examples of strings that wouldn't exist in the data set
(So it doesn't matter if they match or not)
http://example.com/foo/bear-bear/cake/bar/two
http://example.com/foo/bear/camel/tar/2
http://example.com/foo/bear-bear/camel
http://example.com/foo/bear/camel/
http://example.com/foo/bear-bear/camel/tar/2
UPDATE
It turns out that the regex engine the application I'm using this in is from Elasticsearch, so this documentation (and one of our developers) was helpful: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html
The end solution was:
(http://example.com/foo.*)&~(.*bar.*)
All your examples have a specific prefix URL, followed by one-and-only-one path element. If this is the general case, you can do this by simply looking for the prefix URL followed by a word which doesn't contain a path separator, followed by EOL.
You didn't say what engine you're using, so here's an example with Gnu grep in bash:
grep -e '^http://example.com/foo/[^/]\+$'
Bash makes for readable examples, because single-quoting means very few characters need escaping. The sole exception in my example is the + character.

Specific search pattern using regex

I would like to search for a pattern in following type of strings.
I have both of these patterns
"<deliveries!ntg5!intel!api!ntg5!avt!tuner!src>CDAVTTunerTVProxy.cpp"
and
"<.>api/sys/mocca/pf/comm/component/src\HBServices.hpp"
I would like to extract the file names from the patterns above
I tried the following
if(m/(\|>[0-9a-zA-Z_]\.cpp"$|\.hpp"$|\.h"$|\.c")$/){
Above expression is not listing file names with " >xxxxx.cpp" ( or .hpp, or .h, or .c)
Any idea would be of great help.
There are a few mistakes in your regex
if(m/(\|>[0-9a-zA-Z_]\.cpp"$|\.hpp"$|\.h"$|\.c")$/){
I assume that \|> is supposed to match either \ or >, but this is incorrect. It will try to match a pipe | followed by >. Backslash is used to escape characters, and so if you want to match a literal backslash, you need to escape it: \\. This is the wrong way to use an alternation, though (see more below), and there is a better way, which is to use a character class: [\\>].
[0-9a-zA-Z_] is a character class that is represented by \w, so it makes sense to use that instead to make your regex more readable. Also, you are only matching one character. If you want to match more than that, you need to supply a quantifier, such as +, which is suitable in this case. The quantifier + means to match 1 or more times.
Your alternations | are mixed up. Unless you group them properly, they will be intended to match the entire string. Your regex as it is now would capture strings like:
|>A.cpp"
.hpp"
.c"
Which is not what you want. If you want to apply the different extensions to the main file name body, you have to group the alternate extensions properly:
\w+\.(?:cpp|hpp|h|c)"$
Using parentheses that do not capture (?: ... ) are suitable for grouping. As you can also see, there is no need to repeat the parts of the string which are identical for all extensions.
So what do we end up with?
/([\\>]\w+\.(?:cpp|hpp|h|c)")$/
Although I do not think that you really want to include the leading [\\>] in the match, or the trailing ". So more properly it would be
/[\\>](\w+\.(?:cpp|hpp|h|c))"$/
Note that as I said in the comment, there is a module to use if these are paths, and you want to extract the file name. File::Basename is included in Perl core since version 5.
Please try this regex:
m/([0-9a-zA-Z_]+\.(?:cpp|hpp|h|c))$/
This one is looking for the extension cpp, hpp, h or c at the end of the string(using $) and then looking for the file name just before the period(.) with extension.

Regex for SublimeText Snippet

I've been stuck for a while on this Sublime Snippet now.
I would like to display the correct package name when creating a new class, using TM_FILEPATH and TM_FILENAME.
When printing TM_FILEPATH variable, I get something like this:
/Users/caubry/d/[...]/src/com/[...]/folder/MyClass.as
I would like to transform this output, so I could get something like:
com.[...].folder
This includes:
Removing anything before /com/[...]/folder/MyClass.as;
Removing the TM_FILENAME, with its extension; in this example MyClass.as;
And finally finding all the slashes and replacing them by dots.
So far, this is what I've got:
${1:${TM_FILEPATH/.+(?:src\/)(.+)\.\w+/\l$1/}}
and this displays:
com/[...]/folder/MyClass
I do understand how to replace splashes with dots, such as:
${1:${TM_FILEPATH/\//./g/}}
However, I'm having difficulties to add this logic to the previous one, as well as removing the TM_FILENAME at the end of the logic.
I'm really inexperienced with Regex, thanks in advance.
:]
EDIT: [...] indicates variable number of folders.
We can do this in a single replacement with some trickery. What we'll do is, we put a few different cases into our pattern and do a different replacement for each of them. The trick to accomplish this is that the replacement string must contain no literal characters, but consist entirely of "backreferences". In that case, those groups that didn't participate in the match (because they were part of a different case) will simply be written back as an empty string and not contribute to the replacement. Let's get started.
First, we want to remove everything up until the last src/ (to mimic the behaviour of your snippet - use an ungreedy quantifier if you want to remove everything until the first src/):
^.+/src/
We just want to drop this, so there's no need to capture anything - nor to write anything back.
Now we want to match subsequent folders until the last one. We'll capture the folder name, also match the trailing /, but write back the folder name and a .. But I said no literal text in the replacement string! So the . has to come from a capture as well. Here comes the assumption into play, that your file always has an extension. We can grab the period from the file name with a lookahead. We'll also use that lookahead to make sure that there's at least one more folder ahead:
^.+/src/|\G([^/]+)/(?=[^/]+/.*([.]))
And we'll replace this with $1$2. Now if the first alternative catches, groups $1 and $2 will be empty, and the leading bit is still removed. If the second alternative catches, $1 will be the folder name, and $2 will have captured a period. Sweet. The \G is an anchor that ensures that all matches are adjacent to one another.
Finally, we'll match the last folder and everything that follows it, and only write back the folder name:
^.+/src/|\G([^/]+)/(?=[^/]+/.*([.]))|\G([^/]+)/[^/]+$
And now we'll replace this with $1$2$3 for the final solution. Demo.
A conceptually similar variant would be:
^.+/src/|\G([^/]+)/(?:(?=[^/]+/.*([.]))|[^/]+$)
replaced with $1$2. I've really only factored out the beginning of the second and third alternative. Demo.
Finally, if Sublime is using Boost's extended format string syntax, it is actually possible to get characters into the replacement conditionally (without magically conjuring them from the file extension):
^.+/src/|\G(/)?([^/]+)|\G/[^/]+$
Now we have the first alternative for everything up to src (which is to be removed), the third alternative for the last slash and file name (which is to be removed), and the middle alternative for all folders you want to keep. This time I put the slash to be replaced optionally at the beginning. With a conditional replacement we can write a . there if and only if that slash was matched:
(?1.:)$2
Unfortunately, I can't test this right now and I don't know an online tester that uses Boost's regex engine. But this should do the trick just fine.

Regex to parse file paths

I have this text:
Unexpected error creating debug information file
'c:\Users\Path1\Path2\Strategies\Path3\CustomStrategy.PDB' --
'c:\Users\Path1\Path2\Strategies\Path3\CustomStrategy.pdb: The system
cannot find the path specified.
I need to parse out the file paths c:\Users\Path1\Path2\Strategies\Path3 or c:\Users\Path1\Path2\Strategies\Path3\CustomStrategy.PDB, whatever is easier. I tried to use the following Regex
\w:.+[.]\w{3}
But, this RegEx doesn't stop at first file extension and continues to match the the second instance of the path, stopping at the second instance of .pdb; thus putting both file paths in one regex match.
What do I need to change in order for the regex to parse the two paths as two separate matches? Thanks.
Non-greedy re:
\w:.+?[.]\w{3}
Note ? after +.
Also, if your path contains no dots except the last one, you can write it so:
\w:[^.]+[.]\w{3}
If you are not sure that the extension consists of three letters, you must specify the range:
\w:[^.]+[.]\w{1,3}
And when you are not sure that your path has extension at all, but it contains no spaces, then:
\w:\S+
What about this
\w:\\(?:[^\\\s]+\\)+
See it here on Regexr
\w:\\ matches a word character, a : and a backslash
(?:[^\\\s]+\\)+ matches the directories, non-backslash or non whitespace characters till a backslash, and this repeated.
So, this would match both paths c:\Users\Path1\Path2\Strategies\Path3. works as long as the directory names does not contain spaces.
Actually, here you may as well do without regex at all.
Split the text by ' and use the second part.
As for regex, I would use something more complicated, but allowing to catch other filenames, not just those ending with a 3-letter extension:
'([a-z]:(?:[\\/][^\\/]*)+?)' --
(and use first subpattern from the match)

find all text before using regex

How can I use regex to find all text before the text "All text before this line will be included"?
I have includes some sample text below for example
This can include deleting, updating, or adding records to your database, which would then be reflex.
All text before this line will be included
You can make this a bit more sophisticated by encrypting the random number and then verifying that it is still a number when it is decrypted. Alternatively, you can pass a value and a key instead.
Starting with an explanation... skip to end for quick answers
To match upto a specific piece of text, and confirm it's there but not include it with the match, you can use a positive lookahead, using notation (?=regex)
This confirms that 'regex' exists at that position, but matches the start position only, not the contents of it.
So, this gives us the expression:
.*?(?=All text before this line will be included)
Where . is any character, and *? is a lazy match (consumes least amount possible, compared to regular * which consumes most amount possible).
However, in almost all regex flavours . will exclude newline, so we need to explicitly use a flag to include newlines.
The flag to use is s, (which stands for "Single-line mode", although it is also referred to as "DOTALL" mode in some flavours).
And this can be implemented in various ways, including...
Globally, for /-based regexes:
/regex/s
Inline, global for the regex:
(?s)regex
Inline, applies only to bracketed part:
(?s:reg)ex
And as a function argument (depends on which language you're doing the regex with).
So, probably the regex you want is this:
(?s).*?(?=All text before this line will be included)
However, there are some caveats:
Firstly, not all regex flavours support lazy quantifiers - you might have to use just .*, (or potentially use more complex logic depending on precise requirements if "All text before..." can appear multiple times).
Secondly, not all regex flavours support lookaheads, so you will instead need to use captured groups to get the text you want to match.
Finally, you can't always specify flags, such as the s above, so may need to either match "anything or newline" (.|\n) or maybe [\s\S] (whitespace and not whitespace) to get the equivalent matching.
If you're limited by all of these (I think the XML implementation is), then you'll have to do:
([\s\S]*)All text before this line will be included
And then extract the first sub-group from the match result.
(.*?)All text before this line will be included
Depending on what particular regular expression framework you're using, you may need to include a flag to indicate that . can match newline characters as well.
The first (and only) subgroup will include the matched text. How you extract that will again depend on what language and regular expression framework you're using.
If you want to include the "All text before this line..." text, then the entire match is what you want.
This should do it:
<?php
$str = "This can include deleting, updating, or adding records to your database, which would then be reflex.
All text before this line will be included
You can make this a bit more sophisticated by encrypting the random number and then verifying that it is still a number when it is decrypted. Alternatively, you can pass a value and a key instead.";
echo preg_filter("/(.*?)All text before this line will be included.*/s","\\1",$str);
?>
Returns:
This can include deleting, updating, or adding records to your database, which would then be reflex.