Compact syntax to exclude file extensions from search - regex

Basically I wonder if there's a shorter way to do this:
*.* !*.exe !*.zip !*.jar
Problem is, I write software that generates files with extensions I can't predict - neither in what I want to exclude, nor include, so my only option is to append exclusions as I go. But my list's length now exceeds N++'s limit, so I wonder if it can be shorter, like
*.* !*.[exe zip jar]

No:
The Filters list is a space-separated list of wildcard expressions that cmd.exe can understand, like *.doc foo.*.
Wildcards can include * for zero or more of any character, and ? for exactly one of any character.
Most characters work as literals. However, space is used as the separator, and thus cannot be used as a literal in your filter. Some punctuation characters have special meanings (like the ? and * wildcards, or the ! exclusion or !+\ for recursive exclusion), and cannot be used as literals. Also, the ; causes problems, so even though Microsoft allows it in file and path names, using a ; in the Filters box will not work as you might hope. If you want to match a space or a semicolon ; or other problematic-punctuation in your file or folder for your Filter (whether for inclusion or exclusion), then use the * and/or ? wildcards instead. (So x?y.txt will match the file x;y.txt or x y.txt (with a space between x and y).) And sorry, no, you cannot use quotes around a path-with-spaces to allow the spaces to work as literals: the space is a separator in this field.
That's because it's simply not a regular expression. Just like the command line/prompt only knows wildcards, not regular expressions, and also no mix of both (f.e. using square brackets). Likewise "Directory" cannot be used with a regex either.

Related

Replacing with regex: How to insert a number right after a group match

How to insert a number after a group match in a find-replace regex? Like this:
mat367 -> mat0363
fis434 -> fis0434
chm185 -> chm0185
I was renaming those files with the rename command line tool. I tried the following regex
rename 's/([a-z]{3})(.+)/$1\0$2/g' *
s at the beginning means replace
* at the end means every file.
([a-z]{3})(.+) is the regex to match the name of the files.
$1\0$2 is the replacement.
I thought the regex above would insert a 0 after the first group match ($1), but it doesn't insert anything. So I tried:
rename 's/([a-z]{3})(.+)/$10$2/g' *
However, this makes the regex think that I'm refering to $10 (group number teen), and throws errors.
I'd like to know if it is possible to accomplish my goal in a single regex. In other words, don't use the rename command twice or more. For example, use the rename command to insert a letter instead of 0, and then replace that letter with 0, but this would require two regex, two commands. Using only one regex may be useful in contexts other than renaming files.
Note: It seems like the regex used by rename is based on perl. That may help if someone knows perl.
The argument is evaluated as Perl code, and you are correct about Perl seeing $10.
In a double-quoted string literal (which the replacement expression is), you can only safely escape non-word characters. Like letters, digits are word characters. Specifically, \0 refers to the NUL character. So using \0 is not acceptable.
The solution is to use curlies to delimit the var name.
rename 's/([a-z]{3})(.+)/${1}0$2/g' *
Another way to address the problem in this case is by side-stepping it. Since there's no need to replace the text before the insertion point, we don't need to capture it.
rename 's/[a-z]{3}\K(.+)/0$1/g' *
We can further simplify the second solution.
The .+ ensures there's going to be at most one match per line, so the above can be simplified to the following (assuming none of the file names contain a line feed):
rename 's/[a-z]{3}\K(.)/0$1/' *
We could even avoid the remaining capture with a look-ahead.
rename 's/[a-z]{3}\K(?=.)/0/' *
But is there really a reason to look-ahead? The following isn't equivalent as it doesn't require anything to follow the letters, but I don't think that's a problem.
rename 's/[a-z]{3}\K/0/' *
Finally, if the goal is to add a zero before the number (and thus before the first digit encountered), I'd use
rename 's/(?=\d)/0/' *
You can wrap your variable name $1 in curly braces.
$ rename 's/([a-z]{3})(.+)/${1}0$2/g' *
This is Perl's way to enclose variable names inside strings.

Regex to find two words on the page

I'm trying to find all pages which contain words "text1" and "text2".
My regex:
text1(.|\n)*text2
it doesn't work..
If your IDE supports the s (single-line) flag (so the . character can match newlines), you can search for your items with:
(text1).*(text2)|\2.*\1
Example with s flag
If the IDE does not support the s flag, you will need to use [\s\S] in place of .:
(text1)[\s\S]*(text2)|\2[\s\S]*\1
Example with [\s\S]
Some languages use $1 and $2 in place of \1 and \2, so you may need to change that.
EDIT:
Alternately, if you want to simply match that a file contains both strings (but not actually select anything), you can utilize look-aheads:
(?s)^(?=.*?text1)(?=.*?text2)
This doesn't care about the order (or number) of the arguments, and for each additional text that you want to search for, you simply append another (?=.*?text_here). This approach is nice, since you can even include regex instead of just plain strings.
text0[\s\S]*text1
Try this.This should do it for you.
What this does is match all including multiline .similar to having .*? with s flag.
\s takes care of spaces,newlines,tabs
\S takes care any non space character.
If you want the regex to match over several lines I would try:
text1[\w\W]*text2
Using . is not a good choice, because it usually doesn't match over multiple lines. Also, for matching single characters I think using square brackets is more idiomatic than using ( ... | ... )
If you want the match to be order-independent then use this:
(?:text1[\w\W]*text2)|(?:text2[\w\W]*text1)
Adding a response for IntelliJ
Building on #OnlineCop's answer, to swap the order of two expressions in IntelliJ,you would style the search as in the accepted response, but since IntelliJ doesn't allow a one-line version, you have to put the replace statement in a separate field. Also, IntelliJ uses $ to identify expressions instead of \.
For example, I tend to put my nulls at the end of my comparisons, but some people prefer it otherwise. So, to keep things consistent at work, I used this regex pattern to swap the order of my comparisons:
Notice that IntelliJ shows in a tooltip what the result of the replacement will be.
For me works text1*{0,}(text2){0,}.
With {0,} you can decide to get your keyword zero or more times OR you set {1,x} to get your keyword 1 or x-times (how often you want).

how to extract filename in this situation?

my input strings look like this:
1 warning: rg: W, MULT: file 'filename_a.h' was listed twice.
2 warning: rg: W, SCOP: scope redefined in '/proj/test/site_a/filename_b.c'.
3 warning: rg: W, ATTC: file /proj/test/site_b/filename_c.v is not resolved.
4 warning: rg: W, MULTH: property file filename_d.vu was listed outside.
They come in four different flavors as listed above. I read these from a log file line by line.
For the one with path specified (line 2,3) I can extract filename using $file=~s#.*/##; and seems to work fine. Is there a way not to use conditional statements for different type and extract the filename? I want to use just one clean regex and extract the filename. Perl's File::basename will not work also in this case.
I am using Perl.
You could do it in two steps:
extract path from each line
get basename from the path
Example
#!/usr/bin/perl -n
use feature 'say';
use File::Basename;
#NOTE: assume that unquoted path has no spaces in it
say basename($1.$2) if /(?:file|redefined in)\s+(?:'([^']+)'|(\S+))/;
Output
filename_a.h
filename_b.c
filename_c.v
filename_d.vu
Your problem needs more constraints. For example, what's a good way to characterize a string as a "path" (or "filename") or not? You might say, "Hey, when I see a single dot immediately followed by letters and numbers (but not symbols), and there are a bunch of characters before that dot too, then it might be a path or filename!"
\s+([^\s]+\.\w+)
But this doesn't catch all paths, nor files without an extension. So we might latch on an alternation to say, "Either the above, or, a string with at least one slash in it."
\s+([^\s]+\.\w+|[^\s]*\/[^\s]*)
(Note that you may not need to escape the slash in the above example, since you seem to be using # as your delimiter.)
What I'm getting at, in any case, is that you need to specify your problem more rigorously, and this will automatically bring you to a satisfying solution. Of course, there is no truly "correct" solution using regexes alone: you'd need to do file tests to do that.
To go further with this example, perhaps you want to define a list of extensions:
\s+([^\s]+\.(?:c|h|cc|cpp)|[^\s]*\/[^\s]*)
Or, perhaps you want to be more generic, but allow only extensions up to 4 characters long:
\s+([^\s]+\.\w{1,4}|[^\s]*\/[^\s]*)
Perhaps you only consider something a path if it begins with a slash, but you still want at least one another slash somewhere in it:
\s+([^\s]+\.\w{1,4}|/[^\s]*\/[^\s]*)
Good luck.
/\w*.\w*/
This will match the file name expressed in the four different warning logs. \w will match any word character (letters, digits, and underscores), so this regex looks for any number of word characters, followed by a dot followed by more word characters.
This works because the only other dot in your logs is at the end of the log.

find all text before using regex

How can I use regex to find all text before the text "All text before this line will be included"?
I have includes some sample text below for example
This can include deleting, updating, or adding records to your database, which would then be reflex.
All text before this line will be included
You can make this a bit more sophisticated by encrypting the random number and then verifying that it is still a number when it is decrypted. Alternatively, you can pass a value and a key instead.
Starting with an explanation... skip to end for quick answers
To match upto a specific piece of text, and confirm it's there but not include it with the match, you can use a positive lookahead, using notation (?=regex)
This confirms that 'regex' exists at that position, but matches the start position only, not the contents of it.
So, this gives us the expression:
.*?(?=All text before this line will be included)
Where . is any character, and *? is a lazy match (consumes least amount possible, compared to regular * which consumes most amount possible).
However, in almost all regex flavours . will exclude newline, so we need to explicitly use a flag to include newlines.
The flag to use is s, (which stands for "Single-line mode", although it is also referred to as "DOTALL" mode in some flavours).
And this can be implemented in various ways, including...
Globally, for /-based regexes:
/regex/s
Inline, global for the regex:
(?s)regex
Inline, applies only to bracketed part:
(?s:reg)ex
And as a function argument (depends on which language you're doing the regex with).
So, probably the regex you want is this:
(?s).*?(?=All text before this line will be included)
However, there are some caveats:
Firstly, not all regex flavours support lazy quantifiers - you might have to use just .*, (or potentially use more complex logic depending on precise requirements if "All text before..." can appear multiple times).
Secondly, not all regex flavours support lookaheads, so you will instead need to use captured groups to get the text you want to match.
Finally, you can't always specify flags, such as the s above, so may need to either match "anything or newline" (.|\n) or maybe [\s\S] (whitespace and not whitespace) to get the equivalent matching.
If you're limited by all of these (I think the XML implementation is), then you'll have to do:
([\s\S]*)All text before this line will be included
And then extract the first sub-group from the match result.
(.*?)All text before this line will be included
Depending on what particular regular expression framework you're using, you may need to include a flag to indicate that . can match newline characters as well.
The first (and only) subgroup will include the matched text. How you extract that will again depend on what language and regular expression framework you're using.
If you want to include the "All text before this line..." text, then the entire match is what you want.
This should do it:
<?php
$str = "This can include deleting, updating, or adding records to your database, which would then be reflex.
All text before this line will be included
You can make this a bit more sophisticated by encrypting the random number and then verifying that it is still a number when it is decrypted. Alternatively, you can pass a value and a key instead.";
echo preg_filter("/(.*?)All text before this line will be included.*/s","\\1",$str);
?>
Returns:
This can include deleting, updating, or adding records to your database, which would then be reflex.

How to do a proper search/replace using GREP Regular Expressions in a text editor?

I am trying to run some regular expressions(grep) on a text file of about 4K lines. The main portion that I need replaced looks like this:
1,"An Internet-Ready Resume",1,2,"","
And I need it to look like this:
<item>
<title>An Internet-Ready Resume</title>
<category>1</category>
<author>2</author>
<content>
So far, this is what I was trying to no avail:
[0-9]{1}\,\"*\"\,[0-9]\,[0-9]\,\"\"\,\"
You should start with doing a little reading on regular expressions. There are tons of useful resources online. Then you would see that:
you needn't escape everything (such as commas or quotes)
the asterisk * doesn't mean anything, but zero or more times
the any character is the . character. .* means any character any number of times (or anything)
if you need to make substitutions where you need atoms of what you're searching, you have to set those atoms by using (<atom content>) where <atom content> is a bit of a regexp.
A tip to start: instead of \"*\" try ".*"; Check the reference.
Also note that the part regarding the replacement will depend on the text editor/tool you're using. Usually a regexp such as (a)(b) (where a,b are regexp atoms) being replaced by x\1y\2z would produce xaybz.
The error is the \"*\" part. When you use the * operator you need to tell it what is to be repeated. As written it is going to repeat the previous quote character. Instead of that you should tell it to repeat any character (.), thus: \".*\"
A secondary comment is that you have a lot of unnecessary backslashes. In fact, none of them are necessary as far as I can tell. Without them your regex looks like:
[0-9],".*",[0-9],[0-9],"","