Regex to remove version history numbering from end of file names - regex

I'm fairly new to regex and I'm trying to write an expression to remove version history numbering from the end of a big group of file names for use in File Renamer.
These are the various forms of version history I am trying to remove:
1.0 2.1a 3.5b v4.6 v5.7a v6.8b V7.9 V8.0a V9.1b And any of the previous forms
can be enclosed in () [] or {} Maximum 5 Character + 2 brackets.
Examples followed by the desired result:
File-Name(v1.0).txt - File-Name.txt
fn1[V1.1].txt - fn1.txt
FN2(v1.2a).txt - FN2.txt
filename1.3.txt - filename.txt
file(name)1.4b.txt - file(name).txt
[file]name-V1.5.txt - [file]name-.txt
Author - [book 01](1.6).txt - Author - [book 01].txt
This is the expression I've written to remove them:
\s?([\[\(\{]?[vV]?[1-9]\.[0-9][ab]?[\]\)\}]?)\.txt
REPLACE: .txt
I do not want it to remove anything larger than 5 characters in brackets, for example:
Author - [Book 1.0].txt
Should NOT be changed.
I want to make the closing brackets conditional on the presence of the opening bracket in a position where they can contain a maximum of 5 characters, which must include a "."
between two numbers.
Other examples of edge cases it must ignore:
FN-volume1.txt
FN[Vol3.0].txt
FileName - book 2.txt
EDIT: Eventually got it to work in JS by removing the < symbol:
s?([\[({][vV]?\d+\.\d+[ab]?[\])}]|(?!([\[({]))[vV]?\d+\.\d+[ab]?(?![\])}]))(?=\W?\.\w{3})

Use an alternation that first tries to match with both brackets, then without brackets (using negative look arounds) effectively making the pair optional - ie both there or both absent:
\s?([\[({][vV]?\d+\.\d+[ab]?[\])}]|(?<![\[({])[vV]?\d+\.\d+[ab]?(?![\])}]))(?=\W?\.\w{3})
See live demo
This regex matches just the version, so replace with blank.
I have also made the file extension flexible - all file types ae handled (if this is unwanted, replace \w{3} with txt).
The regex for the version number has been simplified, and unnecessary escaping in character classes removed.
Also note that your approach is slightly naive as bracket types don't have to match, eg (v1.2] would match. If that's not a possibility, don't worry, but if it is, you would need a separate alternation for each bracket type pair.

Related

Regex to remove double lines ignoring the punctation marks and spaces in Notepad++ [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 6 months ago.
Is it possible to remove duplications with ignoring the punctation marks and spaces in Notepad++? I would keep one of them matching lines (doesn't matter which to keep).
My examples are from the txt file:
Rough work iconoclasm but the only way to get the truth. Oliver Wendell Holmes
Rough work, iconoclasm, but the only way to get the truth. Oliver Wendell Holmes
Rule No. 1: Never lose money. Rule No. 2: Never forget rule No. 1. Warren Buffett
Rule No.1: Never lose money. Rule No.2: Never forget rule No.1. Warren Buffett
Self-esteem isn't everything, it's just that there's nothing without it. Gloria Steinem
Self-esteem isn't everything it's just that there's nothing without it. Gloria Steinem
You said she's a senior? Babe we're all crazy.
You said, she's a senior! Babe we're ALL crazy.
You said, she's a senior? Babe we're ALL crazy!
Result I need:
Rough work iconoclasm but the only way to get the truth. Oliver Wendell Holmes
Rule No. 1: Never lose money. Rule No. 2: Never forget rule No. 1. Warren Buffett
Self-esteem isn't everything, it's just that there's nothing without it. Gloria Steinem
You said, she's a senior! Babe we're ALL crazy.
I can delete 100% matching duplications with regex, but can't find a regex rule to ignore spaces and marks.
I don't think regex is the best tool for this task, but it's a nice challenge. You can match single words using a nested structure like:
((\w+)\W+((\w+)\W+( ... ((\w+)\W+)? ... )?)?(\w*))
When matching this, capture groups 2 to n contain the words 1 to n-1 of a line. The nested structure is necessary to make it non-ambiguous - otherwise, running the regex takes too long.
To match the duplicate lines, we use a similar structure with back-references:
\1\W+(\2\W+( ... (\9\W+)? ... )?)?
This will also match lines that are substrings of the previous line, which is again helpful to improve performance.
Notice that you have to use the \g{n}-notation when using more than 9 references in Notepad++. Moreover, to avoid matching line breaks you should use [^\w\n\r] instead of \W. To further improve performance, unnecessary groups should be non-matching, i.e., (?: ... ).
To generate the rather long regex that solves the problem for, e.g., up to 20 words per line, you can use the following script:
MAX_WORDS = 20
punct = "[^\\w\\n\\r]"
backref = (i) => `\\g{${i}}`
patternKeep = (i) => "(\\w+)[^\\w\\n\\r]+" + (i < 0 ? "" : `(?:${patternKeep(i-1)})?`)
patternRemove = (i) => `${backref(MAX_WORDS-i + 2)}(?:${punct}+` + (i < 0 ? "" : patternRemove(i-1)) + ")?"
console.log("^(" + patternKeep(MAX_WORDS) + "(\\w*))(\\r?\\n" + patternRemove(MAX_WORDS)+ `${punct}*${backref(MAX_WORDS+4)}${punct}*)+$`)
When copying this to Notepad++ with settings "Wrap around" on and "Match case" off and replacing with $1, it will remove all duplicate lines in your example.
I doubt that it can be done purely with regular expressions. If it can then I imagine that the expression would be difficult to understand and difficult to maintain. Instead I would suggest a multi-step approach.
Step 1 - modify each line to be: original-line separator original-line.
Step 2 - convert it to be line-without-punctuation separator original-line.
Step 3 - sort the lines
Step 4 - remove duplicated lines
Step 5 - remove line-without-punctuation and separator leaving just the original line.
In more detail:
In all the replaces below: select "Wrap around", unselect "Dot matches newline", unselect "Match whole word only" and unselect "Match case".
Step 1 - choose a separator, some text that is not punctuation and does not occur in the file. Here I use qqq. Do a regular expression replace of ^(.+)$ with \1qqq\1.
Step 2 - remove any punctuation before the separator. Repeatedly do a regular expression replace of [!',-.:?]+(.*qqq) with \1 until no more replacements are made. This expression matches all the punctuation in the example, but you may need to add more for your full text. Also need to reduce multiple spaces to singles, so repeatedly do a regular expression replace of +(.*qqq) with \1 until no more replacements are made. One final step to handle spaces before the qqq do a regular expression replace of qqq with qqq (this could also use a non-regular expression replace).
Step 3 - sort the lines lexicographically.
Step 4 - remove duplicated lines. Repeatedly do a regular expression replace of ^(.*qqq).*\R\1 with \1 until no more replacements are made.
Step 5 - Remove unwanted text leaving the original line. Do a regular expression replace of ^.*qqq with nothing (the empty string).
If all punctuation can be deleted and the result being a line without punctuation then could simple do a regular expression replace of [!',-.:? ]+ with , a sort and finally a remove duplicates.
Previously this question attracted an answer, but the author deleted it. To me it was so interesting because a special technique was illustrated. In a comment the answerer pointed me towards another thread to read more about it.
After experimenting a bit with that answer, an idea was the following pattern. Settings in NP++ are to uncheck: [ ] match case, [ ] .matches newline - Replace with emptystring.
^(?>[^\w\n]*(\w++)(?=.*\R(\2?+[^\w\n]*\1\b)))+[^\w\n]*\R(?=\2[^\w\n]*$)
Here is the demo in Regex101 - Assumption is, that duplicate lines are consecutive (like sample).
Most of the used regex-tokens can be looked up in the Stack Overflow Regex FAQ.
In short words, the mechanism used is to capture words from one line to the first group (\w++) while inside the lookahead (?=.*\R(\2?+...\1\b))) a second group in the consecutive line is "growing" from itself plus the captures until \R(?=\2...$) it either matches all words or fails.
Illustration of some steps from the regex101 debugger:
The second group holds the substring of the consecutive line that matches words and order of the previous line. It expands at each repetition from optionally itself and a word from the previous line. Separated by [^\w\n]* any amount of characters that are not word characters or newline.
For making it work, matching is done without giving back at crucial points (prevent backtracking).

PostgreSQL: .csv regex - test for repeating substrings within a string (digits)

Introduction:
I have the following scenario in PostgreSQL whereby I want to perform some data validation on a .csv string prior to inserting it into a table (see the fiddle here).
I've managed to get a regex (in a CHECK constraint) which disallows spaces within strings (e.g. "12 34") and also disallows preceding zeros ("00343").
Now, the icing on the cake would be if I could use regular expressions to disallow strings which contain a repeat of an integer - i.e. if a sequence \d+ matched another \d+ within the same string.
Is this beyond the capacities of regular expressions?
My table is as follows:
CREATE TABLE test
(
data TEXT NOT NULL,
CONSTRAINT d_csv_only_ck
CHECK (data ~ '^([ ]*([1-9]\d*)+[ ]*)(,[ ]*([1-9]\d*)+[ ]*)*$')
);
And I can populate it as follows:
INSERT INTO test VALUES
('992,1005,1007,992,456,456,1008'), -- want to make this line unnacceptable - repeats!
('44,1005,1110'),
('13, 44 , 1005, 10078 '), -- acceptable - spaces before and after integers
('11,1203,6666'),
('1,11,99,2222'),
('3435'),
(' 1234 '); -- acceptable
But:
INSERT INTO test VALUES ('23432, 3433 ,00343, 567'); -- leading 0 - unnacceptable
fails (as it should), and also fails (again, as it should)
INSERT INTO test VALUES ('12 34'); -- spaces within numbers - unnacceptable
The question:
However, if you notice the first string, it has repeats of 992and 456.
I would like to be able to match these.
All of these rules do not have to be in the same regex - I can use a second CHECK constraint.
I would like to know if what I am asking is possible using Regular Expressions?
I did find this post which appears to go some (all?) of the way to solving my issue, but I'm afraid it's beyond my skillset to get it to work - I've included a small test at the bottom of the fiddle.
Please let me know should you require any further information.
p.s. as an aside, I'm not very experienced with regexes and I would welcome any input on my basic one above.
Since PostegreSQL regex does not support backreferences, you cannot apply this restriction because you would need a negative lookahead with a backreference in it.
Have a look at this PCRE regex:
^(?!.*\b(\d+)\b.*\b\1\b) *[1-9]\d* *(?:, *[1-9]\d* *)*$
See this regex demo.
Details:
^ - start of string
(?!.*\b(\d+)\b.*\b\1\b) - no same two numbers as whole word allowed anywhere in the string
* - zero or more spaces
[1-9]\d* - a non-zero digit and then any zero or more digits
* - zero or more spaces
(?:, *[1-9]\d* *)* - zero or more occurrences of
, * - comma and zero or more spaces
[1-9]\d* - a non-zero digit and then any zero or more digits
* - zero or more spaces
$ - end of string.
Even if you replace \b with \y (PostgreSQL regex word boundaries) in the PostgreSQL code, it won't work due to the drawback mentioned at the top of the answer.

Regex Erasing all except numbers with limited digits

What I want to do is erase everything except \d{4,7} only by replacing.
Any ideas to get this?
ex)
G-A15239L → 15239
(G-A and L should be selected and replaced by empty strings)
now200316stillcovid19asdf → 200316
(now and stillcovid19asdf should be selected and replaced by empty strings)
Also, replacing text is not limited as empty string.
substitutions such as $1 are possible too.
Using Regex in 'Kustom' apps. (including KLCK, KLWP, KWGT)
I don't know which engine it's using because there are no information about it
You may use
(\d{4,7})?.?
Or
(\d{4,7})|.
and replace with $1. See the regex demo.
Details
(\d{4,7})? - an optional (due to ? at the end - if it is missing, then the group is obligatory) capturing group matching 1 or 0 occurrences of 4 to 7 digits
| - or
.? - any one char other than line break chars, 1 or 0 times when ? is right after it.
So, any match of 4 to 7 digits is kept (since $1 refers to the Group 1 value) and if there is a char after it, it is removed.
It looks as if the regex is Java based since all non-matching groups are replaced with null:
So, the only possible solution is to use a second pass to post-process the results, just replace null with some kind of a delimiter, a newline for example.
Search: .*?(\d{4,7})[^\d]+|.*
Replace: $1
in for instance Notepad++ 6.0 or better (which comes with built-in PCRE support) works with your examples:
jalsdkfilwsehf
now200316stillcovid19asdf
G-A15239L
becomes:
200316
15239

How to apply conditional treatment with line.endswith(x) where x is a regex result?

I am trying to apply conditional treatment for lines in a file (symbolised by list values in a list for demonstration purposes below) and would like to use a regex function in the endswith(x) method where x is a range page-[1-100]).
import re
lines = ['http://test.com','http://test.com/page-1','http://test.com/page-2']
for line in lines:
if line.startswith('http') and line.endswith('page-2'):
print line
So the required functionality is that if the value starts with http and ends with a page in the range of 1-100 then it will be returned.
Edit: After reflecting on this, I guess the corollary questions are:
How do I make a regex pattern ie page-[1-100] a variable?
How do I then use this variable eg x in endswith(x)
Edit:
This is not an answer to the original question (ie it does not use startswith() and endswith()), and I have no idea if there are problems with this, but this is the solution I used (because it achieved the same functionality):
import re
lines = ['http://test.com','http://test.com/page-1','http://test.com/page-100']
for line in lines:
match_beg = re.search( r'^http://', line)
match_both = re.search( r'^http://.*page-(?:[1-9]|[1-9]\d|100)$', line)
if match_beg and not match_both:
print match_beg.group()
elif match_beg and match_both:
print match_both.group()
I don't know python well enough to paste usable code, but as far as the regular expression is concerned, this is rather trivial to do:
page-(?:[2-9]|[1-9]\d|100)$
What this expression will match:
page- is just a fixed string that will be matched 1:1 (case insensitive if you set Options for that).
(?:...) is a non-capturing group that's just used for separating the following branching.
| all act as "either or" with the expressions being to their left/right.
[2-9] will match this numerical range, i.e. 2-9.
[1-9]\d will match any two Digit number (10-99); \d matches any digit.
100 is again a plain and simple match.
$ will match the line end or end of string (again based on settings).
Using this expression you don't use any specific "ends with" functionality (that's given through using $).
Considering this will have to parse the whole string anyway, you may include the "begins with" check as well, which shouldn't cause any additional overhead (at least none you'd notice):
^http://.*page-(?:[2-9]|[1-9]\d|100)$
^ matches the beginning of the line or string (based on settings).
http:// is once again a plain match.
. will match any character.
* is a quantifier "none or more" for the previous expression.
To get you going in the right direction, the Regex that matches your needed range of pages is:
^http.*page-([2-9]?|[1-9][0-9]|100)$
this will match lines that start with http and end with page-<2 to 100> inclusive.

RegEx Lookaround issue

I am using Powershell 2.0. I have file names like my_file_name_01012013_111546.xls. I am trying to get my_file_name.xls. I have tried:
.*(?=_.{8}_.{6})
which returns my_file_name. However, when I try
.*(?=_.{8}_.{6}).{3}
it returns my_file_name_01.
I can't figure out how to get the extension (which can be any 3 characters. The time/date part will always be _ 8 characters _ 6 characters.
I've looked at a ton of examples and tried a bunch of things, but no luck.
If you just want to find the name and extension, you probably want something like this: ^(.*)_[0-9]{8}_[0-9]{6}(\..{3})$
my_file_name will be in backreference 1 and .xls in backreference 2.
If you want to remove everything else and return the answer, you want to substitute the "numbers" with nothing: 'my_file_name_01012013_111546.xls' -replace '_[0-9]{8}_[0-9]{6}' ''. You can't simply pull two bits (name and extension) of the string out as one match - regex patterns match contiguous chunks only.
try this ( not tested), but it should works for any 'my_file_name' lenght , any lenght of digit and any kind of extension.
"my_file_name_01012013_111546.xls" -replace '(?<=[\D_]*)(_[\d_]*)(\..*)','$2'
non regex solution:
$a = "my_file_name_01012013_111546.xls"
$a.replace( ($a.substring( ($a.LastIndexOf('.') - 16 ) , 16 )),"")
The original regex you specified returns the maximum match that has 14 characters after it (you can change to (?=.{14}) who is the same).
Once you've changed it, it returns the maximum match that has 14 characters after it + the next 3 characters. This is why you're getting this result.
The approach described by Inductiveload is probably better in case you can use backreferences. I'd use the following regex: (.*)[_\d]{16}\.(.*) Otherwise, I'd do it in two separate stages
get the initial part
get the extension
The reason you get my_filename_01 when you add that is because lookaheads are zero-width. This means that they do not consume characters in the string.
As you stated, .*(?=_.{8}_.{6}) matches my_file_name because that string is is followed by something matching _.{8}_.{6}, however once that match is found, you've only consumed my_file_name, so the addition of .{3} will then consume the next 3 characters, namely _01.
As for a regex that would fit your needs, others have posted viable alternatives.