Remove stuff, retrieve numbers, retrieve text with spaces in place of dots, remove the rest - regex

This is my first question, so I hope I didn't mess too much with the title and the formatting.
I have a bunch of file a client of mine sent me in this form:
Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext
What I need is a regex to output just:
212 The Actual Title Of the Chapter
I'm not gonna use it with any script language in particular; it's a batch renaming of files through an app supporting regex (which already "preserves" the extension).
So far, all I was able to do was this:
/.*x(\d+)\.(.*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2
(Capture everything before a number preceded by an "x", group numbers after the "x", group everything following until a 3 digit Uppercase word is met, then capture everything that follows)
which gives me back:
212 The.Actual.Title.Of.the.Chapter
Having seen the result I thought that something like:
/.*x(\d+)\.([^.]*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2
(Changed second group to "Capture everything which is not a dot...") would have worked as expected.
Instead, the whole regex fails to match completely.
What am I missing?
TIA
cià
ale

.*x(\d+)\. matches Name.Of.Chapter.021x212.
\.[A-Z]{3}.* matches .DOC.NAME-Some.stuff.Here.ext
But ([^.]*?) does not match The.Actual.Title.Of.the.Chapter because this regex does not allow for any periods at all.

since you are on Mac, you could use the shell
$ s="Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext"
$ echo ${s#*x}
212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext
$ t=${s#*x}
$ echo ${t%.[A-Z][A-Z][A-Z].*}
212.The.Actual.Title.Of.the.Chapter
Or if you prefer sed, eg
echo $filename | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//'
For processing multiple files
for file in *.ext
do
newfile=${file#*x}
newfile=${newfile%.[A-Z][A-Z][A-Z].*}
# or
# newfile=$(echo $file | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//')
mv "$file" "$newfile"
done

To your question "How can I remove the dots in the process of matching?" the answer is "You can't." The only way to do that is by processing the result of the match in a second step, as others have said. But I think there's a more basic question that needs to be addressed, which is "What does it mean for a regex to match a given input?"
A regex is usually said to match a string when it describes any substring of that string. If you want to be sure the regex describes the whole string, you need to add the start (^) and end ($) anchors:
/^.*x(\d+)\.(.*?)\.[A-Z]{3}.*$/
But in your case, you don't need to describe the whole string; if you get rid of the .* at either end, it will serve your just as well:
/x(\d+)\.(.*?)\.[A-Z]{3}/
I recommend you not get in the habit of "padding" regexes with .* at beginning and end. The leading .* in particular can change the behavior of the regex in unexpected ways. For example, it there were two places in the input string where x(\d+)\. could match, your "real" match would have started at the second one. Also, if it's not anchored with ^ or \A, a leading .* can make the whole regex much less efficient.
I said "usually" above because some tools do automatically "anchor" the match at the beginning (Python's match()) or at both ends (Java's matches()), but that's pretty rare. Most of the shells and command-line tools available on *nix systems define a regex match in the traditional way, but it's a good idea to say what tool(s) you're using, just in case.
Finally, a word or two about vocabulary. The parentheses in (\d+) cause the matched characters to be captured, not grouped. Many regex flavors also support non-capturing parentheses in the form (?:\d+), which are used for grouping only. Any text that is included in the overall match, whether it's captured or not, is said to have been consumed (not captured). The way you used the words "capture" and "group" in your question is guaranteed to cause maximum confusion in anyone who assumes you know what you're talking about. :D
If you haven't read it yet, check out this excellent tutorial.

Related

Use regex to match certain number of lines that follow the line containing the occurrence of a specific string

I am working in InDesign, formatting large quantities of text. Here is a sample of the text.
NEW! Certificate in Office Operations (3 parts)
Office Operations
Cyber Security for Managers
Embracing Sustainability in the Workplace
Intro to 3D Printing
Intro to Maker Tech: The New Shop Class
I need to be able to match the three lines that follow a line containing the string "(3 parts)".
My thought would be to try a positive look-behind like this:
(?<=\(3 parts\)$)^.*$
but it doesn't match anything.
The lookbehind part is correct, but the use of the symbols ^ (Begin Paragraph) and $ (End Paragraph) are restricted to matching the position only – not the actual 'Hard return' characters. That is the reason your expression fails: by default, the . "match all" character does not match returns. So that makes the first test (?<=\(3 parts\)$)^. fail: neither the $ in the lookbehind nor the ^ consumed the return, and the following . does not match it either, per this default rule.
It is possible to put GREP into Single Line mode – a funny description that may put you on the wrong foot. From the perspective of GREP, it allows . to match a return as well; and so an entire running text, hard returns and all, can be considered a "single (long) line". The code for that is (?s), and is typically put at the very front of your expression.
That in itself is not enough to make it work, because
(?s)(?<=\(3 parts\)$)^.
still expects a return between the $ and ^ (otherwise either one would be wrong!). Anyway, it's not a good way to match a certain number of paragraphs. The adjusted expression
(?s)(?<=\(3 parts\)$).^.*
works correctly in consuming the hard returns, but selects everything up to the end as well.
I propose a much simpler approach: if you want to grab a certain number of hard returns, just include them right away in your expression – their GREP code is \r.
That leads to the following:
(?<=\(3 parts\)\r)(.*\r){3}
where the lookbehind is what you already got, plus a return to end that particular line (and it's in the lookbehind because you don't want to grab that return as well), followed by three repetitions of a sequence to grab an entire line, .*\r.
You can use -A option in grep:
grep -A 3 -F '(3 parts)' file
NEW! Certificate in Office Operations (3 parts)
Office Operations
Cyber Security for Managers
Embracing Sustainability in the Workplace
Would this be something for you?
\Q(3 parts)\E\r((?:.*$\R){3})
See a demo on regex101.com. As #Jongware pointed out, it seems to be \r (lowercase) in Adobe InDesign.

RegEx "replace all but" for Notepad++ v6.3

First timer and relatively inexperienced with RegEx and Notepad++. What I am trying to do is replace everything but the policy numbers in these two firewall session. Mind you, I have a list multiple lists 700+ lines long so I want to replace everything in one pass, leaving just the policy number for each line.
id 1978781/s23,vsys 0,flag 00200440/4000/0003,policy 4332,time 5972, dip 0 module 0
id 1997645/s23,vsys 0,flag 00200440/4000/0003,policy 30562,time 6283, dip 0 module 0
There are thousands of different policy numbers, so a simple search wont do.
I would like my lines to look like this after a replace.
4332
30562
After two hours of trying to learn RegEx for this one problem, I realized this its more involved than I expected, and I need to spend time learning this since its a very powerful tool. This could really save a lot of time, which unfortunately I don't have at the moment. I'm looking forward to learning more about RegEx and appreciate any help or direction you could give me.
Given the fact the lines always look the same you can use the following
^.+policy (\d+).+$
Replace by : $1
The dot is a wild card so , .+ means find everything before the word "policy ". Then find a group of digits (\d+ is for finding digits) and save them (thats what the parenthesis are for in many regex engines). Then find all the characters till the end of the line.
The ^ character means start of line. The $ means end of line.
You can try the following:
Find:
^.*policy ([0-9]+).*$
Replace with:
\1
Why does this work?
The dot matches any character, and the star means "zero or more of" the character preceding it. This means that .* matches everything.
What you want is to match everything before and after the policy and erase it, and keep just the policy number, so between your everything matchers you look for the string "policy xxxxx" where the xxxxx are numbers.
Each term surrounded by parenthesis in your regex is saved to be used in the replacement. I put parenthesis around the number matcher, [0-9]+ and then use what was matched in the repace part with \1. If your regex contains several parenthesized parts, you can get them with \1, \2, \3...
Regexes are really powerful, you should read a tutorial about them to learn what they can offer.

Regex to match when a string is present twice

I am horrible at RegEx expressions and I just don't use them often enough for me to remember the syntax between uses.
I am using grepWin to search my files. I need to do a search that will return the files that have a given string twice.
So, for example, if I was searching on the word "how", then file one would not match:
Hello
how are you today?
but file two would:
Hello
how are you today?
I am fine, how are you?
Any one know how to make a RegEx that will match that?
something like this (depends on language and your specific task)
\(how.*){2}\
Edit:
according to #CodeJockey
\^(([^h]|h[^o]|ho[^w])*how([^h]|h[^o]|ho[^w])*){2,2}$\
(it become more complicated)
#CodeJockey: Thanks for comments
I don't know what grepWin supports, but here's what I came up with to make something match exactly twice.
/^((?!how).)*how((?!how).)*how((?!how).)*$/
Explanation:
/^ # start of subject
((?!how).)* # any text that does not contain "how"
how # the word "how"
((?!how).)* # any text that does not contain "how"
how # the word "how"
((?!how).)* # any text that does not contain "how"
$/ # end of subject
This ensures that you find two "how"s, but the texts between the "how"s and to either side of them do not contain "how".
Of course, you can substitute any string for "how" in the expression.
If you want to "simplify" by only writing the search expression twice, you can use backreferences thus:
/^(?:(?!how).)*(how)(?:(?!\1).)*\1(?:(?!\1).)*$/
Refiddle with this expression
Explanation:
I added ?: to make the negative lookaheads' text non-capturing. Then I added parentheses around the regular how to make that a capturing subpattern (the first and only one).
I had to include "how" again in the first lookahead because it's a negative lookahead (meaning any capture would not contain "how") and the captured "how" is not captured yet at that point.
This is significantly harder than I originally thought it would be, and requires variable-length lookbehind, which grepWin does not support...
this expression:
(?<!blah.{0,99999})blah(?=.*?blah)(?!.*blah.*blah)
was successfully used in Eclipse, using the "Search > File" dialog to exclude files with one and three instances of blah and to include files with exactly two instances of blah.
Eclipse does not permit a .* in lookbehind, so I used .{0,99999} instead.
It is possible, with the right tool, but It isn't pretty to get it to work with grepWin (see answer above). Can you use other tools (such as Eclipse) and what did you want to do with the files afterwards?
This works for grep || python, it will return a match only if "how" exists twice in a your_file:
grep "how.*how" your_file
in python (re imported):
re.search(r"how.*how","your_text")
It will return everything in between,(the dot means any character and the star means any number of characters), and you can customize your own script.

how to eliminate dots from filenames, except for the file extension

I have a bunch of files that look like this:
A.File.With.Dots.Instead.Of.Spaces.Extension
Which I want to transform via a regex into:
A File With Dots Instead Of Spaces.Extension
It has to be in one regex (because I want to use it with Total Commander's batch rename tool).
Help me, regex gurus, you're my only hope.
Edit
Several people suggested two-step solutions. Two steps really make this problem trivial, and I was really hoping to find a one-step solution that would work in TC. I did, BTW, manage to find a one-step solution that works as long as there's an even number of dots in the file name. So I'm still hoping for a silver bullet expression (or a proof/explanation of why one is strictly impossible).
It appears Total Commander's regex library does not support lookaround expressions, so you're probably going to have to replace a number of dots at a time, until there are no dots left. Replace:
([^.]*)\.([^.]*)\.([^.]*)\.([^.]*)$
with
$1 $2 $3.$4
(Repeat the sequence and the number of backreferences for more efficiency. You can go up to $9, which may or may not be enough.)
It doesn't appear there is any way to do it with a single, definitive expression in Total Commander, sorry.
Basically:
/\.(?=.*?\.)//
will do it in pure regex terms. This means, replace any period that is followed by a string of characters (non-greedy) and then a period with nothing. This is a positive lookahead.
In PHP this is done as:
$output = preg_replace('/\.(?=.*?\.)/', '', $input);
Other languages vary but the principle is the same.
Here's one based on your almost-solution:
/\.([^.]*(\.[^.]+$)?)/\1/
This is, roughly, "any dot stuff, minus the dot, and maybe plus another dot stuff at the end of the line." I couldn't quite tell if you wanted the dots removed or turned to spaces - if the latter, change the substitution to " \1" (minus the quotes, of course).
[Edited to change the + to a *, as Helen's below.]
Or substitute all dots with space, then substitute [space][Extension] with .[Extension]
A.File.With.Dots.Instead.Of.Spaces.Extension
to
A File With Dots Instead Of Spaces Extension
to
A File With Dots Instead Of Spaces.Extension
Another pattern to find all dots but the last in a (windows) filename that I've found works for me in Mass File Renamer is:
(?!\.\w*$)\.
I don't know how useful that is to other users, but this page was an early search result and if that had been on here it would have saved me some time.
It excludes the result if it's followed by an uninterrupted sequence of alphanumeric characters leading to the end of the input (filename) but otherwise finds all instances of the dot character.
You can do that with Lookahead. However I don't know which kind of regex support you have.
/\.(?=.*\.)//
Which roughly translates to Any dot /\./ that has something and a dot afterwards. Obviously the last dot is the only one not complying. I leave out the "optionality" of something between dots, because the data looks like something will always be in between and the "optionality" has a performance cost.
Check:
http://www.regular-expressions.info/lookaround.html

Need a regex to exclude certain strings

I'm trying to get a regex that will match:
somefile_1.txt
somefile_2.txt
somefile_{anything}.txt
but not match:
somefile_16.txt
I tried
somefile_[^(16)].txt
with no luck (it includes even the "16" record)
Some regex libraries allow lookahead:
somefile(?!16\.txt$).*?\.txt
Otherwise, you can still use multiple character classes:
somefile([^1].|1[^6]|.|.{3,})\.txt
or, to achieve maximum portability:
somefile([^1].|1[^6]|.|....*)\.txt
[^(16)] means: Match any character but braces, 1, and 6.
The best solution has already been mentioned:
somefile_(?!16\.txt$).*\.txt
This works, and is greedy enough to take anything coming at it on the same line. If you know, however, that you want a valid file name, I'd suggest also limiting invalid characters:
somefile_(?!16)[^?%*:|"<>]*\.txt
If you're working with a regex engine that does not support lookahead, you'll have to consider how to make up that !16. You can split files into two groups, those that start with 1, and aren't followed by 6, and those that start with anything else:
somefile_(1[^6]|[^1]).*\.txt
If you want to allow somefile_16_stuff.txt but NOT somefile_16.txt, these regexes above are not enough. You'll need to set your limit differently:
somefile_(16.|1[^6]|[^1]).*\.txt
Combine this all, and you end up with two possibilities, one which blocks out the single instance (somefile_16.txt), and one which blocks out all families (somefile_16*.txt). I personally think you prefer the first one:
somefile_((16[^?%*:|"<>]|1[^6?%*:|"<>]|[^1?%*:|"<>])[^?%*:|"<>]*|1)\.txt
somefile_((1[^6?%*:|"<>]|[^1?%*:|"<>])[^?%*:|"<>]*|1)\.txt
In the version without removing special characters so it's easier to read:
somefile_((16.|1[^6]|[^1).*|1)\.txt
somefile_((1[^6]|[^1]).*|1)\.txt
To obey strictly to your specification and be picky, you should rather use:
^somefile_(?!16\.txt$).*\.txt$
so that somefile_1666.txt which is {anything} can be matched ;)
but sometimes it is just more readable to use...:
ls | grep -e 'somefile_.*\.txt' | grep -v -e 'somefile_16\.txt'
somefile_(?!16).*\.txt
(?!16) means: Assert that it is impossible to match the regex "16" starting at that position.
Sometimes it's just easier to use two regular expressions. First look for everything you want, then ignore everything you don't. I do this all the time on the command line where I pipe a regex that gets a superset into another regex that ignores stuff I don't want.
If the goal is to get the job done rather than find the perfect regex, consider that approach. It's often much easier to write and understand than a regex that makes use of exotic features.
Without using lookahead
somefile_(|.|[^1].+|10|11|12|13|14|15|17|18|19|.{3,}).txt
Read it like: somefile_ followed by either:
nothing.
one character.
any one character except 1 and followed by any other characters.
three or more characters.
either 10 .. 19 note that 16 has been left out.
and finally followed by .txt.