I have a bunch of files that look like this:
A.File.With.Dots.Instead.Of.Spaces.Extension
Which I want to transform via a regex into:
A File With Dots Instead Of Spaces.Extension
It has to be in one regex (because I want to use it with Total Commander's batch rename tool).
Help me, regex gurus, you're my only hope.
Edit
Several people suggested two-step solutions. Two steps really make this problem trivial, and I was really hoping to find a one-step solution that would work in TC. I did, BTW, manage to find a one-step solution that works as long as there's an even number of dots in the file name. So I'm still hoping for a silver bullet expression (or a proof/explanation of why one is strictly impossible).
It appears Total Commander's regex library does not support lookaround expressions, so you're probably going to have to replace a number of dots at a time, until there are no dots left. Replace:
([^.]*)\.([^.]*)\.([^.]*)\.([^.]*)$
with
$1 $2 $3.$4
(Repeat the sequence and the number of backreferences for more efficiency. You can go up to $9, which may or may not be enough.)
It doesn't appear there is any way to do it with a single, definitive expression in Total Commander, sorry.
Basically:
/\.(?=.*?\.)//
will do it in pure regex terms. This means, replace any period that is followed by a string of characters (non-greedy) and then a period with nothing. This is a positive lookahead.
In PHP this is done as:
$output = preg_replace('/\.(?=.*?\.)/', '', $input);
Other languages vary but the principle is the same.
Here's one based on your almost-solution:
/\.([^.]*(\.[^.]+$)?)/\1/
This is, roughly, "any dot stuff, minus the dot, and maybe plus another dot stuff at the end of the line." I couldn't quite tell if you wanted the dots removed or turned to spaces - if the latter, change the substitution to " \1" (minus the quotes, of course).
[Edited to change the + to a *, as Helen's below.]
Or substitute all dots with space, then substitute [space][Extension] with .[Extension]
A.File.With.Dots.Instead.Of.Spaces.Extension
to
A File With Dots Instead Of Spaces Extension
to
A File With Dots Instead Of Spaces.Extension
Another pattern to find all dots but the last in a (windows) filename that I've found works for me in Mass File Renamer is:
(?!\.\w*$)\.
I don't know how useful that is to other users, but this page was an early search result and if that had been on here it would have saved me some time.
It excludes the result if it's followed by an uninterrupted sequence of alphanumeric characters leading to the end of the input (filename) but otherwise finds all instances of the dot character.
You can do that with Lookahead. However I don't know which kind of regex support you have.
/\.(?=.*\.)//
Which roughly translates to Any dot /\./ that has something and a dot afterwards. Obviously the last dot is the only one not complying. I leave out the "optionality" of something between dots, because the data looks like something will always be in between and the "optionality" has a performance cost.
Check:
http://www.regular-expressions.info/lookaround.html
Related
I am trying to write a regex expression that can be used to identify long sentences in a document. I my case a scientific manuscript. I aim to be doing that either in libre office or any text editor with regex search.
So far I got the following expression to work on most occasions:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*,*\:*\s+){24,}?(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*[\.|?|!|$])
btw, I got inspired from this post
It contains:
group1:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*,*\:*\s+)
a repetition element (stating how many words n - 1):
{24,}?
group2:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*[\.|?|!|$])
The basic functioning is:
group1 matches any number of word characters OR other characters that are present in the text followed by one or more spaces
group1 has to be repeated 24 times (or as many as you want the sentences to be long)
group2 matches any number of word characters OR other characters that are present in the text followed by a full stop, exclamation mark, question mark or paragraph break.
Any string that fulfills all the above would then be highlighted.
What I can't solve so far is to make it work when a dot appears in the text with another meaning than a full stop. Things like: i.e., e.g., et al., Fig., 1.89, etc....
Also I don't like that I had to manually adjust it to be able to handle sentences that contain non-word characters such as , [ ( % - # µ " ' and so on. I would have to extend the expression every time I come across some other uncommon character.
I'd be happy for any help or suggestions of other ways to solve this.
You can do a lot with the swiss-army-knife that is regular expressions, but the problem you've presented approaches regex's limits. Some of the things you want to detect can probably be handled with really small changes, while others are a bit harder. If your goal is to have some kind of tool that accurately measures sentence length for every possible mutation of characters, you'll probably need to move outside LibreOffice to a dedicated custom piece of software or a third-party tool.
But, that said, there are a few tricks you can worm into your existing regex to make it work better, if you want to avoid programming or another tool. Let's look at a few of the techniques that might be useful to you:
You can probably tweak your regex for a few special cases, like Fig. and Mr., by including them directly. Where you currently have [\w|\-|–|−|\/|≥|≤|’|“|”|μ]+, which is basically [\w]+ with a bunch of other "special" characters, you could use something like ([\w|...]+|Mr\.|Mrs\.|Miss\.|Fig\.) (substituting in all the special characters where I wrote ..., of course). Regexes are "greedy" algorithms, and will try to consume as much of the text as possible, so by including special "dot words" directly, you can make the regex "skip over" certain period characters that are problematic in your text. Make sure that when you want to add a "period to skip" that you always precede it with a backslash, like in i\.e\., so that it doesn't get treated as the special "any" character.
A similar trick can capture numbers better by assuming that digits followed by a period followed by more digits are supposed to "eat" the period: ([\w|...]+|\d+\.\d+|...) That doesn't handle everything, and if your document authors are writing stuff like 0. in the middle of sentences then you have a tough problem, but it can at least handle pi and e properly.
Also, right now, your regex consumes characters until it reaches any terminating punctuation character — a ., or !, or ?, or the end of the document. That's a problem for things like i.e., and 3.14, since as far as your regex is concerned, the sentence stops at the .. You could require your regex to only stop the sentence once ._ is reached — a period followed by a space. That wouldn't fix mismatches for words like Mr., but it would treat "words" like 3.14 as a word instead of as the end of a sentence, which is closer than you currently are. To do this, you'll have to include an odd sequence as part of the "word" regex, something like (\.[^ ]), which says "dot followed by not-a-space" is part of the word; and then you'll have to change the terminating sequence to (\. |!|?|$). Repeat the changes similarly for ! and ?.
Another useful trick is to take advantage of character-code ranges, instead of encoding each special character directly. Right now, you're doing it the hard way, by spelling out every accented character and digraph and diacritic in the universe. Instead, you could just say that everything that's a "special character" is considered to be part of the "word": Instead of [\w|\-|–|−|\/|≥|≤|’|“|”|μ]+, write [\w|\-|\/|\u0080-\uFFFF], which captures every character except emoji and a few from really obscure dead languages. LibreOffice seems to have Unicode support, so using \uXXXX patterns should work inside [ character ranges ].
This is probably enough to make your regex somewhat acceptable in LibreOffice, and might even be enough to answer your question. But if you're really intent on doing more complex document analysis like this, you may be better off exporting the document as plain text and then running a specialized tool on it.
I have a latex file in which I want to get rid of the last \\ before a \end{quoting}.
The section of the file I'm working on looks similar to this:
\myverse{some text \\
some more text \\}%
%
\myverse{again some text \\
this is my last line \\}%
\footnote{possibly some footnotes here}%
%
\end{quoting}
over several hundred lines, covering maybe 50 quoting environments.
I tried with :%s/\\\\}%\(\_.\{-}\)\\end{quoting}/}%\1\\end{quoting}/gc but unfortunately the non-greedy quantifier \{-} is still too greedy.
It catches starting from the second line of my example until the end of the quoting environment, I guess the greedy quantifier would catch up to the last \end{quoting} in the file. Is there any possibility of doing this with search and replace, or should I write a macro for this?
EDIT: my expected output would look something like this:
this is my last line }%
\footnote{possibly some footnotes here}%
%
\end{quoting}
(I should add that I've by now solved the task by writing a small macro, still I'm curious if it could also be done by search and replace.)
I think you're trying to match from the last occurrence of \\}% prior to end{quoting}, up to the end{quoting}, in which case you don't really want any character (\_.), you want "any character that isn't \\}%" (yes I know that's not a single character, but that's basically it).
So, simply (ha!) change your pattern to use \%(\%(\\\\}%\)\#!\_.\)\{-} instead of \_.\{-}; this means that the pattern cannot contain multiple \\}% sequences, thus achieving your aims (as far as I can determine them).
This uses a negative zero-width look-ahead pattern \#! to ensure that the next match for any character, is limited to not match the specific text we want to avoid (but other than that, anything else still matches). See :help /zero-width for more of these.
I.e. your final command would be:
:%s/\\\\}%\(\%(\%(\\\\}%\)\#!\_.\)\{-}\)\\end{quoting}/}%\1\\end{quoting}/g
(I note your "expected" output does not contain the first few lines for some reason, were they just omitted or was the command supposed to remove them?)
You’re on the right track using the non-greedy multi. The Vim help files
state that,
"{-}" is the same as "*" but uses the shortest match first algorithm.
However, the very next line warns of the issue that you have encountered.
BUT: A match that starts earlier is preferred over a shorter match: "a{-}b" matches "aaab" in "xaaab".
To the best of my knowledge, your best solution would be to use the macro.
I've got a practical application for a vim regex where I'd like to remove numbers from the end of file location links. For example, if the developer is sloppy and just adds files and doesn't reuse file locations, you'll end up with something awful like this:
PATH_TO_MY_FILES>
PATH_TO_MY_FILES1>
...
PATH_TO_MY_FILES22>
PATH_TO_MY_FILES_ELSEWHERE>
PATH_TO_MY_FILES_ELSEWHERE1>
...
So all I want to do is to S&R and replace PATH_TO_MY_FILES*\d+ with PATH_TO_MY_FILES* using regex. Obviously I am not doing it quite right, so I was hoping someone here could not spoon feed the answer necessarily, but throw a regex buzzword my way to get me on track.
Here's what I have tried:
:%s\(PATH_TO_MY_FILES\w*\)\(\d+\)>:gc
But this doesn't work, i.e. if I just do a vim search on that, it doesn't find anything. However, if I use this:
:%s\(PATH_TO_MY_FILES\w*\)\(\d\)>:gc
It will match the string, but the grouping is off, as expected. For example, the string PATH_TO_MY_FILES22 will be grouped as (PATH_TO_MY_FILES2)(2), presumably because the \d only matches the 2, and the \w match includes the first 2.
Question 1: Why doesn't \d+ work?
If I go ahead and use the second string (which is wrong), Vim appears to find a match (even though the grouping is wrong), but then does the replacement incorrectly.
For example, given that we know the \d will only match the last number in the string, I would expect PATH_TO_MY_FILES22> to get replaced with PATH_TO_MY_FILES2>. However, instead it replaces it with this:
PATH_TO_MY_FILES2PATH_TO_MY_FILES22>gt
So basically, it looks like it finds PATH_TO_MY_FILES22>, but then replaces only the & with group 1, which is PATH_TO_MY_FILES2.
I tried another regex at Regexr.com to see how it would interpret my grouping, and it looked correct, but maybe a hack around my lack of regex understanding:
(PATH_TO_\D*)(\d*)>
This correctly broke my target string into the PATH part and the entire number, so I was happy. But then when I used this in Vim, it found the match, but still replaced only the &.
Question 2: Why is Vim only replacing the &?
Answer 1:
You need to escape the + or it will be taken literally. For example \d\+ works correctly.
Answer 2:
An unescaped & in the replacement portion of a substitution means "the entire matched text". You need to escape it if you want a literal ampersand.
I want to match two consecutive lines, with the first line having no lower-case letter and the second having lower-case letter(s), e.g.
("3.2 A MEMORY ABSTRACTION: ADDRESS SPACES 177" "#205")
("3.3.1 Paging 187" "#215")
Why would the Regex ^(?!.*[:lower:]).*$\n^(.*[:lower:]).*$ match each of the following two-line examples?
("1.3.3 Disks 24" "#52")
("1.3.4 Tapes 25" "#53")
("1.5.4 Input/Output 41" "#69")
("1.5.5 Protection 42" "#70")
("3.1 NO MEMORY ABSTRACTION 174" "#202")
("3.2 A MEMORY ABSTRACTION: ADDRESS SPACES 177" "#205")
("3.3.1 Paging 187" "#215")
("3.3.2 Page Tables 191" "#219")
Thanks and regards!
ADDED:
For a example such as:
("3.1 NO MEMORY ABSTRACTION 174" "#202")
("3.2 A MEMORY ABSTRACTION: ADDRESS SPACES 177" "#205")
("3.3.1 Paging 187" "#215")
("3.3.2 Page Tables 191" "#219")
How shall I match only the middle two lines not the first three lines or all the four lines?
To use a POSIX "character class" like [:lower:], you have to enclose it in another set of square brackets, like this: [[:lower:]]. (According to POSIX, the outer set of brackets form a bracket expression and [:lower:] is a character class, but to everyone else the outer brackets define a character class and the inner [:lower:] is obsolete.)
Another problem with your regex is that the first part is not required to consume any characters; everything is optional. That means your match can start on the blank line, and I don't think you want that. Changing the second .* to .+ fixes that, but it's just a quick patch.
This regex seems to match your specification:
^(?!.*[[:lower:]]).+\n(?=.*[[:lower:]]).*$
But I'm a little puzzled, because there's nothing in your sample data that matches. Is there supposed to be?
Using Rubular, we can see what's matched by your initial expression, and then, by adding a few excess capturing groups, see why it matches.
Essentially, the negative look-ahead followed by .* will match anything. If you merely want to check that the first line has no lower-case letters, check that explicitly, e.g.
^(?:[^a-z]+)$
Finally, I'd assuming you want the entire second line, you can do this for the second part:
^(.*?(?=[:lower:]).*?)$
Or to match your inital version:
^(.*?(?=[:lower:])).*?$
The reluctant qualifiers (*?) seemed to be necessary to avoid matching across lines.
The final version I ended up with, thus, is:
^(?:[^a-z]+)$\n^(.*?(?=[:lower:]).*?)$
This can be seen in action with your test data here. It only captures the line ("3.2 A MEMORY ABSTRACTION: ADDRESS SPACES 177" "#205").
Obviously, the regex I've used might be quite specific to Ruby, so testing with your regex engine may be somewhat different. There are many easily Google-able online regex tests, I just picked on Rubular since it does a wonderful job of highlighting what is being matched.
Incidentally, if you're using Python, the Python Regex Tool is very helpful for online testing of Python regexes (and it works with the final version I gave above), though I find the output visually less helpful in trouble-shooting.
After thinking about it a little more, Alan Moore's point about [[:lower:]] is spot on, as is his point about how the data would match. Looking back at what I wrote, I got a little too involved in breaking-down the regex and missed something about the problem as described. If you modify the regex I gave above to:
^(?:[^[:lower:]]+)$\n^(.*?(?=[[:lower:]]).*?)$
It matches only the line ("3.3.1 Paging 187" "#215"), which is the only line with lowercase letters following a line with no lowercase letters, as can be seen here. Placing a capturing group in Alan's expression, yielding ^(?!.*[[:lower:]]).+\n((?=.*[[:lower:]]).*)$ likewise captures the same text, though what, exactly, is matched is different.
I still don't have a good solution for matching multiple lines.
This is my first question, so I hope I didn't mess too much with the title and the formatting.
I have a bunch of file a client of mine sent me in this form:
Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext
What I need is a regex to output just:
212 The Actual Title Of the Chapter
I'm not gonna use it with any script language in particular; it's a batch renaming of files through an app supporting regex (which already "preserves" the extension).
So far, all I was able to do was this:
/.*x(\d+)\.(.*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2
(Capture everything before a number preceded by an "x", group numbers after the "x", group everything following until a 3 digit Uppercase word is met, then capture everything that follows)
which gives me back:
212 The.Actual.Title.Of.the.Chapter
Having seen the result I thought that something like:
/.*x(\d+)\.([^.]*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2
(Changed second group to "Capture everything which is not a dot...") would have worked as expected.
Instead, the whole regex fails to match completely.
What am I missing?
TIA
cià
ale
.*x(\d+)\. matches Name.Of.Chapter.021x212.
\.[A-Z]{3}.* matches .DOC.NAME-Some.stuff.Here.ext
But ([^.]*?) does not match The.Actual.Title.Of.the.Chapter because this regex does not allow for any periods at all.
since you are on Mac, you could use the shell
$ s="Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext"
$ echo ${s#*x}
212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext
$ t=${s#*x}
$ echo ${t%.[A-Z][A-Z][A-Z].*}
212.The.Actual.Title.Of.the.Chapter
Or if you prefer sed, eg
echo $filename | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//'
For processing multiple files
for file in *.ext
do
newfile=${file#*x}
newfile=${newfile%.[A-Z][A-Z][A-Z].*}
# or
# newfile=$(echo $file | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//')
mv "$file" "$newfile"
done
To your question "How can I remove the dots in the process of matching?" the answer is "You can't." The only way to do that is by processing the result of the match in a second step, as others have said. But I think there's a more basic question that needs to be addressed, which is "What does it mean for a regex to match a given input?"
A regex is usually said to match a string when it describes any substring of that string. If you want to be sure the regex describes the whole string, you need to add the start (^) and end ($) anchors:
/^.*x(\d+)\.(.*?)\.[A-Z]{3}.*$/
But in your case, you don't need to describe the whole string; if you get rid of the .* at either end, it will serve your just as well:
/x(\d+)\.(.*?)\.[A-Z]{3}/
I recommend you not get in the habit of "padding" regexes with .* at beginning and end. The leading .* in particular can change the behavior of the regex in unexpected ways. For example, it there were two places in the input string where x(\d+)\. could match, your "real" match would have started at the second one. Also, if it's not anchored with ^ or \A, a leading .* can make the whole regex much less efficient.
I said "usually" above because some tools do automatically "anchor" the match at the beginning (Python's match()) or at both ends (Java's matches()), but that's pretty rare. Most of the shells and command-line tools available on *nix systems define a regex match in the traditional way, but it's a good idea to say what tool(s) you're using, just in case.
Finally, a word or two about vocabulary. The parentheses in (\d+) cause the matched characters to be captured, not grouped. Many regex flavors also support non-capturing parentheses in the form (?:\d+), which are used for grouping only. Any text that is included in the overall match, whether it's captured or not, is said to have been consumed (not captured). The way you used the words "capture" and "group" in your question is guaranteed to cause maximum confusion in anyone who assumes you know what you're talking about. :D
If you haven't read it yet, check out this excellent tutorial.