I am working through a lab on RegEx which asks me to:
Search the 'countries' file for all the words with nine characters and
the letter i.How many results are found?
I am working in a generic Linux command prompt in a online emulated environment. I am allowed to use grep, awk or sed though I am feeling a preference for grep.
(I am 100% a noob when it comes to RegEx so please explain it to me like I'm 5)
Per a previous lab I already used something like below which finds me all countries which have 9 characters, however I cannot find a way to make it find all words which have 9 characters AND contain the letter i in any position.
grep -E '\b\w{9}\b' countries
The | operator does not help because its an OR operator and will find me all instances that i is found, and all words which are 9 characters and I need both to happen at the same time. I tried multiple grep statements as well and it seems the emulator may not accept that.
I am also trying to stick to [] character sets as the next question asks for multiple letters within the 9 letter word.
One way of solving this problem is to use grep twice, and pipe one result to the next.
First, we find all words with length 9, like you did on the previous exercise:
grep -Eo '\b\w{9}\b' countries
I'm using the flag o that lists only the matching words, printing one word per line.
Next, we use Linux pipe (not regex OR) to feed the output of the first grep to a second grep:
grep -Eo '\b\w{9}\b' countries | grep 'i'
The final output will be all words with nine characters and i.
Depending on your requirements, this approach may be considered "cheating" if you're more focused on Regex, but a good solution if you're also learning Linux.
The fact you are looking for words complicates the regex (in contrary to lines in the file), but it is also possible to come up with a single regex to match these words.
\b(?=\w*i)\w{9}\b
This builds on \b\w{9}\b you already have. (?=\w*i) is the AND condition. After we find the beginning of the word (\b), we look ahead for \w*i (zero or more letters, and then our i). We're using \w* in the lookahead, not .*, so we are looking at the same word. (?=.*i) would have matched any i also after the nine characters.
After finding the i, we continue to make sure the word is only 9 letters.
Working example: https://regex101.com/r/G5EVdM/1
Note:
* Python is used to illustrate behaviors, but this question is language-agnostic.
* For the purpose of this discussion, assume single-line input only, because the presence of newlines (multi-line input) introduces variations in behavior of $ and . that are incidental to the questions at hand.
Most regex engines:
accept a regex that explicitly tries to match an expression after the end of the input string[1].
$ python -c "import re; print(re.findall('$.*', 'a'))"
[''] # !! Matched the hypothetical empty string after the end of 'a'
when finding / replacing globally, i.e., when looking for all non-overlapping matches of a given regex, and having reached the end of the string, unexpectedly try to match again[2], as explained in this answer to a related question:
$ python -c "import re; print(re.findall('.*$', 'a'))"
['a', ''] # !! Matched both the full input AND the hypothetical empty string
Perhaps needless to say, such match attempts succeed only if the regex in question matches the empty string (and the regex by default / is configured to report zero-length matches).
These behaviors are at least at first glance counter-intuitive, and I wonder if someone can provide a design rationale for them, not least because:
it's not obvious what the benefit of this behavior is.
conversely, in the context of finding / replacing globally with patterns such as .* and .*$, the behavior is downright surprising.[3]
To ask the question more pointedly: Why does functionality designed to find multiple, non-overlapping matches of a regex - i.e., global matching - decide to even attempt another match if it knows that the entire input has been consumed already, irrespective of what the regex is (although you'll never see the symptom with a regex that doesn't at least also match the empty string)
The following languages/engines exhibit the surprising behavior: .NET, Python (both 2.x and 3.x)[2], Perl (both 5.x and 6.x), Ruby, Node.js (JavaScript)
Note that regex engines vary in behavior with respect to where to continue matching after a zero-length (empty-string) match.
Either choice (start at the same character position vs. start at the next) is defensible - see the chapter on zero-length matches at www.regular-expressions.info.
By contrast, the .*$ case discussed here is different in that, with any non-empty input, the first match for .*$ is not a zero-length match, so the behavior difference does not apply - instead, the character position should advance unconditionally after the first match, which of course is impossible if you're already at the end.
Again, my surprise is at the fact that another match is attempted nonetheless, even though there's by definition nothing left.
[1] I'm using $ as the end-of-input marker here, even though in some engines, such as .NET's, it can mark the end the end of the input optionally followed by a trailing newline. However, the behavior equally applies when you use the unconditional end-of-input marker, \z.
[2] Python 2.x and 3.x up to 3.6.x seemingly special-cased replacement behavior in this context:
python -c "import re; print(re.sub('.*$', '[\g<0>]', 'a'))" used to yield just [a] - that is, only one match was found and replaced.
Since Python 3.7, the behavior is now like in most other regex engines, where two replacements are performed, yielding [a][].
[3] You can avoid the problem by either (a) choosing a replacement method that is designed to find at most one match or (b) use ^.* to prevent multiple matches from being found via start-of-input anchoring.
(a) may not be an option, depending on how a given language surfaces functionality; for instance, PowerShell's -replace operator invariably replaces all occurrences; consider the following attempt to enclose all array elements in "...":
'a', 'b' -replace '.*', '"$&"'. Due to matching twice, this yields elements "a""" and "b""";
option (b), 'a', 'b' -replace '^.*', '"$&"', fixes the problem.
I am giving this answer just to demonstrate why a regex would want to allow any code appearing after the final $ anchor in the pattern. Suppose we needed to create a regex to match a string with the following rules:
starts with three numbers
followed by one or more letters, numbers, hyphen, or underscore
ends with only letters and numbers
We could write the following pattern:
^\d{3}[A-Za-z0-9\-_]*[A-Za-z0-9]$
But this is a bit bulky, because we have to use two similar character classes adjacent to each other. Instead, we could write the pattern as:
^\d{3}[A-Za-z0-9\-_]+$(?<!_|-)
or
^\d{3}[A-Za-z0-9\-_]+(?<!_|-)$
Here, we eliminated one of the character classes, and instead used a negative lookbehind after the $ anchor to assert that the final character was not underscore or hyphen.
Other than a lookbehind, it makes no sense to me why a regex engine would allow something to appear after the $ anchor. My point here is that a regex engine may allow a lookbehind to appear after the $, and there are cases for which it logically makes sense to do so.
Recall several things:
^ and $ are zero width assertions - they match right after the logical start of the string (or after each line ending in multiline mode with the m flag in most regex implementations) or at the logical end of string (or end of line BEFORE the end of line character or characters in multiline mode.)
.* is potentially a zero length match of no match at all. The zero length only version would be $(?:end of line){0} DEMO (which is useful as a comment I guess...)
. does not match \n (unless you have the s flag) but does match the \r in Windows CRLF line endings. So $.{1} only matches Windows line endings for example (but don't do that. Use the literal \r\n instead.)
There is no particular benefit other than simple side effect cases.
The regex $ is useful;
.* is useful.
The regex ^(?a lookahead) and (?a lookbehind)$ are common and useful.
The regex (?a lookaround)^ or $(?a lookaround) are potentially useful.
The regex $.* is not useful and rare enough to not warrant implementing some optimization to have the engine stop looking with that edge case. Most regex engines do a decent job of parsing syntax; a missing brace or parenthesis for example. To have the engine parse $.* as not useful would require parsing meaning of that regex as different than $(something else)
What you get will be highly dependent on the regex flavor and the status of the s and m flags.
For examples of replacements, consider the following Bash script output from some major regex flavors:
#!/bin/bash
echo "perl"
printf "123\r\n" | perl -lnE 'say if s/$.*/X/mg' | od -c
echo "sed"
printf "123\r\n" | sed -E 's/$.*/X/g' | od -c
echo "python"
printf "123\r\n" | python -c "import re, sys; print re.sub(r'$.*', 'X', sys.stdin.read(),flags=re.M) " | od -c
echo "awk"
printf "123\r\n" | awk '{gsub(/$.*/,"X")};1' | od -c
echo "ruby"
printf "123\r\n" | ruby -lne 's=$_.gsub(/$.*/,"X"); print s' | od -c
Prints:
perl
0000000 X X 2 X 3 X \r X \n
0000011
sed
0000000 1 2 3 \r X \n
0000006
python
0000000 1 2 3 \r X \n X \n
0000010
awk
0000000 1 2 3 \r X \n
0000006
ruby
0000000 1 2 3 X \n
0000005
What is the reason behind using .* with global modifier on? Because someone somehow expects an empty string to be returned as a match or he / she isn't aware of what * quantifier is, otherwise global modifier shouldn't be set. .* without g doesn't return two matches.
it's not obvious what the benefit of this behavior is.
There shouldn't be a benefit. Actually you are questioning zero-length matches existence. You are asking why does a zero-length string exist?
We have three valid places that a zero-length string exists:
Start of subject string
Between two characters
End of subject string
We should look for the reason rather than the benefit of that second zero-length match output using .* with g modifier (or a function that searches for all occurrences). That zero-length position following an input string has some logical uses. Below state diagram is grabbed from debuggex against .* but I added epsilon on the direct transition from start state to accept state to demonstrate a definition:
(source: pbrd.co)
That's a zero-length match (read more about epsilon transition).
These all relates to greediness and non-greediness. Without zero-length positions a regex like .?? wouldn't have a meaning. It doesn't attempt the dot first, it skips it. It matches a zero-length string for this purpose to transit the current state to a temporary acceptable state.
Without a zero-length position .?? never could skip a character in input string and that results in a whole brand new flavor.
Definition of greediness / laziness leads into zero-length matches.
Note:
My question post contains two related, but distinct questions, for which I should have created separate posts, as I now realize.
The other answers here focus on one of the questions each, so in part this answer provides a road map to what answers address which question.
As for why patterns such as $<expr> are allowed (i.e., matching something after the input's end) / when they make sense:
dawg's answer argues that nonsensical combinations such as $.+ probably aren't prevented for pragmatic reasons; ruling them out may not be worth the effort.
Tim's answer shows how certain expressions can make sense after $, namely negative lookbehind assertions.
The second half of ivan_pozdeev's answer answer cogently synthesizes dawg's and Tim's answers.
As for why global matching finds two matches for patterns such as .* and .*$:
revo's answer contains great background information about zero-length (empty-string) matching, which is what the problem ultimately comes down to.
Let me complement his answer by relating it more directly to how the behavior contradicts my expectations in the context of global matching:
From a purely common-sense perspective, it stands to reason that once the input has been fully consumed while matching, there is by definition nothing left, so there is no reason to look for further matches.
By contrast, most regex engines consider the character position after the last character of the input string - the position known as end of subject string in some engines - a valid starting position for a match and therefore attempt another one.
If the regex at hand happens to match the empty string (produces a zero-length match; e.g., regexes such as .*, or a?), it matches that position and returns an empty-string match.
Conversely, you won't see an extra match if the regex doesn't (also) match the empty string - while the additional match is still attempted in all cases, no match will be found in this case, given that the empty string is the only possible match at the end-of-subject-string position.
While this provides a technical explanation of the behavior, it still doesn't tell us why matching after the last character was implemented.
The closest thing we have is an educated guess by Wiktor Stribiżew in a comment (emphasis added), which again suggests a pragmatic reason for the behavior:
... as when you get an empty string match, you might still match the next char that is still at the same index in the string. If a regex engine did not support it, these matches would be skipped. Making an exception for the end of string was probably not that critical for regex engine authors.
The first half of ivan_pozdeev's answer explains the behavior in more technical detail by telling us that the void at the end of the [input] string is a valid position for matching, just like any other character-boundary position.
However, while treating all such positions the same is certainly internally consistent and presumably simplifies the implementation, the behavior still defies common sense and has no obvious benefit to the user.
Further observations re empty-string matching:
Note: In all code snippets below, global string replacement is performed to highlight the resulting matches: each match is enclosed in [...], whereas non-matching parts of the input are passed through as-is.
In summary, 3 different, independent behaviors apply in the context of empty(-string) matches, and different engines use different combinations:
Whether the POSIX ERE spec's longest leftmost ruleThanks, revo. is obeyed.
In global matching:
Whether or not the character position is advanced after an empty match.
Whether or not another match is attempted for the by-definition empty string at the very end of the input (the 2nd question in my question post).
Matching at the end-of-subject-string position is not limited to those engines where matching continues at the same character position after an empty match.
For instance, the .NET regex engine does not do so (PowerShell example):
PS> 'a1' -replace '\d*|a', '[$&]'
[]a[1][]
That is:
\d* matched the empty string before a
a itself then did not match, which implies that the character position was advanced after the empty match.
1 was matched by \d*
The end-of-subject-string position was again matched by \d*, resulting in another empty-string match.
Perl 5 is an example of an engine that does resume matching at the same character position:
$ "a1" | perl -ple "s/\d*|a/[$&]/g"
[][a][1][]
Note how a was matched too.
Interestingly, Perl 6 not only behaves differently, but exhibits yet another behavior variant:
$ "a1" | perl6 -pe "s:g/\d*|a/[$/]/"
[a][1][]
Seemingly, if an alternation finds both and empty and a non-empty match, only the non-empty one is reported.
Perl 6's behavior appears to be following the longest leftmost rule.
While sed and awk do as well, they don't attempt another match at the end of the string:
sed, both the BSD/macOS and GNU/Linux implementations:
$ echo a1 | sed -E 's/[0-9]*|a/[&]/g'
[a][1]
awk - both the BSD/macOS and GNU/Linux implementations as well as mawk:
$ echo a1 | awk '1 { gsub(/[0-9]*|a/, "[&]"); print }'
[a][1]
"Void at the end of the string" is a separate position for regex engines because a regex engine deals with positions between input characters:
|a|b|c| <- input line
^ ^ ^ ^
positions at which a regex engine can "currently be"
All other positions can be described as "before Nth character" but for the end, there's no character to refer to.
As per Zero-Length Regex Matches -- Regular-expressions.info, it's also needed to support zero-length matches (which not all regex flavors support):
E.g. a regex \d* over string abc would match 4 times: before each letter, and at the end.
$ is allowed anywhere in the regex for uniformity: it's treated the same as any other token and matches at that magical "end of string" position. Making it "finalize" the regex work would lead to an unnecessary inconsistency in engine work and prevent other useful things that can match there, like e.g. lookbehind or \b (basically, anything that can be a zero-length match) -- i.e. would be both a design complication and a functional limitation with no benefit whatsoever.
Finally, to answer why a regex engine may or may not try to match "again" at the same position, let's refer to Advancing After a Zero-Length Regex Match -- Zero-Length Regex Matches -- Regular-expressions.info:
Say we have the regex \d*|x, the subject string x1
The first match is a blank match at the start of the string. Now, how do we give other tokens a chance while not getting stuck in an infinite loop?
The simplest solution, which is used by most regex engines, is to start the next match attempt one character after the end of the previous match
This may give counterintuitive results -- e.g. the above regex will match '' at start, 1 and '' at the end -- but not x.
The other solution, which is used by Perl, is to always start the next match attempt at the end of the previous match, regardless of whether it was zero-length or not. If it was zero-length, the engine makes note of that, as it must not allow a zero-length match at the same position.
Which "skips" matches less at the cost of some extra complexity. E.g. the above regex will produce '', x, 1 and '' at the end.
The article goes on to show that there aren't established best practices here and various regex engines are actively trying new approaches to try and produce more "natural" results:
One exception is the JGsoft engine. The JGsoft engine advances one
character after a zero-length match, like most engines do. But it has
an extra rule to skip zero-length matches at the position where the
previous match ended, so you can never have a zero-length match
immediately adjacent to a non-zero-length match. In our example the
JGsoft engine only finds two matches: the zero-length match at the
start of the string, and 1.
Python 3.6 and prior advance after zero-length matches. The gsub()
function to search-and-replace skips zero-length matches at the
position where the previous non-zero-length match ended, but the
finditer() function returns those matches. So a search-and-replace in
Python gives the same results as the Just Great Software applications,
but listing all matches adds the zero-length match at the end of the
string.
Python 3.7 changed all this. It handles zero-length matches like Perl.
gsub() does now replace zero-length matches that are adjacent to
another match. This means regular expressions that can find
zero-length matches are not compatible between Python 3.7 and prior
versions of Python.
PCRE 8.00 and later and PCRE2 handle zero-length matches like Perl by
backtracking. They no longer advance one character after a zero-length
match like PCRE 7.9 used to do.
The regexp functions in R and PHP are based on PCRE, so they avoid
getting stuck on a zero-length match by backtracking like PCRE does.
But the gsub() function to search-and-replace in R also skips
zero-length matches at the position where the previous non-zero-length
match ended, like gsub() in Python 3.6 and prior does. The other
regexp functions in R and all the functions in PHP do allow
zero-length matches immediately adjacent to non-zero-length matches,
just like PCRE itself.
I don't know where the confusion comes from.
Regex engines are basically stupid.
They're like Mikey, they'll eat anything.
$ python -c "import re; print(re.findall('$.*', 'a'))"
[''] # !! Matched the hypothetical empty string after the end of 'a'
You could put a thousand optional expressions after $ and it will still match the
EOS. Engines are stupid.
$ python -c "import re; print(re.findall('.*$', 'a'))"
['a', ''] # !! Matched both the full input AND the hypothetical empty string
Think of it this way, there are two independent expressions here
.* | $. The reason is the first expression is optional.
It just happens to butt against the EOS assertion.
Thus you get 2 matches on a non-empty string.
Why does functionality designed to find multiple, non-overlapping matches of a regex - i.e., global matching - decide to even attempt another match if it knows that the entire input has been consumed already,
The class of things called assertions don't exist at character positions.
They exist only BETWEEN character positions.
If they exist in the regex, you don't know if the entire input has been consumed.
If they can be satisfied as an independent step, but only once, they will match
independently.
Remember, regex is a left-to-right proposition.
Also remember, engines are stupid.
This is by design.
Each construct is a state in the engine, it's like a pipeline.
Adding complexity will surely doom it to failure.
As an aside, does .*a actually start from the beginning and check each character ?
No. .* immediately starts at the end of string (or line, depending) and starts
backtracking.
Another funny thing. I see a lot of novices using .*? at the end of their
regex, thinking it will get all the remaining kruft from the string.
It's useless, it will never match anything.
Even a standalone .*? regex will always match nothing for as many characters
there are in the string.
Good luck! Don't fret it, regex engines are just ... well, stupid.
I'm looking for logical solution, using regex, so that I can query grep for pattern and not catch pattern2. Some kind of 'stop', or 'up until' logic.
This question is about performing this type of query, not about naming conventions. I'm not looking for a workaround, just the regexp logic.
For the sake of argument, let's make the context 'up to date' ubuntu bash. But what I really want is something that only utilizes the regexp logic.
For a list as below
entry
entry1
entry2
entry.qualifier
entry.qualifier2
pseudo command: grep("entry")
Note, this will match all of entries because as there is no 'stop' logic. I'm sure the solution is actually quite simple, I just haven't used regex in a long time.
Something like 'not anything after the pattern'?
grep supports word boundary so a pure regex based answer would be:
grep '\bentry\b' file
However grep also supports -w flag (match words) so you can also use:
grep -w 'entry' file
If you're using GNU grep, what can help here are the wound boundary anchor operators \< and \> that it supports. That is to say \<entry\>.
POSIX doesn't specify any \b or \< or -w command line option. What if you have to use grep that doesn't have them? The problem can be solved by testing each line of the file with pure regular expression which must match it completely.
Suppose we want to pick out lines which contain the identifier entry that isn't a substring of a longer identifier name. Suppose identifiers are strings of English letters, digits and underscores. We can use this:
grep -E '^(|.*[^A-Za-z_0-9])entry([^A-Za-z_0-9].*|)$'
Note that the entire pattern is anchored on both ends, so that it must completely match an entire line. It matches any occurrence of entry which:
is either not preceded by anything, or else is preceded by a non-identifier character, possibly with other characters in front of it; and
is either not followed by anything, or else followed by a non-identifier character, possibly followed by other characters.
This approach is also useful if you have a specific idea of what constitutes a "word" which differs from the definition used by the GNU grep \b or \< operators. Suppose the file format is such that entry123 is in fact two different tokens entry and 123, and thus has to match. However entryabc must not match. For this, the GNU grep pattern \bentry\b or \<entry\> won't help; it will not match entry123. However, the above trick can readily be adapted to work:
grep -E '^(|.*[^A-Za-z])entry([^A-Za-z].*|)$'
I.e. entry surrounded by nothing, or else characters that are not upper or lower case letters. So this is worth to "keep in your back pocket".
I would like to get all the results with grep or egrep from a file on my computer.
Just discovered that the regex of finding the string
'+33. ... ... ..' is by the following regex
\+33.[0-9].[0-9].[0-9].[0-9].' Or is this not correct?
My grep command is:
grep '\+31.[0-9].[0.9].[0.9].[0-9]' Samsung\ GT-i9400\ Galaxy\ S\ II.xry >> resultaten.txt
The output file is only giving me as following:
"Binary file Samsung GT-i9400 .xry matches"
..... and no results were given.
Can someone help me please with getting the results and writing to a file?
Firstly, the default behavior of grep is to print the line containing a match. Because binary files do not contain lines, it only prints a message when it finds a match in a binary file. However, this can be overridden with the -a flag.
But then, you end up with the problem that the "lines" it prints are not useful. You probably want to add the -o option to only print the substrings which actually matched.
Finally, your regex isn't correct at all. The lone dot . is a metacharacter which matches any character, including a control character or other non-text character. Given the length of your regex, you are unlikely to catch false positives, but you might want to explain what you want the dot to match. I have replaced it with [ ._-] which matches a space and some punctuation characters which are common in phone numbers. Maybe extend or change it, depending on what interpunction you expect in your phone numbers.
In regular grep, a plus simply matches itself. With grep -E the syntax would change, and you would need to backslash the plus; but in the absence of this option, the backslash is superfluous (and actually wrong in this context in some dialects, including GNU grep, where a backslashed plus selects the extended meaning, which is of course a syntax error at beginning of string, where there is no preceding expression to repeat one or more times; but GNU grep will just silently ignore it, rather than report an error).
On the other hand, your number groups are also wrong. [0-9] matches a single digit, where apparently the intention is to match multiple digits. For convenience, I will use the grep -E extension which enables + to match one or more repetitions of the previous character. Then we also get access to ? to mark the punctuation expressions as optional.
Wrapping up, try this:
grep -Eao '\+33[0-9]+([^ ._-]?[0-9]+){3}' \
'Samsung GT-i9400 Galaxy S II.xry' >resultaten.txt
In human terms, this requires a literal +33 followed by required additional digits, then followed by three number groups of one or more digits, each optionally preceded by punctuation.
This will overwrite resultaten.txt which is usually what you want; the append operation you had also makes sense in many scenarios, so change it back if that's actually what you want.
If each dot in your template +33. ... ... .. represents a required number, and the spaces represent required punctuation, the following is closer to what you attempted to specify:
\+33[0-9]([^ ._-][0-9]{3}){2}[^ ._-][0-9]{2}
That is, there is one required digit after 33, then two groups of exactly three digits and one of two, each group preceded by one non-optional spacing or punctuation character.
(Your exposition has +33 while your actual example has +31. Use whichever is correct, or perhaps allow any sequence of numbers for the country code, too.)
It means that you're find a match but the file you're greping isn't a text file, it's a binary containing non-printable bytes. If you really want to grep that file, try:
strings Samsung\ GT-i9400\ Galaxy\ S\ II.xry | grep '+31.[0-9].[0.9].[0.9].[0-9]' >> resultaten.txt
This is my first question, so I hope I didn't mess too much with the title and the formatting.
I have a bunch of file a client of mine sent me in this form:
Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext
What I need is a regex to output just:
212 The Actual Title Of the Chapter
I'm not gonna use it with any script language in particular; it's a batch renaming of files through an app supporting regex (which already "preserves" the extension).
So far, all I was able to do was this:
/.*x(\d+)\.(.*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2
(Capture everything before a number preceded by an "x", group numbers after the "x", group everything following until a 3 digit Uppercase word is met, then capture everything that follows)
which gives me back:
212 The.Actual.Title.Of.the.Chapter
Having seen the result I thought that something like:
/.*x(\d+)\.([^.]*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2
(Changed second group to "Capture everything which is not a dot...") would have worked as expected.
Instead, the whole regex fails to match completely.
What am I missing?
TIA
cià
ale
.*x(\d+)\. matches Name.Of.Chapter.021x212.
\.[A-Z]{3}.* matches .DOC.NAME-Some.stuff.Here.ext
But ([^.]*?) does not match The.Actual.Title.Of.the.Chapter because this regex does not allow for any periods at all.
since you are on Mac, you could use the shell
$ s="Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext"
$ echo ${s#*x}
212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext
$ t=${s#*x}
$ echo ${t%.[A-Z][A-Z][A-Z].*}
212.The.Actual.Title.Of.the.Chapter
Or if you prefer sed, eg
echo $filename | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//'
For processing multiple files
for file in *.ext
do
newfile=${file#*x}
newfile=${newfile%.[A-Z][A-Z][A-Z].*}
# or
# newfile=$(echo $file | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//')
mv "$file" "$newfile"
done
To your question "How can I remove the dots in the process of matching?" the answer is "You can't." The only way to do that is by processing the result of the match in a second step, as others have said. But I think there's a more basic question that needs to be addressed, which is "What does it mean for a regex to match a given input?"
A regex is usually said to match a string when it describes any substring of that string. If you want to be sure the regex describes the whole string, you need to add the start (^) and end ($) anchors:
/^.*x(\d+)\.(.*?)\.[A-Z]{3}.*$/
But in your case, you don't need to describe the whole string; if you get rid of the .* at either end, it will serve your just as well:
/x(\d+)\.(.*?)\.[A-Z]{3}/
I recommend you not get in the habit of "padding" regexes with .* at beginning and end. The leading .* in particular can change the behavior of the regex in unexpected ways. For example, it there were two places in the input string where x(\d+)\. could match, your "real" match would have started at the second one. Also, if it's not anchored with ^ or \A, a leading .* can make the whole regex much less efficient.
I said "usually" above because some tools do automatically "anchor" the match at the beginning (Python's match()) or at both ends (Java's matches()), but that's pretty rare. Most of the shells and command-line tools available on *nix systems define a regex match in the traditional way, but it's a good idea to say what tool(s) you're using, just in case.
Finally, a word or two about vocabulary. The parentheses in (\d+) cause the matched characters to be captured, not grouped. Many regex flavors also support non-capturing parentheses in the form (?:\d+), which are used for grouping only. Any text that is included in the overall match, whether it's captured or not, is said to have been consumed (not captured). The way you used the words "capture" and "group" in your question is guaranteed to cause maximum confusion in anyone who assumes you know what you're talking about. :D
If you haven't read it yet, check out this excellent tutorial.