Exactly two capitalized words on a line

Exactly two capitalized words on a line - regex

I want to create a regular expression which can replace lines that contain exactly two words beginning with an uppercase with the character 'X'.
I'm currently using this:
sed -e '/\b[A-Z][a-z]*\b c X /home/Morgan/desktop/test
The problem is the following: it only changes lines which contain 1 or more words described by the regular expression in my test.txt.
I don't know how to say that i want a X only on lines with exactly 2 words beginning with an uppercase. Either word can occur anywhere within the line.
My test.txt contains:
Bonjour oui oui Bonjour -> this must be replaced by X
Bonjour Bonjour Bonjour -> this mustn't
Bonjour Oui bonjour oui -> this must be replaced by X

You seem to be attempting to use the Perl/PCRE word boundary \b but typical sed implementations do not understand this regular expression dialect. By your problem description, you are looking for beginning and end of line, anyway; this is a very basic regex anchor which was introduced already in the original grep: ^ matches beginning of line, and $ matches end of line.
Without anchors, a regular expression will match anywhere in the line. To say "only two" you really must check the entire line and make sure there are not three or more of what you're looking for.
"Find a line with exactly two words which begin with uppercase" needs to be rephrased or massaged a bit before you can attempt to write a regex. If we -- provisionally, for this discussion -- define w to mean "word which does not begin with uppercase" and W to mean one which does, you want ^w*Ww*Ww*$ -- exactly two uppercase words, and zero or more non-uppercase words in any position before, between, or after them.
A word which begins with uppercase is [A-Z][a-z]* (this requires all the subsequent characters to be lowercase) and a word which doesn't is [a-z][a-z]* (or [a-z]\+ if your sed supports that regex variation).
Because words need spaces between them, an optional word expression needs to be parenthesized so you can say "zero or more of this entire sequence". Typically, sed regex requires grouping parentheses to be backslashed as well, though this differs between versions.
So, try this:
sed 's/^\([a-z][a-z]* \)*[A-Z][a-z]*\( [a-z][a-z]*\)* [A-Z][a-z]*\( [a-z][a-z]*\)*$/X/' file
If indeed you have GNU sed, this can be simplified a bit:
sed -r 's/^([a-z]+ )*[A-Z][a-z]*( [a-z]+)* [A-Z][a-z]*( [a-z]+)*$/X/' file
This definition of "word" might not be sufficient; perhaps you can refine it to suit your circumstances. In particular, the spacing is assumed to be regular (exactly one space between words; no leading or trailing whitespace on the lines) and no text may contain characters outside of spaces and the alphabetics a-z in upper or lower case. (Whether accented characters like è and Á are also considered alphabetics in this range depends on your locale settings. Maybe set LC_ALL=fr_FR.utf-8 in your script if French locale settings are important.)
Notice also how the sed substition command requires exactly three delimiter characters -- traditionally, we use a slash, but you can use any punctuation character. The form is s/regex/replacement/flags where the regex, the replacement, and the flags can all be empty, but the s and the delimiters are always required.

Related

Why do regex engines allow / automatically attempt matching at the end of the input string?

Note:
* Python is used to illustrate behaviors, but this question is language-agnostic.
* For the purpose of this discussion, assume single-line input only, because the presence of newlines (multi-line input) introduces variations in behavior of $ and . that are incidental to the questions at hand.
Most regex engines:
accept a regex that explicitly tries to match an expression after the end of the input string[1].
$ python -c "import re; print(re.findall('$.*', 'a'))"
[''] # !! Matched the hypothetical empty string after the end of 'a'
when finding / replacing globally, i.e., when looking for all non-overlapping matches of a given regex, and having reached the end of the string, unexpectedly try to match again[2], as explained in this answer to a related question:
$ python -c "import re; print(re.findall('.*$', 'a'))"
['a', ''] # !! Matched both the full input AND the hypothetical empty string
Perhaps needless to say, such match attempts succeed only if the regex in question matches the empty string (and the regex by default / is configured to report zero-length matches).
These behaviors are at least at first glance counter-intuitive, and I wonder if someone can provide a design rationale for them, not least because:
it's not obvious what the benefit of this behavior is.
conversely, in the context of finding / replacing globally with patterns such as .* and .*$, the behavior is downright surprising.[3]
To ask the question more pointedly: Why does functionality designed to find multiple, non-overlapping matches of a regex - i.e., global matching - decide to even attempt another match if it knows that the entire input has been consumed already, irrespective of what the regex is (although you'll never see the symptom with a regex that doesn't at least also match the empty string)
The following languages/engines exhibit the surprising behavior: .NET, Python (both 2.x and 3.x)[2], Perl (both 5.x and 6.x), Ruby, Node.js (JavaScript)
Note that regex engines vary in behavior with respect to where to continue matching after a zero-length (empty-string) match.
Either choice (start at the same character position vs. start at the next) is defensible - see the chapter on zero-length matches at www.regular-expressions.info.
By contrast, the .*$ case discussed here is different in that, with any non-empty input, the first match for .*$ is not a zero-length match, so the behavior difference does not apply - instead, the character position should advance unconditionally after the first match, which of course is impossible if you're already at the end.
Again, my surprise is at the fact that another match is attempted nonetheless, even though there's by definition nothing left.
[1] I'm using $ as the end-of-input marker here, even though in some engines, such as .NET's, it can mark the end the end of the input optionally followed by a trailing newline. However, the behavior equally applies when you use the unconditional end-of-input marker, \z.
[2] Python 2.x and 3.x up to 3.6.x seemingly special-cased replacement behavior in this context:
python -c "import re; print(re.sub('.*$', '[\g<0>]', 'a'))" used to yield just [a] - that is, only one match was found and replaced.
Since Python 3.7, the behavior is now like in most other regex engines, where two replacements are performed, yielding [a][].
[3] You can avoid the problem by either (a) choosing a replacement method that is designed to find at most one match or (b) use ^.* to prevent multiple matches from being found via start-of-input anchoring.
(a) may not be an option, depending on how a given language surfaces functionality; for instance, PowerShell's -replace operator invariably replaces all occurrences; consider the following attempt to enclose all array elements in "...":
'a', 'b' -replace '.*', '"$&"'. Due to matching twice, this yields elements "a""" and "b""";
option (b), 'a', 'b' -replace '^.*', '"$&"', fixes the problem.

I am giving this answer just to demonstrate why a regex would want to allow any code appearing after the final $ anchor in the pattern. Suppose we needed to create a regex to match a string with the following rules:
starts with three numbers
followed by one or more letters, numbers, hyphen, or underscore
ends with only letters and numbers
We could write the following pattern:
^\d{3}[A-Za-z0-9\-_]*[A-Za-z0-9]$
But this is a bit bulky, because we have to use two similar character classes adjacent to each other. Instead, we could write the pattern as:
^\d{3}[A-Za-z0-9\-_]+$(?<!_|-)
or
^\d{3}[A-Za-z0-9\-_]+(?<!_|-)$
Here, we eliminated one of the character classes, and instead used a negative lookbehind after the $ anchor to assert that the final character was not underscore or hyphen.
Other than a lookbehind, it makes no sense to me why a regex engine would allow something to appear after the $ anchor. My point here is that a regex engine may allow a lookbehind to appear after the $, and there are cases for which it logically makes sense to do so.

Recall several things:
^ and $ are zero width assertions - they match right after the logical start of the string (or after each line ending in multiline mode with the m flag in most regex implementations) or at the logical end of string (or end of line BEFORE the end of line character or characters in multiline mode.)
.* is potentially a zero length match of no match at all. The zero length only version would be $(?:end of line){0} DEMO (which is useful as a comment I guess...)
. does not match \n (unless you have the s flag) but does match the \r in Windows CRLF line endings. So $.{1} only matches Windows line endings for example (but don't do that. Use the literal \r\n instead.)
There is no particular benefit other than simple side effect cases.
The regex $ is useful;
.* is useful.
The regex ^(?a lookahead) and (?a lookbehind)$ are common and useful.
The regex (?a lookaround)^ or $(?a lookaround) are potentially useful.
The regex $.* is not useful and rare enough to not warrant implementing some optimization to have the engine stop looking with that edge case. Most regex engines do a decent job of parsing syntax; a missing brace or parenthesis for example. To have the engine parse $.* as not useful would require parsing meaning of that regex as different than $(something else)
What you get will be highly dependent on the regex flavor and the status of the s and m flags.
For examples of replacements, consider the following Bash script output from some major regex flavors:
#!/bin/bash
echo "perl"
printf "123\r\n" | perl -lnE 'say if s/$.*/X/mg' | od -c
echo "sed"
printf "123\r\n" | sed -E 's/$.*/X/g' | od -c
echo "python"
printf "123\r\n" | python -c "import re, sys; print re.sub(r'$.*', 'X', sys.stdin.read(),flags=re.M) " | od -c
echo "awk"
printf "123\r\n" | awk '{gsub(/$.*/,"X")};1' | od -c
echo "ruby"
printf "123\r\n" | ruby -lne 's=$_.gsub(/$.*/,"X"); print s' | od -c
Prints:
perl
0000000 X X 2 X 3 X \r X \n
0000011
sed
0000000 1 2 3 \r X \n
0000006
python
0000000 1 2 3 \r X \n X \n
0000010
awk
0000000 1 2 3 \r X \n
0000006
ruby
0000000 1 2 3 X \n
0000005

What is the reason behind using .* with global modifier on? Because someone somehow expects an empty string to be returned as a match or he / she isn't aware of what * quantifier is, otherwise global modifier shouldn't be set. .* without g doesn't return two matches.
it's not obvious what the benefit of this behavior is.
There shouldn't be a benefit. Actually you are questioning zero-length matches existence. You are asking why does a zero-length string exist?
We have three valid places that a zero-length string exists:
Start of subject string
Between two characters
End of subject string
We should look for the reason rather than the benefit of that second zero-length match output using .* with g modifier (or a function that searches for all occurrences). That zero-length position following an input string has some logical uses. Below state diagram is grabbed from debuggex against .* but I added epsilon on the direct transition from start state to accept state to demonstrate a definition:
(source: pbrd.co)
That's a zero-length match (read more about epsilon transition).
These all relates to greediness and non-greediness. Without zero-length positions a regex like .?? wouldn't have a meaning. It doesn't attempt the dot first, it skips it. It matches a zero-length string for this purpose to transit the current state to a temporary acceptable state.
Without a zero-length position .?? never could skip a character in input string and that results in a whole brand new flavor.
Definition of greediness / laziness leads into zero-length matches.

Note:
My question post contains two related, but distinct questions, for which I should have created separate posts, as I now realize.
The other answers here focus on one of the questions each, so in part this answer provides a road map to what answers address which question.
As for why patterns such as $<expr> are allowed (i.e., matching something after the input's end) / when they make sense:
dawg's answer argues that nonsensical combinations such as $.+ probably aren't prevented for pragmatic reasons; ruling them out may not be worth the effort.
Tim's answer shows how certain expressions can make sense after $, namely negative lookbehind assertions.
The second half of ivan_pozdeev's answer answer cogently synthesizes dawg's and Tim's answers.
As for why global matching finds two matches for patterns such as .* and .*$:
revo's answer contains great background information about zero-length (empty-string) matching, which is what the problem ultimately comes down to.
Let me complement his answer by relating it more directly to how the behavior contradicts my expectations in the context of global matching:
From a purely common-sense perspective, it stands to reason that once the input has been fully consumed while matching, there is by definition nothing left, so there is no reason to look for further matches.
By contrast, most regex engines consider the character position after the last character of the input string - the position known as end of subject string in some engines - a valid starting position for a match and therefore attempt another one.
If the regex at hand happens to match the empty string (produces a zero-length match; e.g., regexes such as .*, or a?), it matches that position and returns an empty-string match.
Conversely, you won't see an extra match if the regex doesn't (also) match the empty string - while the additional match is still attempted in all cases, no match will be found in this case, given that the empty string is the only possible match at the end-of-subject-string position.
While this provides a technical explanation of the behavior, it still doesn't tell us why matching after the last character was implemented.
The closest thing we have is an educated guess by Wiktor Stribiżew in a comment (emphasis added), which again suggests a pragmatic reason for the behavior:
... as when you get an empty string match, you might still match the next char that is still at the same index in the string. If a regex engine did not support it, these matches would be skipped. Making an exception for the end of string was probably not that critical for regex engine authors.
The first half of ivan_pozdeev's answer explains the behavior in more technical detail by telling us that the void at the end of the [input] string is a valid position for matching, just like any other character-boundary position.
However, while treating all such positions the same is certainly internally consistent and presumably simplifies the implementation, the behavior still defies common sense and has no obvious benefit to the user.
Further observations re empty-string matching:
Note: In all code snippets below, global string replacement is performed to highlight the resulting matches: each match is enclosed in [...], whereas non-matching parts of the input are passed through as-is.
In summary, 3 different, independent behaviors apply in the context of empty(-string) matches, and different engines use different combinations:
Whether the POSIX ERE spec's longest leftmost ruleThanks, revo. is obeyed.
In global matching:
Whether or not the character position is advanced after an empty match.
Whether or not another match is attempted for the by-definition empty string at the very end of the input (the 2nd question in my question post).
Matching at the end-of-subject-string position is not limited to those engines where matching continues at the same character position after an empty match.
For instance, the .NET regex engine does not do so (PowerShell example):
PS> 'a1' -replace '\d*|a', '[$&]'
[]a[1][]
That is:
\d* matched the empty string before a
a itself then did not match, which implies that the character position was advanced after the empty match.
1 was matched by \d*
The end-of-subject-string position was again matched by \d*, resulting in another empty-string match.
Perl 5 is an example of an engine that does resume matching at the same character position:
$ "a1" | perl -ple "s/\d*|a/[$&]/g"
[][a][1][]
Note how a was matched too.
Interestingly, Perl 6 not only behaves differently, but exhibits yet another behavior variant:
$ "a1" | perl6 -pe "s:g/\d*|a/[$/]/"
[a][1][]
Seemingly, if an alternation finds both and empty and a non-empty match, only the non-empty one is reported.
Perl 6's behavior appears to be following the longest leftmost rule.
While sed and awk do as well, they don't attempt another match at the end of the string:
sed, both the BSD/macOS and GNU/Linux implementations:
$ echo a1 | sed -E 's/[0-9]*|a/[&]/g'
[a][1]
awk - both the BSD/macOS and GNU/Linux implementations as well as mawk:
$ echo a1 | awk '1 { gsub(/[0-9]*|a/, "[&]"); print }'
[a][1]

"Void at the end of the string" is a separate position for regex engines because a regex engine deals with positions between input characters:
|a|b|c| <- input line
^ ^ ^ ^
positions at which a regex engine can "currently be"
All other positions can be described as "before Nth character" but for the end, there's no character to refer to.
As per Zero-Length Regex Matches -- Regular-expressions.info, it's also needed to support zero-length matches (which not all regex flavors support):
E.g. a regex \d* over string abc would match 4 times: before each letter, and at the end.
$ is allowed anywhere in the regex for uniformity: it's treated the same as any other token and matches at that magical "end of string" position. Making it "finalize" the regex work would lead to an unnecessary inconsistency in engine work and prevent other useful things that can match there, like e.g. lookbehind or \b (basically, anything that can be a zero-length match) -- i.e. would be both a design complication and a functional limitation with no benefit whatsoever.
Finally, to answer why a regex engine may or may not try to match "again" at the same position, let's refer to Advancing After a Zero-Length Regex Match -- Zero-Length Regex Matches -- Regular-expressions.info:
Say we have the regex \d*|x, the subject string x1
The first match is a blank match at the start of the string. Now, how do we give other tokens a chance while not getting stuck in an infinite loop?
The simplest solution, which is used by most regex engines, is to start the next match attempt one character after the end of the previous match
This may give counterintuitive results -- e.g. the above regex will match '' at start, 1 and '' at the end -- but not x.
The other solution, which is used by Perl, is to always start the next match attempt at the end of the previous match, regardless of whether it was zero-length or not. If it was zero-length, the engine makes note of that, as it must not allow a zero-length match at the same position.
Which "skips" matches less at the cost of some extra complexity. E.g. the above regex will produce '', x, 1 and '' at the end.
The article goes on to show that there aren't established best practices here and various regex engines are actively trying new approaches to try and produce more "natural" results:
One exception is the JGsoft engine. The JGsoft engine advances one
character after a zero-length match, like most engines do. But it has
an extra rule to skip zero-length matches at the position where the
previous match ended, so you can never have a zero-length match
immediately adjacent to a non-zero-length match. In our example the
JGsoft engine only finds two matches: the zero-length match at the
start of the string, and 1.
Python 3.6 and prior advance after zero-length matches. The gsub()
function to search-and-replace skips zero-length matches at the
position where the previous non-zero-length match ended, but the
finditer() function returns those matches. So a search-and-replace in
Python gives the same results as the Just Great Software applications,
but listing all matches adds the zero-length match at the end of the
string.
Python 3.7 changed all this. It handles zero-length matches like Perl.
gsub() does now replace zero-length matches that are adjacent to
another match. This means regular expressions that can find
zero-length matches are not compatible between Python 3.7 and prior
versions of Python.
PCRE 8.00 and later and PCRE2 handle zero-length matches like Perl by
backtracking. They no longer advance one character after a zero-length
match like PCRE 7.9 used to do.
The regexp functions in R and PHP are based on PCRE, so they avoid
getting stuck on a zero-length match by backtracking like PCRE does.
But the gsub() function to search-and-replace in R also skips
zero-length matches at the position where the previous non-zero-length
match ended, like gsub() in Python 3.6 and prior does. The other
regexp functions in R and all the functions in PHP do allow
zero-length matches immediately adjacent to non-zero-length matches,
just like PCRE itself.

I don't know where the confusion comes from.
Regex engines are basically stupid.
They're like Mikey, they'll eat anything.
$ python -c "import re; print(re.findall('$.*', 'a'))"
[''] # !! Matched the hypothetical empty string after the end of 'a'
You could put a thousand optional expressions after $ and it will still match the
EOS. Engines are stupid.
$ python -c "import re; print(re.findall('.*$', 'a'))"
['a', ''] # !! Matched both the full input AND the hypothetical empty string
Think of it this way, there are two independent expressions here
.* | $. The reason is the first expression is optional.
It just happens to butt against the EOS assertion.
Thus you get 2 matches on a non-empty string.
Why does functionality designed to find multiple, non-overlapping matches of a regex - i.e., global matching - decide to even attempt another match if it knows that the entire input has been consumed already,
The class of things called assertions don't exist at character positions.
They exist only BETWEEN character positions.
If they exist in the regex, you don't know if the entire input has been consumed.
If they can be satisfied as an independent step, but only once, they will match
independently.
Remember, regex is a left-to-right proposition.
Also remember, engines are stupid.
This is by design.
Each construct is a state in the engine, it's like a pipeline.
Adding complexity will surely doom it to failure.
As an aside, does .*a actually start from the beginning and check each character ?
No. .* immediately starts at the end of string (or line, depending) and starts
backtracking.
Another funny thing. I see a lot of novices using .*? at the end of their
regex, thinking it will get all the remaining kruft from the string.
It's useless, it will never match anything.
Even a standalone .*? regex will always match nothing for as many characters
there are in the string.
Good luck! Don't fret it, regex engines are just ... well, stupid.

RegExp space character

I have this regular expression: ^[a-zA-Z]\s{3,16}$
What I want is to match any name with any spaces, for example, John Smith and that contains 3 to 16 characters long..
What am I doing wrong?

Background
There are a couple of things to note here. First, a quantifier (in this case, {3,16}) only applies to the last regex token. So what your current regex really is saying is to "Match any string that has a single alphabetical character (case-insensitive) followed by 3 to 16 whitespace characters (e.g. spaces, tabs, etc.)."
Second, a name can have more than 2 parts (a middle name, certain ethnic names like "De La Cruz") or include special characters such as accented vowels. You should consider if this is something you need to account for in your program. These things are important and should be considered for any real application.
Assumptions and Answer
Now, let's just assume you only want a certain format for names that consists of a first name, a last name, and a space. Let's also assume you only want simple ASCII characters (i.e. no special characters or accented characters). Furthermore, both the first and last names should start with a capital character followed by only lower-case characters. Other than that, there are no restrictions on the length of the individual parts of the name. In this case, the following regex would do the trick:
^(?=.{3,16}$)[A-Z][a-z]+ [A-Z][a-z]+$
Notes
The first token after the ^ character is what is called a positive lookahead. Basically a positive look ahead will match the regex between the opening (?= and closing ) without actually moving the position of the cursor that is matching the string.
Notice I removed the \s token, since you usually want only a (space). The space can be replaced with the \s token, if tabs and other whitespace is desired there.
I also added a restriction that a name must start with a capital letter followed by only lower-case letters.
Crude English Translation
To help your understanding, here is a simple English translation of what the regex is really doing. The part in italics is just copied from the first part of the English translation of the regex.
"Match any string that has 3-16 characters and starts with a capital alphabetical character followed by one or more (+) alphabetical characters followed by a single space followed by a capital alphabetical character followed by one or more (+) alphabetical characters and ends with any lowercase letter."
Tools
There are a couple of tools I like to use when I am trying to tackle a challenging regex. They are listed below in no particular order:
https://regex101.com/ - Allows you to test regex expressions in real time. It also has a nifty little library to help you along.
http://www.regular-expressions.info/ - Basically a repository of knowledge on regex.
Edit/Update
You mentioned in your comments that you are using your regex in JavaScript. JavaScript uses a forward slash surrounding the regex to determine what is a regex. For this simple case, there are 2 options for using a regex to match a string.
First, use String's match method as follows
"John Smith".match(/^(?=.{3,16}$)[A-Z][a-z]+ [A-Z][a-z]+$/);
Second, create a regex and use its test() method. For example,
/^(?=.{3,16}$)[A-Z][a-z]+ [A-Z][a-z]+$/.test("John Smith");
The latter is probably what you want as it simply returns true or false depending on whether the regex actually matches the string or not.

Regex: grep('pattern') catches 'pattern2'

I'm looking for logical solution, using regex, so that I can query grep for pattern and not catch pattern2. Some kind of 'stop', or 'up until' logic.
This question is about performing this type of query, not about naming conventions. I'm not looking for a workaround, just the regexp logic.
For the sake of argument, let's make the context 'up to date' ubuntu bash. But what I really want is something that only utilizes the regexp logic.
For a list as below
entry
entry1
entry2
entry.qualifier
entry.qualifier2
pseudo command: grep("entry")
Note, this will match all of entries because as there is no 'stop' logic. I'm sure the solution is actually quite simple, I just haven't used regex in a long time.
Something like 'not anything after the pattern'?

grep supports word boundary so a pure regex based answer would be:
grep '\bentry\b' file
However grep also supports -w flag (match words) so you can also use:
grep -w 'entry' file

If you're using GNU grep, what can help here are the wound boundary anchor operators \< and \> that it supports. That is to say \<entry\>.
POSIX doesn't specify any \b or \< or -w command line option. What if you have to use grep that doesn't have them? The problem can be solved by testing each line of the file with pure regular expression which must match it completely.
Suppose we want to pick out lines which contain the identifier entry that isn't a substring of a longer identifier name. Suppose identifiers are strings of English letters, digits and underscores. We can use this:
grep -E '^(|.*[^A-Za-z_0-9])entry([^A-Za-z_0-9].*|)$'
Note that the entire pattern is anchored on both ends, so that it must completely match an entire line. It matches any occurrence of entry which:
is either not preceded by anything, or else is preceded by a non-identifier character, possibly with other characters in front of it; and
is either not followed by anything, or else followed by a non-identifier character, possibly followed by other characters.
This approach is also useful if you have a specific idea of what constitutes a "word" which differs from the definition used by the GNU grep \b or \< operators. Suppose the file format is such that entry123 is in fact two different tokens entry and 123, and thus has to match. However entryabc must not match. For this, the GNU grep pattern \bentry\b or \<entry\> won't help; it will not match entry123. However, the above trick can readily be adapted to work:
grep -E '^(|.*[^A-Za-z])entry([^A-Za-z].*|)$'
I.e. entry surrounded by nothing, or else characters that are not upper or lower case letters. So this is worth to "keep in your back pocket".

Find results with grep and write to file

I would like to get all the results with grep or egrep from a file on my computer.
Just discovered that the regex of finding the string
'+33. ... ... ..' is by the following regex
\+33.[0-9].[0-9].[0-9].[0-9].' Or is this not correct?
My grep command is:
grep '\+31.[0-9].[0.9].[0.9].[0-9]' Samsung\ GT-i9400\ Galaxy\ S\ II.xry >> resultaten.txt
The output file is only giving me as following:
"Binary file Samsung GT-i9400 .xry matches"
..... and no results were given.
Can someone help me please with getting the results and writing to a file?

Firstly, the default behavior of grep is to print the line containing a match. Because binary files do not contain lines, it only prints a message when it finds a match in a binary file. However, this can be overridden with the -a flag.
But then, you end up with the problem that the "lines" it prints are not useful. You probably want to add the -o option to only print the substrings which actually matched.
Finally, your regex isn't correct at all. The lone dot . is a metacharacter which matches any character, including a control character or other non-text character. Given the length of your regex, you are unlikely to catch false positives, but you might want to explain what you want the dot to match. I have replaced it with [ ._-] which matches a space and some punctuation characters which are common in phone numbers. Maybe extend or change it, depending on what interpunction you expect in your phone numbers.
In regular grep, a plus simply matches itself. With grep -E the syntax would change, and you would need to backslash the plus; but in the absence of this option, the backslash is superfluous (and actually wrong in this context in some dialects, including GNU grep, where a backslashed plus selects the extended meaning, which is of course a syntax error at beginning of string, where there is no preceding expression to repeat one or more times; but GNU grep will just silently ignore it, rather than report an error).
On the other hand, your number groups are also wrong. [0-9] matches a single digit, where apparently the intention is to match multiple digits. For convenience, I will use the grep -E extension which enables + to match one or more repetitions of the previous character. Then we also get access to ? to mark the punctuation expressions as optional.
Wrapping up, try this:
grep -Eao '\+33[0-9]+([^ ._-]?[0-9]+){3}' \
'Samsung GT-i9400 Galaxy S II.xry' >resultaten.txt
In human terms, this requires a literal +33 followed by required additional digits, then followed by three number groups of one or more digits, each optionally preceded by punctuation.
This will overwrite resultaten.txt which is usually what you want; the append operation you had also makes sense in many scenarios, so change it back if that's actually what you want.
If each dot in your template +33. ... ... .. represents a required number, and the spaces represent required punctuation, the following is closer to what you attempted to specify:
\+33[0-9]([^ ._-][0-9]{3}){2}[^ ._-][0-9]{2}
That is, there is one required digit after 33, then two groups of exactly three digits and one of two, each group preceded by one non-optional spacing or punctuation character.
(Your exposition has +33 while your actual example has +31. Use whichever is correct, or perhaps allow any sequence of numbers for the country code, too.)

It means that you're find a match but the file you're greping isn't a text file, it's a binary containing non-printable bytes. If you really want to grep that file, try:
strings Samsung\ GT-i9400\ Galaxy\ S\ II.xry | grep '+31.[0-9].[0.9].[0.9].[0-9]' >> resultaten.txt

Regex code question

I'm new to this site and don't know if this is the place to ask this question here?
I was wondering if someone can explain the 3 regex code examples below in detail?
Thanks.
Example 1
`&([a-z]{1,2})(acute|uml|circ|grave|ring|cedil|slash|tilde|caron|lig);`i
Example 2
\\1
Example 3
`[^a-z0-9]`i','`[-]+`

The first regex looks like it'll match the HTML entities for accented characters (e.g., é is é; ø is ø; æ is æ; and Â is Â).
To break it down, & will match an ampersand (the start of the entity), ([a-z]{1,2}) will match any lowercase letter one or two times, (acute|uml|circ|grave|ring|cedil|slash|tilde|caron|lig) will match one of the terms in the pipe-delimited list (e.g., circ, grave, cedil, etc.), and ; will match a semicolon (the end of the entity). I'm not sure what the i character means at the end of the line; it's not part of the regex.
All told, it will match the HTML entities for accented/diacritic/ligatures. Compared, though, to this page, it doesn't seem that it matches all of the valid entities (although ti does catch many of them). Unless you run in case-insensitive mode, the [a-z] will only match lowercase letters. It will also never match the entities ð or þ (ð, þ, respectively) or their capital versions (Ð, Þ, also respectively).
The second regex is simpler. \1 in a regex (or in regex find-replace) simply looks for the contents of the first capturing group (denoted by parentheses ()) and (in a regex) matches them or (in the replace of a find) inserts them. What you have there \\1 is the \1, but it's probably written in a string in some other programming language, so the coder had to escape the backslash with another backslash.
For your third example, I'm less certain what it does, but I can explain the regexes. [^a-z0-9] will match any character that's not a lowercase letter or number (or, if running in case-insensitive mode, anything that's not a letter or a number). The caret (^) at the beginning of the character class (that's anything inside square brackets []) means to negate the class (i.e., find anything that is not specified, instead of the usual find anything that is specified). [-]+ will match one or more hyphens (-). I don't know what the i',' between the regexes means, but, then, you didn't say what language this is written in, and I'm not familiar with it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Exactly two capitalized words on a line - regex

Related

Why do regex engines allow / automatically attempt matching at the end of the input string?

RegExp space character

Regex: grep('pattern') catches 'pattern2'

Find results with grep and write to file

Regex code question

Categories

Resources