Regular expression: match upto 3rd whitespace - regex

I have following line:
Data 5 in:out:40 Files
I want to match all the strings untill 3rd whitespace, So, in this case, I want to get back
Data 5 in:out:40

How about:
^(\S+\s+\S+\s+\S+)
Lets break this down:
^ # start from string beginning
( # match everything inside (begin)
\S+ # match all non-whitespace(s)
\s+ # whitespace(s)
\d+ # match all non-whitespace(s)
\s+ # whitespace(s)
\S+ # match all non-whitespace(s)
) # match everything inside (end)
You can test the regex in a debugger.

Related

Regex match text after last '-'

I am really stuck with the following regex problem:
I want to remove the last piece of a string, but only if the '-' is more then once occurring in the string.
Example:
BOL-83846-M/L -> Should match -M/L and remove it
B0L-026O1 -> Should not match
D&F-176954 -> Should not match
BOL-04134-58/60 -> Should match -58/60 and remove it
BOL-5068-4 - 6 jaar -> Should match -4 - 6 jaar and remove it (maybe in multiple search/replace steps)
It would be no problem if the regex needs two (or more) steps to remove it.
Now I have
[^-]*$
But in sublime it matches B0L-026O1 and D&F-176954
Need your help please
You can match the first - in a capture group, and then match the second - till the end of the string to remove it.
In the replacement use capture group 1.
^([^-\n]*-[^-\n]*)-.*$
^ Start of string
( Capture group 1
[^-\n]*-[^-\n]* Match the first - between chars other than - (or a newline if you don't want to cross lines)
) Capture group 1
-.*$ Match the second - and the rest of the line
Regex demo
You can match the following regular expression.
^[^-\r\n]*(?:$|-[^-\r\n]*(?=-|$))
Demo
If the string contains two or more hyphens this returns the beginning of the string up to, but not including, the second hyphen; else it returns the entire string.
The regular expression can be broken down as follows.
^ # match the beginning of the string
[^-\r\n]* # match zero or more characters other than hyphens,
# carriage returns and linefeeds
(?: # begin a non-capture group
$ # match the end of the string
| # or
- # match a hyphen
[^-\r\n]* # match zero or more characters other than hyphens,
# carriage returns and linefeeds
(?= # begin a positive lookahead
- # match a hyphen
| # or
$ # match the end of the string
) # end positive lookahead
) # end non-capture group

What is regex formula to match two group of words in a body of text

Please show me the formula to match "alpha" OR "beta" if "delta" OR "gamma" is included in the body of text.
Text example:
James is alpha but not gamma but he may also be delta
This should be a match because "alpha" is in the text as well as "gamma".
And I would like it also to have matched because "alpha" is in the text as well as "delta".
The match formula should also apply if "alpha" was replaced by "beta" in the text example.
Depending on your regex flavour, this works for you:
^ # beginning of line
(?= # start lookahead, zero-lengh assertion that make sure we have within a line
.* # 0 or more any character but newline
\b # word boundary
(?: # start non capture group
delta # literally "delta"
| # OR
gamma # literally "gamma"
) # end group
\b # word boundary
) # end lookahead
.* # 0 or more any character but newline
\b # word boundary
( # start group 1
alpha # literally "alpha"
| # OR
beta # literally "beta"
) # end group
\b # word boundary
.* # 0 or more any character but newline
$ # end of line
DEMO
If you need to match the pairs in either order, you can use lookahead assertions:
^(?=.*\b(?:alpha|beta)\b)(?=.*\b(?:gamma|delta)\b).*
Test it live on regex101.com.
Explanation:
Each lookahead checks that one of the two terms is present somewhere in the string. Both lookaheads need to succeed in order for the match to proceed. The .* at the end is not strictly necessary (just to visualize the match in the regex tester); if you only need to check for match/non-match, then you can remove it. In that case, the match result will be an empty string.

Exclude specific name in regex

I have multiple directories with names like app1.6.11, app1.7.12, app1.8.34, test1, test2.
I want to match regex for all the directories which start with app and to exclude app1.8.34.
I have tried:
^(app.+)[^(app1.8.34)]
If you want to match just the dot, you should escape it \. it or else it would match any character.
You could use a negative lookahead:
^app(?!1\.8\.34).+$
That would match
^ # The beginning of the string
app # Match app
(?! # Negative lookahead that asserts what follows is not
1\.8\.34 # Match 1.8.34
) # Close negative lookahead
.+ # Match any character one or more times
$ # End of the string

Regex pattern without one case

I would like to remove some strings from filename.
I want to remove every string in bracket but not if there is a string "remix" or "Remix" or "REMIX"
Now I have got
sed "s/\s*\(\s?[A-z0-9. ]*\)//g"
but how to exclude cases when there is remix in string?
You can use a capture group:
sed 's/\(\s*([^)]*remix[^)]*)\)\|\s*(\s\?[a-z0-9. ]*)/\1/gi'
When the "remix branch" doesn't match, the capture group is not defined and the matched part is replaced with an empty string.
When the "remix branch" succeeds, the matched part is replaced by the content of the capture group, so by itself.
Note: if that helps to avoid false positive, you can add word-boundaries around "remix": \bremix\b
pattern details:
\( # open the capture group 1
\s* # zero or more white-spaces
( # a literal parenthesis
[^)]* # zero or more characters that are not a closing parenthesis
remix
[^)]*
)
\) # close the capture group 1
\| # OR
# something else between parenthesis
\s* # note that it is essential that the two branches are able to
# start at the same position. If you remove \s* in the first
# branch, the second branch will always win when there's a space
# before the opening parenthesis.
(\s\?[a-z0-9. ]*)
\1 is the reference to the capture group 1
i makes the pattern case-insensitive
[EDIT]
If you want to do it in a POSIX compliant way, you must use a different approach because several Gnu features are not available, in particular the alternation \| (but also the i modifier, the \s character class, the optional quantifier \?).
This other approach consists to find all eventual characters that are not an opening parenthesis and all eventual substrings enclosed between parenthesis with "remix" inside, followed by eventual white-spaces and an eventual substring enclosed between parenthesis.
As you can see all is optional and the pattern can match an empty string, but it isn't a problem.
All before the parenthesis part to remove is captured in group 1.
sed 's/\(\([^(]*([^)]*[Rr][Ee][Mm][Ii][Xx][^)]*)[^ \t(]*\([ \t]\{1,\}[^ \t(]\{1,\}\)*\)*\)\([ \t]*([^)]*)\)\{0,1\}/\1/g;'
pattern details:
\( # open the capture group 1
\(
[^(]* # all that is not an opening parenthesis
# substring enclosed between parenthesis without "remix"
( [^)]* [Rr][Ee][Mm][Ii][Xx] [^)]* )
# Let's reach the next parenthesis without to match the white-spaces
# before it (otherwise the leading white-spaces are not removed)
[^ \t(]* # all that is not a white-space or an opening parenthesis
# eventual groups of white-spaces followed by characters that are
# not white-spaces nor opening parenthesis
\( [ \t]\{1,\} [^ \t(]\{1,\} \)*
\)*
\) # close the capture group 1
\(
[ \t]* # leading white-spaces
([^)]*) # parenthesis
\)\{0,1\} # makes this part optional (this avoid to remove a "remix" part
# alone at the end of the string)
The word boundaries in this mode aren't available too. So the only way to emulate them is to list the four possibilities:
([Rr][Ee][Mm][Ii][Xx]) # poss1
([Rr][Ee][Mm][Ii][Xx][^a-zA-Z][^)]*) # poss2
([^)]*[^a-zA-Z][Rr][Ee][Mm][Ii][Xx]) # poss3
([^)]*[^a-zA-Z][Rr][Ee][Mm][Ii][Xx][^a-zA-Z][^)]*) # poss4
and to replace ([^)]*[Rr][Ee][Mm][Ii][Xx][^)]*) with:
\(poss1\)\{0,\}\(poss2\)\{0,\}\(poss3\)\{0,\}\(poss4\)\{0,\}
Just skip the lines matching "remix":
sed '/([^)]*[Rr][Ee][Mm][Ii][Xx][^)]*)/! s/([^)]*)//g'
where bracket are (US) :[]
sed '/remix\|REMIX\|Remix/ !s/\[[^]]*]//g'
where bracet (ROW): ()
sed '/remix\|REMIX\|Remix/ !s/([^)]*)//g'
assuming:
- there is no internal bracket
- Other form of remix are excluced (ReMix, ...), so line is deleted
- Remix could be any place in title (i love remix) [if needed specify which to take and remove]

Regex to fail if multiple matches found

Take the following regex:
P[0-9]{6}(\s|\.|,)
This is designed to check for a 6 digit number preceded by a "P" within a string - works fine for the most part.
Problem is, we need the to fail if more than one match is found - is that possible?
i.e. make Text 4 in the following screenshot fail but still keep all the others failing / passing as shown:
(this RegEx is being executed in a SQL .net CLR)
If the regex engine used by this tool is indeed the .NET engine, then you can use
^(?:(?!P[0-9]{6}[\s.,]).)*P[0-9]{6}[\s.,](?:(?!P[0-9]{6}[\s.,]).)*$
If it's the native SQL engine, then you can't do it with a single regex match because those engines don't support lookaround assertions.
Explanation:
^ # Start of string
(?: # Start of group which matches...
(?!P[0-9]{6}[\s.,]) # unless it's the start of Pnnnnnn...
. # any character
)* # any number of times
P[0-9]{6}[\s.,] # Now match Pnnnnnn exactly once
(?:(?!P[0-9]{6}[\s.,]).)* # Match anything but Pnnnnnn
$ # until the end of the string
Test it live on regex101.com.
or use this pattern
^(?!(.*P[0-9]{6}[\s.,]){2})(.*P[0-9]{6}[\s.,].*)$
Demo
basically check if the pattern exists and not repeated twice.
^ Start of string
(?! Negative Look-Ahead
( Capturing Group \1
. Any character except line break
* (zero or more)(greedy)
P "P"
[0-9] Character Class [0-9]
{6} (repeated {6} times)
[\s.,] Character Class [\s.,]
) End of Capturing Group \1
{2} (repeated {2} times)
) End of Negative Look-Ahead
( Capturing Group \2
. Any character except line break
* (zero or more)(greedy)
P "P"
[0-9] Character Class [0-9]
{6} (repeated {6} times)
[\s.,] Character Class [\s.,]
. Any character except line break
* (zero or more)(greedy)
) End of Capturing Group \2
$ End of string