Excluding pattern in regular expression search and replace - regex

I have this string
asp.net somedomain.com
I need to strip out the domain dot extension part only except in certain cases. So I want this:
asp.net somedomain
Any time there is vb.net, asp.net etc.. I do not want to strip out the extension.
I tried this in perl with no effect.
$company =~ s/(?=\w+)(?!=asp|vb|c#)\.[a-zA-Z]{2,6}\b/\1/g;
My logic is stuff before the dot must be one or more alpha and not asp or vb or c#.

You can use a Negative LookBehind. You were almost there, but using LookAheads.
RegExp: (?<!asp|vb|c\#)\.[a-zA-Z]{2,6}\b
Replace with nothing
Explained demo here: http://regex101.com/r/tG5rO1
To workaround the variable length RegEx error use this: (?<!asp)(?<!vb|c\#)\.[a-z]{2,6}\b
Edit: separate LookBehind group for different length excluded word
This will only find TLD's that don't match your criteria.
Update:
To take care of special cases: don't match words ending in excluded word and match any combination of excluded word (eg. vB VB vb Vb)
RegExp: \b(?<!\b[aA][sS][pP])(?<!\b[vV][bB]|\b[cC]\#)\.[a-zA-Z]{2,6}\b
Explained demo: http://regex101.com/r/bR3kJ8
Or: \b(?<!\basp)(?<!\bvb|\bc\#)\.[a-z]{2,6}\b
When used with case insensitive RegEx modifier i
Update #2
Safer as it cares only about .net TLD and excluded words for it:
/(^|\s)(?!(?:visual)?(?:basic|studio|asp|v[bs]|c\#)\.net)(\w+)(?:\.com?\.[a-z]{2}|\.[a-z]{2,6})\b/\1\2/gi
Needs replacement as opposed with previous variants.
Explained demo: http://regex101.com/r/kL5mQ5

Just match the last one:
my $s = q{asp.net somedomain.com};
my ($company) = ($s =~ / ([A-Za-z]{2,}) [.] (?:[A-Za-z]{2,}) \z /x);
print $company, "\n";
Or, split on space and dot:
my $s = q{asp.net somedomain.com};
my ($company) = split /[.]/, (split ' ', $s)[-1];
print $company, "\n";
How much work you want to put into the pattern depends on how much variation there is in your input. The examples above are based on the sample input your provided.

Related

Regex to match text from multiple links

How to extract links which contain a certain word?
For e.g.:
https://www.test.com/text/1###https://www.test.com/text/word/2###https://www.test.com/text/text/word/3###https://www.test.com/3/text###https://www.test.com/word/3/text/text
How to search "word" from below regex?
((https:).*?(###))
The result should be like this
https://www.test.com/text/word/2
https://www.test.com/text/text/word/3
https://www.test.com/word/3/text/text
Let's try to build such regex. First we need to find the beginning of url:
/(https?:\/\//
We add ? after https for http urls.
Then we need to find any text except ###, so we need to add:
(?:(?!###).)*
which means - any amount of characters not starting a ### sequence.
Also we need to add word itself and previous sub-expression again, since word can be surrounded by any text:
word(?:(?!###).)*
But the thing is that last sub-expression will skip last character before ###, so we need to add one more thing to handle it:
.(?=###|$)
which means - any character followed by ### or end of string. The final expression will look like:
/(https:\/\/(?:(?!###).)*word(?:(?!###).)*.(?=###|$))/g
But i believe, it's better to just split text by ### and then check for needed word by String.prototype.includes.
If the word has to be a part of the pathname, you might use filter in combination with URL and check if the parts of the pathname contain word.
let str = 'https://www.test.com/text/1###https://www.test.com/text/word/2###https://www.test.com/text/text/word/3###https://www.test.com/3/text###https://www.test.com/word/3/text/text';
let filteredUrls = str.split("###")
.filter(s =>
new URL(s).pathname
.split('/')
.includes('word')
);
console.log(filteredUrls);
If you want to use regex only and possessive quantifiers are supported (The javascript tag has been removed) you might use:
https?://[^#w]*(?:#(?!##)|w(?!ord)|[^#w]*)++word.*?(?=###|$)
Regex demo
Previous answer
You for sure looking for this regular expression:
https://www.test.com/(text/)*word/\d+(/text)*
Here is how you can use it in JavaScript context (very slash / is escaped by backslash \/):
var str = 'https://www.test.com/text/1###https://www.test.com/text/word/2###https://www.test.com/text/text/word/3###https://www.test.com/3/text###https://www.test.com/word/3/text/text';
var urls = str.match(/https:\/\/www.test.com\/(text\/)*word\/\d+(\/text)*/g);
console.log(urls);
In the array you get exactly the elements you wanted.
Update the answer after update question and adding comment by the author
If you need take the words from your example string, then you have to use a little more complex regular exception:
var str = 'https://www.test.com/text/1###https://www.test.com/text/word/2###https://www.test.com/text/text/word/3###https://www.test.com/3/text###https://www.test.com/word/3/text/text';
var urls = str.match(/(?<=\/)\w+(?=\/\d+\/\w)|(?<=(\w\/\w+\/))\w+(?=\/\d)/g);
console.log(urls);
Explanation
Here is regular expression /(?<=(\w\/\w+\/))\w+(?=\/\d)|(?<=\/)\w+(?=\/\d+\/\w)/g, limited by /.../ and with the g flag forcing pattern searches for occurrence.
The regular expression has two alternatives ...|...
The first one (?<=\/)\w+(?=\/\d+\/\w) captures cases when the searched word is directly behind the slash (?<=\/) and before more words behind the number (?=\/\d+\/\w).
https://www.test.com/word/3/text/text
The second alternative (?<=(\w\/\w+\/))\w+(?=\/\d) captures cases where the word is preceded by other words following the domain (?<=(\w\/\w+\/)) (in fact two slashes separated by alphanumeric characters) and the searched word is immediately before the slash followed by the number (?=\/\d).
https://www.test.com/text/word/2
https://www.test.com/text/text/word/3
All slashes must be escaped: \/.
The construction (?<=...) means lookbehind in regular expressions and (?=...) means lookahead in regular expressions.
Note 1. The above example currently only works well in a Chrome browser, as that:
(...) now lookbehind is part of the ECMAScript 2018 specification. As of this writing (late 2018), Google's Chrome browser is the only popular JavaScript implementation that supports lookbehind. So if cross-browser compatibility matters, you can't use lookbehind in JavaScript.
Note 2. Lookbehnd, even if it is interpreted correctly, in most regular expression engines must contain a fixed length regular expression, which I do not keep in the example above, because this one is still valid and works for regular expression engines used in Google Chrome's JavaScript engine, JGsoft engine and .NET framework RegEx classes.
Note 3. The lookbehind syntax or its poorer \K replacement are widely supported by many regular expression engines used in a large group of programming languages.
More explanation about regular expressions which I used you can find for example here.
You may first split by ### then check whether /word/ exists in each element:
var s = 'https://www.test.com/text/1###https://www.test.com/text/word/2###https://www.test.com/text/text/word/3###https://www.test.com/3/text###https://www.test.com/word/3/text/text';
var result = [];
s.split(/###/).forEach(function(el) {
if (el.includes('/word/'))
result.push(el);
})
// or else by using filter
// result = s.split(/###/).filter(el => el.includes('/word/'))
console.log(result);

Regex - skip over expressions and parse the rest

I use regular expressions for sorting data into groups. The lines look somewhat like:
testword test
test testword
tes.w. tes.
tes tes.w.
tes.w othertexttobefound
sometexttobefound testword somemoretextwhichdoesnotmatter
The word test is to be found as well as othertexttobefound and sometexttobefound.
Now I am trying to tell my parser that it is supposed to plainly ignore testword and its derivatives while searching and focus on the rest of my data entries. The "good words" and the "bad words" can be anywhere in each line.
I have tried [^w] which is fine for the beginning of strings, but in my versions not for the other cases. Also (?:w) didn't do the trick. I cannot use lookarounds as these would keep the whole line from being detected.
After long searches on the internet I am hoping for help here!
After much appreciated help from Naxos84, I am adding some German real life examples:
sozialabgabe sozialarbeiter
soz.abg. sozialarbeiter
sozarbeiter soz.abg.
sozialarbeiter otherirrelevantstuff
otherirrelevantstuff soz abg
otherirrelevantstuff sozabg
otherirrelevantstuff sozialabgabe
If I search with:
sozial["^\ab"]|soz["^\ab"]|sometexttobefound|othertexttobefound
Lines 6 and 7 get marked as well, but I don't want those.
What am I doing wrong?
A link:
regexr
To find all the matches you want: any occurence of "test" and "sometexttobefound" and "othertexttobefound you can try the following regex:
test[^\w]|sometexttobefound|othertexttobefound
This regex means:
Find every "test" that is not followed by a word OR sometexttobefound OR othertexttobefound
I tried this regex with the follow text (I added a few 'test's)
testword test
test testword
tes.w. testtes.
tes tes.w. test
tes.w othertexttobefound
sometexttobefound testword somemoretextwhichdoesnotmatter
at regexr (when using the global flag)
If you also want to find things like "tes" I guess you should add it. (I'm not a regex expert)
Like:
test[^\w]|tes[^\w]|sometexttobefound|othertexttobefound
If you want to get all words from the text except from some special words, you could use:
#words = grep{$_ ne 'testword'} split /\P{L}+/, $str;
(if $str is your complete string)
See perl docs for \P{...}. Instead of \P{L}, you could also use \W, but those are locale-dependent.
But if you need to use regexps only, then you could use
#words = $str =~ /\b(?!testword)\p{L}+\b/g;
But again, \b is locale-dependent again, so you might want to use \b{...} or rebuild the word boundary matches with \p{L}:
#words = $str =~ /
(?:(?<=\p{L})(?!\p{L})|(?<!\p{L})(?=\p{L}))
(?!testword)\p{L}+
(?:(?<=\p{L})(?!\p{L})|(?<!\p{L})(?=\p{L}))
/gx;

How can I match multiple hits between 2 delimiters?

Hi, my fellow RegEx'ers ;)
I'm trying to match multiple Texts between every two quotes
Here's my text:
...random code
someArray[] = ["Come and",
"get me,",
"or fail",
"trying!",
"Yours truly"]
random code...
So far, I managed to get the correct matches with two patterns, executed after each other:
(?s)someArray\[\].*?=.*?\[(.*?)\]
this extracts the text between the two brackets and on the result, I use this one:
"(.*?)"
This is working just fine, but I'd love to get the Texts in one regex.
Any help is highly appreciated!
Consider using \G. With its help, you may match "(.*?)" preceded by either someArray[] = [ or previous match of "(.*?)" (well, strictly speaking previous match of entire regex). Then just grab first capture groups from all matches:
(?:(?s).*someArray\[\].*?=.*?\[|\G[^"\]]+)"(.*?)"
Demo: https://regex101.com/r/eBQWdU/3
How you grab the first capture groups from depends on the language you're using regex in. For example in PHP you may do something like this:
preg_match_all('/(?:(?s).*someArray\[\].*?=.*?\[|\G[^"\]]+)"(.*?)"/', $input, $matches);
$array_items = $matches[1];
Demo: https://ideone.com/mZgU1x

Is there a regex engine that supports "for each captured group" in replacement strings?

Here's my example. If I want to use a regex to replace tabs in the code with spaces, but wanted to preserve tab characters in the middle or end of a line of code, I would use this as my search string to capture each tab character at the start of a line: ^(\t)+
Now, how could I write a search string that replaces each captured group with four spaces? I'm thinking there must be some way to do this with backreferences?
I've found I can work around this by running similar regex-replacements (like s/^\t/ /g, s/^ \t/ /g, ...) multiple times until no more matches are found, but I wonder if there's a quicker way to do all the necessary replacements at once.
Note: I used sed format in my example, but I'm not sure if this is possible with sed. I'm wondering if sed supports this, and if not, is there a platform that does? (e.g., there's a Python/Java/bash extended regex lib that supports this.)
With perl and other languages that support this feature (Java, PCRE(PHP, R, libboost), Ruby, Python(the new regex module), .NET), you can use the \G anchor that matches the position after the last match or the start of the string:
s/(?:\G|^)\t/ /gm
This works in Perl. Maybe sed too, I don't know sed.
It relies on doing an eval, basically a callback.
It takes the length of $1 then cats ' ' that many times.
Perl sample.
my $str = "
\t\t\tThree
\t\tTwo
\tOne
None";
$str =~ s/^(\t+)/ ' ' x length($1) /emg;
print "$str\n";
Output
Three
Two
One
None
Just another idea that came to me, this could also be solved with positive lookbehind:
s/(?<=^[\t]*)\t/ /gm
It's ugly, but it works.
sed ':a
s/^\(\t*\)\t/\1 /
ta' YourFile
Use recursive action on 1 regex with sed, it's a workaround

How to return the first five digits using Regular Expressions

How do I return the first 5 digits of a string of characters in Regular Expressions?
For example, if I have the following text as input:
15203 Main Street
Apartment 3 63110
How can I return just "15203".
I am using C#.
This isn't really the kind of problem that's ideally solved by a single-regex approach -- the regex language just isn't especially meant for it. Assuming you're writing code in a real language (and not some ill-conceived embedded use of regex), you could do perhaps (examples in perl)
# Capture all the digits into an array
my #digits = $str =~ /(\d)/g;
# Then take the first five and put them back into a string
my $first_five_digits = join "", #digits[0..4];
or
# Copy the string, removing all non-digits
(my $digits = $str) =~ tr/0-9//cd;
# And cut off all but the first five
$first_five_digits = substr $digits, 0, 5;
If for some reason you really are stuck doing a single match, and you have access to the capture buffers and a way to put them back together, then wdebeaum's suggestion works just fine, but I have a hard time imagining a situation where you can do all that, but don't have access to other language facilities :)
it would depend on your flavor of Regex and coding language (C#, PERL, etc.) but in C# you'd do something like
string rX = #"\D+";
Regex.replace(input, rX, "");
return input.SubString(0, 5);
Note: I'm not sure about that Regex match (others here may have a better one), but basically since Regex itself doesn't "replace" anything, only match patterns, you'd have to look for any non-digit characters; once you'd matched that, you'd need to replace it with your languages version of the empty string (string.Empty or "" in C#), and then grab the first 5 characters of the resulting string.
You could capture each digit separately and put them together afterwards, e.g. in Perl:
$str =~ /(\d)\D*(\d)\D*(\d)\D*(\d)\D*(\d)/;
$digits = $1 . $2 . $3 . $4 . $5;
I don't think a regular expression is the best tool for what you want.
Regular expressions are to match patterns... the pattern you are looking for is "a(ny) digit"
Your logic external to the pattern is "five matches".
Thus, you either want to loop over the first five digit matches, or capture five digits and merge them together.
But look at that Perl example -- that's not one pattern -- it's one pattern repeated five times.
Can you do this via a regular expression? Just like parsing XML -- you probably could, but it's not the right tool.
Not sure this is best solved by regular expressions since they are used for string matching and usually not for string manipulation (in my experience).
However, you could make a call to:
strInput = Regex.Replace(strInput, "\D+", "");
to remove all non number characters and then just return the first 5 characters.
If you are wanting just a straight regex expression which does all this for you I am not sure it exists without using the regex class in a similar way as above.
A different approach -
#copy over
$temp = $str;
#Remove non-numbers
$temp =~ s/\D//;
#Get the first 5 numbers, exactly.
$temp =~ /\d{5}/;
#Grab the match- ASSUMES that there will be a match.
$first_digits = $1
result =~ s/^(\d{5}).*/$1/
Replace any text starting with a digit 0-9 (\d) exactly 5 of them {5} with any number of anything after it '.*' with $1, which is the what is contained within the (), that is the first five digits.
if you want any first 5 characters.
result =~ s/^(.{5}).*/$1/
Use whatever programming language you are using to evaluate this.
ie.
regex.replace(text, "^(.{5}).*", "$1");