Regex to remove the end of a URL

Regex to remove the end of a URL - regex

Ok first off I am using built-in .NET regex this I what I was told I am using. I am using the group function A(.*?)B than I am replacing it with nothing (basically removing it). What I am doing is removing some unwanted stuff from the end of a url I am scraping.
But the problem is for "B" I am using the quote which needs to be in there. Is there a way to say remove everything between A and b But not A and B? But A and B has to be used as markers in this example. I hope I explained this well enough.
Just in case I didn't I'll use an example random words and spaces nothing to use as indicators on either site "example.com" sometimes space no space sometimes words letters, etc. Now I want example.com with the quotes but everything changes on each side including spaces.
But I need example.com including the quotes so I cant just use "(.*?)" because once I use the replace function it wont get the quotes which I need to keep.
Ok rewording this A(.*?)B replace essentially I am reving whats in between A and B with nothing which is fine But i want to keep A and B i cannot use any characters or words before or after A and B because they are random and change for example how would you remove this: "example.com" but keep the quotes when everything before the quotes and inside the quotes is changing.

You can use lookaround assertion:
I don't know the exact syntax for the regex flavour you're using but you can adapt this to your language.
replace (?<=A).*?(?=B) by nothing

Related

Regex for fixing YAML strings

I am trying to create a bunch of YAML files, mostly composed of strings of text. Now when using apostrophes in words, they must be escaped by typing a double apostrophe, because I’m using apostrophes to wrap the strings.
I want to create a regex that will check for apostrophes in the text that aren’t double. What I have is this:
^([^'\n]*?)'(([^'\n]*?)'(?!')([^'\n]+?))*?'$\n
https://regex101.com/r/v4nUTn/3
My issue is that as soon as my string has a double apostrophe, but also has an apostrophe which isn’t a double apostrophe, it doesn’t match because my negative lookahead doesn’t match as soon as it sees the double apostrophe. (for example the string t''e'st won’t match even though it is missing a double apostrophe after the e)
How can I make it so that my negative lookahead will not fail as soon as it sees one double apostrophe?

This regex should work:
\w'\w
Test here.

My guess is that maybe an expression similar to
('[^'\r\n]*'|[^\r\n\w']+)|([\w']*)
would be an option to look into.
If the second capturing group returns true, then the string is undesired.
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.

One suggestion would be to do this in two steps.
For example, if every 'candidate' value looks like this: - 'something here' (where you want to test the apostrophes in the something here content of the string, then first isolate out that content via:
/^\s*- '(.+)'$/im
And then make sure all apostrophe's appear as you want them to appear within match group 1 of the result.
Then, replace the original match with your 'sanitised' match.
Doing this means you don't have to be concerned with the bounding apostrophes causing complications to the check for apostrophes in the value.
Note: there may well be a perfect one-step regex to do this, but understanding that you can break tasks into several steps is useful if you spend a lot of time with regular expressions, and can help you sidestep 'perfect regex paralysis'.

If you want your string to match if there is at least one 'single quote' between your singlequote strings, then you should allow consumption of either a string which doesn't have any singlequote in it or consume if it contains two singlequotes and then you should modify your regex a bit to consume two singlequotes and add |'' in your regex, which will now consume either non-singlequote text or a portion which has at least two singlequotes.
Try this updated regex demo and see if this works like you wanted?
https://regex101.com/r/v4nUTn/4

Regex captures all occurrences but the last of certain characters

I want to exclude common punctuation from my URL Regex detector when my clients type a sentence with a URL in it. A common scenario would be the URL example.com?q=this (which obviously needs to include the ?) versus a sentence saying
What do you think of example.com?
This expression suits my needs just fine:
(?:https?\:\/\/)?(?:\w+\.)+\w{2,}(?:[?#/]\S*)?
However it includes all punctuation at the end, so I am iterating through each match to find and use this captured group to exclude said punctuation:
(.*?)[?,!.;:]+$
However, I'm not sure how to leverage the "end of string" technique when scanning the entire block of text which may have multiple URLs. Was hoping there'd be a way to capture the right blocks from the get-go without the extra work.

Just require non-whitespace after the punctuation instead of making it optional.
(?:https?\:\/\/)?(?:\w+\.)+\w{2,}(?:[?#\/]\S+)?
You will of course lose valid ending of URLs like example.com/ will become example.com but as far as I know there is no difference.

Regex for SublimeText Snippet

I've been stuck for a while on this Sublime Snippet now.
I would like to display the correct package name when creating a new class, using TM_FILEPATH and TM_FILENAME.
When printing TM_FILEPATH variable, I get something like this:
/Users/caubry/d/[...]/src/com/[...]/folder/MyClass.as
I would like to transform this output, so I could get something like:
com.[...].folder
This includes:
Removing anything before /com/[...]/folder/MyClass.as;
Removing the TM_FILENAME, with its extension; in this example MyClass.as;
And finally finding all the slashes and replacing them by dots.
So far, this is what I've got:
${1:${TM_FILEPATH/.+(?:src\/)(.+)\.\w+/\l$1/}}
and this displays:
com/[...]/folder/MyClass
I do understand how to replace splashes with dots, such as:
${1:${TM_FILEPATH/\//./g/}}
However, I'm having difficulties to add this logic to the previous one, as well as removing the TM_FILENAME at the end of the logic.
I'm really inexperienced with Regex, thanks in advance.
:]
EDIT: [...] indicates variable number of folders.

We can do this in a single replacement with some trickery. What we'll do is, we put a few different cases into our pattern and do a different replacement for each of them. The trick to accomplish this is that the replacement string must contain no literal characters, but consist entirely of "backreferences". In that case, those groups that didn't participate in the match (because they were part of a different case) will simply be written back as an empty string and not contribute to the replacement. Let's get started.
First, we want to remove everything up until the last src/ (to mimic the behaviour of your snippet - use an ungreedy quantifier if you want to remove everything until the first src/):
^.+/src/
We just want to drop this, so there's no need to capture anything - nor to write anything back.
Now we want to match subsequent folders until the last one. We'll capture the folder name, also match the trailing /, but write back the folder name and a .. But I said no literal text in the replacement string! So the . has to come from a capture as well. Here comes the assumption into play, that your file always has an extension. We can grab the period from the file name with a lookahead. We'll also use that lookahead to make sure that there's at least one more folder ahead:
^.+/src/|\G([^/]+)/(?=[^/]+/.*([.]))
And we'll replace this with $1$2. Now if the first alternative catches, groups $1 and $2 will be empty, and the leading bit is still removed. If the second alternative catches, $1 will be the folder name, and $2 will have captured a period. Sweet. The \G is an anchor that ensures that all matches are adjacent to one another.
Finally, we'll match the last folder and everything that follows it, and only write back the folder name:
^.+/src/|\G([^/]+)/(?=[^/]+/.*([.]))|\G([^/]+)/[^/]+$
And now we'll replace this with $1$2$3 for the final solution. Demo.
A conceptually similar variant would be:
^.+/src/|\G([^/]+)/(?:(?=[^/]+/.*([.]))|[^/]+$)
replaced with $1$2. I've really only factored out the beginning of the second and third alternative. Demo.
Finally, if Sublime is using Boost's extended format string syntax, it is actually possible to get characters into the replacement conditionally (without magically conjuring them from the file extension):
^.+/src/|\G(/)?([^/]+)|\G/[^/]+$
Now we have the first alternative for everything up to src (which is to be removed), the third alternative for the last slash and file name (which is to be removed), and the middle alternative for all folders you want to keep. This time I put the slash to be replaced optionally at the beginning. With a conditional replacement we can write a . there if and only if that slash was matched:
(?1.:)$2
Unfortunately, I can't test this right now and I don't know an online tester that uses Boost's regex engine. But this should do the trick just fine.

Notepad++ masschange using regular expressions

I have issues to perform a mass change in a huge logfile.
Except the filesize which is causing issues to Notepad++ I have a problem to use more than 10 parameters for replacement, up to 9 its working fine.
I need to change numerical values in a file where these values are located within quotation marks and with leading and ending comma: ."123,456,789,012.999",
I used this exp to find and replace the format to:
,123456789012.999, (so that there are no quotation marks and no comma within the num.value)
The exp used to find is:
([,])(["])([0-9]+)([,])([0-9]+)([,])([0-9]+)([,])([0-9]+)([\.])([0-9]+)(["])([,])
and the exp to replace is:
\1\3\5\7\9\10\11\13
The problem is parameters \11 \13 are not working (the chars eg .999 as in the example will not appear in the changed values).
So now the question is - is there any limit for parameters?
It seems for me as its not working above 10. For shorter num.values where I need to use only up to 9 parameters the string for serach and replacement works fine, for the example above the search works but not the replacement, the end of the changed value gets corrupted.
Also, it came to my mind that instead of using Notepad++ I could maybe change the logfile on the unix server directly, howerver I had issues to build the correct perl syntax. Anyone who could help with that maybe?

After having a little play myself, it looks like back-references \11-\99 are invalid in notepad++ (which is not that surprising, since this is commonly omitted from regex languages.) However, there are several things you can do to improve that regular expression, in order to make this work.
Firstly, you should consider using less groups, or alternatively non-capture groups. Did you really need to store 13 variables in that regex, in order to do the replacement? Clearly not, since you're not even using half of them!
To put it simply, you could just remove some brackets from the regex:
[,]["]([0-9]+)[,]([0-9]+)[,]([0-9]+)[,]([0-9]+)[.]([0-9]+)["][,]
And replace with:
,\1\2\3\4.\5,
...But that's not all! Why are you using square brackets to say "match anything inside", if there's only one thing inside?? We can get rid of these, too:
,"([0-9]+),([0-9]+),([0-9]+),([0-9]+)\.([0-9]+)",
(Note I added a "\" before the ".", so that it matches a literal "." rather than "anything".)
Also, although this isn't a big deal, you can use "\d" instead of "[0-9]".
This makes your final, optimised regex:
,"(\d+),(\d+),(\d+),(\d+)\.(\d+)",
And replace with:
,\1\2\3\4.\5,

Not sure if the regex groups has limitations, but you could use lookarounds to save 2 groups, you could also merge some groups in your example. But first, let's get ride of some useless character classes
(\.)(")([0-9]+)(,)([0-9]+)(,)([0-9]+)(,)([0-9]+)(\.)([0-9]+)(")(,)
We could merge those groups:
(\.)(")([0-9]+)(,)([0-9]+)(,)([0-9]+)(,)([0-9]+)(\.)([0-9]+)(")(,)
^^^^^^^^^^^^^^^^^^^^
We get:
(\.)(")([0-9]+)(,)([0-9]+)(,)([0-9]+)(,)([0-9]+\.[0-9]+)(")(,)
Let's add lookarounds:
(?<=\.)(")([0-9]+)(,)([0-9]+)(,)([0-9]+)(,)([0-9]+\.[0-9]+)(")(?=,)
The replacement would be \2\4\6\8.

If you have a fixed length of digits at all times, its fairly simple to do what you have done. Even though your expression is poorly written, it does the job. If this is the case, look at Tom Lords answer.
I played around with it a little bit myself, and I would probably use two expressions - makes it much easier. If you have to do it in one, this would work, but be pretty unsafe:
(?:"|(\d+),)|(\.\d+)"(?=,) replace by \1\2
Live demo: http://regex101.com/r/zL3fY5

Regex to replace email address domains?

I need a regex to obfuscate emails in a database dump file I have. I'd like to replace all domains with a set domain like #fake.com so I don't risk sending out emails to real people during development. The emails do have to be unique to match database constraints, so I only want to replace the domain and keep the usernames.
I current have this regex for finding emails
\b[A-Z0-9._%-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b
How do I convert this search regex into a regex I can use in a find and replace operation in either Sublime Text or SED or Vim?
EDIT:
Just a note, I just realized I could replace all strings found by #[A-Z0-9.-]+\.[A-Z]{2,4}\b in this case, but academically I am still interested in how you could treat each section of the email regex as a token and replace the username / domain independently.

SublimeText
SublimeText uses Boost syntax, which supports quite a large subset of features in Perl regex. But for this task, you don't need all those advanced constructs.
Below are 2 possible approaches:
If you can assume that # doesn't appear in any other context (which is quite a fair assumption for normal text), then you can just search for the domain part #[A-Z0-9.-]+\.[A-Z]{2,4}\b and replace it.
If you use capturing groups (pattern) and backreference in replacement string.
Find what
\b([A-Z0-9._%-]+)#[A-Z0-9.-]+\.[A-Z]{2,4}\b
([A-Z0-9._%-]+) is the first (and only) capturing group in the regex.
Replace with
$1#fake.com
$1 refers to the text captured by the first capturing group.
Note that for both methods above, you need to turn off case-sensitivity (indicated as the 2nd button on the lower left corner), unless you specifically want to remove only emails written in ALL CAPS.

You may use the following command for Vim:
:%s/\(\<[A-Za-z0-9._%-]\+#\)[A-Za-z0-9.-]\+\.[A-Za-z]\{2,4}\>/\1fake.com/g
Everything between \( and \) will become a group that will be replaced by an escaped number of the group (\1 in this case). I've also modified the regexp to match the small letters and to have Vim-compatible syntax.
Also you may turn off the case sensitivity by putting \c anywhere in your regexp like this:
:%s/\c\(\<[A-Z0-9._%-]\+#\)[A-Z0-9.-]\+\.[A-Z]\{2,4}\>/\1fake.com/g
Please also note that % in the beginning of the line asks Vim to do the replacement in a whole file and g at the end to do multiple replacements in the same line.
One more approach is using the zero-width matching (\#<=):
:%s/\c\(\<[A-Z0-9._%-]\+#\)\#<=[A-Z0-9.-]\+\.[A-Z]\{2,4}\>/fake.com/g

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex to remove the end of a URL - regex

You can use lookaround assertion: I don't know the exact syntax for the regex flavour you're using but you can adapt this to your language. replace (?<=A).*?(?=B) by nothing

Related

Regex for fixing YAML strings

Regex captures all occurrences but the last of certain characters

Regex for SublimeText Snippet

Notepad++ masschange using regular expressions

Regex to replace email address domains?

Categories

Resources