I need a RegExp that's special - regex

I need a custom RegExp. In a big text I want to remove any href tag with a certain URL. The perk is that those URLs are made by a server and contain an extra bit of url made of char-upper/lower-number.
So I would like Notepad++ to search and replace by naught all strings that contain an a href+http://www.gymglish.com/workbook/show-lesson/+extrastring like xwSzAdM45jL6+</a>
With http://www.gymglish.com/workbook/show-lesson/[a-zA-Z0-9/] Notepad++ find the string and perform the replacement till the first char of the extra bit (eg : xDghdS5jkA becomes DghdS5jkA).
I made a simple reasoning : if it does the replacement till the first char I must repeat the Regexp for 14 next chars thus
http://www.gymglish.com/workbook/show-lesson/[a-zA-Z0-9\/][a-zA-Z0-9\/][a-zA-Z0-9\/][a-zA-Z0-9\/][a-zA-Z0-9\/][a-zA-Z0-9\/][a-zA-Z0-9\/][a-zA-Z0-9\/][a-zA-Z0-9\/][a-zA-Z0-9\/]/[a-zA-Z0-9\/][a-zA-Z0-9\/][a-zA-Z0-9\/][a-zA-Z0-9\/]>*</[a|A]> :-) however that's a dumb regexp

This should do the trick: (edited to use the new URL)
<[a|A] (href|HREF)=[\'|\"]http:\/\/www\.gymglish\.com\/workbook\/show-lesson[\/a-zA-Z0-9]*[\'|\"]>[a-zA-Z0-9 ]*<\/[a|A]>
Debuggex Demo

Related

Regex to remove hashtags but to keep first hashtag

I want to remove all hashtags from a text but it should keep the first hashtag
Example text:
This is an example #DoNotRemoveThis #removethis #removethis #removethis
Expected result:
This is an example #DoNotRemoveThis
I'm using this
\#\w+\s?
but it remove all the hashtags. I want to keep the first hashtag
This may require further knowledge as to what flavour of regex you are using. For example, if .NET (C#) you can do variable length look-behind, and thus the following pattern will do what you need:
(?<=#.*)(#\w+)/g
Test at Regex101
However, this won't work in most other engines.
It sounds to me like you want to match everything up to but not including the second hash symbol, right?
/^[^#]*#[^#]*/
That will match any number of non-hash characters, then a hash character, then more non-hash characters.

Regex match everything excluding hashtags that are separated by white space

In a string, I want to match everything that is not a hashtag so I can replace the matches with nothing. The match should include any possible single # symbols as well as any possible anchor links that you may find in a URL, like for example http:\www.abc.com#anchor. And these conditions are where I am struggling with.
So far, I was only able to match everything that is not an hashtag but I cannot include these possible anchor links. I am using this regex:
\s*(?<![\#\w\ß\Ä\Ü\Ö\ä\ö\ü\"\!\?\.\,\;\:\^\°\$\&\+\-\*\/\#\§\%\{\}\[\]\(\)])[\d\w\s.$&+,:;=?#|()<>\".°^*%!~§\ß\Ä\Ü\Ö\ä\ö\ü\-\/\/\[\]\}\\]+
And you can find an example here: https://regex101.com/r/o6EQ1B/4
There are two main parts to the regex:
(?<![\#\w\ß\Ä\Ü\Ö\ä\ö\ü\"\!\?\.\,\;\:\^\°\$\&\+\-\*\/\#\§\%\{\}\[\]\(\)])
This part is used to ignore all kinds of characters that may be used to form a hashtag. I need something like this because there may be some hashtags like this that are valid in my use case: #sprint(IPs:1)
[\d\w\s.$&+,:;=?#|()<>\".°^*%!~§\ß\Ä\Ü\Ö\ä\ö\ü\-\/\/\[\]\}\\]
This part is being used to match any possible character that is not part of a hashtag.
I am not very experienced with regex, so I am not even sure if what I am trying to achieve with a single regex is even possible. I also apologize in advance if the regex I wrote is too complicated, but at least I got it somewhat working for me.
UPDATE:
I think I could find a way to achive what I was seeking: https://regex101.com/r/3CWVUy/2
I will use this regex to strip out everything matched so I am left with the hashtags I am interested in. The key was in this part: (?<=\w\#).*.
The hashtag part (or fragment) of a URL is the last part and it's optional. Apparently you want to get rid of the rest of the url, keeping only the fragment including the # character, and for that you want to match everything that's before the hashtag, but I think the opposite makes more sense, that is, extracting the piece that you're interested in and ignoring the rest, so you just have to match from the # character (if found) to the end.
If you specifically want to use a regex for that you can do:
'http:\\www.abc.com#anchor'.match(/#.*/)
Alternatively, you can extract the substring from the # character to the end. I don't know what language you're using but this is an example in Javascript:
const url = 'http:\\www.abc.com#anchor';
let fragment = '';
const index = url.indexOf('#');
if (index >= 0) {
fragment = url.substring(index);
}

How to apply correct regex?

I have a special task which requires lots of regex and javascript parsing.
My head is almost exploding, so maybe I'm tired and forgot some small thing else I'm not newbie to regex so perhaps someone will point me to good direction here and show me where I did mistake.
So I have this regex code:
((?<=\ffmpg=).+(?=////u0026cs=nt))
to get the value of substring between 2 strings. The first string is called:
ffmpg= from this string it should start and it will end just before the other string start called //u0026cs=nt
The problem is that it is working fine until the html page contains only one parameter with the same name; because the source html has inside like 10's of ffmg and the same end string called cs=nt.
I can not even make regex to count the characters because every time you visit the html page the number of characters are different, sometimes +3 else +10. So the only way is to get this sting from the start of param1 to the end of param2.
This is the string I need to get: 1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012
This is the source html example:
\u0026doc=IcuU5Oy8\u0026pen=V9PXaHoOp1gKD25rgAg\u0026ffmpg=1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012\u0026cs=nt\u0026token=gHgig8eLY3qsQ0bXa\\u0026doc=IcuU5Oy8\u0026pen=V9PXaHoOp1gKD25rgAg\u0026ffmpg=1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012\u0026cs=nt\u0026token=gHgig8eLY3qsQ0bXa\\u0026doc=IcuU5Oy8\u0026pen=V9PXaHoOp1gKD25rgAg\u0026ffmpg=1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012\u0026cs=nt\u0026token=gHgig8eLY3qsQ0bXa\
I have copied 3 times the same just for this purpose because it is very big html source and I doubt I can upload it here.
Thanks for your help.
In your questions, you use (?<=\ffmpg=) where \f will match a form feed character which is not present in the data example. If you meant to use \\f it will match \f which is also not present in the example data.
You could get the match using a capturing group instead of using lookarounds as lookbehinds are not widely supported by all browsers.
If you just want to get a single match, you can omit the /g global flag.
If you use .+ you will match too much as the .+ will match until the end of the string and then backtracks until the first time it can match \\u0026cs=nt
What you could do instead is be specific in what you would allow to match which for the current string is a character class with the following characters [AC0-9%]+
You could broaden the character class with a range to match chars A-Z instead of AC for example and add more chars or ranges as required.
ffmpg=([AC0-9%]+)\\\\u0026cs=nt
Regex demo
For example
const regex = /ffmpg=([AC0-9%]+)\\\\u0026cs=nt/;
const str = `\\\\u0026doc=IcuU5Oy8\\\\u0026pen=V9PXaHoOp1gKD25rgAg\\\\u0026ffmpg=1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012\\\\u0026cs=nt\\\\u0026token=gHgig8eLY3qsQ0bXa\\\\\\\\u0026doc=IcuU5Oy8\\\\u0026pen=V9PXaHoOp1gKD25rgAg\\\\u0026ffmpg=1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012\\\\u0026cs=nt\\\\u0026token=gHgig8eLY3qsQ0bXa\\\\\\\\u0026doc=IcuU5Oy8\\\\u0026pen=V9PXaHoOp1gKD25rgAg\\\\u0026ffmpg=1714248%2C23851735%2C23804281%2C23839597%2C23357901%2C3313341%2C3316343%2C23848795%2C3300132%2C26853996%2C3300114%2C3315790%2C23857451%2C23856472%2C23851936%2C3300161%2C3314786%2C23856652%2C23859863%2C23837993%2C23833479%2C23861502%2C23842630%2C23842986%2C23861012\\\\u0026cs=nt\\\\u0026token=gHgig8eLY3qsQ0bXa\\\\`;
console.log(str.match(regex)[1]);
Try this:
(?<=ffmpg=)([A-F0-9%]+)
Explanation
Since your string only consists of url-encoded characters, you can use [A-F0-9%]+character class to capture it. It will stop when next string starts because there will be a backslash.
See online demo here.

Regex for the value of an HTML Property

I have a load of links that look like this:
Taboola - Content you may like
I want to delete the entire ICON and ADD_DATE attributes and their values.
I'm using sublime with a regex find/replace but I'm not sure how to write the regex to grab everything in between ICON=" AND "
Any help would be appreciated!
This should work (escaping quotes as necessary):
ICON="[^"]*"
The reason ICON=\"(.*)" won't work is that regex can 'be greedy' in what it takes. This means that if it can match more of the string to satisfy the pattern it will.
You can either specify a non greedy search, such as ICON=".*?" or explicitly declare matches on atoms that are not quotes as in the above answer.

How to capture text between two markers?

For clarity, I have created this:
http://rubular.com/r/ejYgKSufD4
My strings:
http://blablalba.com/foo/bar_soap/foo/dir2
http://blablalba.com/foo/bar_soap/dir
http://blablalba.com/foo/bar_soap
My Regular expression:
\/foo\/(.*)
This returns:
/foo/bar_soap/dir/dir2
/foo/bar_soap/dir
/foo/bar_soap
But I only want
/foo/bar_soap
Any ideas how I can achieve this? As illustrated above, I want everything after foo up until the first forward slash.
Thanks in advance.
Edit. I only want the text after foo until until the next forward slash after. Some directories may also be named as foo and this would render incorrect results. Thanks
. will match anything, so you should change it to [^/] (not slash) instead:
\/foo\/([^\/]*)
Some of the other answers use + instead of *. That might be correct depending on what you want to do. Using + forces the regex to match at least one non-slash character, so this URL would not match since there isn't a trailing character after the slash:
http://blablalba.com/foo/
Using * instead would allow that to match since it matches "zero or more" non-slash characters. So, whether you should use + or * depends on what matches you want to allow.
Update
If you want to filter out query strings too, you could also filter against ?, which must come at the front of all query strings. (I think the examples you posted below are actually missing the leading ?):
\/foo\/([^?\/]*)
However, rather than rolling out your own solution, it might be better to just use split from the URI module. You could use URI::split to get the path part of the URL, and then use String#split split it up by /, and grab the first one. This would handle all the weird cases for URLs. One that you probably haven't though of yet is a URL with a specified fragment, e.g.:
http://blablalba.com/foo#bar
You would need to add # to your filtered-character class to handle those as well.
You can try this regular expression
/\/foo\/([^\/]+)/
\/foo\/([^\/]+)
[^\/]+ gives you a series of characters that are not a forward slash.
the parentheses cause the regex engine to store the matched contents in a group ([^\/]+), so you can get bar_soap out of the entire match of /foo/bar_soap
For example, in javascript you would get the matched group as follows:
regexp = /\/foo\/([^\/]+)/ ;
match = regexp.exec("/foo/bar_soap/dir");
console.log(match[1]); // prints bar_soap