I have this text:
<Path Fill="None"
PathData="M244.87,363.97 L245.38,363.91 M245.38,363.91 L245.46,363.84 M245.46,363.84 L245.52,363.75 M245.52,363.75 L245.54,363.7 M245.54,363.7 L246.07,370.18 M246.07,370.18 L245.95,370.25 M245.95,370.25 L245.8,370.37 M245.8,370.37 L245.63,370.54 M245.63,370.54 L245.52,370.73 M245.52,370.73 L245.42,370.9 M245.42,370.9 L245.17,368.03 M245.17,368.03 L244.87,363.97"
Stroke="#898989" StrokeWidth="0.5"/>
<Path Fill="None"
PathData="M247.4,371.21 L247.49,371.16 M247.49,371.16 L247.91,371.13 M247.91,371.13 L249.74,371.01 M249.74,371.01 L252.52,370.82 M252.52,370.82 L252.72,370.83 M252.72,370.83 L252.72,370.84 M252.72,370.84 L252.71,370.89 M252.71,370.89 L252.72,370.95 M252.72,370.95 L252.75,371.38 M252.75,371.38 L251.86,371.44 M251.86,371.44 L249.62,371.63 M249.62,371.63 L247.55,371.79 M247.55,371.79 L247.51,371.35 M247.51,371.35 L247.47,371.28 M247.47,371.28 L247.42,371.22 M247.42,371.22 L247.4,371.21"
Stroke="#878787" StrokeWidth="0.5"/>
<Path Fill="None"
PathData="M246.46,372.67 L246.47,372.05 M246.47,372.05 L246.47,372.05 M246.47,372.05 L246.52,372.07 M246.52,372.07 L246.58,372.09 M246.58,372.09 L247.44,372.02 M247.44,372.02 L248.68,371.91 M248.68,371.91 L248.81,373 M248.81,373 L248.07,373.06 M248.07,373.06 L247.88,373.07 M247.88,373.07 L248.54,379.11 M248.54,379.11 L247.62,379.18 M247.62,379.18 L247.2,379.21 M247.2,379.21 L247.15,379.24 M247.15,379.24 L247.12,379.27 M247.12,379.27 L247.06,379.17 M247.06,379.17 L246.83,376.84 M246.83,376.84 L246.46,372.67"
Stroke="#898989" StrokeWidth="0.5"/>
And I am trying to find and delete the paths which are not of a certain color, i.e. - #898989. I would like to use regex to find the non-matching strings.
I am trying the following:
.*(<Path Fill).*(\r\n|\r|\n).*(\r\n|\r|\n).*(?!#898989).*(\r\n|\r|\n)
But this returns the same as the one I would use to find the matching strings:
.*(<Path Fill).*(\r\n|\r|\n).*(\r\n|\r|\n).*(#898989).*(\r\n|\r|\n)
I thought the ?! was a negative lookahead, and would exclude those strings. It seems to not change the results, though.
Any help?
There are many regex solutions to your problem. Lets first discuss why the regex you proposed does not work as expected.
Problem
.*(<Path Fill).*(\r\n|\r|\n).*(\r\n|\r|\n).*(?!#898989).*(\r\n|\r|\n)
The problem occurs at the part
.*(?!#898989).*(\r\n|\r|\n)
The regex simply says match as much of anything as you can. After matching, check if at the current position there is no #898989. Then again....
The match as much of anything as you can is causing the problem. The first .* is actually capturing the whole line.
Stroke="#898989" StrokeWidth="0.5"/>
Then (?!#898989) comes into play which will succeed since after > there is no #898989. To make it obvious, change the regex to -
.*(?:<Path Fill).*[\r\n].*[\r\n](.*)(?!#898989).*
This regex does the same thing. In this regex, (\r\n|\n|\r) is replaced with [\r\n]. Nothing is being captured by the starting brackets (?:<Path Fill). However, this time the .* before #898989 is surrounded by (...) to highlight the text being captured by it.
Observe the yellow lines to see what is being captured by the .* before the #898989. Here is the link: https://regex101.com/r/2R54uW/1
Correction
As already mentioned in the comments, the regex can be corrected by forcing the .* to stop at Stroke=" and then making the position check.
.*(?:<Path Fill).*[\r\n].*[\r\n].*Stroke=\"(?!#898989).*[\r\n]
Here is another regex that does the same thing -
.*(?:<Path Fill).*[\r\n].*[\r\n]((?!#878787).)*/>
Final Thoughts
Try using [\r\n] in place of (\r\n|\r|\n) since character class is faster than alternation.
If you have any additional doubts please comment.
If there can not be a < and > char in the Path, you could assert that the color does not occur before the closing />
<Path Fill(?![^<>]*#898989[^<>]*/>)[^<>]*/>
Regex demo
If it should be the Stroke specifically
<Path Fill(?![^<>]*Stroke="#898989"[^<>]*/>)[^<>]*/>
Regex demo
Related
I'm working with Emergency Services data in the NEMSIS XSD. I have a field, which is constrained to only 50 characters. I've searched this site extensively, and tried many solutions - Notepad++ rejects all of them, saying not found.
Here's an XML Sample:
<E09>
<E09_01>-5</E09_01>
<E09_02>-5</E09_02>
<E09_03>-5</E09_03>
<E09_04>-5</E09_04>
<E09_05>this one is too long Non-Emergency - PT IS BEING DISCHARGED FROM H AFTER BEING ADMITTED FOR FAILURE TO THRIVE AND ALCOHOL WITHDRAWAL</E09_05>
</E09>
<E09>
<E09_01>-5</E09_01>
<E09_02>-5</E09_02>
<E09_03>-5</E09_03>
<E09_04>-5</E09_04>
<E09_05>this one is is okay</E09_05>
</E09>
I've tried solutions naming the E09_05 tag in different ways, using <\/E09_05> for the closing tag as I've seen in some examples, and as just </E09_05> as I've seen in others. I've tried ^.{50,}$ between them, or [a-zA-Z]{50,}$ between them, I've tried wrapping those in-between expressions in () and without. I even tried just [\s\S]*? in between the tags. The only thing that Notepad++ finds is when I use ^.{50,}$ by itself with no XML tags ... but then I wind up hitting on all the E13_01 tags (which are EMS narratives, and always > 50 characters) -- making for painstaking and wrist-aching clicks.
I wanted to XSLT this, but there is too much individual, hands on tweeking of each E09_05 field for automating it. Perl is not an option in this environment (and not a tool I know at all anyway).
To be truly sublime, both E09_05 and E09_08 fields with string lengths >50 need to be what is selected on the search ... but no other elements of any kind or length.
Thanks in advance. I'm sure I'm just missing some subtle \, or () or [] somewhere ... hopefully ...
The following regex will find the text content of <E09_05> elements with more than 50 characters.
(?<=<E09_05>).{51,}?(?=</E09_05>)
Explanation
(?<=<E09_05>) Start matching right after <E09_05>
.{51,}? Match 51 or more characters (in a single line)
The ? makes it reluctant, so it'll stop at first </E09_05>
(?=</E09_05>) Stop matching right before </E09_05>
For truly sublime matching, i.e. both E09_05 and E09_08 fields with string lengths >50, use:
(?<=<(E09_0[58])>).{51,}?(?=</\1>)
Explanation
<(E09_0[58])> Match <E09_05> or <E09_08>, and capture the name as group 1
</\1> Use \1 backreference to match name inside </name>
If you want to shorten the text with ellipsis at the end, e.g. Hello World with max length 8 becomes Hello..., use:
Find what: (?<=<(E09_0[58])>)(.{47}).{4,}(?=</\1>)
Replace with: \2...
I'm currently working on a big svg sprite.
The diffrent images are always 2000px apart.
What I have is:
<g transform="translate(0,0)">
<g transform="translate(0,2000)">
<g transform="translate(0,4000)">
After regex want this so just adding 2000 onto the second number:
<g transform="translate(0,2000)">
<g transform="translate(0,4000)">
<g transform="translate(0,6000)">
I have the issue now that some new images have to be put at the top of the document, thus meaning i would need to change all numbers and they are quite alot.
I was thinking about using regular expressions and even found out that it works in the search bar of VS Code. The thing is i never worked with any regex and i'm kinda confused.
Could someone give me a solution and an explanation for incrementing all the sample numbers by 2000?
I hope i understand it afterwards so i can get my foot into that topic.
I'm also happy with just links to tutorials in general or my specific use case.
Thank you very much :)
In VSCode, you can't replace with an incremented value inside a match/capture. You can only do that inside a callback function passed as the replacement argument to a regex replace function/method.
You may use Notepad++ to perform these replacements after installing Python Script plugin. Follow these instructions and then use the following Python code:
def increment_after_openparen(match):
return "{0}{1}".format(match.group(1),str(int(match.group(2))+2000))
editor.rereplace(r'(transform="translate\(\d+,\s*)(\d+)', increment_after_openparen)
See the regex demo.
Note:
(transform="translate\(\d+,\s*)(\d+) matches and captures into Group 1 transform="translate( + 1 or more digits, then , and 0 or more whitespaces (with (transform="translate\(\d+,\s*))) and then captures into Group 2 any one or more digits (with (\d+))
match.group(1) is the Group 1 contents, match.group(2) is the Group 2 contents.
Basically, any group is formed with a pair of unescaped parentheses and the group count starts with 1. So, if you use a pattern like (Item:\s*)(\d+)([.;]), you will need to use return "{0}{1}{2}".format(match.group(1),str(int(match.group(2))+2000), match.group(3)). Or, return "{}{}{}".format(match.group(1),str(int(match.group(2))+2000), match.group(3)).
you can use the extension Regex Text Generator
Select the numbers with Multi Cursor, can be done with Regex Find and Alt+Enter in find box
Run command: Generate text based on regular expression
As Match Expression use: (\d+)
As generator extression use: {{=N[1]+2000}}
You get a preview of the result.
Press Enter if OK, or Esc to abort
You can set this type of search replace as a predefined in the setting regexTextGen.predefined
"regexTextGen.predefined": {
"Add/Subtract a number" : {
"originalTextRegex": "(\d+)",
"generatorRegex": "{{=N[1]+1}}"
}
}
You can edit the expressions (change the 1) if you choose a predefined.
SublimeText3 with the Text-Pastry add-in can also do \i
I wrote an extension, Find and Transform, to make these math operations on find and replaces with regex's quite simple (and much more like path variables, conditionals, string operations, etc.). In this case, this keybinding (in your keybindings.json) will do what you want:
{
"key": "alt+r", // whatever keybinding you want
"command": "findInCurrentFile",
"args": {
"find": "(?<=translate\\(\\d+,\\s*)(\\d+)", // double-escaped
"replace": "$${ return $1 + 2000 }$$",
"isRegex": true,
// "restrictFind": "document", // or line/once/selections/etc.
}
}
That could also be a setting in your settings.json if you wanted that - see the README.
(?<=translate\\(\\d+,\\s*) a positive lookbehind, you can use non-fixed length items in the lookbehind, like \\d+.
(\\d+) capture group 1
The replace: $${ return $1 + 2000 }$$
$${ <your string or math operation here> }}$
return $1 + 2000 add 2000 to capture group 1
Demo:
I'm using the following regex to find URLs in a text file:
/http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+/
It outputs the following:
http://rda.ucar.edu/datasets/ds117.0/.
http://rda.ucar.edu/datasets/ds111.1/.
http://www.discover-earth.org/index.html).
http://community.eosdis.nasa.gov/measures/).
Ideally they would print out this:
http://rda.ucar.edu/datasets/ds117.0/
http://rda.ucar.edu/datasets/ds111.1/
http://www.discover-earth.org/index.html
http://community.eosdis.nasa.gov/measures/
Any ideas on how I should tweak my regex?
Thank you in advance!
UPDATE - Example of the text would be:
this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/).
This will trim your output containing trail characters, ) .
import re
regx= re.compile(r'(?m)[\.\)]+$')
print(regx.sub('', your_output))
And this regex seems workable to extract URL from your original sample text.
https?:[\S]*\/(?:\w+(?:\.\w+)?)?
Demo,,, ( edited from https?:[\S]*\/)
Python script may be something like this
ss=""" this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/). """
regx= re.compile(r'https?:[\S]*\/(?:\w+(?:\.\w+)?)?')
for m in regx.findall(ss):
print(m)
So for the urls you have here:
https://regex101.com/r/uSlkcQ/4
Pattern explanation:
Protocols (e.g. https://)
^[A-Za-z]{3,9}:(?://)
Look for recurring .[-;:&=+\$,\w]+-class (www.sub.domain.com)
(?:[\-;:&=\+\$,\w]+\.?)+`
Look for recurring /[\-;:&=\+\$,\w\.]+ (/some.path/to/somewhere)
(?:\/[\-;:&=\+\$,\w\.]+)+
Now, for your special case: ensure that the last character is not a dot or a parenthesis, using negative lookahead
(?!\.|\)).
The full pattern is then
^[A-Za-z]{3,9}:(?://)(?:[\-;:&=\+\$,\w]+\.?)+(?:\/[\-;:&=\+\$,\w\.]+)+(?!\.|\)).
There are a few things to improve or change in your existing regex to allow this to work:
http[s]? can be changed to https?. They're identical. No use putting s in its own character class
[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),] You can shorten this entire thing and combine character classes instead of using | between them. This not only improves performance, but also allows you to combine certain ranges into existing character class tokens. Simplifying this, we get [a-zA-Z0-9$-_#.&+!*\(\),]
We can go one step further: a-zA-Z0-9_ is the same as \w. So we can replace those in the character class to get [\w$-#.&+!*\(\),]
In the original regex we have $-_. This creates a range so it actually inclues everything between $ and _ on the ASCII table. This will cause unwanted characters to be matched: $%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_. There are a few options to fix this:
[-\w$#.&+!*\(\),] Place - at the start of the character class
[\w$#.&+!*\(\),-] Place - at the end of the character class
[\w$\-#.&+!*\(\),] Escape - such that you have \- instead
You don't need to escape ( and ) in the character class: [\w$#.&+!*(),-]
[0-9a-fA-F][0-9a-fA-F] You don't need to specify [0-9a-fA-F] twice. Just use a quantifier like so: [0-9a-fA-F]{2}
(?:%[0-9a-fA-F][0-9a-fA-F]) The non-capture group isn't actually needed here, so we can drop it (it adds another step that the regex engine needs to perform, which is unnecessary)
So the result of just simplifying your existing regex is the following:
https?://(?:[$\w#.&+!*(),-]|%[0-9a-fA-F]{2})+
Now you'll notice it doesn't match / so we need to add that to the character class. Your regex was matching this originally because it has an improper range $-_.
https?://(?:[$\w#.&+!*(),/-]|%[0-9a-fA-F]{2})+
Unfortunately, even with this change, it'll still match ). at the end. That's because your regex isn't told to stop matching after /. Even implementing this will now cause it to not match file names like index.html. So a better solution is needed. If you give me a couple of days, I'm working on a fully functional RFC-compliant regex that matches URLs. I figured, in the meantime, I would at least explain why your regex isn't working as you'd expect it to.
Thanks all for the responses. A coworker ended up helping me with it. Here is the solution:
des_links = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', des)
for i in des_links:
tmps = "/".join(i.split('/')[0:-1])
print(tmps)
I've searched around quite a bit now, but I can't get any suggestions to work in my situation. I've seen success with negative lookahead or lookaround, but I really don't understand it.
I wish to use RegExp to find URLs in blocks of text but ignore them when quoted. While not perfect yet I have the following to find URLs:
(https?\://)?(\w+\.)+\w{2,}(:[0-9])?\/?((/?\w+)+)?(\.\w+)?
I want it to match the following:
www.test.com:50/stuff
http://player.vimeo.com/video/63317960
odd.name.amazone.com/pizza
But not match:
"www.test.com:50/stuff
http://plAyerz.vimeo.com/video/63317960"
"odd.name.amazone.com/pizza"
Edit:
To clarify, I could be passing a full paragraph of text through the expression. Sample paragraph of what I'd like below:
I would like the following link to be found www.example.com. However this link should be ignored "www.example.com". It would be nice, but not required, to have "www.example.com and www.example.com" ignored as well.
A sample of a different one I have working below. language is php:
$articleEntry = "Hey guys! Check out this cool video on Vimeo: player.vimeo.com/video/63317960";
$pattern = array('/\n+/', '/(https?\:\/\/)?(player\.vimeo\.com\/video\/[0-9]+)/');
$replace = array('<br/><br/>',
'<iframe src="http://$2?color=40cc20" width="500" height="281" frameborder="0" webkitAllowFullScreen mozallowfullscreen allowFullScreen></iframe>');
$articleEntry = preg_replace($pattern,$replace,$articleEntry);
The result of the above will replace any new lines "\n" with a double break "" and will embed the Vimeo video by replacing the Vimeo address with an iframe and link.
I've found a solution!
(?=(([^"]+"){2})*[^"]*$)((https?:\/\/)?(\w+\.)+\w{2,}(:[0-9]+)?((\/\w+)+(\.\w+)?)?\/?)
The first part from (? to *$) what makes it work for me. I found this as an answer in java Regex - split but ignore text inside quotes? by https://stackoverflow.com/users/548225/anubhava
While I had read that question before, I had overlooked his answer because it wasn't the one that "solved" the question. I just changed the single quote to double quote and it works out for me.
add ^ and $ to your regex
^(https?\://)?(\w+\.)+\w{2,}(:[0-9])?\/?((/?\w+)+)?(\.\w+)?$
please notice you might need to escape the slashes after http (meaning https?\:\/\/)
update
if you want it to be case sensitive, you shouldn't use \w but [a-z]. the \w contains all letters and numbers, so you should be careful while using it.
I'm tring to extract email adressess from a content. I've a problem about false positives.
My regex for: example#site.com
[^\.^\w+](\w+) *?# *?(\w+) *?(?:\.|dot) *?(\w+)
Regex for: example#sub.site.com
[^\.^\w+](\w+) *?# *?(\w+) *?(?:\.|dot) *?(\w+) *?(?:\.|dot) *?(\w+)
I want the first regex not to match with:
example#sub.site
How can I fix it?
The only way to distinguish example#site.com and example#sub.site is to maintain a list of valid top level domains (yes, I'm sorry).
i.e, replacing your last (\w+) by (com|org|info|ly|... and so on.
There is no universal way.
Also, you could do only one regex.
Also, my address could be example#sub1.sub2.site.com, be careful...