I am using Sublime Text 2's regex search and replace tool and would like to search text that includes the \r and \n special characters but cannot see how just at the moment.
For example, I have the text:
<div class="head">\r\n
\r\n Keep this text\r\n</div>
Which I would like to transform into:
<h1>Keep this text</h1>
I would also like to factor in the eventuality that these \r\n characters may not be present.
How might I search accounting for \r\n being present and absent, and then remove them as per above? If two regex are required that's fine too.
So far I have <div class="head">(\w)+</div>, however this is stalled by the aforementioned \r\n.
I think you're looking for \s, which matches white space.
So your regex should be something like the following:
<div class="head">\s*(.+?)\s*</div>
If you can do this in ST2, then I think it would fit your need:
Find:
<div class="head">[\s\r\n]*([\w ]+)[\s\r\n]*<\/div>
Replace by:
<h1>$1</h1>
Demo
Related
I want to extract data from html. The thing is, that i cant extract 2 of strings which are on the top, and on the bottom of my pattern.
I want to extract 23423423423 and 1234523453245 but only, if there is string Allan between:
<h4>###### </h4> said12:49:32
</div>
<a href="javascript:void(0)" onclick="replyAnswer(##########,'GET','');" class="reportLink">
report </a>
</div>
<div class="details">
<p class="content">
Hi there, Allan.
</p>
<div id="AddAnswer1234523453245"></div>
Of course, i can do something like this: Profile\/(\d+).*\s*.*\s*.*\s*.*\s*.*\s*.*\s*.*\s*.*Allan.*\s*.*\s*.*AddAnswer(\d+). But the code is horrible. Is there any solution to make it shorter?
I was thinking about:
Profile\/(\d+)(.\sAllan)*AddAnswer(\d+)
or
Profile\/(\d+)(.*Allan\s*)*AddAnswer(\d+)
but none of wchich works properly. Do you have any ideas?
You can construct a character group to match any character including newlines by using [\S\s]. All space and non-space characters is all characters.
Then, your attempts were reasonably close
/Profile\/(\d+)[\S\s]*Allan[\S\s]*AddAnswer(\d+)/
This looks for the profile, the number that comes after it, any characters before Allan, any characters before AddAnswer, and the number that comes after it. If you have single-line mode available (/s) then you can use dots instead.
/Profile\/(\d+).*Allan.*AddAnswer(\d+)/s
demo
You can use m to specify . to match newlines.
/Profile\/(\d+).+AddAnswer(\d+)/m
Better use a parser instead. If you must use regular expressions for whatever reason, you might get along with a tempered greedy solution:
Profile/(\d+) # Profile followed by digits
(?:(?!Allan)[\S\s])+ # any character except when there's Allan ahead
Allan # Allan literally
(?:(?!AddAnswer)[\S\s])+ # same construct as above
AddAnswer(\d+) # AddAnswer, followed by digits
See a demo on regex101.com
In my html page I have a lot of strings inside tags.
like
<p>Some string 1</p>
<p>Some string 2</p>
<p>Any string 3</p>
I need to put all of them to attribute TRANSLATE, lowercase them and replace all spaces to underscores inside strings.
So I multiselect all of them with holded CTRL, then ctrl+K, ctrl+L make them lowercase, CTRL+x - erase, two left arrows for going inside tags, write translate="PASTE HERE"
Now I have
<p translate="some string 1"></p>
<p translate="some string 2"></p>
<p translate="any string 3"></p>
Next step - I need to make underscores instead of spaces.
To find all translate strings I use regex (?s)translate=".+?"
But how to replace? Help.
Type ctrl + H and then
Use negative-lookbehind to search spaces which are not preceded by p.
(?<!p)\h+
\h matches only horizontal spaces.
Now replace-all it with _.
This is simple but will work and faster than looking for a smarter answer.
Find this: translate="(.*) (.*)"
Replace with this: translate="\1_\2"
Keep using Replace All until all your unwanted spaces are underscores (in the example you gave, twice).
I need to match text between two tags, but starting at a specific occurrence of the tag.
Imagine this text:
Some long <br> text goes <br> here. And some <br> more can <br> go here.<br>
In my example, I would like to match here. And some.
I successfully matched the text between the first occurrence (between the first and second br tags) with:
<br>(.*?)<br>
But I am looking for the text in the next match (which would be between the second and third br tags). This is probably more obvious than I realize, but Regex is not my strong suite.
Just extend your regex:
<br>(.*?)<br>(.*?)<br>
or, for an unlimited number of matches, and trimming the spaces:
<br>\s*(.*?)(?=\s*<br>)
EDIT: Now that I see that you are parsing an HTML document, be aware that regular expressions may not be the best tool for that job, especially if your parsing requirements are complex.
I have 500 HTML files in my project where casing and quotes (" or ') in <title> attribute vary over all pages, see few examples below
<button title="Next" id="next"> Next</button>
<button title="next"> Next </buton>
<button title=""please go back">Check</button>
I want to change all title attributes in Title Case
<button title="Next" id="next"> Next</button>
<button title="Next"> Next </buton>
<button title="Please Go Back">Check</button>#
I have tried to find and replace - Regular Expression and Case sensitive button enabled
Find What: title=(".*")\s
Replace With: title="\u$"
But didn't get success.Please tell me what I am doing wrong?
UPDATED : also want to remove extra ' " see #
To further my comment, first it's the issue of .* being 'greedy' instead of 'lazy', meaning it is matching as much as possible (i.e. Next"> Next</button><button title="Next in your example).
The quick fix is using a 'lazy' .* instead, aka .*? (I added a ? to indicate possible presence of space because there's none in your examples):
title=(".*?")\s?
To improve performance, you would use a negated class:
title=("[^"]+")\s?
Where [^"]+ matches any character except ".
And to cope with the different quotes, you can use:
title=("[^"]+"|'[^']+')\s?
Which basically means either "[^"]+" or '[^']+' for the part within the parentheses.
For the replace and consecutive quotes issue:
title=(?:"+([^"]+)"+|'+([^']+)'+)\s?
Replace with:
title="\u$1$2"
The only thing is that the last line will be <button title="Please go back">Check</button>, if that's not an issue...
EDIT: \G actually works. Use a second replace:
(?:(?<=title=")|(?<!^)\G)[^\s"]+\s?
Replace with:
\u$0
(?<=title=('|")).+?(?=('|"))
this should give you matches Next next please go back that you can use.
you can use the index of the match to find your match in the Original string if you want to upper your lowers..
or use title=('|").+?(\1) to find any title attributes in your tekst including the quotation marks
I'm referring to this page: http://ergoemacs.org/emacs/emacs_regex.html
which says that to capture a pattern in Emacs Regexp, you need to escape the paren like this: \(myPattern\).
It further says that the syntax for capturing a sequence of ASCII characters is [[:ascii:]]+
In my document, I'm trying to match all strings that occur between <p class="calibre3"> and </p>
So, following the syntax above, I do a replace-regexp for
<p class="calibre3">\([[:ascii:]]+\)</p>
but it finds no matches.
Suggestions?
Regexps are not good for general-purpose HTML parsing, but as paragraph tags cannot be validly nested, the following is going to be fine (provided the mark-up is valid & well-formed).
<p class="calibre3">\(.*?\)</p>
*? is the non-greedy zero-or-more repetitions operator, so it will match as little as possible -- in this case everything until the next </p> (as opposed to the greedy version, which would match everything until the final </p> in the text).
The [^<] approach is fine if it fits the data in question, but it won't work if there are other tags within the paragraphs.
You need to escape your angle brackets and I would use [^<] instead of [[:ascii]] like so:
\<p class="calibre3"\>([^<]+\)</p\>
<p class="calibre3">\([^<]\)+</p>
Source: #TooTone