bbedit - how to change multiple strings to title case? - regex

In bbedit there is a feature where you select the text, choose text->change case->make title case from the menu and it will act accordingly.
Is there a way to select multiple strings of text across files in your project then apply the same text formatting?
I know you can do some regex stuff and change it, but none of that is the true title case where it ignores the words like "of" "and" "the" etc. The title case works great, I just need to do it on many items.
For example 5 html files have <h2>THIS IS THE TITLE</h2> -so now I go to each file select the text and do the above menu item. That is fine if there are 5, but if there are 2500 that I want to make <h2>This is the Title</h2> - then I need to be able to select more than one at a time....
Thanks in advance!
------edits
So if you were searching across multiple files for all your <h2> tags and you get back several across different files.....
<h2>MY TITLE</h2>
<h2>this is a title</h2>
<h2>Another title</h2>
The title case would change each of them accordingly to:
<h2>My Title</h2>
<h2>This is a Title</h2>
<h2>Another Title</h2>
Currently you select each one individually to do so via the menu. We'd like to do this with a find-all and change case if that makes sense....
F:<h2>(.*?)</h2>
R: <h2>\1</h2>[make this a certain case]
Thanks.

Almost any operation which involves repetition in BBEdit can be automated using AppleScript. :-)
Here's the text of an AppleScript script which will perform the operation you describe. You can copy and paste this into the AppleScript editor, and save it in BBEdit's "Scripts" folder for future use as needed.
use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions
tell application "BBEdit"
-- make sure we start at the top, because searches will (by default) proceed from the end of the selection range.
tell text of document 1 to select insertion point before first character
repeat
tell text of document 1
-- Find matches for a string inside of heading tags (any level)
-- NOTE extra backslashes in the search pattern to keep AppleScript happy
set aSearchResult to find "(<(h\\d)>)(.+?)(</\\2>)" options {search mode:grep}
end tell
if (not found of aSearchResult) then
exit repeat -- we're done
end if
-- the opening tag is the first capture group. We'll use this below
set openingTagText to grep substitution of "\\1"
-- the title is the third capture group
set titleText to grep substitution of "\\3"
-- use "change case" to titlecase the title
set changedTitleText to change case (titleText as string) making title case
-- select the range of text containing the title, so that we can replace it
set rangeStart to (characterOffset of found object of aSearchResult) + (length of openingTagText)
set rangeEnd to (rangeStart + (length of changedTitleText) - 1)
select (characters rangeStart through rangeEnd of text of document 1)
-- replace the range
set text of selection to changedTitleText
-- select found object of aSearchResult
end repeat
-- put the insertion point back at the top, because it's a nice thing to do
tell text of document 1 to select insertion point before first character
end tell

Related

Possible Bug using Regex in Notepad++ with Replace All?

Have I found a bug in Notepad++ or am I doing something wrong?
Background info
(Please note that I do know that one are supposed not to use Regex parsing HTML, but I think this is a special case that should work - without the possible Notepad++ bug ;-)
I have exported Apple Notes as HTML using Exporter 3.0 on a Mac. In the HTML output every Note line is between <div> - </div> elements and also "header/title lines" like <h1> - </h1> or <h2> - </h2> etc. Each "header/title line" is often split in several unnecessary HTML header elements as in the following simplified example.
<div><h1>TEST </h1><h1>Title<br></h1></div>
<div><b><h2>T1</h2><u><h2>T2</h2></u><h2> </h2></b><h2>(</h2><h2>T3</h2><u><h2>T4</h2></u><h2>)</h2><b><h2><br></h2></b></div>
This HTML can't be imported into OneNote giving the same result as seen in Apple Notes i.e. each "header/title" line is split in multiple lines. That's true even when changing the <h1>/<h2> block elements to inline elements using an initial <style>h1, h2 {display: inline;}</style> statement. (Maybe that is a bug or restriction in OneNote, but I need to find a workaround.)
Therefore, I need to clean the example HTML output above from the unnecessary HTML header <h1> or <h2> (all but the first in every line) and </h1> or </h2> (all but the last in every line), to get the following result that can be imported to OneNote without problem.
<div><h1>TEST Title<br></h1></div>
<div><b><h2>T1<u>T2</u> </b>(T3<u>T4</u>)<b><br></h2></b></div>
Solution ? - Developed Regex
I'm quite new to Regex, especially advanced Regex, but I think I have found a way to clean the erroneous HTML code using TWO different Regex expressions as follows.
Both works well when tested using regex101.com, I think.
The first one is used to remove unnecessary </h1> or </h2> elements and is a Positive Lookahead function (it works both in regex101 and in Notepad++)
(</h[1-6]>)(?=.*?\1)
(Demo)
Picture 1 shows a working Find All + Mark All in Notepad++
Picture 2 shows a working Replace All
The Second one used to remove unnecessary <h1> or <h2> elements and is a Positive Lookbehind function (it works in regex101 but NOT fully in Notepad++)
(?<=(<(h[1-6])>))(?:.*?)\K\1
(Demo)
Picture 3 shows a working Find All + Mark All in Notepad++ = All 8 occurrences found
Picture 4 shows a NOT working Replace All in Notepad++ = Only 5 occurrences (of the 8 found) are replaced
If I redo the same Replace All a second time 2 of the
remaining 3 occurrences are replaced.
If I redo the same Replace All a third time the last
remaining occurrence is replaced.
BUG ?
Is this a bug in Notepad++ or is this behavior normal or am I doing something strange here? Please help me understand.
So, rather than make multiple passes through your data, you can get it all in one pass with this:
(^.*?<h[1-6]>)?(.*?)</?h[1-6]>(?=.*</h[1-6]>.*?$)
and replace it with \1\2. The first capture group skips the first <h#> on each line and is null after line start. The second capture group captures everything up to the next <h#> tag. The optional slash (/?) scans and deletes both open and close tags. The last part is a positive lookahead to make sure the last </h#> is not deleted.
In the two lines of your examples all the header levels were the same on the line and this regex is fine. If the first open and last close don't match, then you have a problem but I think your solutions also have that same problem. In any case you can fix that in a second pass with ^(.*<h)([1-6])(.*<h)[1-6] and replace it with \1\2\3\2.
I would also point out that this creates unbalanced HTML with a <b>, followed by <h1>, followed by </b>, followed by </h1>. I don't know if that is OK for your case. If not, it might be better to remove ALL the <h#> tags and anchor new ones just inside the <div> </div> pair.
In any event here is a REGEX101 screenprint with this regex working on your examples:

preg_replace replace incomplete tags

i am trying to read a html page using file_get_contents. After I processed the data, there are some incomplete tags for example:
</p><p> test test test test</p>
In this case there does not have a <p> to open </p>
or
<font color="#333333">abc</font><div><p>go go go go </p>
in this case there does not have a </div> to close<div>
thus I want to use preg_replace to remove all these incomplete tags, in my examples, the extra </p> and <div> should be removed. How can I do that? these tags can be any valid html5 tags.
First, you need to understand what a "well formed markup document" is in XHTML.
With well formed markup it does not guarantee the tags chosen as a "start end pair(open close)" will be the correct two if their is a spare unpaired tag.
Second, you will need to build a loop to call each tag per iteration from an array repository of the tag types. The tags in the array should be "literals".
Each tag "length" int should be taken and set in the loop before testing for the tag presence.
When the match of the tag pair(open close) is found, preg match puts the section onto an array of copy of matches,position and length, then take the length of the match and its start position from the parts of the preg match return result array(use a debug print-out of the array while developing the script).
Inside each open close pair matched you need to do a sub loop of the same action to check internal tags.
Synopsis:
To build such a system as a customised script ranks with an XML well formed document parser and debugger having any valid efficiency.As much it would be a markup debugger for an IDE if it had that as valid efficiency.
Good luck.
You should investigate the use of the PHP Tidy extension (http://php.net/manual/en/book.tidy.php). You can use Tidy to clean up malformed HTML based on whatever DOCTYPE you are attempting to validate.

regex replace in dreamweaver

I'm trying to replace tags in my code, but keeping the text inside "as is".
example string:
<p class="negrita">text1</p>
or
<p class="negrita">text2</p>
i need to get those replaced as so:
<h3>text1</h3>
and
<h3>text2</h3>
i'm searching (and matching fine) with this,
<p class="negrita">([^>]*)</p>
but I have no idea on how to keep the text inside, as
<h3>$1</h3>
is not working.
Instead of using regular expressions to do this, use Dreamweaver's Specific Tag search option. While regular expressions are possible in Dreamweaver, it's a bit crippled and sometimes a little buggy.
To do this with Specific Tag, invoke the search using Control-F
Change the search option to "Specific Tag".
Next to that, enter p to search for all <p> tags
Hit the + icon below the box to add an option and select "With Attribute" "Class" "=" from the first three pulldowns and type "negrita" in the fourth.
For action, select "Change Tag" and then select h3.
Run this and you will get <h3 class="negrita">text1</h3>
Then repeat the steps above, searching for all h3 tags with the class negrita and choose Remove Attribute "class" for the action. Two steps to one, I admit but it will work every time.

regex for all characters on yahoo pipes

I have an apparently simple regex query for pipes - I need to truncate each item from it's (<img>) tag onwards. I thought a loop with string regex of <img[.]* replaced by blank field would have taken care of it but to no avail.
Obviously I'm missing something basic here - can someone point it out?
The item as it stands goes along something like this:
sample text title
<a rel="nofollow" target="_blank" href="http://example.com"><img border="0" src="http://example.com/image.png" alt="Yes" width="20" height="23"/></a>
<a.... (a bunch of irrelevant hyperlinks I don't need)...
Essentially I only want the title text and hyperlink that's why I'm chopping the rest off
Going one better because all I'm really doing here is making the item string more manageable by cutting it down before further manipulation - anyone know if it's possible to extract a href from a certain link in the page (in this case the 1st one) using Regex in Yahoo Pipes? I've seen the regex answer to this SO q but I'm not sure how to use it to map a url to an item attribute in a Pipes module?
You need to remove the line returns with a RegEx Pipe and replace the pattern [\r\n] with null text on the content or description field to make it a single line of text, then you can use the .* wildcard which will run to the end of the line.
http://www.yemkay.com/2008/06/30/common-problems-faced-in-yahoo-pipes/

Any better way to create MediaWiki numbered lists?

When using MediaWiki's markup language, the only thing that I hate is creating numbered lists. The only way I know to create a list is to do something like this:
#Item1
#Item2
However, if I want to add spaces or some other text between those lines, the numbering gets lost. For example, the following will create text that has two number one items:
#Item1
Somestuff
#Item2
Is there any way around this, or should I just use bullet points instead? I noticed just now that the stackoverflow system does not allow numbering like this, you have to do it all manually.
Like this:
#Item1
#:Somestuff
#Item2
I use <ol></ol> and <li></li> to embed the <pre></pre> code formatting portions. Works great for me! :-)
There are a couple of options, but you can start an ordered list from an arbitrary number like this:
#Item1
Something
<ol start="2">
#Item2
</ol>
You can also use "#:" if you don't mind "Something" being indented a lot:
#Item1
#:
#: Something
#:
#Item2
There are quite a lot of options with lists, you can find more info on Wiki's Help Pages:List.
update
Newer version work more like regular html markup the old syntax will now give you a double indent and will not adjust the start offset, but the following works well, even with the source/syntaxhighlight tag.
<ol>
<li>Item1</li>
Something
</ol>
<ol start="2">
<li>Item2</li&gt
<source lang=javascript>
var a = 1;
</source>
</ol>
In short everything within the ol tag will have the same indentation and will not be numbered if it is outside a li tag. The following will now work and it mean you don't have to offset groups manually.
<ol>
<li>Item1</li>
Something
<li>Item2</li&gt
<source lang=javascript>
var a = 1;
</source>
</ol>
The #: works, but you cannot create a
section with spaces, so I would prefer
the non-working option. Anyone knows a
similar syntaxis that does the trick
(start numbering at given value)?
This response is probably a bit late, but I figure I'll add it in case anyone stumbles across this, as I have.
You can create a section with spaces by doing something like:
# Item 1
#:
#:
# Item 2
This will appear as:
Item 1
Item 2
Now, before you say this doesn't work, the trick is to add an ASCII no-break space after the #: rather than just simply hitting spacebar. You can add this by holding ALT on your keyboard and typing 0160. Doing this should add the usual Wiki paragraph formatting while preserving your numbering between #s.
Hope that helps!
"#:" will not work with other tags like
<source lang=javascript>
//...
</source>
I'm using Mediawiki 1.13.3 and this works:
#Item1
Somestuff
<ol start="2">
<li>Item2 </li>
</ol>
And for cases where you want to have some block text within your numbered wiki list try this
# one
#:<pre>
#:some stuff
#:some more stuff</pre>
# two
Which produces:
1. one
some stuff
some more stuff
2. two
From the Wiki Help Page I was able to get the numbering in a list to stay consitant using <p> and <pre>:
# Item 1
# Item 2 <p><pre>Item 2 Pre Stuff</pre></p>
# Item 3
Would generate
1. Item 1
2. Item 2
[ Item 2 Pre Stuff ]
3. Item 3
Following the link to Wiki Help, I found an example that meets what I think are the implied requirements
The list needs to keep numbering
Sometimes the "Somestuff" should be on it's own line in the source
To get (1) there are a few solutions proposed. Bug one way is to use paragraph delimiters around the extra "somestuff".
Example 1:
# Paragraph 1.<p>Paragraph 2.</p><p>Paragraph 3.</p>
# Second item.
To meet (2), you use paragraph marking in combination with commenting out the new lines (with <!-- newline-->).
Example 2:
# Paragraph 1.&lt!--
--><p>Paragraph 2.</p><!--
--><p>Paragraph 3.</p>
# Second item.
Both examples display as
Result:
1. Paragraph 1.
Paragraph 2.
Paragraph 3.
2. Second item
Note that the comment eats all of the white space between the end of one element and the start of the next, which seems to be standard practice, and makes sense if you're trying to have whitespace without the "wiki effects" of the white space.
Extension:ComplexList
https://www.mediawiki.org/w/index.php?oldid=2126533
was put together but not maintained (for lack of time). It works with 1.26.2 of MediaWiki.
For example.
<cl>
1. list item A1
* list item A2
continuing list item A2
further continuing list item A2
* list item A3
</cl>
becomes
list item A1
list item A2continuing list item A2further continuing list item A2
list item A3
You can do:
# one
# two<br />spanning more lines<br />doesn't break numbering
# three
## three point one
## three point two
Regular old <br> works as well but probably pisses off someone.
You can put additional HTML formatting in as well to do <pre> formatting and the like without breaking the numbering as well. This also works other list formats.
From:
http://www.mediawiki.org/wiki/Help:Formatting
edit: Also found that inside a <pre></pre> many of my old tricks don't work, but using
works as a newline, and allows multi-line blocks. The cost is that you jam all your lines on one line.
# one
#: <pre>foo
bar</pre>