BBEdit GREP Find Replace - replace

I'm working with a tab-delimited file in BBEdit. The file looks like this:
00:15:50;11 text1 text2
00:35:17;03 text4 text5
00:35:20;03 text6
00:35:20;22 text7
Basically, it has:
Timecode Tab Text Tab Text Etc
I want to take the second line of timecode and add it after the first line. I want it to look like this:
00:15:50;11 00:35:17;03 text1 text2
00:35:17;03 00:35:20;03 text4 text5
00:35:20;03 00:35:20;22 text6
00:35:20;22 text7
I've tried using this piece of GREP code:
FIND:
`(?-m)([0-9][0-9][; :][0-9][0-9][; :][0-9][0-9][; :][0-9][0-9])(.*)\r([0-9][0-9][; :][0-9][0-9][; :][0-9][0-9][; :][0-9][0-9])`
REPLACE:
'\1\t\3\2\r\3'
My problem is that it only searches and replaces every other line. If I do a find/replace all, it looks like this:
00:15:50;11 00:35:17;03 text1 text2
00:35:17;03 text4 text5
00:35:20;03 00:35:20;22 text6
00:35:20;22 text7
It's skipping every other line. I want to do a search/replace all in several hundred files. I'm wondering if there's something that I can change to make sure it gets every single line.
Thank you.

I took your regex and modified it slightly.
The trick is to not match the Timecode at the beginning of the line. So, use Positive Lookbehind.
(?<=([0-9][0-9][; :][0-9][0-9][; :][0-9][0-9][; :][0-9][0-9])) /*lookbehind to see if timecode exists, but dont match.
But, the use of parenthesis makes it the first capture group.*/
(.*)
\r
([0-9][0-9][; :][0-9][0-9][; :][0-9][0-9][; :][0-9][0-9])
Before,
After,

Related

Regex : Keep text between 2 keywords but only if another keyword exists inside them

I am using emeditor and I am trying to isolate about 2 millions articles containing keyword3 from a french wikipedia dump .xml file (20GB, 338 millions rows, 4.8 millions articles in total).
I would like to keep the text contained between 2 keywords (keyword1 and keyword2) but only if another keyword (keyword3) exists inside them.
List of keywords :
keyword1 = <page>
keyword2 = </page>
keyword3 = {{Infobox
Example A:
keyword1 = <page>
text to consider without keyword3
keyword2 = </page>
Result => do not extract (or keep or split) this part.
Example B:
keyword1 = <page>
text to consider with keyword3
keyword2 = </page>
Result => extract (or keep or split) this part.
The author of Emeditor helped me with the following :
Find (choose regular expression):
<page>(.*?{{Infobox.*?)</page>
Replace with
\1
And in Advanced... : search in 2500 lines
It seems to work overall fine but from time to time some errors are appearing :
I am joining some tiny samples here : https://www.cjoint.com/c/JErsTJnVQpD
I also added a small desired results xml file.
As you can see in the joined image, the highlighted part in blue color (2 articles) should not have been included in the result part as they don't have the keyword {{Infobox .
Note: It also would be nice if the tag is keep in the results.
Thanks in advance ;)
If you use EmEditor, in the Replace dialog box:
Find:
<page>((?:(?!<page>).)*?{{Infobox.*?)</page>
Replace with:
<page>\1</page>
Make sure New Document is selected in the menu displayed when you click ▼ by the Extract button.
In the Advanced dialog:
Set the Regular Expressions “.” Can Match Newline Characters check box.
Enter 3000 (or the maximum number of lines you need to extract from one occurrence of regex) at the Additional Lines to Search for Regular Expressions text box
Finally, click the Extract button in the Replace dialog box.
Left in the metaphor keywords, subtitute for needed
Since have gigabytes this is fastest way to do
Try:
(?s)keyword1.*?(?:(?:keyword1|keyword2)(*SKIP)(*FAIL)|keyword3).*?(?:keyword1(*SKIP)(*FAIL)|keyword2)
demo
Or with keyword substitons:
Find (?s)<page>(.*?(?:(?:<page>|</page>)(*SKIP)(*FAIL)|{{Infobox).*?)(?:<page>(*SKIP)(*FAIL)|</page>)
Replase $1
demo
Not explain what quantifier is as some do - this is not about it
Expect to know basics
You need to exclude the keyword1 from matching between keyword1 and keyword3. Use
Find What: (?s)<page>((?:(?!<page>).)*?{{Infobox.*?)</page>
Replace with: \1
Here,
(?s) - a DOTALL modifier (same as if . matches newlines were ON)
<page> - keyword1 text
((?:(?!<page>).)*?{{Infobox.*?) - Group 1: any char, 0 or more occurrences but as few as possible, not starting a <page> char sequence
(?:(?!<page>).)*?
{{Infobox - keyword2
.*? - any 0 or more chars as few as possible
</page> - keyword2 text

RegEx for multiple newlines with different tags

<p>HISTORY</p>
<p>1. Vicky Mears 1st dog's owner.</p>
<p>2. Paul Nash 2nd dog's owner.</p>
<p>3. Died 39 months of age</p>
</info>
i want to search from starttag HISTORY endtag plus everything in between doesn't matter how many lines and ending with endtag info as stated above.
sorry for my english im new here and very difficult to write my code cause it won't show up i have several edits just to get this right. ~ i guess :(
The following regex will match exactly the content that you have described:
(?<=<p>HISTORY</p>)((.*\r\n)+)*(?=</info>)
If you need to extract it, you can use the backreference \1
If you use UNIX end of line, you can omit the \r
Cheers!

Regular expression group word and sentence

I would like to make a regular expression that does the following:
Gets the whole line of a text file
Gets the first word of that line
Outputs into an input
Currently I can do each of those separately but as one call it is getting hairy:
Whole Line
^\b(.*)\b
First Word
^\b(\w*)\b
Replace for Input
<div class="field"><label><input class="input-checkbox" id="Foo$1" name="Foo" type="checkbox" value="$1" /> <span>$1</span> </label></div>
I would like to use $1 and $2 to separate between the full line for the text display and the first word for the value and ID. Any thoughts? I really like regular expressions for their usefulness and speed as long as I don't hit a knowledge road block like this
Use the entire match:
Search: ^(\w+).*
Replace: First word is $1, whole line is $&
In your case, the replacment term would be:
<div class="field"><label><input class="input-checkbox" id="Foo$1" name="Foo" type="checkbox" value="$1" /> <span>$0</span> </label></div>
The entire match in Atom is coded as $&.
Most other tools/languages use group zero $0 for the entire match.

Regex: how to find new line in code

I have a lot of html files with text without <p>. tags in the code.
I try find and replace with Adobe Brackets or Sublime Text 2:
Find <br><br>\n
Replace </p>\n</p>
But they do not find the \n in the code
Simplified, now I have:
Some sentence, some sentence<br><br>
(I have one space here in the code)
Some sentence, some sentence<br><br>
I would like to convert:
Some sentence, some sentence</p>
<p>Some sentence, some sentence</p>
(I know I will have to add manually just one <p> at the beginning, this is not important and it is not the point of this question)
Match a br with followed spaces (regex spaces includes \n\r\t ...):
<br\s*\/?>\s*
You can then replace with your string with global search.
Edit: I saw that your replacement is not just a carriage return, which will be messy with my example.
I would go for a two steps, replace any br by \n then apply your p elements by replacing multiple \n\s*.
Find:(.*)<br><br>\n?
Replace:<p>\1</p>\n
InPut:
Some sentence, some sentence<br><br>
Some sentence, some sentence<br><br>
OutPut:
<p>Some sentence, some sentence</p>
<p>Some sentence, some sentence</p>

Parsing with regular expressions

I have some text like
some text [http://abc.com/a.jpg] here will be long text
can be multiple line breaks again [http://a.com/a.jpg] here will be other text
blah blah
Which I need to transform into
<div>some text</div><img src="http://abc.com/a.jpg"/><div>here will be long text
can be multiple line breaks again</div><img src="http://a.com/a.jpg"/><div>here will be other text
blah blah</div>
To get the <img> tags, I replaced \[(.*?)\] with <img src="$1"/>, resulting in
some text<img src="http://abc.com/a.jpg"/>here will be long text
can be multiple enters again<img src="http://a.com/a.jpg"/>here will be other text
blah blah
However, I have no idea how to wrap the text in a <div>.
I'm doing everything on the iPhone with RegexKitLite
Here's the simplest approach:
Replace all occurrences of \[(.*?)\] with </div><img src="$1"/><div>
Prepend a <div>
Append a </div>
That does have a corner case where the result starts or ends with <div></div>, but this probably doesn't matter.
If it does, then:
Replace all occurrences of \](.*?)\[ with ]<div>$1</div>[
Replace all occurrences of \[(.*?)\] with <img src="$1"/>