What's the best way to select all text between 2 comment tags? E.g.
<!-- Text 1
Text 2
Text 3
-->
<\!--.* will capture <!-- Text 1 but not Text 2, Text 3, or -->
Edit
As per Basti M's answer, <\!--((?:.*\n)*)--> will select everything between the first <!-- and last -->. I.e. lines 1 to 11 below.
How would I modify this to select just lines within separate tags? i.e. lines 1 to 4:
1 <!-- Text 1 //First
2 Text 2
3 Text 3
4 -->
5
6 More text
7
8 <!-- Text 4
9 Text 5
10 Text 6
11 --> //Last
Depending on your underlying engine use the s-modifier (and add --> at the end of your expression.
This will make the . match newline-characters aswell.
If the s-flag is not available to you, you may use
<!--((?:.*\r?\n?)*)-->
Explanation:
<!-- #start of comment
( #start of capturing group
(?: #start of non-capturing group
.*\r?\n? #match every character including a line-break
)* #end of non-capturing group, repeated between zero and unlimited times
) #end of capturing group
--> #end of comment
To match multiple comment blocks you can use
/(?:<!--((?:.*?\r?\n?)*)-->)+/g
Demo # Regex101
Use the s modifier to match new lines. E.g.:
/<!--(.*)-->/s
Demo: http://regex101.com/r/lH0jK9
Regex is not the right tool to parse html or xml, use a proper parser, I use xpath here :
$ cat file.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<test>
<!-- Text 1
Text 2
Text 3
-->
</test>
The test :
$ xmllint --xpath '/test/comment()' file.xml
<!-- Text 1
Text 2
Text 3
-->
If you parse html, use the --html switch.
Related
I have more tags. And I want to Select their content without some words, and to replace with something else. For example:
<title>WORD_1 WORD_2 | Blahhhhhh<title>
<title>WORD_3 WORD_4<title>
<title>WORD_5 WORD_6<title>
<title>WORD_7 WORD_8 | Dammmmmm <title>
The desire select for replace:
WORD_1 WORD_2
WORD_3 WORD_4
WORD_5 WORD_6
WORD_7 WORD_8
Or, in other terms, I want to select all content of tags until the second part (until |)
You could accomplish this using the following regex ...
(?<=<title>).*?(?=\||<title>)
(?<=<title>) looks behind for <title>
.*? matches any charecter
(?=\||<title>) looks forward for | or <title>
see regex demo
EDIT 1 :
To keep only the words until | and delete all the tags ...
search with : .*?(?<=<title>)(.*?)(?=\||<title>).*
replace by : $1
EDIT 2 :
To keep only the words after | and delete all the tags ...
search with : .*?(?<=\|)(.*?)(?:\||<title>)
replace by : $1
While the previous answer is good I would suggest faster(optimized) regex pattern:
(<title>).+?(?=\||<title>)
https://regex101.com/r/8gCnCy/1
Performance comparison:
with PHP(PCRE) flavor:
(<title>).+?(?=\||<title>) - 4 matches, 260 steps (~229ms)
(?<=<title>).*?(?=\||<title>) - 4 matches, 433 steps (~288ms)
with Python flavor:
(<title>).+?(?=\||<title>) - 4 matches, 370 steps (~270ms)
(?<=<title>).*?(?=\||<title>) - 4 matches, 973 steps (~529ms)
I have a subtitle file of a movie, like below:
2
00:00:44,687 --> 00:00:46,513
Let's begin.
3
00:01:01,115 --> 00:01:02,975
Very good.
4
00:01:05,965 --> 00:01:08,110
What was your wife's name?
5
00:01:08,943 --> 00:01:12,366
- Mary.
- Mary, alright.
6
00:01:15,665 --> 00:01:18,938
He seeks the spirit
of Mary Browning.
7
00:01:20,446 --> 00:01:24,665
Mary, we invite you
into our circle.
8
00:01:28,776 --> 00:01:32,834
Mary Browning,
we invite you into our circle.
....
Now I want to match only the actual subtitle text content like,
- Mary.
- Mary, alright.
Or
He seeks the spirit
of Mary Browning.
including the special characters, numbers and/or newline characters they may contain. But I don't want to match the time string and serial numbers.
So basically I want to match all lines that contains numbers and special characters only with alphabets, not numbers and special characters which are alone on other lines like time-string and serial numbers.
How can I match and add tag <font color="#FFFF00">[subtitle text any...]</font> to each subtitle I matched with Regex's help ?
Means like below:
<font color="#FFFF00">He seeks the spirit
of Mary Browning.</font>
Well I just figured out by checking and analysing carefully, the key to match all the subtitle text lines.
First from any subtitle(.srt) file I have to remove unnecessary "line-feed" characters, i.e. \r.
Find: \r+
Replace with:
(nothing i.e. null character)
Then I just have to match those lines not starting with digits & newlines(i.e. blank lines) at all and then replace them with their own text wrapped around with <font> tag with color values as below:
Find: ^([^\d^\n].*)
Replace with: <font color="#FFFF00">\1</font>
(space after colon are just for better presentation and not included in code).
Hope this helps everyone head-banging with subtitles everyday.
Here is my simple text file:
1. Text About Question 1
2. Text About Question 2
.
.
20. Text About Question 20
I have 250 text file and all files have only 20 questions and I want to convert these files to xml, add "question" tag beginning of every number, so they will look like:
<question>1. Text About Question 1
<question>2. Text About Question 2
.
.
<question>20. Text About Question 20<question>
I have tried this regex: copy (\d{1}.) replace \1 which just effect between 1 and 9. After 10 it divides number like
1<question>0. Text About Question 10
As a second way, this regex: (\d{2}.) only effect between 10 and 20. So it looks like:
1. Text About Question 1
2. Text About Question 2
.
.
<question>20. Text About Question 20</question>
I couldn't continue with (\d{1}.) because this regex add same tags to number between 10 and 20 and looks like:
<question>1. Text About Question 1 </question>
<question>2. Text About Question 2</question>
.
.
<question><question>20. Text About Question 20</question>
Is there proper way to tag each question from 1 to 20 using regex?
You want to match all numbers between 1 and 20. Here is the regex for that
^[1-9]\.$|^1[0-9]\.$|^20\.$
Breakdown
^ - Start of line
[1-9] - Any digit between 1 and 9. Note 0 is not included
\. - Escape character before a period. Otherwise it will match any character
$ - End of regex
| - Or
^1[0-9]\.$ - Starts with a 1 and is between 10 and 19.
|^20\.$ - Or starts and ends with 20.
I am writing a perl script and part of it is to capture data that does not begin with a number. I have tried (\w)\s+(\d+)\s+(\S+)\s+(\d.+). Below are some parts of the file(too big to put all lines here).
The text I want to capture is
BBACCap 8 N/A 48,46,44,42,40,38,36,34,32,
or can be
IG-XL_DataTool N/A N/A N/A
or
DC-30 1 N/A 1,0
The regex does match for the above data I need however I am also capturing data(which I don't want) such as
1 2
2 3, 4
and also(which I don't want)
1.0 BBAC-15 805-004-50 0301B5C5 0829-E 5445
aka: 805-004-02,805-004-03
but only E 5445
aka: 805-004-02,805-004-03 from the above.
Any help on this?
It's hard to be sure what you need, but it looks like you can split each line on whitespace and select just the first three fields, rejecting any line whose first field starts with a decimal digit
Here's a demonstration which reads from files specified on the command line
while ( <> ) {
my #fields = split;
next if $fields[0] =~ /^[0-9]/;
print "#fields[0..2]\n";
}
What's the best way to select all text between 2 comment tags? E.g.
<!-- Text 1
Text 2
Text 3
-->
<\!--.* will capture <!-- Text 1 but not Text 2, Text 3, or -->
Edit
As per Basti M's answer, <\!--((?:.*\n)*)--> will select everything between the first <!-- and last -->. I.e. lines 1 to 11 below.
How would I modify this to select just lines within separate tags? i.e. lines 1 to 4:
1 <!-- Text 1 //First
2 Text 2
3 Text 3
4 -->
5
6 More text
7
8 <!-- Text 4
9 Text 5
10 Text 6
11 --> //Last
Depending on your underlying engine use the s-modifier (and add --> at the end of your expression.
This will make the . match newline-characters aswell.
If the s-flag is not available to you, you may use
<!--((?:.*\r?\n?)*)-->
Explanation:
<!-- #start of comment
( #start of capturing group
(?: #start of non-capturing group
.*\r?\n? #match every character including a line-break
)* #end of non-capturing group, repeated between zero and unlimited times
) #end of capturing group
--> #end of comment
To match multiple comment blocks you can use
/(?:<!--((?:.*?\r?\n?)*)-->)+/g
Demo # Regex101
Use the s modifier to match new lines. E.g.:
/<!--(.*)-->/s
Demo: http://regex101.com/r/lH0jK9
Regex is not the right tool to parse html or xml, use a proper parser, I use xpath here :
$ cat file.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<test>
<!-- Text 1
Text 2
Text 3
-->
</test>
The test :
$ xmllint --xpath '/test/comment()' file.xml
<!-- Text 1
Text 2
Text 3
-->
If you parse html, use the --html switch.