Regex add tag to subtitles

Regex add tag to subtitles - regex

I have a subtitle file of a movie, like below:
2
00:00:44,687 --> 00:00:46,513
Let's begin.
3
00:01:01,115 --> 00:01:02,975
Very good.
4
00:01:05,965 --> 00:01:08,110
What was your wife's name?
5
00:01:08,943 --> 00:01:12,366
- Mary.
- Mary, alright.
6
00:01:15,665 --> 00:01:18,938
He seeks the spirit
of Mary Browning.
7
00:01:20,446 --> 00:01:24,665
Mary, we invite you
into our circle.
8
00:01:28,776 --> 00:01:32,834
Mary Browning,
we invite you into our circle.
....
Now I want to match only the actual subtitle text content like,
- Mary.
- Mary, alright.
Or
He seeks the spirit
of Mary Browning.
including the special characters, numbers and/or newline characters they may contain. But I don't want to match the time string and serial numbers.
So basically I want to match all lines that contains numbers and special characters only with alphabets, not numbers and special characters which are alone on other lines like time-string and serial numbers.
How can I match and add tag <font color="#FFFF00">[subtitle text any...]</font> to each subtitle I matched with Regex's help ?
Means like below:
<font color="#FFFF00">He seeks the spirit
of Mary Browning.</font>

Well I just figured out by checking and analysing carefully, the key to match all the subtitle text lines.
First from any subtitle(.srt) file I have to remove unnecessary "line-feed" characters, i.e. \r.
Find: \r+
Replace with:
(nothing i.e. null character)
Then I just have to match those lines not starting with digits & newlines(i.e. blank lines) at all and then replace them with their own text wrapped around with <font> tag with color values as below:
Find: ^([^\d^\n].*)
Replace with: <font color="#FFFF00">\1</font>
(space after colon are just for better presentation and not included in code).
Hope this helps everyone head-banging with subtitles everyday.

Related

Regex for name with non-latin characters in python [duplicate]

For website validation purposes, I need first name and last name validation.
For the first name, it should only contain letters, can be several words with spaces, and has a minimum of three characters, but a maximum at top 30 characters. An empty string shouldn't be validated (e.g. Jason, jason, jason smith, jason smith, JASON, Jason smith, jason Smith, and jason SMITH).
For the last name, it should be a single word, only letters, with at least three characters, but at most 30 characters. Empty strings shouldn't be validated (e.g. lazslo, Lazslo, and LAZSLO).

Don't forget about names like:
Mathias d'Arras
Martin Luther King, Jr.
Hector Sausage-Hausen
This should do the trick for most things:
/^[a-z ,.'-]+$/i
OR Support international names with super sweet unicode:
/^[a-zA-ZàáâäãåąčćęèéêëėįìíîïłńòóôöõøùúûüųūÿýżźñçčšžÀÁÂÄÃÅĄĆČĖĘÈÉÊËÌÍÎÏĮŁŃÒÓÔÖÕØÙÚÛÜŲŪŸÝŻŹÑßÇŒÆČŠŽ∂ð ,.'-]+$/u

You make false assumptions on the format of first and last name. It is probably better not to validate the name at all, apart from checking that it is empty.

After going through all of these answers I found a way to build a tiny regex that supports most languages and only allows for word characters. It even supports some special characters like hyphens, spaces and apostrophes. I've tested in python and it supports the characters below:
^[\w'\-,.][^0-9_!¡?÷?¿/\\+=##$%ˆ&*(){}|~<>;:[\]]{2,}$
Characters supported:
abcdefghijklmnopqrstwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
áéíóúäëïöüÄ'
陳大文
łŁőŐűŰZàáâäãåąčćęèéêëėįìíîïłńòóôöõøùúûüųū
ÿýżźñçčšžÀÁÂÄÃÅĄĆČĖĘÈÉÊËÌÍÎÏĮŁ
ŃÒÓÔÖÕØÙÚÛÜŲŪŸÝŻŹÑßÇŒÆČŠŽ.-
ñÑâê都道府県Федерации
আবাসযোগ্য জমির걸쳐 있는

I have created a custom regex to deal with names:
I have tried these types of names and found working perfect
John Smith
John D'Largy
John Doe-Smith
John Doe Smith
Hector Sausage-Hausen
Mathias d'Arras
Martin Luther King
Ai Wong
Chao Chang
Alzbeta Bara
My RegEx looks like this:
^([a-zA-Z]{2,}\s[a-zA-Z]{1,}'?-?[a-zA-Z]{2,}\s?([a-zA-Z]{1,})?)
MVC4 Model:
[RegularExpression("^([a-zA-Z]{2,}\\s[a-zA-Z]{1,}'?-?[a-zA-Z]{2,}\\s?([a-zA-Z]{1,})?)", ErrorMessage = "Valid Charactors include (A-Z) (a-z) (' space -)") ]
Please note the double \\ for escape characters
For those of you that are new to RegEx I thought I'd include a explanation.
^ // start of line
[a-zA-Z]{2,} // will except a name with at least two characters
\s // will look for white space between name and surname
[a-zA-Z]{1,} // needs at least 1 Character
\'?-? // possibility of **'** or **-** for double barreled and hyphenated surnames
[a-zA-Z]{2,} // will except a name with at least two characters
\s? // possibility of another whitespace
([a-zA-Z]{1,})? // possibility of a second surname

I have searched and searched and played and played with it and although it is not perfect it may help others making the attempt to validate first and last names that have been provided as one variable.
In my case, that variable is $name.
I used the following code for my PHP:
if (preg_match('/\b([A-Z]{1}[a-z]{1,30}[- ]{0,1}|[A-Z]{1}[- \']{1}[A-Z]{0,1}
[a-z]{1,30}[- ]{0,1}|[a-z]{1,2}[ -\']{1}[A-Z]{1}[a-z]{1,30}){2,5}/', $name)
# there is no space line break between in the above "if statement", any that
# you notice or perceive are only there for formatting purposes.
#
# pass - successful match - do something
} else {
# fail - unsuccessful match - do something
I am learning RegEx myself but I do have the explanation for the code as provided by RegEx buddy.
Here it is:
Assert position at a word boundary «\b»
Match the regular expression below and capture its match into backreference number 1
«([A-Z]{1}[a-z]{1,30}[- ]{0,1}|[A-Z]{1}[- \']{1}[A-Z]{0,1}[a-z]{1,30}[- ]{0,1}|[a-z]{1,2}[ -\']{1}[A-Z]{1}[a-z]{1,30}){2,5}»
Between 2 and 5 times, as many times as possible, giving back as needed (greedy) «{2,5}»
* I NEED SOME HELP HERE WITH UNDERSTANDING THE RAMIFICATIONS OF THIS NOTE *
Note: I repeated the capturing group itself. The group will capture only the last iteration. Put a capturing group around the repeated group to capture all iterations. «{2,5}»
Match either the regular expression below (attempting the next alternative only if this one fails) «[A-Z]{1}[a-z]{1,30}[- ]{0,1}»
Match a single character in the range between “A” and “Z” «[A-Z]{1}»
Exactly 1 times «{1}»
Match a single character in the range between “a” and “z” «[a-z]{1,30}»
Between one and 30 times, as many times as possible, giving back as needed (greedy) «{1,30}»
Match a single character present in the list “- ” «[- ]{0,1}»
Between zero and one times, as many times as possible, giving back as needed (greedy) «{0,1}»
Or match regular expression number 2 below (attempting the next alternative only if this one fails) «[A-Z]{1}[- \']{1}[A-Z]{0,1}[a-z]{1,30}[- ]{0,1}»
Match a single character in the range between “A” and “Z” «[A-Z]{1}»
Exactly 1 times «{1}»
Match a single character present in the list below «[- \']{1}»
Exactly 1 times «{1}»
One of the characters “- ” «- » A ' character «\'»
Match a single character in the range between “A” and “Z” «[A-Z]{0,1}»
Between zero and one times, as many times as possible, giving back as needed (greedy) «{0,1}»
Match a single character in the range between “a” and “z” «[a-z]{1,30}»
Between one and 30 times, as many times as possible, giving back as needed (greedy) «{1,30}»
Match a single character present in the list “- ” «[- ]{0,1}»
Between zero and one times, as many times as possible, giving back as needed (greedy) «{0,1}»
Or match regular expression number 3 below (the entire group fails if this one fails to match) «[a-z]{1,2}[ -\']{1}[A-Z]{1}[a-z]{1,30}»
Match a single character in the range between “a” and “z” «[a-z]{1,2}»
Between one and 2 times, as many times as possible, giving back as needed (greedy) «{1,2}»
Match a single character in the range between “ ” and “'” «[ -\']{1}»
Exactly 1 times «{1}»
Match a single character in the range between “A” and “Z” «[A-Z]{1}»
Exactly 1 times «{1}»
Match a single character in the range between “a” and “z” «[a-z]{1,30}»
Between one and 30 times, as many times as possible, giving back as needed (greedy) «{1,30}»
I know this validation totally assumes that every person filling out the form has a western name and that may eliminates the vast majority of folks in the world. However, I feel like this is a step in the proper direction. Perhaps this regular expression is too basic for the gurus to address simplistically or maybe there is some other reason that I was unable to find the above code in my searches. I spent way too long trying to figure this bit out, you will probably notice just how foggy my mind is on all this if you look at my test names below.
I tested the code on the following names and the results are in parentheses to the right of each name.
STEVE SMITH (fail)
Stev3 Smith (fail)
STeve Smith (fail)
Steve SMith (fail)
Steve Sm1th (passed on the Steve Sm)
d'Are to Beaware (passed on the Are to Beaware)
Jo Blow (passed)
Hyoung Kyoung Wu (passed)
Mike O'Neal (passed)
Steve Johnson-Smith (passed)
Jozef-Schmozev Hiemdel (passed)
O Henry Smith (passed)
Mathais d'Arras (passed)
Martin Luther King Jr (passed)
Downtown-James Brown (passed)
Darren McCarty (passed)
George De FunkMaster (passed)
Kurtis B-Ball Basketball (passed)
Ahmad el Jeffe (passed)
If you have basic names, there must be more than one up to five for the above code to work, that are similar to those that I used during testing, this code might be for you.
If you have any improvements, please let me know. I am just in the early stages (first few months of figuring out RegEx.
Thanks and good luck,
Steve

I've tried almost everything on this page, then I decided to modify the most voted answer which ended up working best. Simply matches all languages and includes .,-' characters.
Here it is:
/^[\p{L} ,.'-]+$/u

First name would be
"([a-zA-Z]{3,30}\s*)+"
If you need the whole first name part to be shorter than 30 letters, you need to check that seperately, I think. The expression ".{3,30}" should do that.
Your last name requirements would translate into
"[a-zA-Z]{3,30}"
but you should check these. There are plenty of last names containing spaces.

As maček said:
Don't forget about names like:
Mathias d'Arras
Martin Luther King, Jr.
Hector Sausage-Hausen
and to remove cases like:
..Mathias
Martin king, Jr.-
This will cover more cases:
^([a-z]+[,.]?[ ]?|[a-z]+['-]?)+$

This regex work for me (was using in Angular 8) :
([a-zA-Z',.-]+( [a-zA-Z',.-]+)*){2,30}
It will be invalid if there is:-
Any whitespace start or end of the name
Got symbols e.g. #
Less than 2 or more than 30
Example invalid First Name (whitespace)
Example valid First Name :

I'm working on the app that validates International Passports (ICAO). We support only english characters. While most foreign national characters can be represented by a character in the Latin alphabet e.g. è by e, there are several national characters that require an extra letter to represent them such as the German umlaut which requires an ‘e’ to be added to the letter e.g. ä by ae.
This is the JavaScript Regex for the first and last names we use:
/^[a-zA-Z '.-]*$/
The max number of characters on the international passport is up to 31.
We use maxlength="31" to better word error messages instead of including it in the regex.
Here is a snippet from our code in AngularJS 1.6 with form and error handling:
class PassportController {
constructor() {
this.details = {};
// English letters, spaces and the following symbols ' - . are allowed
// Max length determined by ng-maxlength for better error messaging
this.nameRegex = /^[a-zA-Z '.-]*$/;
}
}
angular.module('akyc', ['ngMessages'])
.controller('PassportController', PassportController);
.has-error p[ng-message] {
color: #bc111e;
}
.tip {
color: #535f67;
}
<script src="https://ajax.googleapis.com/ajax/libs/angularjs/1.6.6/angular.min.js"></script>
<script src="https://code.angularjs.org/1.6.6/angular-messages.min.js"></script>
<main ng-app="akyc" ng-controller="PassportController as $ctrl">
<form name="$ctrl.form">
<div name="lastName" ng-class="{ 'has-error': $ctrl.form.lastName.$invalid} ">
<label for="pp-last-name">Surname</label>
<div class="tip">Exactly as it appears on your passport</div>
<div ng-messages="$ctrl.form.lastName.$error" ng-if="$ctrl.form.$submitted" id="last-name-error">
<p ng-message="required">Please enter your last name</p>
<p ng-message="maxlength">This field can be at most 31 characters long</p>
<p ng-message="pattern">Only English letters, spaces and the following symbols ' - . are allowed</p>
</div>
<input type="text" id="pp-last-name" ng-model="$ctrl.details.lastName" name="lastName"
class="form-control" required ng-pattern="$ctrl.nameRegex" ng-maxlength="31" aria-describedby="last-name-error" />
</div>
<button type="submit" class="btn btn-primary">Test</button>
</form>
</main>

Read almost all highly voted posts (only some are good). After understanding the problem in detail & doing research, here are the tight regexes:
1). ^[A-Z][a-z]*(([,.] |[ '-])[A-Za-z][a-z]*)*(\.?)$
name Z is allowed contrary to the assumption made by some in the thread.
No leading or trailing spaces are allowed, empty string is NOT allowed, string containing only spaces is NOT allowed
Supports English alphabets only
Supports hyphens (Some-Foobarbaz-name, Some foobarbaz-Name), apostrophes (David D'Costa, David D'costa, David D'costa R'Costa p'costa), periods (Dr. L. John, Robert Downey Jr., Md. K. P. Asif) and commas (Martin Luther, Jr.).
First alphabet of only the first word of a name MUST be capital.
NOT Allowed: John sTeWaRT, JOHN STEWART, Md. KP Asif, John Stewart PhD
Allowed: John Stewart, John stewart, Md. K P Asif
you can easily modify this condition.
If you also want to allow names like Queen Elizabeth 2 or Henry IV:
2). ^[A-Z][a-z]*(([,.] |[ '-])[A-Za-z][a-z]*)*([.]?| (-----)| [1-9][0-9]*)$
replace ----- with roman numeral's regex (which itself is long) OR you can use this alternative regex which is based on KISS philosophy [IVXLCDM]+ (here I, V, X, ... in ANY random order will satisfy the regex).
I personally suggest to use this regex:
3). ^[A-Z][a-z]*(([,.] |[ '-])[A-Za-z][a-z]*)*(\.?)( [IVXLCDM]+)?$
Feel free to try this regex HERE & make any modifications of your choice.
I have provided with tight regex which covers every possible name I found on my research with no bug. Modify these regexes to relax some of the unwanted constraints.
[UPDATE - March, 2022]
Here are 4 more regexes:
^[A-Za-z]+(([,.] |[ '-])[A-Za-z]+)*([.,'-]?)$
^((([,.'-]| )(?<!( {2}|[,.'-]{2})))*[A-Za-z]+)+[,.'-]?$
^( ([A-Za-z,.'-]+|$))+|([A-Za-z,.'-]+( |$))+$
^(([ ,.'-](?<!( {2}|[,.'-]{2})))*[A-Za-z])+[ ,.'-]?$
It's been a while since I looked back at these 4 regexes so I forgot their specifications. These 4 regexes are not tight, unlike the previous ones but do the job very well. These regexes distinguish 3 parts of a name: English alphabet, space and special character. Which one you need out of these 4 depends on your answer (Yes/No) to these questions:
have at least 1 alphabet?
can start with a space or a special character?
can end with a space or a special character?
are 2 consecutive spaces allowed?
are 2 consecutive special characters allowed?
Note: name validation should ONLY serve as a warning NOT a necessity a name should fulfill because there is no fixed naming pattern, if there is one it can change overnight and thus, any tight regex you come across will become obsolete somewhere in future.

There is one issue with the top voted answer here which recommends this regex:
/^[a-z ,.'-]+$/i
It takes spaces only as a valid name!
The best solution in my opinion is to add a negative look forward to the beginning:
^(?!\s)([a-z ,.'-]+)$/i

I use:
/^(?:[\u00c0-\u01ffa-zA-Z'-]){2,}(?:\s[\u00c0-\u01ffa-zA-Z'-]{2,})+$/i
And test for maxlength using some other means

I didn't find any answer helpful for me simply because users can pick a non-english name and simple regex are not helpful. In fact it's actually very hard to find the right expression that works for all languages.
Instead, I picked a different approach and negated all characters that should not be in the name for the valid match. Below pattern negates numerical, special characters, control characters and '\', '/'
Final regex
without punctuations: ["] ['] [,] [.], etc. :
^([^\p{N}\p{S}\p{C}\p{P}]{2,20})$
with punctuations:
^([^\p{N}\p{S}\p{C}\\\/]{2,20})$
With this, all these names are valid:
alex junior
沐宸
Nick
Sarah's Jane ---> with punctuation support
ביממה
حقیقت
Виктория
And following names become invalid:
🤣 Maria
k
١١١١١
123John
This means all names that don't have numerical characters, emojis, \ and are between 2-20 characters are allowed. You can edit the above regex if you want to add more characters to exclusion list.
To get more information about available patterns to include / exclude checkout this:
https://www.regular-expressions.info/unicode.html#prop

^\p{L}{2,}$
^ asserts position at start of a line.
\p{L} matches any kind of letter from any language
{2,} Quantifier — Matches between 2 and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of a line
So it should be a name in any language containing at least 2 letters(or symbols) without numbers or other characters.

If you are searching a simplest way, just check almost 2 words.
/^[^\s]+( [^\s]+)+$/
Valid names
John Doe
pedro alberto ch
Ar. Gen
Mathias d'Arras
Martin Luther King, Jr.
No valid names
John
陳大文

For simplicities sake, you can use:
(.*)\s(.*)
The thing I like about this is that the last name is always after the first name, so if you're going to enter this matched groups into a database, and the name is John M. Smith, the 1st group will be
John M., and the 2nd group will be Smith.

So, with customer we create this crazy regex:
(^$)|(^([^\-!#\$%&\(\)\*,\./:;\?#\[\\\]_\{\|\}¨ˇ“”€\+<=>§°\d\s¤®™©]| )+$)

For first and last names theres are really only 2 things you should be looking for:
Length
Content
Here is my regular expression:
var regex = /^[A-Za-z-,]{3,20}?=.*\d)/
1. Length
Here the {3,20} constrains the length of the string to be between 3 and 20 characters.
2. Content
The information between the square brackets [A-Za-z] allows uppercase and lowercase characters. All subsequent symbols (-,.) are also allowed.

The following expression will work on any language supported by UTF-16 and will ensure that there's a minimum of two components to the name (i.e. first + last), but will also allow any number of middle names.
/^(\S+ )+\S+$/u
At the time of this writing it seems none of the other answers meet all of that criteria. Even ^\p{L}{2,}$, which is the closest, falls short because it will also match "invisible" characters, such as U+FEFF (Zero Width No-Break Space).

Try these solutions, for maximum compatibility, as I have already posted here:
JavaScript:
var nm_re = /^(?:((([^0-9_!¡?÷?¿/\\+=##$%ˆ&*(){}|~<>;:[\]'’,\-.\s])){1,}(['’,\-\.]){0,1}){2,}(([^0-9_!¡?÷?¿/\\+=##$%ˆ&*(){}|~<>;:[\]'’,\-. ]))*(([ ]+){0,1}(((([^0-9_!¡?÷?¿/\\+=##$%ˆ&*(){}|~<>;:[\]'’,\-\.\s])){1,})(['’\-,\.]){0,1}){2,}((([^0-9_!¡?÷?¿/\\+=##$%ˆ&*(){}|~<>;:[\]'’,\-\.\s])){2,})?)*)$/;
HTML5:
<input type="text" name="full_name" id="full_name" pattern="^(?:((([^0-9_!¡?÷?¿/\\+=##$%ˆ&*(){}|~<>;:[\]'’,\-.\s])){1,}(['’,\-\.]){0,1}){2,}(([^0-9_!¡?÷?¿/\\+=##$%ˆ&*(){}|~<>;:[\]'’,\-. ]))*(([ ]+){0,1}(((([^0-9_!¡?÷?¿/\\+=##$%ˆ&*(){}|~<>;:[\]'’,\-\.\s])){1,})(['’\-,\.]){0,1}){2,}((([^0-9_!¡?÷?¿/\\+=##$%ˆ&*(){}|~<>;:[\]'’,\-\.\s])){2,})?)*)$" required>

This is what I use.
This regex accepts only names with minimum characters, from A-Z a-z ,space and -.
Names example:
Ionut Ionete, Ionut-Ionete Cantemir, Ionete Ionut-Cantemirm Ionut-Cantemir Ionete-Second
The limit of name's character is 3. If you want to change this, modify {3,} to {6,}
([a-zA-Z\-]+){3,}\s+([a-zA-Z\-]+){3,}

This seems to do the job for me:
[\S]{2,} [\S]{2,}( [\S]{2,})*

I usually write:
return /^[a-zA-Z\-\s\.\'\`\u00E0-\u00FC]+$/.test(firstName);

Fullname with only one whitespace:
^[a-zA-Z'\-\pL]+(?:(?! {2})[a-zA-Z'\-\pL ])*[a-zA-Z'\-\pL]+$

A simple function using preg_match in php
<?php
function name_validation($name) {
if (!preg_match("/^[a-zA-Z ]*$/", $name) === false) {
echo "$name is a valid name";
} else {
echo "$name is not a valid name";
}
}
//Test
name_validation('89name');
?>

If you want the whole first name to be between 3 and 30 characters with no restrictions on individual words, try this :
[a-zA-Z ]{3,30}
Beware that it excludes all foreign letters as é,è,à,ï.
If you want the limit of 3 to 30 characters to apply to each individual word, Jens regexp will do the job.

var name = document.getElementById('login_name').value;
if ( name.length < 4 && name.length > 30 )
{
alert ( 'Name length is mismatch ' ) ;
}
var pattern = new RegExp("^[a-z\.0-9 ]+$");
var return_value = var pattern.exec(name);
if ( return_value == null )
{
alert ( "Please give valid Name");
return false;
}

re.findall between two strings (but dismiss numeric digits)

I am trying to parse many txt files. The following textis just a part of a bigger txt files.
<P STYLE="font: 10pt Times New Roman, Times, Serif; margin: 0; text-align: justify">Prior to this primary offering, there has
been no public market for our common stock. We anticipate that the public offering price of the shares will be between $5.00 and
$6.00. We have applied to list our common stock on the Nasdaq Capital Market (“Nasdaq”) under the symbol “HYRE.”
If our application is not approved or we otherwise determine that we will not be able to secure the listing of our common stock
on the Nasdaq, we will not complete this primary offering.</P>
My desired output: be between $5.00 and and $6.00. So, I need to extract anything between the be betweenuntil the following . (but not taking into account the decimal 5.00 point!). I tried the following (Python 3.7):
shareprice = re.findall(r"be between\s\$.+?\.", text, re.DOTALL)
But this code gives me: be between $5. (stops at the decimal point). I initially add a \s at the end of the string to require a white space after the . which would keep the 5.00 point decimal, but many other txt files do not have a white space right after the ending . of the sentence.
Is there anyway I can specify in my string that I want to "skip" numeric digits after the \.?
Thank you very much. I hope it was clear.
Best

After parsing the plain text out of the HTML, you may consider matching any 0+ chars as few as possible followed with a . that is not followed with a digit:
r"be between\s*\$.*?\.(?!\d)"
See the regex demo.
Alternatively, if you only want to ignore the dot STRICTLY in between two digits you may use
r"be between\s*\$.*?\.(?!(?<=\d\.)\d)"
See this regex demo. The (?!(?<=\d\.)\d) makes sure the \d\.\d pattern is skipped up to the first matching ., and not just \.\d.

Adding characters in front and end of specific subtitle lines in Notepad++?

I want to add a dash in front of a continuing subtitle line. Like this:
Example sub (.srt):
1
00:00:48,966 --> 00:00:53,720
Today he was so angry and happy
at the same time,
2
00:00:53,929 --> 00:00:57,683
he went to the store and bought a
couple of books. Then the walked home
3
00:00:57,849 --> 00:01:01,102
with joy and jumped in the pool.
4
00:00:57,849 --> 00:01:01,102
One day he was in a bad mood and he
didn't get happier when he read.
TO THIS:
1
00:00:48,966 --> 00:00:53,720
Today he was so angry and happy
at the same time-
2
00:00:53,929 --> 00:00:57,683
-he went to the store and bought a
couple of books. Then the walked home-
3
00:00:57,849 --> 00:01:01,102
-with joy and jumped in the pool.
4
00:00:57,849 --> 00:01:01,102
One day he was in a bad mood and he
didn't get happier when he read.
The original subtitle is in Swedish. This is the standard for scandinavian subtitles.
How do I format it with regex in Notepad++? How should I write the tags and what if the subtitle contains italic tags in front and end?

You can use this regex with the g and m modifiers:
(?:,|([^.?!]<[^>]+>|[^>.?!]))$(\n\n.*\n.*\n)
Use $1-$2- as the substitution.
I'm using a simple definition of sentence. If there is one of .?!, that's counted as the end of a sentence. While this may not be a perfect definition, you're only looking at the ends of sentences.
Depending on several factors (for example, a line ending in ), you may need to tweak it a little.
Essentially, the regex is two parts.
The first part matches one of three things at the end of a line. If it matches a comma, that comma is removed. Otherwise, it looks to see if the last letter (if there is a tag, the letter before that) is NOT any of .?!.
The second part matches all the lines before the one that needs the dash. This also helps ensure that the end of the line you just matched is followed by a new line (and not more text).

How can i change mulitple line within a tag into one line in notepad++

These are 2 lines within P tag:
<p>2 years of teaching experience on English second paper from class six to ten in Biddapath and Onnesha Coaching Centre, Mirpur 1.
Moreover, I have one year of experience on Private teaching in my home district. I had 3 batches of about 40 students, class 9,10 & 11 on English Second Paper.</p>
And here is a string with 3 lines within p tag:
<p>2 years of teaching experience on English second paper from class six to ten in Biddapath and Onnesha Coaching Centre, Mirpur 1.
Moreover,
I have one year of experience on Private teaching in my home district. I had 3 batches of about 40 students, class 9,10 & 11 on English Second Paper.</p>
There are strings with 4 lines within p tag.
I want to change all lines within p into one single lines.
Is that possible with Notepad++ regex?

The regex to remove all linebreaks inside a non-nested p tag, is
(<p>|(?!^)\G)(?:(?!</?p>)[^\r\n])*\K\R+
See this regex demo.
In short,
(<p>|(?!^)\G) - Group 1 matching a <p> or the end of the previous successful match
(?:(?!</?p>)[^\r\n])* - matches 0+ sequences of any char other than a LF/CR that does not start a </p> or <p> sequence
\K - omits the matched text
\R+ - what we only match is this, 1+ newline sequences (CRLFs, CRs or LFs).
NOTE that unrolling the pattern as
(<p>|\G(?!^))[^\r\n<]*(?:<(?!/?p>)[^\r\n<]*)*\K\R+
will boost the S&R performance (see demo).
VVVVVVV

Regex to match specific-length string with white space in the middle (anywhere)

I need a regex which will match a phrase (with specific length and structure) even if there is additional white space in the middle (anywhere).
Let's say we have some description:
Serial numbers: ABC1234567890 XYZ0987654321
Then we want to find all phrases matching to regex [A-Z]{3}[0-9]{10}, but that description is malformed because of processing by external service. That service splits description to chunks, 12 digits each. So it will be:
Serial numbe
rs: ABC12345
67890 XYZ098
7654321
Important: "Serial numbers:" isn't fixed, it can be everything, so required phrases can be split anywhere (ABC1 234567890, ABC1234567 890 etc.). New line and space have the same meaning from the phrase matching perspective, but in special cases there can be more white chars between parts of phrase (for example, space as last char of chunk + new line, multiple spaces in source description). It just simply should treat whole "white space" between two strings as 1 space (ABC1 234567890 = ABC1234 567890, also with new line break). Those serials can be anywhere in malformed description (as I wrote: "Serial numbers:" part is optional, can be anything), also there can be more serial numbers within description. [A-Z]{3}[0-9]{10} also is only an example, I want to know how to achieve matching with optional white space in the middle, but base regex can be different.
EXPECTED RESULT: collection of matched phrases (serial numbers from the example).
ABC1234567890
XYZ0987654321
Info: result can contain white chars within phrase (from above example it would be: ABC12345 67890 and XYZ098 7654321). Most important thing is to match the base phrase (serial number).
Is it possible to make regex which will match it? I think it would be rather simple algorithm to match it without regex, but maybe it can be done with regular expression and make it "oneliner".

this will handle multiple spaces multiple times
(([A-Z]\s*){3}([0-9]\s*){10})
will match AB C A A A A AD E12 34567890
since AD E12 34567890 fits the pattern
https://regex101.com/r/bK3sF8/1

Edit:
Just considering one(you can adjust for multiples) \n (break lines) in and outside the word here:([\w\n?]*)
You should try grouping the result
in this case:
/(([\w\n?]*)\s([\w\n?]*):\s([\w\n?]*)\n?\n?\s([\w\n]*))/ig
you can get the serial number by groups $3 and $4
http://regexr.com/3d67n

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex add tag to subtitles - regex

Related

Regex for name with non-latin characters in python [duplicate]

re.findall between two strings (but dismiss numeric digits)

Adding characters in front and end of specific subtitle lines in Notepad++?

How can i change mulitple line within a tag into one line in notepad++

Regex to match specific-length string with white space in the middle (anywhere)

Categories

Resources