Adding characters in front and end of specific subtitle lines in Notepad++? - regex

I want to add a dash in front of a continuing subtitle line. Like this:
Example sub (.srt):
1
00:00:48,966 --> 00:00:53,720
Today he was so angry and happy
at the same time,
2
00:00:53,929 --> 00:00:57,683
he went to the store and bought a
couple of books. Then the walked home
3
00:00:57,849 --> 00:01:01,102
with joy and jumped in the pool.
4
00:00:57,849 --> 00:01:01,102
One day he was in a bad mood and he
didn't get happier when he read.
TO THIS:
1
00:00:48,966 --> 00:00:53,720
Today he was so angry and happy
at the same time-
2
00:00:53,929 --> 00:00:57,683
-he went to the store and bought a
couple of books. Then the walked home-
3
00:00:57,849 --> 00:01:01,102
-with joy and jumped in the pool.
4
00:00:57,849 --> 00:01:01,102
One day he was in a bad mood and he
didn't get happier when he read.
The original subtitle is in Swedish. This is the standard for scandinavian subtitles.
How do I format it with regex in Notepad++? How should I write the tags and what if the subtitle contains italic tags in front and end?

You can use this regex with the g and m modifiers:
(?:,|([^.?!]<[^>]+>|[^>.?!]))$(\n\n.*\n.*\n)
Use $1-$2- as the substitution.
I'm using a simple definition of sentence. If there is one of .?!, that's counted as the end of a sentence. While this may not be a perfect definition, you're only looking at the ends of sentences.
Depending on several factors (for example, a line ending in ), you may need to tweak it a little.
Essentially, the regex is two parts.
The first part matches one of three things at the end of a line. If it matches a comma, that comma is removed. Otherwise, it looks to see if the last letter (if there is a tag, the letter before that) is NOT any of .?!.
The second part matches all the lines before the one that needs the dash. This also helps ensure that the end of the line you just matched is followed by a new line (and not more text).

Related

Regex: Only match if one word of both wordlist are present

I am trying to apply a regex filter on news headlines. I want the filter only to match, if at least one word of both wordlists are present in the news headline. Furthermore, it should only generate 1 match (not multiple matches if some tokens apply).
These are my wordlists (and my regex which doesnt work currently):
(Threat actor|attack|skimm|malware|exploit|fraud|inject|trojan|ransom|\bRCE\b)+(\bATM\b|bank|\bAustria\b)
The regex should only match, if "ATM", "bank" or "Austria" AND a word from the first list (in the paranthesis) is present in the news headline, not if only "ATM", ... is present.
Example: A match should only appear, if "exploit" AND "ATM" is encountered in the headline.
Given the four headlines below, only headline 2 should return a match.
An APT Group Exploiting a 0-day in FatPipe WARP, MPVPN, and IPVPN
Software
Ares Malware: The Grandson of the Kronos Banking Trojan that targets
German Flag of Germany Banks.
In human-operated ransomware attacks, threat actors use predictable
methods to enter a device but eventually rely on hands-on-keyboard
activities.
Kotak Mahindra Bank launches new transactions across India
Example 1 has only a word of the first list. Example 4 has only a word of the second list.
Only example 2 has occurences of words of BOTH lists.
Example 3 has also 2 two occurences of the first list, but none of the second list, therefore NO MATCH.
I would be very grateful if you could provide a working regex filter for this case.
Regards, Michael
You could match both groups in both ways:
(Threat actor|attack|skimm|malware|exploit|fraud|inject|trojan|ransom|\bRCE\b).*(\bATM\b|bank|\bAustria\b)|(\bATM\b|bank|\bAustria\b).*(Threat actor|attack|skimm|malware|exploit|fraud|inject|trojan|ransom|\bRCE\b)
Regex demo

Add carriage return using regex pattern of any number

I have text that looks like this:
3 Q I think I started out, I said when4 you first noticed
the oyster beds, it sounded5 like it didn't really concern you, you did not6 believe that the dredging material or the berm7 building material could reach the oyster beds?8 A That's correct.9 Q
I need to have an output that finds the first of any numeric sequence (i.e. "10" doesn't need to be a double match for 1 and 0) and looks looks like this (minus the spaces I had to put between each line):
3 Q I think I started out, I said when
4 you first noticed the oyster beds, it sounded
5 like it didn't really concern you, you did not
6 believe that the dredging material or the berm
7 building material could reach the oyster beds?
8 A That's correct.
9 Q
Here, we might just want to capture the (\d+), then replace it with a new line and $1:
RegEx
If this expression wasn't desired, it can be modified/changed in regex101.com.
Demo
We can try matching on the pattern:
(?<=.)(\d+)
This says to match and capture a number of any size, provided that it is not the first number in the text. This avoids adding an unwanted newline before the first line beginning with 3. Then, we can replace with a newline followed by that captured number. Here is a working script:
Dim regex As Regex = new Regex("(?<=.)(\d+)")
Console.WriteLine(regex.Replace("1 stuff10 more stuff", vbCrLf & "$1"))
This outputs:
1 stuff
10 more stuff
Be certain to include the Imports Microsoft.VisualBasic to be able to use vbCrLf in your code.

Regex add tag to subtitles

I have a subtitle file of a movie, like below:
2
00:00:44,687 --> 00:00:46,513
Let's begin.
3
00:01:01,115 --> 00:01:02,975
Very good.
4
00:01:05,965 --> 00:01:08,110
What was your wife's name?
5
00:01:08,943 --> 00:01:12,366
- Mary.
- Mary, alright.
6
00:01:15,665 --> 00:01:18,938
He seeks the spirit
of Mary Browning.
7
00:01:20,446 --> 00:01:24,665
Mary, we invite you
into our circle.
8
00:01:28,776 --> 00:01:32,834
Mary Browning,
we invite you into our circle.
....
Now I want to match only the actual subtitle text content like,
- Mary.
- Mary, alright.
Or
He seeks the spirit
of Mary Browning.
including the special characters, numbers and/or newline characters they may contain. But I don't want to match the time string and serial numbers.
So basically I want to match all lines that contains numbers and special characters only with alphabets, not numbers and special characters which are alone on other lines like time-string and serial numbers.
How can I match and add tag <font color="#FFFF00">[subtitle text any...]</font> to each subtitle I matched with Regex's help ?
Means like below:
<font color="#FFFF00">He seeks the spirit
of Mary Browning.</font>
Well I just figured out by checking and analysing carefully, the key to match all the subtitle text lines.
First from any subtitle(.srt) file I have to remove unnecessary "line-feed" characters, i.e. \r.
Find: \r+
Replace with:
(nothing i.e. null character)
Then I just have to match those lines not starting with digits & newlines(i.e. blank lines) at all and then replace them with their own text wrapped around with <font> tag with color values as below:
Find: ^([^\d^\n].*)
Replace with: <font color="#FFFF00">\1</font>
(space after colon are just for better presentation and not included in code).
Hope this helps everyone head-banging with subtitles everyday.

How can i change mulitple line within a tag into one line in notepad++

These are 2 lines within P tag:
<p>2 years of teaching experience on English second paper from class six to ten in Biddapath and Onnesha Coaching Centre, Mirpur 1.
Moreover, I have one year of experience on Private teaching in my home district. I had 3 batches of about 40 students, class 9,10 & 11 on English Second Paper.</p>
And here is a string with 3 lines within p tag:
<p>2 years of teaching experience on English second paper from class six to ten in Biddapath and Onnesha Coaching Centre, Mirpur 1.
Moreover,
I have one year of experience on Private teaching in my home district. I had 3 batches of about 40 students, class 9,10 & 11 on English Second Paper.</p>
There are strings with 4 lines within p tag.
I want to change all lines within p into one single lines.
Is that possible with Notepad++ regex?
The regex to remove all linebreaks inside a non-nested p tag, is
(<p>|(?!^)\G)(?:(?!</?p>)[^\r\n])*\K\R+
See this regex demo.
In short,
(<p>|(?!^)\G) - Group 1 matching a <p> or the end of the previous successful match
(?:(?!</?p>)[^\r\n])* - matches 0+ sequences of any char other than a LF/CR that does not start a </p> or <p> sequence
\K - omits the matched text
\R+ - what we only match is this, 1+ newline sequences (CRLFs, CRs or LFs).
NOTE that unrolling the pattern as
(<p>|\G(?!^))[^\r\n<]*(?:<(?!/?p>)[^\r\n<]*)*\K\R+
will boost the S&R performance (see demo).
VVVVVVV

Using Regex to clean a csv file in R

This is my first post so I hope it is clear enough.
I am having a problem regarding cleaning my CSV files before I can read them into R and have spent the entire day trying to find a solution.
My data is supposed to be in the form of two columns. The first column is a timestamp consisting of 10 digits and the second an ID consisting of 11 or 12 Letters and numbers (the first 6 are always numbers).
For example:
logger10 |
0821164100 | 010300033ADD
0821164523 | 010300033ADD
0821164531 | 010700EDDA0F0831102744
010700EDDA0F|
would become:
0821164100 | 010300033ADD
0821164523 | 010300033ADD
0821164531 | 010700EDDA0F
0831102744 | 010700EDDA0F
(please excuse the lines in the middle, that was my attempt at separating the columns...).
The csv file seems to occasionally be missing a comma which means that sometimes one row will end up like this:
0923120531,010300033ADD0925075301,010700EDD00A
My hardware also adds the word logger10 (or whichever number logger this is) whenever it restarts which gives a similar problem e.g. logger10logger100831102744.
I think I have managed to solve the logger text problem (see code) but I am sure this could be improved. Also, I really don't want to delete any of the data.
My real trouble is making sure there is a line break in the right place after the ID and, if not, I would like to add one. I thought I could use regex for this but I'm having difficulty understanding it.
Any help would be greatly appreciated!
Here is my attempt:
temp <- list.files(pattern="*.CSV") #list of each csv/logger file
for(i in temp){
#clean each csv
tmp<-readLines(i) #check each line in file
tmp<-gsub("logger([0-9]{2})","",tmp) #remove logger text
pattern <- ("[0-9]{10}\\,[0-9]{6}[A-Z,0-9]{5,6}") #regex pattern ??
if (tmp!= pattern){
#I have no idea where to start here...
}
}
here is some raw data:
logger01
0729131218,020700EE1961
0729131226,020700EE1961
0831103159,0203000316DB
0831103207,0203000316DB0831103253,010700EDE28C
0831103301,010700EDE28C
0831103522,010300029815
0831103636,010300029815
0831103657,020300029815
If you want to do this in a single pass:
(?:logger\d\d )?([\dA-F]{10}),?([\dA-F]{12}) ?
can be replaced with
\1\t\2\n
What this does is look for any of those rogue logger01 entries (including the space after it) optionally: That trailing ? after the group means that it can match 0 or 1 time: if it does match, it will. If it's not there, the match just keeps going anyway.
Following that, you look for (and capture) exactly 10 hex values (either digits or A-F). The ,? means that if a comma exists, it will match, but it can match 0 or 1 time as well (making it optional).
Following that, look for (and capture) exactly 12 hex values. Finally, to get rid of any strange trailing spaces, the ? (a space character followed by ?) will optionally match the trailing space.
Your replacement will replace the first captured group (the 10 hex digits), add in a tab, replace the second captured group (the 12 hex digits), and then a newline.
You can see this in use on regex101 to see the results. You can use code generator on the left side of that page to get some preformatted PHP/Javascript/Python that you can just drop into a script.
If you're doing this from the command line, perl could be used:
perl -pe 's/(?:logger\d\d )?([\dA-F]{10}),?([\dA-F]{12}) ?/\1\t\2\n/g'
If another language, you may need to adapt it slightly to fit your needs.
EDIT
Re-reading the OP and comments, a slightly more rigid regex could be
(?:logger\d\d\ )?([\dA-F]{10}),?(\d{6}[\dA-F]{5,6})\ ?
I updated the regex101 link with the changes.
This still looks for the first 10 hex values, but now looks for exactly 6 digits, followed by 5-6 hex values, so the total number of characters matched is 11 or 12.
The replacement would be the same.
Paste your regex here https://regex101.com/ to see whether it catches all cases. The 5 or 6 letters or digits could pose an issue as it may catch the first digit of the timestamp when the logger misses out a comma. Append an '\n' to the end of the tmp string should work provided the regex catches all cases.