How can I delete this part of the text with regex? - regex

I have a problem that I really hope that somebody could help me. So, I want to delete some parts of text from a notepad++ document using Regex. If there's another software that I can use to delete this part of text, let me know please, I am really really noob with regex
So, my document its like this:
1
00:00:00,859 --> 00:00:03,070
text over here
2
00:00:03,070 --> 00:00:09,589
text over here
3
00:00:09,589 --> 00:00:10,589
some numbers here
4
00:00:10,589 --> 00:00:12,709
Text over here
5
00:00:12,709 --> 00:00:18,610
More text with numbers here
What I want to learn is how can I delete the first 2 lines of numbers in all the document? So I could get only the text parts (the "text over here" parts)
I would really appreciate any kind of help!

My solution:
^[\s\S]{1,5}\d{1,3}:\d{1,3}:\d{1,3},\d{1,5}\s-->\s*?\d{1,3}:\d{1,3}:\d{1,3},\d{1,5}\s
This solution match both types: either all data in one line, or numbers in one line and data in the second.
Demo: https://regex101.com/r/nKD0DQ/1/

Simplest solution;
\d+(\r\n|\r|\n)\d{2}:\d{2}.*(\r\n|\r|\n)
Get line with some number \d+ with its line break (\r\n|\r|\n)
Also the next line that starts with two 2-digit numbers and a colon \d{2}:\d{2} with the rest .* and its line break. No need to match all since we already are in the correct line, since subtitle file is defined well with its predictable structure.
Put this as Find what: value in Search -> Replace.. in Notepad++, with Seach Mode: Regular Expression and with replace value (Replace with:) of empty space. Will get you the correct result, lines of expected text with empty line in between each.
to see it on action on regex101

Subtitles, for accuracy you can use this:
\d+(\r\n|\n|\r)(\d\d:){2}\d\d,\d{3}\s*-->\s*(\d\d:){2}\d\d,\d{3}(\r\n|\n|\r)
Check Regular Expression, Find what with this and Replace with empty would do.
Regxe Demo
srt subtitles are basically ordered. And it's better accurate than lose texts.
\d : a single digit.
+ : one or more of occurances of the afore character or group.
\r\n: carriage and return. (newline)
* : zero or more of occurances of the afore character or group.
| : Or, match either one.
{3}: Match afore character or group three times.

I'm going for a less specific regex:
^[0-9]*\n[0-9:,]*\s-->\s[0-9:,]*
Demo # regex101

Related

How to Match Tilde-Delimited Data Using Regex

I have data like this:
~10~682423~15~Test Data~10~68276127~15~More Data~10~6813~15~Also Data~
I'm trying to use Notepad++ to find and replace the values within tag 10 (682423, 68276127, 6813) with zeroes. I thought the syntax below would work, but it selects the first occurrence of the text I want and the rest of the line, instead of just the text I want (~10~682423~, for example). I also tried dozens of variations from searching online, but they also either did the same thing or wouldn't return any results.
~10~.*~
You can use: (?<=~10~)\d+(?=~) and replace with 0. This uses lookarounds to check that ~10~ precedes the digit sequence and the (?=~) ensures a ~ follows the digit sequence. If any character could be after the ~10~ field, use (?<=~10~)[^~]+(?=~).
The problem with ~10~.*~ is that the * is greedy, so it just slurps away matching any character and ~.
Use
\b10~\d+
Replace with 10~0. See proof. \b10~ will capture 10 as entire number (no match in 210 is allowed) and \d+ will match one or more digits.

Remove columns from CSV

I don't know anything about Notepad++ Regex.
This is the data I have in my CSV:
6454345|User1-2ds3|62562012032|324|148|9c1fe63ccd3ab234892beaf71f022be2e06b6cd1
3305611|User2-42g563dgsdbf|22023001345|0|0|c36dedfa12634e33ca8bc0ef4703c92b73d9c433
8749412|User3-9|xgs|f|98906504456|1534|51564|411b0fdf54fe29745897288c6ad699f7be30f389
How can I use a Regex to remove the 5th and 6th column? The numbers in the 5th and 6th column are variable in length.
Another problem is the User row can also contain a |, to make it even worse.
I can use a macro to fix this, but the file is a few millions lines long.
This is the final result I want to achieve:
6454345|User1-2ds3|62562012032|9c1fe63ccd3ab234892beaf71f022be2e06b6cd1
3305611|User2-42g563dgsdbf|22023001345|c36dedfa12634e33ca8bc0ef4703c92b73d9c433
8749412|User3-9|xgs|f|98906504456|411b0fdf54fe29745897288c6ad699f7be30f389
I am open for suggestions on how to do this with another program, command line utility, either Linux or Windows.
Match \|[^|]+\|[^|]+(\|[^|]+$)
Repalce $1
Basically, Anchor to the end of the line, and remove columns [-1] and [-2] (I assume columns can't be empty. Replace + with * if they can)
If you need finer detail then that, I'd recommend writing a Java or Python script to manual parse and rewrite the file for you.
I've captured three groups and given them names. If you use a replace utility like sed or vimregex, you can replace remove with nothing. Or you can use a programming language to concatenate keep_before and keep_after for the desired result.
^(?<keep_before>(?:[^|]+\|){3})(?<remove>(?:[^|]+\|){2})(?<keep_after>.*)$
You may have to remove the group namings and use \1 etc. instead, depending on what environment you use.
Demo
From Notepad++ hit ctrl + h then enter the following in the dialog:
Find what: \|\d+\|\d+(\|[0-9a-z]+)$
Replace with: $1
Search mode: Regular Expression
Click replace and done.
Regex Explain:
\|\d+ : match 1st string that starts with | followed by number
\|\d+ : match 2nd string that starts with | followed by number
(\|[0-9a-z]+): match and capture the string after the 2nd number.
$ : This is will force regex search to match the end of the string.
Replacement:
$1 : replace the found string with whatever we have between the captured group which is whatever we have between the parentheses (\|[0-9a-z]+)

Extracting address with Regex

I'm trying to looking for Street|St|Drive|Dr and then get all the contents of the line to extract the address:
(?:(?!\s{2,}|\$).)*(Street|St|Drive|Dr).*?(?=\s{2,})
.. but it also matches:
Full match 420-442 ` Tax Invoice/Statement`
Group 1. 433-435 `St`
Full match 4858-4867 `163.66 DR`
Group 1. 4865-4867 `DR`
Full match 11053-11089 ` Permanent Water Saving Plan, please`
Group 1. 11077-11079 `Pl`
How do i match only whole words and not substrings so it ignores words that contain those words (the first match for example).
One option is to use the the word-boundary anchor, \b, to accomplish this:
(?:(?!\s{2,}|\$).)*\b(Street|St|Drive|Dr)\b.*?(?=\s{2,})
If you provide an example of the raw text you're parsing, I'll be able to give additional help if this doesn't work.
Edit:
From the link you posted in a comment, it seems that the \b solution solves your question:
How do i match only whole words and not substrings so it ignores words that contain those words (the first match for example).
However, it seems like there are additional issues with your regex.

How do you "quantify" a variable number of lines using a regexp?

Say you know the starting and ending lines of some section of text, but the chars in some lines and the number of lines between the starting and ending lines are variable, รก la:
aaa
bbbb
cc
...
...
...
xx
yyy
Z
What quantifier do you use, something like:
aaa\nbbbb\ncc\n(.*\n)+xx\nyyy\nZ\n
to parse those sections of text as a group?
You can use the s flag to match multilines texts, you can do it like:
~\w+ ~s.
There is a similar question here:
Javascript regex multiline flag doesn't work
If I understood correctly, you know that your text begins with aaa\nbbbb\ncc and ends with xx\nyyy\nZ\n. You could use aaa.+?bbbb.+?cc(.+?)xx.+?yyy.+?Z so that all operators are not greedy and you don't accidentally capture two groups at once. The text inbetween these groups would be in match group 1. You also need to turn the setting that causes dot to match new line on.
Try this:
aaa( |\n)bbbb( |\n)cc( |\n)( |\n){0,1}(.|\n)*xx( |\n)yyy( |\n)Z
( |\n) matches a space or a newline (so your starting and ending phrases can be split into different lines)
RegExr
At the end of the day what worked for me using Kate was:
( )+aaa\n( )+bbbb\n( )+cc\n(.|\n)*( )+xx\n( )+yyy\n( )+Z\n
using such regexps you can clear pages of quite a bit of junk.

RegEx in Notepad++ to find a wild character and replace the whole word

I have a test file with number values as below:
32405494
32405495
32405496
32407498
Using Notepad++, what I am trying to achieve here is to search the first 4 digits using regular expression and replace the whole number with G3E_STYLERULE_SEQ.NEXTVAL
I am able to find these values using 3240*. My question is, how do I replace the whole number with G3E_STYLERULE_SEQ.NEXTVAL?
When I am click the Replace All button, I get the following output:
G3E_STYLERULE_SEQ.NEXTVAL5494
G3E_STYLERULE_SEQ.NEXTVAL5495
G3E_STYLERULE_SEQ.NEXTVAL5496
G3E_STYLERULE_SEQ.NEXTVAL7498
However, I am expecting the following:
G3E_STYLERULE_SEQ.NEXTVAL
G3E_STYLERULE_SEQ.NEXTVAL
G3E_STYLERULE_SEQ.NEXTVAL
G3E_STYLERULE_SEQ.NEXTVAL
Any ideas to achieve this? Is it even possible through Notepad++? Are there any other text editors which I can use to achieve this?
Use something like this:
3240.*
. is the wildcard character in regex and * means that the previous character is to be repeated 0 or more times (your current regex actually matches 324 and then 0 which appears 0 or more times).
3240.* will therefore match 3240 and any other following characters.
You might also want to add a line anchor:
^3240.*
So that you don't replace numbers having 3240 in the middle too.
in notepad++, you can use this regex:
^3240\d+
it will match the four digits you're searching at the beginning of your string followed by any digit.
Try this -
Search this - ^3240\d*$
Replace with- G3E_STYLERULE_SEQ.NEXTVAL