I have text that looks like this:
3 Q I think I started out, I said when4 you first noticed
the oyster beds, it sounded5 like it didn't really concern you, you did not6 believe that the dredging material or the berm7 building material could reach the oyster beds?8 A That's correct.9 Q
I need to have an output that finds the first of any numeric sequence (i.e. "10" doesn't need to be a double match for 1 and 0) and looks looks like this (minus the spaces I had to put between each line):
3 Q I think I started out, I said when
4 you first noticed the oyster beds, it sounded
5 like it didn't really concern you, you did not
6 believe that the dredging material or the berm
7 building material could reach the oyster beds?
8 A That's correct.
9 Q
Here, we might just want to capture the (\d+), then replace it with a new line and $1:
RegEx
If this expression wasn't desired, it can be modified/changed in regex101.com.
Demo
We can try matching on the pattern:
(?<=.)(\d+)
This says to match and capture a number of any size, provided that it is not the first number in the text. This avoids adding an unwanted newline before the first line beginning with 3. Then, we can replace with a newline followed by that captured number. Here is a working script:
Dim regex As Regex = new Regex("(?<=.)(\d+)")
Console.WriteLine(regex.Replace("1 stuff10 more stuff", vbCrLf & "$1"))
This outputs:
1 stuff
10 more stuff
Be certain to include the Imports Microsoft.VisualBasic to be able to use vbCrLf in your code.
I have a subtitle file of a movie, like below:
2
00:00:44,687 --> 00:00:46,513
Let's begin.
3
00:01:01,115 --> 00:01:02,975
Very good.
4
00:01:05,965 --> 00:01:08,110
What was your wife's name?
5
00:01:08,943 --> 00:01:12,366
- Mary.
- Mary, alright.
6
00:01:15,665 --> 00:01:18,938
He seeks the spirit
of Mary Browning.
7
00:01:20,446 --> 00:01:24,665
Mary, we invite you
into our circle.
8
00:01:28,776 --> 00:01:32,834
Mary Browning,
we invite you into our circle.
....
Now I want to match only the actual subtitle text content like,
- Mary.
- Mary, alright.
Or
He seeks the spirit
of Mary Browning.
including the special characters, numbers and/or newline characters they may contain. But I don't want to match the time string and serial numbers.
So basically I want to match all lines that contains numbers and special characters only with alphabets, not numbers and special characters which are alone on other lines like time-string and serial numbers.
How can I match and add tag <font color="#FFFF00">[subtitle text any...]</font> to each subtitle I matched with Regex's help ?
Means like below:
<font color="#FFFF00">He seeks the spirit
of Mary Browning.</font>
Well I just figured out by checking and analysing carefully, the key to match all the subtitle text lines.
First from any subtitle(.srt) file I have to remove unnecessary "line-feed" characters, i.e. \r.
Find: \r+
Replace with:
(nothing i.e. null character)
Then I just have to match those lines not starting with digits & newlines(i.e. blank lines) at all and then replace them with their own text wrapped around with <font> tag with color values as below:
Find: ^([^\d^\n].*)
Replace with: <font color="#FFFF00">\1</font>
(space after colon are just for better presentation and not included in code).
Hope this helps everyone head-banging with subtitles everyday.
I want to add a dash in front of a continuing subtitle line. Like this:
Example sub (.srt):
1
00:00:48,966 --> 00:00:53,720
Today he was so angry and happy
at the same time,
2
00:00:53,929 --> 00:00:57,683
he went to the store and bought a
couple of books. Then the walked home
3
00:00:57,849 --> 00:01:01,102
with joy and jumped in the pool.
4
00:00:57,849 --> 00:01:01,102
One day he was in a bad mood and he
didn't get happier when he read.
TO THIS:
1
00:00:48,966 --> 00:00:53,720
Today he was so angry and happy
at the same time-
2
00:00:53,929 --> 00:00:57,683
-he went to the store and bought a
couple of books. Then the walked home-
3
00:00:57,849 --> 00:01:01,102
-with joy and jumped in the pool.
4
00:00:57,849 --> 00:01:01,102
One day he was in a bad mood and he
didn't get happier when he read.
The original subtitle is in Swedish. This is the standard for scandinavian subtitles.
How do I format it with regex in Notepad++? How should I write the tags and what if the subtitle contains italic tags in front and end?
You can use this regex with the g and m modifiers:
(?:,|([^.?!]<[^>]+>|[^>.?!]))$(\n\n.*\n.*\n)
Use $1-$2- as the substitution.
I'm using a simple definition of sentence. If there is one of .?!, that's counted as the end of a sentence. While this may not be a perfect definition, you're only looking at the ends of sentences.
Depending on several factors (for example, a line ending in ), you may need to tweak it a little.
Essentially, the regex is two parts.
The first part matches one of three things at the end of a line. If it matches a comma, that comma is removed. Otherwise, it looks to see if the last letter (if there is a tag, the letter before that) is NOT any of .?!.
The second part matches all the lines before the one that needs the dash. This also helps ensure that the end of the line you just matched is followed by a new line (and not more text).
I work with text files, and I need to be able to see when the gps (last 3 columns of csv) "hangs up" for more than a few lines.
So for example, usually, part of a text file looks like this:
5451,1667,180007,35.7397387,97.8161897,375.8
5448,1053z,180006,35.7397407,97.8161814,375.7
5444,1667,180005,35.7397445,97.8161674,375.6
5439,1668,180004,35.7397483,97.8161526,375.5
5435,1669,180003,35.7397518,97.8161379,375.5
5431,1669,180002,35.7397554,97.8161269,375.6
5426,1054z,180001,35.7397584,97.8161115,375.6
5420,1670,175959,35.7397649,97.8160931,375.9
But sometimes there is an error with the gps and it looks like this:
36859,1598,202603.00,35.8867316,99.2515545,555.700
36859,1598,202608.00,35.8867316,99.2515545,555.700
36859,1142z,202610.00,35.8867316,99.2515545,555.700
36859,1597,202612.00,35.8867316,99.2515545,555.700
36859,1597,202614.00,35.8867316,99.2515545,555.700
36859,1596,202616.00,35.8867316,99.2515545,555.700
36859,1595,202618.00,35.8867316,99.2515545,555.700
I need to be able to figure out a way to search for matching strings of 7 different numbers, (the decimal portion of the gps) but so far I've only been able to figure out how to search for repeating #s or consecutive numbers.
Any ideas?
If you were to find such repetitions in an editor (such as Notepad++), you could use the following regex to find 4 or more repeating lines:
([^,]+(?:,[^,]+){2})\v+(?:(?:[^,]+,){3}\1(?:\v+|$)){3,}
To go a bit into detail
([^,]+(?:,[^,]+){2})\v+ is a group consisting of one or more non-commas followed by comma and another one or more non-commas followed by a vertical space (linebreak), that is not part of the group (e.g. 1,1,1\n)
(?:[^,]+,){3} matches one or more non-commas followed by comma, three times (your columns that don't have to be considered)
\1 is a backreference to group 1, matching if it contains exactly the same as group 1
(?:\v+|$) matches either another vertical whitespaces or the end of the text
{3,} for 3 or more repetitions - increase it if you want more
Here you can see, how it works
However, if you are using any programming language to check this, I wouldn't walk on the path of regex, as checking for those repetitions can be done a lot easier. Here is one example in Python, I hope you can adopt it for your needs:
oldcoords = [0,0,0]
lines = [line.rstrip('\n') for line in open(r'C:\temp\gps.csv')]
for line in lines:
gpscoords = line.split(',')[3:6]
if gpscoords == oldcoords:
repetitions += 1
else:
oldcoords = gpscoords
repetitions = 0
if repetitions == 4: #or however you define more than a few
print(', '.join(gpscoords) + ' is repeated')
If you can use perl, and if I understood you:
perl -ne 'm/^[^,]*,[^,]*,[^,]*,([^,]*,[^,]*,[^,]*$)/g; $current_line=$1; ++$line_number; if ($prev_line==$current_line){$equals++} else {if ($equals>=6){ print "Last three fields in lines ".($line_number-$equals-1)." to ".($line_number-1)." are equals to:\n$prev_line" } ; $equals=0}; $prev_line=$current_line' < onlyreplacethiswithyourfilepath should do the trick.
Sample output:
Last three fields in lines 1 to 7 are equals to:
35.8867316,99.2515545,555.700
Last three fields in lines 16 to 22 are equals to:
37.8782116,99.7825545,572.810
Last three fields in lines 31 to 44 are equals to:
36.6868916,77.2594245,581.358
Last three fields in lines 57 to 63 are equals to:
35.5128764,71.2874545,575.631
I'm trying to split text file by line numbers,
for example, if I have text file like:
1 ljhgk uygk uygghl \r\n
1 ljhg kjhg kjhg kjh gkj \r\n
1 kjhl kjhl kjhlkjhkjhlkjhlkjhl \r\n
2 ljkih lkjhl kjhlkjhlkjhlkjhl \r\n
2 lkjh lkjh lkjhljkhl \r\n
3 asdfghjkl \r\n
3 qweryuiop \r\n
I want to split it to 3 parts (1,2,3),
How can I do this? the size of the text is very large (~20,000,000 characters) and I need an efficient way (like regex).
Another idea, you can use linq to get the groups you're after, by splitting by each first word. Note that this will take each first word, so make sure you only have numbers there. This is using the split/join antipattern, but it seems to work nice here.
var lines = from line in s.Split("\r\n".ToCharArray(),
StringSplitOptions.RemoveEmptyEntries)
let lineNumber = line.Split(" ".ToCharArray(), 2).FirstOrDefault()
group line by lineNumber
into g
select String.Join("\n", g);
Notes:
GroupBy is gurenteed to return lines in the order they appeared.
If a block appears more than once (e.g. "1 1 2 2 3 3 1"), all blocks with the same number will be merged.
You can use a regex, but Split will not work too well. You can Match for the following pattern:
^(\d).*$ # Match first line, capture number
([\r\n]+^\1.*$)* # Match additional lines that begin with the same number
Example: here
I did try to split by$(?<=^(\d+).*)[\r\n]+^(?!\1), but it adds the line numbers as additional elementnt in the array.