use regex to modify srt file? - regex

one format of srt file looks like this:
0:00:04 --> 00:00:10
and another format looks like this
0:00:04,000 --> 00:00:10,000
I want to process the first kind of file to append an ,000 to each time-frame for compatibility purposes so that the first file has the ,000 formatting that I need like the above second example.
I was thinking of trying to use some string functions like mid(), right(), instring() but wondered if regex might do the job better, any suggestions on how to do this?

You can use this regex to match the first group :
^([0-9]{1,2}:[0-9]{2}:[0-9]{2}) --> ([0-9]{1,2}:[0-9]{2}:[0-9]{2})$
And then replace $1 by $1 + ",000" and $2 by $2 + ",000"
Since you don't indicate which language you used, I did a simple example in PHP :
<?php
$string = 'April 15, 2003';
$pattern = '/(\w+) (\d+), (\d+)/i';
$replacement = '${1}1,$3';
echo preg_replace("/^([0-9]{1,2}:[0-9]{2}:[0-9]{2}) --> ([0-9]{1,2}:[0-9]{2}:[0-9]{2})$/i", "$1,000 --> $2,000", "0:00:04 --> 00:00:10");
// output : 0:00:04,000 --> 00:00:10,000
?>

With sed (it's available on Windows too):
sed -i '/\d\+\:\d\+:\d\+ --> \d\+\:\d\+:\d\+/ s_\(\d\+\:\d\+:\d\+\)\s*-->\(\d\+\:\d\+:\d\+\)\s*_\1,000 --> \2,000_' INPUT.srt
It will be done inplace.
(And I know it's not the correct regex to capture time definitions... but it works for this job.)

Sure, that sounds like a good idea. A simple approach would be to match for (\d?\d:\d\d:\d\d) and replace it with the match itself plus ,000 (for "the match itself" use a back reference, which might be something like \1 or $1, depending on your language).
Try implementing this, and if you need further help, start a new question where you mention what you have tried, where you are stuck and which language you are using.

Why not simply
sed -e 's/ -->\|$/,000&/' old.srt >new.srt
provided that old.srt consistently contains the shorter format only.

Related

regex replace in lines starting with {\s between first space to ;}

i have some corrupt rtf files with lines like this:
{\s39\li0\fi0\ri0\sb0\sa0\ql\vertalt\fs22 Fußzeile Zchn;}
^----------------------------^
i want to replace all [^a-zA-Z0-9_\{}; ]
but only in lines beginning with "{\s" and ending with "};" from the first "space" to "};"
the first "space" and "};" should not be replaced.
You didn't specify language, here is Regex101 example:
({\\s.+?\s)(.*)(})
So, I'm unsure what language/technology you'd like to use here, but if using C# is an option, you can check out this previous question. The answer gets you almost the way there.
For your example:
var text = #"{\s39\li0\fi0\ri0\sb0\sa0\ql\vertalt\fs22 Fußzeile Zchn;}";
var pattern = #"^({\\s\S*\s[a-zA-Z0-9_\{}; ]*)([^a-zA-Z0-9_\{}; ]*)([^}]*})";
var replaced = System.Text.RegularExpressions.Regex.Replace(text, pattern, "$1$3");
This will get you to replace one contiguous blob of bad characters, which addresses your example, but unfortunately, not your question. There is probably a more elegant solution, but I think you'll have to iteratively run that expression until the input and output of Regex.Replace() are equal.
If you can use sed in a terminal, you could do something like this.
sed -i 's/^\({\\s[^ ]*\s\).*\(\;}\)\(}\)\?$/\1\2/' filename
Turned my file containing:
{\s39\li0\fi0\ri0\sb0\sa0\ql\vertalt\fs22 Fußzeile Zchn;}
To:
{\s39\li0\fi0\ri0\sb0\sa0\ql\vertalt\fs22 ;}

Need regex to strip away remaing part of a path

I am trying to write a regex which will strip away the rest of the path after a particular folder name.
If Input is:
/Repository/Framework/PITA/branches/ChangePack-6a7B6/core/src/Pita.x86.Interfaces/IDemoReader.cs
Output should be:
/Repository/Framework/PITA/branches/ChangePack-6a7B6
Some constrains:
ChangePack- will be followed change pack id which is a mix of numbers or alphabets a-z or A-Z only in any order. And there is no limit on length of change pack id.
ChangePack- is a constant. It will always be there.
And the text before the ChangePack can also change. Like it can also be:
/Repository/Demo1/Demo2/4.3//PITA/branches/ChangePack-6a7B6/core/src/Pita.x86.Interfaces
My regex-fu is bad. What I have come up with till now is:
^(.*?)\-6a7B6
I need to make this generic.
Any help will be much appreciated.
Below regex can do the trick.
^(.*?ChangePack-[\w]+)
Input:
/Repository/Framework/PITA/branches/ChangePack-6a7B6/core/src/Pita.x86.Interfaces/IDemoReader.cs
/Repository/Demo1/Demo2/4.3//PITA/branches/ChangePack-6a7B6/core/src/Pita.x86.Interfaces
Output:
/Repository/Framework/PITA/branches/ChangePack-6a7B6
/Repository/Demo1/Demo2/4.3//PITA/branches/ChangePack-6a7B6
Check out the live regex demo here.
^(.*?ChangePack-[a-zA-Z0-9]+)
Try this.Instead of replace grab the match $1 or \1.See demo.
https://regex101.com/r/iY3eK8/17
Will you always have '/Repository/Framework/PITA/branches/' at the beginning? If so, this will do the trick:
/Repository/Framework/PITA/branches/\w+-\w*
Instead of regex you could can use split and join functions. Example python:
path = "/a/b/c/d/e"
folders = path.split("/")
newpath = "/".join(folders[:3]) #trims off everything from the third folder over
print(newpath) #prints "/a/b"
If you really want regex, try something like ^.*\/folder\/ where folder is the name of the directory you want to match.

Regex gets more result then in text available

I have a really weird problem: i searching for URLs on a html site and want only a specific part of the url. In my test html page the link occurs only once, but instead of one result i get about 20...
this is my regex im using:
perl -ne 'm/http\:\/\myurl\.com\/somefile\.php.+\/afolder\/(.*)\.(rar|zip|tar|gz)/; print "$1.$2\n";'
sample input would be something like this:
<html><body>Somelinknme</body></html>
which is a very easy example. so in real the link would apper on a normal website with content around...
my result should be something like this:
testfile.zip
but instead i see this line very often... Is this a problem with the regex or with something else?
Yes, the regex is greedy.
Use an appropriate tool for HTML instead: HTML::LinkExtor or one of the link methods in WWW::Mechanize, then URI to extract a specific part.
use 5.010;
use WWW::Mechanize qw();
use URI qw();
use URI::QueryParam qw();
my $w = WWW::Mechanize->new;
$w->get('file:///tmp/so10549258.html');
for my $link ($w->links) {
my $u = URI->new($link->url);
# 'http://myurl.com/somefile.php?x=foo&y=bla&z=sdf&path=/foo/bar/afolder/testfile.zip&more=arguments&and=evenmore'
say $u->query_param('path');
# '/foo/bar/afolder/testfile.zip'
$u = URI->new($u->query_param('path'));
say (($u->path_segments)[-1]);
# 'testfile.zip'
}
Are there 20 lines following in the file after your link?
Your problem is that the matching variables are not reseted. You match your link the first time, $1 and $2 get their values. In the following lines the regex is not matching, but $1 and $2 has still the old values, therefore you should print only if the regex matches and not every time.
From perlre, see section Capture Groups
NOTE: Failed matches in Perl do not reset the match variables, which makes it easier to write code that tests for a series of more specific cases and remembers the best match.
This should do the trick for your sample input & output.
$Str = '<html><body>Somelinknme</body></html>';
#Matches = ($Str =~ m#path=.+/(\w+\.\w+)#g);
print #Matches ;

Extract some strings

need some help with this RegEx magic..
I have this:
delete
and this:
(<a)*([^>]*>)[^<]*(</a>)
$1 = <a
$2 = href="/en/node/1032/delete?destination=node%2F5%2Fblog">
$3 = </a>
I need some aditional strings:
1032
href="/en/ en is dynamic!
How can I get this strings?
Used in php
Your sample could be captured with
(<a)\b.*?((href="/en/).*?(?</)(\d+)/.*?").*?>).*?(</a>)
...but perhaps replacing the "en" with something broader, depending on what you want to capture.
HOWEVER, and I want to emphasize this, don't use regex to parse HTML. The above regex won't work for certain HTML-valid input, and due to the limitations of regex it cannot be refined to work for every possible case. You'll get better, more correct results with an HTML or XML parser.
([^/ ]). That will give you
href="
en
node
1032

Regex Negation : Matching patterns other than specific strings

I am using a Voice-to-Text application which gives transcription files as output.. The transcribed text contains a few tags like (s) (for sentence beginning)..(/s)( for sentence end ).. (VOCAL_NOISE)(for un-recognized words).. but the text also contains unwanted tags like (VOCAL_N) , (VOCAL_NOISED) , (VOCAL_SOUND), (UNKNOWN).. i am using SED to process the text.. but cannot write an appropriate regex to replace all other tags except (s), (/s) and (VOCAL_NOISE), with the tag ~NS.. would appreciate if someone could help me with it..
Example text:
(s) Hi Stacey , this is Stanley (/s) (s) I would (VOCAL_N) appreciate if you could call (UNKNOWN) and let him know I want an appointment (VOCAL_NOISE) with him (/s)
Output should be:
(s) Hi Stacey , this is Stanley (/s) (s) I would ~NS appreciate if you could call ~NS and let him know I want an appointment (VOCAL_NOISE) with him (/s)
This should take care of it:
sed 's|([^)]*)|\n&\n|g;s#\n\((/\?s)\|(VOCAL_NOISE)\)\n#\1#g;s|\n\(([^)]*)\)\n|~NS|g' inputfile
Explanation:
s|([^)]*)|\n&\n|g - divide the line by putting every parenthesized string between two newlines
s#\n\((/\?s)\|(VOCAL_NOISE)\)\n#\1#g - remove the newlines around "(s)", "(/s)" and "(VOCAL_NOISE)" (keepers)
s|\n\(([^)]*)\)\n|~NS|g - replace anything else between newlines that is within parentheses with "~NS"
This works since newlines are guaranteed not to appear within a newly read line of text.
Edit: Shortened the command by using alternation \(foo\|bar\)
Previous version:
sed 's|([^)]*)|\n&\n|g;s|\n\((/\?s)\)\n|\1|g; s|\n\((VOCAL_NOISE)\)\n|\1|g;s|\n\(([^)]*)\)\n|~NS|g' inputfile
This is a dirty trick that is far from being optimal but it should work for you:
sed '
s|(\(/\?\)s)|[\1AAA]|g;
s|(VOCAL_NOISE)|[BBB]|g;
s/([^)]*)/~NS/g;
s|\[\(/\?\)AAA\]|(\1s)|g;
s|\[BBB\]|(VOCAL_NOISE)|g'
The trick is to replace (s), (/s) and (VOCAL_NOISE) with patterns which are not present in the input text (in this case [AAA], [/AAA] and [BBB]); then we replace every instance of (.*) with ~NS; in the end we get back the fake patterns to their original value.
I could suggest this using vim:
:%s/\((\w\+)\)\&\(\((s)\|(VOCAL_NOISE)\)\#!\)/\~NS/g
Using a shell (bash) you can do the following:
vim file -c '%s/\((\w\+)\)\&\(\((s)\|(VOCAL_NOISE)\)\#!\)/\~NS/g' -c "wq"
Make a backup first, I am not responsible for any damage if this is wrong.
Simply this ?
sed -E 's/\((VOCAL_N|UNKNOWN)\)/~NS/'
In this case, you'd have a blacklist (you know what to filter out). Or do you absolutely need a whitelist (you know what to NOT filter out) ?
awk -vRS=")" -vFS="(" '$2!~/s|\\s|VOCAL_NOISE/{$2="~NS"}RT' ORS=")" file |sed 's/~NS)/~NS/g'