Regex Negation : Matching patterns other than specific strings - regex

I am using a Voice-to-Text application which gives transcription files as output.. The transcribed text contains a few tags like (s) (for sentence beginning)..(/s)( for sentence end ).. (VOCAL_NOISE)(for un-recognized words).. but the text also contains unwanted tags like (VOCAL_N) , (VOCAL_NOISED) , (VOCAL_SOUND), (UNKNOWN).. i am using SED to process the text.. but cannot write an appropriate regex to replace all other tags except (s), (/s) and (VOCAL_NOISE), with the tag ~NS.. would appreciate if someone could help me with it..
Example text:
(s) Hi Stacey , this is Stanley (/s) (s) I would (VOCAL_N) appreciate if you could call (UNKNOWN) and let him know I want an appointment (VOCAL_NOISE) with him (/s)
Output should be:
(s) Hi Stacey , this is Stanley (/s) (s) I would ~NS appreciate if you could call ~NS and let him know I want an appointment (VOCAL_NOISE) with him (/s)

This should take care of it:
sed 's|([^)]*)|\n&\n|g;s#\n\((/\?s)\|(VOCAL_NOISE)\)\n#\1#g;s|\n\(([^)]*)\)\n|~NS|g' inputfile
Explanation:
s|([^)]*)|\n&\n|g - divide the line by putting every parenthesized string between two newlines
s#\n\((/\?s)\|(VOCAL_NOISE)\)\n#\1#g - remove the newlines around "(s)", "(/s)" and "(VOCAL_NOISE)" (keepers)
s|\n\(([^)]*)\)\n|~NS|g - replace anything else between newlines that is within parentheses with "~NS"
This works since newlines are guaranteed not to appear within a newly read line of text.
Edit: Shortened the command by using alternation \(foo\|bar\)
Previous version:
sed 's|([^)]*)|\n&\n|g;s|\n\((/\?s)\)\n|\1|g; s|\n\((VOCAL_NOISE)\)\n|\1|g;s|\n\(([^)]*)\)\n|~NS|g' inputfile

This is a dirty trick that is far from being optimal but it should work for you:
sed '
s|(\(/\?\)s)|[\1AAA]|g;
s|(VOCAL_NOISE)|[BBB]|g;
s/([^)]*)/~NS/g;
s|\[\(/\?\)AAA\]|(\1s)|g;
s|\[BBB\]|(VOCAL_NOISE)|g'
The trick is to replace (s), (/s) and (VOCAL_NOISE) with patterns which are not present in the input text (in this case [AAA], [/AAA] and [BBB]); then we replace every instance of (.*) with ~NS; in the end we get back the fake patterns to their original value.

I could suggest this using vim:
:%s/\((\w\+)\)\&\(\((s)\|(VOCAL_NOISE)\)\#!\)/\~NS/g
Using a shell (bash) you can do the following:
vim file -c '%s/\((\w\+)\)\&\(\((s)\|(VOCAL_NOISE)\)\#!\)/\~NS/g' -c "wq"
Make a backup first, I am not responsible for any damage if this is wrong.

Simply this ?
sed -E 's/\((VOCAL_N|UNKNOWN)\)/~NS/'
In this case, you'd have a blacklist (you know what to filter out). Or do you absolutely need a whitelist (you know what to NOT filter out) ?

awk -vRS=")" -vFS="(" '$2!~/s|\\s|VOCAL_NOISE/{$2="~NS"}RT' ORS=")" file |sed 's/~NS)/~NS/g'

Related

Regex that deletes everything except for any tags that contains an specific string inside of it

I need a regex that can be applied on vim editor, or bash (grep command), that will delete everything in a file, leaving only the tags containing an specific string:
<generic>
stuff1
stuff2
stuff3
</generic>
and
<generic>
stuff1
stuff2
DESIRED_STRING
stuff3
</generic>
The first one would be wiped and the second one would remain because of the DESIRED_STRING.
At the end, I need a file with tons of tags that contains a modifier on it. This process will be executed several times to separate one huge file into multiple others.
This (?<=\<custom_item\>).*?(?=\<\/custom_item\>) got me in a point where I could match the content inside of the tags. Not able to filter it though.
The file will always follow this structure
<tag>
system : "Linux"
type : CHECK
</tag>
Where 'CHECK' is the modifier and the word I am looking for
Thank you!!
You may use this approach using awk:
awk '/<generic>/ { tag=1 }
tag && /DESIRED_STRING/ { p=1 }
tag { s = s $0 RS }
/<\/generic>/ { if (p) printf "%s", s; tag=p=0; s="" }' file
We use 2 flags to track our state here. tag represents state when we are inside open and close tags and p represents a state when we find our desired string while inside the open/close tags.
Here's an alternative, in Vim: it is much easier to match than avoid to match, so....
Gmz:1,'z g/DESIRED_STRING/norm yat:$pu<Ctrl-V><Enter><Enter>'zdgg
where <Ctrl-V> and <Enter> are supposed to be keys, not actual text to be entered.
Gmz will set a z mark at the last line. Then, we search for the DESIRED_STRING, and at each one, yank the tag, then paste it to the bottom of the file (in order). Then 'zdgg to delete the original (from the mark z to the top of the file).
Basically, instead of trying to delete everything and making exceptions for the desired content, pull the desired content out first, then delete everything.
Bonus: This will work even with tags that don't align with line breaks (even though OP doesn't have those). For example,
outside<tag>inside
foo DESIRED_STRING inside</tag>outside
will correctly produce
<tag>inside
foo DESIRED_STRING inside</tag>
With Vim regex:
:%s/<\([^>]*\)>\(\_.\(DESIRED_STRING\)\#!\)\{-}<\/\1>//
This regex uses a negative look ahead, \#!, to match all blocks of text not containing DESIRED_STRING. These blocks are then removed with the :%s command

regex replace in lines starting with {\s between first space to ;}

i have some corrupt rtf files with lines like this:
{\s39\li0\fi0\ri0\sb0\sa0\ql\vertalt\fs22 Fußzeile Zchn;}
^----------------------------^
i want to replace all [^a-zA-Z0-9_\{}; ]
but only in lines beginning with "{\s" and ending with "};" from the first "space" to "};"
the first "space" and "};" should not be replaced.
You didn't specify language, here is Regex101 example:
({\\s.+?\s)(.*)(})
So, I'm unsure what language/technology you'd like to use here, but if using C# is an option, you can check out this previous question. The answer gets you almost the way there.
For your example:
var text = #"{\s39\li0\fi0\ri0\sb0\sa0\ql\vertalt\fs22 Fußzeile Zchn;}";
var pattern = #"^({\\s\S*\s[a-zA-Z0-9_\{}; ]*)([^a-zA-Z0-9_\{}; ]*)([^}]*})";
var replaced = System.Text.RegularExpressions.Regex.Replace(text, pattern, "$1$3");
This will get you to replace one contiguous blob of bad characters, which addresses your example, but unfortunately, not your question. There is probably a more elegant solution, but I think you'll have to iteratively run that expression until the input and output of Regex.Replace() are equal.
If you can use sed in a terminal, you could do something like this.
sed -i 's/^\({\\s[^ ]*\s\).*\(\;}\)\(}\)\?$/\1\2/' filename
Turned my file containing:
{\s39\li0\fi0\ri0\sb0\sa0\ql\vertalt\fs22 Fußzeile Zchn;}
To:
{\s39\li0\fi0\ri0\sb0\sa0\ql\vertalt\fs22 ;}

regular expression for finding alphanumeric & hyphen PATTERN in a text file

From the below text i want to fetch PATTERN ALPHABATES-seprated by HYPHEN-NUMBERS for example in this file "MGRAPP-713" (it will change for each file but PATTERN WILL REMAIN CONSTANT)
123.txt
REPORTS:
restrict CBD [Jawad Hameed] [2018-01-31 16:31:00 -0500]`enter code here`
debug [Jawad Hameed] [2018-01-31 16:09:08 -0500]
debug [Jawad Hameed] [2018-01-31 15:59:52 -0500]
Merge pull request #65 from HotelKey/MGRAPP-713 [GitHub] [2018-01-31 11:35:30 0100]
MGRAPP-713 [sabrio] [2018-01-30 15:30:56 +0100]
I'm using: grep '[A-Z0-9-]' 123.txt
Your question isn't very clear. Are you always looking for that particular string? Do you want the whole line, or just that field? Is the string you're looking for always at the beginning of the line?
Based on my guess about what you meant, I'd suggest:
$ awk '/^[A-Z]+-[0-9]/ {print $1}' mgrapp
MGRAPP-713
Whenever you want to print part of line matching a pattern, awk is your friend.
Edit
In your comment, you clarify your objective somewhat. Here's a slightly more elaborate solution:
$ awk '/^[A-Z]+-[0-9]/ {
match($1, /^[A-Z]+-[0-9]+/)
printf "%.*s\n", RLENGTH, $1 }' mgrapp
MGRAPP-713
But I can't write your program for you. I'm just demonstrating that awk lets you write simple programs to grab strings out of text files. Like any powerful tool, it takes time to learn. It's time well spent because, you know, "Luck favors the prepared mind."
Use this:
[A-Z]+-[0-9]+
[A-Z]+ -- for matching all caps word,
- -- for hyphen,
[0-9]+ -- for numbers

Regex Assistance for a url filepath

Can someone assist in creating a Regex for the following situation:
I have about 2000 records for which I need to do a search/repleace where I need to make a replacement for a known item in each record that looks like this:
<li>View Product Information</li>
The FILEPATH and FILE are variable, but the surrounding HTML is always the same. Can someone assist with what kind of Regex I would substitute for the "FILEPATH/FILE" part of the search?
you may match the constant part and use grouping to put it back
(<li>View Product Information</li>)
then you should replace the string with $1your_replacement$2, where $1 is the first matching group and $2 the second (if using python for instance you should call Match.group(1) and Match.group(2))
You would have to escape \ chars if you're using Java instead.

use regex to modify srt file?

one format of srt file looks like this:
0:00:04 --> 00:00:10
and another format looks like this
0:00:04,000 --> 00:00:10,000
I want to process the first kind of file to append an ,000 to each time-frame for compatibility purposes so that the first file has the ,000 formatting that I need like the above second example.
I was thinking of trying to use some string functions like mid(), right(), instring() but wondered if regex might do the job better, any suggestions on how to do this?
You can use this regex to match the first group :
^([0-9]{1,2}:[0-9]{2}:[0-9]{2}) --> ([0-9]{1,2}:[0-9]{2}:[0-9]{2})$
And then replace $1 by $1 + ",000" and $2 by $2 + ",000"
Since you don't indicate which language you used, I did a simple example in PHP :
<?php
$string = 'April 15, 2003';
$pattern = '/(\w+) (\d+), (\d+)/i';
$replacement = '${1}1,$3';
echo preg_replace("/^([0-9]{1,2}:[0-9]{2}:[0-9]{2}) --> ([0-9]{1,2}:[0-9]{2}:[0-9]{2})$/i", "$1,000 --> $2,000", "0:00:04 --> 00:00:10");
// output : 0:00:04,000 --> 00:00:10,000
?>
With sed (it's available on Windows too):
sed -i '/\d\+\:\d\+:\d\+ --> \d\+\:\d\+:\d\+/ s_\(\d\+\:\d\+:\d\+\)\s*-->\(\d\+\:\d\+:\d\+\)\s*_\1,000 --> \2,000_' INPUT.srt
It will be done inplace.
(And I know it's not the correct regex to capture time definitions... but it works for this job.)
Sure, that sounds like a good idea. A simple approach would be to match for (\d?\d:\d\d:\d\d) and replace it with the match itself plus ,000 (for "the match itself" use a back reference, which might be something like \1 or $1, depending on your language).
Try implementing this, and if you need further help, start a new question where you mention what you have tried, where you are stuck and which language you are using.
Why not simply
sed -e 's/ -->\|$/,000&/' old.srt >new.srt
provided that old.srt consistently contains the shorter format only.