regex replace in lines starting with {\s between first space to ;} - regex

i have some corrupt rtf files with lines like this:
{\s39\li0\fi0\ri0\sb0\sa0\ql\vertalt\fs22 Fußzeile Zchn;}
^----------------------------^
i want to replace all [^a-zA-Z0-9_\{}; ]
but only in lines beginning with "{\s" and ending with "};" from the first "space" to "};"
the first "space" and "};" should not be replaced.

You didn't specify language, here is Regex101 example:
({\\s.+?\s)(.*)(})

So, I'm unsure what language/technology you'd like to use here, but if using C# is an option, you can check out this previous question. The answer gets you almost the way there.
For your example:
var text = #"{\s39\li0\fi0\ri0\sb0\sa0\ql\vertalt\fs22 Fußzeile Zchn;}";
var pattern = #"^({\\s\S*\s[a-zA-Z0-9_\{}; ]*)([^a-zA-Z0-9_\{}; ]*)([^}]*})";
var replaced = System.Text.RegularExpressions.Regex.Replace(text, pattern, "$1$3");
This will get you to replace one contiguous blob of bad characters, which addresses your example, but unfortunately, not your question. There is probably a more elegant solution, but I think you'll have to iteratively run that expression until the input and output of Regex.Replace() are equal.

If you can use sed in a terminal, you could do something like this.
sed -i 's/^\({\\s[^ ]*\s\).*\(\;}\)\(}\)\?$/\1\2/' filename
Turned my file containing:
{\s39\li0\fi0\ri0\sb0\sa0\ql\vertalt\fs22 Fußzeile Zchn;}
To:
{\s39\li0\fi0\ri0\sb0\sa0\ql\vertalt\fs22 ;}

Related

Powershell with regex: Unable to find and replace ALL occurences of specified string in a set of data

I am new to regular expressions and stackoverflow. Any help would be greatly appreciated.
I am trying to remove unwanted data from a data set. The data is contained in a .csv file column with multiple cells, each cell containing data similar to this:
OSVDB #109124,OSVDB #109125,OSVDB #109126,OSVDB #109127,OSVDB #109128,OSVDB #109129,OSVDB #109130,OSVDB #109131,OSVDB #109132,OSVDB #109133,OSVDB #109134,OSVDB #109135,OSVDB #109136,OSVDB #109137,OSVDB #109138,OSVDB #109139,OSVDB #109140,OSVDB #109141,OSVDB #109142,OSVDB #109143,VMSA #2014-0012,OSVDB #102715,OSVDB #104972,OSVDB #106710,OSVDB #115364,IAVA #2014-A-0191,IAVB #2014-B-0160,IAVB #2014-B-0162,IAVB #2015-B-0007
I want to replace the above data with each occurrence of the strings beginning "IAV...". So, the above cell would read:
IAVA #2014-A-0191,IAVB #2014-B-0160,IAVB #2014-B-0162,IAVB #2015-B-0007
Below is a snippet of the script that imports the .csv and gets the column containing the data.
My regex, within powershell is:
$reg1 = '$1'
$reg2 = '(IAV[A|B]\s#[0-9]{4}-[A|B]-[0-9]{4}){1,}'
ForEach-Object {$_.IAVM = [regex]::replace($_.IAVM,$reg2,$reg1); $_}
The result is:
The entire cell contents posted above.
From my understanding {1,} at the end of the regex should return each occurrence of the string pattern, but I'm returning all contents of every cell containing my regex string.
Maybe instead of trying to pick out your string you just delete the stuff you don't want? Try something like:
$reg1=''
$reg2='((OSVDB|VMSA)\s#[M-S0-9-]{6,9}[,]?)'
You have .* in that regex at the very beginning. This will capture everything up to the last match of the pat that follows it. In your case I don't think you need that part anyway.
Also note that PowerShell has a handy -replace operator, so there's often no reason to use the static methods on the Regex type.

Need regex to strip away remaing part of a path

I am trying to write a regex which will strip away the rest of the path after a particular folder name.
If Input is:
/Repository/Framework/PITA/branches/ChangePack-6a7B6/core/src/Pita.x86.Interfaces/IDemoReader.cs
Output should be:
/Repository/Framework/PITA/branches/ChangePack-6a7B6
Some constrains:
ChangePack- will be followed change pack id which is a mix of numbers or alphabets a-z or A-Z only in any order. And there is no limit on length of change pack id.
ChangePack- is a constant. It will always be there.
And the text before the ChangePack can also change. Like it can also be:
/Repository/Demo1/Demo2/4.3//PITA/branches/ChangePack-6a7B6/core/src/Pita.x86.Interfaces
My regex-fu is bad. What I have come up with till now is:
^(.*?)\-6a7B6
I need to make this generic.
Any help will be much appreciated.
Below regex can do the trick.
^(.*?ChangePack-[\w]+)
Input:
/Repository/Framework/PITA/branches/ChangePack-6a7B6/core/src/Pita.x86.Interfaces/IDemoReader.cs
/Repository/Demo1/Demo2/4.3//PITA/branches/ChangePack-6a7B6/core/src/Pita.x86.Interfaces
Output:
/Repository/Framework/PITA/branches/ChangePack-6a7B6
/Repository/Demo1/Demo2/4.3//PITA/branches/ChangePack-6a7B6
Check out the live regex demo here.
^(.*?ChangePack-[a-zA-Z0-9]+)
Try this.Instead of replace grab the match $1 or \1.See demo.
https://regex101.com/r/iY3eK8/17
Will you always have '/Repository/Framework/PITA/branches/' at the beginning? If so, this will do the trick:
/Repository/Framework/PITA/branches/\w+-\w*
Instead of regex you could can use split and join functions. Example python:
path = "/a/b/c/d/e"
folders = path.split("/")
newpath = "/".join(folders[:3]) #trims off everything from the third folder over
print(newpath) #prints "/a/b"
If you really want regex, try something like ^.*\/folder\/ where folder is the name of the directory you want to match.

How to make a regex to replace the value of a key in a json file

I want to make a regex so I can do a "Search/Replace"
over a json file with many object.
Every object has a key named "resource"
containing a URL.
Take a look at these examples:
"resource":"http://www.img/qwer/123/image.jpg"
"resource":"io.nl.info/221/elephant.gif"
"resource":"simgur.com/icon.png"
I want to make a regex to replace the whole url with
a string like this: img/filename.format.
This way, the result would be:
"resource":"img/image.jpg"
"resource":"img/elephant.gif"
"resource":"img/icon.png"
I'm just starting with regular expressions and I'm
completely lost. I was thinking that one valid idea would
be to write something starting with this pattern "resource":"
and ending with the last five characters. But I don't even know how to try
that.
How could I write the regular expression?
Thanks in advance!
Try this:
Find: "resource":\s*"[^"]+?([^\/"]+)"
Replace: "resource":"img/\1
Using [^"]+? ensures the match won't roll off the end of the current entry and gobble up too much input, and it's reluctant (with the added ?) so it gets the whole image file name (instead ofwhat the last character).
Edit:
I added optional whitespace after the key, which your pastebin has.
See a live demo of this regex with your pastebin.
Regex
.*\/
Debuggex Demo
This will find the text you want to replace. Replace it with img/ if you want to find the whole text you'll need to look for the following Regex:
("resource":").*\/
Debuggex Demo
Then replace with $1img/ this should give you group 1 and the img part.
Let me know if there are any questions
Note: I personally would just use objects since you have the JSON and parse it to a object then iterate over the objects and change each resource on each object independently rather than looking for a magic bullet
If your JSON is an array of objects containing resource field I would do it in 3 steps: convert to object, find resources and replace them, convert back to string (optional)
var tmp = JSON.parse('<your json>');
for (i = 0; i < tmp.length; ++i) {
for (e in tmp[i])
if (e == 'resource')
tmp[i][e] = tmp[i][e].replace(/.*(?=img\/.*\..*)/,'')
}
tmp = JSON.stringify(tmp);

Article spinner with 2 tiers

I made an article spinner that used regex to find words in this syntax:
{word1|word2}
And then split them up at the "|", but I need a way to make it support tier 2 brackets, such as:
{{word1|word2}|{word3|word4}}
What my code does when presented with such a line, is take "{{word1|word2}" and "{word3|word4}", and this is not as intended.
What I want is when presented with such a line, my code breaks it up as "{word1|word2}|{word3|word4}", so that I can use this with the original function and break it into the actual words.
I am using c#.
Here is the pseudo code of how it might look like:
Check string for regex match to "{{word1|word2}|{word3|word4}}" pattern
If found, store each one as "{word1|word2}|{word3|word4}" in MatchCollection (mc1)
Split the word at the "|" but not the one inside the brackets, and select a random one (aka, "{word1|word2}" or "{word3|word4}")
Store the new results aka "{word1|word2}" and "{word3|word4}" in a new MatchCollection (mc2)
Now search the string again, this time looking for "{word1|word2}" only and ignore the double "{{" "}}"
Store these in mc2.
I can not split these up normally
Here is the regex I use to search for "{word1|word2}":
Regex regexObj = new Regex(#"\{.*?\}", RegexOptions.Singleline);
MatchCollection m = regexObj.Matches(originalText); //How I store them
Hopefully someone can help, thanks!
Edit: I solved this using a recursive method. I was building an article spinner btw.
That is not parsable using a regular expression, instead you have to use a recursive descent parser. Map it to JSON by replacing:
{ with [
| with ,
wordX with "wordX" (regex \w+)
Then your input
{{word1|word2}|{word3|word4}}
becomes valid JSON
[["word1","word2"],["word3","word4"]]
and will map directly to PHP arrays when you call json_decode.
In C#, the same should be possible with JavaScriptSerializer.
I'm really not completely sure WHAT you're asking for, but I'll give it a go:
If you want to get {word1|word2}|{word3|word4} out of any occurrence of {{word1|word2}|{word3|word4}} but not {word1|word2} or {word3|word4}, then use this:
#"\{(\{[^}]*\}\|\{[^}]*\})\}"
...which will match {{word1|word2}|{word3|word4}}, but with {word1|word2}|{word3|word4} in the first matching group.
I'm not sure if this will be helpful or even if it's along the right track, but I'll try to check back every once in a while for more questions or clarifications.
s = "{Spinning|Re-writing|Rotating|Content spinning|Rewriting|SEO Content Machine} is {fun|enjoyable|entertaining|exciting|enjoyment}! try it {for yourself|on your own|yourself|by yourself|for you} and {see how|observe how|observe} it {works|functions|operates|performs|is effective}."
print spin(s)
If you want to use the [square|brackets|syntax] use this line in the process function:
'/[(((?>[^[]]+)|(?R))*)]/x',

Regex Negation : Matching patterns other than specific strings

I am using a Voice-to-Text application which gives transcription files as output.. The transcribed text contains a few tags like (s) (for sentence beginning)..(/s)( for sentence end ).. (VOCAL_NOISE)(for un-recognized words).. but the text also contains unwanted tags like (VOCAL_N) , (VOCAL_NOISED) , (VOCAL_SOUND), (UNKNOWN).. i am using SED to process the text.. but cannot write an appropriate regex to replace all other tags except (s), (/s) and (VOCAL_NOISE), with the tag ~NS.. would appreciate if someone could help me with it..
Example text:
(s) Hi Stacey , this is Stanley (/s) (s) I would (VOCAL_N) appreciate if you could call (UNKNOWN) and let him know I want an appointment (VOCAL_NOISE) with him (/s)
Output should be:
(s) Hi Stacey , this is Stanley (/s) (s) I would ~NS appreciate if you could call ~NS and let him know I want an appointment (VOCAL_NOISE) with him (/s)
This should take care of it:
sed 's|([^)]*)|\n&\n|g;s#\n\((/\?s)\|(VOCAL_NOISE)\)\n#\1#g;s|\n\(([^)]*)\)\n|~NS|g' inputfile
Explanation:
s|([^)]*)|\n&\n|g - divide the line by putting every parenthesized string between two newlines
s#\n\((/\?s)\|(VOCAL_NOISE)\)\n#\1#g - remove the newlines around "(s)", "(/s)" and "(VOCAL_NOISE)" (keepers)
s|\n\(([^)]*)\)\n|~NS|g - replace anything else between newlines that is within parentheses with "~NS"
This works since newlines are guaranteed not to appear within a newly read line of text.
Edit: Shortened the command by using alternation \(foo\|bar\)
Previous version:
sed 's|([^)]*)|\n&\n|g;s|\n\((/\?s)\)\n|\1|g; s|\n\((VOCAL_NOISE)\)\n|\1|g;s|\n\(([^)]*)\)\n|~NS|g' inputfile
This is a dirty trick that is far from being optimal but it should work for you:
sed '
s|(\(/\?\)s)|[\1AAA]|g;
s|(VOCAL_NOISE)|[BBB]|g;
s/([^)]*)/~NS/g;
s|\[\(/\?\)AAA\]|(\1s)|g;
s|\[BBB\]|(VOCAL_NOISE)|g'
The trick is to replace (s), (/s) and (VOCAL_NOISE) with patterns which are not present in the input text (in this case [AAA], [/AAA] and [BBB]); then we replace every instance of (.*) with ~NS; in the end we get back the fake patterns to their original value.
I could suggest this using vim:
:%s/\((\w\+)\)\&\(\((s)\|(VOCAL_NOISE)\)\#!\)/\~NS/g
Using a shell (bash) you can do the following:
vim file -c '%s/\((\w\+)\)\&\(\((s)\|(VOCAL_NOISE)\)\#!\)/\~NS/g' -c "wq"
Make a backup first, I am not responsible for any damage if this is wrong.
Simply this ?
sed -E 's/\((VOCAL_N|UNKNOWN)\)/~NS/'
In this case, you'd have a blacklist (you know what to filter out). Or do you absolutely need a whitelist (you know what to NOT filter out) ?
awk -vRS=")" -vFS="(" '$2!~/s|\\s|VOCAL_NOISE/{$2="~NS"}RT' ORS=")" file |sed 's/~NS)/~NS/g'