Finding and replacing a pattern with bold and normal characters - regex

So as the title suggests I have a crazy thing that I need to do and was wondering if there is a faster way to do it. Basically I have a list in Word format. On each line there is data that looks like this:
Bold Text Normal Text
I need to insert something between the bold and normal text. Is there any way to find only the places that match that pattern (i.e. B space here N)? I could then easily insert what I need. Maybe something with regex?

Ok, so a bit extreme idea:
The document you are talking about, is docx? if not, I guess you can convert it to it.
I've tried that on a docx file, without a regex, but i'm sure that you'll be able to take care of this :)
So!
Extract the docx file as a zip archive
You can add .zip to the file name, as an extension, or just open with an archiver - such as 7zip.
Navigate to the folder named word, under the extracted folder.
Open document.xml with your preferred editor
Every part of the text that changes his style - has a different tag
Find some string that looks like that: <w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000"><w:rPr><w:b w:val="1"/><w:rtl w:val="0"/></w:rPr><w:t xml:space="preserve">bold text </w:t></w:r>
A string style section looks like that ^
The tag <w:b w:val="1"/> with the 1 value, indicates that this string inside ("bold text ") has the bold style.
Create a string that looks like what I've shown above, and insert the text you like. If for example you want the new text to have another style, like italic, so use <w:i w:val="1"/> (with i instead of b).
My example:
I wanted to add pictures, but I don't have enough reputation :(
It looks like:
Before: bold text normal text
After: bold text hi im new normal text
The XMLs example:
https://gist.github.com/arieljannai/08756ef562962eee0798
So, the only thing you need to do now, is build a regex that will find you the parts with w:b tags and all of the surrounding, and than you have it :)
Good luck!
EDIT: A regex example I made, that matches a style string line, like I put in the example above:
(<w:r.*?>(?:<w:b\s{1}.*?\/>){1}.*?(?:<w:t\s{1}.*?>(.*?)<\/w:t>)<\/w:r>)
The regex matches a section, between the <w:r> tag (first group).
The first non-matching group make sure it has the bold tag ((?:<w:b\s{1}.*?\/>))
The second non-matching group finds the tag that the text is with in it (the <w:t> tag).
inside the second non-matching group, there's the second matching group (.*?) which actually holds the text of that style string. (second group).
So you have the whole style string in the first group, and only the actual text in the second group.

Related

NotePad++ regex match and replace and also keep match to convert to different markdown image reference link

I have the following link syntax that needs to be changed:
![[afoldernamenolongerneededandwillbedeprecated/somemarkdownfilename_image1.png]]
I tried (successfully) with this regex to match:
![[].*[\/].*_image[0-9].png[]]]
Although I have a hunch it may not be what I should use. I the novice think it may be only good for matching and not replacing. All images are png's, by the way. All filenames have _image in them, prefixed by the markdown file-name.
Desired end format:
![image](imagenamefromabovestring1,2,orhowevermanythereare.png)
The
![]()
is a known syntax in markdown to reference images. Images will be populated in subdirectories the program/app will find.
It goes without saying I want to run find and replace recursively on some 4000 files containing image references.
I put up the unfinished substitution example here:
https://regex101.com/r/Bl8HJC/1
So to clarify more on what I need. I need the formerly present folder name gone. I don't need it anymore. Then after the slash comes the name of the image, the syntax of which is always: current filename to be proccessed by NotePad++ recursively (it can be a markdown file named Ab, Aba, Abracadabra, etc.) and this filename always serves as prefix, then comes an underscore and 'image' + a number depending on how many images are linked to the markdown file as attachments. The names of the files to go in an attachment folder will look like this:
AB_image3.png
Abracadabra_image2.png
.
.
.
Zodiac_image45.png
I am looking for the right syntax as I couldn't figure it out with the dollar sign.
Cheers,
Otto
I have modified your example to get it working here. What you needed to do is escape the square brackets so they would be interpreted literally, since they have special meaning in regex, and you needed to use a capture group to store the matching value in $1 so you could use it in the replacement.
Regular expression:
!\[\[.*\/(.*_image[0-9]{1,2}\.png)\]\]
Substitution format:
![image]\($1\)
Edit: Question was revised to state that the folder name was unwanted in the final output, so matches are delimited after the final / character in the file path.
Edit 2: Support for file numbers 1 through 99.

Regex select XML Element (containing hyphen) and inside content

I'm working with an enterprise CMS and in order to properly create our weekly-updated dropdown menu without republishing our entire site, I have an XML document being created which has a various number of useful XML elements. However, when pulling in a link with the CMS, the generated XML also outputs the link's contents (the entire HTML for the page). Needless to say, with roughly 50 items, the XML file is too big for use on the web (as it stands I think it's over 600KB). The element is <page-content>filler here</page-content>.
What I'm trying to do is use TextWrangler to find and replace all <page-content> tags as well as their containing content.
I've tried a few different regex's, but I can't seem to match the closing tag, so it will just trail on.
Here's what I've tried:
(<page-content>)(.*?)
The above will match up until the next starting <page-content> tag, which is not what I want.
(<page-content>)(.*?)(<\/page-content>)
(<page-content>)(.*?)(<\/page\-content>)
The above finds no matches, even though the below will find the 7 matches it should.
(<content>)(.*?)(<\/content>)
I don't know if there's a special way to deal with hyphens (I'm inexperienced in regular expressions), but if anyone could help me out, it would be greatly appreciated.
Thanks!
EDIT: Before you tell me that Regex isn't meant to parse HTML, I know that, but there seems to be no other way for me to easily find and replace this. There are too many occurences to manually delete it and save the file again every week.
It seems the problem is that your . is not matching newlines that exist between your open and close tags.
An easy solution for this would be to add the s flag in order for your . to match over newlines. TextWrangler appears to support inline modifiers (?s). You could do it like this:
(<page-content>)(?s)(.*?)(<\/page-content>)
More information on modifiers here.

How to Easily Remove Unwanted Parts in HTML Table Cells Using Notepad++

I have series of different occurrences of table cells in some html files as shown in this image:
http://screencast.com/t/MqGHN2iwfd
Apart from the beginning and end of each cell, they have the following parts in common:
.net/?mobile=true
/spotlightProfile.htm?f=mkt&v=
/#stats
I want to either be able to remove all the parts that look like that once
OR be able to remove one-by-one in notepad++:
the url part that precede .net/?mobile=true
the url parts before and after /spotlightProfile.htm?f=mkt&v= and
the url part before /#stats
Furthermore, please, I also want to be able to remove the duplicate occurrence also in Notepad++
Thanks a lot in anticipation for helping out.
Regex would look something like this.
Search for: (.*)(\/\?mobild=true|\/spotlightProfile\.htm\?f=mkt&v=|\/#stats)?(.*)
Replace With: \1\3
Basically we create 3 groups:
before the expression you match,
the expression you trying to replace
the rest of the line

Regex problems in VB.net, how do I match this?

I never understand how to create regular expressions and now I need one badly. It would be great if someone know how to do this.
I need to match these examples with a regex and then append text before the third comma:
Examples:
1.
Örjan,,;Svensson,,,,, and then it
continues like this
Needs to become:
Örjan,,;SvenssonNEWTEXTHERE,,,,, and
then it continues like this
2.
Patric,The-Man,Black,,,,,,,,, and then
it continues like this
Needs to become:
Patric,The-Man,BlackNEWTEXTHERE,,,,,,,,, and then
it continues like this
If I would use wildcards to do this it would look like this:
*,*,*,*
And I would like to add text just before the last comma. But I still need the whole string so the text can just be added there I don't want the characters that comes after the added text to disappear.
This is a .CSV contact file btw so you better understand the structure of the text.
Is this possible?
The regular expression for a CSV field, i.e. “any text not containing comma”, is [^,]*, if you want to skip to the end of the third field, you’ll use
[^,]*,[^,]*,[^,]*
Now, if you want to modify the string, you can use something like
Dim str = "Örjan,,;Svensson,,,,, and then it continues like this"
Dim re As New Regex("[^,]*,[^,]*,[^,]*")
Dim pos = re.Match(str).Length
Now you’ve got the position of where you want to put the additional string in pos, and you can do whatever you want with it.
Note that a CSV file can generally contain fields which contain literal commas and need to be quoted (e.g. Patric,"The,Man",Black,...). It may even contain a linebreak, which makes it quite difficult to parse properly, especially with regular expressions (and the code above would not work with such data). Can you be sure your CSV file does not contain quoted fields?

In Yahoo-Pipes, how to use regex when you can't see non-printable characters and html tags?

I keeping having the problem trying to extract data using regex whereas my result is not what I wanted because there might be some newlines, spaces, html tags, etc in the string, but is there anyway to actually see what is in the string, the debugger seems to show only the real text. How do you deal with this?
If the content of the string is HTML then debugger gives you a choice of viewing "HTML" or "Source". Source should show you any HTML tags that are there.
However if your concern is white space, this may not be enough. Your only option is to "view source" on the original page.
The best course of action is to explicitly handle these possibilities in your regex. For example, if you think you might be getting white space in your target string, use the \s* pattern in the critical positions. That will match zero or more spaces, tabs, and new lines (you must also have the "s" option checked in the regex panel for new lines).
However, without specific examples of source text and the regex you are using - advice can only be generic.
What I do is use a regex tester (whichever uses the same regex engine that you are using) and I test my pattern on it. I've tried using text editors that display invisible characters but to me they only add to the confusion.
So I just go by trial and error. For instance, if a line ends in:
</a>
Then I'll try the following patterns on the regex tester until I find one that works:
</a>.
</a>..
</a>\s
</a>\s*
</a>\n
</a>\r
</a>\r\n
Etc.