Regex to unwrap paragraphs: Remove returns and new lines at end of lines that have content, but not empty lines - regex

I'm using Text Soap by Unmarked Software on Mac OS which is pretty much PCRE but uses ICU Regular Expression Syntax for its regex find and replace tool. I'm still new to Regex so I'm still learning the many intricacies. Please be patient with me.
I'm struggling to capture new lines or returns at the end of lines that have content, but not capture the new lines or returns of empty lines, or if there is an empty line immediately following.
I've tried using positive lookbehind, and positive lookahead with multiline mode but haven't been able to figure it out. With a bit of trial and error I did figure out that $ is after newline/carriage return.
I am essentially trying to unwrap paragraphs but maintain them as paragraphs.
I want input such as this example:
"I need to unblock," someone may have breathed out.\n
\n
"I know how to do it," I may have responded, picking up\n
the cue. My life has always included strong internal directives.\n
Marching orders) I call them.\n
\n
In any case, I suddenly knew that I did know how to un-\n
block people and that I was meant to do so, starting then and\n
there with the lessons I myself had learned.\n
\n
Where did the lessons come from?\n
\n
In 1978, in January, I stopped drinking. I had never\n
thought drinking made me a writer, but now I suddenly\n
thought not drinking might make me stop. In my mind,\n
drinking and writing went together like, well, scotch and\n
soda. For me, the trick was always getting past the fear and\n
onto the page. I was playing beat the clock-trying to write be-\n
fore the booze closed in like fog and my window of creativity\n
was blocked again.\n
To output this:
"I need to unblock," someone may have breathed out.\n
\n
"I know how to do it," I may have responded, picking up the cue. My life has always included strong internal directives. Marching orders) I call them.\n
\n
In any case, I suddenly knew that I did know how to un-block people and that I was meant to do so, starting then and there with the lessons I myself had learned.
\n
Where did the lessons come from?\n
\n
In 1978, in January, I stopped drinking. I had never thought drinking made me a writer, but now I suddenly thought not drinking might make me stop. In my mind, drinking and writing went together like, well, scotch and soda. For me, the trick was always getting past the fear and onto the page. I was playing beat the clock-trying to write be-fore the booze closed in like fog and my window of creativity was blocked again.\n

If I understand correctly, you can use this regex:
(?<!\n)\n(?!\n)
replace with empty string.
If you want to look for characters other than new lines, you can replace all the \n with the character/string that you want to find instead. For example, if your newline is \r\n. use:
(?<!\r\n)\r\n(?!\r\n)
Essentially, the regex finds a newline that neither follows nor is followed by another newline. And replacing by an empty string removes it.

I cobbled together this rudimentary Regex, but I'm assuming that it may miss certain kinds of visually empty lines. I would greatly appreciate feedback if there are ways this regex could fail or how it could be greedier than I expected, or how to improve on it. I welcome others to play around with my current solution on regex101.com and fork it etc to play or teach me something.
(?<=.$)([\r\n\f\v]?)(?!^$)
Substitute $1 with \s.

Related

Notepad++ / RegEx - Find & replace multiple parts of a line while ignoring any words between

I appreciate what I'm asking may be very simple to more experienced folks. I've spent several hours trying to get my head around RegEx and have gotten close to what I need, but as this is something I'm trying to achieve for a hobby project (RegEx is not something I require in my day job) I'm hoping some of you may be able to help me out.
In short, I have a very large file with tens of thousands of lines of code that I am converting to be readable by another program. All I need to accomplish this is to change some formatting.
I need to find every instance where the tag "{#graphic examplename}" is used, and change it so that only "examplename" remains in square [[ ]] brackets.
Examples of how the tags currently appear (example names can be either single words or multiple):
"{#graphic example1}",
"{#graphic example2}",
"{#graphic example3 with multiple words}"
What I want them to look like when done, replacing the { with [[, removing #graphic, and replacing } with ]].
"[[example1]]",
"[[example2]]",
"[[example3 with multiple words]]"
It's easy enough to do a simple find-and-replace to replace "{#graphic " with "[[", as the #graphic tag is something I want to remove universally however the issue I'm running into is that I can't replicate that with the "}" at the end, because I can't find a way to specify that I only want to replace examples of "}" that come after an instance of "{#graphic " while leaving any other words (the examplename) intact.
Any assistance gratefully received - if the above needs any elaboration please don't hesitate to ask, I understand I may be putting this in amateurish terms.
Regards,
K
Often programs have a way of capturing groups and referencing them later, often with $
so find {#graphic ([^}]+)}
replace [[$1]]
Captures what is inside the () and makes it available in the replace as $1, i.e. the first "capture group".
Regex 101 is an excellent resource for trying these things out:
https://regex101.com/r/r5OX8I/1

Regex question with spaces ruining the targeted words

I'm having issues with writing out an expression to block some words.
This is my current code.
I am currently just using regex101.com to test it.
(^[GgƓɠḠḡǴǵĜĝǦǧĞğĢģǤǥĠġ](?:[^a-zA-Z]*)([ÂâÅåÀàÁáÃãÄäEeAaÆæ4#]+\s{0,30})([.*\S*]{0,1}$)|(?:[^a-zA-Z]*)[ÝýŶŷŸÿỸỹYy].)
I'm needing this to find the word "gay" but if someone writes "gay man" with a space, it doesn't even pickup the word "gay". I'm just trying to figure out why the space allows the word, I tried moving things around and adding what I could that might make sense but nothing seems to click here.
I want to share this post to you https://stackoverflow.com/a/1732454/4867612
I'm not sure I understood correctly, but with only this [GgƓɠḠḡǴǵĜĝǦǧĞğĢģǤǥĠġ][ÂâÅåÀàÁáÃãÄäEeAaÆæ4#][ÝýŶŷŸÿỸỹYy]
in https://regex101.com/r/jYpHJS/1 you can detect if the word gay is in text

Stripping 3 parentheses in regex

I run a small forum that has an issue with people using parentheses to bracket statements. They do it to signify they are talking about Jews. I guess it is called echoes or something. So they will put a name like (((Prominent Person))) like that in the middle of a conversation.
I have recently been trying to combat this without just banning people that can't behave. I have a decent word filter but it doesn't block that. I recently installed something that allows me to use regex to strip things out but I am having trouble finding the proper string that doesn't break everything else.
"/\W{3}(.*)\W{3}/","$1"
The first is the matching string and the comma separates what is left. This string works, it strips the parentheses out and leaves everything else alone. The problem is that the string is too broad. It also strips out any [ brackets as well which breaks all of the bbcode in a post. Any post that has any number of at least 3 brackets will be broken after that.
I have been playing with different strings on regex101 but not finding the best solution. I need any time that ((( or ))) is seen to strip out those and replace it with nothing, like it never happened. It has to be exactly three and only ((( and not the other brackets it could trigger on.
Does anyone have a good solution?
\({3}(.*)\){3}
https://regex101.com/r/wD5TMb/1
So in your format probably: "/\({3}(.*)\){3}/","$1"

In what ways can I improve this regular expression?

I have written this regex that works, but honestly, it’s like 75% guesswork.
The goal is this: I have lots of imports in Xcode, like so:
#import <UIKit/UIKit.h>
#import "NSString+MultilineFontSize.h"
and I only want to return the categories that contain +. There are also lots of lines of code throughout the source which include + in other contexts.
Right now, this returns all of the proper lines throughout the Xcode project. But if there is one thing I’ve learned from googling and searching Stack Overflow for regex tutorials, it is that there are LOTS of different ways to do things. I’d love to see all of the different ways you guys can come up with that make it either more efficient or more bulletproof regarding potential spoofs or misses.
^\#import+.[\"]*+.(?:(?!\+).)*+.*[\"]
Thanks in advance for all of your help.
Update
Also I suppose I’ll accept the answer of whoever does this with the shortest string, without missing any possible spoofs. But again, thanks to everyone who participates in this learning experience.
Resources from answers
This is an awesome resource for practicing regex from Dan Rasmussen: RegExr
The first thing I notice is that your + characters are misplaced: t+. matches t one or more times, followed by a single character .. I'm assuming you wanted to match the end of import, followed by one or more of any character: import.+
Secondly, # doesn't need to be escaped.
Here's what I came up with: ^#import\s+(.*\+.*)$
\s+ matches one or more whitespace character, so you're guaranteed that the line actually starts with #import and not #importbutnotreally or anything else.
I'm not familiar with xcode syntax, but the following part of the expression, (.*\+.*), simply matches any string with a + character somewhere in it. This means invalid imports may be matched, but I'm working under the assumption your trying to match valid code. If not, this will need to be modified to validate the importer syntax as well.
P.S. To test your expression, try RegExr. You can hover over characters to check what they do.
sed 's:^#import \(.*[+].*\):\1:' FILE
will display
"NSString+MultilineFontSize.h"
for your sample.

Create a valid CSV with regular expressions

I have a horribly formated, tab delimited, "CSV" that I'm trying to clean up.
I would like to quote all the fields; currently only some of them are. I'm trying to go through, tab by tab, and add quotes if necessary.
This RegEx will give me all the tabs.
\t
This RegEx will give me the tabs that do not END with a ".
\t(?!")
How do I get the tabs that do not start with a "?
Generally for these kinds of problems, if it's a one time occurrence, I will use Excels capabilities or other applications (SSIS? T-SQL?) to produce the desired output.
A general purpose regex will usually run into bizarre exceptions and getting it just right will often take longer and is prone to missed groups your regex didn't catch.
If this is going to happen regularly, try to fix the problem at the source and/or create a special utility program to do it.
Use negative lookbehind: (?<!")\t
For one shots like this I usually just write a little program to clean up the data, that way I also can add some validation to make sure it really has converted properly after the run. I have nothing against regex but often in my case it takes longer for me figure out the regex expression than writing a small program. :)
edit: come to think about it, the main motivator is that it is more fun - for me at least :)