Issue with selecting items using regex - regex

I have text separated by white spaces and a search range of more than 1000 words.
Approximately 70% of the words are following this pattern foo-bar-...-N, where N is unknown value for counter for words written between this sign: "-". After each word(between each word) there is a blank space.
What I need is for the script to select everything after the foo-bar up until the blank space.
I know how to select whole thing, but not how to get solution for my issue.
Here is some example for my idea:
foo-bar foo-bar-thing foo-bar-stuff-my-gosh ... foo-bar-for-educational-purposes
And regex should select them like so:
[foo-bar] [foo-bar]-thing [foo-bar]-stuff-my-gosh ... [foo-bar]-for-educational-purposes

You want a the regex to fetch a phrase and extract a substring from it.
To do that you need a group.
So here is the code you want :
foo-bar([\w-]*)
There is a space at the end don't forget it. You need to set the global flag as you can see in the demo. And your string has to end with a space if you want to match the last one. If it's multiline don't forget the multiline flag too.

Related

Notepad++ and regex (multiline)

I have been facing a challenge. I have a text file with the following pattern:
SOME RANDOM TITLE IN CAPS (nnnn)
text text text
more text
...
SOME OTHER RANDOM TITLE IN CAPS (nnnn)
What is for sure is that what I want to extract are lines with a bracket and a date ex: (2015) ; (20008)
After the (nnnn) there is no text, sometimes space and CR LF, sometimes just CR LF
I would like to delete everything else and keep just the TITLE LINE with the brackets
The time I spent I could have done it by hand (there are 100lines) but I like the challenge :)
I thought I could find the issue but I am stuck.
I have tried something along this line:
^.*\(\d\d\d\d\)(?s)(.*)(^.*\(\d\d\d\d\))
But I don't get what I want. I can't seem to stop the (?s)(.*) going all the way to the end of the text instead of stopping at the next occurrence.
I suggest using the Search > Mark feature. Use a pattern like \(\d{4}\) and check the "Bookmark Line" option then click "Mark All". Then use Search > Bookmark > Remove Unmarked Lines. This will remove all lines except the ones that have matched your pattern.
Note: If it's possible to have parentheses with 4 digits within your other lines you could add $ to the end of the expression to ensure that the pattern only matches the end of the line. E.g. more text (1234) and other stuff would be matched by the pattern I gave above but if you use pattern \(\d{4}\)$ it will no longer match.
If you want to be even more specific with your pattern by looking for those lines with only uppercase letters and spaces followed by parentheses with 4 digits inside where the parentheses are at the end of the line, then you could use a pattern like this: [A-Z ]+\(\d{4}\)$
Sample input:
SOME RANDOM TITLE IN CAPS (2008)
text text text
more text
...
SOME OTHER RANDOM TITLE IN CAPS (2010)
Here is how to mark the lines:
After clicking "Mark All" here is what you see:
Now use Search > Bookmark > Remove Unmarked Lines and you get this:
The following RegEx maches the 2 lines with brackets containing 4 numbers:
.*?\(\d{4}\)\s*
It starts matching anything at start zero or more times (non greedy), then it matches a start bracket followed by 4 numbers. Finally ending White Space and new line.
If you want to remove all lines but the ones that end with (4numbers) you may try with this:
^(?!.*\(\d{4}\)\h*$).*(?:\r?\n|\z)
Replace by: (nothing)
See demo

Regex Find Spaces between single qoutes and replace with underscore

I have a database table that I have exported. I need to replace the image file name with a space and would like to use notepad++ and regex to do so. I have:
'data/green tea powder.jpg'
'data/prod_img/lumina herbal shampoo.JPG'
'data/ALL GREEN HERBS.jpeg'
'data/prod_img/PSORIASIS KIT (640x530) (2).jpg'
and need to make them look like this:
'data/green_tea_powder.jpg'
'data/prod_img/lumina_herbal_shampoo.JPG'
'data/ALL_GREEN_HERBS.jpeg'
'data/prod_img/PSORIASIS_KIT_(640x530)_(2).jpg'
I just want to change the spaces between the quotes (I don't want to change the capitalization). To be more specific I would like to replace any and all spaces between 'data/ and ' because there are other spaces between quotes in the DB, for example:
'data/ REPLACE ANY SPACE HERE '
I found this:
\s(?!(?:[^']*'[^']*')*[^']*$)
but there are other places where there are spaces between quotes so I'd like to search for data/ in the beging and not just a single quote but I can't figure out how. I tried \s(?!(?:[^'data\/]*'[^']*')*[^']*$) but it didn't work and I am not familiar enough with regex to make it do so.
An example of a full line from the database is:
(712, 'GRTE-P', '', 'data/green tea powder.jpg', '2014-03-12 22:52:03'),
I don't want to replace the spaces in the time and data stamp at the end of the line, just the image file names.
Thanks in advance for your help!
You have to use a \G based pattern to ensure that matches are contiguous.
search: (?:\G(?!^)|'data/)[^' ]*\K[ ]replace: _
The first match uses the second branch of the alternation, then the next matches are contiguous and use the first branch.

Remove columns from CSV

I don't know anything about Notepad++ Regex.
This is the data I have in my CSV:
6454345|User1-2ds3|62562012032|324|148|9c1fe63ccd3ab234892beaf71f022be2e06b6cd1
3305611|User2-42g563dgsdbf|22023001345|0|0|c36dedfa12634e33ca8bc0ef4703c92b73d9c433
8749412|User3-9|xgs|f|98906504456|1534|51564|411b0fdf54fe29745897288c6ad699f7be30f389
How can I use a Regex to remove the 5th and 6th column? The numbers in the 5th and 6th column are variable in length.
Another problem is the User row can also contain a |, to make it even worse.
I can use a macro to fix this, but the file is a few millions lines long.
This is the final result I want to achieve:
6454345|User1-2ds3|62562012032|9c1fe63ccd3ab234892beaf71f022be2e06b6cd1
3305611|User2-42g563dgsdbf|22023001345|c36dedfa12634e33ca8bc0ef4703c92b73d9c433
8749412|User3-9|xgs|f|98906504456|411b0fdf54fe29745897288c6ad699f7be30f389
I am open for suggestions on how to do this with another program, command line utility, either Linux or Windows.
Match \|[^|]+\|[^|]+(\|[^|]+$)
Repalce $1
Basically, Anchor to the end of the line, and remove columns [-1] and [-2] (I assume columns can't be empty. Replace + with * if they can)
If you need finer detail then that, I'd recommend writing a Java or Python script to manual parse and rewrite the file for you.
I've captured three groups and given them names. If you use a replace utility like sed or vimregex, you can replace remove with nothing. Or you can use a programming language to concatenate keep_before and keep_after for the desired result.
^(?<keep_before>(?:[^|]+\|){3})(?<remove>(?:[^|]+\|){2})(?<keep_after>.*)$
You may have to remove the group namings and use \1 etc. instead, depending on what environment you use.
Demo
From Notepad++ hit ctrl + h then enter the following in the dialog:
Find what: \|\d+\|\d+(\|[0-9a-z]+)$
Replace with: $1
Search mode: Regular Expression
Click replace and done.
Regex Explain:
\|\d+ : match 1st string that starts with | followed by number
\|\d+ : match 2nd string that starts with | followed by number
(\|[0-9a-z]+): match and capture the string after the 2nd number.
$ : This is will force regex search to match the end of the string.
Replacement:
$1 : replace the found string with whatever we have between the captured group which is whatever we have between the parentheses (\|[0-9a-z]+)

Regular expression for selecting trailing whitespace except first space after last character in line

I'm editing text in Atom.
Beginning with the regex, $\s , I haven't been able to figure out how to anchor my selection from the second blank space after the line.
I want to remove the thousands of line returns in a text file ( originally formatted as an .srt video transcript ) and replace them with a single, blank space so as to not join together any words.
For example, my file looks like this:
This content is
difficult to read
because the lines
break after too
few characters.
$\s will select all trailing whitespace, something that I don't want to do, because if I delete all the space selected by that regex then I will cause lots of words to join up into nonsense.
I want to start trimming the trailing whitespace of each line from the second blank space, not the first, so that the expected output would be:
"This content is difficult to read because the lines break after too many characters."
Instead of:
"This content isdifficult to readbecause the linesbreak after toofew characters."
I have solved this problem using MS Word's Find and Replace; substituting a single space ( by literally hitting the space bar once ) for all the hard returns ( enter ^p in the Find field ).
I don't know why the Atom regex engine wasn't recognising the answer provided in the comments from regex101.com? It solves my problem in the regex101 tester.

How do you "quantify" a variable number of lines using a regexp?

Say you know the starting and ending lines of some section of text, but the chars in some lines and the number of lines between the starting and ending lines are variable, รก la:
aaa
bbbb
cc
...
...
...
xx
yyy
Z
What quantifier do you use, something like:
aaa\nbbbb\ncc\n(.*\n)+xx\nyyy\nZ\n
to parse those sections of text as a group?
You can use the s flag to match multilines texts, you can do it like:
~\w+ ~s.
There is a similar question here:
Javascript regex multiline flag doesn't work
If I understood correctly, you know that your text begins with aaa\nbbbb\ncc and ends with xx\nyyy\nZ\n. You could use aaa.+?bbbb.+?cc(.+?)xx.+?yyy.+?Z so that all operators are not greedy and you don't accidentally capture two groups at once. The text inbetween these groups would be in match group 1. You also need to turn the setting that causes dot to match new line on.
Try this:
aaa( |\n)bbbb( |\n)cc( |\n)( |\n){0,1}(.|\n)*xx( |\n)yyy( |\n)Z
( |\n) matches a space or a newline (so your starting and ending phrases can be split into different lines)
RegExr
At the end of the day what worked for me using Kate was:
( )+aaa\n( )+bbbb\n( )+cc\n(.|\n)*( )+xx\n( )+yyy\n( )+Z\n
using such regexps you can clear pages of quite a bit of junk.