Regex in Notepad++ to Find-Replace or Remove Partial Strings - regex

This is adapted from an online dataset referencing "Customer Complaints". The data was modified in Excel and Notepad++. This manipulation produced an "extra" set of quotes directly following each "index digit" [1,2,3 ...] directly after the string, "VALUES (X". I would like to remove only this "extra quotes" and maintain the sequential index numbers, which range from a single digit to a number having five digits. This is in preparation to working with a proprietary database having 1.35 million lines of code.
This rather clumsy adaptation of Regex will "find" a string containing the quotes but a "replace" code which maintains the indexing numbers eludes me. Any help would be appreciated.
REGEX
\s\(([0-9])",|\s\(([0-9][0-9])",|\s\(([0-9][0-9][0-9])",|\s\(([0-9][0-9][0-9][0-9])",|\s\(([0-9][0-9][0-9][0-9][0-9])",
DATA STRINGS
INSERT INTO Complaints VALUES (1","2013-07-29","consumer loan","managing the loan or lease","Wells Fargo & Company","VA","24540","phone","2013-07-30","closed with explanation","468882");
INSERT INTO Complaints VALUES (2","2013-07-29","bank account or service","using a debit or ATM card","Wells Fargo & Company","CA","95992","web","2013-07-31","closed with explanation","468889");
INSERT INTO Complaints VALUES (3","2013-07-29","bank account or service","account opening, closing, or management","Santander Bank US","NY","10065","fax","2013-07-31","closed","468879");

Find VALUES \((\d+)" - the inner parentheses will capture the digits (\d) one or more times (+) until a " is encountered.
You can then replace with VALUES \($1 where $1 is the corresponding captured value.

Ctrl+H
Find what: VALUES\h*\(\d+\K"
Replace with: LEAVE EMPTY
check Wrap around
check Regular expression
Replace all
Explanation:
VALUES # literally
\h* # 0 or more horizontal spaces
\( # opening parenthesis
\d+ # 1 or more digits
\K # forget all we have seen until this position
" # a double quote
Screen capture:

Related

Regex for text (and numbers and special characters) between multiple commas [duplicate]

I'm going nuts trying to get a regex to detect spam of keywords in the user inputs. Usually there is some normal text at the start and the keyword spam at the end, separated by commas or other chars.
What I need is a regex to count the number of keywords to flag the text for a human to check it.
The text is usually like this:
[random text, with commas, dots and all]
keyword1, keyword2, keyword3, keyword4, keyword5,
Keyword6, keyword7, keyword8...
I've tried several regex to count the matches:
-This only gets one out of two keywords
[,-](\w|\s)+[,-]
-This also matches the random text
(?:([^,-]*)(?:[^,-]|$))
Can anyone tell me a regex to do this? Or should I take a different approach?
Thanks!
Pr your answer to my question, here is a regexp to match a string that occurs between two commas.
(?<=,)[^,]+(?=,)
This regexp does not match, and hence do not consume, the delimiting commas.
This regexp would match " and hence do not consume" in the previous sentence.
The fact that your regexp matched and consumed the commas was the reason why your attempted regexp only matched every other candidate.
Also if the whole input is a single string you will want to prevent linebreaks. In that case you will want to use;
(?<=,)[^,\n]+(?=,)
http://www.phpliveregex.com/p/1DJ
As others have said this is potentially a very tricky thing to do... It suffers from all of the same failures as general "word filtering" (e.g. people will "mask" the input). It is made even more difficult without plenty of example posts to test against...
Solution
Anyway, assuming that keywords will be on separate lines to the rest of the input and separated by commas you can match the lines with keywords in like:
Regex
#(?:^)((?:(?:[\w\.]+)(?:, ?|$))+)#m
Input
Taken from your question above:
[random text, with commas, dots and all]
keyword1, keyword2, keyword3, keyword4, keyword5,
Keyword6, keyword7, keyword8
Output
// preg_match_all('#(?:^)((?:(?:[\w]+)(?:, ?|$))+)#m', $string, $matches);
// var_dump($matches);
array(2) {
[0]=>
array(2) {
[0]=>
string(49) "keyword1, keyword2, keyword3, keyword4, keyword5,"
[1]=>
string(31) "Keyword6, keyword7, keyword8..."
}
[1]=>
array(2) {
[0]=>
string(49) "keyword1, keyword2, keyword3, keyword4, keyword5,"
[1]=>
string(31) "Keyword6, keyword7, keyword8"
}
}
Explanation
#(?:^)((?:(?:[\w]+)(?:, ?|$))+)#m
# => Starting delimiter
(?:^) => Matches start of line in a non-capturing group (you could just use ^ I was using |\n originally and didn't update)
( => Start a capturing group
(?: => Start a non-capturing group
(?:[\w]+) => A non-capturing group to match one or more word characters a-zA-Z0-9_ (Using a character class so that you can add to it if you need to....)
(?:, ?|$) => A non-capturing group to match either a comma (with an optional space) or the end of the string/line
)+ => End the non-capturing group (4) and repeat 5/6 to find multiple matches in the line
) => Close the capture group 3
# => Ending delimiter
m => Multi-line modifier
Follow up from number 2:
#^((?:(?:[\w]+)(?:, ?|$))+)#m
Counting keywords
Having now returned an array of lines only containing key words you can count the number of commas and thus get the number of keywords
$key_words = implode(', ', $matches[1]); // Join lines returned by preg_match_all
echo substr_count($key_words, ','); // 8
N.B. In most circumstances this will return NUMBER_OF_KEY_WORDS - 1 (i.e. in your case 7); it returns 8 because you have a comma at the end of your first line of key words.
Links
http://php.net/manual/en/reference.pcre.pattern.modifiers.php
http://www.regular-expressions.info/
http://php.net/substr_count
Why not just use explode and trim?
$keywords = array_map ('trim', explode (',', $keywordstring));
Then do a count() on $keywords.
If you think keywords with spaces in are spam, then you can iterate of the $keywords array and look for any that contain whitespace. There might be legitimate reasons for having spaces in a keyword though. If you're talking about superheroes on your system, for example, someone might enter The Tick or Iron Man as a keyword
I don't think counting keywords and looking for spaces in keywords are really very good strategies for detecting spam though. You might want to look into other bot protection strategies instead, or even use manual moderation.
How to match on the String of text between the commas?
This SO Post was marked as a duplicate to my posted question however since it is NOT a duplicate and there were no answers in THIS SO Post that answered my question on how to also match on the strings between the commas see below on how to take this a step further.
How to Match on single digit values in a CSV String
For example if the task is to search the string within the commas for a single 7, 8 or a single 9 but not match on combinations such as 17 or 77 or 78 but only the single 7s, 8s, or 9s see below...
The answer is to Use look arounds and place your search pattern within the look arounds:
(?<=^|,)[789](?=,|$)
See live demo.
The above Pattern is more concise however I've pasted below the Two Patterns provided as solutions to THIS this question of matching on Strings within the commas and they are:
(?<=^|,)[789](?=,|$) Provided by #Bohemian and chosen as the Correct Answer
(?:(?<=^)|(?<=,))[789](?:(?=,)|(?=$)) Provided in comments by #Ouroborus
Demo: https://regex101.com/r/fd5GnD/1
Your first regexp doesn't need a preceding comma
[\w\s]+[,-]
A regex that will match strings between two commas or start or end of string is
(?<=,|^)[^,]*(?=,|$)
Or, a bit more efficient:
(?<![^,])[^,]*(?![^,])
See the regex demo #1 and demo #2.
Details:
(?<=,|^) / (?<![^,]) - start of string or a position immediately preceded with a comma
[^,]* - zero or more chars other than a comma
(?=,|$) / (?![^,]) - end of string or a position immediately followed with a comma
If people still search for this in 2021
([^,\n])+
Match anything except new line and comma
regexr.com/60eme
I think the difficulty is that the random text can also contain commas.
If the keywords are all on one line and it is the last line of the text as a whole, trim the whole text removing new line characters from the end. Then take the text from the last new line character to the end. This should be your string containing the keywords. Once you have this part singled out, you can explode the string on comma and count the parts.
<?php
$string = " some gibberish, some more gibberish, and random text
keyword1, keyword2, keyword3
";
$lastEOL = strrpos(trim($string), PHP_EOL);
$keywordLine = substr($string, $lastEOL);
$keywords = explode(',', $keywordLine);
echo "Number of keywords: " . count($keywords);
I know it is not a regex, but I hope it helps nevertheless.
The only way to find a solution, is to find something that separates the random text and the keywords that is not present in the keywords. If a new line is present in the keywords, you can not use it. But are 2 consecutive new lines? Or any other characters.
$string = " some gibberish, some more gibberish, and random text
keyword1, keyword2, keyword3,
keyword4, keyword5, keyword6,
keyword7, keyword8, keyword9
";
$lastEOL = strrpos(trim($string), PHP_EOL . PHP_EOL); // 2 end of lines after random text
$keywordLine = substr($string, $lastEOL);
$keywords = explode(',', $keywordLine);
echo "Number of keywords: " . count($keywords);
(edit: added example for more new lines - long shot)

Regex for SQL Query

Hello together I have the following problem:
I have a long list of SQL queries which I would like to adapt to one of my changes. Finally, I have a renaming problem and I'm afraid I want to solve it more complicated than expected.
The query looks like this:
INSERT member (member, prename, name, street, postalcode, town, tel1, tel2, fax, bem, anrede, salutation, email, name2, name3, association, project) VALUES (2005, N'John', N'Doe', N'Street 4711', N'1234', N'Town', N'1234-5678', N'1234-5678', N'1234-5678', N'Leader', NULL, N'Dear Mr. Doe', N'a#b.com', N'This is the text i want to delete', N'Name2', N'Name3', NULL, NULL);
In the "Insert" there was another column which I removed (which I did simply via Notepad++ by typing the search term - "example, " - and replaced it with an empty field. Only the following entry in Values I can't get out using this method, because the text varies here. So far I have only worked with the text file in which I adjusted the list of queries.
So as you can see there is one more entry in Values than in the insertions (there was another column here, but it was removed by my change).
It is the entry after the email address. I would like to remove this including the comma (N'This is the text i want to delete',).
My idea was to form a group and say that the 14th digit after the comma should be removed. However, even after research I do not know how to realize this.
I thought it could look like this (tried in https://regex101.com/)
VALUES\s?\((,) something here
Is this even the right approach or is there another method? I only knew Regex to solve this problem, because of course the values look different here.
And how can I finally use the regex to get the queries adapted (because the queries are local to my computer and not yet included in the code).
Short summary:
Change the query from
VALUES (... test5, test6, test7 ...)
To
VALUES (... test5, test7 ...)
As per my comment, you could use find/replace, where you search for:
(\bVALUES +\((?:[^,]+,){13})[^,]+,
And replace with $1
See the online demo
( - Open 1st capture group.
\bValues +\( - Match a word-boundary, literally 'VALUES', followed by at least a single space and a literal open paranthesis.
(?: - Open non-capturing group.
[^,]+, - Match anything but a comma at least once followed by a comma.
){13} - Close non-capture group and repeat it 13 times.
) - Close 1st capture group.
[^,]+, - Match anything but a comma at least once followed by a comma.
You may use the following to remove / replace the value you need:
Find What: \bVALUES\s*\((\s*(?:N'[^']*'|\w+))(?:,(?1)){12}\K,(?1)
Replace With: (empty string, or whatever value you need)
See the regex demo
Details
\bVALUES - whole word VALUES
\s* - 0+ whitespaces
\( - a (
(\s*(?:N'[^']*'|\w+)) - Group 1: 0+ whitespaces and then either N' followed with any 0 or more chars other than ' and then a ', or 1+ word chars
(?:,(?1)){12} - twelve repetitions of , followed with the Group 1 pattern
\K - match reset operator that discards the text matched so far from the match memory buffer
, - a comma
(?1) - Group 1 pattern.
Settings screen:

Regular Expression - joining two lines, but first number of joined 2nd line is deleted

I have some sample data (simplified extract below - the real file contains 52,000 lines, with pairs of lines, the 2nd line of each pair is always a date field, and there are always 2 blank lines between each data pair):
The colour of money 20170233434
10-DEC-2015
SOME TEST DATA 32423412123
19-OCT-2015
I want to join each line up, using a Regular Expression (I am using TextPad, but I think the RegEx syntax is generic).
I am doing a replace search, and want to end up with this:
The colour of money 20170233434 10-DEC-2015
SOME TEST DATA 32423412123 19-OCT-2015
I am using this in the "Find what" field:
\n^[0|1|2|3|4|5|6|7|8|9]
And replacing with NULL.
The end result I am getting is almost there:
The colour of money 20170233434 0-DEC-2015
SOME TEST DATA 32423412123 9-OCT-2015
But not quite, because the first digit of the date values are being stripped out.
How would I modify the RegEx to not delete the first number of the 2nd line? I tried to replace with [0|1|2|3|4|5|6|7|8|9] but that just put that entire string in front of each date field, and still stripped out the first number of the date.
Just search for this
\r?\n(\d{1,2}\-)
And replace it with $1. See the live example here.
If you want to replace it with null, you can also use a lookahead:
\r?\n(?=\d{1,2}\-)
And replace it with null. See the live example here.
Those regular expressions only match for a newline character (in UNIX \n or Windows \r\n) followed by 1 or 2 characters of a number and finally followed by a dash. If you want to be more specific, you could also use this regular expression:
\r?\n(\d{1,2}\-[A-Z]{3}\-\d{4})
Or with a lookahead respectively:
\r?\n(?=\d{1,2}\-[A-Z]{3}\-\d{4})
You could even check for the double linebreaks after the statement (live example):
\r?\n(\d{1,2}\-[A-Z]{3}\-\d{4}(?:\r?\n){2})
Or with a lookahead respectively (live example):
\r?\n(?=\d{1,2}\-[A-Z]{3}\-\d{4}(?:\r?\n){2})

Seeking regex in Notepad++ to search and replace CRLF between two quotation marks ["] only

I've got a CSV file with some 600 records where I need to replace some [CRLF] with a [space] but only when the [CRLF] is positioned between two ["] (quotation marks). When the second ["] is encountered then it should skip the rest of the line and go to the next line in the text.
I don't really have a starting point. Hope someone comes up with a suggestion.
Example:
John und Carol,,Smith,,,J.S.,,,,,,,,,,,,,+11 22 333 4444,,,,,"streetx 21[CRLF]
New York City[CRLF]
USA",streetx 21,,,,New York City,,,USA,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Normal,,My Contacts,[CRLF]
In this case the two [CRLF] after the first ["] need to be replaced with a space [ ]. When the second ["] is encountered, skip the end of the line and go to next line.
Then again, now on the next line, after the first ["] is encountered replace all [CRLF] until the second ["] is encountered. The [CRLF]s vary in numbers.
In the CSV-file the amount of commas [,] before (23) and after (65) the 2 quotation marks ["] is constant.
So maybe a comma counter could be used. I don't know.
Thanks for feedback.
This will work using one regex only (tested in Notepad++):
Enter this regex in the Find what field:
((?:^|\r\n)[^"]*+"[^\r\n"]*+)\r\n([^"]*+")
Enter this string in the Replace with field:
$1 $2
Make sure the Wrap around check box (and Regular expression radio button) are selected.
Do a Replace All as many times as required (until the "0 occurrences were replaced" dialog pops up).
Explanation:
(
(?:^|\r\n) Begin at start of file or before the CRLF before the start of a record
[^"]*+ Consume all chars up to the opening "
" Consume the opening "
[^\r\n"]*+ Consume all chars up to either the first CRLF or the closing "
) Save as capturing group 1 (= everything in record before the target CRLF)
\r\n Consume the target CRLF without capturing it
(
[^"]*+ Consume all chars up to the closing "
" Consume the closing "
) Save as capturing group 2 (= the rest of the string after the target CRLF)
Note: The *+ is a possessive quantifier. Use them appropriately to speed up execution.
Update:
This more general version of the regex will work with any line break sequence (\r\n, \r or \n):
((?:^|[\r\n]+)[^"]*+"[^\r\n"]*+)[\r\n]+([^"]*+")
Maybe do it in three steps (assuming you have 88 fields in the CSV, because you said there are 23 commas before, and 65 after each second ")
Step 1: replace all CR/LF with some character not anywhere in the file, like ~
Search: \r\n Replace: ~
Step 2: replace all ~ after every 88th 'comma group' (or however many fields in CSV) with \r\n -- to reinsert the required CSV linebreaks:
Search: ((?:[^,]*?,){88})~ Replace: $1\r\n
Step 3: replace all remaining ~ with space
Search ~ Replace: <space>
In this case the source data is generated by the export function in GMail for your contacts.
After the modification outlined below (without RegEx) the result can be used to tidy up your contacts database and re-import it to GMail or to MS Outlook.
Yes, I am standing on the shoulders of #alan and #robinCTS. Thank you both.
Instructions in 5 steps:
use Notepad++ / find replace / extended search mode / wrap around = on
-1- replace all [CRLF] with a unique set characters or a string (I used [~~])
find: \r\n and replace with: ~~
The file contents are now on one line only.
-2- Now we need to separate the header line. For this move to where the first record starts exactly before the 88th. comma (including the word after the 87th. comma [,]) and enter the [CRLF] manually by hitting the return key. There are two lines now: header and records.
-3- now find all [,~~] and replace with [,\r\n] The result is one record per line.
-4- remove the remaining [~~] find: ~~ and replace with: [ ] a space.
The file is now clean of unwanted [CRLF]s.
-5- Save the file and use it as intended.

notepad++ - trying to reformat some stuff

I have a CSV that basically has rows that look like:
06444|WidgetAdapter 6444|Description:
Here is a description.
Maybe some more.
|0
The text in the third field is always different and varying, and I'm trying to replace all newlines within it only with <br>, so it ends up as
06444|WidgetAdapter 6444|Description: <br>Here is a description.<br>Maybe some more.<br>|0
edit:
I basically need to get rid of all linebreaks so each line is a proper VALUE|VALUE|VALUE|VALUE. Normalize/beautify/clean it.
None of my tools can import this properly, phpMyAdmin chokes, etc.
There are linebreaks within the field, there are doublequotes that are not escaped, etc.
Example other field:
08681|Book 08681|"Testimonial" - Person
You should buy this.|
Example of another field:
39338|Itemizer||
If you know you have 4 columns, you can easily parse your data. For example, here's a PHP line that results in an array with all data. Each line in the array is another array with all capturing groups: [0] has the whole match, and [1]-[4] with each column:
$pattern = '/^([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)$/m';
preg_match_all($pattern, $data, $matches, PREG_SET_ORDER);
The pattern is extremely simple: it takes 4 values (not pipe signs), separated by 3 pipes. Once you have the data, you can easily rebuild it the way you want, for example by using nl2br.
Note that you cannot reliably parse the data if the first and last columns can also containg new lines.
Working example: http://ideone.com/gG0K3
If needed, it is possible to target these newlines using a regular expression. The idea is to find only newlines that are followed by one extra value, and then only whole lines. We can check the number of values after the current newline is 1 modulo 4, so we know we're at the 3rd column:
(?:\r\n?|\n)(?=[^|]*\|[^\n\r|]*\s*(?:^(?:[^|]*\|){3}[^\n\r|]*$\s*)*\Z)
Or, with (some) explanations:
(?:\r\n?|\n) # Match a newline
(?= # that is before...
[^|]*\|[^\n\r|]*\s* # one more separator and value
(?:^(?:[^|]*\|){3}[^\n\r|]*$\s*)* # and some lines with 4 values.
\Z # until the end of the string.
)
I couldn't get it to work on Notepad++ (it didn't even match [\r\n]), but it seems to work well on other engines:
Rubular (Ruby): http://rubular.com/r/NsbTNg9vCT
RegExr (Action Script): http://regexr.com?2u1iu
Regex Hero (.Net): http://regexhero.net/tester/?id=215ac2bb-811b-48dd-8c00-6dcfadfae2f2