Regex match all lines that don't end with ,0 and ,1 - regex

I have a malformed CSV file which has two columns: Text,Value
The value is either 1 or 0, but some lines are malformed and span two lines:
1. "This line is fine, but there are some that are not like this",0
2. "Another good line",1
4. "Oh, I'm so bad!!
5. I spanned two lines!",0
6. "Why did you break me? FileHelpers can't read two lines!!",1
Line 4 and 5 are supposed to be one line, but the CSV file I got is broken and they span two lines, this causes the FileHelpers engine to fail while reading the csv file.
I have two CSV files with about 3000 lines each and I will only need to fix them once. I want to use notepad++ to find all the lines that are not ending in ,0 or ,1, what kind of regex can I use for that? Or maybe to regular expressions, one for the ,0 case the other one for the ,1 case.
Update:
Dan's answer works without the comma [^01]$ instead of ,[^01]$, but it only matches lines that are not ending with 0 or 1... it works sufficiently well in my case, but it does skip lines that are broken and actually end with 0 or 1.

I don't know how the other answer would work:
Something like the below is what I would use in Notepad++
[^,][^01]$
Here are the steps I did:
Use ([^,][^01])$ to match the lines and replaced with \1{marked}
Then switched to extended mode and replaced {marked}\r\n with `` ( empty ) to get a single line.
Screenshots below:

The expression you would use is
([^,].|,[^01])$
But unfortunately, notepad++ does not support alternation (the | operator). [1]
You can match the broken lines with these two expressions then:
[^,].$
,[^01]$
Except, of course, if the "Text" part does end in ,0 or ,1 itself. :-)
[1] http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Unsupported_Regex_Operators

,[^01]$
Make sure regex mode is on.

General considerations
In general, to match a line that does not end with a specific pattern, you may use
^(?!.*pattern$).*$
where ^ matches the start of a line, (?!.*pattern$) is a negative lookahead that fails the match if there are 0 or more chars other than line break chars, as few as possible (.*) followed with pattern at the end of the line ($), and the .*$ actually matches the line.
To remove a line that does not end with some pattern together with a line break at the end, use
^(?!.*pattern$).*\R?
where \R? is an optional line break sequence.
In case of several fixed strings, you may use
^(?!.*(?:pattern|pattern2|patternN)$).*\R?
If there is just one or two fixed strings to check at the end of the line, you may use a bit quicker regex like
^.*$(?<!a)(?<!bcd)
that will match any line not ending with a and bcd.
^.*$(?<!1)(?<!0)
Current problem solution
So, for the current issue, to match a line not ending with 1 or 0, you may use
^(?!.*[01]$).*$ # without the line break
^(?!.*[01]$).*$\R? # with the line break
Or,
^.*(?<![01])$ # without the line break
^.*(?<![01])$\R? # with the line break
To remove/replace a line break on a line that does not end with a specific pattern you may use
(?<![01])$\R?
Replace with either an empty string (to remove the line break) or with any other delimiter string or character.

Related

Regex to remove double lines ignoring the punctation marks and spaces in Notepad++ [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 6 months ago.
Is it possible to remove duplications with ignoring the punctation marks and spaces in Notepad++? I would keep one of them matching lines (doesn't matter which to keep).
My examples are from the txt file:
Rough work iconoclasm but the only way to get the truth. Oliver Wendell Holmes
Rough work, iconoclasm, but the only way to get the truth. Oliver Wendell Holmes
Rule No. 1: Never lose money. Rule No. 2: Never forget rule No. 1. Warren Buffett
Rule No.1: Never lose money. Rule No.2: Never forget rule No.1. Warren Buffett
Self-esteem isn't everything, it's just that there's nothing without it. Gloria Steinem
Self-esteem isn't everything it's just that there's nothing without it. Gloria Steinem
You said she's a senior? Babe we're all crazy.
You said, she's a senior! Babe we're ALL crazy.
You said, she's a senior? Babe we're ALL crazy!
Result I need:
Rough work iconoclasm but the only way to get the truth. Oliver Wendell Holmes
Rule No. 1: Never lose money. Rule No. 2: Never forget rule No. 1. Warren Buffett
Self-esteem isn't everything, it's just that there's nothing without it. Gloria Steinem
You said, she's a senior! Babe we're ALL crazy.
I can delete 100% matching duplications with regex, but can't find a regex rule to ignore spaces and marks.
I don't think regex is the best tool for this task, but it's a nice challenge. You can match single words using a nested structure like:
((\w+)\W+((\w+)\W+( ... ((\w+)\W+)? ... )?)?(\w*))
When matching this, capture groups 2 to n contain the words 1 to n-1 of a line. The nested structure is necessary to make it non-ambiguous - otherwise, running the regex takes too long.
To match the duplicate lines, we use a similar structure with back-references:
\1\W+(\2\W+( ... (\9\W+)? ... )?)?
This will also match lines that are substrings of the previous line, which is again helpful to improve performance.
Notice that you have to use the \g{n}-notation when using more than 9 references in Notepad++. Moreover, to avoid matching line breaks you should use [^\w\n\r] instead of \W. To further improve performance, unnecessary groups should be non-matching, i.e., (?: ... ).
To generate the rather long regex that solves the problem for, e.g., up to 20 words per line, you can use the following script:
MAX_WORDS = 20
punct = "[^\\w\\n\\r]"
backref = (i) => `\\g{${i}}`
patternKeep = (i) => "(\\w+)[^\\w\\n\\r]+" + (i < 0 ? "" : `(?:${patternKeep(i-1)})?`)
patternRemove = (i) => `${backref(MAX_WORDS-i + 2)}(?:${punct}+` + (i < 0 ? "" : patternRemove(i-1)) + ")?"
console.log("^(" + patternKeep(MAX_WORDS) + "(\\w*))(\\r?\\n" + patternRemove(MAX_WORDS)+ `${punct}*${backref(MAX_WORDS+4)}${punct}*)+$`)
When copying this to Notepad++ with settings "Wrap around" on and "Match case" off and replacing with $1, it will remove all duplicate lines in your example.
I doubt that it can be done purely with regular expressions. If it can then I imagine that the expression would be difficult to understand and difficult to maintain. Instead I would suggest a multi-step approach.
Step 1 - modify each line to be: original-line separator original-line.
Step 2 - convert it to be line-without-punctuation separator original-line.
Step 3 - sort the lines
Step 4 - remove duplicated lines
Step 5 - remove line-without-punctuation and separator leaving just the original line.
In more detail:
In all the replaces below: select "Wrap around", unselect "Dot matches newline", unselect "Match whole word only" and unselect "Match case".
Step 1 - choose a separator, some text that is not punctuation and does not occur in the file. Here I use qqq. Do a regular expression replace of ^(.+)$ with \1qqq\1.
Step 2 - remove any punctuation before the separator. Repeatedly do a regular expression replace of [!',-.:?]+(.*qqq) with \1 until no more replacements are made. This expression matches all the punctuation in the example, but you may need to add more for your full text. Also need to reduce multiple spaces to singles, so repeatedly do a regular expression replace of +(.*qqq) with \1 until no more replacements are made. One final step to handle spaces before the qqq do a regular expression replace of qqq with qqq (this could also use a non-regular expression replace).
Step 3 - sort the lines lexicographically.
Step 4 - remove duplicated lines. Repeatedly do a regular expression replace of ^(.*qqq).*\R\1 with \1 until no more replacements are made.
Step 5 - Remove unwanted text leaving the original line. Do a regular expression replace of ^.*qqq with nothing (the empty string).
If all punctuation can be deleted and the result being a line without punctuation then could simple do a regular expression replace of [!',-.:? ]+ with , a sort and finally a remove duplicates.
Previously this question attracted an answer, but the author deleted it. To me it was so interesting because a special technique was illustrated. In a comment the answerer pointed me towards another thread to read more about it.
After experimenting a bit with that answer, an idea was the following pattern. Settings in NP++ are to uncheck: [ ] match case, [ ] .matches newline - Replace with emptystring.
^(?>[^\w\n]*(\w++)(?=.*\R(\2?+[^\w\n]*\1\b)))+[^\w\n]*\R(?=\2[^\w\n]*$)
Here is the demo in Regex101 - Assumption is, that duplicate lines are consecutive (like sample).
Most of the used regex-tokens can be looked up in the Stack Overflow Regex FAQ.
In short words, the mechanism used is to capture words from one line to the first group (\w++) while inside the lookahead (?=.*\R(\2?+...\1\b))) a second group in the consecutive line is "growing" from itself plus the captures until \R(?=\2...$) it either matches all words or fails.
Illustration of some steps from the regex101 debugger:
The second group holds the substring of the consecutive line that matches words and order of the previous line. It expands at each repetition from optionally itself and a word from the previous line. Separated by [^\w\n]* any amount of characters that are not word characters or newline.
For making it work, matching is done without giving back at crucial points (prevent backtracking).

Remove duplicate lines containing same starting text

So I have a massive list of numbers where all lines contain the same format.
#976B4B|B|0|0
#970000|B|0|1
#974B00|B|0|2
#979700|B|0|3
#4B9700|B|0|4
#009700|B|0|5
#00974B|B|0|6
#009797|B|0|7
#004B97|B|0|8
#000097|B|0|9
#4B0097|B|0|10
#970097|B|0|11
#97004B|B|0|12
#970000|B|0|13
#974B00|B|0|14
#979700|B|0|15
#4B9700|B|0|16
#009700|B|0|17
#00974B|B|0|18
#009797|B|0|19
#004B97|B|0|20
#000097|B|0|21
#4B0097|B|0|22
#970097|B|0|23
#97004B|B|0|24
#2C2C2C|B|0|25
#979797|B|0|26
#676767|B|0|27
#97694A|B|0|28
#020202|B|0|29
#6894B4|B|0|30
#976B4B|B|0|31
#808080|B|1|0
#800000|B|1|1
#803F00|B|1|2
#808000|B|1|3
What I am trying to do is remove all duplicate lines that contain the same hex codes, regardless of the text after it.
Example, in the first line #976B4B|B|0|0 the hex #976B4B shows up in line 32 as #976B4B|B|0|31. I want all lines EXCEPT the first occurrence to be removed.
I have been attempting to use regex to solve this, and found ^(.*)(\r?\n\1)+$ $1 can remove duplicate lines but obviously not what I need. Looking for some guidance and maybe a possibility to learn from this.
You can use the following regex replacement, make sure you click Replace All as many times as necessary, until no match is found:
Find What: ^((#[[:xdigit:]]+)\|.*(?:\R.+)*?)\R\2\|.*
Replace With: $1
See the regex demo and the demo screenshot:
Details:
^ - start of a line
((#[[:xdigit:]]+)\|.*(?:\R.+)*?) - Group 1 ($1, it will be kept):
(#[[:xdigit:]]+) - Group 2: # and one or more hex chars
\| - a | char
.* - the rest of the line
(?:\R.+)*? - any zero or more non-empty lines (if they can be empty, replace .+ with .*)
\R\2\|.* - a line break, Group 2 value, | and the rest of the line.

Regex which grabs everything between two characters at the end of a line

I'm looking to create a regex which grabs the text between two ":"s but only if it is the "last set", for example:
\--- org.codehaus.groovy.modules.http-builder:http-builder:0.7.1
should return:
http-builder
It should be noted that it's possible to get something like:
\--- org::codehaus::groovy::modules::http-builder:http-builder:0.7.1
because the input does not necessarily follow conventions (based on the problem at hand) but the required information is ALWAYS in the last two ":"s.
I've tried some of the following (minus the end of line):
1) (?<=\:).*(?=\:)
2) [^(.*:)].*[^(:.*)]
3) :.*: (this was the most successful, although I got the ":"s with the result but there are issues when there is more than one set of ":"s)
Futher information:
I need to use Groovy for this
I can read it using a stream or a file (in case that matters)
Thanks for reading and any help!
:([^:]*):[^:]*$
That means:
Sequence must start with a :
Then start capturing (
Capture all characters that are not colons [^:]*
End capturing ) ...
... at the next colon :
Then there's another sequence of chars [^:]*
And after that sequence the line must end $ (no more sequence)
Or if you can use non-greedy matches, you can also use
:(.*?):[^:]*$
.* means capture as many characters as possible, while .*? means capture as little characters as possible. Not all regex implementation support that, though.
How about splitting on the : and grabbing the next-to-last segment?
['org.codehaus.groovy.modules.http-builder:http-builder:0.7.1',
/\--- org::codehaus::groovy::modules::http-builder:http-builder:0.7.1/].each { line ->
assert 'http-builder' == line.split(':')[-2]
}

i need help in regex

so i have (matlab) code .. and of the lines doesnt have (;) after the line
i want to find that line
for a starter :
sad= sdfsdf ; %this is comment
sad = awaww ;
n= sdfdsfd ;
m = (asd + adsf(asd,asd)) %this is comment
lets say i want to find the 4th line because it doesnt have (;) at the end of line ..
so far im stuck at this :
/(^[-a-zA-Z0-9]+\s*=[-a-zA-Z0-9#:%,_\+.()~#?&//= ]+)(?!;)$/gim
so this will work fine.. it will find the fourth line only
but what if i wanted (;) in middle of the line but not at end or before the comment .. ?
w=sss (;)aaa **;** % i dont want this line to be selected
w=sss (;)aaa %i want this line to be selected
http://regexr.com/3cfor
Well, let's find all lines which end with a semicolon:
^.+?;
optionally followed by horizontal whitespace:
^.+?;[ \t]*
and an optional comment:
^.+?;[ \t]*(?:%.*)?
This expression easily matches all the lines you don't want. So, inverse it:
^(?!.+?;[ \t]*(?:%.*)?$).+
Unfortunately, that's too easy. It fails to match lines which contain a semicolon in a comment. We could replace .+? with [^%\r\n]+? but this would fail on lines containing a % in a string.
If you need a more robust pattern, you'll have to account for all of this.
So let's start the same way, by defining what a "correct" line should look like. I'll use the PCRE syntax for atomic grouping, so you'll have to use perl = TRUE.
A string is: '(?>[^']+|'')*'
Other code (except string, comments and semicolons) is covered by: [^%';\r\n]+
So "normal" code is:
(?>[^%';\r\n]+|'(?>[^']+|'')*'|;)+?
Then, we add the required semicolon and optional comment:
(?>[^%';\r\n]+|'(?>[^']+|'')*'|;)+?;[ \t]*(?:%.*)?$
Finally, we invert all of this:
^(?!(?>[^%';\r\n]+|'(?>[^']+|'')*'|;)+?;[ \t]*(?:%.*)?$).+
And we have the final pattern. Demo.
You don't need to fully tokenize the input, you only have to recognize the different "lexer modes". I hope handling strings and comments is enough, but I didn't check the Matlab syntax thoroughly.
You could use this with other regex engines that do not support atomic groups by replacing (?> with (?: but you'll expose yourself to the catastrophic backtracking problem.

notepad++ - trying to reformat some stuff

I have a CSV that basically has rows that look like:
06444|WidgetAdapter 6444|Description:
Here is a description.
Maybe some more.
|0
The text in the third field is always different and varying, and I'm trying to replace all newlines within it only with <br>, so it ends up as
06444|WidgetAdapter 6444|Description: <br>Here is a description.<br>Maybe some more.<br>|0
edit:
I basically need to get rid of all linebreaks so each line is a proper VALUE|VALUE|VALUE|VALUE. Normalize/beautify/clean it.
None of my tools can import this properly, phpMyAdmin chokes, etc.
There are linebreaks within the field, there are doublequotes that are not escaped, etc.
Example other field:
08681|Book 08681|"Testimonial" - Person
You should buy this.|
Example of another field:
39338|Itemizer||
If you know you have 4 columns, you can easily parse your data. For example, here's a PHP line that results in an array with all data. Each line in the array is another array with all capturing groups: [0] has the whole match, and [1]-[4] with each column:
$pattern = '/^([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)$/m';
preg_match_all($pattern, $data, $matches, PREG_SET_ORDER);
The pattern is extremely simple: it takes 4 values (not pipe signs), separated by 3 pipes. Once you have the data, you can easily rebuild it the way you want, for example by using nl2br.
Note that you cannot reliably parse the data if the first and last columns can also containg new lines.
Working example: http://ideone.com/gG0K3
If needed, it is possible to target these newlines using a regular expression. The idea is to find only newlines that are followed by one extra value, and then only whole lines. We can check the number of values after the current newline is 1 modulo 4, so we know we're at the 3rd column:
(?:\r\n?|\n)(?=[^|]*\|[^\n\r|]*\s*(?:^(?:[^|]*\|){3}[^\n\r|]*$\s*)*\Z)
Or, with (some) explanations:
(?:\r\n?|\n) # Match a newline
(?= # that is before...
[^|]*\|[^\n\r|]*\s* # one more separator and value
(?:^(?:[^|]*\|){3}[^\n\r|]*$\s*)* # and some lines with 4 values.
\Z # until the end of the string.
)
I couldn't get it to work on Notepad++ (it didn't even match [\r\n]), but it seems to work well on other engines:
Rubular (Ruby): http://rubular.com/r/NsbTNg9vCT
RegExr (Action Script): http://regexr.com?2u1iu
Regex Hero (.Net): http://regexhero.net/tester/?id=215ac2bb-811b-48dd-8c00-6dcfadfae2f2