Regular expression to delete row from csv - regex

I have a line from CSV
first decimal;;;first text;;second text with newlines, special symbols, including semicolons;second decimal, always present;first dot separated float, may not present;second dot separated float, may not present;third text that present only if present previous float
I need to delete second text (with new lines and special symbols).
As for now I have expression like:
(?<=;;)(.*?)(?=;\d+)
First part of it does not work, and I don't know how to make it select text preceded by only two semicolons (for now it selects text preceded by two or more semicolons and first decimal preceded by semicolons + newline if I turn on dotall). Besides, I do not know how to include newline symbol here (.*?).

If you have a CSV file that contains semicolons and newlines as part of quoted fields, then regex is not the right tool for this. Imagine what would happen if you had a field like "This is one field;;don't split this;42"...
If you're sure that you'll never have two semicolons before or within a quoted field, then you may give regex a try. But a dedicated CSV parser would definitely be a safer bet.
That said, let's see why your regex fails:
Imagine the line 1;;;2;3. Your regex will match ;2 because it fulfills all the requirements - there are two semicolons before it, and a semicolon plus digit after it. It's also the shortest possible match at this position in the string.
What can you do? You could use another lookbehind assertion to make sure that it's not possible to match three semicolons before the current position:
(?<=;;)(?<!;;;)(.*?)(?=;\d+)
Give it a try - but look into CSV libraries too, because they will solve your problem better.

Related

Notepad++ Regex Remove Character from Markdown Formatted Footnote

This is a follow-up question to what was solved yesterday:
Notepad++ Regex Replace Makeshift Footnotes format With Proper Markdown format
I managed to find a Regex to remove the offending semicolons in the main text area but by only cutting out the text and pasting back the result, which can only be done one by one.
I'm not sure how this can be done, but the expert can tell me.
So I have footnote references in markdown format. Two instances of the same thing:
[^1]:
[^2]:
.
.
.
[^99]:
I might not have 99 in a document but I wanted to show I need to match two digits here again.
As I said, there are two instances of these numbered references in the text. One in the main text pointing to the footnote and the footnote at the end of the document.
What I need is deleting the semi-colons from the main text and leave the
[^3]:
[^15]:
etc.
references at the end intact.
Because the main text references come after a word or at the end of a sentence (ususally before the sentence-ending period), there is never a case a reference would start a sentence (even if they seem to appear there once or twice because of word wrap).
I provided the exact opposite of my needs here:
Click here for Regex101 website link
I put in the exact opposite of what I want because I already knew of the
^
sign to match anything that is at the front of the line.
Now I would like to negate this, if possible, so that I would delete the semi-colons in the main text, not down at the bottom.
Of course, it is likely that my approach is not good and you'll come up with a completely different approach. Especially because there doesn't seem to be a NOT operator in Regex, if I read correctly.
I repeat: the Regex101 example with the match and substitution is exactly the opposite of what I want.
I am not sure if you can play around in the substitution line to get the desired negative effect.
I could have probably asked for removing the first occurence of semi-colons but I thought the important part of tackling the problem is that those items not to be matched are always at the start of the line, not the others.
Thanks for any suggestions
In Notepad++ you might use a negative lookabehind asserting not the start of the string to the left, and use \K to clear the match buffer matching only the colon that should be replaced by an empty string.
(?<!^)\[\^\d{1,2}]\K:
Explanation
(?<!^) Negative lookbehind, assert not the start of the start directly to the left
\[\^ Match [^
\d{1,2} Match 1 or 2 digits
] Match literally
\K Forget what is matched so far
: Match a colon
Regex demo

Adjust existing regex to ignore semicolon inside quotes

I am using a regex to read csv files and split its columns. The input of files changes frequently, and is unpredictable how the content will come (not the format). I already use the following regex to read the csv file and split the columns:
;(?=(?:[^\"]*\"*[^\"]*\")*[^\"]*$)
It was working until I faced a input like these:
'02'.'018'.'7975';PRODUCT 1;UN;02
'02'.'018'.'7976';PRODUCT 2;UN;02
'02'.'018'.'7977';PRODUCT 3;UN;02
'02'.'018'.'7978';"PRODUCT 4 ; ADDITIONAL INFO";UN;02 // Problem
'02'.'018'.'7979';"PRODUCT 5 ; ADDITIONAL INFO";UN;02 // Problem
I would like to understand how I can adjust my regex and adapt it to ignore semicolon inside quotes.
I am using Java with the method split from String class.
Bear in mind that you should probably use a parser for this, but if you must use regex, here's one that should work:
;(?=[^"]*(?:(?:"[^"]*){2})*$)
Explanation
; matches the semicolon.
(?=...) is a positive lookahead. It checks that the pattern contained in it will match, without actually matching it.
[^"]*(?:(?:"[^"]*){2})*$ ensures that there are an even number of quotes in the rest of the string.

searching for text that contain and not contain text and symbol

This regex code works correctly for searching for lines that begins with an exclamation mark and does not contain colon : symbol
^!([^:\n]*)$
In addition to the regex code above, I need it to contain lines of text that has the word "spelling" in it, like this code below but does not work.
^!([^:\n]spelling*)$
You could do this:
^![^:\n]*spelling[^:\n]*$
If you are looping through a file line by line, as is typical, there is no need to exclude the newlines from the match:
^![^:]*spelling[^:]*$
Another option to consider when you have complex requirements is breaking the match down into mutiple steps. This makes for simpler, easier to understand code that is less error-prone:
if (/^!/ and /spelling/ and not /:/)
spelling*
matches
spellin
spelling
spellingg
spellinggg
etc. You were trying for
^([^:\n]*spelling[^\n]*)$
aka
^([^:\n]*spelling.*)$ # Assuming /s isn't used
But that would allow : after spelling, so you really want
^([^:\n]*spelling[^:\n]*)$
What about ^([^:\n]*spelling.*)$ ?
Adding .* allows any character (except newline) to be present after 'spelling'

Use REGEX to find line breaks within a wrapped content

The direct question: How can I use REGEX lookarounds to find instances of \r\n that occur between a set of characters (stand in open and closing tags), "[ and ]" with arbitrary characters and line breaks inside as well?
The situation:
I have a large database exported to tab or comma delineated text files that I'm trying to import into excel. The problem is that some of the cells come from text areas that contain line breaks, and are qualified by double quotes. Importing into excel these line breaks are treated as new rows. I cannot adjust how the file is exported. I data needs to be preserved, but the exact format doesn't, so I was planning on using some placeholder for the returns or ~
Here's a generic illustration of the format of my data:
column1rowA column2rowA column3rowA column4rowA
column1rowB column2rowB "column3rowB
3Bcont
3Bcont
3Bcont
" column4rowB
column1rowC column2rowC column4rowC
column1rowD column2rowD "column3rowD
3Dcont" column4rowD
My thought has been to try to select and replace line breaks within the quotes using REGEX search and replace in Notepad++. To try and make is simpler I have tried adding a character to the double quotes to help indicate whether it is an opening or closing quote:
"[column3rowB
3Bcont
3Bcont
3Bcont
]"
I am new to REGEX. The progress I've made (which isn't much) is:
(?<="[) missing some sort of wildcard \r\n(?=.*]")
Every iteration I've tried has also included every line break between the first "[ and last ]"
I would also appreciate any other approaches that solve the underlying problem
If you can use some tool other than Notepad++, you can use this regex (see my working example on regex101):
(?!\n(([^"]*"){2})*[^"]*$)\n
It uses a negative lookahead to find line breaks only when not followed by an even number of quotes. You could replace them with <br>, spaces, or whatever is appropriate.
Breakdown:
(?! ... ) This is the negative lookahead, necessary because it's zero-width. Anything matched by it will still be available to match again.
(([^"]*"){2})* This is the other key piece. It ensures even-numbered pairs of non-quote characters followed by a quote.
[^"]*$ This is ensuring that there are no more quotes from there until the end of the string.
Caveat:
I couldn't get it to work in Notepad++ because it always recognizes $ as the end of a line, not the end of the entire string.
Great answer from Brian. I added an option that would only consider real linebreaks (i.e. \n\r), which worked for my CSV file:
(?!\n|\r(([^"]*"){2})*[^"]*$)\n|\r

How to read this command to remove all blanks at the end of a line

I happened across this page full of super useful and rather cryptic vim tips at http://rayninfo.co.uk/vimtips.html. I've tried a few of these and I understand what is happening enough to be able to parse it correctly in my head so that I can possibly recreate it later. One I'm having a hard time getting my head wrapped around though are the following two commands to remove all spaces from the end of every line
:%s= *$== : delete end of line blanks
:%s= \+$== : Same thing
I'm interpreting %s as string replacement on every line in the file, but after that I am getting lost in what looks like some gnarly variation of :s and regex. I'm used to seeing and using :s/regex/replacement. But the above is super confusing.
What do those above commands mean in english, step by step?
The regex delimiters don't have to be slashes, they can be other characters as well. This is handy if your search or replacement strings contain slashes. In this case I don't know why they use equal signs instead of slashes, but you can pretend that the equals are slashes:
:%s/ *$//
:%s/ \+$//
Does that make sense? The first one searches for a space followed by zero or more spaces, and the second one searches for one or more spaces. Each one is anchored at the end of the line with $. And then the replacement string is empty, so the spaces are deleted.
I understand your confusion, actually. If you look at :help :s you have to scroll down a few pages before you find this note:
*E146*
Instead of the '/' which surrounds the pattern and replacement string, you
can use any other character, but not an alphanumeric character, '\', '"' or
'|'. This is useful if you want to include a '/' in the search pattern or
replacement string. Example:
:s+/+//+
I do not know vim syntax, but it looks to me like these are sed-style substitution operators. In sed, the / (in s/REGEX/REPLACEMENT/) can be uniformly replaced with any other single character. Here it appears to be =. So if you mentally replace = with /, you'll get
:%s/ *$//
:%s/ \+$//
which should make more sense to you.