RegEx to compare any character between 59 tabs - regex

I recently received a tab separated file that has 60 fields. Each field can have any character in it. The export I received also has linefeeds and carriage returns in some of the fields. This is causing the tab separated file to not import correctly. Is there a way to remove linebreaks and carriage returns if the line does not have 59 tabs on it? There may or may not be data between each tab.
Sample File
Line 3,4,5 is the issue I'm trying to fix.

Warning: I'm assuming that there are no tabs within a column's data. If there is, then you need something far more capable that what I have here.
The following works with the sample input provided:
First, replace all of the line breaks with a character that doesn't occur anywhere in your file. You can even use characters that you can't type with your keyboard.
Find what: (\r\n?|\n)
Replace with: \xB6
Then, match your 60-field rows and give them line-breaks (I'm going with Windows-style):
Find what: ^(([^\t]*\t){59}[^\t\xB6]*)\xB6
Replace with: $1\r\n
I'm making one huge assumption here: that column 60 never contains a line break. If this is false, then you're going to have some of column 60's data ending up in column 1 of the next record.
Now, if you don't like that paragraph symbol showing up in your data, you can either purge it or replace it with whatever you like:
Find what: \xB6
Replace with:
Explanation of matching patterns:
(\r\n?|\n) matches any of the three kinds of line breaks, which are single \r, a single \n, or the Windows-style \r\n. Wikipedia has a whole article about this.
See http://regex101.com/r/iB6fK9 to explore the ^(([^\t]*\t){59}[^\t\xB6]*)\xB6 pattern.
I'm matching the beginning of the line with ^ at the start.
I have a group of zero or more characters that are not a tab, followed by a tab, that I match exactly 59 times with ([^\t]*\t){59}. That gets us the first 59 tab-separated columns. Only column 59 is captured by this group.
For column 60, I match zero or more characters that are neither a tab nor our special character with [^\t\xB6]*.
I capture the 60 columns with parentheses, but I leave our special character outside of the captured group so that it gets replaced with the \r\n that we insert with the $1\r\n replacement.

What I understand from your question is that you want to remove the windows \r\n from your file, to do this you can use replace dialog ctrl+h.
On the Search Mode select Extended (\n, \r,..., then on the "Find What" look for \r\n and in "Replace" leave it empty (or replace it with what you want).

I'd do:
Find what: ^((?:[^\t]*\t[^\t]*){1,58})[\r\n]+
Replace with: $1
This will replace line break with nothing if there are less than 59 occurrence of \t character in a line.

Related

Replace Certain Line Breaks with Equivalent of Pressing delete key on Keyboard NotePad++ Regex

Im using Notepad++ Find and replace and I have regex that looks for [^|]\r which will find the end of the line that starts with 8778.
8778|44523|0||TENNESSEE|ADMINISTRATION||ROLL 169 BATCH 8|1947-09-22|0|OnBase
See Also 15990TT|
I want to basically merge that line with the one below it, so it becomes this:
8778|44523|0||TENNESSEE|ADMINISTRATION||ROLL 169 BATCH 8|1947-09-22|0|OnBase See Also 15990TT|
Ive tried the replace being a blank space, but its grabbing the last character on that line (an e in this case) and replacing that with a space, so its making it
8778|44523|0||TENNESSEE|ADMINISTRATION||ROLL 169 BATCH 8|1947-09-22|0|OnBas
See Also 15990TT|
Is there any way to make it essentially merge the two lines?
\r only matches a carriage return symbol, to match a line break, you need \R that matches any line break sequence.
To keep a part of a pattern after replacement, capture that part with parentheses, and then use a backreference to that group.
So you may use
([^|\r])\R
Replace with $1. Or with $1 if you need to append a space.
Details
([^|\r]) - Capturing group 1 ($1 is the backreference that refers to the group value from the replacement pattern): any char other than | and CR
\R - any line break char sequence, LF, CR or CRLF.
See the regex demo and the Notepad++ demo with settings:
The issue is you're using [^|] to match anything that's not a pipe character before the carriage return, which, on replacement, will remove that character (hence why you're losing an e).
If it's imperative that you match only carriage returns that follow non-pipe characters, capture the preceding character ([^|])\r$ and then put it back in the replacement using $1.
You're also missing a \n in your regex, which is why the replacement isn't concatenating the two lines. So your search should be ([^|])\r\n$ and your replace should be $1.
Find
(\r\n)+
For "Replace" - don't put anything in (not even a space)

Trying to replace a space ' ' at a specific position in every line of a .txt file using regex

I have a text file that is tab delimited that I can import into excel. There is one problem there where I do not have a tab between 2 sets of data which should be delimited into 2 different columns when I import. I would simply like to replace the space character at position 40 with the tab character for every line in the file.
Example Data:
02/01 02/04 24123069033893031235753 CHESTER LAKE BUENA VIFL $86.16
I would like to:
02/01 02/04 24123069033893031235753 CHESTER LAKE BUENA VIFL $86.16
I have tried many different attempts using the regex to replace in Sublime text with no luck. I feel like this should be a simple solution but I have been searching Stack Overflow for 2 hours and trying different solutions.
Here's an example way if the character is always the 40th character in a line:
use the regex ^(.{39})( ) with a replacement of $1\t and it will replace the 40th character (which has to be a space) with a tab.
Essentially the regex is just grabbing the first 39 character into the first capturing group ($1) and then the space afterwards. Then you replace that with the first capturing group and a tab to replace the space.
https://regex101.com/r/bhORsu/2
You could also get rid of the first capturing group by using a positive lookbehind instead which would enable you to only match that one character exactly. The regex would be (?<=^(.{39})) and the replacement would be \t, but I don't know for sure if sublime supports positive lookbehind.
https://regex101.com/r/bhORsu/4

regex I am trying to find a comma at the end of a field which has no more fields after the comma

I have a csv file comma separated file. I have opened in Notepad++
I am using the Regular Expression option from the Search\Replace dialog.
E.g. data in the file
One,two,three
One,,three
One,two,
One,two,
In row 3 there is a comma at the last field with no space.
In row 4 there is a comma at the last field with a space.
I am trying to find the comma without the space at row 3
I have tried the following regular expression [,^\s$]|[,^[a-z$]]
It finds all of the commas.
It is interesting it even finds the comma without a space. I thought ^\s means not include a space. i.e. ^ means Not, \s means space.
I would just like to find the last field at the end of the record which has a comma without a space and without any characters.
What regeular expression do i use for this?
Thanks!
You can use a simple regex to check if a comma is not followed by a space at the end of a line.
,(?! )$
or if there are multiple spaces:
,(?! *)$
or if there is just any whitespace:
,(?!\s*)$
See screenshot:
Just do that:
Find what: ,$
Make sure that "Regular expression" is checked.

regex in Notepad++ to remove blank lines

I have multiple html files and some of them have some blank lines, I need a regex to remove all blank lines and leave only one blank line.. So it removes anything more than one blank line, and leave those that are just one or none (none like in having text in them).
I need it also to consider lines that are not totally blank, as some lines could have spaces or tabs (characters that doesn't show), so I need it to consider these lines with the regex to be removed as long as it is more than one line..
Search for
^([ \t]*)\r?\n\s+$
and replace with
\1
Explanation:
^ # Start of line
([ \t]*) # Match any number of spaces or tabs, capture them in group 1
\r?\n # Match one linebreak
\s+ # Match any following whitespace
$ # until the last possible end of line.
\1 will then contain the first line of whitespace characters, so when you use that as the replacement string, only the first line of whitespace will be preserved (excluding the linebreak at the end).
This worked for me on notepad++ v6.5.1. UNICODE windows 7
Search for: ^[ \t]*\r\n
Replace with: nothing, leave blank
Search mode: Regular expression.
search for (\r?\n(\t| )*){3,}, replace by \r\n\r\n, check "Regular expression" and ". matches newline".
Tested with Notepad++ 6.2
This will replace the successive blank lines containing white spaces (or not) and replace it with one new line.
Search for
(\s*\r?\n){3,}
replace with
\r\n
You can find it yourself what you need to replace with
\n\n OR \n\r\n or \r\n\r\n etc ... now you can even modify your regular expression ^([ \t]*)\r?\n\s+$ according to your need.
I tested any of the above suggestions, always was either too less or to much deleted. So that either you got no blank line where at least one was beforehand or deleted not enough (whitespaces was left, etc.). Unfortunately I cannot write comments yet. Tested both with 6.1.5 and updated to 6.2 and tested again. depending on how mayn files there are, I would suggest use
Edit->Blank Operations->Trim trailing whitespace
Followed by Ctrl+A and
TextFX -> TextFX Edit -> Delete surplus blank lines
A Macro I tried to record didn't work. Theres even a macro for just remove trailing whitespace (Alt+Shift+S, see Settings | Shortcut Mapper... | Macros). There's a
Edit->Blank Operations->Remove unnecessary EOL and whitespace
but that deletes every EOL and puts everything in a single line.
In notepad++ v8.4.7 there is the option:
Edit > Line Operations > Remove Empty Lines (Containing Blank characters)
or
Edit > Line Operations > Remove Empty Lines
So there is no need to use a regular expressions for this. But this only works for one file at a time.
I looked for ^\r\n and click "Replace All" with nothing (empty) in "Replace with" textbox.

How to remove newline from previous line based on current line's first character?

I exported some data in CSV format that has some line breaks within text fields. I can't get Excel to handle this correctly. I'd like to just edit the file to remove these line breaks.
A valid record begins with a number. I've tried putting \n^([^\d]) in the "Find what" box of Notepad++'s find/replace, to match any line beginning with a non-number and the preceding newline. It matches correctly. In the "replace with" box, I put a space followed by \1 to replace the newline with a space and leave the matched character. However, the replace isn't working at all, nothing gets changed.
What am I doing wrong?
Sample text:
123,0,1,"This is a single line comment","bob","jim"
124,0,1,"This is a multi line comment w/ newline.
This is the second line of the comment","ted","alfred"
125,0,1,"This is another single line comment","jim","bob"
I want to replace the newline just before "This is the second..." with a space so that the file looks like this:
123,0,1,"This is a single line comment","bob","jim"
124,0,1,"This is a multi line comment w/ newline. This is the second line of the comment","ted","alfred"
125,0,1,"This is another single line comment","jim","bob"
I figured it out. I used (\n)^(?!\d+,\d+) to match any newline followed by the beginning of a line that's not followed by at least one number, a comma, and at least one number. In "replace with" I just put a space.
\n works in Notepad++ if you set the linefeeds to Unix (LF).
If it's in Windows (CR LF), then \r\n should work, or convert the returns to Unix (from the bottom bar).