I need a regex to repair lines split at column 80 - regex

Problem - Multiline, Semi-colon delimited file has been split at column 79 or 80 (not always the same for some strange reason).
It seems to me that a Regex would be the appropriate solution, so now I have two problems.
Lines are:
1sdf.............................mno[cr][lf]
pqr........xyz......................[cr][lf]
.....|.....|.....|.....|.....|.....|[cr][lf]
2sdf.............................mno[cr][lf]
pqr........xyz......................[cr][lf]
.....|.....|.....|.....|.....|.....|[cr][lf]
3sdf.............................mno[cr][lf]
pqr........xyz......................[cr][lf]
.....|.....|.....|.....|.....|.....|[cr][lf]
4sdf.............................mno[cr][lf]
pqr........xyz......................[cr][lf]
.....|.....|.....|.....|.....|.....|[cr][lf]
... 10000 rows ...
Where the pipe is a non-space whitespace character (possibly a tab)
I need:
1sdf.............................mnopqr........xyz......................[cr][lf]
2sdf.............................mnopqr........xyz......................[cr][lf]
3sdf.............................mnopqr........xyz......................[cr][lf]
4sdf.............................mnopqr........xyz......................[cr][lf]
I managed to get the job done with
Pass 1:
Replace ^\s*\r\n with \rxxx\n
// Replace Blank lines with \rxxx\n leaving
1sdf.............................mno[cr][lf]
pqr........xyz......................[cr][lf]
[cr]xxx[lf]
2sdf.............................mno[cr][lf]
pqr........xyz......................[cr][lf]
Pass 2:
Replace \r\n with [empty]
//leaving:
1sdf.............................mnopqr........xyz......................[cr]
xxx[lf]
2sdf.............................mnopqr........xyz......................
Pass 3:
Replace \rxxx\n with \r\n
//leaving:
1sdf.............................mnopqr........xyz......................[cr][lf]
2sdf.............................mnopqr........xyz......................
And the rest of the cleanup is trivial.
Is there any way of doing this in a single step? The output is from a common financial application, and I'd rather be able to fix the files myself rather than try and get many multiple clients to adjust their output.

In Notepad++ (using regular expression mode) you can use this:
Find what: \r\n(\s*\r\n)?
Replace with: \1
Then run "Replace All" exactly once. However, make sure you update to Notepad++ 6! Otherwise matching \r\n with a regular expression won't work in Notepad++.

Assuming that ^\s*\r\n match the line you want to remove as you said above, I believe you could do it with replacing \r\n\s*\r\n|\r\n by \r\n
It's my first regex, so if it doesn't work, don't be to harsh :-)
Good luck

Related

Notepad++ Regex Replace Makeshift Footnotes format With Proper Markdown format

In Word, I had to convert my footnotes to lines appearing at the end of each file to able to make changes in formatting. Some macro I found online was using braces and I ended up using also highlighting so I can see easily where my footnotes used to be. In this way, I have the following strings twice in my documents in the main text and also at the end of each document, sort of like makeshift endnotes.
=={1}==
.
.
.
=={99}==
I want to be able to match those instances in the text and convert them to proper markdown now. The problem is that the in-text format
[^1], [^2], etc.
will be different from what needs to come at the bottom with a semi-colon added:
[^1]:
etc.
So I'm guessing I'll have to live with replacing my old formatting with the new ones with semi-colons and deleting the semi-colons individually while I edit/clean up my text in the future. Without adding the semi-colon, it won't work.
My question is how to use the regex to match the two-digit strings with braces and equation marks.
This
==(\{d{1,2}\})==
did not work.
Also, as I am no pro, I would need the replacement as well. It probably will be
[^($1)]:
I reckon. Apparently, the equal mark doesn't have to be escaped.
Current format:
...some text...makeshift footnote in the format of
=={one- or two-digit number with no spaces in between}==
For example,
=={1}==
=={23}==
etc.
Desired result for all occurences recursively:
[^1]:
.
.
.
[^99]:
The markdown format is single square brackets with a caret and a number, also a semi-colon with the actual footnotes. Usually the number goes up to 42-45 maximum but it doesn't matter, the two digit regex is needed. As I said, the semi-colon will be needed in all instances.
Cheers
You have just some errors in your regex, you forget to escaped the d for digit, it should be \d and the capture group must not include the curly braces.
Use:
Ctrl+H
Find what: =={(\d{1,2})}==
Replace with: [^$1]:
TICK Wrap around
SELECT Regular expression
Replace all
Explanation:
=={ # literally
(\d{1,2}) # group 1, 1 or 2 digits
}== # literally
Screenshot (before):
Screenshot (after):

Remove columns from CSV

I don't know anything about Notepad++ Regex.
This is the data I have in my CSV:
6454345|User1-2ds3|62562012032|324|148|9c1fe63ccd3ab234892beaf71f022be2e06b6cd1
3305611|User2-42g563dgsdbf|22023001345|0|0|c36dedfa12634e33ca8bc0ef4703c92b73d9c433
8749412|User3-9|xgs|f|98906504456|1534|51564|411b0fdf54fe29745897288c6ad699f7be30f389
How can I use a Regex to remove the 5th and 6th column? The numbers in the 5th and 6th column are variable in length.
Another problem is the User row can also contain a |, to make it even worse.
I can use a macro to fix this, but the file is a few millions lines long.
This is the final result I want to achieve:
6454345|User1-2ds3|62562012032|9c1fe63ccd3ab234892beaf71f022be2e06b6cd1
3305611|User2-42g563dgsdbf|22023001345|c36dedfa12634e33ca8bc0ef4703c92b73d9c433
8749412|User3-9|xgs|f|98906504456|411b0fdf54fe29745897288c6ad699f7be30f389
I am open for suggestions on how to do this with another program, command line utility, either Linux or Windows.
Match \|[^|]+\|[^|]+(\|[^|]+$)
Repalce $1
Basically, Anchor to the end of the line, and remove columns [-1] and [-2] (I assume columns can't be empty. Replace + with * if they can)
If you need finer detail then that, I'd recommend writing a Java or Python script to manual parse and rewrite the file for you.
I've captured three groups and given them names. If you use a replace utility like sed or vimregex, you can replace remove with nothing. Or you can use a programming language to concatenate keep_before and keep_after for the desired result.
^(?<keep_before>(?:[^|]+\|){3})(?<remove>(?:[^|]+\|){2})(?<keep_after>.*)$
You may have to remove the group namings and use \1 etc. instead, depending on what environment you use.
Demo
From Notepad++ hit ctrl + h then enter the following in the dialog:
Find what: \|\d+\|\d+(\|[0-9a-z]+)$
Replace with: $1
Search mode: Regular Expression
Click replace and done.
Regex Explain:
\|\d+ : match 1st string that starts with | followed by number
\|\d+ : match 2nd string that starts with | followed by number
(\|[0-9a-z]+): match and capture the string after the 2nd number.
$ : This is will force regex search to match the end of the string.
Replacement:
$1 : replace the found string with whatever we have between the captured group which is whatever we have between the parentheses (\|[0-9a-z]+)

PCRE regex replace a text pattern within double quotes

In Notepad++ 6.5.1 I need to replace certain patterns within quote pairs. I want to save the replace as part of a macro, so all replacements need to happen in one step.
For example, in the following string, replace all 'a' characters within quote pairs with a dash, while leaving characters outside the quote pairs untouched:
Input: aa"bbabaavv"kdjhas"bbabaavv"x
Desired result: aa"bb-b--vv"kdjhas"bb-b--vv"x
Note that the quotes are matched up pairwise, such that the 'a' in kdjhas is untouched.
So far I have tried searching for (?:"[^"a]*|\G)\Ka([^"a]*) and replacing with -$1, but that simply replaces all the a's, with the result --"bb-b--vv"kdjh-s"bb-b--vv"x. I'm attempting PCRE regex that will let me recursively replace the quote-delimited text.
Edit: Quote marks within a quoted string are escaped with an extra quote, e.g. "". However, assume I will have already replaced these in a previous pass with a special character. Therefore a regex solution to this problem will not have to deal with escaped quotes.
It is hard to tell if this is possible as you've only provided one line of input text.
But assuming that input follows this pattern:
BOL|any text|string with two groups of a's|any text|string with two groups of a's|any text|EOL
aa "bbabaavv" kdjhas "bbabaavv" x
I was able to create this regexp search string:
^(.+?\".+?)([a]+)(.+?)([a]+)(.*?\")(.+?\".+?)([a]+)(.+?)([a]+)(.*?\".*)$
With this replace string:
\1-\3-\5\6-\8-\A
and it turn your input string from this:
aa"bbabaavv"kdjhas"bbabaavv"x
into this:
aa"bb-b-vv"kdjhas"bb-b-vv"x
Now naturally the search an replace will fail if the input varies from that pattern described as the search is looking for those four groups of a's inside the two groups of quoted strings.
Also I tested that regexp using Zeus which can create a regexp with more than 9 groups.
As you can see the regexp requires 10 groups.
I'm not familar with Notpad++ so I don't know if it supports that many groups.
If your data have variable number of occurrences of quoted strings, then it is not possible to perform replacements only via regex at least in its form offered by Notepad++.
To replace using regex, you would need to perform regex find in existing regex match. As far as I know such a functionality is not available in Notepad++ regexes.
Self-answer
I may have been reaching for the stars in trying to get Notepad++ to do this regex replace, but I think I found a workaround.
The actual task I was attempting involved creating a SQL Server VALUES list from an Excel spreadsheet, where I was copying and pasting selected cells into Notepad++. The delimiters are \t and \r\n. But, cells can have linefeeds too, which are delimited by ". So, I was going to replace these linefeeds with <br> (or something like it), so that
"line1
line2"
would become "line1<br>line2", before processing the actual end-of-row line feeds.
Having such parsing work reliably, especially when more than two lines were in a single cell, may have been too much to ask of Notepad++'s regex capability.
So I came up with a workaround that seems to be working:) Basically it starts with selecting a blank "dummy" column to the right of my column selection (which I can insert if I'm partially selecting from the middle). This will leave a trailing \t at the end of each row, which effectively sets these EOL's apart from ones that might exist with a text cell, freeing me from having to parse line feeds from a "..." field.
So I compiled a macro from the following steps, which seems to be working well:
replace ' with ''
replace \t\r\n with '\)\r\n, \('
replace \t with ', '
replace "" with ''
replace " with <blank>
replace ^ with \(' (cleanup - first row only)
replace ^, \('$ with <blank> (cleanup - last row only)
Example transformation:
from
line1 line 2
"line3
line3b
line3c" line 4
to
('line1', 'line 2')
, ('line3
line3b
line3c', 'line 4')
which can now be easily modified into a SELECT statement:
SELECT *
FROM (VALUES('line1', 'line 2')
, ('line3
line3b
line3c', 'line 4')
) t(a,b)

Unwrap text in Sublime Text 2

I'd like to unwrap lines so that I can turn them from lines with hard line-breaks to no line breaks.
Specifically, this means that contiguous runs of lines with non-whitespace should be joined together Essentially, any \n with no whitespace on either side should be replaced with a single space. Other linebreaks shouldn't get touched.
I feel like it ought to be a search-and-replace with a search string something like (?!\n)\n(?!\n) -> , but that doesn't work, as it doesn't match anything.
Is there an ST2 built-in command for this?
any \n with no whitespace on either side
(?<!\s)\n(?!\s)
other linebreaks shouldn't get touched.
(?<!(?:\s|\n))\n(?!\s)
Replace with ''
As #flow mentioned, there are built-ins for that task. Just select the lines you want to join and press Ctrl + J.
And your way should works too. Only you missed a bit. It should be (?<!\n)\n(?!\n)
The following solution works best for text copied from a console log with 80 columns. It only removes \n if the line touches the last column.
Find:
(.{80})\n
Replace:
$1

Notepad++ Regex for remove text in line after first two words

so I have a list that goes like this:
AudioQuest FLX-14/2
Abbey Road Cable Monitor Speaker Cable
and in in the first line I need to remove everything after first word and in the second one I need to remove everything in line after first TWO words. I figured out how to remove everything after first word, it's
.*?$
but I'm helpless with the second case. Help me out so I can toggle shortcuts on macros for both actions and process the list in the way semi-automatical way (Select and apply macros).
From what I can see, it seems the data is aligned. And from the example, only the first 10 characters are needed, the rest should be removed.
Find what: (.{10}).*
Replace with: $1
I'd do it in two passes..
Find:
"(^[a-z] [a-z]* )"
"(^[a-z] )"
Replace: "\1"