Java regexp to remove CRLF between quotes [duplicate] - regex

This question already has answers here:
Java: How to remove all line breaks between double quotes
(2 answers)
Closed 4 months ago.
I'm having a string containing CSV lines. Some of its values contains the CRLF characters, marked [CRLF] in the example below
NOTE: Line 1: and Line 2: aren't part of the CSV, but for the discussion
Line 1:
foo1,bar1,"john[CRLF]
dose[CRLF]
blah[CRLF]
blah",harry,potter[CRLF]
Line 2:
foo2,bar2,john,dose,blah,blah,harry,potter[CRLF]
Each time a value in a line have a CRLF, the whole value appears between quotes, as shown by line 1. Looking for a way to get ride of those CRLF when they appears between quotes.
Tried regexp such as:
data.replaceAll("(,\".*)([\r\n]+|[\n\r]+)(.*\",)", "$1 $3");
Or just ([\r\n]+) , \n+, etc. without success: the line continue to appears as if no replacement were made.
EDIT:
Solution
Found the solution here:
String data = "\"Test Line wo line break\", \"Test Line \nwith line break\"\n\"Test Line2 wo line break\", \"Test Line2 \nwith line break\"\n";
StringBuffer result = new StringBuffer();
Matcher m = Pattern.compile("\"[^\"]*\"").matcher(data);
while (m.find()) {
m.appendReplacement(result, m.group().replaceAll("\\R+", ""));
}
m.appendTail(result);
System.out.println(result.toString());

Using Java 9+ you can use a function code inside Matcher#replaceAll and solve your problem using this code:
// pattern that captures quoted strings ignoring all escaped quotes
Pattern p = Pattern.compile("\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"");
String data1 = "\"Test Line wo line break\", \"Test Line \nwith line break\"\n\"Test Line2 wo line break\", \"Test Line2 \nwith line break\"\n";
// functional code to get all quotes strings and then remove all line
// breaks from matched substrings
String repl = p.matcher(data1).replaceAll(
m -> m.group().replaceAll("\\R+", "")
);
System.out.println(repl);
Output:
"Test Line wo line break", "Test Line with line break"
"Test Line2 wo line break", "Test Line2 with line break"
Code Demo

Related

Trying to remove newlines and carriage returns from a text. Why doesn´t this code work? [duplicate]

I have a text in a textarea and I read it out using the .value attribute.
Now I would like to remove all linebreaks (the character that is produced when you press Enter) from my text now using .replace with a regular expression, but how do I indicate a linebreak in a regex?
If that is not possible, is there another way?
How you'd find a line break varies between operating system encodings. Windows would be \r\n, but Linux just uses \n and Apple uses \r.
I found this in JavaScript line breaks:
someText = someText.replace(/(\r\n|\n|\r)/gm, "");
That should remove all kinds of line breaks.
Line breaks (better: newlines) can be one of Carriage Return (CR, \r, on older Macs), Line Feed (LF, \n, on Unices incl. Linux) or CR followed by LF (\r\n, on WinDOS). (Contrary to another answer, this has nothing to do with character encoding.)
Therefore, the most efficient RegExp literal to match all variants is
/\r?\n|\r/
If you want to match all newlines in a string, use a global match,
/\r?\n|\r/g
respectively. Then proceed with the replace method as suggested in several other answers. (Probably you do not want to remove the newlines, but replace them with other whitespace, for example the space character, so that words remain intact.)
var str = " \n this is a string \n \n \n"
console.log(str);
console.log(str.trim());
String.trim() removes whitespace from the beginning and end of strings... including newlines.
const myString = " \n \n\n Hey! \n I'm a string!!! \n\n";
const trimmedString = myString.trim();
console.log(trimmedString);
// outputs: "Hey! \n I'm a string!!!"
Here's an example fiddle: http://jsfiddle.net/BLs8u/
NOTE! it only trims the beginning and end of the string, not line breaks or whitespace in the middle of the string.
You can use \n in a regex for newlines, and \r for carriage returns.
var str2 = str.replace(/\n|\r/g, "");
Different operating systems use different line endings, with varying mixtures of \n and \r. This regex will replace them all.
The simplest solution would be:
let str = '\t\n\r this \n \t \r is \r a \n test \t \r \n';
str = str.replace(/\s+/g, ' ').trim();
console.log(str); // logs: "this is a test"
.replace() with /\s+/g regexp is changing all groups of white-spaces characters to a single space in the whole string then we .trim() the result to remove all exceeding white-spaces before and after the text.
Are considered as white-spaces characters:
[ \f\n\r\t\v​\u00a0\u1680​\u2000​-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]
If you want to remove all control characters, including CR and LF, you can use this:
myString.replace(/[^\x20-\x7E]/gmi, "")
It will remove all non-printable characters. This are all characters NOT within the ASCII HEX space 0x20-0x7E. Feel free to modify the HEX range as needed.
This will replace the line break by empty space.
someText = someText.replace(/(\r\n|\n|\r)/gm,"");
Read more on this article.
var str = "bar\r\nbaz\nfoo";
str.replace(/[\r\n]/g, '');
>> "barbazfoo"
To remove new line chars use this:
yourString.replace(/\r?\n?/g, '')
Then you can trim your string to remove leading and trailing spaces:
yourString.trim()
USE THIS FUNCTION BELOW AND MAKE YOUR LIFE EASY
The easiest approach is using regular expressions to detect and replace newlines in the string. In this case, we use replace function along with string to replace with, which in our case is an empty string.
function remove_linebreaks( var message ) {
return message.replace( /[\r\n]+/gm, "" );
}
In the above expression, g and m are for global and multiline flags
I often use this regex for (html) strings inside jsons:
replace(/[\n\r\t\s]+/g, ' ')
The strings come from a html editor of a CMS or a i18n php. The common scenarios are:
- lorem(.,)\nipsum
- lorem(.,)\n ipsum
- lorem(.,)\n
ipsum
- lorem ipsum
- lorem\n\nipsum
- ... many others with mixed whitespaces (\t\s) and even \r
The regex avoids this ugly things:
lorem\nipsum => loremipsum
lorem,\nipsum => lorem,ipsum
lorem,\n\nipsum => lorem, ipsum
...
Surely not for all use cases and not the fastest one, but enough for most textareas and texts for websites or webapps.
The answer provided by PointedEars is everything most of us need. But by following Mathias Bynens's answer, I went on a Wikipedia trip and found this: https://en.wikipedia.org/wiki/Newline.
The following is a drop-in function that implements everything the above Wiki page considers "new line" at the time of this answer.
If something doesn't fit your case, just remove it. Also, if you're looking for performance this might not be it, but for a quick tool that does the job in any case, this should be useful.
// replaces all "new line" characters contained in `someString` with the given `replacementString`
const replaceNewLineChars = ((someString, replacementString = ``) => { // defaults to just removing
const LF = `\u{000a}`; // Line Feed (\n)
const VT = `\u{000b}`; // Vertical Tab
const FF = `\u{000c}`; // Form Feed
const CR = `\u{000d}`; // Carriage Return (\r)
const CRLF = `${CR}${LF}`; // (\r\n)
const NEL = `\u{0085}`; // Next Line
const LS = `\u{2028}`; // Line Separator
const PS = `\u{2029}`; // Paragraph Separator
const lineTerminators = [LF, VT, FF, CR, CRLF, NEL, LS, PS]; // all Unicode `lineTerminators`
let finalString = someString.normalize(`NFD`); // better safe than sorry? Or is it?
for (let lineTerminator of lineTerminators) {
if (finalString.includes(lineTerminator)) { // check if the string contains the current `lineTerminator`
let regex = new RegExp(lineTerminator.normalize(`NFD`), `gu`); // create the `regex` for the current `lineTerminator`
finalString = finalString.replace(regex, replacementString); // perform the replacement
};
};
return finalString.normalize(`NFC`); // return the `finalString` (without any Unicode `lineTerminators`)
});
Simple we can remove new line by using text.replace(/\n/g, " ")
const text = 'Students next year\n GO \n For Trip \n';
console.log("Original : ", text);
var removed_new_line = text.replace(/\n/g, " ");
console.log("New : ", removed_new_line);
A linebreak in regex is \n, so your script would be
var test = 'this\nis\na\ntest\nwith\newlines';
console.log(test.replace(/\n/g, ' '));
I am adding my answer, it is just an addon to the above,
as for me I tried all the /n options and it didn't work, I saw my text is comming from server with double slash so I used this:
var fixedText = yourString.replace(/(\r\n|\n|\r|\\n)/gm, '');
Try the following code. It works on all platforms.
var break_for_winDOS = 'test\r\nwith\r\nline\r\nbreaks';
var break_for_linux = 'test\nwith\nline\nbreaks';
var break_for_older_mac = 'test\rwith\rline\rbreaks';
break_for_winDOS.replace(/(\r?\n|\r)/gm, ' ');
//output
'test with line breaks'
break_for_linux.replace(/(\r?\n|\r)/gm, ' ');
//output
'test with line breaks'
break_for_older_mac.replace(/(\r?\n|\r)/gm, ' ');
// Output
'test with line breaks'
If it happens that you don't need this htm characte &nbsp shile using str.replace(/(\r\n|\n|\r)/gm, "") you can use this str.split('\n').join('');
cheers
1st way:
const yourString = 'How are you \n I am fine \n Hah'; // Or textInput, something else
const newStringWithoutLineBreaks = yourString.replace(/(\r\n|\n|\r)/gm, "");
2nd way:
const yourString = 'How are you \n I am fine \n Hah'; // Or textInput, something else
const newStringWithoutLineBreaks = yourString.split('\n').join('');
On mac, just use \n in regexp to match linebreaks. So the code will be string.replace(/\n/g, ''), ps: the g followed means match all instead of just the first.
On windows, it will be \r\n.
This will remove all your newlines, spaces, unnecessary characters
str = '\n \n\n\n\n\n\n\n\n\n\n\n\n \n \n \n \n Books\n \n \n \n \n\n\n'
console.log(str)
var output = str.replace(/\n|\r|\W/g, "");
console.log(output)
'Books'
const text = 'test\nwith\nline\nbreaks'
const textWithoutBreaks = text.split('\n').join(' ')

regex excluding newline

I have a simple word counter that works with one exception. It is splitting on the \n character.
The small sample text file is:
'''
A tree is a woody perennial plant,typically with branches.
I added this second line,just to add eleven more words.
'''
Line #1 has ten words, line #2 has eleven. Total word count = 21.
This code yields a count of 22 because it is including the \n character at the end of line #1:
import re
testfile = "d:\\python\\workbook\\words2.txt"
number_of_words = 0
with open(testfile, "r") as datafile:
for line in datafile:
number_of_words += len(re.split(",|\s", line))
print(number_of_words)
If I change my regex to: number_of_words += len(re.split(",|^\n|\s", line))
the word count (22) remains unchanged.
My question is: why is exclude newline [^\n] failing, or more broadly, what
should be the correct way to code my regex so that I exclude the trailing \n and have the above code arrive at the correct word total of 21.
You can simply use:
number_of_words = 0
with open(testfile, "r") as datafile:
for line in datafile:
number_of_words += len(re.findall('\w+', line)

how to skip multiple header lines using python

I am new to python. Trying to write a script that will use numeric colomns from a file whcih also contains a header. Here is an example of a file:
#File_Version: 4
PROJECTED_COORDINATE_SYSTEM
#File_Version____________-> 4
#Master_Project_______->
#Coordinate_type_________-> 1
#Horizon_name____________->
sb+
#Horizon_attribute_______-> STRUCTURE
474457.83994 6761013.11978
474482.83750 6761012.77069
474507.83506 6761012.42160
474532.83262 6761012.07251
474557.83018 6761011.72342
474582.82774 6761011.37433
474607.82530 6761011.02524
I'd like to skip the header. here is what i tried. It works of course if i know which characters will appear in the header like "#" and "#". But how can i skip all lines containing any letter character?
in_file1 = open(input_file1_short, 'r')
out_file1 = open(output_file1_short,"w")
lines = in_file1.readlines ()
x = []
y = []
for line in lines:
if "#" not in line and "#" not in line:
strip_line = line.strip()
replace_split = re.split(r'[ ,|;"\t]+', strip_line)
x = (replace_split[0])
y = (replace_split[1])
out_file1.write("%s\t%s\n" % (str(x),str(y)))
in_file1.close ()
Thank you very much!
I think you could use some built ins like this:
import string
for line in lines:
if any([letter in line for letter in string.ascii_letters]):
print "there is an ascii letter somewhere in this line"
This is only looking for ascii letters, however.
you could also:
import unicodedata
for line in lines:
if any([unicodedata.category(unicode(letter)).startswith('L') for letter in line]):
print "there is a unicode letter somewhere in this line"
but only if I understand my unicode categories correctly....
Even cleaner (using suggestions from other answers. This works for both unicode lines and strings):
for line in lines:
if any([letter.isalpha() for letter in line]):
print "there is a letter somewhere in this line"
But, interestingly, if you do:
In [57]: u'\u2161'.isdecimal()
Out[57]: False
In [58]: u'\u2161'.isdigit()
Out[58]: False
In [59]: u'\u2161'.isalpha()
Out[59]: False
The unicode for the roman numeral "Two" is none of those,
but unicodedata.category(u'\u2161') does return 'Nl' indicating a numeric (and u'\u2161'.isnumeric() is True).
This will check the first character in each line and skip all lines that doesn't start with a digit:
for line in lines:
if line[0].isdigit():
# we've got a line starting with a digit
Use a generator pipeline to filter your input stream.
This takes the lines from your original input lines, but stops to check that there are no letters in the entire line.
input_stream = (line in lines if
reduce((lambda x, y: (not y.isalpha()) and x), line, True))
for line in input_stream:
strip_line = ...

c# windows form- How remove txt all line where available $ or ! or >?

txt file available lots line in there some line I need remove.
Please see txt file some line
adrian
jenkinson
adri
abby
abigal
adrian$
abbby%
jennefer!
jennef%
jenn^
so I need solution how I am regex using to delete all symbol character full line from the txt file?
You can use regex /[^A-Za-z\d\s]+/
It will match character except
Letters (A-Za-z)
Digits (\d)
White line, new line, tab space,etc... (\s)
Explanation :
[^A-Za-z\d\s]+
Debuggex Demo
var lines = text.Split('\n');
var newLines = (from line in lines
where !line.Contains('$') && !line.Contains('!') && !line.Contains('>')
select line).ToArray();
var newText = string.Empty;
for (var i = 0; i < newLines.Length - 1; i++)
newText += newLines[i] + Environment.NewLine;
newText += newLines[newLines.Length - 1];
newText = your new text without those lines.

Remove lines beginning with the same semi-colon delimited part with regex

I would like to use Notepad++ to remove lines with duplicate beginning of line. For example, I have a semi-colon separated file like below:
string at the beginning of line 1;second string line 1; final string line1;
string at the beginning of line 2;second string line 2; final string line2;
string at the beginning of line 1;second string line 3; final string line3;
string at the beginning of line 1;second string line 4; final string line4;
I would like to remove the third and fourth lines as they have the same first substring as the first line and get the following result:
string at the beginning of line 1;second string line 1; final string line1;
string at the beginning of line 2;second string line 2; final string line2;
You can try using the following regex:
^(([^;]*;).*\R(?:.*\R)*?)\2.*
Or
^(([^;]*;).*\R(?:.*\R)*?)\2.*(?:$|\R)
And replace with $1.
The idea is to find and capture text in the beginning of a line that consists of non-semicolon characters up to ; ([^;]*;), then match the rest of the line (with .*\R), then 0 or more lines ((?:.*\R)*?) up to a line that starts with the captured text in group 2, matching it to the end and capturing into the second group that we can use later.
The drawback is that you will have to click Replace All several times until no match is found.
Thanks go to #nhahtdh who noticed a bug with my previous ^(([^;]*).*\R(?:.*\R)*?)\2.* regex that can overfire.