Remove lines beginning with the same semi-colon delimited part with regex - regex

I would like to use Notepad++ to remove lines with duplicate beginning of line. For example, I have a semi-colon separated file like below:
string at the beginning of line 1;second string line 1; final string line1;
string at the beginning of line 2;second string line 2; final string line2;
string at the beginning of line 1;second string line 3; final string line3;
string at the beginning of line 1;second string line 4; final string line4;
I would like to remove the third and fourth lines as they have the same first substring as the first line and get the following result:
string at the beginning of line 1;second string line 1; final string line1;
string at the beginning of line 2;second string line 2; final string line2;

You can try using the following regex:
^(([^;]*;).*\R(?:.*\R)*?)\2.*
Or
^(([^;]*;).*\R(?:.*\R)*?)\2.*(?:$|\R)
And replace with $1.
The idea is to find and capture text in the beginning of a line that consists of non-semicolon characters up to ; ([^;]*;), then match the rest of the line (with .*\R), then 0 or more lines ((?:.*\R)*?) up to a line that starts with the captured text in group 2, matching it to the end and capturing into the second group that we can use later.
The drawback is that you will have to click Replace All several times until no match is found.
Thanks go to #nhahtdh who noticed a bug with my previous ^(([^;]*).*\R(?:.*\R)*?)\2.* regex that can overfire.

Related

Java regexp to remove CRLF between quotes [duplicate]

This question already has answers here:
Java: How to remove all line breaks between double quotes
(2 answers)
Closed 4 months ago.
I'm having a string containing CSV lines. Some of its values contains the CRLF characters, marked [CRLF] in the example below
NOTE: Line 1: and Line 2: aren't part of the CSV, but for the discussion
Line 1:
foo1,bar1,"john[CRLF]
dose[CRLF]
blah[CRLF]
blah",harry,potter[CRLF]
Line 2:
foo2,bar2,john,dose,blah,blah,harry,potter[CRLF]
Each time a value in a line have a CRLF, the whole value appears between quotes, as shown by line 1. Looking for a way to get ride of those CRLF when they appears between quotes.
Tried regexp such as:
data.replaceAll("(,\".*)([\r\n]+|[\n\r]+)(.*\",)", "$1 $3");
Or just ([\r\n]+) , \n+, etc. without success: the line continue to appears as if no replacement were made.
EDIT:
Solution
Found the solution here:
String data = "\"Test Line wo line break\", \"Test Line \nwith line break\"\n\"Test Line2 wo line break\", \"Test Line2 \nwith line break\"\n";
StringBuffer result = new StringBuffer();
Matcher m = Pattern.compile("\"[^\"]*\"").matcher(data);
while (m.find()) {
m.appendReplacement(result, m.group().replaceAll("\\R+", ""));
}
m.appendTail(result);
System.out.println(result.toString());
Using Java 9+ you can use a function code inside Matcher#replaceAll and solve your problem using this code:
// pattern that captures quoted strings ignoring all escaped quotes
Pattern p = Pattern.compile("\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"");
String data1 = "\"Test Line wo line break\", \"Test Line \nwith line break\"\n\"Test Line2 wo line break\", \"Test Line2 \nwith line break\"\n";
// functional code to get all quotes strings and then remove all line
// breaks from matched substrings
String repl = p.matcher(data1).replaceAll(
m -> m.group().replaceAll("\\R+", "")
);
System.out.println(repl);
Output:
"Test Line wo line break", "Test Line with line break"
"Test Line2 wo line break", "Test Line2 with line break"
Code Demo

CSV Regex skipping first comma

I am using regex for CSV processing where data can be in Quotes, or no quotes. But if there is just a comma at the starting column, it skips it.
Here is the regex I am using:
(?:,"|^")(""|[\w\W]*?)(?=",|"$)|(?:,(?!")|^(?!"))([^,]*?|)(?=$|,)
Now the example data I am using is:
,"data",moredata,"Data"
Which should have 4 matches ["","data","moredata","Data"], but it always skips the first comma. It is fine if there is quotes on the first column, or it is not blank, but if it is empty with no quotes, it ignores it.
Here is a sample code I am using for testing purposes, it is written in Dart:
void main() {
String delimiter = ",";
String rawRow = ',,"data",moredata,"Data"';
RegExp exp = new RegExp(r'(?:'+ delimiter + r'"|^")(^,|""|[\w\W]*?)(?="'+ delimiter + r'|"$)|(?:'+ delimiter + '(?!")|^(?!"))([^'+ delimiter + r']*?)(?=$|'+ delimiter + r')');
Iterable<Match> matches = exp.allMatches(rawRow.replaceAll("\n","").replaceAll("\r","").trim());
List<String> row = new List();
matches.forEach((Match m) {
//This checks to see which match group it found the item in.
String cellValue;
if (m.group(2) != null) {
//Data found without speech marks
cellValue = m.group(2);
} else if (m.group(1) != null) {
//Data found with speech marks (so it removes escaped quotes)
cellValue = m.group(1).replaceAll('""', '"');
} else {
//Anything left
cellValue = m.group(0).replaceAll('""', '"');
}
row.add(cellValue);
});
print(row.toString());
}
Investigating your expression
(,"|^")
(""|[\w\W]*?)
(?=",|"$)
|
(,(?!")|^(?!"))
([^,]*?|)
(?=$|,)
(,"|^")(""|[\w\W]*?)(?=",|"$) This part is to match quoted strings, that seem to work for you
Going through this part (,(?!")|^(?!"))([^,]*?|)(?=$|,)
(,(?!")|^(?!")) start with comma not followed by " OR start of line not followed by "
([^,]*?|) Start of line or comma zero or more non greedy and |, why |
(?=$|,) end of line or , .
In CSV this ,,,3,4,5 line should give 6 matches but the above only gets 5
You could add (^(?=,)) at the begining of second part, the part that matches non quoted sections.
Second group with match of start and also added non capture to groups
(?:^(?=,))|(?:,(?!")|^(?!"))(?:[^,]*?)(?=$|,)
Complete: (?:,"|^")(?:""|[\w\W]*?)(?=",|"$)|(?:^(?=,))|(?:,(?!")|^(?!"))(?:[^,]*?)(?=$|,)
Here is another that might work
(?:(?:"(?:[^"]|"")*"|(?<=,)[^,]*(?=,))|^[^,]+|^(?=,)|[^,]+$|(?<=,)$)
How that works i described here: Build CSV parser using regex

regex excluding newline

I have a simple word counter that works with one exception. It is splitting on the \n character.
The small sample text file is:
'''
A tree is a woody perennial plant,typically with branches.
I added this second line,just to add eleven more words.
'''
Line #1 has ten words, line #2 has eleven. Total word count = 21.
This code yields a count of 22 because it is including the \n character at the end of line #1:
import re
testfile = "d:\\python\\workbook\\words2.txt"
number_of_words = 0
with open(testfile, "r") as datafile:
for line in datafile:
number_of_words += len(re.split(",|\s", line))
print(number_of_words)
If I change my regex to: number_of_words += len(re.split(",|^\n|\s", line))
the word count (22) remains unchanged.
My question is: why is exclude newline [^\n] failing, or more broadly, what
should be the correct way to code my regex so that I exclude the trailing \n and have the above code arrive at the correct word total of 21.
You can simply use:
number_of_words = 0
with open(testfile, "r") as datafile:
for line in datafile:
number_of_words += len(re.findall('\w+', line)

Regex_match doesn't match in a file

I want to recognize some lines in a text file using regex, but regex_match doesn't match any line, even if I use regex patron(".*")
string dirin = "/home/user/in.srt";
string dirout = "/home/user/out.srt";
ifstream in(dirin.c_str());
ofstream out(dirout.c_str());
string line;
// regex patron("(\\d{2}):(\\d{2}):(\\d{2}),(\\d{3})\\s-->\\s(\\d{2}):(\\d{2}):(\\d{2}),(\\d{3})");
regex patron(".*");
smatch m;
while (getline(in, line)) {
if (regex_match(line, m, patron)) {
out << "ok";
};
out << line;
}
in.close();
out.close();
The code always print the string line in the out.srt file, but never the string "ok" inside the if (regex_match(line, m, patron)).
I'm testing it with the following lines
1
00:01:00,708 --> 00:01:01,800
You look at that river
2
00:01:02,977 --> 00:01:04,706
gently flowing by.
3
00:01:06,213 --> 00:01:08,238
You notice the leaves
Note that getline() reads a line with a trailing carriage return CR symbol, and note that ECMAScript . pattern does not match CR symbol considering it an end of line symbol.
regex_match requires that a whole string matches the pattern.
Thus, you need to account for an optional carriage return at the end of the pattern. You can do it by appending \r? or \s* at the end of the pattern:
regex patron("(\\d{2}):(\\d{2}):(\\d{2}),(\\d{3})\\s-->\\s(\\d{2}):(\\d{2}):(\\d{2}),(\\d{3})\\s*");
or
regex patron(".*\\s*");
Also, consider using raw string literals if your C++ version allows it:
regex patron(R"((\d{2}):(\d{2}):(\d{2}),(\d{3})\s-->\s(\d{2}):(\d{2}):(\d{2}),(\d{3})\s*)");

regex for String contains comma semicolon and Carriage return

I need to check if an input string contains a semicolon or a comma or carriage return or all of them
Dim Input As String = "1298-673-4192,A08Z-931-468A;"
Dim pattern as string ="^[a-zA-Z0-9 \r , ; ]*$"
Dim regex As New Regex(pattern)
regex.Ismatch(Input)
I get false for it even though the string contains a comma a semicolon and a carriage return are present
This will work:
Dim Input As String = "1298-673-4192,A08Z-931-468A;"
Dim pattern as string ="[\r,;]+"
Dim regex As New Regex(pattern)
regex.Ismatch(Input)
I removed ^ and $ because that checks the entire string (^ means begin of string while $ means end).
I changed * to + because * checks for 0 or more while + checks for 1 or more.
I removed the spaces because they were not necessary.
I removed the a-zA-Z0-9 because that will also match all alphanumeric characters.
Before (With string begin/end characters, spaces, alphanumerical and *).
After (Without all the stuff that causes it to break).