REGEX - Count Number of Occurrences Ignoring Escaped Characters - regex

My data looks like this: [No Empty Lines]
Number;Lastname or Company;Firstname;City;Postcode;Amount;
1;Trump;Donald;Washington;12345;4;
2;Bush;George;Washington;54321;1;
3;Lloyds\; and Firends;;11111;2;
4;Schuhmacher\;Frenzen\;Fettel; and Co;Company;Anywhere;22222;3;
5;Best\;Friends;Company\;Co;Nowhere;33333;4;
I am trying to validate this csv file by looking for lines that do not have 6 entries per row. I am doing this by counting the number of ; per line. The only catch is \; (escaped semicolon) should not be counted.
This is how I am doing it right now:
STEP 1
Find= \\;
Replace= \s
STEP 2
Find= ^([^;]*;)([^;]*;)([^;]*;)([^;]*;)([^;]*;)([^;]*;)$
This will select all correct rows.[ In above case: All rows except 3: and 4:]
PROBLEM is this requires changing the data using substitution. Is there a way to do this with only regex and NO substitution.
I am basically struggling with the part where I have to ignore this pattern \;.
EDIT 1: I am using SUBLIME text editor.
EDIT 2: I have updated the sample text file with \;

You don't need substitutions if you consider matching escaped characters individually:
(?m)^(?:[^\\;\r\n]*(?:\\.[^\\;\r\n]*)*;){6}$
Live demo
Breakdown:
(?m) Set multiline flag
^ Assert beginning of line
(?: Start of non-capturing group 1
[^\\;\r\n]* Match any thing except \ ; \r and \n
(?: Start of NCG 2
\\.[^\\;\r\n]* Match an escaped char and repeat matching recent character class
)* As many as possible
; Match a semi-colon
){6} Six times exactly
$ Assert end of line

Just use "|" in the regex not working?
e.g. ^([^;]*;)([^;]*;)([^;]*;)([^;]*;)([^;]*;)([^;]*;)|\\;$

I don't know what language your are using, but personally i thing you better use a split() follow by count() function. this is available in many languages.
Hope that's helps

Related

Regex to disregard partial matches across lines / matching too much

I have three lines of tab-separated values:
SELL 2022-06-28 12:42:27 39.42 0.29 11.43180000 0.00003582
BUY 2022-06-28 12:27:22 39.30 0.10 3.93000000 0.00001233
_____2022-06-28 12:27:22 39.30 0.19 7.46700000 0.00002342
The first two have 'SELL' or 'BUY' as first value but the third one has not, hence a Tab mark where I wrote ______:
I would like to capture the following using Regex:
My expression ^(BUY|SELL).+?\r\n\t does not work as it gets me this:
I do know why outputs this - adding an lazy-maker '?' obviously won't help. I don't get lookarounds to work either, if they are the right means at all. I need something like 'Match \r\n\t only or \r\n(?:^\t) at the end of each line'.
The final goal is to make the three lines look at this at the end, so I will need to replace the match with capturing groups:
Can anyone point me to the right direction?
Ctrl+H
Find what: ^(BUY|SELL).+\R\K\t
Replace with: $1\t
CHECK Match case
CHECK Wrap around
CHECK Regular expression
UNCHECK . matches newline
Replace all
Explanation:
^ # beginning of line
(BUY|SELL) # group 1, BUY or SELL
.+ # 1 or more any character but newline
\R # any kind of linebreak
\K # forget all we have seen until this position
\t # a tabulation
Replacement:
$1 # content of group 1
\t # a tabulation
Screenshot (before):
Screenshot (after):
You can use the following regex ((BUY|SELL)[^\n]+\n)\s+ and replace with \1\2.
Regex Match Explanation:
((BUY|SELL)[^\n]+\n): Group 1
(BUY|SELL): Group 2
BUY: sequence of characters "BUY" followed by a space
|: or
SELL: sequence of characters "SELL" followed by a space
[^\n]+: any character other than newline
\n: newline character
\s+: any space characters
Regex Replace Explanation:
\1: Reference to Group 1
\2: Reference to Group 2
Check the demo here. Tested on Notepad++ in a private environment too.
Note: Make sure to check the "Regular expression" checkbox.
Regex

Regular expressions in notepad++ (Search and Replace)

I have a list of thousands of records within a .txt document.
some of them look like these records
201910031044 "00059" "11.31AG" "Senior Champion"
201910031044 "00060" "GBA146" "Junior Champion"
201910031044 "00999" "10.12G" "ProAM"
201910031044 "00362" "113.1LI" "Abcd"
Whenever a record similar to this occurs I'd like to get rid of the last words/numbers/etc in the last quotation marks (like "Senior Champion", "Junior Champion" etc. There are many possibilities here)
e.g. (before)
201910031044 "00059" "11.31AG" "Senior Champion"
after
201910031044 "00059" "11.31AG"
I tried the following regex but it wouldn't work.
Search: ^([0-9]{17,17} + "[0-9]{8,8}" + "[a-zA-Z0-9]").*$
Replace: \1 (replace string)
OK I forgot the . (dot) sign however even if I do not have a . (dot) sign it would not work. Not sure if it has anything to do when using the + sign used more than once.
I'd like to get rid of the last words/numbers/etc in the last quotation marks
This does the job:
Ctrl+H
Find what: ^.+\K\h+".*?"$
Replace with: LEAVE EMPTY
CHECK Wrap around
CHECK Regular expression
UNCHECK . matches newline*
Replace all
Explanation:
^ # beginning of line
.+ # 1 or more any character but newline
\K # forget all we have seen until this position
\h+ # 1 or more horizontal spaces
".*?" # something inside quotes
$ # end of line
Screen capture (before):
Screen capture (after):
The RegEx looks for the 4th double quote:
^(?:[^"]*\"){4}([^|]*)
You can see this demo: https://regex101.com/r/wJ9yS6/163
You will still need to parse the lines, so probably easier opening in excel or parsing using code as a CSV.
You have a problem with the count of your characters:
you specify that the line should start with exactly 17 digits ([0-9]{17,17}). However, there are only 12 digits in the data 201910031044.
you can specify exactly 12 digits by using {12} or if it could be 12-17, then {12,17}. I'll assume exactly 12 based on the current data.
similarly, for the second column you specify that it's exactly 8 digits surrounded by quotes ("[0-9]{8,8}") but it only has 5 digits surrounded by quotes.
again, you can specify exactly 5 with {5} or 5-8 with {5,8}. I will assume exactly 5.
finally, there is no quantifier for the final field, so the regex tries to match exactly one character that is a letter or a number surrounded by quotes "[a-zA-Z0-9]".
I'm not sure if there is any limit on the number of characters, so I would go with one or more using + as quantifier "[a-zA-Z0-9]+" - if you can have zero or more, then you can use *, or if it's any other count from m to n, then you can use {m,n} as before.
Not a character count problem but the final column can also have dots but the regex doesn't account for. You can just add . inside the square brackets and it will only match dot characters. It's usually used as a wildcard but it loses its special meaning inside a character class ([]), so you get "[a-zA-Z0-9.]+"
Putting it all together, you get
Search: ^([0-9]{12} + "[0-9]{5}" + "[a-zA-Z0-9.]+").*$
Replace: \1
Which will get rid of anything after the third field in Notepad++.
This can be shortened a bit by using \d instead of [0-9] for digits and \s+ for whitespace instead of +. As a benefit, \s will also match other whitespace like tabs, so you don't have to manually account for those. This leads to
Search: ^(\d{12}\s+"\d{5}"\s+"[a-zA-Z0-9.]+").*$
Replace: \1
If you want to get rid of the last words/numbers/etc in the last quotation marks you could capture in a group what is before that and match the last quotation marks and everything between it to remove it using a negated character class.
If what is between the values can be spaces or tabs, you could use [ \t]+ to match those (using \s could also match a newline)
Note that {17,17} and {8,8} may also be written as {17} and {8} which in this case should be {12} and {5}
^([0-9]{12}[ \t]+"[0-9]{5}"[ \t]+"[a-zA-Z0-9.]+")[ \t]{2,}"[^"\r\n]+"
In parts
^ Start of string
( Capture group 1
[0-9]{12}[ \t]+ Match 12 digits and 1+ spaces or tabs
"[0-9]{5}"[ \t]+ Match 5 digits between " and 1+ spaces or tabs
"[a-zA-Z0-9.]+" Match 1+ times any of the listed between "
) Close group
[ \t]{2,} Match 1+ times
"[^"\r\n]+"
In the replacement use group 1 $1
Regex demo
Before
After

Regex for text file

I have a text file with the following text:
andal-4.1.0.jar
besc_2.1.0-beta
prov-3.0.jar
add4lib-1.0.jar
com_lab_2.0.jar
astrix
lis-2_0_1.jar
Is there any way i can split the name and the version using regex. I want to use the results to make two columns 'Name' and 'Version' in excel.
So i want the results from regex to look like
andal 4.1.0.jar
besc 2.1.0-beta
prov 3.0.jar
add4lib 1.0.jar
com_lab 2.0.jar
astrix
lis 2_0_1.jar
So far I have used ^(?:.*-(?=\d)|\D+) to get the Version and -\d.*$ to get the Name separately. The problem with this is that when i do it for a large text file, the results from the two regex are not in the same order. So is there any way to get the results in the way I have mentioned above?
Ctrl+H
Find what: ^(.+?)[-_](\d.*)$
Replace with: $1\t$2
check Wrap around
check Regular expression
UNCHECK . matches newline
Replace all
Explanation:
^ # beginning of line
(.+?) # group 1, 1 or more any character but newline, not greedy
[-_] # a dash or underscore
(\d.*) # group 2, a digit then 0 or more any character but newline
$ # end of line
Replacement:
$1 # content of group 1
\t # a tabulation, you may replace with what you want
$2 # content of group 2
Result for given example:
andal 4.1.0.jar
besc 2.1.0-beta
prov 3.0.jar
add4lib 1.0.jar
com_lab 2.0.jar
astrix
lis 2_0_1.jar
Not quite sure what you meant for the problem in large file, and I believe the two regex you showed are doing opposite as what you said: first one should get you the name and second one should give you version.
Anyway, here is the assumption I have to guess what may make sense to you:
"Name" may follow by - or _, followed by version string.
"Version" string is something preceded by - or _, with some digit, followed by a dot or underscore, followed by some digit, and then any string.
If these assumption make sense, you may use
^(.+?)(?:[-_](\d+[._]\d+.*))?$
as your regex. Group 1 is will be the name, Group 2 will be the Version.
Demo in regex101: https://regex101.com/r/RnwMaw/3
Explanation of regex
^ start of line
(.+?) "Name" part, using reluctant match of
at least 1 character
(?: )? Optional group of "Version String", which
consists of:
[-_] - or _
( ) Followed by the "Version" , which is
\d+ at least 1 digit,
[._] then 1 dot or underscore,
\d+ then at least 1 digit,
.* then any string
$ end of line

Notepad++ Regex to find string on a line and remove duplicates of the exact string

Anyone know how to match a random string and then remove and re-occurences of the same string on each line in a file.
Essentially I have a file:
00101 blah 0000202 thisisasentencethisisasentence 99929
00102 blah 0000202 thisisasentenc1thisisasentenc1 999292
I want to remove the duplicate sentence so it returns:
00101 blah 0000202 thisisasentence 99929
00102 blah 0000202 thisisasentenc1 999292
The width isn't fixed or anything like that.
I think this is close but I don't understand regex well and it highlights everything in the file except the last line - correctly finding the duplicate but only once.
Removing duplicate strings/words(not lines) using RegEx(notepad++)
Note I can also use the following to identify which parts of each line is duplicated - it highlights the duplicated values (thisisasentencethisisasentence) but I don't know how to split it
(.{5,})\1
Any help would be appreciated,
thanks.
EDIT I can reformat to create comma delimited (to some extent): (note with this, there is a chance a comma exists in the duplicated string but don't worry about that)
00101,blah,0000202,thisisasentencethisisasentence,99929
00102,blah,0000202,thisisasentenc1thisisasentenc1,999292
You can use this pattern in notepad++ with an empty string as replacement:
^(?>\S+[^\S\n]+){3,}?(\S+?)\K\1(?!\S)
demo
pattern details:
^ # anchors for the start of the line (by default in notepad++)
(?> # atomic group: a column and the following space
\S+ # all that is not a white-space character
[^\S\n]+ # white-spaces except newlines
){3,}? # repeat 3 or more times (non-greedy) until you find
(\S+?)\K\1(?!\S) # a column with a duplicate
details of the last subpattern:
(\S+?) # capture one or more non-white characters
# (non-greedy: until \1(?!\S) succeeds)
\K # discard all previous characters from whole match result
\1 # back-reference to the capture group 1
(?!\S) # ensure that the end of the column is reached
Note: using {5,} instead of + in \S+? (so \S{5,}?) is a good idea, if you are sure that columns contain at least five characters.
You say you are happy with what (.{5,})\1 matches. So, just use $1 as the replacement value. It will automatically replace the repeated part and its copy with a single copy of the text.

Regular Expressions - Greedy but stop before a string match

I have the some data and i'd like to convert it into a table format.
Here's the input data
1- This is the 1st line with a
newline character
2- This is the 2nd line
Each line may contain multiple newline characters.
Output
<td>1- This the 1st line with
a new line character</td>
<td>2- This is the 2nd line</td>
I've tried the following
^(\d{1,3}-)[^\d]*
but it seems to match only till the digit 1 in 1st.
I'd like to be able to stop matching after i find another \d{1,3}\- in my string.
Any suggestions?
EDIT:
I'm using EditPad Lite.
This is for vim, and uses zerowidth positive-lookahead:
/^\d\{1,3\}-\_.*[\r\n]\(\d\{1,3\}-\)\#=
Steps:
/^\d\{1,3\}- 1 to 3 digits followed by -
\_.* any number of characters including newlines/linefeeds
[\r\n]\(\d\{1,3\}-\)\#= followed by a newline/linefeed ONLY if it is followed
by 1 to 3 digits followed by - (the first condition)
EDIT: This is how it would be in pcre/ruby:
/(\d{1,3}-.*?[\r\n])(?=(?:\d{1,3}-)|\Z)/m
Note you need a string ending with a newline to match the last entry.
SEARCH: ^\d+-.*(?:[\r\n]++(?!\d+-).*)*
REPLACE: <td>$0</td>
[\r\n]++ matches one or more carriage-returns or linefeeds, so you don't have to worry about whether the file use Unix (\n), DOS (\r\n), or older Mac (\r) line separators.
(?!\d+-) asserts that the first thing after the line separator is not another line number.
I used the possessive + in [\r\n]++ to make sure it matches the whole separator. Otherwise, if the separator is \r\n, [\r\n]+ could match the \r and (?!\d+-) could match the \n.
Tested in EditPad Pro, but it should work in Lite as well.
You did not specify a language (there are many regexp implementations), but in general, what you are looking for is called "positive lookahead", which lets you add patterns that will influence the match, but will not become part of it.
Search for lookahead in the documentation of whatever language you are using.
Edit: the following sample seems to work in vim.
:%s#\v(^\d+-\_.{-})\ze(\n\d+-|%$)#<td>\1</td>
Annotation below:
% - for all lines
s# - substitute the following (you can use any delimiter, and slash is most
common, but as that will require that we escape slashes in the command
I chose to use the number sign)
\v - very magic mode, let's us use less backslashes
( - start group for back referencing
^ - start of line
\d+ - one or more digits (as many as possible)
- - a literal dash!
\_. - any character, including a newline
{-} - zero or more of these (as few as possible)
) - end group
\ze - end match (anything beyond this point will not be included in the match)
( - start a new group
[\n\r] - newline (in any format - thanks Alan)
\d+ - one or more digits
- - a dash
| - or
%$ - end of file
) - end group
# - start substitute string
<td>\1</td> - a TD tag around the first matched group
(\d+-.+(\r|$)((?!^\d-).+(\r|$))?)
You can match only the separators and split on them. In C#, for example, it could be done like this:
string s = "1- This is the 1st line with a \r\nnewline character\r\n2- This is the 2nd line";
string ss = "<td>" + string.Join("</td>\r\n<td>", Regex.Split(s.Substring(3), "\r\n\\d{1,3}- ")) + "</td>";
MessageBox.Show(ss);
Would it be good for you to do it in 3 steps?
(these are perl regex):
Replace the first:
$input =~ s/^(\d{1,3})/<td>\1/;
Replace the rest
$input =~ s/\n(\d{1,3})/<\/td>\n<td>\1/gm;
Add the last:
$input .= '</td>';