Match the line after delimiter and shorter than 9 - regex

me and regex never get along
i get every day an email from my supervisor
contains about 1000+ lines need to be sorted
its like :
name|code
the goal is to separate them to 2 files
example :
Garry Cooper|abc123h1n1
Andy Morray|abcd
John Travolta|123567
Simon Person | abcd1
what i do
i look after | character
i remove the whole line :
if code contains numbers only
and or contains letters only
and or is shorter than 9
the example list becomes :
Garry Cooper|abc123h1n1
this steps i do them daily sometimes i get 2000 lines :/ real pain
i used to work with regex in notepad++
but i cant found the match for this one
i am not very bad also in php
help me please
UPDATE 01 :
regex found (?i)^[^|]\|\h[a-z\d]{0,8}$\R?
Current question :
writting a small php script or maybe reusable classes
interface:
submit the data from text box (html form) or from txt file
processing :
lines that match the regex downloadable in txt file.
others in a files
output:
2 links of the files
Thank u all for your help in advance

If you just use a greedy dot matching with .* you do not check the length. It can be checked with the limiting quantifier. To match just 0 to 8 symbols, you can use {0,8}. All but | can be matched with [^|]* negated character class.
Use
(?i)^[^|]*\|\h*[a-z\d]{0,8}$\R?
See regex demo (note that gm flags are used by default in Notepad++ regex-based search and replace).
Explanation:
^ - start of a line
[^|]* - zero or more symbols other than a pipe
\| - a literal pipe symbol
\h* - zero or more horizontal whitespace
[a-z\d]{0,8} - letters a to z and A to Z (due to (?i) case insensitive modifier) or digits, zero to 8 occurrences
$ - end of line and
\R? - one or zero (otpional) line break.

Related

Remove all but the first four characters on each line

So I have a text file in Vscode that contains several lines of text like so:
1801: Joseph Marie Jacquard, a French merchant and inventor invent a loom that uses punched wooden cards to automatically weave fabric designs. Early computers would use similar punch cards.
So now I'm trying to isolate the year number/the first 4 characters of each line. I'm new to regex, and I know how to get the first 4 characters (I used ^.{4}) but how would I be able to find all EXCEPT for the first 4 characters so that I can replace them with nothing and be left with just the year numbers?
Find: (?<=^\d{4}).*
Replace: with nothing
regex101 Demo
(?<=^\d{4}) if a line starts ^ with 4 digits , (?<=...) is a positive lookbehind
.* match everything else up to line terminators, so the : will be included in the match
Since you never matched the 4 digits, a lookbehind/lookahead isn't part of any match necessarily, that you want to keep, you don't have to worry about any capture groups or replacements.
You can
Find:       ^(.{4}).+
Replace: $1
See the regex demo. Details:
^ - start of a line (in Visual Studio Code, ^ matches any line start)
(.{4}) - capturing group #1 that captures any four chars other than line break chars
.+ - one or more chars other than line break chars, as many as possible.
The $1 backreference in the replacement pattern replaces the match with Group 1 value.

Sublime & Regex: How to find all lines that are 3 letters or less, excluding certain lines?

I have a document that is 10,000 lines long.
I would like to remove all lines that are 3 letters or less, excluding any lines that start with a § symbol or excluding any lines that are all in caps.
Example:
Before removal:
§day
DOG
Happy
Monday
Now
Yes
Sunday
§new day.txt
DIY
Leg
Books
Car
Home
After removal:
§day
DOG
Happy
Monday
Sunday
§new day.txt
DIY
Books
Home
DOG & DIY are not affected as they are all capitals.
The lines start with § are also not affected.
My attempts
I know that this code can be used to make Regex ignore all lines that are in capitals and all lines that start with a § (In the example, the code is searching for many or north or one).
(^(?:(?:§.*|[^[:alpha:]\n\r]*[[:upper:]]+(?:[^[:alpha:]\n\r]+[[:upper:]]+)*[^[:alpha:]\n\r]*))$|(?i:\b(?:many|north|one)\s+of\b))|(?i:\bof\b)
I also know that this code can be used to find all words that are 3 letters or less
'^.{1,3}$'
Is there any way I can combine them?
I tried replacing many|north|one with '^.{1,3}$' but it didn’t work.
You can search using this regex:
^(?!(?:§|[A-Z]+$)).{0,3}(?:[\r\n]+|\z)
and replace using empty string:
Make sure mode m or MULTILINE is enabled in your regex.
RegEx Demo
RegEx Details:
^: Start
(?!(?:§|[A-Z]+$)): Negative lookahead to assert a failure if line starts with § or contains only uppercase letters
.{0,3}(?:[\r\n]+|\z): Match any character 0 to 3 times followed by a 1 or more line breaks or end of file
Before:
After
I suggest using
(?-i)^(?!§|[A-Z]+$).{1,3}$\R?
See the regex demo. Details:
(?-i) - turn on case sensitivity (or, you may omit it and turn the Aa option on as shown in the screenshot below)
^ - start of a line
(?!§|[A-Z]+$) - no § at the start and a line only made of uppercase ASCII letters are allowed
.{1,3} - 1 to 3 chars
$ - end of a line
\R? - an optional line break sequence.
SublimeText3 test:
        V

REGEX - Count Number of Occurrences Ignoring Escaped Characters

My data looks like this: [No Empty Lines]
Number;Lastname or Company;Firstname;City;Postcode;Amount;
1;Trump;Donald;Washington;12345;4;
2;Bush;George;Washington;54321;1;
3;Lloyds\; and Firends;;11111;2;
4;Schuhmacher\;Frenzen\;Fettel; and Co;Company;Anywhere;22222;3;
5;Best\;Friends;Company\;Co;Nowhere;33333;4;
I am trying to validate this csv file by looking for lines that do not have 6 entries per row. I am doing this by counting the number of ; per line. The only catch is \; (escaped semicolon) should not be counted.
This is how I am doing it right now:
STEP 1
Find= \\;
Replace= \s
STEP 2
Find= ^([^;]*;)([^;]*;)([^;]*;)([^;]*;)([^;]*;)([^;]*;)$
This will select all correct rows.[ In above case: All rows except 3: and 4:]
PROBLEM is this requires changing the data using substitution. Is there a way to do this with only regex and NO substitution.
I am basically struggling with the part where I have to ignore this pattern \;.
EDIT 1: I am using SUBLIME text editor.
EDIT 2: I have updated the sample text file with \;
You don't need substitutions if you consider matching escaped characters individually:
(?m)^(?:[^\\;\r\n]*(?:\\.[^\\;\r\n]*)*;){6}$
Live demo
Breakdown:
(?m) Set multiline flag
^ Assert beginning of line
(?: Start of non-capturing group 1
[^\\;\r\n]* Match any thing except \ ; \r and \n
(?: Start of NCG 2
\\.[^\\;\r\n]* Match an escaped char and repeat matching recent character class
)* As many as possible
; Match a semi-colon
){6} Six times exactly
$ Assert end of line
Just use "|" in the regex not working?
e.g. ^([^;]*;)([^;]*;)([^;]*;)([^;]*;)([^;]*;)([^;]*;)|\\;$
I don't know what language your are using, but personally i thing you better use a split() follow by count() function. this is available in many languages.
Hope that's helps

Vimgrep before any empty line

I have a lot of files which starts with some tags I defined.
Example:
=Title
#context
!todo
#topic
#subject
#etc
And some text (notice the blank line just before this text).
Foo
Bar
I'd like to write a Vim search command (with vimgrep) to match something before an empty line.
How do I grep only in the lines before the first blank line? Will it make quicker grep action? Please, no need to mention :grep and binary like Ag - silver search.
I know \_.* to match everything including EOL. I know the negation [^foo]. I succeed to match everything but empty lines with /[^^$]. But I didn't manage to compose my :vimgrep command. Thank you for your help!
If you want a general solution which works for any content of file let me tell you that AFAK, you can't with that form of text. You may ask why ?
Explanation:
vimgrep requires a pattern argument to do the search line by line which behaves exactly as the :global cmd.
For your case we need to get the first part preceding the first blank line. (It can be extended to: Get the first non blank text)
Let's call:
A :Every block of text not containing any single blank line inside
x :Blank lines
With these only 5 forms of content file you can get the first A block with vimgrep(case 2,4,5 are obvious):
1 | 2 | 3 | 4 | 5
x | x | A | x | A
A | A | x | A | x
x | x | A
A |
Looking to your file, it is having this form:
A
x
A
x
A
the middle block causes a problem that's why you cannot split the first A unless you delimit it by some known UNIQUE text.
So the only solution that I can come up for the only 5 cases is:
:vimgrep /\_.\{-}\(\(\n\s*\n\)\+\)\#=/ %
AFAIK the most you can do with :vimgrep is use the \%<XXl atom to search below a specific line number:
:vim /\%<20lfunction/ *.vim
That command will find all instances of function above line 20 in the given files.
See :help \%l.
[...] always matches a single character. [^^$] matches a character that is not ^ or $. This is not what you want.
One of the things you can do is:
/\%^\%(.\+\n\)\{-}.\{-}\zsfoo/
This matches
\%^ - the beginning of the file
\%( \) - a non-capturing group
\{-} - ... repeated 0 or more times (as few as possible)
.\+ - 1 or more non-newline characters
\n - a newline
.\{-} - 0 or more non-newline characters (as few as possible)
\zs - the official start of the match
This will find the first occurrence of foo, starting from the beginning of the file, searching only non-empty lines. But that's all it does: You can't use it to find multiple matches.
Alternatively:
/\%(^\n\_.*\)\#<!foo/
\%( \) - a non-capturing group
\#<! - not-preceded-by modifier
^ - beginning of line
\n - newline
\_.* - 0 or more of any character
This matches every occurrence of foo that is not preceded anywhere by an empty line (i.e. a beginning-of-line / newline combo).

Regular Expressions - Greedy but stop before a string match

I have the some data and i'd like to convert it into a table format.
Here's the input data
1- This is the 1st line with a
newline character
2- This is the 2nd line
Each line may contain multiple newline characters.
Output
<td>1- This the 1st line with
a new line character</td>
<td>2- This is the 2nd line</td>
I've tried the following
^(\d{1,3}-)[^\d]*
but it seems to match only till the digit 1 in 1st.
I'd like to be able to stop matching after i find another \d{1,3}\- in my string.
Any suggestions?
EDIT:
I'm using EditPad Lite.
This is for vim, and uses zerowidth positive-lookahead:
/^\d\{1,3\}-\_.*[\r\n]\(\d\{1,3\}-\)\#=
Steps:
/^\d\{1,3\}- 1 to 3 digits followed by -
\_.* any number of characters including newlines/linefeeds
[\r\n]\(\d\{1,3\}-\)\#= followed by a newline/linefeed ONLY if it is followed
by 1 to 3 digits followed by - (the first condition)
EDIT: This is how it would be in pcre/ruby:
/(\d{1,3}-.*?[\r\n])(?=(?:\d{1,3}-)|\Z)/m
Note you need a string ending with a newline to match the last entry.
SEARCH: ^\d+-.*(?:[\r\n]++(?!\d+-).*)*
REPLACE: <td>$0</td>
[\r\n]++ matches one or more carriage-returns or linefeeds, so you don't have to worry about whether the file use Unix (\n), DOS (\r\n), or older Mac (\r) line separators.
(?!\d+-) asserts that the first thing after the line separator is not another line number.
I used the possessive + in [\r\n]++ to make sure it matches the whole separator. Otherwise, if the separator is \r\n, [\r\n]+ could match the \r and (?!\d+-) could match the \n.
Tested in EditPad Pro, but it should work in Lite as well.
You did not specify a language (there are many regexp implementations), but in general, what you are looking for is called "positive lookahead", which lets you add patterns that will influence the match, but will not become part of it.
Search for lookahead in the documentation of whatever language you are using.
Edit: the following sample seems to work in vim.
:%s#\v(^\d+-\_.{-})\ze(\n\d+-|%$)#<td>\1</td>
Annotation below:
% - for all lines
s# - substitute the following (you can use any delimiter, and slash is most
common, but as that will require that we escape slashes in the command
I chose to use the number sign)
\v - very magic mode, let's us use less backslashes
( - start group for back referencing
^ - start of line
\d+ - one or more digits (as many as possible)
- - a literal dash!
\_. - any character, including a newline
{-} - zero or more of these (as few as possible)
) - end group
\ze - end match (anything beyond this point will not be included in the match)
( - start a new group
[\n\r] - newline (in any format - thanks Alan)
\d+ - one or more digits
- - a dash
| - or
%$ - end of file
) - end group
# - start substitute string
<td>\1</td> - a TD tag around the first matched group
(\d+-.+(\r|$)((?!^\d-).+(\r|$))?)
You can match only the separators and split on them. In C#, for example, it could be done like this:
string s = "1- This is the 1st line with a \r\nnewline character\r\n2- This is the 2nd line";
string ss = "<td>" + string.Join("</td>\r\n<td>", Regex.Split(s.Substring(3), "\r\n\\d{1,3}- ")) + "</td>";
MessageBox.Show(ss);
Would it be good for you to do it in 3 steps?
(these are perl regex):
Replace the first:
$input =~ s/^(\d{1,3})/<td>\1/;
Replace the rest
$input =~ s/\n(\d{1,3})/<\/td>\n<td>\1/gm;
Add the last:
$input .= '</td>';