Remove all but the first four characters on each line

Remove all but the first four characters on each line - regex

So I have a text file in Vscode that contains several lines of text like so:
1801: Joseph Marie Jacquard, a French merchant and inventor invent a loom that uses punched wooden cards to automatically weave fabric designs. Early computers would use similar punch cards.
So now I'm trying to isolate the year number/the first 4 characters of each line. I'm new to regex, and I know how to get the first 4 characters (I used ^.{4}) but how would I be able to find all EXCEPT for the first 4 characters so that I can replace them with nothing and be left with just the year numbers?

Find: (?<=^\d{4}).*
Replace: with nothing
regex101 Demo
(?<=^\d{4}) if a line starts ^ with 4 digits , (?<=...) is a positive lookbehind
.* match everything else up to line terminators, so the : will be included in the match
Since you never matched the 4 digits, a lookbehind/lookahead isn't part of any match necessarily, that you want to keep, you don't have to worry about any capture groups or replacements.

You can
Find:       ^(.{4}).+
Replace: $1
See the regex demo. Details:
^ - start of a line (in Visual Studio Code, ^ matches any line start)
(.{4}) - capturing group #1 that captures any four chars other than line break chars
.+ - one or more chars other than line break chars, as many as possible.
The $1 backreference in the replacement pattern replaces the match with Group 1 value.

Related

regex matching consecutive characters from start and end

Im trying to match a string to that containsthree consecutive characters at the beginning of the line and the same six consecutive characters at the end.
for example
CCC i love regex CCCCCC
the C's would be highlighted from search
I have found a way to find get the first 3 and the last six using these two regex codes but im struggling to combine them
^([0-9]|[aA-zZ])\1\1 and ([0-9]|[aA-zZ])\1\1\1\1\1$
appreciate any help

If you want just one regular expression to "highlight" only the 1st three characters and last six, maybe use:
(?:^([0-9A-Za-z])\1\1(?=.*\1{6}$)|([0-9A-Za-z])\2{5}(?<=^\2{3}.*)$)
See an online demo
(?: - Open non-capture group to allow for alternations;
^([0-9A-Za-z])\1\1(?=.*\1{6}$) - Start-line anchor with a 1st capture group followed by two backreferences to that same group. This is followed by a positive lookahead to assert that the very last 6 characters are the same;
| - Or;
([0-9A-Za-z])\2{5}(?<=^\2{3}.*)$ - The alternative is to match a 2nd capture group with 5 backreferences to the same followed by a positive lookbehind (zero-width) to check that the first three characters are the same.
Now, if you don't want to be too strict about "highlighting" the other parts, just use capture groups:
^(([0-9A-Za-z])\2\2).*(\2{6})$
See an online demo. Where you can now refer to both capture group 1 and 3.

Regex positive lookbehind after digits

Trying to get the all characters until new line after one/more digits with positive lookbehind from below text with this
(?<=below customer.\s.*\n.* )(.*)
I order standardinstalation to below customer.
Paul Rilley
Abbeyroad 55
It works (gives 55) if the roadname does not have a space. Not working with (High Tory road). Also there could be letters after the digits (55b) that I should get.
I need to look behind the words (below customer) since the first line is the only part that is always the same.

You can use
(?m)(?<=below customer\.\r?\n(?:.+\n)*?.+ )(\d+[A-Za-z]*)\r?$
See the .NET regex demo.
Details:
(?m) - multiline mode to make $ match end of any line is on
(?<=below customer\.\r?\n(?:.+\n)*?.+ ) - the lookbehind to match below customer., then a line ending sequence, then zero or more lines with a line ending sequence, as few as possible, and then zero or more chars other than newline till the last space followed with
(\d+[A-Za-z]*) - Group 1: one or more digits and then zero or more letters
\r?$ - an optional CR char and the end of line.
It will also match 55b.

In most regex flavors, a lookbehind must be fixed width. In .NET, variable width is supported.
You can use both in PCRE and .NET:
/(?<=below customer\.)\r?\n.*\r?\n.* (\w+)$/gm
Demo for PCRE
Demo for .NET

Regex (PCRE): Match all digits in a line following a line which includes a certain string

Using PCRE, I want to capture only and all digits in a line which follows a line in which a certain string appears. Say the string is "STRING99". Example:
car string99 house 45b
22 dog 1 cat
women 6 man
In this case, the desired result is:
221
As asked a similar question some time ago, however, back then trying to capture the numbers in the SAME line where the string appears ( Regex (PCRE): Match all digits conditional upon presence of a string ). While the question is similar, I don't think the answer, if there is one at all, will be similar. The approach using the newline anchor ^ does not work in this case.
I am looking for a single regular expression without any other programming code. It would be easy to accomplish with two consecutive regex operations, but this not what I'm looking for.

Maybe you could try:
(?:\bstring99\b.*?\n|\G(?!^))[^\d\n]*\K\d
See the online demo
(?: - Open non-capture group:
\bstring99\b - Literally match "string99" between word-boundaries.
.*?\n - Lazy match up to (including) nearest newline character.
| - Or:
\G(?!^) - Asserts position at the end of the previous match but prevent it to be the start of the string for the first match using a negative lookahead.
) - Close non-capture group.
[^\d\n]* - Match 0+ non-digit/newline characters.
\K - Resets the starting point of the reported match.
\d - Match a digit.

Match certain string on second line of text with regex

I'm new to regex, and would appreciate some guidance/help.
Currently, I'm looking to write an expression, that derives a certain part of text from the 2nd line of the provided text.
Here is the text:
123 anywhere Avenue
Winnipeg, Manitoba R3E 0L7
Canada
Pharmacy Manager: person person
Pharmacy Licence Holder/Owner: 123456 Manitoba Ltd.
see correct formatting with code here
My goal is to derive the 'Manitoba' string from the second line, however I'd like to make it dynamic rather than writing an expression to always fetch Manitoba as a static. I used the below code to target the second line:
(.*)(?=(\n.*){3}$)
(It matches 3 lines up from the last line, thus targeting the desired line)
I noticed, that within the dataset, that the Province (Manitoba) is always in between two spaces.
Is there any addition I can make to the code, so that the expression only targets the second line, then matches the first string in-between spaces?
Perhaps using a lazy expression with a positive lookaround?
If I target all matches in between spaces, it would take both 'Manitoba' and 'R3E 0L7' which I dont want.
I want it to only match the first piece of text in between spaces on the second line.
Any help is much appreciated :-)
Thanks.

One option could be to match the first line, then capture the second word in the second lines in capturing group 1.
Then match the rest of the second line and assert what follows is 3 times a line.
^.*\r?\n\S+[^\S\r\n]+(\S+).*(?=(?:\r?\n.*){3}$)
In parts:
^ Start of string
.*\r?\n Match the whole lines and a newline
\S+ Match 1+ non whitespace char (the first "word")
[^\S\r\n]+ Match 1+ times a whitespace char except newlines
(\S+) Capture group 1 Match 1+ times a non whitespace char (the second "word')
.* Match the rest of the line
(?= Positive lookahead, assert what follows on the right is
(?:\r?\n.*){3}$ Match 3 times a newline followed by 0+ times any except a newline and assert the end of the string
) Close lookahead
Regex demo
You could also turn the lookahead in to a match instead
^.*\r?\n\S+[^\S\r\n]+(\S+).*(?:\r?\n.*){3}$
Regex demo

Regular Expressions - Greedy but stop before a string match

I have the some data and i'd like to convert it into a table format.
Here's the input data
1- This is the 1st line with a
newline character
2- This is the 2nd line
Each line may contain multiple newline characters.
Output
<td>1- This the 1st line with
a new line character</td>
<td>2- This is the 2nd line</td>
I've tried the following
^(\d{1,3}-)[^\d]*
but it seems to match only till the digit 1 in 1st.
I'd like to be able to stop matching after i find another \d{1,3}\- in my string.
Any suggestions?
EDIT:
I'm using EditPad Lite.

This is for vim, and uses zerowidth positive-lookahead:
/^\d\{1,3\}-\_.*[\r\n]\(\d\{1,3\}-\)\#=
Steps:
/^\d\{1,3\}- 1 to 3 digits followed by -
\_.* any number of characters including newlines/linefeeds
[\r\n]\(\d\{1,3\}-\)\#= followed by a newline/linefeed ONLY if it is followed
by 1 to 3 digits followed by - (the first condition)
EDIT: This is how it would be in pcre/ruby:
/(\d{1,3}-.*?[\r\n])(?=(?:\d{1,3}-)|\Z)/m
Note you need a string ending with a newline to match the last entry.

SEARCH: ^\d+-.*(?:[\r\n]++(?!\d+-).*)*
REPLACE: <td>$0</td>
[\r\n]++ matches one or more carriage-returns or linefeeds, so you don't have to worry about whether the file use Unix (\n), DOS (\r\n), or older Mac (\r) line separators.
(?!\d+-) asserts that the first thing after the line separator is not another line number.
I used the possessive + in [\r\n]++ to make sure it matches the whole separator. Otherwise, if the separator is \r\n, [\r\n]+ could match the \r and (?!\d+-) could match the \n.
Tested in EditPad Pro, but it should work in Lite as well.

You did not specify a language (there are many regexp implementations), but in general, what you are looking for is called "positive lookahead", which lets you add patterns that will influence the match, but will not become part of it.
Search for lookahead in the documentation of whatever language you are using.
Edit: the following sample seems to work in vim.
:%s#\v(^\d+-\_.{-})\ze(\n\d+-|%$)#<td>\1</td>
Annotation below:
% - for all lines
s# - substitute the following (you can use any delimiter, and slash is most
common, but as that will require that we escape slashes in the command
I chose to use the number sign)
\v - very magic mode, let's us use less backslashes
( - start group for back referencing
^ - start of line
\d+ - one or more digits (as many as possible)
- - a literal dash!
\_. - any character, including a newline
{-} - zero or more of these (as few as possible)
) - end group
\ze - end match (anything beyond this point will not be included in the match)
( - start a new group
[\n\r] - newline (in any format - thanks Alan)
\d+ - one or more digits
- - a dash
| - or
%$ - end of file
) - end group
# - start substitute string
<td>\1</td> - a TD tag around the first matched group

(\d+-.+(\r|$)((?!^\d-).+(\r|$))?)

You can match only the separators and split on them. In C#, for example, it could be done like this:
string s = "1- This is the 1st line with a \r\nnewline character\r\n2- This is the 2nd line";
string ss = "<td>" + string.Join("</td>\r\n<td>", Regex.Split(s.Substring(3), "\r\n\\d{1,3}- ")) + "</td>";
MessageBox.Show(ss);

Would it be good for you to do it in 3 steps?
(these are perl regex):
Replace the first:
$input =~ s/^(\d{1,3})/<td>\1/;
Replace the rest
$input =~ s/\n(\d{1,3})/<\/td>\n<td>\1/gm;
Add the last:
$input .= '</td>';

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Remove all but the first four characters on each line - regex

Related

regex matching consecutive characters from start and end

Regex positive lookbehind after digits

Regex (PCRE): Match all digits in a line following a line which includes a certain string

Match certain string on second line of text with regex

Regular Expressions - Greedy but stop before a string match

Categories

Resources