Perl, delete everything after first three characters - regex

I promise you all I've searched the site for about two hours now. I've found several that should have worked, but they didn't.
I have a line that consists of a varying amount of numbers separated by spaces. I want to delete everything after the third number.
I should say that everything I've been writing has been assuming that \S\s\S\s\S would match the first three numbers. with spaces between 1 and 2, and 2 and 3.
I anticipated the following working:
s/^.*?[\S\s\S\s\S].{5}//s;
but it did the exact opposite of what I wanted.
I would like 2 3 0 4 5 6 7 1 0 1 2 to become 2 3 0
I would really prefer to keep it substitution. I've tried look-behind as one person mentioned and I had no luck. Should I be saving the first 3 numbers as a string before I'm trying these commands?
EDIT:
I should have clarified that these numbers could be in the form 1.57 or 1.00E01 as well. I had integers when I was trying to get that to just baseline work.

\S\s\S\s\S will indeed match three non-space characters separated by space characters. However, ^.*?[\S\s\S\s\S].{5} does something completely different:
^ matches the beginning of the line.
.*? matches characters until the next match can start (not as many as it can). Since you specify /s, . will match newline as well.
[\S\s\S\s\S] is a character class, and so is the same as [\S\s]—match either \S or \s, which is to say anything.
.{5} will match five characters.
Since [\S\s] and . with /s match the same things, the .*? will never match any characters as it wants to match as little as possible. Thus, this is the same as s/^.{6}//s—delete the first six characters from the string. As you can see, that's not what you wanted!
One way to keep the first three numbers is to explicitly match them: s/^(\d \d \d).*/$1/s. Here, \d matches a single digit (0–9) with literal spaces in between them. We match the first three followed by anything at all, and then replace the whole match—since it ends in .*, that's the whole string—with just the bit in between parentheses, i.e. the first three numbers. If your numbers can be more than one digit long, then s/^(\d+ \d+ \d+).*/$1/s will do what you want; if you can have arbitrary space-like characters (space, tab, newline) separating them, then s/^(\d\s\d\s\d\s).*/$1/s is what you want (or \s+ if you can have multiple spaces). If you want to catch lines which have things other than digits, you can use \S or \S+, just as you were.
Another approach, using lookbehind, would be s/(?<=^\d \d \d).*//s. In other words, delete any characters which are preceded by ^\d \d \d—the beginning of the string followed by three space-separated numbers. There's no real advantage to this approach—I'd probably do it the other way—but since you mentioned lookbehind, here's how you can do it. (Again, things like s/(?<=^\S\s\S\s\S).*//s are more general.)

So match the first three numbers explicitly, and drop everything else.
s/^([\dE.]+)\s+([\dE.]+)\s+([\dE.]+).*$/$1 $2 $3/;
This works as follows:
$ perl -MYAPE::Regex::Explain -E 'say YAPE::Regex::Explain->new(q{^([\dE.]+)\s+([\dE.]+)\s+([\dE.]+).*$})->explain;'
The regular expression:
(?-imsx:^([\dE.]+)\s+([\dE.]+)\s+([\dE.]+).*$)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[\dE.]+ any character of: digits (0-9), 'E', '.'
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[\dE.]+ any character of: digits (0-9), 'E', '.'
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
[\dE.]+ any character of: digits (0-9), 'E', '.'
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
(Updated in consideration of the changes the OP made to the original specification.)

Your code where you say s/^.*?[\S\s\S\s\S].{5}//s;
I would write as: s/^(\S\s\S\s\S).*$/$1/
You're forgetting to use a $1 to capture the part of the substitution that you want to keep, and having a .* at the beginning could lead to starting numbers being removed instead of trailing numbers.
Also, I'm not sure if you have some guarantee of single digit numbers, or of single whitespace characters, so you could write the code with s/^(\S+\s+\S+\s+\S+).*$/$1/ to capture all of the spaces and all of the digits.
Let me know if I need to clarify that a little more.
Here's a website I find super helpful for Perl regex: http://pubcrawler.org/perl-reference.html

Question is, why do u want to do such a thing with regexp? it seems easier to me with:
substr $string, 5;
or if u really want to (I didn't test):
s/^(.{5})(.*)/$1/
parentheses allows you to "remember" patterns, this is the way to say that you want to replace pretty much everything with just the first part of the pattern (the first five characters). this pattern will match any line of text and leave just the first 5 characters maybe you want to modify it to match 3 digits with spaces between them

Related

How do I write regex to pull patch version out of semvar

I'm trying to use regex to pull just the patch version out of some semvars in the form v1.2.3
I've got some regex which can match the v1.2. part however I'm struggling to get the other part, the 3 (which I actually want back)
I'm using ^v\d+\.\d+\. to select the first part.
I'm trying to use a negative lookahead with this to then select everything after it with (?!(v\d+\.\d+\.)).* but this just seems to return everything after the v rather than everything after the group
Any pointers would be really appreciated, thanks!
In this special case:
'^(?<=v\d\.\d\.)[[:alnum:]]+'
The regular expression matches as follows:
Node
Explanation
^
start of string
(?<=
look behind to see if there is:
v
v
\d
digits (0-9)
\.
.
\d
digits (0-9)
\.
.
)
end of look-behind
[[:alnum:]]+
any character of: letters and digits (1 or more times (matching the most amount possible))
A more generic solution than works with any length of digits
'^v\d+\.\d+\.\K.[[:alnum:]]+'
The regular expression matches as follows:
Node
Explanation
^
start of string
v
v
\d+
digits (0-9) (1 or more times (matching the most amount possible))
\.
.
\d+
digits (0-9) (1 or more times (matching the most amount possible))
\.
.
\K
resets the start of the match (what is Kept) as a shorter alternative to using a look-behind assertion: look arounds and Support of K in regex
[[:alnum:]]+
any character of: letters and digits (1 or more times (matching the most amount possible))
Check man tr | grep -FA1 '[:' for a POSIX character classes like [[:alnum:]]

Match a not 8 character (digit/a-z/A-Z) long word through regex

This is the first question I post so sorry for anything I might screw up.
I've spent the past hour experimenting and searching for a way to replace "not 8 character (digit/a-z/A-Z) long word" with blank space (or in other words delete anything but those words) in Notepad++ through regular expressions.
I managed to bookmark the lines containing them but I'm stuck with the whole line that has that word, I just want those specific words. I'd appreciate any help, thanks a lot!
Edit2: A better way to phrase this is:
To remove anything that isn't: 8 character long that starts with an S and only contains digits and letters. In other words, remove anything that isn't S******* where *=digit,letter
Edit: I realized that's not enough to understand the situation so here's an example. I want to process this:
Here's your first code: S284JF2B
Here's your second code: SKE093JF
Here's your third code: S28fka30
And get this output:
S284JF2B
SD34EQ5M
SASFKA30
The actual file has lots of other characters that are not just digits/letters and the codes I want on the output are always 8 character long (digits/Uppercase letters) always starting with an S.
I have two possible solutions. Both solutions require the string to be 8 characters long and begin with an S.
Given the this sample text:
the problem is it's not the words that do not contain any words that I don't want
but actually any string that isn't a string that starts with an S and is 8 character long.
Example: S294KS12 this is the type of string I want on the document. Contains 8 characters
that are either digits or letters and starts with an S
SOMETIME
S294KS12
S1234567
S123456A
Option 1
This solution only finds strings which are 8 characters long and start with an S.
\bS[A-Z0-9]{7}\b
Live Demo
https://regex101.com/r/lK0aO9/1
Matches from Sample
S294KS12
SOMETIME
S294KS12
S1234567
S123456A
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
S 'S'
----------------------------------------------------------------------
[A-Z0-9]{7} any character of: 'A' to 'Z', '0' to '9'
(7 times)
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
Options 2
This solution does additional checking to ensure there is at least one additional letter and one number.
\bS(?=[A-Z]*[0-9])(?=[0-9]*[A-Z])[A-Z0-9]{7}\b
Live Demo
https://regex101.com/r/vH4lX2/3
Matches from Sample
S294KS12
S294KS12
S123456A
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
S 'S'
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
[A-Z]* any character of: 'A' to 'Z' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
[0-9] any character of: '0' to '9'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
[0-9]* any character of: '0' to '9' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
[A-Z] any character of: 'A' to 'Z'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
[A-Z0-9]{7} any character of: 'A' to 'Z', '0' to '9'
(7 times)
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
Putting all together
To replace everything else, then I'd incoprporate the regular expression into ( ... )\s?|. Which will match everything, including the desired strings.
If you then use $1 in the Replace with option in Notepad++, then you'll be left with just your desired strings.
I recommend using option 2 above, and inserting that into the expression so it looks like this:
(\bS(?=[A-Z]*[0-9])(?=[0-9]*[A-Z])[A-Z0-9]{7}\b)\s?|.
Replace with: $1
Live Demo
https://regex101.com/r/gO7zV7/1
Description
Lacking any proper examples, this will find substrings that are 8 characters long and not containing any letters. The substring must be bracketed by either whitespace or at the beginning or end of the string
(?<=\s|^)[^a-zA-Z0-9\s]{8}(?=\s|$)
Example
Live Demo
https://regex101.com/r/gS9uN7/1
Sample text
I've spent the past hour experimenting and searching for a way to replace "not 8 character (digit/a-z/A-Z) $##!#$>< fd long word" with blank space (or in other words delete anything but those words) in Notepad++ through regular expressions.
Sample Matches
$##!#$><
After Replacement
I've spent the past hour experimenting and searching for a way to replace "not 8 character (digit/a-z/A-Z) fd long word" with blank space (or in other words delete anything but those words) in Notepad++ through regular expressions.
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
(?<= look behind to see if there is:
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
^ after an optional start of the string
----------------------------------------------------------------------
) end of look-behind
----------------------------------------------------------------------
[^a-zA-Z0-9\s]{8} any character except: 'a' to 'z', 'A' to
'Z', '0' to '9', whitespace (\n, \r, \t, \f, and " ")
(8 times)
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
$ before an optional \n, and the end of a
"line"
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
To match everything but the tokens from your example:
(^|\s)(?!S\w{7}\b)\S*
For a live demo, see https://regex101.com/r/rW8mF0/4
To match any non-8 character word:
\b\w{1,7}\b|\b\w{9,}\b
It matches words of length 1 - 7 OR words of length 9 and more.
For a live demo, see https://regex101.com/r/fX2sE5/1
I have tried with following text to match your problem with the additional non-alphanumeric characters.
<protocol="toto" john="doe" Here's your first code: S284JF2B sign="+" />
<protocol="toto" john="doe" Here's your second code: SKE093JF sign="+" />
<protocol="toto" john="doe" 8char="s2345678" Here's your third code: S28fka30 sign="+" />
I used the following regular expression
\b(\w{1,7}|(?=[^S])\w{8}|S(?![A-Za-z0-9]{8})\w{8}|\w{9,})\b|[^\w\r\n]
I obtained only the codes when replacing with nothing and selected the option "Match case" in the replacement windows.
With a live demo here: https://regex101.com/r/wW3eB1
Explanations:
\b(...)\b: word between boundaries
\w{1,7}: word of 1 to 7 letters
(?=[^S])\w{8}: word of 8 letters not starting by 'S'
S(?![A-Za-z0-9]{8})\w{8}: word starting with a 'S', with 8 characters but containing something other than alpha-numeric (i.e. an underscore)
|\w{9,}: word of 9 letters or more
[^\w\r\n]: character that is neither a word or an EOL character
I'd use the following regex ^.*(S[A-Z0-9]{7})(?!=[A-Z0-9]).*$:
Ctrl+H
Find what: ^.*(S[A-Z0-9]{7})(?!=[A-Z0-9]).*$
Replace with: $1
DO NOT check . matches newline
Replace all
Explanation:
^ : begining of line
.* : any character 0 or more times
( : start group 1
S[A-Z0-9]{7}: S followed by 7 alphanumeric characters
) : end group
(?!=[A-Z0-9]) : negative lookahead to make sure there are no alphanum after
.* : any character 0 or more times
$ : end of line

regex filename of a unixpath without the first two digits

I have filename in a unix-path starting with two digits ... how can i extract the name without the extension
/this/is/my/path/to/the/file/01filename.ext should be filename
I currently have [^/]+(?=\.ext$) so I get 01filename, but how do I get rid of the first two digits?
You can add a look-behind in front of what you already have, looking for two digits:
(?<=\d\d)[^/]+(?=.ext$)
This only works if you have exactly two digits! Unfortunately, in most regex engines it is not possible to use quantifiers like * or + in lookbehinds.
(?<=\d\d) - checks for two digits before the match
[^/]+ - matches 1 or more characters, except /
(?=.ext$) - checks for .ext behind the match
Try this one :
/\d\d(.*?).\w{3}$
Explanation :
/\d\d : slash followed by two digit
(.*?) : the capture
.\w{3} : a dot followed by three letters
$ : end of string
It works for me on Expresso
Consider the following Regex...
(?<=\d{2})[^/]+(?=.ext$)
Good Luck!
A more general regex:
(?:^|\/)[\d]+([^.]+)\.[\w.]+$
Explanation:
(?: group, but do not capture:
^ the beginning of the string
| OR
\/ '/'
) end of grouping
[\d]+ any character of: digits (0-9) (1 or more
times (matching the most amount possible))
( group and capture to \1:
[^.]+ any character except: '.' (1 or more
times (matching the most amount
possible))
) end of \1
\. '.'
[\w\.]+ any character of: word characters (a-z, A-
Z, 0-9, _), '.' (1 or more times
(matching the most amount possible))
$ before an optional \n, and the end of the
string

I can't find proper regexp

I have the following file(like this scheme, but much longer):
LSE ZTX
SWX ZURN
LSE ZYT
NYSE CGI
There are 2 words (like i.e. LSE ZTX) in every line with optional spaces and/or tabs at the beginning, at the end and always in between.
Could someone help me to match these 2 words with regexp? Following the example I wish to have LSE in $1 and ZTX in $2 for the first line, SWX in $1 and ZURN in $2 for the second etc.
I have tried something like:
$line =~ /(\t|\s)*?(.*?)(\t|\s)*?(.*?)/msgi;
$line =~ /[\t*\s*]?(.*?)[\t*\s*]?(.*?)/msgi;
I don't know how can I say, that there could be either spaces or tabs (or both of them mixed, so for ex. \t\s\t)
If you want to just match the two first words, the most basic thing is to just match any sequence of characters that are not whitespace:
my ($word1, $word2) = $line =~ /\S+/g;
This will capture the first two words in $line into the variables, if they exist. Note that parentheses are not required when using the /g modifier. Use an array instead if you want to capture all existing matches.
Always two words, you don't need to match the entire line, so your most simple regex would be:
/(\w+)\s+(\w+)/
I think this is what you want
^\s*([A-Z]+)\s+([A-Z]+)
See it here on Regexr, you find the first code of a row in group 1 and the second in group 2. \s is a whitespace character, it includes e.g. spaces, tabs and newline characters.
In Perl it is something like this:
($code1, $code2) = $line =~ /^\s*([A-Z]+)\s+([A-Z]+)/i;
I think you are reading the text file row by row, so you don't need the modifiers s and m, and g is also not needed.
In case the codes are not only ASCII letters, then replace [A-Z] with \p{L}. \p{L} is a Unicode property that will match every letter in every language.
\s includes also tabulation so your regex looks like:
$line =~ /^\s*([A-Z]+)\s+([A-Z]+)/;
the first word is in the first group ($1) and the second in $2.
You can change [A-Z] to whatever's more convenient with your needs.
Here is the explanation from YAPE::Regex::Explain
The regular expression:
(?-imsx:^\s*([A-Z]+)\s+([A-Z]+))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[A-Z]+ any character of: 'A' to 'Z' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[A-Z]+ any character of: 'A' to 'Z' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
With option "Multiline" this Regex:
^\s*(?<word1>\S+)\s+(?<word2>\S+)\s*$
Will give you N matches each containing 2 groups named:
- word1
- word2
^\s*([A-Z]{3,4})\s+([A-Z]{3,4})$
What this does
^ // Matches the beginning of a string
\s* // Matches a space/tab character zero or more times
([A-Z]{3,4}) // Matches any letter A-Z either 3 or 4 times and captures to $1
\s+ // Then matches at least one tab or space
([A-Z]{3,4}) // Matches any letter A-Z either 3 or 4 times and captures to $2
$ // Matches the end of a string
You can use split here:
use strict;
use warnings;
while (<DATA>) {
my ( $word1, $word2 ) = split;
print "($word1, $word2)\n";
}
__DATA__
LSE ZTX
SWX ZURN
LSE ZYT
NYSE CGI
Output:
(LSE, ZTX)
(SWX, ZURN)
(LSE, ZYT)
(NYSE, CGI)
Assuming the spaces at the start of the line are what you use to identify your codes you want, try this:
Split your string up at newlines, then try this regex:
^\s+(\w+\s+){2}$
This will only match lines that start with some space, followed by a (word - some space - word), then end with some space.
# ^ --> String start
# \s+ --> Any number of spaces
# (\w+\s+){2} --> A (word followed by some space)x2
# $ --> String end.
However, if you want to capture the codes alone, try this:
$line =~ /^\s*(\w+)\s+(\w+)/;
# \s* --> Zero or more whitespace,
# (\w+) --> Followed by a word (group #1),
# \s+ --> Followed by some whitespace,
# (\w+) --> Followed by a word (group #2),
This will match all your codes
/[A-Z]+/

regex decompiler

I found this regex and want to understand it. Are there any regex decompilers that will translate what the following regex does into words? It is really complicated.
$text =~ /(((\w)\W*(?{$^R.(0+( q{a}lt$3))})) {8}(?{print +pack"B8" ,$^Rand ""})) +/x;
Using YAPE::Regex::Explain (not sure if it is good, but it's the first result in searching):
use YAPE::Regex::Explain;
my $REx = qr/(((\w)\W*(?{$^R.(0+( q{a}lt$3))})) {8}(?{print +pack"B8" ,$^Rand ""})) +/x;
my $exp = YAPE::Regex::Explain->new($REx)->explain;
print $exp;
I've got the explanation as:
( group and capture to \1 (1 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
( group and capture to \2 (8 times):
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
\w word characters (a-z, A-Z, 0-9, _)
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
\W* non-word characters (all but a-z, A-Z,
0-9, _) (0 or more times (matching the
most amount possible))
----------------------------------------------------------------------
(?{$^R.(0+( run this block of Perl code
q{a}lt$3))})
----------------------------------------------------------------------
){8} end of \2 (NOTE: because you are using a
quantifier on this capture, only the
LAST repetition of the captured pattern
will be stored in \2)
----------------------------------------------------------------------
(?{print +pack"B8" run this block of Perl code
,$^Rand ""})
----------------------------------------------------------------------
)+ end of \1 (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)
There are 2 blocks of Perl code, which must be analyzed independently.
In the first block:
$^R . (0 + (q{a} lt $3))
here, $^R is "the result of evaluation of the last successful (?{ code }) regular expression assertion", and the expression (0 + (q{a} lt $3)) gives 1 if the 3rd capture is in [b-z], 0 otherwise.
In the second block:
print +pack "B8", $^R and ""
it interpret the previous result of evaluation as a (big-endian) binary string, get the number, convert it to the corresponding character, and finally print it out.
Together, the regex finds every 8 alphanumeric characters, then treat those in [b-z] as the binary digit 1, otherwise 0. These 8 binary digits are then interpreted as a character code, and that character is printed out.
For instance, the letter 'H' = 0b01001000 would be printed when matching the string
$test = 'OvERfLOW';
Im not sure what all is in that statement, but for regex analyzing i use this site
http://xenon.stanford.edu/~xusch/regexp/analyzer.html
I always found OptiPerl's Regex Editor to be really good at this type of thing