Regular expression for text only from integers separated by newline characters - regex

There is the following regular expression:
^[1-9]\d{0,8}[\r\n]?$
It describes one line of text.
How to indicate that this expression is applicable to 1 or more lines of text? I do not exclude that changes will be required in the above expression.

ou have a ^[line_pattern]$ regex. To expand it to validate a multiline string where each line shouldmeet the same [line_pattern] use ^[line_pattern](?:\r?\n[line_pattern])*$. In some engines that support the \R line break regex construct, replace \r?\n with it.
You may use
^[1-9]\d{0,8}(?:\r?\n[1-9]\d{0,8})*$
or
^[1-9]\d{0,8}(?:\R[1-9]\d{0,8})*$
It matches
^ - start of a string
[1-9]\d{0,8} - a non-zero digit followed with 0 to 8 any digits
(?:\r?\n[1-9]\d{0,8})* - 0 or more repetitions of
\r?\n - a CRLF or an LF only line ending (\R matches any line break sequence)
[1-9]\d{0,8} - a non-zero digit followed with 0 to 8 any digits
$ - end of string.

Your expression was for a single line. I have simply changed your expression to say there will be one or more of them using parentheses and a plus to indicate 'one or more'.
I have also edited the way you defend the end of the line. I am assuming that there is always a CRLF or LF at the end of each number:
^([1-9]\d{0,8}\r?\n)+$

Related

How to use Ruby gsub with regex to do partial string substitution

I have a pipe delimited file which has a line
H||CUSTCHQH2H||PHPCCIPHP|1010032000|28092017|25001853||||
I want to substitute the date (28092017) with a regex "[0-9]{8}" if the first character is "H"
I tried the following example to test my understanding where Im trying to subtitute "a" with "i".
str = "|123||a|"
str.gsub /\|(.*?)\|(.*?)\|(.*?)\|/, "\|\\1\|\|\\1\|i\|"
But this is giving o/p as
"|123||123|i|"
Any clue how this can be achieved?
You may replace the first occurrence of 8 digits inside pipes if a string starts with H using
s = "H||CUSTCHQH2H||PHPCCIPHP|1010032000|28092017|25001853||||"
p s.gsub(/\A(H.*?\|)[0-9]{8}(?=\|)/, '\100000000')
# or
p s.gsub(/\AH.*?\|\K[0-9]{8}(?=\|)/, '00000000')
See the Ruby demo. Here, the value is replaced with 8 zeros.
Pattern details
\A - start of string (^ is the start of a line in Ruby)
(H.*?\|) - Capturing group 1 (you do not need it when using the variation with \K): H and then any 0+ chars as few as possible
\K - match reset operator that discards the text matched so far
[0-9]{8} - eight digits
(?=\|) - the next char must be |, but it is not added to the match value since it is a positive lookahead that does not consume text.
The \1 in the first gsub is a replacement backreference to the value in Group 1.

Regex for excluding strings that start with consecutive leading zeroes or are only alphabets [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 4 years ago.
I am looking for a regex to select only the strings that are not starting with consecutive zeroes or consecutive alphabets before underscore in below strings.
For ex:
ABC_DE-001 is invalid
abc is invalid (only alphabets)
0_DE-001 is invalid (1 zero before underscore)
000_DE-001 is invalid (sequence of 3 consecutive zeroes)
00_DE-001 is invalid (sequence of 2 consecutive zeroes)
01_DE-001 is valid (0 followed by some other number is valid)
10_DE-001 is valid (starts with 1)
100_DE-001 is valid (starts with 1)
One of the approach I tried was:
(0[1-9]+|[1-9][0-9]+|0[0*$][1-9])_[A-Z0-9]+[-][0-9]{3}
I am not sure though if any scenario is missed with this. Also, how can the same thing be achieved using negative or positive lookaround?
For your examople data, you might match using an optional zero ^0? as that can occur but not more than 1 zero.
^0?[1-9][0-9]*_[A-Z]+-[0-9]{3}$
Regex demo
That will match
^0? An optional zero at the start of the string
[1-9][0-9]* Match a digit 1-9 followed by 0+ digits
_[A-Z]+ Match an _ followed by 1+ times A-Z
-[0-9]{3} Match-` followed by 3 digits
$ Assert the end of the string
You can try with negative look ahead groups:
grep -Pi '^(?![a-z]+(?:_|$|\s)|0+(?:_|$|\s))' test.txt
Explanation:
-Pi - use PCRE and process ignore case. This is grep specific, you can adapt these options to your case. If you cannot make the regex processor to ignore case, just replace [a-z] with [a-zA-Z]. And of course, PCRE support is required.
^ - beginning of the line
(?!rgx) - look forward without moving the cursor to check the line doesn't match the enclosed regular expression rgx.
[a-z]+(?:_|$|\s)|0+(?:_|$|\s) :
don't keep consecutive letters ([a-z]+) followed by an underscore, and end of line or a blank character ((?:_|$|\s))
don't keep consecutive zeroes (0+) followed by an underscore, and end of line or a blank character ((?:_|$|\s))
(?:) stands for a non capturing group (got content is not stored, use it if so to improve performances)
Output got:
01_DE-001 is valid (0 followed by some other number is valid)
10_DE-001 is valid (starts with 1)
100_DE-001 is valid (starts with 1)
Since grep only keeps valid lines (default behavior), non displayed lines were processed as invalid.

Notepad++ regex to insert character every nth character from a starting position

How do you use regex to insert | every two characters from a starting position to the end of the line?
Using regex on the following sample (tshark output of packet data), the regex inserts | after the first two characters and the next two characters, but does not apply the pattern to the rest of the line. I think the issue is with a repeated pattern on the 2nd grouping (or lackthereof).
Sample:
1478646603.255173000 10.10.10.1 0000000000000000000000
^(.{34})(..) replace with \1|\2| OR ^(.{34})(.*?(..)) replace with \1|\2
Produces this:
1478646603.255173000 10.10.10.1 00|00|000000000000000000
What I want is:
1478646603.255173000 10.10.10.1 00|00|00|00|00|00|00|00|00|00|00
You may use
(?:\G(?!^)|^.{36})\K..(?!$)
and replace with $&|.
Details:
(?:\G(?!^)|^.{36}) - matches the location at the end of the previous successful match (with \G(?!^)) or (|) the start of a line (^) and the first 36 characters other than linebreak chars (.{36})
\K - the match reset operator that discards the whole text matched so far
.. - any 2 chars other than linebreak chars
(?!$) - that are not at the end of the string.
The replacement pattern only contains the backreference to the whole match ($&) and a | pipe symbol (a literal symbol in the replacement pattern).

Regular Expressions - Greedy but stop before a string match

I have the some data and i'd like to convert it into a table format.
Here's the input data
1- This is the 1st line with a
newline character
2- This is the 2nd line
Each line may contain multiple newline characters.
Output
<td>1- This the 1st line with
a new line character</td>
<td>2- This is the 2nd line</td>
I've tried the following
^(\d{1,3}-)[^\d]*
but it seems to match only till the digit 1 in 1st.
I'd like to be able to stop matching after i find another \d{1,3}\- in my string.
Any suggestions?
EDIT:
I'm using EditPad Lite.
This is for vim, and uses zerowidth positive-lookahead:
/^\d\{1,3\}-\_.*[\r\n]\(\d\{1,3\}-\)\#=
Steps:
/^\d\{1,3\}- 1 to 3 digits followed by -
\_.* any number of characters including newlines/linefeeds
[\r\n]\(\d\{1,3\}-\)\#= followed by a newline/linefeed ONLY if it is followed
by 1 to 3 digits followed by - (the first condition)
EDIT: This is how it would be in pcre/ruby:
/(\d{1,3}-.*?[\r\n])(?=(?:\d{1,3}-)|\Z)/m
Note you need a string ending with a newline to match the last entry.
SEARCH: ^\d+-.*(?:[\r\n]++(?!\d+-).*)*
REPLACE: <td>$0</td>
[\r\n]++ matches one or more carriage-returns or linefeeds, so you don't have to worry about whether the file use Unix (\n), DOS (\r\n), or older Mac (\r) line separators.
(?!\d+-) asserts that the first thing after the line separator is not another line number.
I used the possessive + in [\r\n]++ to make sure it matches the whole separator. Otherwise, if the separator is \r\n, [\r\n]+ could match the \r and (?!\d+-) could match the \n.
Tested in EditPad Pro, but it should work in Lite as well.
You did not specify a language (there are many regexp implementations), but in general, what you are looking for is called "positive lookahead", which lets you add patterns that will influence the match, but will not become part of it.
Search for lookahead in the documentation of whatever language you are using.
Edit: the following sample seems to work in vim.
:%s#\v(^\d+-\_.{-})\ze(\n\d+-|%$)#<td>\1</td>
Annotation below:
% - for all lines
s# - substitute the following (you can use any delimiter, and slash is most
common, but as that will require that we escape slashes in the command
I chose to use the number sign)
\v - very magic mode, let's us use less backslashes
( - start group for back referencing
^ - start of line
\d+ - one or more digits (as many as possible)
- - a literal dash!
\_. - any character, including a newline
{-} - zero or more of these (as few as possible)
) - end group
\ze - end match (anything beyond this point will not be included in the match)
( - start a new group
[\n\r] - newline (in any format - thanks Alan)
\d+ - one or more digits
- - a dash
| - or
%$ - end of file
) - end group
# - start substitute string
<td>\1</td> - a TD tag around the first matched group
(\d+-.+(\r|$)((?!^\d-).+(\r|$))?)
You can match only the separators and split on them. In C#, for example, it could be done like this:
string s = "1- This is the 1st line with a \r\nnewline character\r\n2- This is the 2nd line";
string ss = "<td>" + string.Join("</td>\r\n<td>", Regex.Split(s.Substring(3), "\r\n\\d{1,3}- ")) + "</td>";
MessageBox.Show(ss);
Would it be good for you to do it in 3 steps?
(these are perl regex):
Replace the first:
$input =~ s/^(\d{1,3})/<td>\1/;
Replace the rest
$input =~ s/\n(\d{1,3})/<\/td>\n<td>\1/gm;
Add the last:
$input .= '</td>';

Reg Ex question

What does the following reg ex code mean?
'/^\w{4,20}$/'
It means that string should contain from 4 to 20 word characters (letters, digits, and underscores). Here:
^ (caret) matches at the start of the string the regex pattern is applied to. Matches a position rather than a character. Most regex flavors have an option to make the caret match after line breaks (i.e. at the start of a line in a file) as well
$ (dollar) matches at the end of the string the regex pattern is applied to. Matches a position rather than a character. Most regex flavors have an option to make the dollar match before line breaks (i.e. at the end of a line in a file) as well. Also matches before the very last line break if the string ends with a line break
\w shorthand character class matching word characters (letters, digits, and underscores). Can be used inside and outside character classes.
{n,m} where n >= 0 and m >= n Repeats the previous item between n and m times. Greedy, so repeating m times is tried before reducing the repetition to n times
Let me show you a usage example. Say, we have the file with the following contents:
[spongebob#conductor /tmp]$ cat file.txt
between4and20
therearetoomanyalphanumcharacters
foo
okay
Now you want to get only those strings which match your pattern '/^\w{4,20}$/':
[spongebob#conductor /tmp]$ grep -E '^\w{4,20}$' blah
between4and20
okay
On output you see only those lines, which fulfil your regular expression.
Ah, also, don't confuse ^ (caret) with ^ immediately after the opening [, the latter negates the character class, causing it to match a single character not listed in the character class. (Specifies a caret if placed anywhere except after the opening [), for example [^a-d] matches x (any character except a, b, c or d).
It means:
^ Between the beginning,
$ and the end of a given string,
\w{4,20} there should be only 4-20 Alphanumeric characters (like
a,b,c,d,1,2,3...etc, and also _)
I think you'll find Wikipedia's page on Regular Expressions a big, big help while learning regexes.
And just so there is no confusion, ^ and $ don't necessarily need each other,
If the regex was:
'/^\w{4,20}/'
That'd mean: The match should be at the start of the string, followed by 4-20 alphanumeric characters.
Example (match in bold): Foobar baz
And if the regex pattern was:
'/\w{4,20}$/'
That'd mean: The match should be at the end of the string, proceeded by 4-20 alpha-numeric characters
Example (match in bold): Foo barbaz
/ opening delimiter
^ = start of sting
\w = word character
{x,y} min max
$ = end of string
/end delimiter