Positive lookbehind doesn't work with plus - regex

I'm trying to select place after pattern word <Art><dot><digits><dot>
Code:
Art. 83.
xxx xxx xxx
Art. 3.
xxx xxx xxx
So far I tried this pattern, however if add + for \d selection fails.. why?
(?<=Art..\d\d.).
How can I select text after text with random digits length?
Edit 1
Ok I need add new line for every text with after text pattern Art. <digits length unknown>.
Input
Art. 3.
xxx xxx xxx
Output
Art. 3.
xxx xxx xxx
Edit 2
I am looking solution for language JAVA / Android / parser in Notepad++

You are using look behind, not lookahead, which has limitations in most implementations. From http://www.regular-expressions.info/lookaround.html
The bad news is that most regex flavors do not allow you to use just any regex inside a lookbehind, because they cannot apply a regular expression backwards. The regular expression engine needs to be able to figure out how many characters to step back before checking the lookbehind. When evaluating the lookbehind, the regex engine determines the length of the regex inside the lookbehind, steps back that many characters in the subject string, and then applies the regex inside the lookbehind from left to right just as it would with a normal regex.
In your case, maybe you can use an expression that matches text and uses the match in the replacement. For example, in Java:
String original = "Art. 3.\nxxx xxx xxx";
String replaced = original.replaceAll("Art\\. \\d+\\.", "$0\n");

You are not able to use variable length look behinds in most implementations of regex. However, you should be able to solve this without the look behind:
# Match your string in a group
/(Art\.\s\d+\.)/g
# Replace and append a new line to $1 match group
$1\n
Example: http://regex101.com/r/fW5jO7
We don't know what language you are using, but a PHP implementation:
preg_replace('/(Art\.\s\d+\.)/', "$1\n", $text);

In perl:
for (<DATA>) {
print;
print("\n") if (/Art\. \d+\./);
}
__DATA__
whatver
stuff
Art. 83.
123 456 789
Art. 3.
987 654 321
more
stuff

Use grouping instead of lookbehind
(Art..\d\d.)(.)
and then get group 2

A Notepad++ solution:
Find what: ^(Art\.\s*\d+\.)
Replace with: $1\n
May be you want crlf, so use: Replace with:$1\r\n

Related

regex to select only the zipcode

,Ray Balwierczak,4/11/2017,,895 Forest Hill Rd,Apalachin,NY,13732,y,,
i want to select only 13732 from the line. I came up with this regex
(\d)(\s*\d+)*(\,y,,)
But its also selecting the ,y,, .if i remove it that part from regex, the regex also gets valid for the date. please help me on this.
Generally, if you want to match something without capturing it, use zero-length lookaround (lookahead or lookbehind). In your case, you can use lookahead:
(\d)(\s*\d+)*(?=\,y,,)
The syntax (?=<stuff>) means "followed by <stuff>, without matching it".
More information on lookarounds can be found in this tutorial.
Regex: \D*(\d{5})\D*
Explanation: match 5 digits surrounded by zero or more non-digits on both sides. Then you can extract group containing the match.
Here's code in python:
import re
string = ",Ray Balwierczak,4/11/2017,,895 Forest Hill Rd,Apalachin,NY,13732,y,,"
search = re.search("\D*(\d{5})\D*", string)
print search.group(1)
Output:
13732

How do you "quantify" a variable number of lines using a regexp?

Say you know the starting and ending lines of some section of text, but the chars in some lines and the number of lines between the starting and ending lines are variable, á la:
aaa
bbbb
cc
...
...
...
xx
yyy
Z
What quantifier do you use, something like:
aaa\nbbbb\ncc\n(.*\n)+xx\nyyy\nZ\n
to parse those sections of text as a group?
You can use the s flag to match multilines texts, you can do it like:
~\w+ ~s.
There is a similar question here:
Javascript regex multiline flag doesn't work
If I understood correctly, you know that your text begins with aaa\nbbbb\ncc and ends with xx\nyyy\nZ\n. You could use aaa.+?bbbb.+?cc(.+?)xx.+?yyy.+?Z so that all operators are not greedy and you don't accidentally capture two groups at once. The text inbetween these groups would be in match group 1. You also need to turn the setting that causes dot to match new line on.
Try this:
aaa( |\n)bbbb( |\n)cc( |\n)( |\n){0,1}(.|\n)*xx( |\n)yyy( |\n)Z
( |\n) matches a space or a newline (so your starting and ending phrases can be split into different lines)
RegExr
At the end of the day what worked for me using Kate was:
( )+aaa\n( )+bbbb\n( )+cc\n(.|\n)*( )+xx\n( )+yyy\n( )+Z\n
using such regexps you can clear pages of quite a bit of junk.

regular expression to match strings with decimals

I'm trying to create a regex which will do the following:
Name description: "QUARTERLY PATCH FOR XAQE (JUL 2013 - 11.2.0.3.20) : (125546467)"
Val version : 11.2.0.3.4
In order to output:
"Name, 11.2.0.3.20"
"Val, 11.2.0.3.4"
I have created the following regex: /^([\w]+).*([\d\.\d]+).*/, but it is only matching the last number in the 2nd group, i.e. in 11.2.0.3.4 it will only match 4. Could anyone help?
Also, there could be more than the two lines given above, so it needs to account for arbitrary lines where the version number could be anywhere in the line.
You can use a one-liner for this as well:
perl -lne '/(\w+).*?(\d+(\.\d+)+)/; print "$1, $2"' <filename>
__END__
Name, 11.2.0.3.20
Val, 11.2.0.3.4
If you are only planning for the output and not doing any processing over the captured groups, then this will do:
$str =~ s/([\n\r]|^)(Name|Val).*?(\d+(\.\d+)+).*/$1"$2, $3"/g;
Your problem is that .* is greedy and will consume as much as it can whilst the pattern still matches. One solution is to make is lazy .*?
Also [\d\.\d]+ means match one of \d, \. and \d, so it's the same as [\d.]+ which isn't what you want since it would match "2013" in the first line. \d+(\.\d+)+ is more suitable.
After those 2 changes you have:
^([\w]+).*?(\d+(\.\d+)+).*
RegExr

Regex - how to get time and date and get ISO8601 timestamp

I have this text
2014-01-30 10:15 some text here
2014-01-30 10:20 some other text here
I need a regex that matches a timestamp group in ISO 8601 format.
Required output:
2014-01-30T10:15Z
2014-01-30T10:20Z
With this REGEX I can't get what I want, replace the space with 'T' and append a 'Z at the end.
^(?<timestamp>\S+ \S+)
Does anyone know how to solve this problem?
--- UPDATE ---
BTW, I'm using http://rubular.com/ to test my regex
You could perhaps modify your current regex a bit to:
^(\S+) (\S+).*
And replace with $1T$2Z
regex101 demo
\d{4}-\d{2}-\d{2} \d{2}:\d{2} will match the required format – validation is another story though (if you need it).
You can do something like if (regex match) { replace " " with "T"; append "Z" }
If this doesn't help you or it is unclear it is because your question was vague.
Edit: you didn't specify what language you're writing this in. That is how you would do your replacements.
In php:
preg_replace('/^(\S+) (\S+).*/', "$1T$2Z", $str);
In perl:
$str =~ s/^(\S+) (\S+).*/$1T$2Z/;
In notepad++
Find what: ^(\S+) (\S+).*
Replace with: $1T$2Z
With:
(\d{4}-\d{2}-\d{2})( \d{2}:\d{2} )(?:.*)
You can capture 2014-01-30 10:15 in groups (and ignore the text in another group).
Then you use the second group (10:15) to add 'T' at the beginning and 'Z' at the end.
See demo at:
http://rubular.com/r/4icGfcIixa
Regex is a bit different from language to language, it could help if you told us what language you are using.
For example, in javascript, you can do something like this:
"2014-01-30 10:15 some text here".replace(/(\d{4}-\d{2}-\d{2})\s(\d{2}:\d{2})\s?.*/,"$1T$2Z")
Where the string can be a variable.
If you have a multiple line text them you should add a g at the end of the regex:
"2014-01-30 10:15 some text here\n2014-01-30 10:20 some other text here".replace(/.*(\d{4}-\d{2}-\d{2})\s(\d{2}:\d{2})\s?.*/g,"$1T$2Z")

Regex expression to find file extension in a file with multiple periods

How would you write a regular expression to find the file extension of the following files, keeping in mind that what I am looking for is the ".pdf" or ".xls" portion of the string?
REPORTPDF.20130810.pdf.pgp
REPORTXLS.20130810.xls.pgp
EDIT:
The resulting filenames I want to end up with are the following:
REPORT20130810.PDF
REPORT20130810.XLS
I am on a Windows platform. I've played around with this a bit at http://regexpal.com/ but so far I can only figure out how to match the date:
([0-9]{4}[0-9]{2}[0-9]{2})
Using sed:
sed 's/^\(.*[^.]*\)\.[^.]*$/\1/' <<< "REPORTPDF.20130810.pdf.pgp"
REPORTPDF.20130810.pdf
Using grep -P (PCRE regex):
grep -oP '^.+[^.]+(?=\.[^.]+$)' <<< "REPORTPDF.20130810.pdf.pgp"
REPORTPDF.20130810.pdf
.+\.(\w+)\.\w+$ would deliver the last but one extension as group 1, how this is accessed would then be dependent of your host language for the regex.
If you don't need the file extension to be capitalized, this should work
([a-zA-Z]+)\.([0-9]{4}[0-9]{2}[0-9]{2})\.(xls|pdf)\.pgp
Matches:
REPORTXLS.20130810.xls.pgp
And then the groups you'd use are two and three
REPORT\2.\3
Matches:
REPORT20130810.xls
Problem is that you don't provide much context for how you're going about changing these file names.
You don't say what language/library you're using, but this Perl one-liner does the trick:
perl -lpe "s/^([^.]*)(...)\.(\d+)(\.\2)\.pgp/\1\3\4/i; $_=uc"
I think this will work for you :)
^(([A-Z a-z]*)(?:XLS.|PDF.)(\d{8})(.pdf|.xls))
Edit live on Debuggex
^ starts at the beginning of the string
(.*) any character before
\d any number 0-9
{8} only 8 times for that character section (in this case 8 times of
the numbers 0-9)
?: is non capture groups
I wrapped the capture groups into one large one so the thing that you want will be in the first capture group :).
This can be prob be replaced
([A-Z a-z]*)
with
(REPORT)
This (.*?(?:\..*)?)(\..*) will hold things like:
'hello.1a.2bb.3' ---> group(1) == 'hello.1a.2bb', group(2) == '.3'
'yep.1' ---> group(1) == 'yep', group(2) == '.1'
If the format is pretty much fixed you could use
(REPORT)([^.]++)[.]([^.]++)[.]([^.]++)[.](pgp)
and cherry pick replacement based on what you want
Used java here but regex match would still be same
String a = "REPORTPDF.20130810.pdf.pgp".replaceAll(
"(REPORT)([^.]++)[.]([^.]++)[.]([^.]++)[.](pgp)",
"$1--$2--$3--$4--$5");
;
String b = "REPORTXLS.20130810.xls.pgp".replaceAll(
"(REPORT)([^.]++)[.]([^.]++)[.]([^.]++)[.](pgp)",
"$1--$2--$3--$4--$5");
System.out.println(a);
System.out.println(b);
REPORT--PDF--20130810--pdf--pgp
REPORT--XLS--20130810--xls--pgp
in your case "$1$3.$2"
String b = "REPORTXLS.20130810.xls.pgp".replaceAll(
"(REPORT)([^.]++)[.]([^.]++)[.]([^.]++)[.](pgp)",
"$1$3.$2");
which produces intended result
REPORT20130810.XLS