Matching from a starting delimiter to an end delimiter (Regex/pattern matching) - regex

I am trying to match for a certain block of text.
The format of the text I want to match is
<pevz:url>https://some.server.com/arbitraryFoo.jpeg</pevz:url>
where only <pevz:url> and </pevz:url> are known.
My naive try was to match with
<pevz:url>*([0-9a-zA-Z:/._-])<\/pevz:url>
but that didn't work. I am using gedit to match with the default search and replace (no advanced-find).
How can I match for the whole string?
Best regards,
Joe Cocker

You can try:
<pevz:url>(.*?)<\/pevz:url>
or
<pevz:url>([^>]+)<\/pevz:url>
Regex Demo

Related

Remove columns from CSV

I don't know anything about Notepad++ Regex.
This is the data I have in my CSV:
6454345|User1-2ds3|62562012032|324|148|9c1fe63ccd3ab234892beaf71f022be2e06b6cd1
3305611|User2-42g563dgsdbf|22023001345|0|0|c36dedfa12634e33ca8bc0ef4703c92b73d9c433
8749412|User3-9|xgs|f|98906504456|1534|51564|411b0fdf54fe29745897288c6ad699f7be30f389
How can I use a Regex to remove the 5th and 6th column? The numbers in the 5th and 6th column are variable in length.
Another problem is the User row can also contain a |, to make it even worse.
I can use a macro to fix this, but the file is a few millions lines long.
This is the final result I want to achieve:
6454345|User1-2ds3|62562012032|9c1fe63ccd3ab234892beaf71f022be2e06b6cd1
3305611|User2-42g563dgsdbf|22023001345|c36dedfa12634e33ca8bc0ef4703c92b73d9c433
8749412|User3-9|xgs|f|98906504456|411b0fdf54fe29745897288c6ad699f7be30f389
I am open for suggestions on how to do this with another program, command line utility, either Linux or Windows.
Match \|[^|]+\|[^|]+(\|[^|]+$)
Repalce $1
Basically, Anchor to the end of the line, and remove columns [-1] and [-2] (I assume columns can't be empty. Replace + with * if they can)
If you need finer detail then that, I'd recommend writing a Java or Python script to manual parse and rewrite the file for you.
I've captured three groups and given them names. If you use a replace utility like sed or vimregex, you can replace remove with nothing. Or you can use a programming language to concatenate keep_before and keep_after for the desired result.
^(?<keep_before>(?:[^|]+\|){3})(?<remove>(?:[^|]+\|){2})(?<keep_after>.*)$
You may have to remove the group namings and use \1 etc. instead, depending on what environment you use.
Demo
From Notepad++ hit ctrl + h then enter the following in the dialog:
Find what: \|\d+\|\d+(\|[0-9a-z]+)$
Replace with: $1
Search mode: Regular Expression
Click replace and done.
Regex Explain:
\|\d+ : match 1st string that starts with | followed by number
\|\d+ : match 2nd string that starts with | followed by number
(\|[0-9a-z]+): match and capture the string after the 2nd number.
$ : This is will force regex search to match the end of the string.
Replacement:
$1 : replace the found string with whatever we have between the captured group which is whatever we have between the parentheses (\|[0-9a-z]+)

regex to select only the zipcode

,Ray Balwierczak,4/11/2017,,895 Forest Hill Rd,Apalachin,NY,13732,y,,
i want to select only 13732 from the line. I came up with this regex
(\d)(\s*\d+)*(\,y,,)
But its also selecting the ,y,, .if i remove it that part from regex, the regex also gets valid for the date. please help me on this.
Generally, if you want to match something without capturing it, use zero-length lookaround (lookahead or lookbehind). In your case, you can use lookahead:
(\d)(\s*\d+)*(?=\,y,,)
The syntax (?=<stuff>) means "followed by <stuff>, without matching it".
More information on lookarounds can be found in this tutorial.
Regex: \D*(\d{5})\D*
Explanation: match 5 digits surrounded by zero or more non-digits on both sides. Then you can extract group containing the match.
Here's code in python:
import re
string = ",Ray Balwierczak,4/11/2017,,895 Forest Hill Rd,Apalachin,NY,13732,y,,"
search = re.search("\D*(\d{5})\D*", string)
print search.group(1)
Output:
13732

Regex validation of filename failing

I'm trying to validate a filename having letters "CAT" or "DOG" followed by 8 numerics, and ending in ".TXT".
Examples:
CAT20000101.TXT
DOG20031212.TXT
This would NOT match:
ATA12330000.TXT
CAT200T0101.TXT
DOG20031212.TX1
Here's the regex I am trying to make work:
(([A-Z]{3})([0-9]{8})([\.TXT]))\w+
Why is the last section (.TXT) failing against non-matching file extensions?
See example: http://regexr.com/3a7fo
Inside character class there is no regex grouping hence [\.TXT] is not right.
You can use this regex:
^[A-Z]{3}[0-9]{8}\.TXT$
For only matching CAT and DOG use:
^(CAT|DOG)[0-9]{8}\.TXT$
lose the unnecessary parentheses
[A-Z]{3}[0-9]{8}[\.TXT]\w+
lose the unnecessary/pattern-breaking character class [] around \.TXT
[A-Z]{3}[0-9]{8}\.TXT\w+
lose the \w+ at the end
[A-Z]{3}[0-9]{8}\.TXT
change [A-Z]{3} to (?:CAT|DOG).
(?:CAT|DOG)[0-9]{8}\.TXT
voilà.
It's failing because \.TXT is in square brackets, which matches only one of those four characters. Just use (\.TXT).
remove square brackets around [.TXT] to .TXT
Your example modified http://regexr.com/3a7fu

VIM - Replace based on a search regex

I've got a file with several (1000+) records like :
lbc3.*'
ssa2.*'
lie1.*'
sld0.*'
ssdasd.*'
I can find them all by :
/s[w|l].*[0-9].*$
What i want to do is to replace the final part of each pattern found with \.*'
I can't do :%s//s[w|l].*[0-9].*$/\\\\\.\*' because it'll replace all the string, and what i need is only replace the end of it from
.'
to
\.'
So the file output is llike :
lbc3\\.*'
ssa2\\.*'
lie1\\.*'
sld0\\.*'
ssdasd\\.*'
Thanks.
In general, the solution is to use a capture. Put \(...\) around the part of the regex that matches what you want to keep, and use \1 to include whatever matched that part of the regex in the replacement string:
s/\(s[w|l].*[0-9].*\)\.\*'$/\1\\.*'/
Since you're really just inserting a backslash between two strings that you aren't changing, you could use a second set of parens and \2 for the second one:
s/\(s[w|l].*[0-9].*\)\(\.\*'\)$/\1\\\2/
Alternatively, you could use \zs and \ze to delimit just the part of the string you want to replace:
s/s[w|l].*p0-9].*\zs\ze\*\'$/\\/

Regex - how to get time and date and get ISO8601 timestamp

I have this text
2014-01-30 10:15 some text here
2014-01-30 10:20 some other text here
I need a regex that matches a timestamp group in ISO 8601 format.
Required output:
2014-01-30T10:15Z
2014-01-30T10:20Z
With this REGEX I can't get what I want, replace the space with 'T' and append a 'Z at the end.
^(?<timestamp>\S+ \S+)
Does anyone know how to solve this problem?
--- UPDATE ---
BTW, I'm using http://rubular.com/ to test my regex
You could perhaps modify your current regex a bit to:
^(\S+) (\S+).*
And replace with $1T$2Z
regex101 demo
\d{4}-\d{2}-\d{2} \d{2}:\d{2} will match the required format – validation is another story though (if you need it).
You can do something like if (regex match) { replace " " with "T"; append "Z" }
If this doesn't help you or it is unclear it is because your question was vague.
Edit: you didn't specify what language you're writing this in. That is how you would do your replacements.
In php:
preg_replace('/^(\S+) (\S+).*/', "$1T$2Z", $str);
In perl:
$str =~ s/^(\S+) (\S+).*/$1T$2Z/;
In notepad++
Find what: ^(\S+) (\S+).*
Replace with: $1T$2Z
With:
(\d{4}-\d{2}-\d{2})( \d{2}:\d{2} )(?:.*)
You can capture 2014-01-30 10:15 in groups (and ignore the text in another group).
Then you use the second group (10:15) to add 'T' at the beginning and 'Z' at the end.
See demo at:
http://rubular.com/r/4icGfcIixa
Regex is a bit different from language to language, it could help if you told us what language you are using.
For example, in javascript, you can do something like this:
"2014-01-30 10:15 some text here".replace(/(\d{4}-\d{2}-\d{2})\s(\d{2}:\d{2})\s?.*/,"$1T$2Z")
Where the string can be a variable.
If you have a multiple line text them you should add a g at the end of the regex:
"2014-01-30 10:15 some text here\n2014-01-30 10:20 some other text here".replace(/.*(\d{4}-\d{2}-\d{2})\s(\d{2}:\d{2})\s?.*/g,"$1T$2Z")