How to better this regex? - regex

I have a list of strings like this:
/soccer/poland/ekstraklasa-2008-2009/results/
/soccer/poland/orange-ekstraklasa-2007-2008/results/
/soccer/poland/orange-ekstraklasa-youth-2010-2011/results/
From each string I want to take a middle part resulting in respectively:
ekstraklasa
orange ekstraklasa
orange ekstraklasa youth
My code here does the job but it feels like it can be done in fewer steps and probably with regex alone.
name = re.search('/([-a-z\d]+)/results/', string).group(1) # take the middle part
name = re.search('[-a-z]+', name).group() # trim numbers
if name.endswith('-'):
name = name[:-1] # trim tailing `-` if needed
name = name.replace('-', ' ')
Can anyone see how make it better?

This regex should do the work:
/(?:\/\w+){2}\/([\w\-]+)(?:-\d+){2}/
Explanation:
(?:\/\w+){2} - eat the first two words delimited by /
\/ - eat the next /
([\w\-]+)- match the word characters of hyphens (this is what we're looking for)
(?:-\d+){2} - eat the hyphens and the numbers after the part we're looking for
The result is in the first match group

I cant test it because i am not using python, but i would use an Expression like
^(/soccer/poland/)([a-z\-]*)(.*)$
or
^(/[a-z]*/[a-z]*/)([a-z\-]*)(.*)$
This Expressen works like "/soccer/poland/" at the beginning, than "everything with a to z (small) or -" and the rest of the string.
And than taking 2nd Group!
The Groups should hold this Strings:
/soccer/poland/
orange-ekstraklasa-youth-
2010-2011/results/
And then simply replacing "-" with " " and after that TRIM Spaces.
PS: If ur Using regex101.com e.g., u need to escape / AND just use one Row of String!
Expression
^(\/soccer\/poland\/)([a-z\-]*)(.*)$
And one Row of ur String.
/soccer/poland/orange-ekstraklasa-youth-2010-2011/results/
If u prefere to use the Expression not just for soccer and poland, use
^(\/[a-z]*\/[a-z]*\/)([a-z\-]*)(.*)$

Related

Regular expression to extract string from urls

I need to extract a string from an URL. Here are some examples:
Input: https://www.example.net/eur_en/bas-026-009-basic-baby-hat-beige.html – Output: bas-026-009
Input: https://www.example.net/eur_en/aw18-245-b86-big-cherries-snow-jacket-plum-red.html – Output: aw18-245-b86
Input: https://www.example.net/eur_en/ss20-028-e70-hearts-tee-off-white-yellow.html – Output: ss20-028-e70
I want to be able to extract the string that goes from the first character after the "/eur_en/" until the third dash. Can someone help me? Thanks
You're looking for regexp: \/eur_en\/([^-]+-[^-]+-[^-]+)
Play & test it at regex101: https://regex101.com/r/RvGROG/1
You need something like this:
const urls = [
"https://www.example.net/eur_en/bas-026-009-basic-baby-hat-beige.html",
"https://www.example.net/eur_en/aw18-245-b86-big-cherries-snow-jacket-plum-red.html",
"https://www.example.net/eur_en/ss20-028-e70-hearts-tee-off-white-yellow.html",
]
const rg = new RegExp(`\/eur_en\/([^-]+-[^-]+-[^-]+)`)
const strs = urls.map(url => url.match(rg)[1])
console.log(strs)
// Output:
// [
// "bas-026-009",
// "aw18-245-b86",
// "ss20-028-e70"
// ]
Of course, it's a simple example. In real cases don't forget to check that .match returned array with length greater than 1.
So, the first element is full captured string and the second (as third and next) it's a sub-strings, which is captured by parentheses.
We can improve and complicate our regex like so:
\/((?:[^-\/]+-){2}[^-\/]+)
It'll allow us to not to use a specific anchor /eur_en/ and control the number of dash divided parts.
The expression you're looking for is the following:
/(?<=eur_en\/)[^-]*-[^-]*-[^-]*/
Here is how it works:
(?<=eur_en\/): will look behind for eur_env/ but will not use it in the output
[^-]*: it will match any character that is not a dash. So it will get everything up to the first dash (not including the dash)
[^-]*: it will match any character that is not a dash. So it will get everything up to the second dash (not including the dash)
[^-]*: it will match any character that is not a dash. So it will get everything up to the third dash (not including the dash).
/(?<=\/eur_en\/)\w+-\w+-\w+/g
Tolkens
Description
(?<=\/eur_en\/)
Look behind - If /eur_en/ is found, match whatever proceeds it.
\w+-\w+-\w+
One or more Word character = [A-Za-z0-9] and a literal hyphen three consecutive times.
Review: https://regex101.com/r/Ge0zA3/1

Get segment of string in between characters

I have a giant data set that includes lots of file names with various parts of strings that I need to grab.
I have this code segment currently:
def fps(data):
for i in data:
pattern = r'.(\d{4}).' # finds data in between the periods
frames = re.findall(pattern, ' '.join(data)) #puts info into frames list
frames.sort()
for i in range(len(frames)): #Turns the str into integers
frames[i] = int(frames[i])
return frames
This is great and all but it only returns 4 characters after and before a period.
How would I grab part of the string after a period and before the next period.
Preferably without using regular edit because it's a little too complex for a simpleton like me.
For example:
One string may look like this
string = ['filename.0530.extension']
while the others may look like this
string2 = ['filename.042.extension']
string3 = [filename.045363.extension']
I would need to output the numbers in between the periods on the terminal so:
0530, 042, 045363
To match your example data your could match a dot, capture in a group one or more digits \d+ (instead of exactly 4 \d{4}) followed by matching a dot:
\.(\d+)\.
If you want to match all between the dots you might use a negating character class [^.] to match not a dot:
\.([^.]+)\.
Note that if you want to match a literal dot you should escape it \.
Demo
To match the numbers between your periods in your example, you can use this:
^.*\.[^.\s]*?\.?(\d+)\..*$
Here's an online example

Regex to grab formulas

I am trying to parse a file that contains parameter attributes. The attributes are setup like this:
w=(nf*40e-9)*ng
but also like this:
par_nf=(1) * (ng)
The issue is, all of these parameter definitions are on a single line in the source file, and they are separated by spaces. So you might have a situation like this:
pd=2.0*(84e-9+(1.0*nf)*40e-9) nf=ng m=1 par=(1) par_nf=(1) * (ng) plorient=0
The current algorithm just splits the line on spaces and then for each token, the name is extracted from the LHS of the = and the value from the RHS. My thought is if I can create a Regex match based on spaces within parameter declarations, I can then remove just those spaces before feeding the line to the splitter/parser. I am having a tough time coming up with the appropriate Regex, however. Is it possible to create a regex that matches only spaces within parameter declarations, but ignores the spaces between parameter declarations?
Try this RegEx:
(?<=^|\s) # Start of each formula (start of line OR [space])
(?:.*?) # Attribute Name
= # =
(?: # Formula
(?!\s\w+=) # DO NOT Match [space] Word Characters = (Attr. Name)
[^=] # Any Character except =
)* # Formula Characters repeated any number of times
When checking formula characters, it uses a negative lookahead to check for a Space, followed by Word Characters (Attribute Name) and an =. If this is found, it will stop the match. The fact that the negative lookahead checks for a space means that it will stop without a trailing space at the end of the formula.
Live Demo on Regex101
Thanks to #Andy for the tip:
In this case I'll probably just match on the parameter name and equals, but replace the preceding whitespace with some other "parse-able" character to split on, like so:
(\s*)\w+[a-zA-Z_]=
Now my first capturing group can be used to insert something like a colon, semicolon, or line-break.
You need to add Perl tag. :-( Maybe this will help:
I ended up using this in C#. The idea was to break it into name value pairs, using a negative lookahead specified as the key to stop a match and start a new one. If this helps
var data = #"pd=2.0*(84e-9+(1.0*nf)*40e-9) nf=ng m=1 par=(1) par_nf=(1) * (ng) plorient=0";
var pattern = #"
(?<Key>[a-zA-Z_\s\d]+) # Key is any alpha, digit and _
= # = is a hard anchor
(?<Value>[.*+\-\\\/()\w\s]+) # Value is any combinations of text with space(s)
(\s|$) # Soft anchor of either a \s or EOB
((?!\s[a-zA-Z_\d\s]+\=)|$) # Negative lookahead to stop matching if a space then key then equal found or EOB
";
Regex.Matches(data, pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture)
.OfType<Match>()
.Select(mt => new
{
LHS = mt.Groups["Key"].Value,
RHS = mt.Groups["Value"].Value
});
Results:

Regex: Removing Space Between Quotes, And Stopping Before a Colon (With Yahoo Pipes)

I've been working on this for a while, but it's beyond my understanding of regex.
I'm using Yahoo Pipes on an RSS, and I want to create hashtags from titles; so, I'd like to remove space from everything between quotes, but, if there's a colon within the quotes, I only want the space removed between the words before the colon.
And, it would be great if I could also capture the unspaced words as a group, to be able to use: #$1 to output the hashtag in one step.
So, something like:
"The New Apple: Worlds Within Worlds" Before We Begin...
Could be substituted like #$1 - with this result:
"#TheNewApple: Worlds Within Worlds" Before We Begin...
After some work, I was able to come up with, this regex:
\s(?=\s)?|(‘|’|(Review)|:.*)
("Review" was a word that often came before colons and wouldn't be stripped, if it were later in the title; that's what that's for, but I would like to not require that, to be more universal)
But, it has two problems:
I have to use multiple steps. The result of that regex would be:
"TheNewApple: Worlds Within Worlds" Before We Begin...
And I could then add another regex step, to put the hash # in front
But, it only works if the quotes are first, and I don't know how to fix that...
You can do this all in one step with regex, with a caveat. You run into problems with a repeated capturing group because only the last iteration is available in the replacement string. Searching for ( (\w+))+ and replacing with $2 will replace all the words with just the last match - not what we want.
The way around this is to repeat the pattern an arbitrary number of times that will suffice for your use. Each separate group can be referenced.
Search: "(\w+)(?: (\w+))?(?: (\w+))?(?: (\w+))?(?: (\w+))?(?: (\w+))?
Replace: "#$1$2$3$4$5$6
This will replace up to 6-word titles, exactly as you need them. First, "(\w+) matches any word following a quote. In the replacement string, it is put back as "#$1, adding the hashtag. The rest is a repeated list of (?: (\w+))? matches, each matching a possible space and word. Notice the space is part of a non-capturing group; only the word is part of the inner capture group. In the replacement string, I have $1$2$3$4$5$6, which puts back the words, without the spaces. Notice that a colon will not match any part of this, so it will stop once it hits a colon.
Examples:
"The New Apple: Worlds Within Worlds" Before We Begin...
"The New Apple" Before We Begin...
"One: Two"
only "One" word
this has "Two Words"
"The Great Big Apple Dumpling"
"The Great Big Apple Dumpling Again: Part 2"
Results:
"#TheNewApple: Worlds Within Worlds" Before We Begin...
"#TheNewApple" Before We Begin...
"#One: Two"
only "#One" word
this has "#TwoWords"
"#TheGreatBigAppleDumpling"
"#TheGreatBigAppleDumplingAgain: Part 2"
You can match the text with
"([^:]*)(.*?)"(.*)
then use some programming language to output the result like this:
'"#' + removeSpace($1) + $2 + '"' + $3
I have no idea what language you're using, but this seems like a poor choice for regex. In Python I'd do this:
# Python 3
import re
titles = ['''"The New Apple: Worlds Within Worlds" Before We Begin...''',
'''"Made Up Title: For Example Only" So We Can Continue...''']
hashtagged_titles = list()
for title in titles:
hashtagme, *restofstring = title.split(":")
hashtag = '"#'+hashtagme[1:].translate(str.maketrans('', '', " "))
result = "{}:{}".format(hashtag, restofstring)
hashtagged_titles.append(result)
Do a global search for
\ (?=.*:)
Replaced with nothing. Example
You'll need a second search on the results of that if you want to capture "TheNewApple" as a single word.

Regex for quoted string with escaping quotes

How do I get the substring " It's big \"problem " using a regular expression?
s = ' function(){ return " It\'s big \"problem "; }';
/"(?:[^"\\]|\\.)*"/
Works in The Regex Coach and PCRE Workbench.
Example of test in JavaScript:
var s = ' function(){ return " Is big \\"problem\\", \\no? "; }';
var m = s.match(/"(?:[^"\\]|\\.)*"/);
if (m != null)
alert(m);
This one comes from nanorc.sample available in many linux distros. It is used for syntax highlighting of C style strings
\"(\\.|[^\"])*\"
As provided by ePharaoh, the answer is
/"([^"\\]*(\\.[^"\\]*)*)"/
To have the above apply to either single quoted or double quoted strings, use
/"([^"\\]*(\\.[^"\\]*)*)"|\'([^\'\\]*(\\.[^\'\\]*)*)\'/
Most of the solutions provided here use alternative repetition paths i.e. (A|B)*.
You may encounter stack overflows on large inputs since some pattern compiler implements this using recursion.
Java for instance: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6337993
Something like this:
"(?:[^"\\]*(?:\\.)?)*", or the one provided by Guy Bedford will reduce the amount of parsing steps avoiding most stack overflows.
/(["\']).*?(?<!\\)(\\\\)*\1/is
should work with any quoted string
"(?:\\"|.)*?"
Alternating the \" and the . passes over escaped quotes while the lazy quantifier *? ensures that you don't go past the end of the quoted string. Works with .NET Framework RE classes
/"(?:[^"\\]++|\\.)*+"/
Taken straight from man perlre on a Linux system with Perl 5.22.0 installed.
As an optimization, this regex uses the 'posessive' form of both + and * to prevent backtracking, for it is known beforehand that a string without a closing quote wouldn't match in any case.
This one works perfect on PCRE and does not fall with StackOverflow.
"(.*?[^\\])??((\\\\)+)?+"
Explanation:
Every quoted string starts with Char: " ;
It may contain any number of any characters: .*? {Lazy match}; ending with non escape character [^\\];
Statement (2) is Lazy(!) optional because string can be empty(""). So: (.*?[^\\])??
Finally, every quoted string ends with Char("), but it can be preceded with even number of escape sign pairs (\\\\)+; and it is Greedy(!) optional: ((\\\\)+)?+ {Greedy matching}, bacause string can be empty or without ending pairs!
An option that has not been touched on before is:
Reverse the string.
Perform the matching on the reversed string.
Re-reverse the matched strings.
This has the added bonus of being able to correctly match escaped open tags.
Lets say you had the following string; String \"this "should" NOT match\" and "this \"should\" match"
Here, \"this "should" NOT match\" should not be matched and "should" should be.
On top of that this \"should\" match should be matched and \"should\" should not.
First an example.
// The input string.
const myString = 'String \\"this "should" NOT match\\" and "this \\"should\\" match"';
// The RegExp.
const regExp = new RegExp(
// Match close
'([\'"])(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))' +
'((?:' +
// Match escaped close quote
'(?:\\1(?=(?:[\\\\]{2})*[\\\\](?![\\\\])))|' +
// Match everything thats not the close quote
'(?:(?!\\1).)' +
'){0,})' +
// Match open
'(\\1)(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))',
'g'
);
// Reverse the matched strings.
matches = myString
// Reverse the string.
.split('').reverse().join('')
// '"hctam "\dluohs"\ siht" dna "\hctam TON "dluohs" siht"\ gnirtS'
// Match the quoted
.match(regExp)
// ['"hctam "\dluohs"\ siht"', '"dluohs"']
// Reverse the matches
.map(x => x.split('').reverse().join(''))
// ['"this \"should\" match"', '"should"']
// Re order the matches
.reverse();
// ['"should"', '"this \"should\" match"']
Okay, now to explain the RegExp.
This is the regexp can be easily broken into three pieces. As follows:
# Part 1
(['"]) # Match a closing quotation mark " or '
(?! # As long as it's not followed by
(?:[\\]{2})* # A pair of escape characters
[\\] # and a single escape
(?![\\]) # As long as that's not followed by an escape
)
# Part 2
((?: # Match inside the quotes
(?: # Match option 1:
\1 # Match the closing quote
(?= # As long as it's followed by
(?:\\\\)* # A pair of escape characters
\\ #
(?![\\]) # As long as that's not followed by an escape
) # and a single escape
)| # OR
(?: # Match option 2:
(?!\1). # Any character that isn't the closing quote
)
)*) # Match the group 0 or more times
# Part 3
(\1) # Match an open quotation mark that is the same as the closing one
(?! # As long as it's not followed by
(?:[\\]{2})* # A pair of escape characters
[\\] # and a single escape
(?![\\]) # As long as that's not followed by an escape
)
This is probably a lot clearer in image form: generated using Jex's Regulex
Image on github (JavaScript Regular Expression Visualizer.)
Sorry, I don't have a high enough reputation to include images, so, it's just a link for now.
Here is a gist of an example function using this concept that's a little more advanced: https://gist.github.com/scagood/bd99371c072d49a4fee29d193252f5fc#file-matchquotes-js
here is one that work with both " and ' and you easily add others at the start.
("|')(?:\\\1|[^\1])*?\1
it uses the backreference (\1) match exactley what is in the first group (" or ').
http://www.regular-expressions.info/backref.html
One has to remember that regexps aren't a silver bullet for everything string-y. Some stuff are simpler to do with a cursor and linear, manual, seeking. A CFL would do the trick pretty trivially, but there aren't many CFL implementations (afaik).
A more extensive version of https://stackoverflow.com/a/10786066/1794894
/"([^"\\]{50,}(\\.[^"\\]*)*)"|\'[^\'\\]{50,}(\\.[^\'\\]*)*\'|“[^”\\]{50,}(\\.[^“\\]*)*”/
This version also contains
Minimum quote length of 50
Extra type of quotes (open “ and close ”)
If it is searched from the beginning, maybe this can work?
\"((\\\")|[^\\])*\"
I faced a similar problem trying to remove quoted strings that may interfere with parsing of some files.
I ended up with a two-step solution that beats any convoluted regex you can come up with:
line = line.replace("\\\"","\'"); // Replace escaped quotes with something easier to handle
line = line.replaceAll("\"([^\"]*)\"","\"x\""); // Simple is beautiful
Easier to read and probably more efficient.
If your IDE is IntelliJ Idea, you can forget all these headaches and store your regex into a String variable and as you copy-paste it inside the double-quote it will automatically change to a regex acceptable format.
example in Java:
String s = "\"en_usa\":[^\\,\\}]+";
now you can use this variable in your regexp or anywhere.
(?<="|')(?:[^"\\]|\\.)*(?="|')
" It\'s big \"problem "
match result:
It\'s big \"problem
("|')(?:[^"\\]|\\.)*("|')
" It\'s big \"problem "
match result:
" It\'s big \"problem "
Messed around at regexpal and ended up with this regex: (Don't ask me how it works, I barely understand even tho I wrote it lol)
"(([^"\\]?(\\\\)?)|(\\")+)+"