Get segment of string in between characters - regex

I have a giant data set that includes lots of file names with various parts of strings that I need to grab.
I have this code segment currently:
def fps(data):
for i in data:
pattern = r'.(\d{4}).' # finds data in between the periods
frames = re.findall(pattern, ' '.join(data)) #puts info into frames list
frames.sort()
for i in range(len(frames)): #Turns the str into integers
frames[i] = int(frames[i])
return frames
This is great and all but it only returns 4 characters after and before a period.
How would I grab part of the string after a period and before the next period.
Preferably without using regular edit because it's a little too complex for a simpleton like me.
For example:
One string may look like this
string = ['filename.0530.extension']
while the others may look like this
string2 = ['filename.042.extension']
string3 = [filename.045363.extension']
I would need to output the numbers in between the periods on the terminal so:
0530, 042, 045363

To match your example data your could match a dot, capture in a group one or more digits \d+ (instead of exactly 4 \d{4}) followed by matching a dot:
\.(\d+)\.
If you want to match all between the dots you might use a negating character class [^.] to match not a dot:
\.([^.]+)\.
Note that if you want to match a literal dot you should escape it \.
Demo

To match the numbers between your periods in your example, you can use this:
^.*\.[^.\s]*?\.?(\d+)\..*$
Here's an online example

Related

Regular expression to extract string from urls

I need to extract a string from an URL. Here are some examples:
Input: https://www.example.net/eur_en/bas-026-009-basic-baby-hat-beige.html – Output: bas-026-009
Input: https://www.example.net/eur_en/aw18-245-b86-big-cherries-snow-jacket-plum-red.html – Output: aw18-245-b86
Input: https://www.example.net/eur_en/ss20-028-e70-hearts-tee-off-white-yellow.html – Output: ss20-028-e70
I want to be able to extract the string that goes from the first character after the "/eur_en/" until the third dash. Can someone help me? Thanks
You're looking for regexp: \/eur_en\/([^-]+-[^-]+-[^-]+)
Play & test it at regex101: https://regex101.com/r/RvGROG/1
You need something like this:
const urls = [
"https://www.example.net/eur_en/bas-026-009-basic-baby-hat-beige.html",
"https://www.example.net/eur_en/aw18-245-b86-big-cherries-snow-jacket-plum-red.html",
"https://www.example.net/eur_en/ss20-028-e70-hearts-tee-off-white-yellow.html",
]
const rg = new RegExp(`\/eur_en\/([^-]+-[^-]+-[^-]+)`)
const strs = urls.map(url => url.match(rg)[1])
console.log(strs)
// Output:
// [
// "bas-026-009",
// "aw18-245-b86",
// "ss20-028-e70"
// ]
Of course, it's a simple example. In real cases don't forget to check that .match returned array with length greater than 1.
So, the first element is full captured string and the second (as third and next) it's a sub-strings, which is captured by parentheses.
We can improve and complicate our regex like so:
\/((?:[^-\/]+-){2}[^-\/]+)
It'll allow us to not to use a specific anchor /eur_en/ and control the number of dash divided parts.
The expression you're looking for is the following:
/(?<=eur_en\/)[^-]*-[^-]*-[^-]*/
Here is how it works:
(?<=eur_en\/): will look behind for eur_env/ but will not use it in the output
[^-]*: it will match any character that is not a dash. So it will get everything up to the first dash (not including the dash)
[^-]*: it will match any character that is not a dash. So it will get everything up to the second dash (not including the dash)
[^-]*: it will match any character that is not a dash. So it will get everything up to the third dash (not including the dash).
/(?<=\/eur_en\/)\w+-\w+-\w+/g
Tolkens
Description
(?<=\/eur_en\/)
Look behind - If /eur_en/ is found, match whatever proceeds it.
\w+-\w+-\w+
One or more Word character = [A-Za-z0-9] and a literal hyphen three consecutive times.
Review: https://regex101.com/r/Ge0zA3/1

Don't get number if precede by month python

I'm pulling some data out of the web utilizing python in the Jupyter notebook. I have pulled down the data, parsed, and created the data frame. I have extracted a number out of a string that I have in a variable in the data frame. I utilizing this regex to do it:
number = []
for note in df["person_notes"]:
match = re.search(r'\d+', note)
if match:
number.append(note[match.start(): match.end()])
else:
number.append("")
df["number"] = number
Some strings are missing the number I'm looking for. For those cases, I will like to number.append(""). Those strings have instead a full date like so... "September 20, 2016" and my re.search() is pulling the number 20 out of that full date. If the string has a data like so, I want to ignore the 20 and instead I want to number.append("").
How can I modify the re.search() to ignore the number if the number is preceded by a month?
I suggest useing the old JS regex trick: enclose the pattern you wouldenclose with a negative lookbehind with an optional capturing group, and if it is a success, discard the match (here, append a ""). Else, grab the other capturing group contents (here, the digits).
See the Python demo:
import re
number = []
p = re.compile(r'((?:Jan|Febr)(?:uary)?|Ma(?:y|r(?:ch)?)|A(?:ug(?:ust)?|pr(?:il)?)|Ju(?:ne?|ly?)|Oct(?:ober)?|(?:Sept|Nov|Dec)(?:ember)?)? *(\d+)')
match = p.search('September 20, 2016')
if match and not match.group(1): # Did the string match and did Group 1 fail?
number.append(match.group(2)) # Yes, then add digits
else:
number.append("") # Else, add an empty value
print(number)
If you do not care about the shortened month names and keep it readable, you may use a simpler regex:
p = re.compile(r'(January|February|March|April|May|June|July|August|September‌​|October|November|De‌​cember)? *(\d+)')
The regex matches:
((?:Jan|Febr)(?:uary)?|Ma(?:y|r(?:ch)?)|A(?:ug(?:ust)?|pr(?:il)?)|Ju(?:ne?|ly?)|Oct(?:ober)?|(?:Sept|Nov|Dec)(?:ember)?)? - months (full or short names)
* - zero or more spaces
(\d+) - Group 2: one or more digits.

Regex to grab formulas

I am trying to parse a file that contains parameter attributes. The attributes are setup like this:
w=(nf*40e-9)*ng
but also like this:
par_nf=(1) * (ng)
The issue is, all of these parameter definitions are on a single line in the source file, and they are separated by spaces. So you might have a situation like this:
pd=2.0*(84e-9+(1.0*nf)*40e-9) nf=ng m=1 par=(1) par_nf=(1) * (ng) plorient=0
The current algorithm just splits the line on spaces and then for each token, the name is extracted from the LHS of the = and the value from the RHS. My thought is if I can create a Regex match based on spaces within parameter declarations, I can then remove just those spaces before feeding the line to the splitter/parser. I am having a tough time coming up with the appropriate Regex, however. Is it possible to create a regex that matches only spaces within parameter declarations, but ignores the spaces between parameter declarations?
Try this RegEx:
(?<=^|\s) # Start of each formula (start of line OR [space])
(?:.*?) # Attribute Name
= # =
(?: # Formula
(?!\s\w+=) # DO NOT Match [space] Word Characters = (Attr. Name)
[^=] # Any Character except =
)* # Formula Characters repeated any number of times
When checking formula characters, it uses a negative lookahead to check for a Space, followed by Word Characters (Attribute Name) and an =. If this is found, it will stop the match. The fact that the negative lookahead checks for a space means that it will stop without a trailing space at the end of the formula.
Live Demo on Regex101
Thanks to #Andy for the tip:
In this case I'll probably just match on the parameter name and equals, but replace the preceding whitespace with some other "parse-able" character to split on, like so:
(\s*)\w+[a-zA-Z_]=
Now my first capturing group can be used to insert something like a colon, semicolon, or line-break.
You need to add Perl tag. :-( Maybe this will help:
I ended up using this in C#. The idea was to break it into name value pairs, using a negative lookahead specified as the key to stop a match and start a new one. If this helps
var data = #"pd=2.0*(84e-9+(1.0*nf)*40e-9) nf=ng m=1 par=(1) par_nf=(1) * (ng) plorient=0";
var pattern = #"
(?<Key>[a-zA-Z_\s\d]+) # Key is any alpha, digit and _
= # = is a hard anchor
(?<Value>[.*+\-\\\/()\w\s]+) # Value is any combinations of text with space(s)
(\s|$) # Soft anchor of either a \s or EOB
((?!\s[a-zA-Z_\d\s]+\=)|$) # Negative lookahead to stop matching if a space then key then equal found or EOB
";
Regex.Matches(data, pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture)
.OfType<Match>()
.Select(mt => new
{
LHS = mt.Groups["Key"].Value,
RHS = mt.Groups["Value"].Value
});
Results:

How to better this regex?

I have a list of strings like this:
/soccer/poland/ekstraklasa-2008-2009/results/
/soccer/poland/orange-ekstraklasa-2007-2008/results/
/soccer/poland/orange-ekstraklasa-youth-2010-2011/results/
From each string I want to take a middle part resulting in respectively:
ekstraklasa
orange ekstraklasa
orange ekstraklasa youth
My code here does the job but it feels like it can be done in fewer steps and probably with regex alone.
name = re.search('/([-a-z\d]+)/results/', string).group(1) # take the middle part
name = re.search('[-a-z]+', name).group() # trim numbers
if name.endswith('-'):
name = name[:-1] # trim tailing `-` if needed
name = name.replace('-', ' ')
Can anyone see how make it better?
This regex should do the work:
/(?:\/\w+){2}\/([\w\-]+)(?:-\d+){2}/
Explanation:
(?:\/\w+){2} - eat the first two words delimited by /
\/ - eat the next /
([\w\-]+)- match the word characters of hyphens (this is what we're looking for)
(?:-\d+){2} - eat the hyphens and the numbers after the part we're looking for
The result is in the first match group
I cant test it because i am not using python, but i would use an Expression like
^(/soccer/poland/)([a-z\-]*)(.*)$
or
^(/[a-z]*/[a-z]*/)([a-z\-]*)(.*)$
This Expressen works like "/soccer/poland/" at the beginning, than "everything with a to z (small) or -" and the rest of the string.
And than taking 2nd Group!
The Groups should hold this Strings:
/soccer/poland/
orange-ekstraklasa-youth-
2010-2011/results/
And then simply replacing "-" with " " and after that TRIM Spaces.
PS: If ur Using regex101.com e.g., u need to escape / AND just use one Row of String!
Expression
^(\/soccer\/poland\/)([a-z\-]*)(.*)$
And one Row of ur String.
/soccer/poland/orange-ekstraklasa-youth-2010-2011/results/
If u prefere to use the Expression not just for soccer and poland, use
^(\/[a-z]*\/[a-z]*\/)([a-z\-]*)(.*)$

Regular expression help - comma delimited string

I don't write many regular expressions so I'm going to need some help on the one.
I need a regular expression that can validate that a string is an alphanumeric comma delimited string.
Examples:
123, 4A67, GGG, 767 would be valid.
12333, 78787&*, GH778 would be invalid
fghkjhfdg8797< would be invalid
This is what I have so far, but isn't quite right: ^(?=.*[a-zA-Z0-9][,]).*$
Any suggestions?
Sounds like you need an expression like this:
^[0-9a-zA-Z]+(,[0-9a-zA-Z]+)*$
Posix allows for the more self-descriptive version:
^[[:alnum:]]+(,[[:alnum:]]+)*$
^[[:alnum:]]+([[:space:]]*,[[:space:]]*[[:alnum:]]+)*$ // allow whitespace
If you're willing to admit underscores, too, search for entire words (\w+):
^\w+(,\w+)*$
^\w+(\s*,\s*\w+)*$ // allow whitespaces around the comma
Try this pattern: ^([a-zA-Z0-9]+,?\s*)+$
I tested it with your cases, as well as just a single number "123". I don't know if you will always have a comma or not.
The [a-zA-Z0-9]+ means match 1 or more of these symbols
The ,? means match 0 or 1 commas (basically, the comma is optional)
The \s* handles 1 or more spaces after the comma
and finally the outer + says match 1 or more of the pattern.
This will also match
123 123 abc (no commas) which might be a problem
This will also match 123, (ends with a comma) which might be a problem.
Try the following expression:
/^([a-z0-9\s]+,)*([a-z0-9\s]+){1}$/i
This will work for:
test
test, test
test123,Test 123,test
I would strongly suggest trimming the whitespaces at the beginning and end of each item in the comma-separated list.
You seem to be lacking repetition. How about:
^(?:[a-zA-Z0-9 ]+,)*[a-zA-Z0-9 ]+$
I'm not sure how you'd express that in VB.Net, but in Python:
>>> import re
>>> x [ "123, $a67, GGG, 767", "12333, 78787&*, GH778" ]
>>> r = '^(?:[a-zA-Z0-9 ]+,)*[a-zA-Z0-9 ]+$'
>>> for s in x:
... print re.match( r, s )
...
<_sre.SRE_Match object at 0xb75c8218>
None
>>>>
You can use shortcuts instead of listing the [a-zA-Z0-9 ] part, but this is probably easier to understand.
Analyzing the highlights:
[a-zA-Z0-9 ]+ : capture one or more (but not zero) of the listed ranges, and space.
(?:[...]+,)* : In non-capturing parenthesis, match one or more of the characters, plus a comma at the end. Match such sequences zero or more times. Capturing zero times allows for no comma.
[...]+ : capture at least one of these. This does not include a comma. This is to ensure that it does not accept a trailing comma. If a trailing comma is acceptable, then the expression is easier: ^[a-zA-Z0-9 ,]+
Yes, when you want to catch comma separated things where a comma at the end is not legal, and the things match to $LONGSTUFF, you have to repeat $LONGSTUFF:
$LONGSTUFF(,$LONGSTUFF)*
If $LONGSTUFF is really long and contains comma repeated items itself etc., it might be a good idea to not build the regexp by hand and instead rely on a computer for doing that for you, even if it's just through string concatenation. For example, I just wanted to build a regular expression to validate the CPUID parameter of a XEN configuration file, of the ['1:a=b,c=d','2:e=f,g=h'] type. I... believe this mostly fits the bill: (whitespace notwithstanding!)
xend_fudge_item_re = r"""
e[a-d]x= #register of the call return value to fudge
(
0x[0-9A-F]+ | #either hardcode the reply
[10xks]{32} #or edit the bitfield directly
)
"""
xend_string_item_re = r"""
(0x)?[0-9A-F]+: #leafnum (the contents of EAX before the call)
%s #one fudge
(,%s)* #repeated multiple times
""" % (xend_fudge_item_re, xend_fudge_item_re)
xend_syntax = re.compile(r"""
\[ #a list of
'%s' #string elements
(,'%s')* #repeated multiple times
\]
$ #and nothing else
""" % (xend_string_item_re, xend_string_item_re), re.VERBOSE | re.MULTILINE)
Try ^(?!,)((, *)?([a-zA-Z0-9])\b)*$
Step by step description:
Don't match a beginning comma (good for the upcoming "loop").
Match optional comma and spaces.
Match characters you like.
The match of a word boundary make sure that a comma is necessary if more arguments are stacked in string.
Please use - ^((([a-zA-Z0-9\s]){1,45},)+([a-zA-Z0-9\s]){1,45})$
Here, I have set max word size to 45, as longest word in english is 45 characters, can be changed as per requirement