Regex groups for dash delimited filename in URL - regex

I have a URL that is structured like so: <domain>/<subdirectory>/<filename>-<semantic_version>-<hash>.<filetype>
For example, it could look like: https://cdn.example.com/sample_files/some_file-1.2.3-56857cfc709d3996f057252c16ec4656f5292802.css
So far I have the following regex which gives me the entire filename. However, I'd like to individually get the filename, semantic_version, and hash as defined above. You can assume that the filename will not has dashes in the name.
([^/\\&\?]+)$(?<=(?:.js))

You could match the protocol and then until the last forward slash.
After that, capture 1+ word chars in group 1 for the file name, a repeating part in group 2 to capture digits divided by dots and in the third group a character class which would match all the characters in the hash.
^http\S+\/(\w+)-(\d+(?:\.\d+)+)-([0-9a-f]+)\.\w+$
Explanation
^ Start of string
http\S+\/ Match the protocol followed by 1+ non whitespace chars, then backtrack till the last /
(\w+)- Capture group 1, match 1+ word chars followed by -
(\d+(?:\.\d+)+)- Capture group 2, match digits divided by dots followed by -
([0-9a-f]+)\.\w+ Capture group 3, match 1+ times the chars from the hash followed by . and 1+ word chars
$ End of string
Regex demo
If the hash always has 40 characters, you could match [a-z0-9]{40} instead of [a-z]+ to be a bit more precise.

Use multiple capture groups that don't match - characters.
([^-/\\&\?]+)-([^-/\\&\?]+)-([^-/\\&\?]+)\.[a-z]+$(?<=(?:.js))

Related

How to make optional capturing groups be matched first

For example I want to match three values, required text, optional times and id, and the format of id is [id=100000], how can I match data correctly when text contains spaces.
my reg: (?<text>[\s\S]+) (?<times>\d+)? (\[id=(?<id>\d+)])?
example source text: hello world 1 [id=10000]
In this example, all of source text are matched in text
The problem with your pattern is that matches any whitespace and non whitespace one and unlimited times, which captures everything without getting the other desired capture groups. Also, with a little help with the positive lookahead and alternate (|) , we can make the last 2 capture groups desired optional.
The final pattern (?<text>[a-zA-Z ]+)(?=$|(?<times>\d+)? \[id=(?<id>\d+)])
Group text will match any letter and spaces.
The lookahead avoid consuming characters and we should match either the string ended, or have a number and [id=number]
Said that, regex101 with further explanation and some examples
You could use:
:\s*(?<text>[^][:]+?)\s*(?<times>\d+)? \[id=(?<id>\d+)]
Explanation
: Match literally
\s* Match optional whitespace chars
(?<text> Group text
[^][:]+? match 1+ occurrences of any char except [ ] :
) Close group text
\s* Match optional whitespace chars
(?<times>\d+)? Group times, match 1+ digits
\[id= Match [id=
(?<id>\d+) Group id, match 1+ digirs
] Match literally
Regex demo

How can I extract non digit characters and digit characters in the end of a string?

I have a string that has the following structure:
digit-word(s)-digit.
For example:
2029 AG.IZTAPALAPA 2
I want to extract the word(s) in the middle, and the digit at the end of the string.
I want to extract AG.IZTAPALAPA and 2 in the same capture group to extract like:
AG.IZTAPALAPA 2
I managed to capture them as individual capture groups but not as a single:
town_state['municipality'] = town_state['Town'].str.extract(r'(\D+)', expand=False)
town_state['number'] = town_state['Town'].str.extract(r'(\d+)$', expand=False)
Thank you for your help!
Yo can use a single capturing group for the example string to match a single "word" that consists of uppercase chars A-Z with an optional dot in the middle which can not be at the start or end followed by 1 or more digits.
\b\d+ ([A-Z]+(?:\.[A-Z]+)* \d+)\b
Explanation
\b A word boundary
\d+
( Capture group 1
[A-Z]+ Match 1+ occurrences of an uppercase char A-Z
(?:\.[A-Z]+)* \d+ Repeat 0+ times matching a dot and a char A-Z followed by matching 1+ digits
) Close group 1
\b A word boundary
Regex demo
Or you can make the pattern a bit broader matching either a dot or a word character
\b\d+ ([\w.]+(?: [\w.]+)* \d+)\b
Regex demo
You can use the following simple regex:
[0-9]+\s([A-Z]+.[A-Z]+(?: [0-9]+)*)
Note:
(?: [0-9]+)* will make it the last digital optional.

Regex to pull first two fields from a comma separated file

I want to pull the second string in a commma delimited list where the first value is numeric and the second is alpha.
I'm using \d[^,]+(?=,) to pull the numeric value in the first field and just need help with pulling the second value from the "Name" column.
Here's part of a sample file that I'm trying to extract data from:
Address Number,Name,Employee Master Exist(Y/N),Auto-Deposit Exists(Y/N),Supplier Master Exists(Y/N),Supplier Master Created,ACH Account Exists(Y/N),ACH Account Created,ACH Same as Auto-deposit(Y/N)
//line break here is for clarity and does not exist in file//
4398,Presley Elvis Aaron,Y,N,Y,N,Y,N,N
10154,Shepard Alan Barrett,Y,Y,Y,N,Y,N,N
You could make use of a capturing group if you want to match the second string by first matching 1+ digits and a comma.
Then capture in a group matching 1+ chars a-zA-Z and match the trailing comma.
^\d+,([a-zA-Z]+(?: [a-zA-Z]+)*),
^ Start of string
\d+, Match 1+ digits and a comma (Or use (\d+), if the digits should also be a group)
( Capture group 1
[a-zA-Z]+ Match 1+ chars a-zA-Z
(?: [a-zA-Z]+)* Repeat matching the same as previous preceded by a space
), Close capturing group and match trailing comma
Regex demo
To get a bit broader match you could use this pattern to match at least a single char a-zA-Z
\d+,([a-zA-Z ]*[a-zA-Z][a-zA-Z ]*),
Regex demo
Note that this part in your pattern \d[^,]+ matches not only digits, but 1 digit followed by 1+ times any char except a comma which would for example also match 4a$ .
You could try this regex:
^\d+,([^,]+),
This will look for lines:
starting with one or more digits
followed by a comma
capture anything that is not a comma
followed by a comma
See it at Regex 101
If not all lines contain a name, then change the + to a *:
^\d+,([^,]*),
See alternative regex

C# Regex Expression to extract field name and values in SQL Condition

Consider following 2 SQL conditions.
1.) AssetView.[PROPTYPE] NOT IN ('B15/30','SFD','SFA')
2.) AssetView.[FICO] IN (500,600,700)
I want to break this SQL using RegEx so that I can have table name, field name, function type and field values into 4 different parts.
e.g.
Table Name - AssetView
Field Name - PROPTYPE
Function - NOT IN
Field Values (Together or separate): B15/30, SFD, SFA
Here is the regex I tried (https://rubular.com/r/WGiyz0oGrooyiA) but I am not able to split TableName, Field Name and Function type into its own group.
(.*?)[^=]['(]+(.*?)[')]
In your pattern (.*?)[^=]['(]+(.*?)[')] you make use of a character classes ['(] and [')] which match any of the listed and can also first match an opening ' and then a closing )
For your example data, you might use:
(\w+)\.\[(\w+)\] +(\w+(?: \w+)*) +\(([^)\n]+)\)
(\w+) Capture 1+ word chars in group 1
\. Match a dot
\[(\w+)\] + Capture 1+ word chars between square brackets in group 2 and 1+ spaces
(\w+(?: \w+)*) + Capture 1+ word chars followed by repeating 0+ times matching a space and 1+ word chars in group 3 and 1+ spaces
\(([^)\n]+)\) Capture 1+ times not a closing parenthesis or newline between parenthesis in group 4
Rubular regex | .NET regex (click on the Table tab)
If you want to allow more characters to match than \w you could extend that using a character class.
For example if you also want to allow a hyphen and a space use [\w-]+ or if you want to match all between the brackets you could make use of a negating character class, for example \[([^\]]+)\]

Use class content inside REGEX

I want to parse a nested structure like this one in MATLAB :
structure NAME_PART_1
Some content
block NAME_PART_2
Some other content
end NAME_PART_2
block NAME_PART_3
subblock NAME_PART_4
Some content++
end NAME_PART_4
end NAME_PART_3
end NAME_PART_1
structure
NAME_PART_5
end NAME_PART_5
First, I would like to extract the content of each structure. It's quite easy because a structure content is always between "structure NAME" and "end NAME".
So, I would like to use regex. But I don't know in advance what the structure name will be.
So, I wrote my regex like this :
\bstructure\s+([\w.-]*)((?:\s|.)*)\bend\b\s+XXXX
But, I don't know by what I should replace "XXXX", in order to "reference" the content of the first class of this regex. But is that even possible?
Try this Regex:
structure\s+([\w.-]+)\s*((?:(?!end\s+\1)[\s\S])*)end\s+\1
Click for Demo
Explanation:
structure - matches structure
\s+ - matches 1+ occurrences of a white-space
([\w.-]+) - matches 1+ occurrences of either a word character or a . or a -. This sub-match which contains the structure name is captured in Group 1.
\s* - matches 0+ occurrences of a white-space
((?:(?!end\s+\1)[\s\S])*) - Tempered Greedy Token - Matches 1+ occurrences of any character [\s\S] which does not start with the sequence end followed by Group 1 contents \1 i.e, structure name. This sub-match is captured in Group 2 which contains the contents of the structure
end\s+\1 - matches the word end followed by 1+ white-spaces followed by Structure Name contained in Group 1 \1.
Apart from making use of a backreference \1 to refer what is captured, you might replace the alternation in the capturing group ((?:\s|.)*) with matching a newline followed by 0+ characters and repeat that while capturing it ((?:\n.*)+)
Also you might omit the word boundary after end end\b\s+ as 1+ whitespace characters is what follows after end and instead add a word boundary at the end so that \1 is not part of a larger match.
\bstructure\s+([\w.-]+)((?:\n.*)+)\bend\s+\1\b
Regex demo
Explanation
\bstructure\s+ Match structure followed by 1+ whitespace chars
([\w.-]+) Capture in a group repeating 1+ times any of the listed chars
( Capturing group
(?:\n.*)+ Match newline followed by 0+ times any char except a newline
) Close capturing group
\bend Match end
\s+\1\b Match 1+ times a whitespace char followed by a backreference to group 1 and end with a word boundary.