Regex about url encoded string - regex

Would like to write one regex to get the url encoded string in below line:
<topicref href="%E4%BA%B0.txt"/>
When I used a regex like (%[A-Z][0-9])+\.txt it only got %B0.txt. What can I do if I want to get the whole url encoded string such like %E4%BA%B0.txt.
Thanks a lot.

Proper URL encoding uses hex digits only, A-F not A-Z. The encoded URL could contain non-encoded characters anywhere. Also, you should escape the full stop.
((%[0-9A-F]{2}|[^<>'" %])+)\.txt
is a quick ad-hoc fix for your regex, though obviously for any production code, probably don't use a regex for this at all, or at the very least try a well-defined and properly tested URL regex like the one you can find in the HTTP RFC.
Putting the + quantifier outside the capturing parentheses will only return the last repetition. I added a second set of parentheses to put the quantifier inside the first capture group, which assumes you are doing something to extract the first capture group in particular. (If your regex dialect has non-capturing groups, you could change the second opening parenthesis to non-capturing, i.e. (?:.)

You need to change your regex to
([%\dA-Z]+)\.txt
([%\dA-Z]+) - Match %, digits and alphabets one or more time
\.txt - Match .txt
where as your regex means
(%[A-Z][0-9])+.txt
(%[A-Z][0-9])+
% - Match %
[A-Z] - Match A to Z one time
[0-9] - Match any digit one or more time
+ - Match the captured group one or more time
.txt - Match single character (anything except new line) followed by txt

Related

RegEx - double condition to find some string

I'd like to find word RADU3_ or RADU3- in a sentence that begins with xlink:href= and ends with .svg
How to do this?
I've tried following, but does not give the result I'm expecting.
(?=\wxlink:href=|\wsvg\b)|\bRADU3_|\bRADU3-
Just last line in example is good result (RADU3_)
ProductionGraphics\GP1**RADU3-**11_HeatingFurnaceF1.svg
PB:ExpressionText id="RADU3_FUEL GAS _SUM_EX" PBD:LinkUses
xlink:href="C:\ProcBookImport\MaintenanceGraphics\RADU3_AI.svg"
Example...
Not sure exactly how you want to use it but the below pattern finds the string. I put the RADU3 part in a group where I matches RADU3 followed by - or _ ([_-])
(xlink:href=.*)(RADU3[_-]*)(.*\.svg)
Edit, handle multiple occurences
If a string might contain the pattern several times then use ? to allow a group to repeat itself
(RADU3[_-]*?)(.*?\.svg?)
The above could be used in a replace expression like
\1someotherword\3
Where \2 is the second group that is replaced
If you want to make sure that the string starts with xlink:href= and ends with \.svg you could use anchors to assert the start ^ and the end $ of the string.
Use 1 capturing group to make sure xlink:href= comes before RADU3 followed by an underscore or a hyphen. Then you could match it and in the replacement use that capturing group follwed by your replacement.
You could use a positive lookahead to assert that the string ends with \.svg
That will match:
^(xlink:href=.*)\bRADU3[_-](?=.*\.svg$)
^ Assert the start of the string
(xlink:href=.*) Capturing group, match up until the last occurence of ..
\bRADU3[_-] Word boundary to prevent matching part of a larger word. Match RADU3 followed by an underscore or hyphen
(?=.*\.svg$) Positive lookahead to assert the string ends with .svg
See the regex demo
It sounds like you only want the word (substring) if it is in a specific context?
In your case, you can restart the regex midways if you want to have starting and ending conditions (multiple conditions) for a string, but at the same time only want to use these conditions as "if-statements" and not as part of the result.
The following uses this method, and utilizes restarts (\K) in order to only extract the substring you are looking for.
# The string has to start with "xlink:href="
xlink:href=
# Fetch everything up to our match, and the restart the regex
.*\K
# The strings we are looking for
(RADU3[-_])
# String has to end with ".svg"
(?=(.*\.svg))
If you want the entire string matching our rules you are looking for something like this:
#The string has to start with "xlink:href"
^(xlink:href=).*
# The strings we are looking for
(RADU3[-_])
# String has to end with ".svg"
(\w+\.svg)
#Get everything after .svg too
.*
If you only want the ending " after the .svg, you'd want to modify the last part where I just take everything after .svg
You can play around with what I have come up with at regex101 (no affiliation, just love their site): https://regex101.com/r/g0v07V/3/

Use Regular Expressions to find URLs without certain word patterns

I am trying to write a Regular Expression that can match URLs that don't have a certain pattern. The URLs I am trying to filter out shouldn't have an ID in them, which is 40 Hex uppercase characters.
For example, If I have the following URLs:
/dev/api/appid/A1B2C3D4E5A1B2C3D4E5A1B2C3D4E5A1B2C3D4E5/users
/dev/api/apps/list
/dev/api/help/apps/applicationname/apple/osversion/list/
(urls are made up, but the idea is that there are some endpoints with 40-length IDs, and some endpoints that don't, and some endpoints that are really long in total characters)
I want to make sure that the regular expression is only able to match the last 2 URLs, and not the first one.
I wrote the following regex,
\S+(?:[0-9A-F]{40})\S+
and it matches endpoints that do have the long ID in them, but skips over the ones that should be filtered. If I try to negate the regex,
\S+(?![0-9A-F]{40})\S+
It matches all endpoints, because some URLs have lengths that are greater than what the ID should be (40 characters).
How can I use a regular expression to filter out exactly the URLs I need?
Try this regex:
^(?!.*\/[0-9A-F]{40}\/).*$
Click for Demo
Explanation:
^ - asserts the start of the string/url
(?!.*\/[0-9A-F]{40}\/) - Negative Lookahead to check for the presence of a / followed by exactly 40 HEX characters followed by / somewhere in the string. Since, it is a negative lookahead, any string/url containing this pattern will not be matched.
.* - matches 0+ occurrences of any character except a newline character
$ - asserts the end of the string
^((?![A-F0-9]{40}).)*$
Uses a negative lookahead to match any line that doesn't have 40 hex digits in a row. Try it here.

RegEx for capturing everything except numbers and one word

I am quite stuck with a regex I can't get to work. It should capture everything except digits and the word fiktiv (not single characters of it!). Objective is to get rid of this content.
I have tried something like (?!\d|fiktiv).* on my sample string 123456788daswqrt fiktiv
https://regex101.com/r/kU8mF3/1
However this does match the fiktiv at the end as well.
One possibility would be to use a neglected character class, which can be used by putting a ^ in [] braces. So you basically say don't match digits, and as many non digits as you can get until a space occurs and the word fiktiv appears.
This capturing will be "saved" in the capturing group 1 for later use.
([^\d]+)\s+fiktiv
Testing could be done here:
https://regex101.com/
It should capture everything except digits and the word fiktiv (not single characters of it!). Objective is to get rid of this content.
So, you want to remove any character that is not a digit (that is, \D or [^0-9] pattern) and not a fiktiv char sequence.
You may use a regex with a capturing group and alternation:
(fiktiv)|[^0-9]
and replace with the contents of Group 1 using a $1 backreference, fiktiv, to restore it in the replaced string.
See the regex demo
C# implementation:
Regex.Replace(input‌​, "(fiktiv)|[^0-9]", "$1")
Also, see Use RegEx in SQL with CLR Procs.

how to code correct URI regex

having different URI pattern trying to find out correct regex to cover all of them, for example:
1) href="http://site.example.com/category/
and
2) href="http://site.example.com/en/page/
Using href=".+..+..+/(.+?)" respects first url, in second url skip en/page.
How to read everything after href="http://site.example.com/ ?
This should do it:
[^\./]+\.[^\./]+\.[^\./]+(?:/(.*))?
That is:
[^\./]+ = (anything but . and /)
\. = dot
...? = Zero or one occurrence(s) of ...
(?:...)? = Zero or one of ..., which is more than one character, but without capturing ....
(?:/(.*))? = Capture everything after the last /, if there is one.
Tested here.
. in regex means any character (except \n newline), + means one or more of the previous expression, ? means 0 or 1 of previous expression; also forces minimal matching when an expression might match several strings within a search string (e.g. http://regexlib.com/CheatSheet.aspx).
A literal dot is matched by \..
So your regex boils down to at least five signs, a slash sign, at least one sign, but you don't have to.
Meaning it matches even http:/. And it does match both of your examples (tested with egrep and grep -P), but only if you replace href=" by href=\" and leave the last " out. Otherwise it will match none.
What you probably wanted was something like:
.+\..+\..+/.*
Or, if you want to be sure to match only urls, you might consider
http[s]?://([a-z]+\.)?[a-z]+\.[a-z]+/?[a-z/]?
The http[s]?: as a fixed part starts the expression (the s in case the ref comes from a secure connection). [a-z] means match only lowercase letters. As you might stumble upon sites that don't have a subdomain in the name like stackoverflow.com, the first [a-z]+\. is questionmarked. The end of url slash, too. [a-z/] means match only lowercase letters and slashes.

Regex to match number specific number in a string

I'm trying to fix a regex I create.
I have an url like this:
http://www.demo.it/prodotti/822/Panasonic-TXP46G20E.html
and I have to match the product ID (822).
I write this regex
(?<=prodotti\/).*(?<=\/)
and the result is "822/"
My match is always a group of numbers between two / /
You're almost there!
Simply use:
(?<=prodotti\/).*?(?=\/)
instead of:
(?<=prodotti\/).*(?<=\/)
And you're good ;)
See it working here on regex101.
I've actually just changed two things:
replaced that lookbehind of yours ((?<=\/)) by its matching lookahead... so it asserts that we can match a / AFTER the last character consumed by .*.
changed the greediness of your matching pattern, by using .*? instead of .*. Without that change, in case of an url that has several / following prodotti/, you wouldn't have stopped to the first one.
i.e., given the input string: http://www.demo.it/prodotti/822/Panasonic/TXP46G20E.html, it would have matched 822/Panasonic.