Extracting String Parts with Regular Expressions - regex

This is a string:
http://news.ycombinator.com/page?vasya=pupkin&b=b news.ycombinator.com/page news.ycombinator.com/page.php news.ycombinator.com/page
I am extracting a host with page. So I wrote the following regular expression:
([a-zA-Z0-9\.]*[a-zA-Z0-9]+[^\/][\.][a-zA-Z0-9\/\.]+)
It returns me these (in bold):
http://news.ycombinator.com/page?vasya=pupkin&b=b news.ycombinator.com/page news.ycombinator.com/page.php news.ycombinator.com/page
This is not exactly what I need. Regexp should not see a host with page in case of this string: http://news.ycombinator.com/page?vasya=pupkin&b=b, because it is a link, which should be treated differently.
Should be rejected:
"http://news.ycombinator.com/page?vasya=pupkin&b=b", "http://news.ycombinator.com/page", "http://news.ycombinator.com/","http://news.ycombinator.com".
Should not be rejected:
"news.ycombinator.com/page","news.ycombinator.com/page.php", "news.ycombinator.com/page/index", "news.ycombinator.com/page/index.php"
How to improve this regexp so it could select only those string parts, which have no word characters nearby?

I'm not sure exactly what you are using to do your regex, but you've actually solved your own problem - you just need the regex to match whole words. This will depend on the program you are using, but this is a guidleine (posix style regex):
([:space:][a-zA-Z0-9\.]*[a-zA-Z0-9]+[^\/][\.][a-zA-Z0-9\/\.]+[:space:])
or maybe ([:space:]([a-zA-Z0-9]*[\.\/])+[a-zA-Z0-9]+[:space:])
In the second one, you will have to make sure the inner groups are for non capturing groups.

Related

Reg expression to get a string starting from particular string

I'm trying to write a regular expression which returns a string after a particular string.
For example:
The string is
"<https://meraki/api/v1/sm/devices?fields%5B%5D=imei%2Ciccid%2ClastConnected%2CownerEmail%2C+ownerUsername%2CphoneNumber&perPage=1000&startingAfter=0>; rel=first"
result I'm expecting is -- first.
Here is the expression i'm using
(?<=rel=\s").*(?=\)
Okay so this should work:
(?<=rel[=])[^"]*
I would advise looking over the syntax of regex again, because yours was not even matching the colons correctly. Look behinds (?<=pattern) match before the pattern you want to capture. Likewise look aheads (?=pattern) match after the pattern.
You can test your regex online here (or many other sites). They will show you the matching groups and errors, but will also explain what certain parts of the pattern do.

Trying to extract repeating pattern from string in php/javascript

The following is in PHP but the regex will also be used in javascript.
Trying to extract repeating patterns from a string
string can be any of the following:
"something arbitrary"
"D123"
"D111|something"
"D197|what.org|when.net"
"D297|who.197d234.whatever|when.net|some other arbitrary string"
I'm currently using the following regex: /^D([0-9]{3})(?:\|([^\|]+))*/
This correctly does not match the first string, matches the second and third correctly. The problem is the third and fourth only match the Dxxx and the last string. I need each of the strings between the '|' to be matched.
I'm hoping to use a regex as it makes it a single step. I realize I could just detect the leading Dxxx then use explode or split as appropriate to break the strings out. I've just gotten stuck on wanting a single regular expression match step.
This same regex may be used in Python as well so just want a generic regex solution.
There is no way to have a dynamic number of capture groups in a regular expression, but if you know some upper limit to how many parts you would have in one string, you can just repeat the pattern that many times:
/^D([0-9]{3})(?:$|\|)(.*?)(?:$|\|)(.*?)(?:$|\|)(.*?)(?:$|\|)(.*?)(?:$|\|)/
So after the initial ^D([0-9]{3})(?:$|\|) you just repeat (.*?)(?:$|\|) as many times as you need it.
When the string has fewer elements, those remaining capture groups will match the empty string.
See regex tester.
Is something like preg_match_all() (the PHP variant of a global match) also acceptable for you?
Then you could use:
^(?|D([0-9]{3})|^.+$|(?!^)\|([^|\n]*)(?=\||$))
This will match everything in a string in different matches, e.g. take your string:
D197|what.org|when.net
It will you then give three matches:
D197
what.org
when.net
Running live: https://regex101.com/r/jL2oX6/4 (Everything in green are your group matches. Ignore what's in blue.)

Improving a regex

I am looking for alternate methods to get john from the provided example.
My expression works as is but was hoping for some examples of better methods.
Example: john&home
my regexp: [a-z]{3,6}[^&home]
Im matching any character of length 3-6 upto but not including &home
Every item i run the regexp on is in the same format. 3-6 characters followed by &home
I have looked at other posts but was hoping for a reply specific to my regexp.
Most regex engines allow you to capture parts of a regex with capture groups. For instance:
^([A-Za-z]{3,6})&home$
The brackets here mean that you are interested in the part before the &home. The ^ and $ mean that you want to match the entire string. Without it, averylongname&homeofsomeone will be matched as well.
Since you use rubular, I assume you use the Ruby regex engine. In that case you can for instance use:
full = "john&home"
name = full.match(/^([A-Za-z]{3,6})&home$/).captures
And name will in this case contain john.

How to extract file location using Regular Expressions(VB.NET)

I am facing a problem whereby I am given a string that contains a path to a file and the file's name and I only want to extract the path (without the file's name)
For example, I will receive something like
C:\Users\OopsD\Projects\test.acdbd
and from that string I want to extract only
C:\Users\OopsD\Projects
I was trying to create a RegEx to match a backslash followed by a word, followed by a dot followed by another word - this is to match the
\test.acdbd
part and replace it with empty string so that the final result is
C:\Users\OopsD\Projects
Can anyone, familiar with RegEx, help me on this one? Also, I will be using regular expressions quite a lot in the future. Is there a (free) program I can download to create regular expressions?
Are you really sure you need to be using Regex for such as simple task? How about this:
Dim file As New IO.FileInfo(" C:\Users\OopsD\Projects\test.acdbd")
MsgBox(file.Directory.FullName)
Regarding the free program on Regex, I would definitely recommend http://www.gskinner.com/RegExr/ - using it all the time. But you always have to consider alternatives, before going the Regex way.
The regex that you are looking for is as below:
[^/]+$
where,
^ (caret):Matches at the start of the string the regex pattern is applied to. Matches a position rather than a character. Most regex flavors have an option to make the caret match after line breaks (i.e. at the start of a line in a file) as well.
$ (dollar):Matches at the end of the string the regex pattern is applied to. Matches a position rather than a character. Most regex flavors have an option to make the dollar match before line breaks (i.e. at the end of a line in a file) as well. Also matches before the very last line break if the string ends with a line break.
+ (plus):Repeats the previous item once or more. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is matched only once.
More reference can be found out at this link.
Many Regex softwares and tools are out there. Some of them are:
www.gskinner.com/RegExr/
www.txt2re.com
Rubular- It is not just for Ruby.

How to remove a small part of the string in the big string using RegExp

Hey guys, I don't know RegExp yet. I know a lil about it but I'm not experience user.
Supposed that I run a RegExp match on a website, the matches are:
Data: Informations
Data: Liberty
Then I want to extract only Informations and Liberty, I don't want the Data: part.
Does Data: always appear at the begining of a line?
Can there be multiple spaces between the : and the next word?
Do you know about groups?
What do you want: lazy matching vs greedy matching?
If so, you can use (with lazy matching):
^Data:\s+(.*?)$
With character classes:
^Data:\s+(\w+)$
if you know that it'll always be a word. Try this website.
Can't be absolutely sure without knowing more about the potential matches, but this should be at least a good starting point:
Data: (.*)$
That will return everything after "Data: " to the end of the line.
Search for a regular expression like
Data: (.*)
Then use the "first submatch", which is often referred to by "$1" or "\1", depending on the language you are using.
Regular expression engines support what are commonly called "capturing groups". If you surround a pattern or part of a pattern with (), the part of the string matched by that part of the regular expression will be captured.
The command(s) you use to do the matching will determine how to get these captured values. They may be stored in special variables (eg: $1, $2) or you may be able to specify the names of the variables either embedded within the regular expression or as arguments to the regular expression command. Exactly how depends on what language you are using.
So, read up on the regexp commands for the language of your choice and look for the term "capturing groups" or maybe just "groups".