Parsing whitespace-oriented conf file with Regex - regex

I'm trying to parse a gitolite.conf file, which is a whitespace-oriented conf file with a few regexes. The worst problem is that some options might appear anywhere:
#staff = dilbert alice # line 1
#projects = foo bar # line 2
repo #projects baz # line 3
RW+ = #staff # line 4
- master = ashok # line 5
RW = ashok # line 6
R = wally # line 7
config hooks.emailprefix = '[%GL_REPO] ' # line 8
Check the "master" attribute. Some repos have them, others do not. It's a real pain.

This answer assumes a goal of extracting key/value pairs into capturing groups, where key consists of contiguous non-whitespace before = and value includes everything after = but before #, trimmed of leading/trailing whitespace.
Basic version
([^\s]+)\s*=\s*((?:\s*[^\s#]+)*)
More advanced version
The regex above doesn't handle quoted strings very well (e.g. prefix = ' Quoted with # and leading/trailing whitespace '). Regex isn't great at this kind of thing but simple cases can be handled as follows:
([^\s]+)\s*=\s*('[^']*'|"[^"]*"|(?:(?:\s*[^\s#]+)*))
Here's the demo if you need to see what is captured and play around with it more: Debuggex Demo

First, you should know that this isn't entirely possible with Regex. Regex is a great tool for parsing regular languages (including some types of configuration files), but as soon as you get into "Well, this line is actually a header line and we need all lines under it, and some lines might have this token, and others might not", it gets quite messy. I'm not saying it's impossible, but you're going to waste a lot of time debugging your Regex pattern instead of just writing a parser in whatever language you're using this with.
Second, if you're going to ask a quesiton about Regex, it is always helpful to know what you want out of the expression. Do you want to tokenize everything, do you only want the configuration keys, do you only want the comments?
That being said, I took my best guess, here's an expression to get you started:
^(?:([^=#]+?)\s.?=?\s.?([^=#]+?)\s.?(?:#|$))
With this expression, please apply the g and m flags (global and multiline). In PCRE, this would look like:
/^(?:([^=#]+?)\s.?=?\s.?([^=#]+?)\s.?(?:#|$))/gm
There are two capture groups, one is whatever is before the = sign, and the other is whatever is after. If there is no = sign, the first capture group contains everything. Anything after "#" is ignored.
Here's a fiddle to demonstrate: http://www.rexfiddle.net/eQexbZU

Related

How to find "complicated" URLs in a text file

I'm using the following regex to find URLs in a text file:
/http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+/
It outputs the following:
http://rda.ucar.edu/datasets/ds117.0/.
http://rda.ucar.edu/datasets/ds111.1/.
http://www.discover-earth.org/index.html).
http://community.eosdis.nasa.gov/measures/).
Ideally they would print out this:
http://rda.ucar.edu/datasets/ds117.0/
http://rda.ucar.edu/datasets/ds111.1/
http://www.discover-earth.org/index.html
http://community.eosdis.nasa.gov/measures/
Any ideas on how I should tweak my regex?
Thank you in advance!
UPDATE - Example of the text would be:
this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/).
This will trim your output containing trail characters, ) .
import re
regx= re.compile(r'(?m)[\.\)]+$')
print(regx.sub('', your_output))
And this regex seems workable to extract URL from your original sample text.
https?:[\S]*\/(?:\w+(?:\.\w+)?)?
Demo,,, ( edited from https?:[\S]*\/)
Python script may be something like this
ss=""" this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/). """
regx= re.compile(r'https?:[\S]*\/(?:\w+(?:\.\w+)?)?')
for m in regx.findall(ss):
print(m)
So for the urls you have here:
https://regex101.com/r/uSlkcQ/4
Pattern explanation:
Protocols (e.g. https://)
^[A-Za-z]{3,9}:(?://)
Look for recurring .[-;:&=+\$,\w]+-class (www.sub.domain.com)
(?:[\-;:&=\+\$,\w]+\.?)+`
Look for recurring /[\-;:&=\+\$,\w\.]+ (/some.path/to/somewhere)
(?:\/[\-;:&=\+\$,\w\.]+)+
Now, for your special case: ensure that the last character is not a dot or a parenthesis, using negative lookahead
(?!\.|\)).
The full pattern is then
^[A-Za-z]{3,9}:(?://)(?:[\-;:&=\+\$,\w]+\.?)+(?:\/[\-;:&=\+\$,\w\.]+)+(?!\.|\)).
There are a few things to improve or change in your existing regex to allow this to work:
http[s]? can be changed to https?. They're identical. No use putting s in its own character class
[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),] You can shorten this entire thing and combine character classes instead of using | between them. This not only improves performance, but also allows you to combine certain ranges into existing character class tokens. Simplifying this, we get [a-zA-Z0-9$-_#.&+!*\(\),]
We can go one step further: a-zA-Z0-9_ is the same as \w. So we can replace those in the character class to get [\w$-#.&+!*\(\),]
In the original regex we have $-_. This creates a range so it actually inclues everything between $ and _ on the ASCII table. This will cause unwanted characters to be matched: $%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_. There are a few options to fix this:
[-\w$#.&+!*\(\),] Place - at the start of the character class
[\w$#.&+!*\(\),-] Place - at the end of the character class
[\w$\-#.&+!*\(\),] Escape - such that you have \- instead
You don't need to escape ( and ) in the character class: [\w$#.&+!*(),-]
[0-9a-fA-F][0-9a-fA-F] You don't need to specify [0-9a-fA-F] twice. Just use a quantifier like so: [0-9a-fA-F]{2}
(?:%[0-9a-fA-F][0-9a-fA-F]) The non-capture group isn't actually needed here, so we can drop it (it adds another step that the regex engine needs to perform, which is unnecessary)
So the result of just simplifying your existing regex is the following:
https?://(?:[$\w#.&+!*(),-]|%[0-9a-fA-F]{2})+
Now you'll notice it doesn't match / so we need to add that to the character class. Your regex was matching this originally because it has an improper range $-_.
https?://(?:[$\w#.&+!*(),/-]|%[0-9a-fA-F]{2})+
Unfortunately, even with this change, it'll still match ). at the end. That's because your regex isn't told to stop matching after /. Even implementing this will now cause it to not match file names like index.html. So a better solution is needed. If you give me a couple of days, I'm working on a fully functional RFC-compliant regex that matches URLs. I figured, in the meantime, I would at least explain why your regex isn't working as you'd expect it to.
Thanks all for the responses. A coworker ended up helping me with it. Here is the solution:
des_links = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', des)
for i in des_links:
tmps = "/".join(i.split('/')[0:-1])
print(tmps)

Powershell: Using regex to match patterns and its variations (which has special characters)

I have several thousand text files containing form information (one text file for each form), including the unique id of each form.
I have been trying to extract just the form id using regex (which I am not too familiar with) to match the string of characters found before and after the form id and extract only the form ID number in between them. Usually the text looks like this: "... 12 ID 12345678 INDEPENDENT BOARD..."
The bolded 8-digit number is the form ID that I need to extract.
The code I used can be seen below:
$id= ([regex]::Match($text_file, "12 ID (.+) INDEPENDENT").Groups[1].Value)
This works pretty well, but I soon noticed that there were some files for which this script did not work. After investigation, I found that there was another variation to the text containing the form ID used by some of the text files. This variation looks like this: "... 12 ID 12345678 (a.12(3)(b),45)..."
So my first challenge is to figure out how to change the script so that it will match the first or the second pattern. My second challenge is to escape all the special characters in "(a.12(3)(b),45)".
I know that the pipe | is used as an "or" in regex and two backslashes are used to escape special characters, however the code below gives me errors:
$id= ([regex]::Match($text_one_line, "34 PR (.+) INDEPENDENT"|"34 PR (.+) //(a//.12//(3//)//(b//)//,45//)").Groups[1].Value)
Where have I gone wrong here and how I can fix my code?
Thank you!
When you approach a regex pattern always look for fixed vs. variable parts.
In your case the ID seems to be fixed, and it is, therefore, useful as a reference point.
The following pattern applies this suggestion: (?:ID\s+)(\d{8})
(click on the pattern for an explanation).
$str = "... 12 ID 12345678 INDEPENDENT BOARD..."
$ret = [Regex]::Matches($str, "(?:ID\s+)(\d{8})")
for($i = 0; $i -lt $ret.Count; $i++) {
$ret[0].Groups[1].Value
}
Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference. It contains a treasure trove of useful information.

Using regex multiple capture groups to split up a string

I have a file that looks like this...
"1234567123456","V","0","0","BLAH","BLAH","BLAH","BLAH"
"1234567123456","D","TEST1 "
"1234567123456","D","TEST 2~TEST3"
"1234567123456","R","TEST4~TEST5"
"1234567123457","V","0","0","BLAH","BLAH","BLAH","BLAH"
"1234567123457","D","TEST 6"
"1234567123457","D","TEST7"
"1234567123457","R","TEST 8~TEST9~TEST,10"
All I'm trying to do is parse the D and R lines. The ~ is used in this case as a separator. So the end results would be...
"1234567123456","V","0","0","BLAH","BLAH","BLAH","BLAH"
"1234567123456","D","TEST1 "
"1234567123456","D","TEST3"
"1234567123456","D","TEST3"
"1234567123456","R","TEST4"
"1234567123456","R","TEST5"
"1234567123457","V","0","0","BLAH","BLAH","BLAH","BLAH"
"1234567123457","D","TEST 6"
"1234567123457","D","TEST7"
"1234567123457","R","TEST 8"
"1234567123457","R","TEST9"
"1234567123457","R","TEST,10"
I'm using regex on applications like Textpad and Notepad++. I have not figured out how to use a regex like /.+/g because the applications do not like the forward slashes. So I don't think I can use things like the global modifier. I currently have the following regex...
//In a program like Textpad/Notepad++
<FIND> "(.{13})","D","([^~]*)~(.*)
<REPLACE> "\1","D","\2"\n"\1","D","\3
Now if I run a find and replace with the above params a few times it would work fine (for the D lines only). The problem is there is an unknown number of lines to be made. For example...
"1234567123456","D","TEST1~TEST2~TEST3~TEST4~TEST5"
"1234567123457","D","TEST1~TEST2~TEST3"
"1234567123458","D","TEST1~TEST2"
"1234567123459","D","TEST1~TEST2~TEST3~TEST4"
I was hoping to be able to use a MULTI capture group to make this work. I found this PAGE talking about the common mistake between repeating a capturing group and capturing a repeated group. I need to capture a repeated group. For some reason I just could not make mine work right though. Anyone else have an idea?
Note: If I could get rid of the leading and trailing spaces EX: "1234567123456","D","TEST1 " ending up as "1234567123456","D","TEST1" that would be even better but not necessary.
RESOURCES:
http://www.regular-expressions.info/captureall.html
http://regex101.com/

What is the regex required to find specific urls within content from a list of urls generated by a for loop?

As I write this I realise there are two parts to this question, however I think I am only really stuck on the first part and therefore the second is only provided for context:
Part A:
I need to search the contents of each value returned by a for loop (where each value is a url) for the following:
href="/dir/Sub_Dir/dir/163472311232-text-text-text-text/page-n"
where:
the numerals 163472311232 could be any length (ie it could be 5478)
-text-text-text-text could be any number of different words
where page-n could be from page-2 up until any number
where matches are not returned more than once, ie only unique matches are returned and therefore only one of the following would be returned:
href="/dir/Sub_Dir/dir/5422-la-la/page-4
href="/dir/Sub_Dir/dir/5422-la-la/page-4
Part B:
So the logic would be something like:
list_of_urls = original_list
for url in list_of_urls:
headers = {'User-Agent' : 'Mozilla 5.0'}
request = urllib2.Request(url, None, headers)
url_for_re = urllib2.urlopen(request).read()
another_url = re.findall(r'href="(/dir/Sub_dir\/dir/[^"/]*)"', url_for_re, re.I)
file.write(url)
file.write('\n')
file.write(another_url)
file.write('\n')
Which i am hoping will give me output similar to:
a.html
a/page-2.html
a/page-3.html
a/page-4.html
b.html
b/page-2.html
b/page-3.html
b/page-4.html
So my question is (assuming the logic in part B is ok):
What is the required regex pattern to use for part A?
I am a newbie to python and regex so this will limit my understanding somewhat in regards to relatively complicated regex suggestions etc.
update:
after suggestions i tried to test the following regex which did not produce any results:
import re
content = 'href="/dir/Sub_Dir/dir/5648342378-text-texttttt-texty-text-text/page-2"'
matches = re.findall(r'href="/dir/Sub_Dir/dir/[0-9]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+/page-([2-9]|[1-9][0-9]+)"', content, re.I)
prefix = 'http://www.test.com'
for match in matches:
i = prefix + match + '\n'
print i
solution:
i think this is the regex that will work:
matches = re.findall(r'href="(/dir/Sub_Dir/dir/[^"/]*/page-[2-9])"', content, re.I)
You can have... most of what you want. Regexes don't really do the distinct thing, so I suggest you just use them to get all the URLs, and then remove duplicates yourself.
Off the top of my head it would be something like this:
href="/dir/Sub_Dir/dir/[0-9]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+/page-([2-9])|([1-9][0-9]+)"
Plus or minus escaping rules, specifics on what words are allowed, etc. I'm a Windows guy, there's a great tool called Expresso which is helpful for learning regexes. I hope there's an equivalent for whatever platform you're using, it comes in handy.

Regular Expression break down URL into parts

I've just recently started learning Regex so i'm not sure yet about a couple of aspects of the hole thing.
Right now my web page reads in the URL breaks it up into parts and only uses certain parts for processing:
E.g. 1) http://mycontoso.com/products/luggage/selloBag
E.g. 2) http://mycontoso.com/products/luggage/selloBag.sf404.aspx
For some reason Sitefinity is giving us both possibilities, which is fine, but what I need from this is only the actual product details as in "luggage/selloBag"
My current Regex expression is: "(.*)(map-search)(\/)(.*)(\.sf404\.aspx)", I combine this with a replace statement and extract the contents of group 4 (or $4), which is fine, but it doesn't work for example 2.
So the question is: Is it possible to match 2 possibilities with regular expressions where a part of a string might or might not be there and then still reference a group whose value you actually want to use?
RFC-3986 is the authority regarding URIs. Appendix B provides this regex to break one down into its components:
re_3986 = r"^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?"
# Where:
# scheme = $2
# authority = $4
# path = $5
# query = $7
# fragment = $9
Here is an enhanced (and commented) regex (in Python syntax) which utilizes named capture groups:
re_3986_enhanced = re.compile(r"""
# Parse and capture RFC-3986 Generic URI components.
^ # anchor to beginning of string
(?: (?P<scheme> [^:/?#\s]+): )? # capture optional scheme
(?://(?P<authority> [^/?#\s]*) )? # capture optional authority
(?P<path> [^?#\s]*) # capture required path
(?:\?(?P<query> [^#\s]*) )? # capture optional query
(?:\#(?P<fragment> [^\s]*) )? # capture optional fragment
$ # anchor to end of string
""", re.MULTILINE | re.VERBOSE)
For more information regarding the picking apart and validation of a URI according to RFC-3986, you may want to take a look at an article I've been working on: Regular Expression URI Validation
Depends on your regex implementation, but most support a syntax like
(\.sf404\.aspx|)
Assuming that's your group 4 (i.e. zero-indexed groups). The | lists two alternatives, one of which is the empty string.
You don't say if you're doing this in javascript, but if you are, the parseUri lib written by Steven Levithan does a pretty damn good job at parsing urls. You can get it from various places, including here (click on the "Source Code" tab) and here.