Parsing sectioned file with augeas

Parsing sectioned file with augeas - regex

I am trying to create a module for parsing vim files which are sectioned in a specific manner. A sample file:
" My section {
set nocompatible " be iMproved
set encoding=utf-8
" }
" vim: set foldmarker={,} foldlevel=0 foldmethod=marker:
While writing the module, I've got stuck at this point:
module Vimrc =
autoload xfm
let section = del "\" " "\" " . key /[^\n]+/ . del "\n" "\n" . store /.*/ . del "\" " "\" "
let lns = [ section . del "\n" "\n" ] *
let filter = (incl "*.vim")
let xfm = transform lns filter
I'm aware that there are some other mistakes, but it complains about the regex key /[^\n]+/, saying:
/tmp/aug/vimrc.aug:3.36-.48:exception: The key regexp /[^ ]+/ matches
a '/'
I do not understand what the / character has got to do with this.

As the error says, your key regexp matches a slash, which is illegal since / is used as a level separator in the tree.
If your section names can contain slashes, you need to store them as a node value, not label, so instead of:
{ "My section"
{ "set" = "nocompatible" { "#comment" = "be iMproved" } } }
you'll have to do:
{ "section" = "My section"
{ "set" = "nocompatible" { "#comment" = "be iMproved" } } }

Related

Regex (JS Notation): Select spaces not in [ [], {}, "" ] to tokenize string

So I need to tokenize a string by all spaces not between quotes, I am using regex in Javascript notation.
For example:
" Test Test " ab c " Test" "Test " "Test" "T e s t"
becomes
[" Test Test ",ab,c," Test","Test ","Test","T e s t"]
For my use case however, the solution should work in the following test setting:
https://www.regextester.com/
All Spaces not within quotes should be highlighted in the above setting. If they are highlighted in the above setting they would be parsed correctly in my program.
For more specificity, I am using Boost::Regex C++ to do the parsing as follows:
...
std::string test_string("\" Test Test \" ab c \" Test\" \"Test \" \"Test\" \"T e s t\"");
// (,|;)?\\s+ : Split on ,\s or ;\s
// (?![^\\[]*\\]) : Ignore spaces inside []
// (?![^\\{]*\\}) : Ignore spaces inside {}
// (?![^\"].*\") : Ignore spaces inside "" !!! MY ATTEMPT DOESN'T WORK !!!
//Note the below regex delimiter declaration does not include the erroneous regex.
boost::regex delimiter("(,|;\\s|\\s)+(?![^\\[]*\\])(?![^\\(]*\\))(?![^\\{]*\\})");
std::vector<std::string> string_vector;
boost::split_regex(string_vector, test_string, delimiter);
For those of you who do not use Boost::regex or C++ the above link should enable testing of viable regex for the above use case.
Thank you all for you assistance I hope you can help me with the above problem.

I would 100% not use regular expressions for this. First off, because it's way easier to express as a PEG grammar instead. E.g.:
std::vector<std::string> tokens(std::string_view input) {
namespace x3 = boost::spirit::x3;
std::vector<std::string> r;
auto atom //
= '[' >> *~x3::char_(']') >> ']' //
| '{' >> *~x3::char_('}') >> '}' //
| '"' >> *~x3::char_('"') >> '"' //
| x3::graph;
auto token = x3::raw[*atom];
parse(input.begin(), input.end(), token % +x3::space, r);
return r;
}
This, off the bat, already performs as you intend:
Live On Coliru
int main() {
for (std::string const input : {R"(" Test Test " ab c " Test" "Test " "Test" "T e s t")"}) {
std::cout << input << "\n";
for (auto& tok : tokens(input))
std::cout << " - " << quoted(tok, '\'') << "\n";
}
}
Output:
" Test Test " ab c " Test" "Test " "Test" "T e s t"
- '" Test Test "'
- 'ab'
- 'c'
- '" Test"'
- '"Test "'
- '"Test"'
- '"T e s t"'
BONUS
Where this really makes the difference, is when you realize that you wanted to be able to handle nested constructs (e.g. "string" [ {1,2,"3,4", [true,"more [string]"], 9 }, "bye ]).
Regular expressions are notoriously bad at this. Spirit grammar rules can be recursive though. If you make your grammar description more explicit I could show you examples.

You can use multiple regexes if you are ok with that. The idea is to replace spaces inside quotes with a non-printable char (\x01), and restore them after the split:
const input = `" Test Test " ab c " Test" "Test " "Test" "T e s t"`;
let result = input
.replace(/"[^"]*"/g, m => m.replace(/ /g, '\x01')) // replace spaces inside quotes
.split(/ +/) // split on spaces
.map(s => s.replace(/\x01/g, ' ')); // restore spaces inside quotes
console.log(result);
If you have escaped quotes within a string, such as "a \"quoted\" token" you can use this regex instead:
const input = `"A \"quoted\" token" " Test Test " ab c " Test" "Test " "Test" "T e s t"`;
let result = input
.replace(/".*?[^\\]"/g, m => m.replace(/ /g, '\x01')) // replace spaces inside quotes
.split(/ +/) // split on spaces
.map(s => s.replace(/\x01/g, ' ')); // restore spaces inside quotes
console.log(result);
If you want to parse nested brackets you need a proper language parser. You can also do that with regexes however: Parsing JavaScript objects with functions as JSON
Learn more about regex: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex

Cleaning up formatting after deletion using regex

I have a function similar to the one below appearing in multiple files. I want to use regex to get rid of all references to outputString, since clearly, they're wasteful.
... other functions, class declarations, etc
public String toString()
{
String outputString = "";
return ... some stuff
+ outputString;
}
... other functions, class declarations, etc
I'm happy to do this in multiple passes. So far I've got regexes to find the first and last line (String outputString = "";$ and ( \+ outputString;)$). However, I've got two problems: first, I want to get rid of the whitespace that results in deleting the two lines that refer to outputString. Second, I need the final ; on the second last line to move up to the line above it.
As a bonus, I'd also like to know what's wrong with adding the line start anchor (^) to either of the regexes I specified. It seems like doing so would tighten them up, but when I try something like ^( \+ outputString;)$ I get zero results.
After all's said and done the function above should look like this:
... other functions, class declarations, etc
public String toString()
{
return ... some stuff;
}
... other functions, class declarations, etc
Here's an example of what "some stuff" might be:
"name" + ":" + getName()+ "," +
"id" + ":" + getId()+ "]" + System.getProperties().getProperty("line.separator") +
" " + "student = "+(getStudent()!=null?Integer.toHexString(System.identityHashCode(getStudent())):"null")
Here's a concrete example:
Current:
public void delete()
{
Student existingStudent = student;
student = null;
if (existingStudent != null)
{
existingStudent.delete();
}
}
public String toString()
{
String outputString = "";
return super.toString() + "["+
"name" + ":" + getName()+ "," +
"id" + ":" + getId()+ "]" + System.getProperties().getProperty("line.separator") +
" " + "student = "+(getStudent()!=null?Integer.toHexString(System.identityHashCode(getStudent())):"null")
+ outputString;
}
public String getId()
{
return id;
}
Required:
public void delete()
{
Student existingStudent = student;
student = null;
if (existingStudent != null)
{
existingStudent.delete();
}
}
public String toString()
{
return super.toString() + "["+
"name" + ":" + getName()+ "," +
"id" + ":" + getId()+ "]" + System.getProperties().getProperty("line.separator") +
" " + "student = "+(getStudent()!=null?Integer.toHexString(System.identityHashCode(getStudent())):"null");
}
public String getId()
{
return id;
}

1st pass:
Find:
.*outputString.*\R
Replace with empty string.
Demo:
https://regex101.com/r/g3aYnp/2
2nd pass:
Find:
(toString\(\)[\s\S]+\))(\s*\R\s*?\})
Replace:
$1;$2
https://regex101.com/r/oxsNRW/3

Assuming that the wanted part of the return expression does not contain any semi colons (i.e. ;) then you can do it in one replace. Search for:
^ +String outputString = "";\R( +return [^;]+?)\R +\+ outputString;
and replace with:
\1;
The idea is to match all three lines in one go, to keep the wanted part and to add the ;.
An interesting point in this replacement. My first attempt had ... return [^;]+)\R +\+ ... and it failed whereas ... return [^;]+)\r\n +\+ ... worked. The \R version appeared to leave a line-break before the final ;. Turning on menu => View => Show symbol => Show end of line reveals that the greedy term within the capture group collected the \r and the \R matched only the \n. Changing to a non-greedy form allowed the \R to match the entire \r\n.

Augeas: How to match dash?

Want to write a lens for duply-exclude Files. Example:
+ /etc
- /
So my lens looks like this:
module DuplyExclude =
let nl = del /[\n]+/ "\n"
let entry = [ label "entry" . [ label "op" . store /(\+|-)/ ] . del /[ \t]+/ " " . [ label "path" . store /\/[^ \t\n\r]+/ ] ]
let lns = ( entry . nl )*
test lns get "+ /hello\n+ /etc\n- /" = ?
This results in an error. I know from experimenting a bit, that the regular expression /(\+|-)/ does not match the second line. The question is: Why the dash seems to be not matchable, even if escaped by \?

There are two reasons for this:
The test string is missing a trailing \n. This is important as lns is defined as having an entry followed by an unconditional new line. Note that this only really affects string tests with augparse because when loading files via the library, it adds a trailing \n to any file read in (since many lenses can't handle a missing EOL).
The path node is defined as matching a single / followed by at least one (+) other character in store /\/[^ \t\n\r]+/. This won't match a single / entry.
So with these two changes, this lens works:
module DuplyExclude =
let nl = del /[\n]+/ "\n"
let entry = [ label "entry" . [ label "op" . store /(\+|-)/ ] . del /[ \t]+/ " " . [ label "path" . store /\/[^ \t\n\r]*/ ] ]
let lns = ( entry . nl )*
test lns get "+ /hello\n+ /etc\n- /\n" = ?
Test result: /tmp/duplyexclude.aug:6.2-.44:
{ "entry"
{ "op" = "+" }
{ "path" = "/hello" }
}
{ "entry"
{ "op" = "+" }
{ "path" = "/etc" }
}
{ "entry"
{ "op" = "-" }
{ "path" = "/" }
}

QRegExp not extracting text as expected

I am trying to extract text from between square brackets on a line of text. I've been messing with the regex for some time now, and cannot get what I need. (I can't even explain why the output is what it is). Here's the code:
QRegExp rx_timestamp("\[(.*?)\]");
int pos = rx_timestamp.indexIn(line);
if (pos > -1) {
qDebug() << "Captured texts: " << rx_timestamp.capturedTexts();
qDebug() << "timestamp cap: " <<rx_timestamp.cap(0);
qDebug() << "timestamp cap: " <<rx_timestamp.cap(1);
qDebug() << "timestamp cap: " <<rx_timestamp.cap(2);
} else qDebug() << "No indexin";
The input line is:
messages:[2013-10-08 09:13:41] NOTICE[2366] chan_sip.c: Registration from '"xx000 <sip:xx000#183.229.164.42:5060>' failed for '192.187.100.170' - No matching peer found
And the output is:
Captured texts: (".")
timestamp cap: "."
timestamp cap: ""
timestamp cap: ""
Can someone explain what is going on? Why is cap returning "." when no such character exists between square brackets
Can someone correct the regex to extract the timestamp from between the square brackets?

You are missing two things. Escaping the backslash, and using setMinimal. See below.
QString line = "messages:[2013-10-08 09:13:41] NOTICE[2366] chan_sip.c: Registration from '\"xx000 <sip:xx000#183.229.164.42:5060>' failed for '192.187.100.170' - No matching peer found";
QRegExp rx_timestamp("\\[(.*)\\]");
rx_timestamp.setMinimal(true);
int pos = rx_timestamp.indexIn(line);
if (pos > -1) {
qDebug() << "Captured texts: " << rx_timestamp.capturedTexts();
qDebug() << "timestamp cap: " <<rx_timestamp.cap(0);
qDebug() << "timestamp cap: " <<rx_timestamp.cap(1);
qDebug() << "timestamp cap: " <<rx_timestamp.cap(2);
} else qDebug() << "No indexin";
Output:
Captured texts: ("[2013-10-08 09:13:41]", "2013-10-08 09:13:41")
timestamp cap: "[2013-10-08 09:13:41]"
timestamp cap: "2013-10-08 09:13:41"
timestamp cap: ""
UPDATE: What is going on:
A backslash in c++ source code indicates that the next character is an escape character, such as \n. To have a backslash show up in a regular expression you have to escape a backslash like so: \\ That will make it so that the Regular Expression engine sees \, like what Ruby, Perl or Python would use.
The square brackets should be escaped, too, because they are used to indicate a range of elements normally in regex.
So for the Regular expression engine to see a square bracket character you need to send it
\[
but a c++ source file can't get a \ character into a string without two of them in a row so it turns into
\\[
While learning regex, I liked using this regex tool by GSkinner. It has a listing on the right hand side of the page of unique codes and characters.
QRegEx doesn't match regex exactly. If you study the documentation you find a lot of little things. Such as how it does Greedy v. Lazy matching.
QRegExp and double-quoted text for QSyntaxHighlighter
How the captures are listed is pretty typical as far as I have seen from regex parsers. The capture listing first lists all of them, then it lists the first capture group (or what was enclosed by the first set of parentheses.
http://qt-project.org/doc/qt-5.0/qtcore/qregexp.html#cap
http://qt-project.org/doc/qt-5.0/qtcore/qregexp.html#capturedTexts
To find more matches, you have to iteratively call indexIn.
http://qt-project.org/doc/qt-5.0/qtcore/qregexp.html#indexIn
QString str = "offsets: 1.23 .50 71.00 6.00";
QRegExp rx("\\d*\\.\\d+"); // primitive floating point matching
int count = 0;
int pos = 0;
while ((pos = rx.indexIn(str, pos)) != -1) {
++count;
pos += rx.matchedLength();
}
// pos will be 9, 14, 18 and finally 24; count will end up as 4
Hope that helps.

Convert punctuation to space

I have a bunch of strings with punctuation in them that I'd like to convert to spaces:
"This is a string. In addition, this is a string (with one more)."
would become:
"This is a string In addition this is a string with one more "
I can go thru and do this manually with the stringr package (str_replace_all()) one punctuation symbol at a time (, / . / ! / ( / ) / etc. ), but I'm curious if there's a faster way I'd assume using regex's.
Any suggestions?

x <- "This is a string. In addition, this is a string (with one more)."
gsub("[[:punct:]]", " ", x)
[1] "This is a string In addition this is a string with one more "
See ?gsub for doing quick substitutions like this, and ?regex for details on the [[:punct:]] class, i.e.
‘[:punct:]’ Punctuation characters:
‘! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { |
} ~’.

have a look at ?regex
library(stringr)
str_replace_all(x, '[[:punct:]]',' ')
"This is a string In addition this is a string with one more "

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Parsing sectioned file with augeas - regex

Related

Regex (JS Notation): Select spaces not in [ [], {}, "" ] to tokenize string

Cleaning up formatting after deletion using regex

Augeas: How to match dash?

QRegExp not extracting text as expected

Convert punctuation to space

Categories

Resources