How to recognize a string literal for a scanner in lex? - regex

Ok, I have been trying to find a regular expression in lex that will recognize a C-like string literal in an input string. eg. in printf("stackoverflow"), "stackoverflow" should be recognized as a string literal.
I have been trying the following:
"[.]+"
["][.]+["]
\"[.]+\"
[\"][.]+[\"]
"""[.]+"""
None of these is working. The lexeme recognized each time is " alone. What should I do?
Thanks in advance...

Simple, try this:
\".*\" printf("STRING[%s]", yytext);
\'.*\' printf("STRING[%s]", yytext);
When compiled and run, a quick test shows it correctly parses strings like
"hello world!"
STRING["hello world!"]
'hello world!'
STRING['hello world!']
'john\'s cat'
STRING['john\'s cat']
'mary said "hello!"'
STRING['mary said "hello!"']
"frank said \"goodbye!\""
these are words "which contain" a string
these are words STRING["which contain"] a string

You may find these links helpful
ANSI C grammar, Lex
specification
ANSI C Yacc grammar

Related

Remove everything except numbers and alphabets from a string using google sheet or excel formulas

I have search but found python and related solutions.
I have a string like
"Hello 'how' are % you?"
which I want to convert to below after Remove everything except numbers and alphabets
Hello how are you
I am using Regexreplace as follows but now sure what should be the replacement or if its a right approach
=REGEXREPLACE(B2 , "([^A-Za-z0-9]+)")
The main thing i want to remove from the string are the stuff like " or strange symbols
can anyone help?
You can use:
=TRIM(REGEXREPLACE(B2,"[\W_]+"," "))
Or, include the space in your character class:
=REGEXREPLACE(B2,"[\W_ ]+"," "))
Where: \W is short for [^A-Ba-b0-9_], so to include the underscore we added it to the character class.
you can use:
=TRIM(REGEXREPLACE(A1, "'|%|""", ))

Odd substitution behaviour in perl substitution of rtf file

I am trying to use the perl module "RTF::Writer" for strings of text that must be a mix of formats. This is proving more complicated than I anticipated. I am just trying a test at the moment with:
$rtf->paragraph( \'\b', "Name: $name, le\cf1 ng\cf0 th $len" );
but this writes:
{\pard
\b
Name: my_name, le\'061 ng\'060 th 7
\par}
where \'061 should be \cf1 and \'060 should be \cf0.
I then tried to remedy this with a perl 1-liner:
perl -pi -e "s/\'06/\cf/g"
but this made things worse, I do not know what "\^F" represents in vi, but that is what it shows.
It did not matter if I escaped the backslashes or not.
Can anyone explain this behavior, and what to do about it?
Can anyone suggest how to get the RTF::Writer to create the file as desired from the start?
Thanks
\ is a special character in double-quoted string literals. If you want a string that contains \, you need to use \\ in the literal. To create the string \cf1, you need to use "\\cf1". ("\cf" means Ctrl-F, which is to say the byte 06.)
Alternatively, \ is only special if followed by \ or a delimiter in single-quoted string literals. So the string \cf1 could also be created from '\cf1'.
Both produce the string you want, but they don't produce the document you want. That's because there's a second problem.
When you pass a string to RTF::Writer, it's expected to be text to render. But you are passing a string you wanted included as is in the final document. You need to pass a reference to a string if you want to provide raw RTF. \'...', \"..." and \$str all produce a reference to a string.
Fixed:
use RTF::Writer qw( );
my $name = "my_name";
my $rtf = RTF::Writer->new_to_file("greetings.rtf");
$rtf->prolog( 'title' => "Greetings, hyoomon" );
$rtf->paragraph( \'\b', "Name: $name, le", \'\cf1', "ng", \'\cf0', "th".length($name));
$rtf->close;
Output from the call to paragraph:
{\pard
\b
Name: my_name, le\cf1
ng\cf0
th7
\par}
Note that I didn't use the following because it would be code injection bug:
$rtf->paragraph(\("\\b Name: $name, le\\cf1 ng\\cf0 th".length($name)));
Don't pass text such as the contents of $name using \...; use that for raw RTF only.

Stanford NLP how to preprocessing the text

I have a sentence like this "The people working in #walman are not good"
I have a preprocessed text file which contains the mappings, similar to the following two lines:
#walman Walman
#text Test
For the above sentence I have to read through the text file and replace the word with any matching word found in the text file.
The above sentence will change to "The people working in Walman are not good"
I am looking for an API available in Standford NLP to read the input text file and replace the text.
The only NLP-related part here is tokenization. You should read your text file into the map (e.g. HashMap in case of Java), then for each new sentence, you should tokenize it (e.g. by Stanford tokenizer), and check for each token if it is presented in the map; if yes, just replace by the found value from the map, if no, do nothing for this token.
Sample code for tokenization (taken from the link above):
String arg = "The people working in #walman is not good";
PTBTokenizer<CoreLabel> ptbt = new PTBTokenizer<CoreLabel>(new StringReader(arg),
new CoreLabelTokenFactory(), "");
for (CoreLabel label; ptbt.hasNext(); ) {
label = ptbt.next();
System.out.println(label);
}
}
So, label.toString() gives you the token without any suffixes.

Using a regex capture directly in expression in C++

I'm trying to use a captured group directly in the regex. However, when I try to do this the program hangs indefinitely.
For example:
string input = "<Tag>blahblah</Tag>";
regex r1("<([a-zA-Z]+)>[a-z]+</\1>");
string result = regex_replace(result, regex, "");
If I add another slash to the capture "<([a-zA-Z]+)>[a-z]</\\1>", the program compiles but throws a "regex_error(regex_constants::error_backref)" exception.
Notes:
Compiler: Apple LLVM 5.1
I am using this as part of the process to clean junk from blocks of text. The document is not necessarily HTML/XML and desired text is not always within tags. So if possible, I would like to be able to do this with regular expressions, not a parser.
The backslash character in string literals is an escape character.
Either escape it "<([a-zA-Z]+)>[a-z]+</\\1>" or use a raw literal, R"(<([a-zA-Z]+)>[a-z]+</\1>)"
With that, your program works as you would expect:
#include <regex>
#include <iostream>
int main()
{
std::string input = "Hello<Tag>blahblah</Tag> World";
std::regex r1("<([a-zA-Z]+)>[a-z]+</\\1>");
std::string result = regex_replace(input, r1, "");
std::cout << "The result is '" << result << "'\n";
}
demo: http://coliru.stacked-crooked.com/a/ae20b09d46f975e9
The exception you're getting with \\1 suggests that your compiler is configured to use GNU libstdc++, where regex was not implemented. Look up how to set it up to use LLVM libc++ or use boost.regex.

Groovy Regex illegal Characters

I have a Groovy script that converts some very poorly formatted data into XML. This part works fine, but it's also happily passing some characters along that aren't legal in XML. So I'm adding some code to strip these out, and this is where the problem is coming from.
The code that isn't compiling is this:
def illegalChars = ~/[\u0000-\u0008]|[\u000B-\u000C]|[\u000E-\u001F]|[\u007F-\u009F]/
What I'm wondering is, why? What am I doing wrong here? I tested this regex in http://regexpal.com/ and it works as expected, but I'm getting an error compiling it in Groovy:
[ERROR] BUILD ERROR
[INFO] ------------------------------------------------------------------------
[INFO] line 23:26: unexpected char: 0x0
The line above is line 23. The surrounding lines are just variable declarations that I haven't changed while working on the regex.
Thanks!
Update:
The code compiles, but it's not filtering as I'd expected it to.
In regexpal I put the regex:
[\u0000-\u0008\u000B-\u000C\u000E-\u001F\u007F-\u009F]
and the test data:
name='lang'>E</field><field name='title'>CHEMICAL IMMUNOLOGY AND ALLERGY</field></doc>
<doc><field name='page'>72-88</field><field name='shm'>3146.757500</field><field
name='pubc'>47</field><field name='cs'>1</field><field name='issue'>NUMBER</field>
<field name='auth'>Dvorak, A.</field><field name='pub'>KARGER</field><field
name='rr'>GBP013.51</field><field name='issn'>1660-2242</field><field
name='class1'>TS</field><field name='freq'>S</field><field
name='class2'>616.079</field><field name='text'>Subcellular Localization of the
Cytokines, Basic Fibroblast Growth Factor and Tumor Necrosis Factor- in Mast
Cells</field><field name='id'>RN170369808</field><field name='volume'>VOL 85</field>
<field name='year'>2005</field><field name='lang'>E</field><field
name='title'>CHEMICAL IMMUNOLOGY AND ALLERGY</field></doc><doc><field
name='page'>89-97</field><field name='shm'>3146.757500</field><field
name='pubc'>47</field><field name='cs'>1</field><field
It's a grab from a file with one of the illegal characters, so it's a little random. But regexpal highlights only the illegal character, but in Groovy it's replacing even the '<' and '>' characters with empty strings, so it's basically annihilating the entire document.
The code snippet:
def List parseFile(File file){
println "reading File name: ${file.name}"
def lineCount = 0
List data = new ArrayList()
file.eachLine {
String input ->
lineCount ++
String line = input
if(input =~ illegalChars){
line = input.replaceAll(illegalChars, " ")
}
Map document = new HashMap()
elementNames.each(){
token ->
def val = getValue(line, token)
if(val != null){
if(token.equals("ISSUE")){
List entries = val.split(";")
document.putAt("year",entries.getAt(0).trim())
if(entries.size() > 1){
document.putAt("volume", entries.getAt(1).trim())
}
if(entries.size() > 2){
document.putAt("issue", entries.getAt(2).trim())
}
} else {
document.putAt(token, val)
}
}
}
data.add(document)
}
println "done"
return data
}
I don't see any reason that the two should behave differently; am I missing something?
Again, thanks!
line 23:26: unexpected char: 0x0
This error message points to this part of the code:
def illegalChars = ~/[\u0000-...
12345678901234567890123
It looks like for some reason the compiler doesn't like having Unicode 0 character in the source code. That said, you should be able to fix this by doubling the slash. This prevents Unicode escapes at the source code level, and let the regex engine handle the unicode instead:
def illegals = ~/[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]/
Note that I've also combined the character classes into one instead of as alternates. I've also removed the range definition when they're not necessary.
References
regular-expressions.info/Character Classes
On doubling the slash
Here's the relevant quote from java.util.regex.Pattern
Unicode escape sequences such as \u2014 in Java source code are processed as described in JLS 3.3. Such escape sequences are also implemented directly by the regular-expression parser so that Unicode escapes can be used in expressions that are read from files or from the keyboard. Thus the strings "\u2014" and "\\u2014", while not equal, compile into the same pattern, which matches the character with hexadecimal value 0x2014.
To illustrate, in Java:
System.out.println("\n".matches("\\u000A")); // prints "true"
However:
System.out.println("\n".matches("\u000A"));
// DOES NOT COMPILE!
// "String literal is not properly closed by a double-quote"
This is because \u000A, which is the newline character, is escaped in the second snippet at the source code level. The source code essentially becomes:
System.out.println("\n".matches("
"));
// DOES NOT COMPILE!
// "String literal is not properly closed by a double-quote"
This is not a legal Java source code.
Try this Regular Expression to remove unicode char from the string :
/*\\u([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])/
OK here's my finding:
>>> print "XYZ".replaceAll(
/[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]/,
"-"
)
---
>>> print "X\0YZ".replaceAll(
/[\u0000-\u0008\u000B\u000C\u000E-\u001F\u007F-\u009F]/,
"-"
)
X-YZ
>>> print "X\0YZ".replaceAll(
"[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]",
"-"
)
X-YZ
In other words, my \\uNNNN answer within /pattern/ is WRONG. What happens is that 0-\ becomes part of the range, and this includes <, > and all capital letters.
The \\uNNNN only works in "pattern", not in /pattern/.
I will edit my official answer based on comments to this "answer".
Related questions
How to escape Unicode escapes in Groovy’s /pattern/ syntax
try
def illegalChars = ~/[\u0001-\u0008]|[\u000B-\u000C]|[\u000E-\u001F]|[\u007F-\u009F]/`