Pythonic way to rewrite the following C++ string processing code - c++

Previous, I am having a C++ string processing code which is able to do this.
input -> Hello 12
output-> Hello
input -> Hello 12 World
output-> Hello World
input -> Hello12 World
output-> Hello World
input -> Hello12World
output-> HelloWorld
The following is the C++ code.
std::string Utils::toStringWithoutNumerical(const std::string& str) {
std::string result;
bool alreadyAppendSpace = false;
for (int i = 0, length = str.length(); i < length; i++) {
const char c = str.at(i);
if (isdigit(c)) {
continue;
}
if (isspace(c)) {
if (false == alreadyAppendSpace) {
result.append(1, c);
alreadyAppendSpace = true;
}
continue;
}
result.append(1, c);
alreadyAppendSpace = false;
}
return trim(result);
}
May I know in Python, what is the Pythonic way for implementing such functionality? Is regular expression able to achieve so?
Thanks.

Edit: This reproduces more accurately what the C++ code does than the previous version.
s = re.sub(r"\d+", "", s)
s = re.sub(r"(\s)\s*", "\1", s)
In particular, if the first whitespace in a run of several whitespaces is a tab, it will preserve the tab.
Further Edit: To replace by a space anyway, this works:
s = re.sub(r"\d+", "", s)
s = re.sub(r"\s+", " ", s)

Python has a lot of built-in functions that can be very powerful when used together.
def RemoveNumeric(str):
return ' '.join(str.translate(None, '0123456789').split())
>>> RemoveNumeric('Hello 12')
'Hello'
>>> RemoveNumeric('Hello 12 World')
'Hello World'
>>> RemoveNumeric('Hello12 World')
'Hello World'
>>> RemoveNumeric('Hello12World')
'HelloWorld'

import re
re.sub(r'[0-9]+', "", string)

import re
re.sub(r"(\s*)\d+(\s*)", lambda m: m.group(1) or m.group(2), string)
Breakdown:
\s* matches zero or more whitespace.
\d+ matches one or more digits.
The parentheses are used to capture the whitespace.
The replacement parameter is normally a string, but it can alternatively be a function which constructs the replacement dynamically.
lambda is used to create an inline function which returns whichever of the two capture groups is non-empty. This preserves a space if there was whitespace and returns an empty string if there wasn't any.

The regular expression answers are clearly the right way to do this. But if you're interested in a way to do if you didn't have a regex engine, here's how:
class filterstate(object):
def __init__(self):
self.seenspace = False
def include(self, c):
isspace = c.isspace()
if (not c.isdigit()) and (not (self.seenspace and isspace)):
self.seenspace = isspace
return True
else:
return False
def toStringWithoutNumerical(s):
fs = filterstate()
return ''.join((c for c in s if fs.include(c)))

Related

regex keeps returning false even when regex101 returns match

I am doing a list.where filter:
String needleTemp = '';
final String hayStack =
[itemCode, itemDesc, itemCodeAlt, itemDescAlt, itemGroup].join(' ');
for (final k in query.split(" ")) {
needleTemp = '$needleTemp(?=.*\\Q$k\\E)';
}
var re = RegExp(needleTemp);
return re.hasMatch(hayStack);
I printed the output for needleTemp and it looks the same as on my regex101 example:
in dart it prints (?=.*\Qa/a\E)(?=.*\Qpatro\E)
basically the same, but nothing matches, not even a simple letter.
Is dart regex different or do I need another syntax?
edit:
Simple example to test in DartPad:
void main() {
print("(?=.*\\Qpatrol\\E)");
var re = RegExp("(?=.*\\Q2020\\E)");
print(re.hasMatch('A/A PATROL 2020'));
}
still returns false
Found the solution:
I just need to remove \Q and \E then RegExp.escape(text_to_escape) inside the needle.

Regex count number of replacements [duplicate]

Is there a way to count the number of replacements a Regex.Replace call makes?
E.g. for Regex.Replace("aaa", "a", "b"); I want to get the number 3 out (result is "bbb"); for Regex.Replace("aaa", "(?<test>aa?)", "${test}b"); I want to get the number 2 out (result is "aabab").
Ways I can think to do this:
Use a MatchEvaluator that increments a captured variable, doing the replacement manually
Get a MatchCollection and iterate it, doing the replacement manually and keeping a count
Search first and get a MatchCollection, get the count from that, then do a separate replace
Methods 1 and 2 require manual parsing of $ replacements, method 3 requires regex matching the string twice. Is there a better way.
Thanks to both Chevex and Guffa. I started looking for a better way to get the results and found that there is a Result method on the Match class that does the substitution. That's the missing piece of the jigsaw. Example code below:
using System.Text.RegularExpressions;
namespace regexrep
{
class Program
{
static int Main(string[] args)
{
string fileText = System.IO.File.ReadAllText(args[0]);
int matchCount = 0;
string newText = Regex.Replace(fileText, args[1],
(match) =>
{
matchCount++;
return match.Result(args[2]);
});
System.IO.File.WriteAllText(args[0], newText);
return matchCount;
}
}
}
With a file test.txt containing aaa, the command line regexrep test.txt "(?<test>aa?)" ${test}b will set %errorlevel% to 2 and change the text to aabab.
You can use a MatchEvaluator that runs for each replacement, that way you can count how many times it occurs:
int cnt = 0;
string result = Regex.Replace("aaa", "a", m => {
cnt++;
return "b";
});
The second case is trickier as you have to produce the same result as the replacement pattern would:
int cnt = 0;
string result = Regex.Replace("aaa", "(?<test>aa?)", m => {
cnt++;
return m.Groups["test"] + "b";
});
This should do it.
int count = 0;
string text = Regex.Replace(text,
#"(((http|ftp|https):\/\/|www\.)[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?)", //Example expression. This one captures URLs.
match =>
{
string replacementValue = String.Format("<a href='{0}'>{0}</a>", match.Value);
count++;
return replacementValue;
});
I am not on my dev computer so I can't do it right now, but I am going to experiment later and see if there is a way to do this with lambda expressions instead of declaring the method IncrementCount() just to increment an int.
EDIT modified to use a lambda expression instead of declaring another method.
EDIT2 If you don't know the pattern in advance, you can still get all the groupings (The $ groups you refer to) within the match object as they are included as a GroupCollection. Like so:
int count = 0;
string text = Regex.Replace(text,
#"(((http|ftp|https):\/\/|www\.)[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?)", //Example expression. This one captures URLs.
match =>
{
string replacementValue = String.Format("<a href='{0}'>{0}</a>", match.Value);
count++;
foreach (Group g in match.Groups)
{
g.Value; //Do stuff with g.Value
}
return replacementValue;
});

Regex for custom parsing

Regex isn't my strongest point. Let's say I need a custom parser for strings which strips the string of any letters and multiple decimal points and alphabets.
For example, input string is "--1-2.3-gf5.47", the parser would return
"-12.3547".
I could only come up with variations of this :
string.replaceAll("[^(\\-?)(\\.?)(\\d+)]", "")
which removes the alphabets but retains everything else. Any pointers?
More examples:
Input: -34.le.78-90
Output: -34.7890
Input: df56hfp.78
Output: 56.78
Some rules:
Consider only the first negative sign before the first number, everything else can be ignored.
I'm trying to do this using Java.
Assume the -ve sign, if there is one, will always occur before the
decimal point.
Just tested this on ideone and it seemed to work. The comments should explain the code well enough. You can copy/paste this into Ideone.com and test it if you'd like.
It might be possible to write a single regex pattern for it, but you're probably better off implementing something simpler/more readable like below.
The three examples you gave prints out:
--1-2.3-gf5.47 -> -12.3547
-34.le.78-90 -> -34.7890
df56hfp.78 -> 56.78
import java.util.*;
import java.lang.*;
import java.io.*;
/* Name of the class has to be "Main" only if the class is public. */
class Ideone
{
public static void main (String[] args) throws java.lang.Exception
{
System.out.println(strip_and_parse("--1-2.3-gf5.47"));
System.out.println(strip_and_parse("-34.le.78-90"));
System.out.println(strip_and_parse("df56hfp.78"));
}
public static String strip_and_parse(String input)
{
//remove anything not a period or digit (including hyphens) for output string
String output = input.replaceAll("[^\\.\\d]", "");
//add a hyphen to the beginning of 'out' if the original string started with one
if (input.startsWith("-"))
{
output = "-" + output;
}
//if the string contains a decimal point, remove all but the first one by splitting
//the output string into two strings and removing all the decimal points from the
//second half
if (output.indexOf(".") != -1)
{
output = output.substring(0, output.indexOf(".") + 1)
+ output.substring(output.indexOf(".") + 1, output.length()).replaceAll("[^\\d]", "");
}
return output;
}
}
In terms of regex, the secondary, tertiary, etc., decimals seem tough to remove. However, this one should remove the additional dashes and alphas: (?<=.)-|[a-zA-Z]. (Hopefully the syntax is the same in Java; this is a Python regex but my understanding is that the language is relatively uniform).
That being said, it seems like you could just run a pretty short "finite state machine"-type piece of code to scan the string and rebuild the reduced string yourself like this:
a = "--1-2.3-gf5.47"
new_a = ""
dash = False
dot = False
nums = '0123456789'
for char in a:
if char in nums:
new_a = new_a + char # record a match to nums
dash = True # since we saw a number first, turn on the dash flag, we won't use any dashes from now on
elif char == '-' and not dash:
new_a = new_a + char # if we see a dash and haven't seen anything else yet, we append it
dash = True # activate the flag
elif char == '.' and not dot:
new_a = new_a + char # take the first dot
dot = True # put up the dot flag
(Again, sorry for the syntax, I think you need some curly backets around the statements vs. Python's indentation only style)

how to use regular expressions in swift?

I am trying to write a JSON parser in swift. I am writing functions for parsing different parts of JSON code. I wrote a string parser which detects a string from the JSON data, by checking the start with \" and if I meet with another \" it is separated and returned as a String but when I met with this JSON text:
{"gd$etag": "W\/\"D0QCQX4zfCp7I2A9XRZQFkw.\""}
the function I wrote failed in the above case since in the value part it has to recognise the whole as String while mine is working to collect only
W\/
Since I gave the condition as starting and ending with \"
when I searched online I understood it is something in relation to regular expressions. So help me out to solve this!
Is this what you were looking for?
import Foundation
let str: NSString = "W\\/\\\"D0QCQX4zfCp7I2A9XRZQFkw.\\\""
let regex = "\\\\\".*\\\\\""
// Finds range that starts with \" and ends with \"
let range = str.rangeOfString(regex, options: .RegularExpressionSearch)
let match: NSString = str.substringWithRange(range)
//Removes the \" from the start and end.
let innerString = match.substringWithRange(NSMakeRange(2, match.length-4))
if first(jsonString) == "\"" {
jsonString.removeAtIndex(jsonString.startIndex)
for elem in jsonString {
if elem == "\\" {
s = 1
parsedString.append(elem)
jsonString.removeAtIndex(jsonString.startIndex)
continue
}
if s == 1 && elem == "\""{
parsedString.append(elem)
jsonString.removeAtIndex(jsonString.startIndex)
s = 0
continue
}
else if elem == "\""{
jsonString.removeAtIndex(jsonString.startIndex)
break
}
parsedString.append(elem)
jsonString.removeAtIndex(jsonString.startIndex)
s = 0
}
This code lets me solve my issue to keep it pure swift.

Regular expression doesn't work in Go

forgive me for being a regex amateur but I'm really confused as to why this doesn't piece of code doesn't work in Go
package main
import (
"fmt"
"regexp"
)
func main() {
var a string = "parameter=0xFF"
var regex string = "^.+=\b0x[A-F][A-F]\b$"
result,err := regexp.MatchString(regex, a)
fmt.Println(result, err)
}
// output: false <nil>
This seems to work OK in python
import re
p = re.compile(r"^.+=\b0x[A-F][A-F]\b$")
m = p.match("parameter=0xFF")
if m is not None:
print m.group()
// output: parameter=0xFF
All I want to do is match whether the input is in the format <anything>=0x[A-F][A-F]
Any help would be appreciated
Have you tried using raw string literal (with back quote instead of quote)?
Like this:
var regex string = `^.+=\b0x[A-F][A-F]\b$`
You must escape the \ in interpreted literal strings :
var regex string = "^.+=\\b0x[A-F][A-F]\\b$"
But in fact the \b (word boundaries) appear to be useless in your expression.
It works without them :
var regex string = "^.+=0x[A-F][A-F]$"
Demonstration