I have a string that looks like a comma separate list of "label:value" items.
package testParsers
import org.scalatest.{Matchers, FlatSpec}
class testReturnStrParser extends FlatSpec with Matchers{
import parsers.ReturnStringParser
"return string parser" should "find the height in ret string" in {
val teststr = "blahblah:123, height:80.3"
val s = ReturnStringParser.findVal("height", teststr)
s should have length 1
s.head shouldEqual ("80.3")
}
it should "work if it is in the middle" in {
val teststr = "blahblah:123, height:80.3,weight:100.0"
val s = ReturnStringParser.findVal("height", teststr)
s should have length 1
s.head shouldEqual ("80.3")
}
}
I am trying to make the class work when the label height is in the middle:
package parsers
object ReturnStringParser {
def findVal(fieldName: String, s: String) = {
val rx = s"(?<=$fieldName:)"+"(.*)*[^,\\s]*"
(rx.r)
.findAllIn(s)
.toList
}
}
This works:
val rx = s"(?<=$fieldName:)"+"([^,]*)"
https://regex101.com/r/aC4vA3/1
Related
Situation & Problem
1 .
eg:
Say, you have a paragraph.
The word sentence is broken down to sente-nce with a hyphen.
Imagine you have this sample sentence, which is a very long sente-
nce that has a word being broken down with a hyphen.
2 .
How can I detect that word sente-nce is broken down with a hyphen, and correct it into sentence?
note:
Is there any library I can use to do that (prefer Java / Python / any software)?
Using a simple regex to match all (\w)-(\w) & replace with $1$2, wont work in all cases.
eg: Imagine you have a word event-driven, it will become eventdriven, which is undesired.
You need to check if the word belongs to english vocabulary. Find all the matches, for each check if word exists in english vocabulary and if not, then change the word. Something like:
import enchant
voc = enchant.Dict("en_US")
word = "sente-nce"
voc.check(word)
It returns False if it's not a word.
Solution (may not be the best)
logic & usage
/*
#logic::
regex match all words with hypen -
loop check if those words are correct by using a dictionary
_ & fix if they have hypen misplaced
#to_use::
put your dictionary in Path path = Paths.get("words_alpha.txt"); <= https://github.com/dwyl/english-words
put your sentence to autoCorrect on in content_TESTING
execute & get output
#note::
depending on the quality of the dictionary, the results may not be good.
#note::
if your words contains "space or newline \n" -> modify the regex in String str_RegexPattern = "([a-zA-Z]+)-([a-zA-Z]+)";
#note::
this is not fully tested yet
*/
code
package com.ex.main.autoCorrectHypen;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.Collections;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/*
#logic::
1. regex match all words with hypen -
2. loop check if those words are correct by using a dictionary
_ & fix if they have hypen misplaced
#to_use::
1. put your dictionary in `Path path = Paths.get("words_alpha.txt");` <= https://github.com/dwyl/english-words
2. put your sentence to autoCorrect on in `content_TESTING`
3. execute & get output
#note::
depending on the quality of the dictionary, the results may not be good.
#note::
if your words contains "space or newline \n" -> modify the regex in `String str_RegexPattern = "([a-zA-Z]+)-([a-zA-Z]+)";`
#note::
this is not fully tested yet
*/
// https://stackoverflow.com/questions/11607270/how-to-check-whether-given-string-is-a-word
// https://github.com/dwyl/english-words
// ~// https://github.com/first20hours/google-10000-english
class Dictionary {
private static HashSet<String> wordsSet = new HashSet<>();
public static void initDictionary() throws IOException {
Path path = Paths.get("words_alpha.txt");
byte[] readBytes = Files.readAllBytes(path);
String wordListContents = new String(readBytes, "UTF-8");
String[] words = wordListContents.split("\r\n"); // #atten: \r\n or \n
Collections.addAll(wordsSet, words);
}
static {
try {
initDictionary();
} catch (IOException e) {
e.printStackTrace();
}
}
public static boolean contains(String word) { return wordsSet.contains(word); }
}
public class AutoCorrectHypen {
public static String autoCorrectHypen(String content_ValidateOn) {
String content_SearchOn = content_ValidateOn;
String str_RegexPattern = "([a-zA-Z]+)-([a-zA-Z]+)";
Pattern pattern = Pattern.compile(str_RegexPattern);
Matcher matcher = pattern.matcher(content_SearchOn);
StringBuilder sb_ContentSearchOn = new StringBuilder(content_SearchOn);
StringBuilder content_Replaced = new StringBuilder();
int ind_MatchGroupEnd_prev = 0;
int ind_MatchGroupEnd_curr;
int ind_MatchGroupStart_curr;
while (matcher.find()) {
//
ind_MatchGroupStart_curr = matcher.start(0);
ind_MatchGroupEnd_curr = matcher.end(0);
String content_BeforeMatchGroup = sb_ContentSearchOn.substring(ind_MatchGroupEnd_prev, ind_MatchGroupStart_curr); // prev end to curr start, not start to end
content_Replaced.append(content_BeforeMatchGroup);
//
String content_SearchOn_innerMatch_G0 = matcher.group(0);
String content_SearchOn_innerMatch_G1 = matcher.group(1);
String content_SearchOn_innerMatch_G2 = matcher.group(2);
String content_Replaced_innerMatch = autoCorrectHypen_innerMatch(content_SearchOn_innerMatch_G0, content_SearchOn_innerMatch_G1, content_SearchOn_innerMatch_G2);
content_Replaced.append(content_Replaced_innerMatch);
//
ind_MatchGroupEnd_prev = ind_MatchGroupEnd_curr;
}
System.out.println("-------");
// append the content after the last match group
String content_AfterLastMatchGroup = sb_ContentSearchOn.substring(ind_MatchGroupEnd_prev, sb_ContentSearchOn.length());
content_Replaced.append(content_AfterLastMatchGroup);
return content_Replaced.toString();
}
protected static String autoCorrectHypen_innerMatch(String content_SearchOn_innerMatch_G0, String content_SearchOn_innerMatch_G1, String content_SearchOn_innerMatch_G2) {
System.out.printf("> %s; %s; %s; %n", content_SearchOn_innerMatch_G0, content_SearchOn_innerMatch_G1, content_SearchOn_innerMatch_G2);
String content_Replaced_innerMatch = null;
// #atten: order of the if stmt matters
if (Dictionary.contains(content_SearchOn_innerMatch_G0)) {
content_Replaced_innerMatch = content_SearchOn_innerMatch_G0;
System.out.printf(">> %s: %n%s %n", "whole word - with hypen, G0", content_Replaced_innerMatch);
} else if (Dictionary.contains(content_SearchOn_innerMatch_G1 + content_SearchOn_innerMatch_G2)) {
content_Replaced_innerMatch = content_SearchOn_innerMatch_G1 + content_SearchOn_innerMatch_G2;
System.out.printf(">> %s: %n%s %n", "whole word - remove hypen, G1 + G2", content_Replaced_innerMatch);
} else if (Dictionary.contains(content_SearchOn_innerMatch_G1) && Dictionary.contains(content_SearchOn_innerMatch_G2)) {
content_Replaced_innerMatch = content_SearchOn_innerMatch_G0;
System.out.printf(">> %s: %n%s %n", "whole word - with hypen, G1 && G2", content_Replaced_innerMatch);
} else {
content_Replaced_innerMatch = content_SearchOn_innerMatch_G0;
System.err.println(">> No such word");
}
return content_Replaced_innerMatch;
}
//################################################################################################
static final String content_TESTING_Simple = ""
+ "Check the word sente-nce, event-driven, family-owned, chocolate-covered, anti-clockwise.\n"
+ "samp-le, diff-erence, what-do-you-mean, how-ever, be-cause, other-wise, pill-ow";
static final String content_TESTING = ""
+ "Imagine you have this sample sentence, which is a very long sente-\n"
+ "nce that has a word being broken down with a hyphen. \n"
+ "\n"
+ "Check the word sente-nce, event-driven, family-owned, chocolate-covered, anti-clockwise.\n"
+ "";
public static void main(String[] args) throws Exception {
System.out.println(autoCorrectHypen(content_TESTING_Simple)); //
}
}
input
Check the word sente-nce, event-driven, family-owned, chocolate-covered, anti-clockwise.
samp-le, diff-erence, what-do-you-mean, how-ever, be-cause, other-wise, pill-ow
output
Check the word sentence, event-driven, family-owned, chocolate-covered, anticlockwise.
sample, difference, what-do-you-mean, however, because, otherwise, pillow
I'm been running into weird issues with regex and Typescript in which I'm trying to have my expression replace the value of test minus the first instance if followed by test. In other words, replace the first two lines that have test but for the third line below, replace only the second value of test.
[test]
[test].[db]
[test].[test]
Where it should look like:
[newvalue]
[newvalue].[db]
[test].[newvalue]
I've come up with lots of variations but this is the one that I thought was simple enough to solve it and regex101 can confirm this works:
\[(\w+)\](?!\.\[test\])
But when using Typescript (custom task in VSTS build), it actually replaces the values like this:
[newvalue]
[newvalue].[db]
[newvalue].[test]
Update: It looks like a regex like (test)(?!.test) breaks when changing the use cases removing the square brackets, which makes me think this might be somewhere in the code. Could the problem be with the index that the value is replaced at?
Some of the code in Typescript that is calling this:
var filePattern = tl.getInput("filePattern", true);
var tokenRegex = tl.getInput("tokenRegex", true);
for (var i = 0; i < files.length; i++) {
var file = files[i];
console.info(`Starting regex replacement in [${file}]`);
var contents = fs.readFileSync(file).toString();
var reg = new RegExp(tokenRegex, "g");
// loop through each match
var match: RegExpExecArray;
// keep a separate var for the contents so that the regex index doesn't get messed up
// by replacing items underneath it
var newContents = contents;
while((match = reg.exec(contents)) !== null) {
var vName = match[1];
// find the variable value in the environment
var vValue = tl.getVariable(vName);
if (typeof vValue === 'undefined') {
tl.warning(`Token [${vName}] does not have an environment value`);
} else {
newContents = newContents.replace(match[0], vValue);
console.info(`Replaced token [${vName }]`);
}
}
}
Full code is for the task I'm using this with: https://github.com/colindembovsky/cols-agent-tasks/blob/master/Tasks/ReplaceTokens/replaceTokens.ts
For me this regex is working like you are expecting:
\[(test)\](?!\.\[test\])
with a Typescript code like that
myString.replace(/\[(test)\](?!\.\[test\])/g, "[newvalue]");
Instead, the regex you are using should replace also the [db] part.
I've tried with this code:
class Greeter {
myString1: string;
myString2: string;
myString3: string;
greeting: string;
constructor(str1: string, str2: string, str3: string) {
this.myString1 = str1.replace(/\[(test)\](?!\.\[test\])/g, "[newvalue]");
this.myString2 = str2.replace(/\[(test)\](?!\.\[test\])/g, "[newvalue]");
this.myString3 = str3.replace(/\[(test)\](?!\.\[test\])/g, "[newvalue]");
this.greeting = this.myString1 + "\n" + this.myString2 + "\n" + this.myString3;
}
greet() {
return "Hello, these are your replacements:\n" + this.greeting;
}
}
let greeter = new Greeter("[test]", "[test].[db]", "[test].[test]");
let button = document.createElement('button');
button.textContent = "Say Hello";
button.onclick = function() {
alert(greeter.greet());
}
document.body.appendChild(button);
Online playground here.
I'm just getting started with CoreNLP's TokenSequencePattern and I can't get simple matches to work. All im trying to do is to match a token from the input text. The code below executes without errors but doesn't match anything. However, if u change the match expression to [] then it matches the two sentences.
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, parse");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation document = new Annotation("This is sent 1. And here is sent 2");
pipeline.annotate(document);
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
Env env = TokenSequencePattern.getNewEnv();
env.setDefaultStringMatchFlags(NodePattern.CASE_INSENSITIVE);
env.setDefaultStringPatternFlags(Pattern.CASE_INSENSITIVE);
TokenSequencePattern pattern = TokenSequencePattern.compile(env,"[ { word:\"sent\" } ]");
TokenSequenceMatcher matcher = pattern.getMatcher(sentences);
while ( matcher.find() ) {
System.out.println( matcher.group() );
}
Thank you!
List<CoreLabel> tokens =
document.get(CoreAnnotations.TokensAnnotation.class);
TokenSequencePattern pattern= TokenSequencePattern.compile("[ {
word:\"sent\" } ]");
TokenSequenceMatcher matcher = pattern.getMatcher(tokens);
while (matcher.find())
{
String matchedString = matcher.group();
List<CoreMap> matchedTokens = matcher.groupNodes();
System.out.println(matchedString + " " + matchedTokens);
}
I have following program in in that want to fetch
"planned", "not automated", "st3reporter", "functional", "report-upto3times-per2hrs", "st3-throttling-cdb"
**These value from string **
import re
string='''
import org.testng.Assert;
import org.testng.annotations.AfterMethod;
import org.testng.annotations.BeforeMethod;
import org.testng.annotations.Test;
public class ReportUpToTimesEveryHours {
String objective = "04 - Report up to 3 times every 2 hours";
String testName = "ReportUpToTimesEveryHours";
#BeforeMethod(alwaysRun = true)
public void beforeMethod() {
logger.info(Constants.LOGGER_SEPERATOR);
logger.info("Start -- " + testName + " - " + objective);
}
#Test(groups = { "planned", "not automated", "st3reporter", "functional", "report-upto3times-per2hrs", "st3-throttling-cdb" },
description = "04 - Report up to 3 times every 2 hours")
public void testReportUpToTimesEveryHours() {
}
#AfterMethod(alwaysRun = true)
public void afterMethod() {
logger.info("End -- " + testName + " - " + objective);
logger.info(Constants.LOGGER_SEPERATOR);
}
}
'''
pattern=re.compile(r' (\#Test\(groups\s*=\s*\{)')
m= pattern.search(string)
print m.group()
Try something like this:
pattern=re.compile(r' #Test\(groups\s*=\s*\{([^\}]+)')
result_set = [i.strip() for i in
pattern.search(string).group(1).replace('"', '').split(',')]
The value will store a list:
['planned', 'not automated', 'st3reporter', 'functional', 'report-upto3times-per2hrs', 'st3-throttling-cdb']
Could someone explain why this snip :
// import com.google.gwt.regexp.shared.MatchResult;
// import com.google.gwt.regexp.shared.RegExp;
RegExp regExp = RegExp.compile("^$");
MatchResult matcher;
while ((matcher = regExp.exec("")) != null)
{
System.out.println("match " + matcher);
}
give an incredible count of matches? I tested with different modifier allowed by GWT implementation of compile(), g, i and m. It works only with m (multiline).
I just want to check for empty string.
[EDIT] the new method
private ArrayList<MatchResult> getMatches(String input, String pattern)
{
ArrayList<MatchResult> matches = new ArrayList<MatchResult>();
if(null == regExp)
{
regExp = RegExp.compile(pattern, "g");
}
if(input.isEmpty())
{
// empty string : just check if pattern validate and
// don't try to extract matches : it will resutl in infinite
// loop.
if(regExp.test(input))
{
matches.add(new MatchResult(0, "", new ArrayList<String>(0)));
}
}
else
{
for(MatchResult matcher = regExp.exec(input); matcher != null; matcher = regExp
.exec(input))
{
matches.add(matcher);
}
}
return matches;
}
Your regExp.exec("") with RegExp.compile("^$") will never return null, as the empty string "" is a match for regex ^$, which reads "nothing between beginning and the end of line/string".
So your while is infinity loop.
Also, you print is
System.out.println("match " + matcher);
...but you probably wanted to use
System.out.println("match " + matcher.getGroup(0));
Also see GWT checking if textbox is empty.