I have a short test program to extract strings separated by tabs. The output does not make sense to me. The idea is to find the next tab position, and return the values in between the previous and next tab.
The output of my program below. Where did the "a rob" come from?
fred ted rob a rob alex
program
<cfscript>
s="fred"&chr(9)&"ted"&chr(9)&"rob"&chr(9)&"alex";
oldp=0;
while(oldp<Len(s))
{
p=Find(chr(9),s,oldp+1);
if (p==0)
break;
m=Mid(s,oldp+1,p); // oldp is the old tab poit p is the new get string in between
WriteOutput(m);
WriteOutput(" ");
oldp=p;
}
</cfscript>
Now if I change the program to print out oldp after each string the result is:
fred => 1
ted rob a => 6
rob alex => 10
I would expect to see 1,5,9,. I don't understand why ted rob is the second string. I would expect to see rob instead.
Mid(s,oldp+1,p);
To answer your question, that is not how mid works. The third parameter p is the number of characters to return, not a position in the string.
mid(s, 6, 3) ; // this would return "Ted"
If I can make a suggestion - it is much easier to treat the string as a list, separated by tabs. Then parse it with list functions.
<cfscript>
str = "red"& chr(9) &"ted"& chr(9) &"rob"& chr(9) &"alex";
for (i = 1; i <= listLen(str, chr(9)); i++) {
WriteDump( listGetAt(str, i, chr(9)) );
}
</cfscript>
Note, most list functions ignore empty elements. If you wish to preserve them, use listToArray.
<cfscript>
str = "red"& chr(9) &"ted"& chr(9) &"rob"& chr(9) &"alex";
arr = listToArray(str, chr(9), true);
for (i = 1; i <= arrayLen(arr); i++) {
WriteDump( arr[i] );
}
</cfscript>
Related
I want paragraphs to be up to 3 sentences only.
For that, my strategy is to loop on all paragraphs and find the 3rd sentence ending (see note). And then, to add a "\r" char after it.
This is the code I have:
for (var i = 1; i < paragraphs.length; i++) {
...
sentEnds = paragraphs[i].getText().match(/[a-zA-Z0-9_\u0590-\u05fe][.?!](\s|$)|[.?!][.?!](\s|$)/g);
//this array is used to count sentences in Hebrew/English/digits that end with 1 or more of either ".","?" or "!"
...
if ((sentEnds != null) && (sentEnds.length > 3)) {
lineBreakAnchor = paragraphs[i].getText().match(/.{10}[.?!](\s)/g);
paragraphs[i].replaceText(lineBreakAnchor[2],lineBreakAnchor[2] + "\r");
}
}
This works fine for round 1. But if I run the code again- the text after the inserted "\r" char is not recognized as a new paragraph. Hence, more "\r" (new lines) will be inserted each time the script is running.
How can I make the script "understand" that "\r" means new, separate paragraph?
OR
Is there another character/approach that will do the trick?
Thank you.
Note: I use the last 10 characters of the sentence assuming the match will be unique enough to make only 1 replacement.
Without modifying your own regex expression you can achieve this.
Try this approach to split the paragraphs:
Grab the whole content of the document and create an array of sentences.
Insert paragraphs with up to 3 sentences after original paragraphs.
Remove original paragraphs from hell.
function sentenceMe() {
var doc = DocumentApp.getActiveDocument();
var paragraphs = doc.getBody().getParagraphs();
var sentences = [];
// Split paragraphs into sentences
for (var i = 0; i < paragraphs.length; i++) {
var parText = paragraphs[i].getText();
//Count sentences in Hebrew/English/digits that end with 1 or more of either ".","?" or "!"
var sentEnds = parText.match(/[a-zA-Z0-9_\u0590-\u05fe][.?!](\s|$)|[.?!][.?!](\s|$)/g);
if (sentEnds){
for (var j=0; j< sentEnds.length; j++){
var initIdx = 0;
var sentence = parText.substring(initIdx,parText.indexOf(sentEnds[j])+3);
var parInitIdx = initIdx;
initIdx = parText.indexOf(sentEnds[j])+3;
parText = parText.substring(initIdx - parInitIdx);
sentences.push(sentence);
}
}
// console.log(sentences);
}
inThrees(doc, paragraphs, sentences)
}
function inThrees(doc, paragraphs, sentences) {
// define offset
var offset = paragraphs.length;
// Create paragraphs with up to 3 sentences
var k=0;
do {
var parText = sentences.splice(0,3).join(' ');
doc.getBody().insertParagraph(k + offset , parText.concat('\n'));
k++
}
while (sentences.length > 0)
// Remove paragraphs from hell
for (var i = 0; i < offset; i++){
doc.getBody().removeChild(paragraphs[i]);
}
}
In case you are wondering about the custom menu, here is it:
function onOpen() {
var ui = DocumentApp.getUi();
ui.createMenu('Custom Menu')
.addItem("3's the magic number", 'sentenceMe')
.addToUi();
}
References:
DocumentApp.Body.insertParagraph
Actually the detection of sentences is not an easy task.
A sentence does not always end with a dot, a question mark or an exclamation mark. If the sentence ends with a quote then punctuation rules in some countries force you to put the end of the sentence mark inside the quote:
John asked: "Who's there?"
Not every dot means an end of a sentence, usually the dot after an uppercase letter does not end the sentence, because it occurs after an initial. The sentence does not end after J. here:
The latest Star Wars movie has been directed by J.J. Abrams.
However, sometimes the sentence does end after a capital letter followed by a dot:
This project has been sponsored by NASA.
And abbreviations can make it very hard:
For more information check the article in Phys. Rev. Letters 66, 2697, 2013.
Having in mind these difficulties let's still try to get some expression which will work in "usual" cases.
Make a global match and substitution. Match
((?:[^.?!]+[.?!] +){3})
and substitute it with
\1\r
Demo
This looks for 3 sentences (a sentence is a sequence of not-dot, not-?, not-! characters followed by a dot, a ? or a ! and some spaces) and puts a \r after them.
UPDATED 2020-03-04
Try this:
var regex = new RegExp('((?:[a-zA-Z0-9_\\u0590-\\u05fe\\s]+[.?!]+\\s+){3})', 'gi');
for (var i = 1; i < paragraphs.length; i++) {
paragraphs[i].replaceText(regex, '$1\\r');
}
I need to split string into the array with elements as two following words by scala:
"Hello, it is useless text. Hope you can help me."
The result:
[[it is], [is useless], [useless text], [Hope you], [you can], [can help], [help me]]
One more example:
"This is example 2. Just\nskip it."
Result:
[[This is], [is example], [Just skip], [skip it]]
I tried this regex:
var num = """[a-zA-Z]+\s[a-zA-Z]+""".r
But the output is:
scala> for (m <- re.findAllIn("Hello, it is useless text. Hope you can help me.")) println(m)
it is
useless text
Hope you
can help
So it ignores some cases.
First split on the punctuation and digits, then split on the spaces, then slide over the results.
def doubleUp(txt :String) :Array[Array[String]] =
txt.split("[.,;:\\d]+")
.flatMap(_.trim.split("\\s+").sliding(2))
.filter(_.length > 1)
usage:
val txt1 = "Hello, it is useless text. Hope you can help me."
doubleUp(txt1)
//res0: Array[Array[String]] = Array(Array(it, is), Array(is, useless), Array(useless, text), Array(Hope, you), Array(you, can), Array(can, help), Array(help, me))
val txt2 = "This is example 2. Just\nskip it."
doubleUp(txt2)
//res1: Array[Array[String]] = Array(Array(This, is), Array(is, example), Array(Just, skip), Array(skip, it))
First process the string as it is by removing all escape characters.
scala> val string = "Hello, it is useless text. Hope you can help me."
val preprocessed = StringContext.processEscapes(string)
//preprocessed: String = Hello, it is useless text. Hope you can help me.
OR
scala>val string = "This is example 2. Just\nskip it."
val preprocessed = StringContext.processEscapes(string)
//preprocessed: String =
//This is example 2. Just
//skip it.
Then filter out all necessary chars(like chars, space etc...) and use slide function as
val result = preprocessed.split("\\s").filter(e => !e.isEmpty && !e.matches("(?<=^|\\s)[A-Za-z]+\\p{Punct}(?=\\s|$)") ).sliding(2).toList
//scala> res9: List[Array[String]] = List(Array(it, is), Array(is, useless), Array(useless, Hope), Array(Hope, you), Array(you, can), Array(can, help))
You need to use split to break the string down into words separated by non-word characters, and then sliding to double-up the words in the way that you want;
val text = "Hello, it is useless text. Hope you can help me."
text.trim.split("\\W+").sliding(2)
You may also want to remove escape characters, as explained in other answers.
Sorry I only know Python. I heard the two are almost the same. Hope you can understand
string = "it is useless text. Hope you can help me."
split = string.split(' ') // splits on space (you can use regex for this)
result = []
no = 0
count = len(split)
for x in range(count):
no +=1
if no < count:
pair = split[x] + ' ' + split[no] // Adds the current to the next
result.append(pair)
The output will be:
['it is', 'is useless', 'useless text.', 'text. Hope', 'Hope you', 'you can', 'can help', 'help me.']
I am trying to use multiple characters as the delimeter in ColdFusion list like ,( comma and blank) but it ignores the blank.
I then tried to use:
<cfset title = listappend( title, a[ idx ].title, "#Chr(44)##Chr(32)#" ) />
But it also ignores the blank and without blanks the list items to diffucult to read.
Any ideas?
With ListAppend you can only use one delimiter. As the docs say for the delimiters parameter:
If this parameter contains more than one character, ColdFusion uses only the first character.
I'm not sure what a[ idx ].title contains or exactly what the expected result is (would be better if you gave a complete example), but I think something like this will do what you want or at least get you started:
<cfscript>
a = [
{"title"="One"},
{"title"="Two"},
{"title"="Three"}
];
result = "";
for (el in a) {
result &= el.title & ", ";
}
writeDump(result);
</cfscript>
I think there's a fundamental flaw in your approach here. The list delimiter is part of the structure of the data, whereas you are also trying to use it for "decoration" when you come to output the data from the list. Whilst often conveniently this'll work, it's kinda conflating two ideas.
What you should do is eschew the use of lists as a data structure completely, as they're a bit crap. Use an array for storing the data, and then deal with rendering it as a separate issue: write a render function which puts whatever separator you want in your display between each element.
function displayArrayAsList(array, separator){
var list = "";
for (var element in array){
list &= (len(list) ? separator : "");
list &= element;
}
return list;
}
writeOutput(displayAsList(["tahi", "rua", "toru", "wha"], ", "));
tahi, rua, toru, wha
Use a two step process. Step 1 - create your comma delimited list. Step 2
yourList = replace(yourList, ",", ", ", "all");
I have a code where the values are coming as:
a,b,c from database..
now i want to remove c from the string based upon condition, c can be at any place, 1st, last or middle.
i am using replace to do it like this:
<cfset answer = Replace('a,b,c','c','','all')>
This works but it leaves a trailing comma at the end or at the start or 2 commas in middle breaking the whole string, what can be my approach here
<cfscript>
input = 'a,b,c';
foundAt = listFind(input, 'c');
answer = foundAt ? listDeleteAt(input, foundAt) : input;
writeOutput(answer);
</cfscript>
Run this code LIVE on TryCF.com
See: List functions
OR use REReplace(). The solution was just one google search away: Regex for removing an item from a comma-separated string?
function listRemoveAll(list, item) {
return REReplace(list, "\b#item#\b,|,\b#item#\b$", "", "all");
}
Is there a regular expression for matching a string that is not necessarily complete?
Example:
some other supercalifragilisticexpialidocious random things
and maybe supercalifragilistic meaningless padding
lorem superca ipsum dolor
I would like to match whichever left part of supercalifragilisticexpialidocious there is each time. There are not necessarily spaces around words.
The expected result would be to find:
supercalifragilisticexpialidocious
supercalifragilistic
superca
This is similar to matching the same character any number of times, but more universal.
Thank you!
I know this isn't a regex, but I think your objective can be accomplished better using code. Here's an example of a JavaScript function that matches as much of an input string as it can:
function matchMost(find, string){
for(var i = 0 ; i < find.length ; i++){
for(var j = find.length ; j > i ; j--){
if(string.indexOf(find.substring(i, j)) !== -1){
return find.substring(i, j);
}
}
}
return false;
}
For example, if you call matchMost("supercalifragilisticexpialidocious", "lorem superca ipsum dolor"), it will return the string "superca". If string doesn't contain a single character from find, the function will return false.
Here's a JS Fiddle where you can test this code: http://jsfiddle.net/n252eyw1/
UPDATE
This function matches as much of the left side of an input string as it can:
function matchMostLeft(find, string){
for(var j = find.length ; j > 0 ; j--){
if(string.indexOf(find.substring(0, j)) !== -1){
return find.substring(0, j);
}
}
return false;
}
JS Fiddle: http://jsfiddle.net/sjy312ae/
There is, but it's not tidy at all (and probably not very performant either). This regex matches at least 3 characters on the left side and up to supercal as written; the way to extend it should be fairly plain.
(?:sup(?:e(?:r(?:c(?:a(?:l)?)?)?)?)?)?
Paul's answer is likely far more useful in the general case.
Below is an alternative solution. It is similar to Paul's in that it uses indexOf rather than regular expressions. This also makes it equally case-sensitive. My approach should perform better in exceptional cases where Paul's solution would cause excessive calls to indexOf; typically when:
the needle is very long (worse than supercalifragilisticexpialidocious), and
you have a lot of separate texts to scan, and
the majority of texts either do not match, or contain only short matches.
If this is not the case with you, then please use Paul's solution, as it is clean, simple and readable.
function getLongestMatchingPrefix(needle, haystack) {
var len = 0;
var i = 0;
while (len < needle.length && (i = haystack.substring(i).indexOf(needle.substring(0, len + 1))) >= 0) {
while (++len < needle.length && haystack.substring(i, i + len + 1) == needle.substring(0, len + 1)) {}
}
return needle.substring(0, len);
}
Fiddle: http://jsfiddle.net/ov26msj5/