Regex for excluding lines starting with // from a word counter - regex

I am building a novel-writing tool that includes in-line annotations designated by "//" a la JavaScript.
I want to be able to count all of the words that don't belong to an annotation (and therefore belong to the 'real' novel) so that a writer can use this to track their word count goals.
For word counts so far, I've been using: /\S+/g
I've successfully found a way to exclude full lines with a // prefix with ^(?!\/\/).+$/gm
But,
They don't work together, i.e. \S+^(?!\/\/).+$/gm
How would I exclude words between a // and the end of a line? i.e. These words are included.//but these aren't
Some example text with all cases:
// Scene Name - This is a scene description.`
// !Location
// #John #David
Hello, I am very grateful to the Stack Overflow community for teaching me how to fix every problem I've ever had. //wow good content
And here's some more text. This is 30 words.
What am I missing?
[Edit: I am using /\S+/g for the word count regex, not /\w+/g, which counts contractions as two words]

I suggest you divide the operation in two, first you replace using the following (simple) regex:
/\/\/.*/gm
It simply matches any 2 slashes followed by any characters.
Just replace with an empty string. Now you have a nice text without slashes and you can use your word-counting regex to Count the Words.

This pattern should be what you need. ^.+?(?=//)|^(?!//).+
Demo
Let me know if you have any questions.

Related

Is there a way to capture any character along with newline at the end?

I'm working on this TDD tutorial exercise that centers around creating and testing a string calculator. I'm at the step where you create a customize a delimiter. The input should be something like this //[delimiter]\n[numbers…] an accepted input should be something like //%\n5%2%5.
I'm adding to the regex step by step and I've hit a brick wall. I am currently only trying to match the //[delimiter]\n part. How do I match any character (including newline) while keeping the closing newline? For example, .* keeps gobbling up the entire string, .+? only takes one character. I have also tried to use //.*(?=\n) but still no match. I suspect that I have to use a lookahead but how do I implement this properly?
The link to this problem is http://osherove.com/tdd-kata-1/. Any pointers are appreciated and have a great day.
Why not just match numbers and then add the add the groups together?
String input = "//[***]\n1***2*\n**3";
Pattern pattern = Pattern.compile("\\d");
Matcher matcher = pattern.matcher(input);
int total = 0;
while (matcher.find()) {
total += Integer.valueOf(matcher.group(0));
}
System.out.println(total);
1.3 of the question states solve things as simply as possible. Matching for all digits and adding them togheter supports delimiters etc. since it doesn't matter what content is between the numbers.
Verify if your program reads the data line-by-line (same as sed). If not, the following should work:
//(.*?)\n\d+(\1\d+)+
delimiter can be any sequence of characters
there is an unlimited number of numbers to be separated (on the second line).
Test

Regex - How to search for singular or plural version of word [duplicate]

This question already has answers here:
Regex search and replace with optional plural
(4 answers)
Closed 6 years ago.
I'm trying to do what should be a simple Regular Expression, where all I want to do is match the singular portion of a word whether or not it has an s on the end. So if I have the following words
test
tests
EDIT: Further examples, I need to this to be possible for many words not just those two
movie
movies
page
pages
time
times
For all of them I need to get the word without the s on the end but I can't find a regular expression that will always grab the first bit without the s on the end and work for both cases.
I've tried the following:
([a-zA-Z]+)([s\b]{0,}) - This returns the full word as the first match in both cases
([a-zA-Z]+?)([s\b]{0,}) - This returns 3 different matching groups for both words
([a-zA-Z]+)([s]?) - This returns the full word as the first match in both cases
([a-zA-Z]+)(s\b) - This works for tests but doesn't match test at all
([a-zA-Z]+)(s\b)? - This returns the full word as the first match in both cases
I've been using http://gskinner.com/RegExr/ for trying out the different regex's.
EDIT: This is for a sublime text snippet, which for those that don't know a snippet in sublime text is a shortcut so that I can type say the name of my database and hit "run snippet" and it will turn it into something like:
$movies= $this->ci->db->get_where("movies", "");
if ($movies->num_rows()) {
foreach ($movies->result() AS $movie) {
}
}
All I need is to turn "movies" into "movie" and auto inserts it into the foreach loop.
Which means I can't just do a find and replace on the text and I only need to take 60 - 70 words into account (it's only running against my own tables, not every word in the english language).
Thanks!
- Tim
Ok I've found a solution:
([a-zA-Z]+?)(s\b|\b)
Works as desired, then you can simply use the first match as the unpluralized version of the word.
Thanks #Jahroy for helping me find it. I added this as answer for future surfers who just want a solution but please check out Jahroy's comment for more in depth information.
For simple plurals, use this:
test(?=s| |$)
For more complex plurals, you're in trouble using regex. For example, this regex
part(y|i)(?=es | )
will return "party" or "parti", but what you do with that I'm not sure
Here's how you can do it with vi or sed:
s/\([A-Za-z]\)[sS]$/\1
That replaces a bunch of letters that end with S with everything but the last letter.
NOTE:
The escape chars (backslashes before the parens) might be different in different contexts.
ALSO:
The \1 (which means the first pattern) may also vary depending on context.
ALSO:
This will only work if your word is the only word on the line.
If your table name is one of many words on the line, you could probably replace the $ (which stands for the end of the line) with a wildcard that represents whitespace or a word boundary (these differ based on context).

RegExp Find skip letter in the word

I want to find word even this word is written with skip letter.
For example I want to find
references
I want also find refrences or refernces, but not refer
I write this Regexp
(\brefe?r?e?n?c?e?s?\b)
And I want to add checking for length of matched group, this group should be greather than 8.
Can I do only with regexp methods?
I don't think regex is a good tool to find similar words like you try to. What are you doing if two letters are swapped, like "refernece"? Your regex will not find it.
But to show the regex way to check for the length, you could do this by using a lookahead like this
(\b(?=.{8,}\b)refe?r?e?n?c?e?s?\b)
The (?=.{8,}\b) will check if the length from the first \b to the next \b is at least 8 characters ({8,})
See it here on Regexr
I think that using regex is not a good idea. You need more power functions. For example, if you are programming in php, you need function like similar_text. More details here: http://www.php.net/manual/en/function.similar-text.php
Basically you are asking that (in pseudo code):
input == "references" or (levenshtein("references", input)==1 and length(input) == (lenght("references")-1))
Levenshtein distance is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.
Since you want to detect only the strings where a char was skipped, you must add the constraint on the string length.

Select capitalized & all-caps words using RegEx

I'm trying to find names of people and companies (everything that is capitalized but not in the beginning of a sentence) in a large body of text. The purpose is to find as many instances as possible so that they can be XML-tagged properly.
This is what I've come up with so far:
[^\W](\s\b[\p{Lu}][\p{Lu}|\p{Ll}]+\b)+
It has two problems:
It selects two characters too many in front of the hit.
In the sentence "Is this Beetle ugly?" it finds s Beetle which complicates the subsequent tagging.
When a capitalized word is preceded with an apostrophe or a colon, it isn't found. If possible I'd like to limit what characters are used for determining a sentence to just !?.
Here's the sample text I'm using to test it out:
John Adams is my hero. There's just no limits to his imagination! Is
this Beetle ugly? It sings at the: La Scala opera house. I have a
dream that I will find work at' Frame Store but not in the USA! This
way ILM could do whatever they pleased. ILM was very sweet. Visual
Effects did a good job... Neither did Animatronix?
I'm using jEdit http.//jedit.org since I need something that works on both Windows and OS X.
Update, this avoids now the matching at the start of the string.
(?<!(?:[!?\.]\s|^))(\b[\p{Lu}][\p{Lu}\p{Ll}]+\b)+
(?<!(?:[!?\.]\s|^)) is a negative lookbehind that ensures it is not preceded by one of the !?. and a space OR by the start of a new row.
I tested it with jEdit.
Update to cover Names consisting of multiple words
(?<!(?:[!?\.]\s|^))(\b[\p{Lu}][\p{Lu}\p{Ll}]*\b(?:\s\b[\p{Lu}][\p{Lu}\p{Ll}]*\b)*)+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (added)
^ (changed)
I added the group (?:\s\b[\p{Lu}][\p{Lu}\p{Ll}]*\b)* to match optional following words starting with uppercase letters. And I changed the + to a * to match the A in your example My company's called A Few Good Men. But this change causes now the regex to match I as a name.
See tchrists comment. Names are not a simple thing and it gets really difficult if you want to cover the more complex cases.
This is also working
(?<!\p{P}\s)(\b[\p{Lu}][\p{Lu}|\p{Ll}]+\b)+
But \p{P} covers all punctuation, I understood this is not what you want. But maybe you can find here on regular-expressions.info/unicode.html a property that fits your needs.
Another mistake in your expression is the | in the character class. Its not needed, you are just adding this character to your class and with it it will match words like U|S|A, so just remove it:
(?<![!?\.]\s)(\b[\p{Lu}][\p{Lu}\p{Ll}]+\b)+

How to Use Regex to Ensure Complete Words While Adding a Character Limit to Yahoo Pipes?

I'm pretty new to this, so excuse me if my question isn't that clear. I'm pulling an RSS Feed into Yahoo Pipes and using Regex to modify it. Here's what I'm trying to do:
Limit the number of characters in an entry, but...
Make sure the item includes complete words, and...
If the item is shortened, add an ellipses, but...
If it falls within the limits nothing should be done to it
So, if a feed's Title is: "This article is important" and the limit is 20 characters, the result should be "This article is..." But if the Title is "Good Article," nothing should happen to it.
After doing some research I think that I want to combine an if/then statement with lookahead, i.e. go to the character limit and if there is a character following it that is a space, add an ellipses, if it is a number or letter, go to the final space within the limit and add an ellipses, but if there isn't any character following it, don't do anything. Does this make sense? Is there an easier way to do what I'm going for?
I would really appreciate any help you could provide. Thanks!
Try replacing the title using the following pattern:
^(?=.{23})(.{0,20})(?=\s).*$
With the string
$1...
Working example: http://pipes.yahoo.com/pipes/pipe.info?_id=04158a7a5ea390b1b0b78ebccadcec79
How does it work?
(?=.{23}) - First, we check the length is at least 23 (that's for 20 + '...', you can play with that)
(.{0,20}) - Match at most 20 characters on the first group.
(?=\s) - Make sure there's a space after the last character. If not, it will try to match fewer characters.
.* - Match all the way to the end, so the rest of the line is removed.
An edge case here is a single word longer than 20 characters. If that's a problem, you can solve it by using:
^(?=.{23})(.{0,20}(?=\s)|\S{20}).*$