I would like to search for all uppercase words in a file but I have no idea how to do it (or if it's possible). I found this solution here on stackoverflow, but it doesn't work on vim.
From command mode, assuming you do not have the option ignorecase set:
/\<[A-Z]\+\>
or
/\v<[A-Z]+>
Finds any string of capital letters greater than length one surrounded by word boundaries. The second form uses 'very-magic'. :help magic for details
The shortest answer: /\<\u\+\>
If you want a list of all the matching uppercase words (i.e. you aren't interested in jumping from one word to the other), you can use:
echo filter(split(join(getline(1, '$'), ' '), '\v(\s|[[:punct:]])'), 'v:val =~ "\\v<\\u+>"')
With:
getline(1, '$') that returns a list of all the lines from the current buffer
join(lines, ' ') that flattens this list of lines
split(all_text, separators_regex) that build a list of word-like elements
and finally filter(words, uppercase-condition) that selects only the uppercase words.
Related
So I have a list of all 5 letter words in the English language that I can interrogate when I'm really stuck at Wordle. I found this an excellent exercise for brushing up on my Regular Expressions in BBEDIT, which is what I tell myself I'm doing.
The way wordle works, I can have three conditions.
A letter that is somewhere in the word (and must be present)
A letter that is not present in the word
A letter that is correct in presence and position
Condition 3 is easy. If my start word "crone" has the n in the right place, my pattern is
...n.
And I can add condition 2 fairly easily with
^(?!.*[croe])...n.
If my next guess is "burns" I'll know there's an "s"
^(?!.*[croebur])^(?=.*s)...n.
And that it's not in the last position:
^(?!.*[croebur])^(?=.*s)...n[^s]
If my next (very poor) guess is 'stone' I'll know there's a 't'.
^(?!.*[croebur])^(?=.*s)^(?=.*t)sa.n.
So that's a workable formula.
But if my next guess were "wimpy" I'd know there was an 'i' in the answer, but I have to add an additional ^(?=.*i) which just feels inefficient. I tried grouping the letters that must be in the word by using a bracket set, ^(?=.*[ist]) but of course that will match targets that contain any one of those characters rather than all.
Is there a more efficient way to express the phrase "the word must contain all of the following letters to match" than a series of "start at the beginning, scan for occurence of this single character until the end" phrases?
If you enter a word into Wordle, it displays all the matched characters in your word. It also shows the characters which exist in the word but not in the correct order.
Considering these requirements, I think you should create different rules for each letter's place. This way, your regex pattern keeps simple, and you get the search results quickly. Let me give an example:
Input word: crone
Matched Characters: ...n.
Characters in the wrong place: -
Next regex search pattern: ^[^crone][^crone][^crone]n[^crone]$
Input word: burns
Matched Characters: ...n.
Characters in the wrong place: s
Next regex search pattern: ^(?=\S*[s]\S*)[^bucrone][^bucrone][^bucrone]n[^bucrones]$ (Be careful, there is an "s" character in the last parenthesis because we know its place isn't there.)
Input word: stone
Matched Characters: s..n.
Characters in the wrong place: t
Next regex search pattern: ^(?=\S*[t]\S*)s[^tsbucrone][^sbucrone]n[^sbucrones]$ (Be careful, there is a "t" character in the first parenthesis because we know its place isn't there.)
^ => Start of the line
[^abc] => Any character except "a" and "b" and "c"
(?=\S*[t]\S*)=> There must be a "t" character in the given string
(?=\S*[t]\S*)(?=\S*[u]\S*)=> There must be "t" and "u" characters in the given string
$ => End of the line
When we look at performance tests of the regex patterns with a seven-word sample, my regex pattern found the result in 130 steps, whereas your pattern in 175 steps. The performance difference will increase as the word-list increase. You can review it from the following links:
Suggested pattern: https://regex101.com/r/mvHL3J/1
Your pattern: https://regex101.com/r/Nn8EwL/1
Note: You need to click the "Regex Debugger" link in the left sidebar to see the steps.
Note 2: I updated my response to fix the bug in the following comment.
I want to find words in a document using only the letters in a given pattern, but those letters can appear at most once.
Suppose document.txt consists of "abcd abbcd"
What pattern (and what concepts are involved in writing such a pattern) will return "abcd" and not "abbcd"?
You could check if a character appears more than once and then negate the result (in your source code):
split your document into words
check each word with ([a-z])[a-z]*\1 (that matches abbcd, but not abcd)
negate the result
Explanation:
([a-z]) matches any single character
[a-z]* allows none or more characters after the one matched above
\1 is a back reference to the character found at ([a-z])
There were already some good ideas here, but I wanted to offer an example implementation in python. This isn't necessarily optimal, but it should work. Usage would be:
$ python find.py -p abcd < file.txt
And the implementation of find.py is:
import argparse
import sys
from itertools import cycle
parser = argparse.ArgumentParser()
parser.add_argument('-p', required=True)
args = parser.parse_args()
for line in sys.stdin:
for candidate in line.split():
present = dict(zip(args.p, cycle((0,)))) # initialize dict of letter:count
for ch in candidate:
if ch in present:
present[ch] += 1
if all(x <= 1 for x in present.values()):
print(candidate)
This handles your requirement of matching each character in the pattern at most once, i.e. it allows for zero matches. If you wanted to match each character exactly once, you'd change the second-to-last line to:
if all(x == 1 for x in present.values()):
Melpomene is right, regexps are not the best instrument to solve this task. Regexp is essentially a finite state machine. In your case current state can be defined as the combination of presence flags for each of the letters from your alphabet. Thus the total number of internal states in regex will be 2^N where N is the number of allowed letters.
The easiest way to define such regex will be list all possible permutations of available letters (and use ? to eliminate necessity to list shorter sequences). For three letters (a,b,c) regex looks like:
a?b?c?|a?c?b?|b?a?c?|b?c?a?|c?a?b?|c?b?a?
For the four letters (a,b,c,d) it becomes much longer:
a?b?c?d?|a?b?d?c?|a?c?b?d?|a?c?d?b?|a?d?b?c?|a?d?c?b?|b?a?c?d?|b?a?d?c?|b?c?a?d?|b?c?d?a?|b?d?a?c?|b?d?c?a?|c?a?b?d?|c?a?d?b?|c?b?a?d?|c?b?d?a?|c?d?a?b?|c?d?b?a?|d?a?b?c?|d?a?c?b?|d?b?a?c?|d?b?c?a?|d?c?a?b?|d?c?b?a?
As you can see, not that convenient.
The solution without regexps depends on your toolset. I would write a simple program that processes input text word by word. At the start of the word BitSet is created, where each bit represents the presence of the corresponding letter of the desired alphabet. While traversing the word if bit that corresponds to the current letter is zero it becomes one. If already marked bit occurs or letter is not in alphabet, word is skipped. If word is completely evaluated, then it's "valid".
grep -Pwo '[abc]+' | grep -Pv '([abc]).*\1' | awk 'length==3'
where:
first grep: a word composed by the pattern letters...
second grep: ... with no repeated letters ...
awk: ...whose length is the number of letters
After something I guess is pretty complex, and I am pretty bad with regex's so you guys might be able to help.
See this data source:
User ID:
a123456
a12345f
a1234e6
d123d56
b12c456
c1b3456
ba23456
Basically, what I want to do, is use a regex/sed to replace all occurances of letters into numbers EXCEPT the first letter. Letters will always match their alphabet position. e.g. a = 1, b = 2, c = 3 etc.
So the result set should look like this:
User ID:
a123456
a123456
a123456
d123456
b123456
c123456
b123456
There will also never be any letters other that a-j, and the string will always be 7 chars long.
Can anyone shed some light? Thanks! :)
Here's one way you could do it using standard tools cut, paste and tr:
$ paste -d'\0' <(cut -c1 file) <(cut -c2- file | tr 'abcdef' '123456')
a123456
a123456
a123456
d123456
b123456
c123456
b123456
This joins the first character of the line with the result of tr on the rest of the line, using the null string. tr replaces each element found in the first list with the corresponding element of the second list.
To replace a-j letters in a line by the corresponding digits except the first letter using perl:
$ perl -pe 'substr($_, 1) =~ tr/a-j/0-9/' input_file
a=0, not a=1 because j would be 10 (two digits) otherwise.
J = 0, and no, only numbers 0-9 are used, and letters simply replace their number counterpart, so there will never be a latter greater than j.
To make j=0 and a=1:
$ perl -pe 'substr($_, 1) =~ tr/ja-i/0-9/' input_file
sed '/[a-j][0-9a-j]\{6\}$/{h;y/abcdefghij/1234567890/;G;s/.\(.\{6\}\).\(.\).*/\2\1/;}' YourFile
filter on "number" only
remind line (for 1st letter)
change all letter to digit (including 1st)
add first form of number (as second line in buffer)
take 1st letter of second line and 6 last of 1st one, reorder and dont keep the other character
$ awk 'BEGIN{FS=OFS=""} NR>1{for (i=2;i<=NF;i++) if(p=index("jabcdefghi",$i)) $i=p-1} 1' file
User ID:
a123456
a123456
a123456
d123456
b123456
c123456
b123456
Note that the above reproduces the header line User ID: as-is. So far, best I can tell, all of the other posted solutions would change the header line to Us5r ID: since they would do the letter-to-number translation on it just like on all of the subsequent lines.
I don't see the complexity. Your samples look like you just want to replace six of seven characters with the numbers 1-6:
s/^\([a-j0-9]\)[a-j0-9]\{6\}/\1123456/
Since the numbers to put there are defined by position, we don't care what the letter was (or even if it was a letter). The downside here is that we don't preserve the numbers, but they never varied in your sample data.
If we want to replace only letters, the first method I can think of involves simply using multiple substitutions:
s/^\([a-j0-9]\{1\}\)[a-j]/\11/
s/^\([a-j0-9]\{2\}\)[a-j]/\12/
s/^\([a-j0-9]\{3\}\)[a-j]/\13/
s/^\([a-j0-9]\{4\}\)[a-j]/\14/
s/^\([a-j0-9]\{5\}\)[a-j]/\15/
s/^\([a-j0-9]\{6\}\)[a-j]/\16/
Replacing letters with specific digits, excluding the first letter:
s/\(.\)a/\11/g
This pattern will replace two character sequences, preserving the first, so would have to be run twice for each letter. Using hold space we could store the first character and use a simple transliteration. The tricky part is joining the two sections, whereupon sed injects an unwanted newline.
# Store in hold space
h
# Remove the first character
s/^.//
# Transliterate letters
y/jabcdefghi/0123456789/
# Exchange pattern and hold space
x
# Keep the first character
s/^\(.\).*$/\1/
# Print it
#P
# Join
G
# Remove the newline
s/^\(.\)./\1/
Still learning about sed's capabilities :)
I want to split a text into it's single words using regular expressions. The obvious solution would be to use the regex \\b unfortunately this one does split words also on the hyphen.
So I am searching an expression doing exactly the same as the \\b but does not split on hyphens.
Thanks for your help.
Example:
String s = "This is my text! It uses some odd words like user-generated and need therefore a special regex.";
String [] b = s.split("\\b+");
for (int i = 0; i < b.length; i++){
System.out.println(b[i]);
}
Output:
This
is
my
text
!
It
uses
some
odd
words
like
user
-
generated
and
need
therefore
a
special
regex
.
Expected output:
...
like
user-generated
and
....
#Matmarbon solution is already quite close, but not 100% fitting it gives me
...
like
user-
generated
and
....
This should do the trick, even if lookaheads are not available:
[^\w\-]+
Also not you but somebody who needs this for another purpose (i.e. inserting something) this is more of an equivalent to the \b-solutions:
([^\w\-]|$|^)+
because:
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
--- http://www.regular-expressions.info/wordboundaries.html
You can use this:
(?<!-)\\b(?!-)
I am new to Regular Expressions.
What is the expression that would find a long string of words that begin with a 3-digit number and place spaces at the beginning of capitalized words:
REPLACE:
013TheBlueCowJumpedOverTheFence1984.jpg
WITH:
013 The Blue Cow Jumped Over The Fence 1984
Note: removes the .jpg at the end
This will save me ooooodles of time.
I would not use regular expressions for this task. It's going to be ugly and hard to maintain. A better approach would be to loop through the string and rebuild the string as you go based on your input.
string retVal = "";
foreach(char s in myInput){
if(IsCapitol(s)){
reVal += " " + s;
}
//insert the rest of your conditions
}
try use this regular expression \d+|[A-Z][a-z]*
it will collect all matches, and you must join them with spases
This will need two operations since the replacement is different for each.
The first:
/(((?<![\d])\d)|((?<![A-Z])[A-Z](?![A-Z])))/
Replace with: ' $1' (note the space)
Will put spaces between the words. The second:
/\s*(.*)\s*\..*$/
Replace with: '$1'
Will remove trailing spaces and the extension.
The first expression can be taken into parts: (?<![\d])\d finds a digit not preceded by another digit, the second: ((?<![A-Z])[A-Z](?![A-Z])) finds an uppercase letter not preceded or followed by an uppercase lettter.
You'll likely have more rules that you will want to incorporate into this, such as how are you dealing with the string: 'BackInTheUSSR.jpg'?
Edit: This should handle that example:
/(((?<![\d])\d)|((?<![A-Z])[A-Z](?![A-Z]))|((?<![A-Z])[A-Z]+(?![a-z])))/
match:
'[A-Z][a-z]*'
replace with
' \0'
Note that this doesn't put a space before 1984, and it doesn't remove .jpg.
You can do the former by matching on
'[0-9]+|[A-Z][a-z]*'
instead. And the latter by removing it in a separate instruction, for example with a regexp replacement of '\.jpg$' with ''
Note that \'s need to be written as \\ in many languages.