I have a long string variable (many sentences separated by ".") with some important numerical information, generally with a decimal point (e.g., "6.5 lbs").
I would like to regex out the all periods when they appear at the end of a sentence, but leave them when they appear between numbers.
FROM:
First sentence. Second sentence contains a number 1.0 and more words. One more sentence.
TO:
First sentence Second sentence contains a number 1.0 and more words One more sentence
I am doing this in Stata, using Unicode regex functions which follow this standard: http://userguide.icu-project.org/strings/regexp
What I thought I was doing in the following is: `replace the period w/ a space when the previous character is a lowercase letter'.
gen new_variable = ustrregexrf(note_text, "(?<=[a-z])\.", " ")
I find that it will remove one period per line, but will not remove all of them. Maybe what I need to do is tell it: do this for all the periods you find satisfying the condition, but since it's not working the way I think it is already maybe I need an explanation of what it actually is doing.
Bonus points if you can tell me how to remove a period when there is a number followed by a space:
number is 1.0. Next sentence -> number is 1.0 Next sentence
EDIT: there are occasionally strings like end sentence.begin next sentence without spacing so separating on white space won't handle all of my cases.
Method 1
Maybe,
\.(?=\s|$)
might be OK to look into.
Demo 1
Method 2
\d+\.\d+(*SKIP)(*FAIL)|\.
Demo 2
is another option to look at, and it would work by installing the regex module:
$ pip3 install regex
Test
import regex as re
string = '''
First sentence. Second sentence contains a number 1.0 and more words. One more sentence.First sentence. Second sentence contains a number 1.0 and more words. One more sentence.
'''
expression = r'\d+\.\d+(*SKIP)(*FAIL)|\.'
print(re.sub(expression, '', string))
Output
First sentence Second sentence contains a number 1.0 and more words
One more sentenceFirst sentence Second sentence contains a number 1.0
and more words One more sentence
Related
I have sentences within text I need to extract. They are formatted as such:
Less word word....etc Number% (<-- eg. 99%)
OR
More word word etc Number%
I would like to create a regular expression that will capture everything between the word Less or the word More and the ending percentage sign.
The challenge I'm having is that I cannot use the ^ or the $ characters as these sentences don't start with a new line.
Is there a way to signify that I'd like to capture ech instance of SENTENCES (not lines) beginning with:
(less | more)
and ending with:
%
To clarify - I want to include Less and More and the percent symbol in what I'm capturing.
Here is what I have so far:
(?=Less|More)(.*)((\%|\s\bpercent\b))
The above code captures the first instance of the word 'Less' and everything after it. I would like every sentence to be captured separately.
For example, the two sentences below should be captured separately:
Less than a dollar 99,5%
More than a dollar and less than a cent 95%
EDIT FOR CLARIFICATION:
What I'm after is a solution that won't capture the ENTIRE text below.
More than 55 percent RANDOM STRING. RANDOM STRING. Less than a dollar
99,5% More than a dollar and less than a cent 95%
My aim is to capture three sentences from the above string, preferably all in one group:
More than 55 Percent
Less than a dollar 99,5%
More than a dollar and less than a cent 95%
If you want the capture included, then you simply include it instead of using a lookaround.
Please note, that according to your specification, the . at the end of sentences is not included. Perhaps you want to add that as a possibility at the end like this: percent)\.?
We are also using the non-greedy modifier .*? So that two sentences on the same line will be captured uniquely, otherwise it will capture everything from the first More|Less to the last %|percent on the same line.
let regex = /(Less|More).*?(\%|\spercent)/g
let string = `The above code captures the first instance of the word 'Less' and everything after it. I would like every sentence to be captured separately.
More than 55 percent.
For example, the two sentences below should be captured separately:
Less than a dollar 99,5% but More than the least possible percent.
More than a dollar and less than a cent 95%`
let capture = string.match(regex);
console.log("Content of capture as array:");
console.log(capture);
capture.forEach((capturedString, index) => {
console.log("Sentence number " + Number(index+1) + ": " + capturedString);
});
let everythingAsOneSentence = capture.join(" ");
console.log(everythingAsOneSentence);
I have a list of different items. Some of them have 8-10 digits in front of the name, some others have these 8-10 digits behind the name and some others again don't have these numbers in the name.
I have two expressions that I use to remove these digits, but I can not manage to combine them with | (or). They work each for themselves, but if I use the first expression first, then the second expression, I don't get the result I want to have.
I use these to expressions for now:
(?<=[\d]{8,10}) (.*)
.*?(?=[\d]{8,10})
But if I use them both (first one and then the other), then some of the lines become totally empty.
How can I combine these to to do what I want, or if it's better, write a new expression that does what I want to do :)
List is like this:
12345678 Book
12345678 Book
Book 12345678
Book 12345678
Cabinet 120x30x145
Want this result:
Book
Book
Book
Book
Cabinet 120x30x145
Why not just use the following.
Check if there are 8 numbers in the beginning of the string, or at the end of it and remove them.
(^\d{8,10}\s*|\s*\d{8,10}$)
It gives the wanted behaviour
Instead of only matching everything but a number containing
8-10 digits + adjacent spaces, use a regex to substitute
such a number (also + adjacent spaces) with an empty string.
To match, use the following regex:
*\d{8,10} *
That is:
* - a space and an asterix - a sequence of spaces (may be empty),
\d{8,10} - a sequence of 8 to 10 digits,
* - another sequence of spaces (may be empty).
The replacement string is (as I said) empty. Of course, you should use
g (global) option.
Note that you can not use \s instead of the space, as \s matches also
CR and LF and we don't want this.
For a working example see https://regex101.com/r/1hsGzT/1
You need to use \b meta sequence boundary:
/\b[0-9\s]{8,10}\b/g;
var str = `12345678 Book
12345678 Book
Book 12345678
Book 12345678
Cabinet 120x30x145`;
var rgx = /\b[0-9\s]{8,10}\b/g;
var res = str.replace(rgx, `\n`)
console.log(res);
I need a RegEx pattern that will return the first N words using a custom word boundary that is the normal RegEx white space (\s) plus punctuation like .,;:!?-*_
EDIT #1: Thanks for all your comments.
To be clear:
I'd like to set the characters that would be the word delimiters
Lets call this the "Delimiter Set", or strDelimiters
strDelimiters = ".,;:!?-*_"
nNumWordsToFind = 5
A word is defined as any contiguous text that does NOT contain any character in strDelimiters
The RegEx word boundary is any contiguous text that contains one or more of the characters in strDelimiters
I'd like to build the RegEx pattern to get/return the first nNumWordsToFind using the strDelimiters.
EDIT #2: Sat, Aug 8, 2015 at 12:49 AM US CT
#maraca definitely answered my question as originally stated.
But what I actually need is to return the number of words ≤ nNumWordsToFind.
So if the source text has only 3 words, but my RegEx asks for 4 words, I need it to return the 3 words. The answer provided by maraca fails if nNumWordsToFind > number of actual words in the source text.
For example:
one,two;three-four_five.six:seven eight nine! ten
It would see this as 10 words.
If I want the first 5 words, it would return:
one,two;three-four_five.
I have this pattern using the normal \s whitespace, which works, but NOT exactly what I need:
([\w]+\s+){<NumWordsOut>}
where <NumWordsOut> is the number of words to return.
I have also found this word boundary pattern, but I don't know how to use it:
a "real word boundary" that detects the edge between an ASCII letter
and a non-letter.
(?i)(?<=^|[^a-z])(?=[a-z])|(?<=[a-z])(?=$|[^a-z])
However, I would want my words to allow numbers as well.
IAC, I have not been able how to use the above custom word boundary pattern to return the first N words of my text.
BTW, I will be using this in a Keyboard Maestro macro.
Can anyone help?
TIA.
All you have to do is to adapt your pattern ([\w]+\s+){<NumWordsOut>} to, including some special cases:
^[\s.,;:!?*_-]*([^\s.,;:!?*_-]+([\s.,;:!?*_-]+|$)){<NumWordsOut>}
1. 2. 3. 4. 5.
Match any amount of delimiters before the first word
Match a word (= at least one non-delimiter)
The word has to be followed by at least one delimiter
Or it can be at the end of the string (in case no delimiter follows at the end)
Repeat 2. to 4. <NumWordsOut> times
Note how I changed the order of the -, it has to be at the start or end, otherwise it needs to be escaped: \-.
Thanks to #maraca for providing the complete answer to my question.
I just wanted to post the Keyboard Maestro macro that I have built using #maraca's RegEx pattern for anyone interested in the complete solution.
See KM Forum Macro: Get a Max of N Words in String Using RegEx
I have thousands of article descriptions containing numbers.
they look like:
ca.2760h3x1000.5DIN345x1500e34
the resulting numbers should be:
2760
1000.5
1500
h3 or 3 shall not be a result of the parsing, since h3 is a tolerance only
same for e34
DIN345 is a norm an needs to be excluded (every number with a trailing DIN or BN)
My current REGEX is:
[^hHeE]([-+]?([0-9]+\.[0-9]+|[0-9]+))
This solves everything BUT the norm. How can I get this "DIN" and "BN" treated the same way as a single character ?
Thanx, TomE
Try using this regular expression:
(?<=x)[+-]?0*[0-9]+(?:\.[0-9]+)?|[+-]?0*[0-9]+(?:\.[0-9]+)?(?=h|e)
It looks like every number in your testcase you want to match exept the first number is starting with x.This is what the first part of the regex matches. (?<=x)[+-]?0*[0-9]+(?:\.[0-9]+)?The second part of the regex matches the number until h or e. [+-]?0*[0-9]+(?:\.[0-9]+)?(?=h|e)
The two parts [+-]?0*[0-9]+(?:\.[0-9]+)? in the regex is to match the number.
If we can assume that the numbers are always going to be four digits long, you can use the regex:
(\d{4}\.\d+|\d{4})
DEMO
Depending on the language you might need to replace \d with [0-9].
I have a column in an Excel spreadsheet that contains the following:
### - 3-digit number
#### - 4-digit number
A### - character with 3-digits
#A## - digit followed by character then 2 more digits
There may also be superfluous characters to the right of these strings.
I would like to sort the entire spreadsheet by this column in the following order (ascending or descending):
the first three types of strings alphabetically as expected (NOT ASCII-Betically!)
Then the #A## by the character first, then by the first digit.
Example:
000...999, 0000...9999, A000...Z999, 0A00...9A99, 0B00...9B99...9Z99
I feel there is a very simple solution using a regular expression or macro but my VBa and RegExp are pretty rusty (a friend asked me for this but I' m more of a C-guy these days). I have read some solutions which involve splitting the data into additional columns which I would be fine with.
I would settle for a link to a good guide. Eternal thanks in advance.
If you want to sort by second character regardless of the content ahead and behind, then regex ^.(.) represents second character match...