I have some records as such, in a file:
A 20 year old programmer
A 52 year old politician
A 12 year old practitioner
Many many more of these...
I want to match only lines that contain a number less than 20. I have tried:
^[0-20]{2}$
But it works for only numbers 0-2. How should I construct a regular expression to match numbers < 20? For instance, it should match:
A 12 year old practitioner
But not
A 20 year old programmer
A 52 year old politician
You may use
\b1?[0-9]\b
See the regex demo
Details
\b - a word boundary
(?:1?[0-9]) - an optional 1 and any digit after it
\b - a word boundary
Word boundary variations
To match anywhere in a string, even if glued to a word:
(?<!\d)1?[0-9](?!\d)
To only match in between whitespaces:
(?<!\S)1?[0-9](?!\S)
Using regex to match digit ranges is usually a bit clumsy, but here, you can do it pretty simply with:
\b1?\d\b
https://regex101.com/r/YCWmNo/2
In plain language: an optional one, followed by a digit. So, any standalone digit is allowed, but a two-digit number needs its first digit to be a 1.
If you want to permit leading zeros, change to \b[01]?\d\b.
Related
I have 10000 descriptions and I want to use regular expressions to extract the number associated with the phrase ``arrested''.
For example:
"police arrests 4 people"
"7 people were arrested".
The numbers range from 1-99.
I have tried the following code:
gen arrest= regexm(description, "(^[1-9][0-9]$)[ ]*(arrests|arrested)")
I cannot simply extract just the number, because the descriptions also mention numbers that have nothing to do with arrests.
You can use this regex:
(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))
It divides the search into 2 by alternation, whether the number is before or after 'arrests|arrested'.
It creates a non capturing Group, that matches a number from 1-9 (which is optional) and a number from 0-9. This is followed by matching 0 - 20 of any letter and Space (the other Words) before it matches 'arrests OR arrested. It then ORs that with the opposite situation (where the number comes last).
This will match, if the number is within 20 chars from 'arrests|arrested'.
Perhaps something like this?
(\d+)[^,.\d\n]+?(?=arrest|custody)|(?<=arrest|custody)[^,.\d\n]+?(\d+)
Regex101
Keep in mind, this will not match textual versions of the number (i.e., five people were arrested) - so you would have to incorporate that if desired.
Breaking down the pattern
(\d+)[^,.\d\n]+?(?=arrest|custody) First option if # comes before watched terms
(\d+) the number to capture, with + one or more digits
[^,.\d\n]+? matches anything except a comma ,, period ., digit \d, or new line \n. These prevent FPs from different sentences (must be contained in the same sentence) - +? one or more times (lazy)
(?=arrest|custody) positive look ahead checking for either word:
(?<=arrest|custody)[^,.\d\n]+?(\d+) Second option if # comes after watched terms
(?<=arrest|custody) positive lookbehind checking that the word comes before #
[^,.\d\n]+? matches anything except a comma ,, period ., digit \d, or new line \n. These prevent FPs from different sentences (must be contained in the same sentence) - +? one or more times (lazy)
(\d+) the number to capture, with + one or more digits
Miscellaneous Notes
If you want to add textual representations of your numbers, then you would incorporate that into the (\d+) capturing group.
If you have any additional terms to watch for other than arrested or custody, you would add those terms to both lookaround groups
The following works for me (solution based on #PoulBak's idea):
clear
input strL var1
"This is 1 long string saying that police arrests 4 people"
"3 news outlets today reported that 7 people were arrested"
"several witnesses saw 5 people arrested and other 3 killed"
end
generate var2 = ustrregexs(0) if ustrregexm(var1, "(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))")
list
+-------------------------------------------------------------------------------------+
| var1 var2 |
|-------------------------------------------------------------------------------------|
1. | This is 1 long string saying that police arrests 4 people arrests 4 |
2. | 3 news outlets today reported that 7 people were arrested 7 people were arrested |
3. | several witnesses saw 5 people arrested and other 3 killed 5 people arrested |
+-------------------------------------------------------------------------------------+
I have multiple 24-hour time strings through several files. For example, 1234, which I wish to replace with 12:34.
Finding them is easy, just \d\d\d\d, that I understand and it works. However, what replace string do I need. In other words, say xx:xx, what do I put in place of each x.
I've tried numbers of things to no avail. I'm obviously not understanding how I get it to remember the digits it found and to recall them in the replace string.
If in your example data 4 digits represent 24 hour time strings you could match 2 capturing groups between word boundaries to prevent a match with more then 4 digits. You can Adjust the word boundaries to your requirements.
Match
\b(\d{2})(\d{2})\b
Replace
group1:group2 \1:\2
Explanation
\b Match a word boundary
(\d{2}) Capture in a group 2 digits
(\d{2}) Capture in a group 2 digits
\b Match a word boundary
Note
Matching 4 digits does not verify a valid 24 hour time. You could match that using for example \b([01][0-9]|2[0-3])([0-5][0-9])\b and replace with \1:\2
I have read through this question, but for Discover card, the starting digits are 6011, 622126-622925, 644-649, 65 instead of just 6011, 65. (Source)
For Discover cards, I picked up this regex from that question ^6(?:011|5[0-9]{2})[0-9]{12}$
I modified it to cover 6011, 644-649& 65 but for 622126-622925, building regex is hard cuz of my poor regex skills.
I have this regex so far 6(?:011|5[0-9]{2}|[4][4-9][0-9]|[2]{2}[1-9])[0-9]{2}$, but it only checks for 622[1-9]**.
How do I modify it so that it accepts only between 622126-622925 for 622*** case?
Here's your regex (demo):
^6(?:011\d{12}|5\d{14}|4[4-9]\d{13}|22(?:1(?:2[6-9]|[3-9]\d)|[2-8]\d{2}|9(?:[01]\d|2[0-5]))\d{10})$
Needless to say, I won't exactly call this pretty or easy to maintain. I would recommend parsing the number as an integer and using your programming language to do the checks.
You should also use Luhn algorithm to check if the credit card number is valid, and while you could theoretically do this with regex, it would many times worse than this.
Allow me to show you how I arrived at this monstrosity, step by step. First, here is how you match each of those ranges:
6011 # matches 6011
65 # matches 65
64[4-9] # matches 644-649
622(1(2[6-9]|[3-9]\d)|[2-8]\d{2}|9([01]\d|2[0-5]))
# matches 622126-622925
Now, you want to match the rest of the digits:
6011\d{12} # matches 6011 + 12 digits
65\d{14} # matches 65 + 14 digits
64[4-9]\d{13} # matches 644-649 + 13 digits
622(1(2[6-9]|[3-9]\d)|[2-8]\d{2}|9([01]\d|2[0-5]))\d{10}
# matches 622126-622925 + 10 digits
Now you can combine all four, and add start and end of line anchors:
^( # match start of string and open group
6011\d{12}| # matches 6011 + 12 digits
65\d{14}| # matches 65 + 14 digits
64[4-9]\d{13}| # matches 644-649 + 13 digits
622(1(2[6-9]|[3-9]\d)|[2-8]\d{2}|9([01]\d|2[0-5]))\d{10}
# matches 622126-622925 + 10 digits
)$ # close group and match end of string
The final product above is a slightly compacted version of the previous regex, and I also made groups non-capturing (that's what those ?: are for).
Here are your options:
Hack your way through it and build a really complicated regex. Regexes are not suited for this sort of integer comparison so what you come up with will necessarily be long, uncomplicated and unmaintainable. See Regex for number check below a value and similar SO questions on this topic.
Use integer comparison in your code.
For reference one such said complicated regex would be
62212[6-9]|6221[3-9]|622[1-8]|62291|62292[1-5]
even this ticket is 3 years ago, I encountered the same task and would like to share a regex for 622126-622925 :)
^(622[1-9]\\d(?<!10|11|9[3-9])\\d(?<!12[0-5]|92[6-9])\\d{10})$
which using zero-width negative lookbehind to exclude not expected number
I have a barcode of the format 123456########. That is, the first 6 digits are always the same followed by 8 digits.
How would I check that a variable matches that format?
You haven't specified a language, but regexp. syntax is relatively uniform across implementations, so something like the following should work: 123456\d{8}
\d Indicates numeric characters and is typically equivalent to the set [0-9].
{8} indicates repetition of the preceding character set precisely eight times.
Depending on how the input is coming in, you may want to anchor the regexp. thusly:
^123456\d{8}$
Where ^ matches the beginning of the line or string and $ matches the end. Alternatively, you may wish to use word boundaries, to ensure that your bar-code strings are properly separated:
\b123456\d{8}\b
Where \b matches the empty string but only at the edges of a word (normally defined as a sequence consisting exclusively of alphanumeric characters plus the underscore, but this can be locale-dependent).
123456\d{8}
123456 # Literals
\d # Match a digit
{8} # 8 times
You can change the {8} to any number of digits depending on how many are after your static ones.
Regexr will let you try out the regex.
123456\d{8}
should do it. This breaks down to:
123456 - the fixed bit, obviously substitute this for what you're fixed bit is, remember to escape and regex special characters in here, although with just numbers you should be fine
\d - a digit
{8} - the number of times the previous element must be repeated, 8 in this case.
the {8} can take 2 digits if you have a minimum or maximum number in the range so you could do {6,8} if the previous element had to be repeated between 6 and 8 times.
The way you describe it, it's just
^123456[0-9]{8}$
...where you'd replace 123456 with your 6 known digits. I'm using [0-9] instead of \d because I don't know what flavor of regex you're using, and \d allows non-Arabic numerals in some flavors (if that concerns you).
to search a single char in RegEx is easy.
exp: at least one digit:
\d
so i need to match at least 2 digit in the text
.*\d{2}.* or .*\d\d.* #### "d2dr5" -> not match... d22r or d00r match..
will not work because RegEx engine look for these numbers as consecutive how can I search for overall? for example
I want to match at least 3 digit and 2 uppercase word in the text. and the text length can be max 12. how can I do this ? If you can give an explained example so then i may have a point to re-search
example match:
a9r2lDpDf2 - matches. at least 3 digit 2 upper case and not exceeding 12 char in total.
If you want to make sure there is only three digits in the string you can try this (add start and end of string if needed):
[^\d]*\d[^\d]*\d[^\d]*\d[^\d]*
[^\d]* - anything except digits.
Same pattern can be used to check for uppercase letters:
[^A-Z]*[A-Z][^A-Z]*[A-Z][^A-Z]*
RegEx is not the best tool to check length. The language you use has something like length(str) or str.length or str.length() etc.
It can be done with lookahead feature. This is how RegEx looks in Perl (and it does what you ask):
/^(?=.*\d.*\d.*\d)(?=.*[A-Z].*[A-Z]).{12}$/
(?=.*\d.*\d.*\d) - "looks ahead" to see if there are 3 digits
(?=.*[A-Z].*[A-Z]) - "looks ahead" to see if there are 2 uppercase letters
.{12} - length must be precisely 12 characters. Any character 12 times.
I dont think regexes are the optimal solution here , but for academic interest
(?=(.*[0-9]){3})(?=(.*[A-Z]){2}).{5,12}