I have always wondered why I can't replace an unknown whitespace character until just an hour ago that I decided to loop through it and using php ord function I found out that it is actually character ASCII number 13. I tried the following to remove it but didn't work:
preg_replace('/\x13/','',$string)
any help?
13 in hexadecimal is 19 in decimal, which is the ASCII control character DC3, which isn't properly whitespace.
You probably mean decimal 13, which is a carriage return. In hexadecimal, that's D, so you'd use \x0D instead.
Related
I know how to make a regex case-insensitive. This is not about case but about character width. I'm looking for something that is width-insensitive. In Japanese, you have half-width and full-width characters (consider 0123 vs 0123 or ABCD vs ABCD). You can make SQL Server databases width-insensitive with _WI (or width-sensitive with _WS). I was hoping there would be something similar for a regex.
I'm trying to find birth dates where the numbers can be half and full width. Here's an illustration of the problem
For a more specific date matching problem, here's another illustration:
So while \d{4} finds instances of 4 digits, it will not find 4 full-width digits, etc. The workaround I've found is to do something like [0123456789|\d]{4} like so:
But that feels really dirty. Is there a better way to do this?
To match ASCII or full-width digits you can use
[0-9\d]
Or, [\uFF10-\uFF19\d] if you need to abstract from using Unicode text in your source.
Note that in ECMAScript 2018 and later, you can use \p{N} or \p{Nd} to match all Unicode digits.
The current \p{N} range (encompassing Number, Decimal Digit (Nd), Number, Letter (Nl) and Number, Other (No) categories) regex matching 1,791 code points is
(?:[0-9\xB2\xB3\xB9\xBC-\xBE\u0660-\u0669\u06F0-\u06F9\u07C0-\u07C9\u0966-\u096F\u09E6-\u09EF\u09F4-\u09F9\u0A66-\u0A6F\u0AE6-\u0AEF\u0B66-\u0B6F\u0B72-\u0B77\u0BE6-\u0BF2\u0C66-\u0C6F\u0C78-\u0C7E\u0CE6-\u0CEF\u0D58-\u0D5E\u0D66-\u0D78\u0DE6-\u0DEF\u0E50-\u0E59\u0ED0-\u0ED9\u0F20-\u0F33\u1040-\u1049\u1090-\u1099\u1369-\u137C\u16EE-\u16F0\u17E0-\u17E9\u17F0-\u17F9\u1810-\u1819\u1946-\u194F\u19D0-\u19DA\u1A80-\u1A89\u1A90-\u1A99\u1B50-\u1B59\u1BB0-\u1BB9\u1C40-\u1C49\u1C50-\u1C59\u2070\u2074-\u2079\u2080-\u2089\u2150-\u2182\u2185-\u2189\u2460-\u249B\u24EA-\u24FF\u2776-\u2793\u2CFD\u3007\u3021-\u3029\u3038-\u303A\u3192-\u3195\u3220-\u3229\u3248-\u324F\u3251-\u325F\u3280-\u3289\u32B1-\u32BF\uA620-\uA629\uA6E6-\uA6EF\uA830-\uA835\uA8D0-\uA8D9\uA900-\uA909\uA9D0-\uA9D9\uA9F0-\uA9F9\uAA50-\uAA59\uABF0-\uABF9\uFF10-\uFF19]|\uD800[\uDD07-\uDD33\uDD40-\uDD78\uDD8A\uDD8B\uDEE1-\uDEFB\uDF20-\uDF23\uDF41\uDF4A\uDFD1-\uDFD5]|\uD801[\uDCA0-\uDCA9]|\uD802[\uDC58-\uDC5F\uDC79-\uDC7F\uDCA7-\uDCAF\uDCFB-\uDCFF\uDD16-\uDD1B\uDDBC\uDDBD\uDDC0-\uDDCF\uDDD2-\uDDFF\uDE40-\uDE48\uDE7D\uDE7E\uDE9D-\uDE9F\uDEEB-\uDEEF\uDF58-\uDF5F\uDF78-\uDF7F\uDFA9-\uDFAF]|\uD803[\uDCFA-\uDCFF\uDD30-\uDD39\uDE60-\uDE7E\uDF1D-\uDF26\uDF51-\uDF54\uDFC5-\uDFCB]|\uD804[\uDC52-\uDC6F\uDCF0-\uDCF9\uDD36-\uDD3F\uDDD0-\uDDD9\uDDE1-\uDDF4\uDEF0-\uDEF9]|\uD805[\uDC50-\uDC59\uDCD0-\uDCD9\uDE50-\uDE59\uDEC0-\uDEC9\uDF30-\uDF3B]|\uD806[\uDCE0-\uDCF2\uDD50-\uDD59]|\uD807[\uDC50-\uDC6C\uDD50-\uDD59\uDDA0-\uDDA9\uDFC0-\uDFD4]|\uD809[\uDC00-\uDC6E]|\uD81A[\uDE60-\uDE69\uDEC0-\uDEC9\uDF50-\uDF59\uDF5B-\uDF61]|\uD81B[\uDE80-\uDE96]|\uD834[\uDEE0-\uDEF3\uDF60-\uDF78]|\uD835[\uDFCE-\uDFFF]|\uD838[\uDD40-\uDD49\uDEF0-\uDEF9]|\uD83A[\uDCC7-\uDCCF\uDD50-\uDD59]|\uD83B[\uDC71-\uDCAB\uDCAD-\uDCAF\uDCB1-\uDCB4\uDD01-\uDD2D\uDD2F-\uDD3D]|\uD83C[\uDD00-\uDD0C]|\uD83E[\uDFF0-\uDFF9])
and the \p{Nd} (with 660 code points) converts to
(?:[0-9\u0660-\u0669\u06F0-\u06F9\u07C0-\u07C9\u0966-\u096F\u09E6-\u09EF\u0A66-\u0A6F\u0AE6-\u0AEF\u0B66-\u0B6F\u0BE6-\u0BEF\u0C66-\u0C6F\u0CE6-\u0CEF\u0D66-\u0D6F\u0DE6-\u0DEF\u0E50-\u0E59\u0ED0-\u0ED9\u0F20-\u0F29\u1040-\u1049\u1090-\u1099\u17E0-\u17E9\u1810-\u1819\u1946-\u194F\u19D0-\u19D9\u1A80-\u1A89\u1A90-\u1A99\u1B50-\u1B59\u1BB0-\u1BB9\u1C40-\u1C49\u1C50-\u1C59\uA620-\uA629\uA8D0-\uA8D9\uA900-\uA909\uA9D0-\uA9D9\uA9F0-\uA9F9\uAA50-\uAA59\uABF0-\uABF9\uFF10-\uFF19]|\uD801[\uDCA0-\uDCA9]|\uD803[\uDD30-\uDD39]|\uD804[\uDC66-\uDC6F\uDCF0-\uDCF9\uDD36-\uDD3F\uDDD0-\uDDD9\uDEF0-\uDEF9]|\uD805[\uDC50-\uDC59\uDCD0-\uDCD9\uDE50-\uDE59\uDEC0-\uDEC9\uDF30-\uDF39]|\uD806[\uDCE0-\uDCE9\uDD50-\uDD59]|\uD807[\uDC50-\uDC59\uDD50-\uDD59\uDDA0-\uDDA9]|\uD81A[\uDE60-\uDE69\uDEC0-\uDEC9\uDF50-\uDF59]|\uD835[\uDFCE-\uDFFF]|\uD838[\uDD40-\uDD49\uDEF0-\uDEF9]|\uD83A[\uDD50-\uDD59]|\uD83E[\uDFF0-\uDFF9])
If i have a series of strings in Python 3.x im iterating over, how do i check if they all have the right formatting of 1 letter and 12 numbers following it. I want a booleon output so i can use it in an if statment? Thanks
/[a-zA-Z]\d{12}/.test(string)
[a-zA-Z] matches any single letter capital or small
\d{12} matches 12 digits
and test() function returns true if a match is found
I am trying to write a regex that will match Roman numerals from 0 to 39 only. There are plenty of examples which match much larger Roman numerals, but I cannot figure out how to match this specific subset.
Got it. Try this:
/^(X{1,3})(I[XV]|V?I{0,3})$|^(I[XV]|V?I{1,3})$|^V$/
Update:
Zero doesn't exist in Roman numerals. Therefore feel free to tack on your own implementation for zero.
I'm not sure how to represent 0 using Roman numerals. I assume that it has separate token N (see Wikipedia).
Assuming the regex tries to match the whole string (like in Java) and you have lookahead, you can use this regex:
(?.)(X{0,3}(IX|IV|V?I{0,3})|N)
Explanation:
(?.): ensure at least one character
X{0,3}: define the tens (0, 10, 20, 30)
(...): define the final digit
IX: 9
IV: 4
V?I{0,3}: 0-3, 5-8 (0 not as whole number, require at least one X)
N: 0 (as whole number)
If you represent 0 as empty string, the regex is simpler:
X{0,3}(IX|IV|V?I{0,3})
since the lookahead and N in the previous regex is just to prevent empty string.
Assuming you know you have valid Roman numerals and want to fetch only the ones <= 39, that is easy:
^[XVI]*$
See it in action
If that is not the case, it's a little bit trickier, but you can still take advantage of the fact that all the numbers that can be represented only with X, V and I are 1..39:
^X{0,3}(?:V?I{0,3}|I[VX])$
See it in action
X{0,3} covers 10, 20, 30
X{0,3}V?I{0,3} covers all but the ones that end with 4 or 9 (14, 29, etc)
X{0,3}I[VX] exactly the ones ending with 4 or 9
Note: these will also match an empty string, which is my interpretation of a Roman zero. If that is not the case, you can replace the * with + for the first regex and add a positive lookahead at the start of the regex for the second ((?=.)).
Note 2: If they are not on separate lines (or in separate strings), you can replace ^ and $ with word boundaries (\b).
I'm trying to match numbers greater than 40. The good point is that all of them have 2 decimal places, so all of them are like: 3.25, 5.89, 999.75 and they don't use any leading zeros (except on the decimal part that always have 2 digits)...
At first I tried the following code but then I realized this wouldn't match numbers like 100, 1000... even if they are greater than 40.
[4-9][0-9]\.
I don't have to match the decimal part, so don't worry about matching that, just help me to find how to match numbers greater than 40 (up to 9999 would be fine).
Thanks for your help.
This should do the job:
([4-9][0-9]|\d{3,})\.
Check it here:
http://www.regexr.com/3a5v9
Don't use regular expressions for number comparison. If, for example, you're using Javascript:
var aNumber = parseFloat("50");
if (aNumber > 40) {
// yay!
}
If your regex flavour can use negative lookbehind to match the numbers from 41 to 9999 without decimal:
\b(?:[1-9][0-9]{2,3}|[5-9][0-9]|4[1-9])(?<!\.\d{1,2})\b
(40\.(?!0[^\d]|00)\d{1,2}|(((4[1-9](?!\d)|[5-9][0-9])(?![\d])|\d*[1-9]\d{2,})(\.\d{1,2})?))
This prevents false positives from leading 0s.
This worked for me.
It tries to match 40 followed by 1 or two decimals that are not 00.
It then tries to match 4 followed by 1-9, decimal optional.
If it can't match that it matches 5-9 followed by 0-9, decimal optional.
It then triese to match any digit, any number of times, followed by 1-9, followed by 1 or 2 digits, decimal optional.
If you want to require the decimal, just remove the last question mark.
This will do it:
([4-9][0-9]+|\d{3,})
This it will get all the numbers of two digits having the first one greater than 4 or any number with three digits.
As an example http://www.regexr.com/3a5v0
You can use brackets to indicate a minimum and, if desired, maximum number of characters to match. So,
([4-9][0-9]|[1-9][0-9]{2,})\.
matches 4-9 followed by one or more digits. Presumably there's a boundary of some sort at the beginning of this, but it sounds like you have that part worked out. This uses an OR to allow for two possible groups of first digits.
(Most of the other answer are perfect for me -- This is paranoia and a bad idea :)
for use with grep -Po or Perl we could use:
'\b(\d{3,}|[4-9]\d)\.\d\d'
but this would get 40.00 (not greater than 40)
'\b(\d{3,}|[5-9]\d|4[1-9])\.\d\d|\b40\.\d?[1-9]\d?'
Corresponding to:
DDD.DD
| [5-9]D.DD
| 4[1-9].DD
| 40.D[1-9]
| 40.[1-9]D
In flex(1) you have this code to parse strings and get numbers greater than 40:
pru.l:
%option noyywrap
%%
\+?(0*[4-9][0-9]|0*[1-9][0-9][0-9][0-9]*)(\.[0-9]*)? { printf("Greater than 40: %s\n", yytext); }
\-?[0-9]*(\.[0-9]*)? { printf("Lesser than 40: %s\n", yytext); }
\n |
. ;
%%
int main()
{ yylex(); }
Install flex and compile this file it with
make pru
Then run it as:
pru <filein >fileout
or just
pru
This code constructs a deterministic finite automaton from the regular expressions listed and prints the commands listed on the right when recognizes a value greater than 40. It allows a leading optional sign and leading zeros, and an optional fractional part composed of any number of digits. And it does this with only one asignment and one decision for each character read. You have access to the automaton state table generated by flex (it writes C code for you)
the regex that recognizes numbers greater than 40 (with decimals and leading sign and zeros) is:
\+?(0*[4-9][0-9]|0*[1-9][0-9][0-9][0-9]*)(\.[0-9]*)?
and can be abreviated as:
\+?(0*[4-9][0-9]|0*[1-9][0-9]{3,})(\.[0-9]*)?
explanation:
\+? matches an optional plus sign.
(...|...) two options:
0* optional arbitrary number of leadin zeros.
[4-9][0-9] the numbers 40 to 99
[1-9][0-9]{3,} the numbers 100 and up.
(.[0-9]*)? optional decimal point followed by an arbitrary number of digits.
Is there a way to decrement a character value alphabetically in C++?
For example, changing a variable containing
'b' to the value 'a' or a variable containing
'd' to the value 'c' ?
I tried looking at character sequence but couldn't find anything useful.
Characters are essentially one byte integers (although the representation may vary between compilers). While there are many encodings which map integer values to characters, almost all of them map 'a' to 'z' characters in successive numerical order. So, if you wanted to change the string "aaab" to "aaaa" you could do something like the following:
char letters [4] = {'a','a','a','b'};
letters[3]--;
Alphabet characters are part of the ASCII character table. 65 is uppercase letter A, and 32 bits later, which is 97, is lowercase letter A. Letters B through Z and b through z are 66 through 90 and 98 through 122, respectively) The original computer programmers made it 32 bits apart in the ASCII chart rather than 26 (letters in the alphabet) because bit manipulation can be done to either easily change from lowercase to uppercase (and vice-versa), as well as ignoring the case (by ignoring the 32 bit - 0010 0000).
This way, for example, the 84th character on the ASCII chart, which represents the letter T, is represented with the bits 0101 0100. Lowercase t is 116 which is 0111 0100. When ignoring the case, the 1 in the 32 bit (6th position from the right) is ignored. You can see all the other bits are exactly the same for uppercase and lowercase. This makes it more convenient for everyone and more optimal for the computer.
To decrement just convert the character to its ASCII character value, decrement by 1, then take that integer and convert it back into ASCII value. Be careful when you have an 'A' though (or 'a'), as that's a special case.