Regex leading zeros from string in Hive - regex

I have a 19 - character string in Hive that I need to split up and remove any leading zeros.
Example:
7212092180052740029
I need it to be split like this
721 20 9218 00527 40029
So there are no leading zeros in 1st, 2nd, or 3rd section, and 00 would be removed from the 4th section; section 5 will be disregarded. My desired result would be
721209218527
My first-pass solution is
trim(concat_ws('', regexp_replace(substr(some_string, 1, 3), '^0*', '')
, regexp_replace(substr(some_string, 4, 2), '^0*', '')
, regexp_replace(substr(some_string, 6, 4), '^0*', '')
, regexp_replace(substr(some_string, 10, 5), '^0*', '')))
but this seems like extreme overkill. Any ideas how to do this with one line of regex?
Also, it should be noted that in any of the 5 sections, when split, will never be all zeros (i.e. section one will never be 000); if so then my 'solution' wouldn't work, as all zeros would be leading ones and '^0* would return nothing.

^0*|(?<=^.{3})0*|(?<=^.{5})0*|(?<=^.{9})0*|(?<=^.{14}).*$
You can use this regex and replace by empty string.See demo.
https://regex101.com/r/rO0yD8/15

Related

Split a String, take only 5 items. But limit characters to <20

a simple query perhaps.
I use a very useful formula:
=JOIN(replace(A1, find("|", SUBSTITUTE(A1, ", ", "|", 5)), len(A1), ""), "", )
...this takes a comma separated cell (may contain x50 strings) and returns only 5, next I'd like to limit the returned strings to those of under 20 characters. Is it possible to incorporate some magic into this formula. I currently use a regex "find and replace" with the value: .{20,} and then delete everything that is over 20 chars. There must be a more beautiful way of doing this?
ie. Cell A1 =
here's a string, here's a very long string over 20 chars, string 3, string 4, another string, string 5, another string 7, number 8
would become
here's a string, string 3, string 4, another string, string 5
Also, in formulas... how do you handle errors? If my queried cell only contains 3 strings, and I'm asking for 5 I get an Error, or bad return, what's the trick to handle such an event?
Thanks so much for reading this!
=JOIN(", ",ARRAY_CONSTRAIN(FILTER(SPLIT(A1,", ",0),LEN(SPLIT(A1,", ",0))<=20),1,5))
SPLIT by delimiter
FILTER array by LENgth of chars
ARRAY_CONSTRAIN to constrain the array
JOIN back the splitted filtered array

Regex Expression on 16-digit number

I am stuck on this regex problem.
A 16-digit credit card number, with the first digit being a 5 and the second digit being a 1, 2, 3, 4, or 5 (the rest of the digits can be anything).
so far I have ^4[1,5]\d{14} and I know I'm missing a lot of things but I dont know what I'm missing..
please help and thanks!
Look at the start of your regex:
^4[1,5]
That says that the number must start with 4 (not 5), and that the second character must be 1, a comma, or 5.
You want this instead (followed by the rest, of course):
^5[1-5]
Note the use of - rather than , to indicate a range of characters.
The full regex you're looking for is the following
^5[1-5]\d{14}$
Demo
Your error lays in the fact that you used 1,5 as a range but this will just match 1 , or 5 as characters. To use a range, the - is needed between the enclosings

R regex: How to extract string with one or two digit number within title?

I have a bunch of filenames that are numbered that I would like to be able to extract based on a regex statement.
For example, say I have the following filenames:
file.names <- paste0("run", 0:99, ".dat.gz")
If I wanted to extract files 5 through 8, I would need a regex that returns the following:
grep("correct_regex", file.names, value=TRUE)
"run5.dat.gz" "run6.dat.gz" "run7.dat.gz" "run8.dat.gz"
Or if I wanted to return files 9 through 21, it would return the following:
grep("correct_regex", file.names, value=TRUE)
"run9.dat.gz" "run10.dat.gz" "run11.dat.gz" "run12.dat.gz" "run13.dat.gz" "run14.dat.gz" "run15.dat.gz" "run16.dat.gz" "run17.dat.gz" "run18.dat.gz" "run19.dat.gz" "run20.dat.gz" "run21.dat.gz"
The tricky part if developing a regex that extracts the number as opposed to the digits (e.g. [0-9]). Any tips to help with this?
I also think that Sam's answer is the correct one, but just in case you also need to quickly extract non-sequential items, here is how you can easily build the regex you need (these subpatterns are to be used between "^run and [.]dat[.]gz$"):
Use [5-8] to match all digits from 5 to 8 (as in the current example)
For non-sequential one-digit values, add the ranges separately ([1-37-9] will match 1, 2, 3, 7, 8, 9)
When you need to combine numbers of different length, use alternations with (...|...):
(1[2-4]|2[89]) - will match 12, 13, 14, 28 and 29
(2[3-5]|[0-2]) - will match 23, 24, 25, 0, 1, and 2
In your case, you can use
> file.names <- paste0("run", 0:99, ".dat.gz")
> grep("^run[5-8][.]dat[.]gz$", file.names, value=TRUE)
[1] "run5.dat.gz" "run6.dat.gz" "run7.dat.gz" "run8.dat.gz"
>
Note that ^ matches the start of string and $ matches the end of string (so, this regex ensures a full string match).
You could accomplish this with a simple function and avoid regexes:
get_numbered_filenames <- function(num_vec){
target <- paste0("run", num_vec, ".dat.gz")
file.names[file.names %in% target]
}
get_numbered_filenames(9:21)
[1] "run9.dat.gz" "run10.dat.gz" "run11.dat.gz" "run12.dat.gz" "run13.dat.gz" "run14.dat.gz"
[7] "run15.dat.gz" "run16.dat.gz" "run17.dat.gz" "run18.dat.gz" "run19.dat.gz" "run20.dat.gz"
[13] "run21.dat.gz"

R: removing the last three dots from a string

I have a text data file that I likely will read with readLines. The initial portion of each string contains a lot of gibberish followed by the data I need. The gibberish and the data are usually separated by three dots. I would like to split the strings after the last three dots, or replace the last three dots with a marker of some sort telling R to treat everything to the left of those three dots as one column.
Here is a similar post on Stackoverflow that will locate the last dot:
R: Find the last dot in a string
However, in my case some of the data have decimals, so locating the last dot will not suffice. Also, I think ... has a special meaning in R, which might be complicating the issue. Another potential complication is that some of the dots are bigger than others. Also, in some lines one of the three dots was replaced with a comma.
In addition to gregexpr in the post above I have tried using gsub, but cannot figure out the solution.
Here is an example data set and the outcome I hope to achieve:
aa = matrix(c(
'first string of junk... 0.2 0 1',
'next string ........2 0 2',
'%%%... ! 1959 ... 0 3 3',
'year .. 2 .,. 7 6 5',
'this_string is . not fine .•. 4 2 3'),
nrow=5, byrow=TRUE,
dimnames = list(NULL, c("C1")))
aa <- as.data.frame(aa, stringsAsFactors=F)
aa
# desired result
# C1 C2 C3 C4
# 1 first string of junk 0.2 0 1
# 2 next string ..... 2 0 2
# 3 %%%... ! 1959 0 3 3
# 4 year .. 2 7 6 5
# 5 this_string is . not fine 4 2 3
I hope this question is not considered too specific. The text data file was created using the steps outlined in my post from yesterday about reading an MSWord file in R.
Some of the lines do not contain gibberish or three dots, but only data. However, that might be a complication for a follow up post.
Thank you for any advice.
This does the trick, though not especially elegant...
options(stringsAsFactors = FALSE)
# Search for three consecutive characters of your delimiters, then pull out
# all of the characters after that
# (in parentheses, represented in replace by \\1)
nums <- as.vector(gsub(aa$C1, pattern = "^.*[.,•]{3}\\s*(.*)", replace = "\\1"))
# Use strsplit to break the results apart at spaces and just get the numbers
# Use unlist to conver that into a bare vector of numbers
# Use matrix(, nrow = length(x)) to convert it back into a
# matrix of appropriate length
num.mat <- do.call(rbind, strsplit(nums, split = " "))
# Mash it back together with your original strings
result <- as.data.frame(cbind(aa, num.mat))
# Give it informative names
names(result) <- c("original.string", "num1", "num2", "num3")
This will get you most of the way there, and it will have no problems with numbers that include commas:
# First, use a regex to eliminate the bad pattern. This regex
# eliminates any three-character combination of periods, commas,
# and big dots (•), so long as the combination is followed by
# 0-2 spaces and then a digit.
aa.sub <- as.matrix(
apply(aa, 1, function (x)
gsub('[•.,]{3}(\\s{0,2}\\d)', '\\1', x, perl = TRUE)))
# Second: it looks as though you want your data split into columns.
# So this regex splits on spaces that are (a) preceded by a letter,
# digit, or space, and (b) followed by a digit. The result is a
# list, each element of which is a list containing the parts of
# one of the strings in aa.
aa.list <- apply(aa.sub, 1, function (x)
strsplit(x, '(?<=[\\w\\d\\s])\\s(?=\\d)', perl = TRUE))
# Remove the second element in aa. There is no space before the
# first data column in this string. As a result, strsplit() split
# it into three columns, not 4. That in turn throws off the code
# below.
aa.list <- aa.list[-2]
# Make the data frame.
aa.list <- lapply(aa.list, unlist) # convert list of lists to list of vectors
aa.df <- data.frame(aa.list)
aa.df <- data.frame(t(aa.df), row.names = NULL, stringsAsFactors = FALSE)
The only thing remaining is to modify the regex for strsplit() so that it can handle the second string in aa. Or perhaps it's better just to handle cases like that manually.
Reverse the string
Reverse the pattern you're searching for if necessary - it's not in your case
Reverse the result
[haiku-pseudocode]
a = 'first string of junk... 0.2 0 1' // string to search
b = 'junk' // pattern to match
ra = reverseString(a) // now equals '1 0 2.0 ...knuj fo gnirts tsrif'
rb = reverseString (b) // now equals 'knuj'
// run your regular expression search / replace - search in 'ra' for 'rb'
// put the result in rResult
// and then unreverse the result
// apologies for not knowing the syntax for 'R' regex
[/haiku-pseudocode]

What regular expression will match valid international phone numbers?

I need to determine whether a phone number is valid before attempting to dial it. The phone call can go anywhere in the world.
What regular expression will match valid international phone numbers?
\+(9[976]\d|8[987530]\d|6[987]\d|5[90]\d|42\d|3[875]\d|
2[98654321]\d|9[8543210]|8[6421]|6[6543210]|5[87654321]|
4[987654310]|3[9643210]|2[70]|7|1)\d{1,14}$
Is the correct format for matching a generic international phone number. I replaced the US land line centric international access code 011 with the standard international access code identifier of '+', making it mandatory. I also changed the minimum for the national number to at least one digit.
Note that if you enter numbers in this format into your mobile phone address book, you may successfully call any number in your address book no matter where you travel. For land lines, replace the plus with the international access code for the country you are dialing from.
Note that this DOES NOT take into account national number plan rules - specifically, it allows zeros and ones in locations that national number plans may not allow and also allows number lengths greater than the national number plan for some countries (e.g., the US).
All country codes are defined by the ITU. The following regex is based on ITU-T E.164 and Annex to ITU Operational Bulletin No. 930 – 15.IV.2009. It contains all current country codes and codes reserved for future use. While it could be shortened a bit, I decided to include each code independently.
This is for calls originating from the USA. For other countries, replace the international access code (the 011 at the beginning of the regex) with whatever is appropriate for that country's dialing plan.
Also, note that ITU E.164 defines the maximum length of a full international telephone number to 15 digits. This means a three digit country code results in up to 12 additional digits, and a 1 digit country code could contain up to 14 additional digits. Hence the
[0-9]{0,14}$
a the end of the regex.
Most importantly, this regex does not mean the number is valid - each country defines its own internal numbering plan. This only ensures that the country code is valid.
^011(999|998|997|996|995|994|993|992|991|
990|979|978|977|976|975|974|973|972|971|970|
969|968|967|966|965|964|963|962|961|960|899|
898|897|896|895|894|893|892|891|890|889|888|
887|886|885|884|883|882|881|880|879|878|877|
876|875|874|873|872|871|870|859|858|857|856|
855|854|853|852|851|850|839|838|837|836|835|
834|833|832|831|830|809|808|807|806|805|804|
803|802|801|800|699|698|697|696|695|694|693|
692|691|690|689|688|687|686|685|684|683|682|
681|680|679|678|677|676|675|674|673|672|671|
670|599|598|597|596|595|594|593|592|591|590|
509|508|507|506|505|504|503|502|501|500|429|
428|427|426|425|424|423|422|421|420|389|388|
387|386|385|384|383|382|381|380|379|378|377|
376|375|374|373|372|371|370|359|358|357|356|
355|354|353|352|351|350|299|298|297|296|295|
294|293|292|291|290|289|288|287|286|285|284|
283|282|281|280|269|268|267|266|265|264|263|
262|261|260|259|258|257|256|255|254|253|252|
251|250|249|248|247|246|245|244|243|242|241|
240|239|238|237|236|235|234|233|232|231|230|
229|228|227|226|225|224|223|222|221|220|219|
218|217|216|215|214|213|212|211|210|98|95|94|
93|92|91|90|86|84|82|81|66|65|64|63|62|61|60|
58|57|56|55|54|53|52|51|49|48|47|46|45|44|43|
41|40|39|36|34|33|32|31|30|27|20|7|1)[0-9]{0,
14}$
This is a further optimisation.
\+(9[976]\d|8[987530]\d|6[987]\d|5[90]\d|42\d|3[875]\d|
2[98654321]\d|9[8543210]|8[6421]|6[6543210]|5[87654321]|
4[987654310]|3[9643210]|2[70]|7|1)
\W*\d\W*\d\W*\d\W*\d\W*\d\W*\d\W*\d\W*\d\W*(\d{1,2})$
(i) allows for valid international prefixes
(ii) followed by 9 or 10 digits, with any type or placing of delimeters (except between the last two digits)
This will match:
+1-234-567-8901
+61-234-567-89-01
+46-234 5678901
+1 (234) 56 89 901
+1 (234) 56-89 901
+46.234.567.8901
+1/234/567/8901
You can use the library libphonenumber from Google.
PhoneNumberUtil phoneNumberUtil = PhoneNumberUtil.getInstance();
String decodedNumber = null;
PhoneNumber number;
try {
number = phoneNumberUtil.parse(encodedHeader, null);
decodedNumber = phoneNumberUtil.format(number, PhoneNumberFormat.E164);
} catch (NumberParseException e) {
e.printStackTrace();
}
Modified #Eric's regular expression - added a list of all country codes (got them from xxxdepy # Github.
I hope you will find it helpful:
/(\+|00)(297|93|244|1264|358|355|376|971|54|374|1684|1268|61|43|994|257|32|229|226|880|359|973|1242|387|590|375|501|1441|591|55|1246|673|975|267|236|1|61|41|56|86|225|237|243|242|682|57|269|238|506|53|5999|61|1345|357|420|49|253|1767|45|1809|1829|1849|213|593|20|291|212|34|372|251|358|679|500|33|298|691|241|44|995|44|233|350|224|590|220|245|240|30|1473|299|502|594|1671|592|852|504|385|509|36|62|44|91|246|353|98|964|354|972|39|1876|44|962|81|76|77|254|996|855|686|1869|82|383|965|856|961|231|218|1758|423|94|266|370|352|371|853|590|212|377|373|261|960|52|692|389|223|356|95|382|976|1670|258|222|1664|596|230|265|60|262|264|687|227|672|234|505|683|31|47|977|674|64|968|92|507|64|51|63|680|675|48|1787|1939|850|351|595|970|689|974|262|40|7|250|966|249|221|65|500|4779|677|232|503|378|252|508|381|211|239|597|421|386|46|268|1721|248|963|1649|235|228|66|992|690|993|670|676|1868|216|90|688|886|255|256|380|598|1|998|3906698|379|1784|58|1284|1340|84|678|681|685|967|27|260|263)(9[976]\d|8[987530]\d|6[987]\d|5[90]\d|42\d|3[875]\d|2[98654321]\d|9[8543210]|8[6421]|6[6543210]|5[87654321]|4[987654310]|3[9643210]|2[70]|7|1)\d{4,20}$/
No criticism regarding those great answers I just want to present the simple solution I use for our admin content creators:
^(\+|00)[1-9][0-9 \-\(\)\.]{7,32}$
Force start with a plus or two zeros and use at least a little bit of numbers. White space, brackets, minus and point are optional, no other characters allowed.
You can safely remove all non-numbers and use this in a tel: input. Numbers will have a common form of representation and I do not have to worry about being to restrictive.
I use this one:
/([0-9\s\-]{7,})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?$/
Advantages: recognizes + or 011 beginnings, lets it be as long as needed, and handles many extension conventions. (#,x,ext,extension)
This will work for international numbers;
C#:
#"^((\+\d{1,3}(-| )?\(?\d\)?(-| )?\d{1,5})|(\(?\d{2,6}\)?))(-| )?(\d{3,4})(-| )?(\d{4})(( x| ext)\d{1,5}){0,1}$"
JS:
/^((\+\d{1,3}(-| )?\(?\d\)?(-| )?\d{1,5})|(\(?\d{2,6}\)?))(-| )?(\d{3,4})(-| )?(\d{4})(( x| ext)\d{1,5}){0,1}$/
Here's an "optimized" version of your regex:
^011(9[976]\d|8[987530]\d|6[987]\d|5[90]\d|42\d|3[875]\d|
2[98654321]\d|9[8543210]|8[6421]|6[6543210]|5[87654321]|
4[987654310]|3[9643210]|2[70]|7|1)\d{0,14}$
You can replace the \ds with [0-9] if your regex syntax doesn't support \d.
For iOS SWIFT I found this helpful,
let phoneRegEx = "^((\\+)|(00)|(\\*)|())[0-9]{3,14}((\\#)|())$"
Here is a regex for the following most common phone number scenarios. Although this is tailored from a US perspective for area codes it works for international scenarios.
The actual number should be 10 digits only.
For US numbers area code may be surrounded with parentheses ().
The country code can be 1 to 3 digits long. Optionally may be preceded by a + sign.
There may be dashes, spaces, dots or no spaces between country code, area code and the rest of the number.
A valid phone number cannot be all zeros.
^(?!\b(0)\1+\b)(\+?\d{1,3}[. -]?)?\(?\d{3}\)?([. -]?)\d{3}\3\d{4}$
Explanation:
^ - start of expression
(?!\b(0)\1+\b) - (?!)Negative Look ahead. \b - word boundary around a '0' character. \1 backtrack to previous capturing group (zero). Basically don't match all zeros.
(\+?\d{1,3}[. -]?)? - '\+?' plus sign before country code is optional.\d{1,3} - country code can be 1 to 3 digits long. '[. -]?' - spaces,dots and dashes are optional. The last question mark is to make country code optional.
\(?\d{3}\)? - '\)?' is to make parentheses optional. \d{3} - match 3 digit area code.
([. -]?) - optional space, dash or dot
$ - end of expression
More examples and explanation - https://regex101.com/r/hTH8Ct/2/
I have used this below:
^(\+|00)[0-9]{1,3}[0-9]{4,14}(?:x.+)?$
The format +CCC.NNNNNNNNNNxEEEE or 00CCC.NNNNNNNNNNxEEEE
Phone number must start with '+' or '00' for an international call.
where C is the 1–3 digit country code,
N is up to 14 digits,
and E is the (optional) extension.
The leading plus sign and the dot following the country code are required. The literal “x” character is required only if an extension is provided.
I made the regexp for european phone numbers, and it is specific against dial prefix vs length of number.
const PhoneEuropeRegExp = () => {
// eu phones map https://en.wikipedia.org/wiki/Telephone_numbers_in_Europe
const phonesMap = {
"43": [4, 13],
"32": [8, 10],
"359": [7, 9],
"385": [8, 9],
"357": 8,
"420": 9,
"45": 8,
"372": 7,
"358": [5, 12],
"33": 9,
"350": 8,
"49": [3, 12],
"30": 10,
"36": [8, 9],
"354": [7, 9],
"353": [7, 9],
"39": [6, 12],
"371": 8,
"423": [7, 12],
"370": 8,
"352": 8,
"356": 8,
"31": 9,
"47": [4, 12],
"48": 9,
"351": 9,
"40": 9,
"421": 9,
"386": 8,
"34": 9,
"46": [6, 9],
};
const regExpBuilt = Object.keys(phonesMap)
.reduce(function(prev, key) {
const val = phonesMap[key];
if (Array.isArray(val)) {
prev.push("(\\+" + key + `[0-9]\{${val[0]},${val[1]}\})`);
} else {
prev.push("(\\+" + key + `[0-9]\{${val}\})`);
}
return prev;
}, [])
.join("|");
return new RegExp(`^(${regExpBuilt})$`);
};
alert(PhoneEuropeRegExp().test("+420123456789"))
I only check for valid characters and allow up to 30 characters. Numbers that include an extension are also possible.
^[\+\(\s.\-\/\d\)]{5,30}$
Matches the following:
(0123) 123 456 1
555-555-5555
0049 1555 532-3455
123 456 7890
0761 12 34 56
+49 123 1-234-567-8901
+61-234-567-89-01
+46-234 5678901
+1 (234) 56 89 901
+1 (234) 56-89 901
+46.234.567.8901
+1/234/567/8901
It works pretty well with 00xx and +xx:
^(?:00|\+)(9[976]\d|8[987530]\d|6[987]\d|5[90]\d|42\d|3[875]\d|2[98654321]\d|9[8543210]|8[6421]|6[6543210]|5[87654321]|4[987654310]|3[9643210]|2[70]|7|1)\d{1,14}$
There's obviously a multitude of ways to do this, as evidenced by all of the different answers given thus far, but I'll throw my $0.02 worth in here and provide the regex below, which is a bit more terse than nearly all of the above, but more thorough than most as well. It also has the nice side-effect of leaving the country code in $1 and the local number in $2.
^\+(?=\d{5,15}$)(1|2[078]|3[0-469]|4[013-9]|5[1-8]|6[0-6]|7|8[1-469]|9[0-58]|[2-9]..)(\d+)$
A simple version for european numbers, that matches numbers like 0034617393211 but also long ones as 004401484172842.
^0{2}[0-9]{11,}
Hope it helps :·)
public static boolean validateInternationalPhoneNumberFormat(String phone) {
StringBuilder sb = new StringBuilder(200);
// Country code
sb.append("^(\\+{1}[\\d]{1,3})?");
// Area code, with or without parentheses
sb.append("([\\s])?(([\\(]{1}[\\d]{2,3}[\\)]{1}[\\s]?)|([\\d]{2,3}[\\s]?))?");
// Phone number separator can be "-", "." or " "
// Minimum of 5 digits (for fixed line phones in Solomon Islands)
sb.append("\\d[\\-\\.\\s]?\\d[\\-\\.\\s]?\\d[\\-\\.\\s]?\\d[\\-\\.\\s]?\\d[\\-\\.\\s]?");
// 4 more optional digits
sb.append("\\d?[\\-\\.\\s]?\\d?[\\-\\.\\s]?\\d?[\\-\\.\\s]?\\d?$");
return Pattern.compile(sb.toString()).matcher(phone).find();
}
The international numbering plan is based on the ITU E.164 numbering plan. I guess that's the starting point to your regular expression.
I'll update this if I get around to create a regular expression based on the ITU E.164 numbering.
This Regex Expression works for India, Canada, Europe, New Zealand, Australia, United States phone numbers, along with their country codes:
"^(\+(([0-9]){1,2})[-.])?((((([0-9]){2,3})[-.]){1,2}([0-9]{4,10}))|([0-9]{10}))$"
This works for me, without 00, 001, 0011 etc prefix though:
/^\+*(\d{3})*[0-9,\-]{8,}/
Try this, it works for me.
^(00|\+)[1-9]{1}([0-9][\s]*){9,16}$
^\+[1-9]\d{10,14}$
This will match "e164 phone numbers"