I have written some basic regex:
/[0123BCDER]/g
I would like to extract the bold numbers from the below lines. The regex i have written extracts the character however i would only like to extract the character if it is surrounded by white space. Any help would be much appreciated.
721658A220421EE5867 AMBER YUR DE STE 30367887462580 **1** 00355132
172638A220421ER3028 NIKITA YUAN 318058763400580 **1** 00355133
982658A230421MC1234 SEAN D W MC100050420965155230421 **3** 14032887609303 00355134
Please note the character or digit will always be by itself.
You are loking for something like this: /\s\d\s/g.
\s - match whitespace,
\d - match any digit,
/g - match all occurrences.
You can also replace \d with e.g. [0123BCDER] (your example) or [0-9A-Za-z] (all alphanumberic).
const input = `721658A220421EE5867 AMBER YUR DE STE 30367887462580 1 00355132
172638A220421ER3028 NIKITA YUAN 318058763400580 1 00355133 _
982658A230421MC1234 SEAN D W MC100050420965155230421 3 14032887609303 00355134
`
// with whitespaces
const res = input.match(/\s\d\s/g)
console.log(res)
// alphanumeric
const res2 = input.match(/\s[A-Za-z0-9]\s/g)
console.log(res2)
I have written some basic regex:
/[0123BCDER]/g
I would like to extract the bold numbers from the below lines. The regex i have written extracts the character however i would only like to extract the character if it is surrounded by white space. Any help would be much appreciated.
721658A220421EE5867 AMBER YUR DE STE 30367887462580 **1** 00355132
172638A220421ER3028 NIKITA YUAN 318058763400580 **1** 00355133
982658A230421MC1234 SEAN D W MC100050420965155230421 **3** 14032887609303 00355134
Please note the character or digit will always be by itself.
You are loking for something like this: /\s\d\s/g.
\s - match whitespace,
\d - match any digit,
/g - match all occurrences.
You can also replace \d with e.g. [0123BCDER] (your example) or [0-9A-Za-z] (all alphanumberic).
const input = `721658A220421EE5867 AMBER YUR DE STE 30367887462580 1 00355132
172638A220421ER3028 NIKITA YUAN 318058763400580 1 00355133 _
982658A230421MC1234 SEAN D W MC100050420965155230421 3 14032887609303 00355134
`
// with whitespaces
const res = input.match(/\s\d\s/g)
console.log(res)
// alphanumeric
const res2 = input.match(/\s[A-Za-z0-9]\s/g)
console.log(res2)
I am trying to figure out how to replace spaces in a text like the example below but I don't know how to deal with different number of spaces in the same text
This text:
E m se guida, a e mpre sa deu ba ixa e m
cerca de $82 b ilhões ( ma is de 75 %) de se us a t ivos.
Should be:
Em seguida, a empresa deu baixa em
cerca de $82 bilhões (mais de 75%) de seus ativos.
Note that there are single spaces between characters and double spaces between words.
Could someone give me some light on this?
I would approach this in two steps. First, I would use a regex to replace all of the single spaces, and then another to shorten the double spaces. To find only single spaces, you can use this regex:
(\S)\s(\S)
Next, to find double spaces, you can use this regex:
\s\s+
So first, replace single spaces with groups one and two from the first regex, and then replace double spaces with a single space using the second regex.
Using the atom editor, you can use these two regex to find and replace like this:
In the second image, you do have to enter one space, it is slightly unclear from the screen shot. Hope this helps!
I'm trying to break a string vector into several variables using regular expressions in R, preferably in a dplyr-tidyr way using the tidyr::extract command. For insctance in the vector bellow:
sasdic <- data.frame(a=c(
'#1 ANO_CENSO 5. /*Ano do Censo*/',
'#71 TP_SEXO $Char1. /*Sexo*/',
'#72 TP_COR_RACA $Char1. /*Cor/raça*/',
'#74 FK_COD_PAIS_ORIGEM 4. /*Código País de origem*/' ))
I would like for the:
first number ([0-9]+) to go to variable "int_pos"
the variable name connected by undersline ([a-zA-Z_]+) to go to variable "var_name"
The second number or the term $Char1 (could be $Char2, etc) to go to var "x". I figured ([0-9]+|$Char[0-9]+) could select this?
Lastly, whatever comes in between "/* ... /" to go to variable "label" (don´t know the regex for this).
All other intermidiate caracters (blank spaces, ".", "/", "" should be disconsidered)
This would be the result
d <- data.frame(int_pos=c(1,72,72,74),
var_name=c('ANO_CENSO','TP_SEXO','TP_COR_RACA','FK_COD_PAIS_ORIGEM'),
x=c('5','Chart1','$Char1','4'),
label=c('Ano do Censo','Sexo','Cor/raça','Código País de origem') )
I tryed to construct a regular expression for this. This is what I got so far:
sasdic %>% extract(a, c('int_pos','var_name','x','label'),
"([0-9]+)([a-zA-Z_]+)([0-9]+|$Char[0-9]+)(something to get the label")
-> d
above the regular expression is incomplete. Also, I don't know hot to make explicit in the extract command syntax, what are the parts to be recovered and what are the parts to leave out.
In the regex used, we are matchng one more more punctuation characters ([[:punct:]]+) i.e. # followed by capturing the numeric part ((\\d+) - this will be our first column of interest), followed by one or more white-space (\\s+), followed by the second capture group (\\S+ - one or more non white-space character i.e. "ANO_CENSO" for the first row), followed by space (\\s+), then we capture the third group (([[:alum:]$]+) - i.e. one or more characters that include the alpha numeric along with $ so as to match $Char1), next we match one or more characters that are not a letter ([^A-Za-z]+- this should get rid of the space and *) and the last part we capture one or more characters that are not * (([^*]+).
sasdic %>%
extract(a, into=c('int_pos', 'var_name', 'x', 'label'),
"[[:punct:]](\\d+)\\s+(\\S+)\\s+([[:alnum:]$]+)[^A-Za-z]+([^*]+)")
# int_pos var_name x label
#1 1 ANO_CENSO 5 Ano do Censo
#2 71 TP_SEXO $Char1 Sexo
#3 72 TP_COR_RACA $Char1 Cor/raça
#4 74 FK_COD_PAIS_ORIGEM 4 Código País de origem
This is another option, though it uses the data.table package instead of tidyr:
library(data.table)
setDT(sasdic)
# split label
sasdic[, c("V1","label") := tstrsplit(a, "/\\*|\\*/")]
# remove leading "#", split remaining parts
sasdic[, c("int_pos","var_name","x") := tstrsplit(gsub("^#","",V1)," +")]
# remove unneeded columns
sasdic[, c("a","V1") := NULL]
sasdic
# label int_pos var_name x
# 1: Ano do Censo 1 ANO_CENSO 5.
# 2: Sexo 71 TP_SEXO $Char1.
# 3: Cor/raça 72 TP_COR_RACA $Char1.
# 4: Código País de origem 74 FK_COD_PAIS_ORIGEM 4.
This assumes that the "remaining parts" (aside from the label) are space-separated.
This could also be done in one block (which is what I would do):
sasdic[, c("a","label","int_pos","var_name","x") := {
x = tstrsplit(a, "/\\*|\\*/")
x1s = tstrsplit(gsub("^#","",x[[1]])," +")
c(list(NULL), x1s, x[2])
}]
You could use the package unglue :
library(unglue)
unglue_unnest(sasdic, a, "#{int_pos}{=\\s+}{varname}{=\\s+}{x}.{=\\s+}/*{label}*/")
#> int_pos varname x label
#> 1 1 ANO_CENSO 5 Ano do Censo
#> 2 71 TP_SEXO $Char1 Sexo
#> 3 72 TP_COR_RACA $Char1 Cor/ra<e7>a
#> 4 74 FK_COD_PAIS_ORIGEM 4 C<f3>digo Pa<ed>s de origem
I can parse the regular expression in matlab / octave below:
A = 'Var Name 123.5'
[si ei xt mt] = regexp(A, '(\d)*(\.)?(\d)*$')
number = str2num(mt{1})
number = 123.50
But I get a syntax error below most likly caused due to the ]
A='[angle_deg = 75.01323334803705]'
[si ei xt mt] = regexp(A, '(\d)*(\.)?(\d)*$])
how can I fix this regular expression?
Your regular expression from the first method is fine assuming that you are looking for a number at the end of the string. Because you have an ending ] character in your new string, your regular expression won't work because your string does not end in a number. As such, simply removing the $ character should work, as you want to search for one number that may or may not be floating point. You have three capturing groups in your regex, where the first capture group grabs the integer part of the number, the second capture group optionally grabs a decimal point, and the last capture group grabs the floating point portion of your number.
You also did not close your string properly in your regex. It needs an ending single quotation. Therefore:
A='[angle_deg = 75.01323334803705]';
[si ei xt mt] = regexp(A, '(\d)*(\.)?(\d)*');
Displaying all of the output variables from regexp, this is what I get:
>> si
si =
14
>> ei
ei =
30
>> xt
xt =
[3x2 double]
>> mt
mt =
'75.01323334803705'
si denotes the starting index of where the match happened, which is index 14 in your string. ei denotes the ending index of where the match happened, which is index 30. xt shows you the starting and ending index that matched each token or capturing group of your regular expression. To display this, simply do:
>> xt{1}
ans =
14 15
16 16
17 30
Therefore, the first capturing group begins at index 14 and ends at index 15, which is the 75 portion of your number. The second capturing group begins at index 16 and also ends there, which denotes the . character. Finally, index 17 to 30 denote the floating point portion of your number, which is 01323334803705. To finish it all off, mt shows you the extracted string that matched the regular expression, which is the number at the end of this string. You can certainly convert this string into a number by using str2num.