Regular Expression - difficulties with comma - regex

I am facing two problems with comma:
I want to search for DE 99, SF 99 and DE 99 SF 99 in the same pattern. Kindly note that the only difference is the comma. I have an input with Data Element number (DE) and its Subfield number (SF). SF isn't always present, but I managed to deal with in the code below. The issue is that some times DE and SF comes separated by "," other times not.
The other problem is, that the currency value or any value with "," is missed after the comma. I placed below what I am doing and some test case examples. Kindly note that the value can be number or alphanumeric.
Found and read correctly the value
wholeLine: DE 3, SF 1 = 20
OUTPUT: DE 3, SF 1 = 20
Found and read correctly the value
wholeLine: DE 26 = 6538
OUTPUT: DE 26 = 6538
Found but read wrongly the value because only reads before “,”
wholeLine: DE 4 = 3,727
OUTPUT: DE 4 = 3
Not Found
wholeLine: DE 63 SF 2 = xyz
Pattern patternDE = Pattern.compile("DE \\d+(, SF \\d+)* = \\w+");
Matcher matcherDE = patternDE.matcher(wholeLine);
while (matcherDE.find()){
String wholeThing = matcherDE.group();
System.out.println(wholeThing);
}

Looks like you should be using
DE \\d+,?( SF \\d+)* = \\w+
? is a quantifier for one or none, so you're looking for DE followed by a space, then one or more digits, then one or zero commas, followed by the rest of your regex that's already working.
The problem you're having with the last part of your output is that you're matchin word characters, which don't include commas. Try matching non-spaces instead \\S

the part (, SF \\d+)* acts as a group and can not tell whether comma , exists or not separately. So by moving the , out of the group, the expression should be ok.
And for the currency problem, try replacing \\w+ with [\w,]+, to include comma.
DE \\d+(, SF \\d+)* = \\w+ // original
DE \\d+,?( SF \\d+)* = \\w+ // exclude comma from group
DE \\d+,?( SF \\d+)* = \[\w,]+// currency separator

Related

REGEX Two expressions [duplicate]

I have written some basic regex:
/[0123BCDER]/g
I would like to extract the bold numbers from the below lines. The regex i have written extracts the character however i would only like to extract the character if it is surrounded by white space. Any help would be much appreciated.
721658A220421EE5867 AMBER YUR DE STE 30367887462580 **1** 00355132
172638A220421ER3028 NIKITA YUAN 318058763400580 **1** 00355133
982658A230421MC1234 SEAN D W MC100050420965155230421 **3** 14032887609303 00355134
Please note the character or digit will always be by itself.
You are loking for something like this: /\s\d\s/g.
\s - match whitespace,
\d - match any digit,
/g - match all occurrences.
You can also replace \d with e.g. [0123BCDER] (your example) or [0-9A-Za-z] (all alphanumberic).
const input = `721658A220421EE5867 AMBER YUR DE STE 30367887462580 1 00355132
172638A220421ER3028 NIKITA YUAN 318058763400580 1 00355133 _
982658A230421MC1234 SEAN D W MC100050420965155230421 3 14032887609303 00355134
`
// with whitespaces
const res = input.match(/\s\d\s/g)
console.log(res)
// alphanumeric
const res2 = input.match(/\s[A-Za-z0-9]\s/g)
console.log(res2)

Extract specific text surrounded by white space

I have written some basic regex:
/[0123BCDER]/g
I would like to extract the bold numbers from the below lines. The regex i have written extracts the character however i would only like to extract the character if it is surrounded by white space. Any help would be much appreciated.
721658A220421EE5867 AMBER YUR DE STE 30367887462580 **1** 00355132
172638A220421ER3028 NIKITA YUAN 318058763400580 **1** 00355133
982658A230421MC1234 SEAN D W MC100050420965155230421 **3** 14032887609303 00355134
Please note the character or digit will always be by itself.
You are loking for something like this: /\s\d\s/g.
\s - match whitespace,
\d - match any digit,
/g - match all occurrences.
You can also replace \d with e.g. [0123BCDER] (your example) or [0-9A-Za-z] (all alphanumberic).
const input = `721658A220421EE5867 AMBER YUR DE STE 30367887462580 1 00355132
172638A220421ER3028 NIKITA YUAN 318058763400580 1 00355133 _
982658A230421MC1234 SEAN D W MC100050420965155230421 3 14032887609303 00355134
`
// with whitespaces
const res = input.match(/\s\d\s/g)
console.log(res)
// alphanumeric
const res2 = input.match(/\s[A-Za-z0-9]\s/g)
console.log(res2)

How can I replace wrong spaces in a text using REGEX?

I am trying to figure out how to replace spaces in a text like the example below but I don't know how to deal with different number of spaces in the same text
This text:
E m se guida, a e mpre sa deu ba ixa e m
cerca de $82 b ilhões ( ma is de 75 %) de se us a t ivos.
Should be:
Em seguida, a empresa deu baixa em
cerca de $82 bilhões (mais de 75%) de seus ativos.
Note that there are single spaces between characters and double spaces between words.
Could someone give me some light on this?
I would approach this in two steps. First, I would use a regex to replace all of the single spaces, and then another to shorten the double spaces. To find only single spaces, you can use this regex:
(\S)\s(\S)
Next, to find double spaces, you can use this regex:
\s\s+
So first, replace single spaces with groups one and two from the first regex, and then replace double spaces with a single space using the second regex.
Using the atom editor, you can use these two regex to find and replace like this:
In the second image, you do have to enter one space, it is slightly unclear from the screen shot. Hope this helps!

Break string into several columns using tidyr::extract regex

I'm trying to break a string vector into several variables using regular expressions in R, preferably in a dplyr-tidyr way using the tidyr::extract command. For insctance in the vector bellow:
sasdic <- data.frame(a=c(
'#1 ANO_CENSO 5. /*Ano do Censo*/',
'#71 TP_SEXO $Char1. /*Sexo*/',
'#72 TP_COR_RACA $Char1. /*Cor/raça*/',
'#74 FK_COD_PAIS_ORIGEM 4. /*Código País de origem*/' ))
I would like for the:
first number ([0-9]+) to go to variable "int_pos"
the variable name connected by undersline ([a-zA-Z_]+) to go to variable "var_name"
The second number or the term $Char1 (could be $Char2, etc) to go to var "x". I figured ([0-9]+|$Char[0-9]+) could select this?
Lastly, whatever comes in between "/* ... /" to go to variable "label" (don´t know the regex for this).
All other intermidiate caracters (blank spaces, ".", "/", "" should be disconsidered)
This would be the result
d <- data.frame(int_pos=c(1,72,72,74),
var_name=c('ANO_CENSO','TP_SEXO','TP_COR_RACA','FK_COD_PAIS_ORIGEM'),
x=c('5','Chart1','$Char1','4'),
label=c('Ano do Censo','Sexo','Cor/raça','Código País de origem') )
I tryed to construct a regular expression for this. This is what I got so far:
sasdic %>% extract(a, c('int_pos','var_name','x','label'),
"([0-9]+)([a-zA-Z_]+)([0-9]+|$Char[0-9]+)(something to get the label")
-> d
above the regular expression is incomplete. Also, I don't know hot to make explicit in the extract command syntax, what are the parts to be recovered and what are the parts to leave out.
In the regex used, we are matchng one more more punctuation characters ([[:punct:]]+) i.e. # followed by capturing the numeric part ((\\d+) - this will be our first column of interest), followed by one or more white-space (\\s+), followed by the second capture group (\\S+ - one or more non white-space character i.e. "ANO_CENSO" for the first row), followed by space (\\s+), then we capture the third group (([[:alum:]$]+) - i.e. one or more characters that include the alpha numeric along with $ so as to match $Char1), next we match one or more characters that are not a letter ([^A-Za-z]+- this should get rid of the space and *) and the last part we capture one or more characters that are not * (([^*]+).
sasdic %>%
extract(a, into=c('int_pos', 'var_name', 'x', 'label'),
"[[:punct:]](\\d+)\\s+(\\S+)\\s+([[:alnum:]$]+)[^A-Za-z]+([^*]+)")
# int_pos var_name x label
#1 1 ANO_CENSO 5 Ano do Censo
#2 71 TP_SEXO $Char1 Sexo
#3 72 TP_COR_RACA $Char1 Cor/raça
#4 74 FK_COD_PAIS_ORIGEM 4 Código País de origem
This is another option, though it uses the data.table package instead of tidyr:
library(data.table)
setDT(sasdic)
# split label
sasdic[, c("V1","label") := tstrsplit(a, "/\\*|\\*/")]
# remove leading "#", split remaining parts
sasdic[, c("int_pos","var_name","x") := tstrsplit(gsub("^#","",V1)," +")]
# remove unneeded columns
sasdic[, c("a","V1") := NULL]
sasdic
# label int_pos var_name x
# 1: Ano do Censo 1 ANO_CENSO 5.
# 2: Sexo 71 TP_SEXO $Char1.
# 3: Cor/raça 72 TP_COR_RACA $Char1.
# 4: Código País de origem 74 FK_COD_PAIS_ORIGEM 4.
This assumes that the "remaining parts" (aside from the label) are space-separated.
This could also be done in one block (which is what I would do):
sasdic[, c("a","label","int_pos","var_name","x") := {
x = tstrsplit(a, "/\\*|\\*/")
x1s = tstrsplit(gsub("^#","",x[[1]])," +")
c(list(NULL), x1s, x[2])
}]
You could use the package unglue :
library(unglue)
unglue_unnest(sasdic, a, "#{int_pos}{=\\s+}{varname}{=\\s+}{x}.{=\\s+}/*{label}*/")
#> int_pos varname x label
#> 1 1 ANO_CENSO 5 Ano do Censo
#> 2 71 TP_SEXO $Char1 Sexo
#> 3 72 TP_COR_RACA $Char1 Cor/ra<e7>a
#> 4 74 FK_COD_PAIS_ORIGEM 4 C<f3>digo Pa<ed>s de origem

parsing regular expression syntax issue matlab / octave

I can parse the regular expression in matlab / octave below:
A = 'Var Name 123.5'
[si ei xt mt] = regexp(A, '(\d)*(\.)?(\d)*$')
number = str2num(mt{1})
number = 123.50
But I get a syntax error below most likly caused due to the ]
A='[angle_deg = 75.01323334803705]'
[si ei xt mt] = regexp(A, '(\d)*(\.)?(\d)*$])
how can I fix this regular expression?
Your regular expression from the first method is fine assuming that you are looking for a number at the end of the string. Because you have an ending ] character in your new string, your regular expression won't work because your string does not end in a number. As such, simply removing the $ character should work, as you want to search for one number that may or may not be floating point. You have three capturing groups in your regex, where the first capture group grabs the integer part of the number, the second capture group optionally grabs a decimal point, and the last capture group grabs the floating point portion of your number.
You also did not close your string properly in your regex. It needs an ending single quotation. Therefore:
A='[angle_deg = 75.01323334803705]';
[si ei xt mt] = regexp(A, '(\d)*(\.)?(\d)*');
Displaying all of the output variables from regexp, this is what I get:
>> si
si =
14
>> ei
ei =
30
>> xt
xt =
[3x2 double]
>> mt
mt =
'75.01323334803705'
si denotes the starting index of where the match happened, which is index 14 in your string. ei denotes the ending index of where the match happened, which is index 30. xt shows you the starting and ending index that matched each token or capturing group of your regular expression. To display this, simply do:
>> xt{1}
ans =
14 15
16 16
17 30
Therefore, the first capturing group begins at index 14 and ends at index 15, which is the 75 portion of your number. The second capturing group begins at index 16 and also ends there, which denotes the . character. Finally, index 17 to 30 denote the floating point portion of your number, which is 01323334803705. To finish it all off, mt shows you the extracted string that matched the regular expression, which is the number at the end of this string. You can certainly convert this string into a number by using str2num.