How can I replace wrong spaces in a text using REGEX?

How can I replace wrong spaces in a text using REGEX? - regex

I am trying to figure out how to replace spaces in a text like the example below but I don't know how to deal with different number of spaces in the same text
This text:
E m se guida, a e mpre sa deu ba ixa e m
cerca de $82 b ilhões ( ma is de 75 %) de se us a t ivos.
Should be:
Em seguida, a empresa deu baixa em
cerca de $82 bilhões (mais de 75%) de seus ativos.
Note that there are single spaces between characters and double spaces between words.
Could someone give me some light on this?

I would approach this in two steps. First, I would use a regex to replace all of the single spaces, and then another to shorten the double spaces. To find only single spaces, you can use this regex:
(\S)\s(\S)
Next, to find double spaces, you can use this regex:
\s\s+
So first, replace single spaces with groups one and two from the first regex, and then replace double spaces with a single space using the second regex.
Using the atom editor, you can use these two regex to find and replace like this:
In the second image, you do have to enter one space, it is slightly unclear from the screen shot. Hope this helps!

Related

Notepad++ add new line above changing syntax with replace

I have a constant syntax of "Se " but there is a number in front of it that changes. I want to add a newline \n before the number. I've tried using \c to address any character (for the changing number) during replace, I don't know how to get the number part to copy over or work.
this is what it currently looks like
1 hinge 2pk
1 Se wall cabinet
4 door 15x40"
I want the new line to be above any item that includes "Se", so that it looks like this
1 hinge 2pk
1 Se wall cabinet
4 door 15x40"
this is what i've tried so far (not including parenthesis)
REPLACE TOOL
Find what: [\C Se ]
Replace with: [\n\C Se ]
✓ = Regular expression
but this is what I get
1 hinge 2pk
C Se wall cabinet
4 door 15x40
How do I get the number to the left of "Se" to copy down (as this number is always changing)

You can use:
^\d+\h+Se\b
^ Start of string
\d+ Match 1+ digits
\h+ Match 1+ spaces
Se\b Match Se followed by a word boundary
Regex demo
In the replacement use a newline and the full match \n$0
Find what:
^\d+\h+Se\b
Replace with
\n$0

Well, try this simple code, hope it will help...
Find:^(\d.*? Se .*\n)
Replace with:\n$1 or \n\1

Extract specific text surrounded by white space

I have written some basic regex:
/[0123BCDER]/g
I would like to extract the bold numbers from the below lines. The regex i have written extracts the character however i would only like to extract the character if it is surrounded by white space. Any help would be much appreciated.
721658A220421EE5867 AMBER YUR DE STE 30367887462580 **1** 00355132
172638A220421ER3028 NIKITA YUAN 318058763400580 **1** 00355133
982658A230421MC1234 SEAN D W MC100050420965155230421 **3** 14032887609303 00355134
Please note the character or digit will always be by itself.

You are loking for something like this: /\s\d\s/g.
\s - match whitespace,
\d - match any digit,
/g - match all occurrences.
You can also replace \d with e.g. [0123BCDER] (your example) or [0-9A-Za-z] (all alphanumberic).
const input = `721658A220421EE5867 AMBER YUR DE STE 30367887462580 1 00355132
172638A220421ER3028 NIKITA YUAN 318058763400580 1 00355133 _
982658A230421MC1234 SEAN D W MC100050420965155230421 3 14032887609303 00355134
`
// with whitespaces
const res = input.match(/\s\d\s/g)
console.log(res)
// alphanumeric
const res2 = input.match(/\s[A-Za-z0-9]\s/g)
console.log(res2)

regex working with long lines

I got a lot of these strings in one txt-file:
X00NAP-0111-OG02Flur-A 2 AIR-CAP2702I-E-K9 00:b8:b8:b8:7d:b8 0111-HGS DE 10.100.100.100 8
X006NAP-0500-EG00Grossrau-A 2 AIR-CAP2702I-E-K9 50:0f:80:94:82:c0 HGS 0500 DE 10.100.100.100 1
Y008NAP-8399-OG04OE3020-A 2 AIR-CAP2702I-E-K9 00:b8:b8:b8:7d:b8 HGS Erfurter Hof DE 10.100.100.100 1
A1234NAP-4101-OG02Raum237-A 2 AIR-CAP2602I-E-K9 00:b8:b8:b8:7d:b8 AP 2 Anmeldung V DE 10.100.100.100 0
I am only interested in the first string and the number on the end of the lines. The number can be max. 99
So in the end I would like to have a output like this:
X00NAP-0111-OG02Flur-A 8
X006NAP-0500-EG00Grossrau-A 1
Y008NAP-8399-OG04OE3020-A 1
A1234NAP-4101-OG02Raum237-A 0
I tried a lot of things with regex, but nothing worked really.

Here is a general regex solution:
Find:
^([^\s]*).*(\d+)$
Replace:
$1 $2
The idea here is to match the first string and final number as capture groups, which are indicated by the two terms in the pattern surrounded by parentheses. These capture groups are made available in the replacement as $1 and $2 (sometimes \1 and \2, depending on the regex tool/engine). We can replace each line with these capture groups to leave you with the output you expect.
Note that this may "trash" the original file, but if you are using a tool like Notepad++, you can simply copy this result out, then undo the replacement, or just close the original file without saving.
Demo

The simplest way I can think of is:
Find: " .* "
Replace: " "
This replaces everything from the first space to the last space with a single space, achieving your goal.
Note: Quotes are only there to help show where spaces are in the regex.

Diacritics and regular expressions in R

In R I have a column which should contain only one word. It is created by taking the contents of another column and with regex only keeping the last word. However, for some rows this doesn't work in which case R simply copies the content from the first column. Here is my R
df$precedingWord <- gsub(".*?\\W*(\\w+-?)\\W*$","\\1", df$leftContext, perl=TRUE)
precedingWord should only hold one word. It is extracted from leftContext with regex. This works fine overall, but not with diacritics. A couple of rows in leftContext have letters with diacritics such as é and à. For some reason R ignores these items completely and simply copies the whole thing to precedingWord. I find this odd, because it is practically impossible that the regex matches the whole thing - as you can see here. In the example, Test string is leftContext and Substitution should be *precedingWord.
As you see in the example above, the output in the online regex tester is different from the output I get. I simply get an exact copy of leftContext. This does not mean that the output in the online tester is what want. Now the tool considers letters with diacritics as non-word characters and thus it doesn't mark it as the output that I want. But actually, I want to threat them as word characters so they are eligible for output.
If this is the input:
Un premier projet prévoit que l'établissement verserait 11 FF par an et par élève du secondaire et 30 FF par étudiant universitaire, une somme à évaluer et à
Outre le prêt-à-
And à
Sur base de ces données, on cherchera à
Ce sera encore le cas ce vendredi 19 juillet dans l'é
Then this is the output I expect
à
prêt-à-
à
à
é
This is the regex I already have
.*?\W*(\w+?-?)\W*$
I'm already using stringi in my project, so if that provides a solution I could use that.

In Perl-like regex, you can match any Unicode letter with \p{L} shorthand class, and all characters that are non-Unicode can be matched with the reverse class \P{L}. See regular-expressions.info:
You can match a single character belonging to the "letter" category with \p{L}. You can match a single character not belonging to that category with \P{L}.
Thus, the regex you can use is
df$precedingWord <- gsub(".*?\\P{L}*(\\p{L}+-?)\\P{L}*$","\\1", df$leftContext, perl=TRUE)

Regular Expression - difficulties with comma

I am facing two problems with comma:
I want to search for DE 99, SF 99 and DE 99 SF 99 in the same pattern. Kindly note that the only difference is the comma. I have an input with Data Element number (DE) and its Subfield number (SF). SF isn't always present, but I managed to deal with in the code below. The issue is that some times DE and SF comes separated by "," other times not.
The other problem is, that the currency value or any value with "," is missed after the comma. I placed below what I am doing and some test case examples. Kindly note that the value can be number or alphanumeric.
Found and read correctly the value
wholeLine: DE 3, SF 1 = 20
OUTPUT: DE 3, SF 1 = 20
Found and read correctly the value
wholeLine: DE 26 = 6538
OUTPUT: DE 26 = 6538
Found but read wrongly the value because only reads before “,”
wholeLine: DE 4 = 3,727
OUTPUT: DE 4 = 3
Not Found
wholeLine: DE 63 SF 2 = xyz
Pattern patternDE = Pattern.compile("DE \\d+(, SF \\d+)* = \\w+");
Matcher matcherDE = patternDE.matcher(wholeLine);
while (matcherDE.find()){
String wholeThing = matcherDE.group();
System.out.println(wholeThing);
}

Looks like you should be using
DE \\d+,?( SF \\d+)* = \\w+
? is a quantifier for one or none, so you're looking for DE followed by a space, then one or more digits, then one or zero commas, followed by the rest of your regex that's already working.
The problem you're having with the last part of your output is that you're matchin word characters, which don't include commas. Try matching non-spaces instead \\S

the part (, SF \\d+)* acts as a group and can not tell whether comma , exists or not separately. So by moving the , out of the group, the expression should be ok.
And for the currency problem, try replacing \\w+ with [\w,]+, to include comma.
DE \\d+(, SF \\d+)* = \\w+ // original
DE \\d+,?( SF \\d+)* = \\w+ // exclude comma from group
DE \\d+,?( SF \\d+)* = \[\w,]+// currency separator

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js