Regex for removing repeating numbers on different lines [duplicate]

Regex for removing repeating numbers on different lines [duplicate] - regex

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
It's perhaps quite simple, but I can't figure it out:
I have a random number (can be 1,2,3 or 4 digits)
It's repeating on a second line:
2131
2131
How can I remove the first number?
EDIT: Sorry I didn't explained it better. These lines are in a plain text file. I'm using BBEdit as my editor. And the actual file looks like this (only then app. 10.000 lines):
336
336
rinde
337
337
diving
338
338
graffiti
339
339
forest
340
340
mountain
If possible the result should look like this:
336 - rinde
337 - diving
338 - graffiti
339 - forest
340 - mountain

Search:
^(\d{1,4})\n(?:\1\n)+([a-z]+$)
Replace:
\1 - \2
I don't have access to BBEdit, but apparently you have to check the "Grep" option to enable regex search-n-replace. (I don't know why they call it that, since it seems to be powered by the PCRE library, which is much more powerful than grep.)

since you didn't mention any programming language, tools. I assume those numbers are in a file. each per line, and any repeated numbers are in neighbour lines. uniq command can solve your problem:
kent$ echo "1234
dquote> 1234
dquote> 431
dquote> 431
dquote> 222
dquote> 222
dquote> 234"|uniq
1234
431
222
234

Another way find: /^(\d{1,4})\n(?=\1$)/ replace: ""
modifiers mg (multi-line and global)
$str =
'1234
1234
431
431
222
222
222
234
234';
$str =~ s/^(\d{1,4})\n(?=\1$)//mg;
print $str;
Output:
1234
431
222
234
Added On the revised sample, you could do something like this:
Find: /(?=^(\d{1,4}))(?:\1\n)+\s*([^\n\d]*$)/
Replace: $1 - $2
Mods: /mg (multi-line, global)
Test:
$str =
'
336
336
rinde
337
337
337
diving
338
338
graffiti
339
337
339
forest
340
340
mountain
';
$str =~ s/(?=^(\d{1,4}))(?:\1\n)+\s*([^\n\d]*$)/$1 - $2/mg;
print $str;
Output:
336 - rinde
337 - diving
338 - graffiti
339
337
339 - forest
340 - mountain
Added2 - I was more impressed with the OP's later desired output format than the original question. It has many elements to it so, unable to control myself, generated a way too complicated regex.
Search: /^(\d{1,4})\n+(?:\1\n+)*\s*(?:((?:(?:\w|[^\S\n])*[a-zA-Z](?:\w|[^\S\n])*))\s*(?:\n|$)|)/
Replace: $1 - $2\n
Modifiers: mg (multi-line, global)
Expanded-
# Find:
s{ # Find a single unique digit pattern on a line (group 1)
^(\d{1,4})\n+ # Grp 1, capture a digit sequence
(?:\1\n+)* # Optionally consume the sequence many times,
\s* # and whitespaces (cleanup)
# Get the next word (group 2)
(?:
# Either find a valid word
( # Grp2
(?:
(?:\w|[^\S\n])* # Optional \w or non-newline whitespaces
[a-zA-Z] # with at least one alpha character
(?:\w|[^\S\n])*
)
)
\s* # Consume whitespaces (cleanup),
(?:\n|$) # a newline
# or, end of string
|
# OR, dont find anything (clears group 2)
)
}
# Replace (rewrite the new block)
{$1 - $2\n}xmg; # modifiers expanded, multi-line, global

find:
((\d{1,4})\r(\D{1,10}))|(\d{1,6})
replace:
\2 - \3
You should be able to clean it up from there quite easily!

Detecting such a pattern is not possible using regexp.
You can split the string by the "\n" and then compare.

Related

Regex for converting spaces to tabs but leaving word items in the middle alone?

I have a problem that my Googling tells me can be solved with Regex, but I'm completely unfamiliar and I tried following some tutorials but I'm entirely lost. I have this sample data set:
59 65 21366 CLEMENTINES 4.89 2.00 9.78
59 61 22384 PORK BACK RIBS 6.50 2.40 15.59
59 65 30669 BANANAS 1.89 1.00 1.89
59 13 391314 KODIAK POWER CAKES 14.69 1.00 14.69
59 65 392373 BAJA CHOPPED SALAD KIT 2.99 1.00 2.99
59 39 429227 FILA MENS ANKLE SOCK 6PK 9.99 1.00 9.99
59 65 1056187 ASIAN CASHEW SALAD KIT 2.99 1.00 2.99
59 28 1159696 SHOPKINS GG/TWOZIES ASST 5.97 1.00 5.97
59 13 1221327 KODIAK POWER CAKES -3.00 -3.00 COUPON
59 14 1270070 KLEENEX ULTRA SOFT 12 PCK 16.49 1.00 16.49
59 21 5221111 10 DRAWER STORAGE CART 29.99 1.00 29.99
59 17 1019 HALF + HALF 1 L 1.99 1.00 1.99
I want to import it into a spreadsheet. Visually I can see what I want (3 numeric columns at the beginning, then a description that may or may not contain spaces, then usually 3 numeric columns, but sometimes 2 + a word (see the line that ends in "coupon").
But because of the spaces and lack of quotes, my Excel skills (which are also marginal) don't allow me to import this in a sensible way.
I thought of doing multiple processes: pull off the 3 columns at the left and then 3 columns at the right... but in Excel I see no way to operate "from the right".
Any help appreciated. Thanks.
[edit] I realize from the comments that my ignorance has resulted in a poor question.
I didn't realize "Regex" was specific to language, etc. I am trying to import a csv into Excel, but I was using Notepad++ to perform the regex operations. I don't know what "flavor" that uses but the answer below helped greatly.

You can match this with:
^(\S*) (\S*) (\S*) (.*) (\S*) (\S*) (\S*)$
^ matches the start of a line
\S* matches one or more non-whitespace characters
.* matches anything, including spaces
the parentheses capture the matches into capture groups
$ matches the end of a line.
You haven't said what tool you intend to use to do this.
One way is with a Perl one-liner:
perl -pe 's/^(\S*) (\S*) (\S*) (.*) (\S*) (\S*) (\S*)$/"\1","\2","\3","\4","\5","\6","\7"/' input.txt
Returning:
"59","65","21366","CLEMENTINES","4.89","2.00","9.78"
...
"59","13","1221327","KODIAK POWER CAKES","-3.00","-3.00","COUPON"
... etc.

Regex match line starting with whitespace and first character is non-digit

I am trying to create a regex that will only match lines that start with whitespace, then have 1-4 non-digits as the first characters, and then at least one or more spaces after the digits. The purpose of this regex is to use it in the "Find and Replace" option of Notepad++ to remove any lines that do not start with space(s) and then have a number as the first character in the line.
What I have now is allowing me to match the lines that start with whitespace and are followed with a group of digits and another space. However, these are the lines I want to keep. How can I modify the following regex so that it will match everything else other than these lines?
/^([\s]+\d[\s]|[\s]+\d\d[\s]|[\s]+\d\d\d[\s])/gm
Here's an example of the data we're using the regex on. The regex should only match the lines that DO NOT start with 1, 2, 49, 50, 99 and 100. Note that the lines that start with "40th" and "5/23/2017" should match.
Page 1
40th Marathon and 25th Marathon Relay
5/23/2017 USATF Certified Marathon (#RE98723UB) Downtown/City, ST
Timing: Race Services See our Calendar of Events at www.website.com
Results questions: http://www.website.com/fixresults
=====================================================================================
**** FINAL RESULTS IN NETTIME ORDER ****
Place Div/Tot Div Halfway 22miles Guntime Nettime Pace Name
===== ======== ===== ======= ======= ======= ======= ===== =======
1 1/153 M0139 1:15:08 2:05:50 2:29:20 2:29:20 5:42 Eric
2 2/153 M0139 1:15:07 2:06:29 2:29:56* 2:29:56 5:44 Bryan
Record 2:17:35 by Randy in 1986
49 8/77 M4049 1:36:48 2:54:03 3:37:02 3:36:59 8:17 Joshua
50 28/153 M0139 1:49:45 3:03:56 3:37:38# 3:37:22 8:18 Brian
# Under USATF OPEN guideline
99 1/16 M6069 1:56:30 3:15:24 3:51:06 3:50:46 8:49 Paul
100 3/35 F5059 1:50:06 3:11:37 3:51:03 3:50:47 8:49 Ashley
101 4/35 F5059 1:55:26 3:16:37 3:56:03 3:55:57 9:14 Joan
* Under USATF Age-Group guideline
% For an Explanation of AgeGraded Percentages, See Here: http://www.website.com/agegrading
So if we used the regex in Notepad++ to find the matching strings/lines and replace (delete) them, the desired end result would be as follows (in other words, the following lines would NOT match the regex):
1 1/153 M0139 1:15:08 2:05:50 2:29:20 2:29:20 5:42 Eric
2 2/153 M0139 1:15:07 2:06:29 2:29:56* 2:29:56 5:44 Bryan
49 8/77 M4049 1:36:48 2:54:03 3:37:02 3:36:59 8:17 Joshua
50 28/153 M0139 1:49:45 3:03:56 3:37:38# 3:37:22 8:18 Brian
99 1/16 M6069 1:56:30 3:15:24 3:51:06 3:50:46 8:49 Paul
100 3/35 F5059 1:50:06 3:11:37 3:51:03 3:50:47 8:49 Ashley
101 4/35 F5059 1:55:26 3:16:37 3:56:03 3:55:57 9:14 Joan
Any assistance would be greatly appreciated.

See regex in use here
^(?! +\d+ ).*\n*
^ Assert position at the start of the line
(?! +\d+ ) Negative lookahead ensuring what follows is not one or more spaces, then one or more digits, then a space
.* Match any character (except \n) any number of times
\n* Matches any number of newline characters
Result:
1 1/153 M0139 1:15:08 2:05:50 2:29:20 2:29:20 5:42 Eric
2 2/153 M0139 1:15:07 2:06:29 2:29:56* 2:29:56 5:44 Bryan
49 8/77 M4049 1:36:48 2:54:03 3:37:02 3:36:59 8:17 Joshua
50 28/153 M0139 1:49:45 3:03:56 3:37:38# 3:37:22 8:18 Brian
99 1/16 M6069 1:56:30 3:15:24 3:51:06 3:50:46 8:49 Paul
100 3/35 F5059 1:50:06 3:11:37 3:51:03 3:50:47 8:49 Ashley
101 4/35 F5059 1:55:26 3:16:37 3:56:03 3:55:57 9:14 Joan

If this is to use in the Find/Replace dialog then you can use a cunning trick...
^(pattern_I_want_to_keep)$|^.*$
And replace it with
\1
Anything that doesn't match what you want to keep will be removed, although it will leave a blank line. They can be removed with a plugin or another regex.
This is simpler to read than concocting a match for what you don't want to keep, or using a negative lookahead.

How do I format a list of phone numbers using regular expression in vim commands?

Given the following list of phone numbers
8144658695
812 673 5748
812 453 6783
812-348-7584
(617) 536 6584
834-674-8595
Write a single regular expression (use vim on loki) to reformat the numbers so they look like this
814 465 8695
812 673 5748
812 453 6783
812 348 7584
617 536 6584
834 674 8595
I am using the search and replace command. My regular expression using back referencing:
:%s/\(\d\d\d\)\(\d\d\d\)\(\d\d\d\d\)/\1 \2 \3\g
only formats the first line.
Any ideas?

Try this:
:%s,.*\(\d\d\d\).*\(\d\d\d\).*\(\d\d\d\d\).*,\1 \2 \3,

First use count to match a pattern multiple times, it is a bad habbit to repeat the pattern:
\d\{3} "instead of \d\d\d
Than you also have to match the whitespaces etc:
:%s/.*\(\d\{3}\).*\(\d\{3}\).*\(\d\{4}\).*/\1 \2 \3/g
Or even better, escape the whole regex with \v:
:%s/\v.*(\d{3}).*(\d{3}).*(\d{4}).*/\1 \2 \3/g
This greatly increases readability

Italian phone 10-digit number regex issue

I'm trying to use the regex from this site
/^([+]39)?((38[{8,9}|0])|(34[{7-9}|0])|(36[6|8|0])|(33[{3-9}|0])|(32[{8,9}]))([\d]{7})$/
for italian mobile phone numbers but a simple number as 3491234567 results invalid.
(don't care about spaces as i'll trim them)
should pass:
349 1234567
+39 349 1234567
TODO: 0039 349 1234567
TODO: (+39) 349 1234567
TODO: (0039) 349 1234567
regex101 and regexr both pass the validation..what's wrong?
UPDATE:
To clarify:
The regex should match any number that starts with either
388/389/380 (38[{8,9}|0])|
or
347/348/349/340 (34[{7-9}|0])|
or
366/368/360 (36[6|8|0])|
or
333/334/335/336/337/338/339/330 (33[{3-9}|0])|
328/329 (32[{8,9}])
plus 7 digits ([\d]{7})
and the +39 at the start optionally ([+]39)?

The following regex appears to fulfill your requirements. I took out the syntax errors and guessed a bit, and added the missing parts to cover your TODO comments.
^(\((00|\+)39\)|(00|\+)39)?(38[890]|34[7-90]|36[680]|33[3-90]|32[89])\d{7}$
Demo: https://regex101.com/r/yF7bZ0/1
Your test cases fail to cover many of the variations captured by the regex; perhaps you'll want to beef up the test set to make sure it does what you want.
The beginning allows for an optional international prefix with or without the parentheses. The basic pattern is (00|\+)39 and it is repeated with or without parentheses around it. (Perhaps a better overall approach would be to trim parentheses and punctuation as well as whitespace before processing begins; you'll want to keep the plus as significant, of course.)
Updated with information from #Edoardo's answer; wrapped for legibility and added comments:
^ # beginning of line
(\((00|\+)39\)|(00|\+)39)? # country code or trunk code, with or without parentheses
( # followed by one of the following
32[89]| # 328 or 329
33[013-9]| # 33x where x != 2
34[04-9]| # 34x where x not in 1,2,3
35[01]| # 350 or 351
36[068]| # 360 or 366 or 368
37[019] # 370 or 371 or 379
38[089]) # 380 or 388 or 389
\d{6,7} # ... followed by 6 or 7 digits
$ # and end of line
There are obvious accidental gaps which will probably also get filled over time. Generalizing this further is likely to improve resilience toward future changes, but of course may at the same time increase the risk of false positives. Make up your mind about which is worse.

I found this and i updated with new operators and MVNO prefixes (Iliad, ho.)
^(\((00|\+)39\)|(00|\+)39)?(38[890]|34[4-90]|36[680]|33[13-90]|32[89]|35[01]|37[019])\d{6,7}$

I improved the regex adding the case to handle space between numbers:
^(\((00|\+)39\)|(00|\+)39)?(38[890]|34[4-90]|36[680]|33[13-90]|32[89]|35[01]|37[019])(\s?\d{3}\s?\d{3,4}|\d{6,7})$
so, for example, I can match phone number like this (0039) 349 123 4567 or this 349 123 4567

Following doc:
https://it.qaz.wiki/wiki/Telephone_numbers_in_Italy
A simple regex for MOBILE italian numbers without special chars is:
/^3[0-9]{8,9}$/
it match a string starting with the digit '3' and followed by 8 or 9 digits, ex:
3345678103
you can add then ITALIAN prefix like '+39 ' or '0039 '
/^+39 3[0-9]{8,9}$/ --- match --> +39 3345678103
/^\0039 3[0-9]{8,9}$/ --- match --> 0039 3345678103

Find numbers using regular expression with egrep in Linux terminal

I want to find all separated words (which means characters between two spaces), that are decimal numbers including plus and minus signs in Linux terminal using egrep.
My solution:
(?<= |\n|\t)[\+\-]?[0-9]+(?= |\n|\t)
Explanation:
(?<= |\n|\t) checks if there is a space or newline or tabulator before decimal number
(?= |\n|\t) checks if there is a space or newline or tabulator after decimal number.
This code works well in program Kiki 0.5.6 where I test implementation, but if I copy it to terminal, it doesn't work. I think that terminal doesn't recognize special parentheses constructions (?= or ?<=). Am I right? How can I apply to terminal?
For example: my text:
1.fasfa
123asfavdsvdas156
1safsavdsvsd1sdva5s31as35d1va
595s6dva2sdvas9
asd9as5dv92s
sd559vs fs5s94 4dfs dfa4s44 459 9dasf 8sdfa 5sfa
napr. uNIveRziTA
sfaf 2262 2226 56565 adss
uNiVerZita
uNIVERZITa
123
123 sadasf 123456 sfafs 134
-1234- -25- -5- 5- --55
-
-55
123 100 999 124 6262 62 6 2 62 62 65 26565 22 62 62652 +665 +0649 ---662 265 959 595 099 199 -059 -0245 -444
--1245 -555-5-55 --555- 555-
+25
-55
+++55 +5 ++5 ++55+665+
samo samo samo samo otec otec skola skola samo lamo samo lamo
re20. (?<=(\t|\n| ))([+-])?[1-9][0-9]*(?= |$|\n)
--- ---
doma doma doma doma doma doma doma doma doma
meno.priezvisko#tuke.sk meno.priezvisko.1#tuke.sk meno.priezvisko#student.tuke.sk meno.priezvisko.2#student.tuke.sk
23:56:59.555
00:00:00.000
23:59:59.999
31/12/2099
00/12/2054
01/01/2000
matches:
459
2262
2226
56565
123
123
123456
134
-55
123
100
999
124
6262
62
6
2
62
65
26565
22
62
62652
+655
+0649

egrep does not support lookaround assertions. However, GNU grep comes with perl compatible regular expressions using the -P switch:
grep -oP '(?<=\s|^)[+-]?[0-9]+(?=\s|$)' input
Note that you can simplify |\n|\t to \s which stands for whitespace character. In order to match numbers that start at the begin of a line and numbers that end at the end of the line I've added ^ and $ as alternatives for \s.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex for removing repeating numbers on different lines [duplicate] - regex

Search: ^(\d{1,4})\n(?:\1\n)+([a-z]+$) Replace: \1 - \2 I don't have access to BBEdit, but apparently you have to check the "Grep" option to enable regex search-n-replace. (I don't know why they call it that, since it seems to be powered by the PCRE library, which is much more powerful than grep.)

find: ((\d{1,4})\r(\D{1,10}))|(\d{1,6}) replace: \2 - \3 You should be able to clean it up from there quite easily!

Detecting such a pattern is not possible using regexp. You can split the string by the "\n" and then compare.

Related

Regex for converting spaces to tabs but leaving word items in the middle alone?

Regex match line starting with whitespace and first character is non-digit

How do I format a list of phone numbers using regular expression in vim commands?

Italian phone 10-digit number regex issue

Find numbers using regular expression with egrep in Linux terminal

Categories

Resources