Removing duplicate strings from text file using RegEx

Removing duplicate strings from text file using RegEx - regex

I would like to remove strings from a file that already existed in a line with less number suing RegEx(Note++).
Example -
123 = 45,
789 = 321,
123 = 951
Should result in -
123 = 45,
789 = 321,
= 951

Well, this is a good example of how though RegEx is very powerful, it is not always the right tool for the job. For instance, the following RegEx will probably do what you want (I don't have Notepad++ installed, but it works in my RegEx client)
Search: (\b\d+\b)(.+?)\1
Replace: \1\2 (or $1$2, depending on your setup)
This takes and instance of a number, searches until it finds another instance of it, then replaces the entire thing with itself minus the second instance.
However, aside from being pretty dirty, this type of thing would be much simpler using a quick script or even something like Excel.

Related

Removing unmatched text and building a table with the remaining matches

I have 30000 lines that look like the one below.
342800005013000 CON N GORE PT LOT 31 RP 11R2284 PT PART 1 RP 11R4541 PT PART 2
I would like to capture the 15 digit number at the beginning and any "11R***" numbers.
In Notepad++ I've used \d{15}|(11R\d*)* to match everything that I want. Ultimately I would like to get all the matched results into excel. What would be the best way to do so?
Thanks for your help.
Notepad++ Matches

You could try this one
(^[0-9]*)|(11R[0-9A-Za-z]*)
edit: check it now, the code formatting correctly displays the regex;

MongoDB count and regex search count not matching

I have a huge mongoDB containing documents on which I am using a name as index.
So basically, I had a text file containing 48 000 016 entries. (I use wc -l to obtain that count)
To give more context, the database contains a lot of names that we're extracted from OCR (so a lot of junk) and also names in other languages (Japanese, Russian, etc...).
My MongoDB table statistics tell me I have 48 000 016 which is fine.
The problem happens because I want to query the items on their names (which is a standard string) using this regex :
/^([A-Z]|\W|\s|\d|_)/i
So my checklist :
any letter - check
case insensitive - check
any number - check
underscore - check
\W for anything that is not a number, letter or underscore.
So from what I understand, this regex should get me everything, since I am querying database on string values with this regex. But the problem is that I am missing 5 items.
When I run the count on the result of the query, I have 48 000 011 items.
Any idea where these 5 ones could be ? Because of the nature of my problem I can simply go through all my items using a simple cursor, I know it could be done that way, but I need a regex that can retrieve all my values.
I ran this query on the Database as indicated by the comments.
db.name.aggregate({$group:{_id:"uniqueDocs", count:{$sum:1}}})
Result is :
{ "result" : [ ], "ok" : 1 }
Thanks a lot !

I have seen you are using the anchor ^ to match the beginnig of a line. It could be possible that the line start with an new line \n or carriage return character \r.
Try to include \n and \r to your regex
/^([A-Z]|\W|\s|\d|\r|\n|_)/i
Also check to remove the anchor.
/([A-Z]|\W|\s|\d|\r|\n|_)/i
At last option inverse your regex to see which records are not included. These regex expressions should also math empty strings.
/^(?![.*])/i

I want to thank #Paul Wasilewski for giving me some great solutions. I found my problem which was not related to a regex problem.
My 5 entries we're simply not indexed, their size was more than 1024 bytes in length so MongoDB could not index them.
So that's the reason why they could not be queried by regex.

Using Regex to clean a csv file in R

This is my first post so I hope it is clear enough.
I am having a problem regarding cleaning my CSV files before I can read them into R and have spent the entire day trying to find a solution.
My data is supposed to be in the form of two columns. The first column is a timestamp consisting of 10 digits and the second an ID consisting of 11 or 12 Letters and numbers (the first 6 are always numbers).
For example:
logger10 |
0821164100 | 010300033ADD
0821164523 | 010300033ADD
0821164531 | 010700EDDA0F0831102744
010700EDDA0F|
would become:
0821164100 | 010300033ADD
0821164523 | 010300033ADD
0821164531 | 010700EDDA0F
0831102744 | 010700EDDA0F
(please excuse the lines in the middle, that was my attempt at separating the columns...).
The csv file seems to occasionally be missing a comma which means that sometimes one row will end up like this:
0923120531,010300033ADD0925075301,010700EDD00A
My hardware also adds the word logger10 (or whichever number logger this is) whenever it restarts which gives a similar problem e.g. logger10logger100831102744.
I think I have managed to solve the logger text problem (see code) but I am sure this could be improved. Also, I really don't want to delete any of the data.
My real trouble is making sure there is a line break in the right place after the ID and, if not, I would like to add one. I thought I could use regex for this but I'm having difficulty understanding it.
Any help would be greatly appreciated!
Here is my attempt:
temp <- list.files(pattern="*.CSV") #list of each csv/logger file
for(i in temp){
#clean each csv
tmp<-readLines(i) #check each line in file
tmp<-gsub("logger([0-9]{2})","",tmp) #remove logger text
pattern <- ("[0-9]{10}\\,[0-9]{6}[A-Z,0-9]{5,6}") #regex pattern ??
if (tmp!= pattern){
#I have no idea where to start here...
}
}
here is some raw data:
logger01
0729131218,020700EE1961
0729131226,020700EE1961
0831103159,0203000316DB
0831103207,0203000316DB0831103253,010700EDE28C
0831103301,010700EDE28C
0831103522,010300029815
0831103636,010300029815
0831103657,020300029815

If you want to do this in a single pass:
(?:logger\d\d )?([\dA-F]{10}),?([\dA-F]{12}) ?
can be replaced with
\1\t\2\n
What this does is look for any of those rogue logger01 entries (including the space after it) optionally: That trailing ? after the group means that it can match 0 or 1 time: if it does match, it will. If it's not there, the match just keeps going anyway.
Following that, you look for (and capture) exactly 10 hex values (either digits or A-F). The ,? means that if a comma exists, it will match, but it can match 0 or 1 time as well (making it optional).
Following that, look for (and capture) exactly 12 hex values. Finally, to get rid of any strange trailing spaces, the ? (a space character followed by ?) will optionally match the trailing space.
Your replacement will replace the first captured group (the 10 hex digits), add in a tab, replace the second captured group (the 12 hex digits), and then a newline.
You can see this in use on regex101 to see the results. You can use code generator on the left side of that page to get some preformatted PHP/Javascript/Python that you can just drop into a script.
If you're doing this from the command line, perl could be used:
perl -pe 's/(?:logger\d\d )?([\dA-F]{10}),?([\dA-F]{12}) ?/\1\t\2\n/g'
If another language, you may need to adapt it slightly to fit your needs.
EDIT
Re-reading the OP and comments, a slightly more rigid regex could be
(?:logger\d\d\ )?([\dA-F]{10}),?(\d{6}[\dA-F]{5,6})\ ?
I updated the regex101 link with the changes.
This still looks for the first 10 hex values, but now looks for exactly 6 digits, followed by 5-6 hex values, so the total number of characters matched is 11 or 12.
The replacement would be the same.

Paste your regex here https://regex101.com/ to see whether it catches all cases. The 5 or 6 letters or digits could pose an issue as it may catch the first digit of the timestamp when the logger misses out a comma. Append an '\n' to the end of the tmp string should work provided the regex catches all cases.

python regex repetition with capture question

using python3's regex capabilities, is it possible to capture variable numbers of capture blocks, based on the number of the repetitions found? for instance, in the following search strings, i want to capture all the digit strings with the same regex.
search string 1(trying to capture: 89, 45):
zzz89zzz45.mp3
search string 2(trying to capture: 98, 67, 89, 45):
zzz98zzz67zzz89zzz45.mp3
search string 3(trying to capture: 98, 67, 89, 45, 55, 111):
zzz98zzz67zzz89zzz45vdvd55lplp111.mp3
the following regex will match all the repetitions, though all the values are not available for later use(only 1 digit string is captured):
((\d+)\D*)*\.mp3$
the other 2 options are writing a different regex for every case, or use findall(). Is there a way to adjust the above regex in order to capture every digit string for later use with various numbers of repetitions using just regex facilities, or to do this in python3, are you forced to use findall()?

Most or all regular expression engines in common use, including in particular those based on the PCRE syntax (like Python's), label their capturing groups according to the numerical index of the opening parenthesis, as the regex is written. So no, you cannot use capturing groups alone to extract an arbitrary, variable number of subsequences from a string.
The closest you can get (as far as I know) is to manually write out a certain number of capturing groups, something like this:
s = ...
res = re.match(r'\D*' + 25 * r'(\d+)\D+')
numbers = [r for r in res.groups() if r is not None]
This will get you up to 25 groups of digits. If you need more, replace 25 with some higher number.
I wouldn't be surprised if this were less efficient than the iterative approach with findall(), although I haven't tested it.

This will match all the numbers before the dot:
s = "zzz98zzz67zzz89zzz45vdvd55lplp111.mp3"
res = re.findall("[0-9]+(?=.*\\.)", s)
print(res)

How can I parse this without regex?

A friend of mine said if the regex I'm using is too long, it's probably the wrong tool for the job. Any thoughts here on a better way to parse this text? I have a regex that returns everything to an array I can easily just chunk out, but if there's another simpler way I'd really like to see it.
Here's what it looks like:
2 AB 123A 01JAN M ABCDEF AA1 100A 200A 02JAN T /ABCD /E
Here's a break down of that:
2 is the line number, these range from 1 all the way to 99. If you can't see because of formatting, there is a space charecter prepending numbers less than 10.
The space may or may not be replaced by an *
AB is an important unit of data (UOD).
AB may be prepended by /CD which is another important UOD.
123 is an important UOD. It can range from 1 (prepended by 4 spaces) to 99999.
A is an important UOD.
01JAN is a day/month combination, I need to extract both UODs.
M is a day name short form. This may be a number between 1 and 7.
ABC is an important UOD.
DEF is an important UOD.
The space after DEF may be an *
AA1 may be zero characters, or it may be 5. It is unimportant.
100A is a timestamp, but may be in the format 1300. The A may be N when the time is 1200 or P for times in the PM.
We then see another timestamp.
The next date part may not be there, for example, this is valid:
93*DE/QQ51234 30APR J QWERTY*QQ0 1250 0520 /ABCD*ASDFAS /E
The data where /ABCD*ASDFAS /E appears is irrelevant to the application, but, this is where the second date stamp may appear. The front-slash may be something else (such as a letter).
Note:
It is not space delimited, some parts of the body run into others. Character position is only accurate for the first two or three items on the list
I don't think I left anything out, but, if there's an easier way to parse out a string like this than writing a regex, please let me know.

This is a perfect task for regular expressions. The text does not contain nesting and the items you're matching are fairly simple taken individually.
Most regular expression syntaxes have an xtended flag or mode that allows whitespace and comments to improve readability. For example:
$regex = '#
# 2 is the line number, these range from 1 all the way to 99.
# There is a space character prepending numbers less than 10.
# The space may or may not be replaced by an *.
[ *]\d|\d\d
\s
# AB is an important unit of data (UOD).
# AB may be prepended by /CD which is another important UOD.
(/CD)?AB
\s
# 123 is an important UOD. It can range from 1 (prepended by 4 spaces)
# to 99999.
\s{4}\d{1}|\s{3}\d{2}|\s{2}\d{3}|\s{1}\d{4}|\d{5}
#x';
And so on.

A regex seems fine for this application, but for simplicity and readability, you might want to split this into several regexes (one for each field) so people can more easily follow which part of the regex corresponds to which variable.

You can always code your own parser by hand, but that would be more lines of code than a regex. The lines of code, however, will probably be simpler to follow for the reader.

Simply write a custom parser that handles it line by line. It seems like everything is at a fixed position rather than space/comma-delimited, so simply use those as indices into what you need:
line_number = int(line_text[0:1])
ab_unit = line_text[3:4]
...
If it is indeed space-delimited, simply split() each line and then parse through each, splitting each chunk into component parts where appropriate.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Removing duplicate strings from text file using RegEx - regex

I would like to remove strings from a file that already existed in a line with less number suing RegEx(Note++). Example - 123 = 45, 789 = 321, 123 = 951 Should result in - 123 = 45, 789 = 321, = 951

Related

Removing unmatched text and building a table with the remaining matches

MongoDB count and regex search count not matching

Using Regex to clean a csv file in R

python regex repetition with capture question

How can I parse this without regex?

Categories

Resources