Remove duplicate lines containing same starting text - regex

So I have a massive list of numbers where all lines contain the same format.
#976B4B|B|0|0
#970000|B|0|1
#974B00|B|0|2
#979700|B|0|3
#4B9700|B|0|4
#009700|B|0|5
#00974B|B|0|6
#009797|B|0|7
#004B97|B|0|8
#000097|B|0|9
#4B0097|B|0|10
#970097|B|0|11
#97004B|B|0|12
#970000|B|0|13
#974B00|B|0|14
#979700|B|0|15
#4B9700|B|0|16
#009700|B|0|17
#00974B|B|0|18
#009797|B|0|19
#004B97|B|0|20
#000097|B|0|21
#4B0097|B|0|22
#970097|B|0|23
#97004B|B|0|24
#2C2C2C|B|0|25
#979797|B|0|26
#676767|B|0|27
#97694A|B|0|28
#020202|B|0|29
#6894B4|B|0|30
#976B4B|B|0|31
#808080|B|1|0
#800000|B|1|1
#803F00|B|1|2
#808000|B|1|3
What I am trying to do is remove all duplicate lines that contain the same hex codes, regardless of the text after it.
Example, in the first line #976B4B|B|0|0 the hex #976B4B shows up in line 32 as #976B4B|B|0|31. I want all lines EXCEPT the first occurrence to be removed.
I have been attempting to use regex to solve this, and found ^(.*)(\r?\n\1)+$ $1 can remove duplicate lines but obviously not what I need. Looking for some guidance and maybe a possibility to learn from this.

You can use the following regex replacement, make sure you click Replace All as many times as necessary, until no match is found:
Find What: ^((#[[:xdigit:]]+)\|.*(?:\R.+)*?)\R\2\|.*
Replace With: $1
See the regex demo and the demo screenshot:
Details:
^ - start of a line
((#[[:xdigit:]]+)\|.*(?:\R.+)*?) - Group 1 ($1, it will be kept):
(#[[:xdigit:]]+) - Group 2: # and one or more hex chars
\| - a | char
.* - the rest of the line
(?:\R.+)*? - any zero or more non-empty lines (if they can be empty, replace .+ with .*)
\R\2\|.* - a line break, Group 2 value, | and the rest of the line.

Related

How can I delete the rest of the line after the second pipe character "|" for every line with python?

I am using notepad++ and I want to get rid of everything after one second (including the second pipe character) for every line in my txt file.
Basically, the txt file has the following format:
3.1_1.wav|I like apples.|I like apples|I like bananas
3.1_2.wav|Isn't today a lovely day?|Right now it is 1 in the afternoon.|....
The result should be:
3.1_1.wav|I like apples.
3.1_2.wav|Isn't today a lovely day?
I have tried using \|.* but then everything after the first pipe character is matched.
In Notepad++ do this:
Find what: ^([^\|]*\|[^\|]*).*
Replace with: $1
check "Regular expression", and "Replace All"
Explanation:
^ - anchor at start of line
( - start group, can be referenced as $1
[^\|]* - scan over any character other than |
\| - scan over |
[^\|]* - scan over any character other than |
) - end group
.* - scan over everything until end of line
in replace reference the captured group with $1
I'm not sure if this is the best way to do it, but try this:
[^wav]\|.*

Regex to match all lines after a specific string

Possible duplicate of Regex - find all lines after a match: although my need is a little different.
I want to parse a plain text file with multiple date/value data separated by specific strings. I want to skip the first half of the file until a specific line where I want to match the results.
Here is an example of the file in question (including the mess with tabulations and spaces):
I dont want to capture the following measures. This text is on a single line and contains tabs and spaces is also ends with this token : Token1
05/01/1969 0.01846
15/01/1969 0.16730
25/01/1969 0.33988
05/04/1969 0.81319
15/04/1969 0.76973
25/11/2011 0.24210
05/12/2011 0.25220
15/12/2011 0.31160
25/12/2011 0.36845
End : bla bla bla
This text is also on a single line and marks the beginning of a new series of results. These are the results that I want. it also ends with the following token : Token2
05/01/1969 109.46333
15/01/1969 110.06998 118.18000
25/01/1969 110.82954
05/02/1969 111.51394 118.83000
25/02/1969 112.36483
05/10/2011 114.38798 114.31000
05/10/2011 114.31000 114.38798 114.38798 114.38798 114.38798 114.38798 114.38798
25/12/2011 112.64000 112.41261 112.86301 113.25494 114.06421 115.93219 116.38780
05/01/2012 112.22834 112.92301 113.40561 114.78823 116.62931 117.43421
05/09/2012 110.01410 112.16391 112.88199 115.23640 117.04756 118.04632
15/09/2012 109.97572 112.00809 112.70266 114.91247 116.65256 117.57412
25/09/2012 109.93967 111.87272 112.53305 114.60381 116.26935 117.12756
End : Marks the end of the file
What I wish to do is to match every line after the line which ends with Token2. I have tried different solutions from the other similar questions but none work. I ended up matching all the results of the file and considered splitting it before applying the following pattern. Is there a pure regex solution to this ?
Here is the pattern that works for the whole file. With named capture groups :
(?P<date>\d\d\/\d\d\/\d\d\d\d)\s*(?P<simul>\d+\.*\d*)[\t ]*(?P<observ>\d+\.*\d*){0,1}[\t ]*(?P<prev_no_rain>\d+\.*\d*){0,1}[\t ]*(?P<prev_10_dry>\d+\.*\d*){0,1}[\t ]*(?P<prev_20_dry>\d+\.*\d*){0,1}[\t ]*(?P<prev_50>\d+\.*\d*){0,1}[\t ]*(?P<prev_20_wet>\d+\.*\d*){0,1}[\t ]*(?P<prev_10_wet>\d+\.*\d*){0,1}
Regex101 link : https://regex101.com/r/a0mCZ2/3
You may leverage the \G operator that matches the start of string (that can be excluded with a negative lookaround) and the end of the previous successful match position. With the (?:\G(?!\A)|\bToken2[\r\n]+) we can tell the regex engine to find a whole word Token2 at the end of the line (with linebreak symbols) and then only find the following subpatterns if they follow in an immediate succession.
A regex that can be used:
(?:\G(?!\A)[\r\n]*|Token2[\r\n]+)\K(?P<date>\d\d\/\d\d\/\d{4})\s*(?P<simul>\d+\.*\d*)[\t ]*(?P<observ>\d+\.*\d*)?[\t ]*(?P<prev_no_rain>\d+(?:\.\d+)*)?[\t ]*(?P<prev_10_dry>\d+\.*\d*)?[\t ]*(?P<prev_20_dry>\d+\.*\d*)?[\t ]*(?P<prev_50>\d+\.*\d*)?[\t ]*(?P<prev_20_wet>\d+\.*\d*)?[\t ]*(?P<prev_10_wet>\d+\.*\d*)?
See the regex demo. Note I replaced {0,1} with ? to shorten it a bit.
The part you are interested in is (?:\G(?!\A)[\r\n]*|Token2[\r\n]+)\K.
(?:\G(?!\A)[\r\n]*|Token2[\r\n]+) - 1 of two alternatives:
\G(?!\A)[\r\n]* - end of the previous successful match and 0+ linebreak symbols
| - or
Token2[\r\n]+ - Token2 followed with 1+ CR or LFs. (If you need to match Token2 as a whole word, you might add \b before it).
\K - omit the text matched so far.
The (?P<date>\d\d\/\d\d\/\d{4})\s*(?P<simul>\d+\.*\d*)[\t ]*(?P<observ>\d+\.*\d*)?[\t ]*(?P<prev_no_rain>\d+(?:\.\d+)*)?[\t ]*(?P<prev_10_dry>\d+\.*\d*)?[\t ]*(?P<prev_20_dry>\d+\.*\d*)?[\t ]*(?P<prev_50>\d+\.*\d*)?[\t ]*(?P<prev_20_wet>\d+\.*\d*)?[\t ]*(?P<prev_10_wet>\d+\.*\d*)? is your pattern that I did not modify too much, and that matches a line with specific fata (note that the fact it matches a line justifies the usage of [\r\n]* after (\G(?!\A))).

Parsing out particular text in a big text column in a Dataframe - R

Suppose I have the following data,
data
text
abc/1234&
qwertyabc/5555&
a&sdfghabc/ppp&plksa&
z&xabc/lkjh&poiuw&
lkjqwefasrjabc/855698&plkjdhweb
For example if I want to parse out the text between abc/ and first occurrence of & alone, how do I parse out those text between these texts. I want the text between first occurence of abc/ and first occurrence of & after abc/ has occurred.
My output should be as follows,
data
text parsed_out
abc/1234& 1234
qwertyabc/5555& 5555
a&sdfghabc/ppp&plksa& ppp
z&xabc/lkjh&poiuw& lkjh
lkjqwefasrjabc/855698&plkjdhweb 855698
The following is my trying,
data1 = within(data, FOO<-data.frame(do.call('rbind', strsplit(as.character(text), 'abc/', fixed=TRUE))))
data2 = within(data1, FOO1<-data.frame(do.call('rbind', strsplit(as.character(FOO$X1), '&', fixed=TRUE))))
This is using too much of memory since the text file is of 8 million rows and also data2 would be having several columns because it has several '&'. Can anybody help me in parsing text between these two characters as only one column in a best efficient way so that it doesn't occupy too much of memory?
x = "thesearepresentinthestartingwhichisnotneededhttp://google.com/needstobeparsedout&reoccurencenotneeded&"
here, the function should check for http://google.com/ and parse out until first & is found. Here the output should be needstobeparsedout.
new_x = "\"http://www.google.com/search?q=erykah+badu+with+hiatus+kaiyote,+august+3&""
Why is it not working with this link?
Thanks
I actually wanted to parse out few parts of the URL and for example, I want to parse out, the text between "http:www.google.com/" and first occurrence of "&".
Use
sub(".*?https?://(?:www\\.)?google\\.com/([^&]+).*", "\\1", x)
See the regex demo.
The pattern matches:
(optionally add a ^ in front to match the start of string position)
.*? - 0+ chars as few as possible from the start till the first
https?:// - either https:// or http:// followed with
(?:www\\.)? - 1 or 0 (optional) sequence www.
google\\.com/ - literal text google.com
([^&]+) - 1 or more chars other than & (Capture group 1)
.* - any 0+ chars (up to the end of string).
In the replacment pattern, \1 refers to the subtext captured into Group 1.

Regular expression to get only the first word from each line

I have a text file
#sp_id int,
#sp_name varchar(120),
#sp_gender varchar(10),
#sp_date_of_birth varchar(10),
#sp_address varchar(120),
#sp_is_active int,
#sp_role int
Here, I want to get only the first word from each line. How can I do this? The spaces between the words may be space or tab etc.
Here is what I suggest:
Find what: ^([^ \t]+).*
Replace with: $1
Explanation: ^ matches the start of line, ([^ \t]+) matches 1 or more (due to +) characters other than space and tab (due to [^ \t]), and then any number of characters up to the end of the line with .*.
See settings:
In case you might have leading whitespace, you might want to use
^\s*([^ \t]+).*
I did something similar with this:
with open('handles.txt', 'r') as handles:
handlelist = [line.rstrip('\n') for line in handles]
newlist = [str(re.findall("\w+", line)[0]) for line in handlelist]
This gets a list containing all the lines in the document,
then it changes each line to a string and uses regex to extract the first word (ignoring white spaces)
My file (handles.txt) contained info like this:
JoIyke - personal twitter link;
newMan - another twitter handle;
yourlink - yet another one.
The code will return this list:
[JoIyke, newMan, yourlink]
Find What: ^(\S+).*$
Replace by : \1
You can simply use this to get the first word.Here we are capturing the first word in a group and replace the while line by the captured group.
Find the first word of each line with /^\w+/gm.

Regex match all lines that don't end with ,0 and ,1

I have a malformed CSV file which has two columns: Text,Value
The value is either 1 or 0, but some lines are malformed and span two lines:
1. "This line is fine, but there are some that are not like this",0
2. "Another good line",1
4. "Oh, I'm so bad!!
5. I spanned two lines!",0
6. "Why did you break me? FileHelpers can't read two lines!!",1
Line 4 and 5 are supposed to be one line, but the CSV file I got is broken and they span two lines, this causes the FileHelpers engine to fail while reading the csv file.
I have two CSV files with about 3000 lines each and I will only need to fix them once. I want to use notepad++ to find all the lines that are not ending in ,0 or ,1, what kind of regex can I use for that? Or maybe to regular expressions, one for the ,0 case the other one for the ,1 case.
Update:
Dan's answer works without the comma [^01]$ instead of ,[^01]$, but it only matches lines that are not ending with 0 or 1... it works sufficiently well in my case, but it does skip lines that are broken and actually end with 0 or 1.
I don't know how the other answer would work:
Something like the below is what I would use in Notepad++
[^,][^01]$
Here are the steps I did:
Use ([^,][^01])$ to match the lines and replaced with \1{marked}
Then switched to extended mode and replaced {marked}\r\n with `` ( empty ) to get a single line.
Screenshots below:
The expression you would use is
([^,].|,[^01])$
But unfortunately, notepad++ does not support alternation (the | operator). [1]
You can match the broken lines with these two expressions then:
[^,].$
,[^01]$
Except, of course, if the "Text" part does end in ,0 or ,1 itself. :-)
[1] http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Unsupported_Regex_Operators
,[^01]$
Make sure regex mode is on.
General considerations
In general, to match a line that does not end with a specific pattern, you may use
^(?!.*pattern$).*$
where ^ matches the start of a line, (?!.*pattern$) is a negative lookahead that fails the match if there are 0 or more chars other than line break chars, as few as possible (.*) followed with pattern at the end of the line ($), and the .*$ actually matches the line.
To remove a line that does not end with some pattern together with a line break at the end, use
^(?!.*pattern$).*\R?
where \R? is an optional line break sequence.
In case of several fixed strings, you may use
^(?!.*(?:pattern|pattern2|patternN)$).*\R?
If there is just one or two fixed strings to check at the end of the line, you may use a bit quicker regex like
^.*$(?<!a)(?<!bcd)
that will match any line not ending with a and bcd.
^.*$(?<!1)(?<!0)
Current problem solution
So, for the current issue, to match a line not ending with 1 or 0, you may use
^(?!.*[01]$).*$ # without the line break
^(?!.*[01]$).*$\R? # with the line break
Or,
^.*(?<![01])$ # without the line break
^.*(?<![01])$\R? # with the line break
To remove/replace a line break on a line that does not end with a specific pattern you may use
(?<![01])$\R?
Replace with either an empty string (to remove the line break) or with any other delimiter string or character.