So I have a massive list of numbers where all lines contain the same format.
#976B4B|B|0|0
#970000|B|0|1
#974B00|B|0|2
#979700|B|0|3
#4B9700|B|0|4
#009700|B|0|5
#00974B|B|0|6
#009797|B|0|7
#004B97|B|0|8
#000097|B|0|9
#4B0097|B|0|10
#970097|B|0|11
#97004B|B|0|12
#970000|B|0|13
#974B00|B|0|14
#979700|B|0|15
#4B9700|B|0|16
#009700|B|0|17
#00974B|B|0|18
#009797|B|0|19
#004B97|B|0|20
#000097|B|0|21
#4B0097|B|0|22
#970097|B|0|23
#97004B|B|0|24
#2C2C2C|B|0|25
#979797|B|0|26
#676767|B|0|27
#97694A|B|0|28
#020202|B|0|29
#6894B4|B|0|30
#976B4B|B|0|31
#808080|B|1|0
#800000|B|1|1
#803F00|B|1|2
#808000|B|1|3
What I am trying to do is remove all duplicate lines that contain the same hex codes, regardless of the text after it.
Example, in the first line #976B4B|B|0|0 the hex #976B4B shows up in line 32 as #976B4B|B|0|31. I want all lines EXCEPT the first occurrence to be removed.
I have been attempting to use regex to solve this, and found ^(.*)(\r?\n\1)+$ $1 can remove duplicate lines but obviously not what I need. Looking for some guidance and maybe a possibility to learn from this.
You can use the following regex replacement, make sure you click Replace All as many times as necessary, until no match is found:
Find What: ^((#[[:xdigit:]]+)\|.*(?:\R.+)*?)\R\2\|.*
Replace With: $1
See the regex demo and the demo screenshot:
Details:
^ - start of a line
((#[[:xdigit:]]+)\|.*(?:\R.+)*?) - Group 1 ($1, it will be kept):
(#[[:xdigit:]]+) - Group 2: # and one or more hex chars
\| - a | char
.* - the rest of the line
(?:\R.+)*? - any zero or more non-empty lines (if they can be empty, replace .+ with .*)
\R\2\|.* - a line break, Group 2 value, | and the rest of the line.
Related
I am using notepad++ and I want to get rid of everything after one second (including the second pipe character) for every line in my txt file.
Basically, the txt file has the following format:
3.1_1.wav|I like apples.|I like apples|I like bananas
3.1_2.wav|Isn't today a lovely day?|Right now it is 1 in the afternoon.|....
The result should be:
3.1_1.wav|I like apples.
3.1_2.wav|Isn't today a lovely day?
I have tried using \|.* but then everything after the first pipe character is matched.
In Notepad++ do this:
Find what: ^([^\|]*\|[^\|]*).*
Replace with: $1
check "Regular expression", and "Replace All"
Explanation:
^ - anchor at start of line
( - start group, can be referenced as $1
[^\|]* - scan over any character other than |
\| - scan over |
[^\|]* - scan over any character other than |
) - end group
.* - scan over everything until end of line
in replace reference the captured group with $1
I'm not sure if this is the best way to do it, but try this:
[^wav]\|.*
Possible duplicate of Regex - find all lines after a match: although my need is a little different.
I want to parse a plain text file with multiple date/value data separated by specific strings. I want to skip the first half of the file until a specific line where I want to match the results.
Here is an example of the file in question (including the mess with tabulations and spaces):
I dont want to capture the following measures. This text is on a single line and contains tabs and spaces is also ends with this token : Token1
05/01/1969 0.01846
15/01/1969 0.16730
25/01/1969 0.33988
05/04/1969 0.81319
15/04/1969 0.76973
25/11/2011 0.24210
05/12/2011 0.25220
15/12/2011 0.31160
25/12/2011 0.36845
End : bla bla bla
This text is also on a single line and marks the beginning of a new series of results. These are the results that I want. it also ends with the following token : Token2
05/01/1969 109.46333
15/01/1969 110.06998 118.18000
25/01/1969 110.82954
05/02/1969 111.51394 118.83000
25/02/1969 112.36483
05/10/2011 114.38798 114.31000
05/10/2011 114.31000 114.38798 114.38798 114.38798 114.38798 114.38798 114.38798
25/12/2011 112.64000 112.41261 112.86301 113.25494 114.06421 115.93219 116.38780
05/01/2012 112.22834 112.92301 113.40561 114.78823 116.62931 117.43421
05/09/2012 110.01410 112.16391 112.88199 115.23640 117.04756 118.04632
15/09/2012 109.97572 112.00809 112.70266 114.91247 116.65256 117.57412
25/09/2012 109.93967 111.87272 112.53305 114.60381 116.26935 117.12756
End : Marks the end of the file
What I wish to do is to match every line after the line which ends with Token2. I have tried different solutions from the other similar questions but none work. I ended up matching all the results of the file and considered splitting it before applying the following pattern. Is there a pure regex solution to this ?
Here is the pattern that works for the whole file. With named capture groups :
(?P<date>\d\d\/\d\d\/\d\d\d\d)\s*(?P<simul>\d+\.*\d*)[\t ]*(?P<observ>\d+\.*\d*){0,1}[\t ]*(?P<prev_no_rain>\d+\.*\d*){0,1}[\t ]*(?P<prev_10_dry>\d+\.*\d*){0,1}[\t ]*(?P<prev_20_dry>\d+\.*\d*){0,1}[\t ]*(?P<prev_50>\d+\.*\d*){0,1}[\t ]*(?P<prev_20_wet>\d+\.*\d*){0,1}[\t ]*(?P<prev_10_wet>\d+\.*\d*){0,1}
Regex101 link : https://regex101.com/r/a0mCZ2/3
You may leverage the \G operator that matches the start of string (that can be excluded with a negative lookaround) and the end of the previous successful match position. With the (?:\G(?!\A)|\bToken2[\r\n]+) we can tell the regex engine to find a whole word Token2 at the end of the line (with linebreak symbols) and then only find the following subpatterns if they follow in an immediate succession.
A regex that can be used:
(?:\G(?!\A)[\r\n]*|Token2[\r\n]+)\K(?P<date>\d\d\/\d\d\/\d{4})\s*(?P<simul>\d+\.*\d*)[\t ]*(?P<observ>\d+\.*\d*)?[\t ]*(?P<prev_no_rain>\d+(?:\.\d+)*)?[\t ]*(?P<prev_10_dry>\d+\.*\d*)?[\t ]*(?P<prev_20_dry>\d+\.*\d*)?[\t ]*(?P<prev_50>\d+\.*\d*)?[\t ]*(?P<prev_20_wet>\d+\.*\d*)?[\t ]*(?P<prev_10_wet>\d+\.*\d*)?
See the regex demo. Note I replaced {0,1} with ? to shorten it a bit.
The part you are interested in is (?:\G(?!\A)[\r\n]*|Token2[\r\n]+)\K.
(?:\G(?!\A)[\r\n]*|Token2[\r\n]+) - 1 of two alternatives:
\G(?!\A)[\r\n]* - end of the previous successful match and 0+ linebreak symbols
| - or
Token2[\r\n]+ - Token2 followed with 1+ CR or LFs. (If you need to match Token2 as a whole word, you might add \b before it).
\K - omit the text matched so far.
The (?P<date>\d\d\/\d\d\/\d{4})\s*(?P<simul>\d+\.*\d*)[\t ]*(?P<observ>\d+\.*\d*)?[\t ]*(?P<prev_no_rain>\d+(?:\.\d+)*)?[\t ]*(?P<prev_10_dry>\d+\.*\d*)?[\t ]*(?P<prev_20_dry>\d+\.*\d*)?[\t ]*(?P<prev_50>\d+\.*\d*)?[\t ]*(?P<prev_20_wet>\d+\.*\d*)?[\t ]*(?P<prev_10_wet>\d+\.*\d*)? is your pattern that I did not modify too much, and that matches a line with specific fata (note that the fact it matches a line justifies the usage of [\r\n]* after (\G(?!\A))).
Suppose I have the following data,
data
text
abc/1234&
qwertyabc/5555&
a&sdfghabc/ppp&plksa&
z&xabc/lkjh&poiuw&
lkjqwefasrjabc/855698&plkjdhweb
For example if I want to parse out the text between abc/ and first occurrence of & alone, how do I parse out those text between these texts. I want the text between first occurence of abc/ and first occurrence of & after abc/ has occurred.
My output should be as follows,
data
text parsed_out
abc/1234& 1234
qwertyabc/5555& 5555
a&sdfghabc/ppp&plksa& ppp
z&xabc/lkjh&poiuw& lkjh
lkjqwefasrjabc/855698&plkjdhweb 855698
The following is my trying,
data1 = within(data, FOO<-data.frame(do.call('rbind', strsplit(as.character(text), 'abc/', fixed=TRUE))))
data2 = within(data1, FOO1<-data.frame(do.call('rbind', strsplit(as.character(FOO$X1), '&', fixed=TRUE))))
This is using too much of memory since the text file is of 8 million rows and also data2 would be having several columns because it has several '&'. Can anybody help me in parsing text between these two characters as only one column in a best efficient way so that it doesn't occupy too much of memory?
x = "thesearepresentinthestartingwhichisnotneededhttp://google.com/needstobeparsedout&reoccurencenotneeded&"
here, the function should check for http://google.com/ and parse out until first & is found. Here the output should be needstobeparsedout.
new_x = "\"http://www.google.com/search?q=erykah+badu+with+hiatus+kaiyote,+august+3&""
Why is it not working with this link?
Thanks
I actually wanted to parse out few parts of the URL and for example, I want to parse out, the text between "http:www.google.com/" and first occurrence of "&".
Use
sub(".*?https?://(?:www\\.)?google\\.com/([^&]+).*", "\\1", x)
See the regex demo.
The pattern matches:
(optionally add a ^ in front to match the start of string position)
.*? - 0+ chars as few as possible from the start till the first
https?:// - either https:// or http:// followed with
(?:www\\.)? - 1 or 0 (optional) sequence www.
google\\.com/ - literal text google.com
([^&]+) - 1 or more chars other than & (Capture group 1)
.* - any 0+ chars (up to the end of string).
In the replacment pattern, \1 refers to the subtext captured into Group 1.
I have a text file
#sp_id int,
#sp_name varchar(120),
#sp_gender varchar(10),
#sp_date_of_birth varchar(10),
#sp_address varchar(120),
#sp_is_active int,
#sp_role int
Here, I want to get only the first word from each line. How can I do this? The spaces between the words may be space or tab etc.
Here is what I suggest:
Find what: ^([^ \t]+).*
Replace with: $1
Explanation: ^ matches the start of line, ([^ \t]+) matches 1 or more (due to +) characters other than space and tab (due to [^ \t]), and then any number of characters up to the end of the line with .*.
See settings:
In case you might have leading whitespace, you might want to use
^\s*([^ \t]+).*
I did something similar with this:
with open('handles.txt', 'r') as handles:
handlelist = [line.rstrip('\n') for line in handles]
newlist = [str(re.findall("\w+", line)[0]) for line in handlelist]
This gets a list containing all the lines in the document,
then it changes each line to a string and uses regex to extract the first word (ignoring white spaces)
My file (handles.txt) contained info like this:
JoIyke - personal twitter link;
newMan - another twitter handle;
yourlink - yet another one.
The code will return this list:
[JoIyke, newMan, yourlink]
Find What: ^(\S+).*$
Replace by : \1
You can simply use this to get the first word.Here we are capturing the first word in a group and replace the while line by the captured group.
Find the first word of each line with /^\w+/gm.
I have a malformed CSV file which has two columns: Text,Value
The value is either 1 or 0, but some lines are malformed and span two lines:
1. "This line is fine, but there are some that are not like this",0
2. "Another good line",1
4. "Oh, I'm so bad!!
5. I spanned two lines!",0
6. "Why did you break me? FileHelpers can't read two lines!!",1
Line 4 and 5 are supposed to be one line, but the CSV file I got is broken and they span two lines, this causes the FileHelpers engine to fail while reading the csv file.
I have two CSV files with about 3000 lines each and I will only need to fix them once. I want to use notepad++ to find all the lines that are not ending in ,0 or ,1, what kind of regex can I use for that? Or maybe to regular expressions, one for the ,0 case the other one for the ,1 case.
Update:
Dan's answer works without the comma [^01]$ instead of ,[^01]$, but it only matches lines that are not ending with 0 or 1... it works sufficiently well in my case, but it does skip lines that are broken and actually end with 0 or 1.
I don't know how the other answer would work:
Something like the below is what I would use in Notepad++
[^,][^01]$
Here are the steps I did:
Use ([^,][^01])$ to match the lines and replaced with \1{marked}
Then switched to extended mode and replaced {marked}\r\n with `` ( empty ) to get a single line.
Screenshots below:
The expression you would use is
([^,].|,[^01])$
But unfortunately, notepad++ does not support alternation (the | operator). [1]
You can match the broken lines with these two expressions then:
[^,].$
,[^01]$
Except, of course, if the "Text" part does end in ,0 or ,1 itself. :-)
[1] http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Unsupported_Regex_Operators
,[^01]$
Make sure regex mode is on.
General considerations
In general, to match a line that does not end with a specific pattern, you may use
^(?!.*pattern$).*$
where ^ matches the start of a line, (?!.*pattern$) is a negative lookahead that fails the match if there are 0 or more chars other than line break chars, as few as possible (.*) followed with pattern at the end of the line ($), and the .*$ actually matches the line.
To remove a line that does not end with some pattern together with a line break at the end, use
^(?!.*pattern$).*\R?
where \R? is an optional line break sequence.
In case of several fixed strings, you may use
^(?!.*(?:pattern|pattern2|patternN)$).*\R?
If there is just one or two fixed strings to check at the end of the line, you may use a bit quicker regex like
^.*$(?<!a)(?<!bcd)
that will match any line not ending with a and bcd.
^.*$(?<!1)(?<!0)
Current problem solution
So, for the current issue, to match a line not ending with 1 or 0, you may use
^(?!.*[01]$).*$ # without the line break
^(?!.*[01]$).*$\R? # with the line break
Or,
^.*(?<![01])$ # without the line break
^.*(?<![01])$\R? # with the line break
To remove/replace a line break on a line that does not end with a specific pattern you may use
(?<![01])$\R?
Replace with either an empty string (to remove the line break) or with any other delimiter string or character.