How to remove repeated words or phrases within the same string - regex

I am working with a string variable response in Stata. This variable stores complete sentences, and many of these sentences have repeated phrases.
For example:
how do you know how do you know what it is?
it was during the during the past thirty days
well well I would hope I would hope that they're doing that
I want to clean these strings by removing all repeated phrases.
In other words, I want to transform this sentence:
how do you know how do you know what it is?
to the one below:
how do you know what it is?
So far, I have tried to fix each case individually, but this is incredibly time-consuming as there are thousands of repeated words/phrases.
I would like to run code that can identify when a phrase is repeated within the same observation / string, and then remove one instance of that phrase (or word).
I imagine regular expressions would help, but I cannot figure out much more than this.

The following works for me:
clear
input str80 string
"Pearly Spencer how do you know how do you know what it is?"
"it was during the during the past thirty days"
"well well I would hope I would hope that they're doing that"
"well well they're doing that I would hope I would hope "
"well well I would hope I would hope that they're doing that but but they don't"
end
clonevar wanted = string
local stop = 0
while `stop' == 0 {
generate dup = ustrregexs(2) if ustrregexm(wanted, "(\W|^)(.+)\s\2")
replace wanted = subinstr(wanted, dup, "", 1)
capture assert dup == ""
if _rc == 0 local stop = 1
else drop dup
}
replace wanted = strtrim(stritrim(wanted))
list wanted
+----------------------------------------------------------+
| wanted |
|----------------------------------------------------------|
1. | Pearly Spencer how do you know what it is? |
2. | it was during the past thirty days |
3. | well I would hope that they're doing that |
4. | well they're doing that I would hope |
5. | well I would hope that they're doing that but they don't |
+----------------------------------------------------------+
The above solution uses a regular expression to first identify repeated words / phrases. Then it eliminates this from the string by substituting a space in its place.
Because this particular regular expression does not find all sets in one pass (for example in the last observation there are three sets - well, I would hope and but), the process is repeated using a while loop until no repeated elements remain in the string.
In the final step, all unnecessary spaces are deleted to bring the string back to shape.

Related

Using Regex to clean a csv file in R

This is my first post so I hope it is clear enough.
I am having a problem regarding cleaning my CSV files before I can read them into R and have spent the entire day trying to find a solution.
My data is supposed to be in the form of two columns. The first column is a timestamp consisting of 10 digits and the second an ID consisting of 11 or 12 Letters and numbers (the first 6 are always numbers).
For example:
logger10 |
0821164100 | 010300033ADD
0821164523 | 010300033ADD
0821164531 | 010700EDDA0F0831102744
010700EDDA0F|
would become:
0821164100 | 010300033ADD
0821164523 | 010300033ADD
0821164531 | 010700EDDA0F
0831102744 | 010700EDDA0F
(please excuse the lines in the middle, that was my attempt at separating the columns...).
The csv file seems to occasionally be missing a comma which means that sometimes one row will end up like this:
0923120531,010300033ADD0925075301,010700EDD00A
My hardware also adds the word logger10 (or whichever number logger this is) whenever it restarts which gives a similar problem e.g. logger10logger100831102744.
I think I have managed to solve the logger text problem (see code) but I am sure this could be improved. Also, I really don't want to delete any of the data.
My real trouble is making sure there is a line break in the right place after the ID and, if not, I would like to add one. I thought I could use regex for this but I'm having difficulty understanding it.
Any help would be greatly appreciated!
Here is my attempt:
temp <- list.files(pattern="*.CSV") #list of each csv/logger file
for(i in temp){
#clean each csv
tmp<-readLines(i) #check each line in file
tmp<-gsub("logger([0-9]{2})","",tmp) #remove logger text
pattern <- ("[0-9]{10}\\,[0-9]{6}[A-Z,0-9]{5,6}") #regex pattern ??
if (tmp!= pattern){
#I have no idea where to start here...
}
}
here is some raw data:
logger01
0729131218,020700EE1961
0729131226,020700EE1961
0831103159,0203000316DB
0831103207,0203000316DB0831103253,010700EDE28C
0831103301,010700EDE28C
0831103522,010300029815
0831103636,010300029815
0831103657,020300029815
If you want to do this in a single pass:
(?:logger\d\d )?([\dA-F]{10}),?([\dA-F]{12}) ?
can be replaced with
\1\t\2\n
What this does is look for any of those rogue logger01 entries (including the space after it) optionally: That trailing ? after the group means that it can match 0 or 1 time: if it does match, it will. If it's not there, the match just keeps going anyway.
Following that, you look for (and capture) exactly 10 hex values (either digits or A-F). The ,? means that if a comma exists, it will match, but it can match 0 or 1 time as well (making it optional).
Following that, look for (and capture) exactly 12 hex values. Finally, to get rid of any strange trailing spaces, the ? (a space character followed by ?) will optionally match the trailing space.
Your replacement will replace the first captured group (the 10 hex digits), add in a tab, replace the second captured group (the 12 hex digits), and then a newline.
You can see this in use on regex101 to see the results. You can use code generator on the left side of that page to get some preformatted PHP/Javascript/Python that you can just drop into a script.
If you're doing this from the command line, perl could be used:
perl -pe 's/(?:logger\d\d )?([\dA-F]{10}),?([\dA-F]{12}) ?/\1\t\2\n/g'
If another language, you may need to adapt it slightly to fit your needs.
EDIT
Re-reading the OP and comments, a slightly more rigid regex could be
(?:logger\d\d\ )?([\dA-F]{10}),?(\d{6}[\dA-F]{5,6})\ ?
I updated the regex101 link with the changes.
This still looks for the first 10 hex values, but now looks for exactly 6 digits, followed by 5-6 hex values, so the total number of characters matched is 11 or 12.
The replacement would be the same.
Paste your regex here https://regex101.com/ to see whether it catches all cases. The 5 or 6 letters or digits could pose an issue as it may catch the first digit of the timestamp when the logger misses out a comma. Append an '\n' to the end of the tmp string should work provided the regex catches all cases.

Regex newbie: How to isolate 'num-num-num' in a string

I'm sure this is a super simple question for many of you, but I've only just started learning regex and at the moment can't for the life of me isolate what I'm after from the following:
June 2015 - Won / Void / Lost = 3-0-1
I need a solution to isolate the 'num-num-num' part at the end of the string that would work for any positive integers.
Thanks for any help
EDIT
So this line of code from a scrapy spider I'm writing produces the line above:
tips_str = sel.xpath('//*[#class="recent-picks"]//div[#class="title3"]/text()').extract()[0]
I've tried to isolate the part I'm after with:
tips_str = sel.xpath('//*[#class="recent-picks"]//div[#class="title3"]/text()').re(r'\d+-\d+-\d+$').extract()[0]
No luck though :(
The regex to capture that is:
\d+-\d+-\d+$
It works as follows:
\d+- means: capture 1 or more digits (the numbers [0-9]), and then a "-".
$ means: you should now be at the end of the line.
Translating that into the full regex pattern:
Capture 1 or more digits, then a hyphen, then 1 or more digits, then a hyphen, then 1 or more digits, and we should now be at the end of the string.
EDIT: Addressing your edits and comments:
I'm not so sure what you mean by "isolate". I'll assume that you mean you want tips_str to equal "3-0-1".
I believe the easiest way would be to first use xpath extract the string for the entire line without doing any regex. Then, when we're simply dealing with a string (instead of xpath stuff), it should be nice and easy to use regex and get the pattern out.
As far as I understand, sel.xpath('//*[#class="recent-picks"]//div[#class="title3"]/text()').extract()[0] (without .re()) is providing you with the string: "June 2015 - Won / Void / Lost = 3-0-1".
So then:
full_str = sel.xpath('//*[#class="recent-picks"]//div[#class="title3"]/text()').extract()[0]
Now that we've got the full string, we can use standard string regex to pluck the part we want out:
tips_str = false
search = re.search(r'\d+-\d+-\d+$', full_str)
if(search):
tips_str = search.group(0)
Now tips_str will equal "3-0-1". If the pattern wasn't matched at all, it'd instead equal false.
If any of my assumptions are wrong then let me know what's actually happening (like if .extract()[0] isn't giving back a string, then what is it giving back?) and I'll try to adjust this response.
Any and all numbers, so negatives, scientific notation, etc? This will match it.
/(\-?[\.\d]+(e\+|e\-)?[\.\d]*)-(\-?[\.\d]+(e\+|e\-)?[\.\d]*)-(\-?[\.\d]+(e\+|e\-)?[\.\d]*)$/ig
Tested with these:
June 2015 - Won / Void / Lost = -1.1e+3-1.01-0.1e+2
June 2015 - Won / Void / Lost = 1-2-3
June 2015 - Won / Void / Lost = 0.1--5-5.6
If you take $ out if it, it will match on all lines at the same time.

Trying to check input textbox for time

I have to make this overview of questions and the user has to be able to insert a time.
To do this I made 2 textboxes, 1 is for the hour input and 1 is for the minute input.
What I want to do now is check if the values aren't to high to be correct.
Example:
The hour value cant be higher than 23 and the minute cant be higher than 59.
What is the best method for checking this?
I've been thinking about if statements but maybe there is a much more efficient way to get this done?
Maybe regular expressions, although I wouldnt know a correct syntax for this matter.
Thanks in advance.
If it has to be a regex:
^(?:2[0-3]|[01]?[0-9])$
will validate the hour and
^[0-5]?[0-9]$
will validate the minute.
Explanation for the "Hours" regex: (you can figure out the minutes yourself easily):
^ # Match start of string
(?: # Match either...
2[0-3] # 2, followed by 0, 1, 2 or 3,
| # or...
[01]? # 0 or 1 (optional; the empty string is OK, too), followed by
[0-9] # any digit
) # End of group
$ # Match end of string
If statements are definitely the way to go. There's no reason to use a regular expression for something so simple... it's like using a sledgehammer to place a small nail into a wall. If statements are also very efficient and easy to read... there's no reason to use regex for what you're doing.

How can I parse this without regex?

A friend of mine said if the regex I'm using is too long, it's probably the wrong tool for the job. Any thoughts here on a better way to parse this text? I have a regex that returns everything to an array I can easily just chunk out, but if there's another simpler way I'd really like to see it.
Here's what it looks like:
2 AB 123A 01JAN M ABCDEF AA1 100A 200A 02JAN T /ABCD /E
Here's a break down of that:
2 is the line number, these range from 1 all the way to 99. If you can't see because of formatting, there is a space charecter prepending numbers less than 10.
The space may or may not be replaced by an *
AB is an important unit of data (UOD).
AB may be prepended by /CD which is another important UOD.
123 is an important UOD. It can range from 1 (prepended by 4 spaces) to 99999.
A is an important UOD.
01JAN is a day/month combination, I need to extract both UODs.
M is a day name short form. This may be a number between 1 and 7.
ABC is an important UOD.
DEF is an important UOD.
The space after DEF may be an *
AA1 may be zero characters, or it may be 5. It is unimportant.
100A is a timestamp, but may be in the format 1300. The A may be N when the time is 1200 or P for times in the PM.
We then see another timestamp.
The next date part may not be there, for example, this is valid:
93*DE/QQ51234 30APR J QWERTY*QQ0 1250 0520 /ABCD*ASDFAS /E
The data where /ABCD*ASDFAS /E appears is irrelevant to the application, but, this is where the second date stamp may appear. The front-slash may be something else (such as a letter).
Note:
It is not space delimited, some parts of the body run into others. Character position is only accurate for the first two or three items on the list
I don't think I left anything out, but, if there's an easier way to parse out a string like this than writing a regex, please let me know.
This is a perfect task for regular expressions. The text does not contain nesting and the items you're matching are fairly simple taken individually.
Most regular expression syntaxes have an xtended flag or mode that allows whitespace and comments to improve readability. For example:
$regex = '#
# 2 is the line number, these range from 1 all the way to 99.
# There is a space character prepending numbers less than 10.
# The space may or may not be replaced by an *.
[ *]\d|\d\d
\s
# AB is an important unit of data (UOD).
# AB may be prepended by /CD which is another important UOD.
(/CD)?AB
\s
# 123 is an important UOD. It can range from 1 (prepended by 4 spaces)
# to 99999.
\s{4}\d{1}|\s{3}\d{2}|\s{2}\d{3}|\s{1}\d{4}|\d{5}
#x';
And so on.
A regex seems fine for this application, but for simplicity and readability, you might want to split this into several regexes (one for each field) so people can more easily follow which part of the regex corresponds to which variable.
You can always code your own parser by hand, but that would be more lines of code than a regex. The lines of code, however, will probably be simpler to follow for the reader.
Simply write a custom parser that handles it line by line. It seems like everything is at a fixed position rather than space/comma-delimited, so simply use those as indices into what you need:
line_number = int(line_text[0:1])
ab_unit = line_text[3:4]
...
If it is indeed space-delimited, simply split() each line and then parse through each, splitting each chunk into component parts where appropriate.

Create shortest possible regex

I want to create a regex that will match any of these values
7-5
6-6 ((0-99) - (0-99))
6-4
6-3
6-2
6-1
6-0
0-6
1-6
2-6
3-6
4-6
the 6-6 example is a special case, here are some examples of values:
6-6 (23-8)
6-6 (4-25)
6-6 (56-34)
Is it possible to make one regex that can do this?
If so, is it possible to further extend that regex for the 6-6 special case such that the the difference between the two numbers within the parentheses is equal to 2 or -2?
I could easily write this with procedural code, but i'm really curious if someone can devise a regex for this.
Lastly, if it could be further extended such that the individual digits were in their own match groups I'd be amazed. An example would be for 7-5, i could have a match group that just had the value 7, and another that had the value 5. However for 6-6 (24-26) I'd like a match group that had the first six, a match group for the second 6, a match group for the 24 and a match group for the 26.
This may be impossible, but some of you can probably get this part of the way there.
Good luck, and thanks for the help.
NO. The answer is "We can't," and the reason is because you're trying to use a hammer to dig a hole.
The problem with writing one long "clever" (this word causes a knee-jerk reaction in many people who are far more anti-regex than I) regex is that, six months from now, you'll have forgotten those clever regex features that you used so heavily, and you'll have written six months worth of code related to something else, and you'll get back to your impressive regex and have to tweak one detail, and you'll say, "WTF?"
This is what (I understand) you want, in Perl:
# data is in $_
if(/7-5|6-[0-4]|[0-4]-6|6-6 \((\d{1,2})-(\d{1,2})\)/) {
if($1 and $2 and abs($1 - $2) == 2) {
# we have the right difference
}
}
Some might say that the given regex is a bit much, but I don't think it's too bad. If the \d{1,2} bit is a little too obscure you could use \d\d? (which is what I used at first, but didn't like the repetition).
You can do it like this:
7-5|6-[0-4]|[0-5]-6|6-6 \(\d\d?-\d\d?\)
Just add parens to get your match groups.
Off the top of my head (there may be some errors but the principle should be good):
\d-\d|6-6 (\d+-\d+)
And like with any regexp, you can surround what you want to extract with parentheses for match groups:
(\d)-(\d)|(6)-(6) ((\d)+-(\d+))
In the 6-6 case, the first two parentheses should get the sixes, and the second two should get the multi-digit values that come afterwards.
Here is one that will match only the numbers you want and let you get each digit by name:
p = r'(?P<a>[0-4]|6|7)-(?P<b>[0-4]|6|5) *(\((?P<c>\d{1,2})-(?P<d>\d{1,2})\))?'
To get each digit you could use:
values = re.search(p, string).group('a', 'b', 'c', 'd')
Which will return a four element tuple with the values you are looking for (or None if no match was found).
One problem with this pattern is that it will patch the stuff in the parenthesis whether or not there was a match to '6-6'. This one will only match the final parenthesis if 6-6 is matched:
p = r'(?P<a>[0-4]|(?P<tmp_a>6)|7)-(?P<b>(?(tmp_a)(?P<tmp_b>6)|([0-4]|5)))(?(tmp_b) *(\((?P<c>\d{1,2})-(?P<d>\d{1,2})\))?)'
I don't know of any way to look for a difference between the numbers in the parenthesis; regex only knows about strings, not numerical values . . .
(I am assuming python syntax here; the perl syntax is slightly different, though perl supports the python way of doing things.)