I have a file full of URLs in a weird format, characters separated by a space character.
h t t p : / / w w w . y o u t u b e . c o m / u s e r / A S D
h t t p : / / m o r c c . c o m / f r m / i n d . p h p ? t o p i c = 5 7 . 0
I would like to make it look like :
http://www.youtube.com/user/ASD
http://morcc.com/frm/ind.php?topic=57.0
I use notepad++, and I think regex could take care of this problem for me, unfortunately I don't know regex.
I want to remove the ' ' character (space) between the characters, and leave them in listed format, so replacing /s with '' is not a solution, because it becomes a mess :/
I think I should also insert a /n BEFORE "http" occurs.
Can you not just replace a space ' ' with an empty string ''? Replacing \s is not working how you want because newlines are also matched.
If that doesn't work you could, as you say, replace \s with '' and then replace http with \nhttp.
Regex is fairly basic. Check out the examples page. The second example seems to have what you're looking for: http://www.regular-expressions.info/examples.html
EDIT: Also, I assume you know this, but just to be sure, regex itself will not do what you want. What language are you planning on using regex with, so that people can provide more detailed responses?
Regex reference page [Bookmark it ;)] - http://www.regular-expressions.info/reference.html
Related
I am trying to figure out a regex. That includes all characters after it but if another patterns occurs it does not overlap
This is my current regex
[a-zA-Z]{2}\d{1}\s?\w?
The pattern is always 2 letter followed by a number like AE1 or BE3 but I need all the characters following the pattern.
So AE1 A E F but if another pattern occurs in the string like
AE1 A D BE1 A D C it cannot overlap with and be two separate matches.
So to clarify
AB3 D T B should be one match on the regex
ABC D A F DE3 D CD A
should have 2 matches with all the char following it because of the the two letter word and number.
How do I achieve this
I'm not quite following the logic here, yet my guess would be that we might want something similar to this:
([A-Z]{2}\d\s([A-Z]+\s)+)|([A-Z]{3}\s([A-Z]+\s)+)
which allows two letters followed by a digit, or three letters, both followed by ([A-Z]+\s)+.
Demo
Look, you have to consider where your pattern will start. I mean, you know, what is different between AE1 A E F and BE1 A D C in AE1 A D BE1 A D C? You don't want to treat both similarly. So you have to separate them. Separation of these two texts is possible only determining which one is placed in text start.
Altogether, only adding ^ to start your pattern will solve problem.
So your regex should be like this:
^[a-zA-Z]{2}\d{1}\s?\w?
Demo
What you want to do is to split a string with your pattern having the current pattern match as the start of the extracted substrings.
You may use
(?!^)(?=[a-zA-Z]{2}\d)
to split the string. Details
(?!^) - not at the start of the string
(?=[a-zA-Z]{2}\d) - a location in the string that is immediately followed with 2 ASCII letters and any digit.
See the Scala demo:
val s = "ABC D A F DE3 D CD A"
val rx = """(?!^)(?=[a-zA-Z]{2}\d)"""
val results = s.split(rx).map(_.trim)
println(results.mkString(", "))
// => ABC D A F, DE3 D CD A
You can just use this regex:
(?i)\b[a-z]{2}\d\b(?:(?:(?!\b[a-z]{2}\d\b).)+\s?)?
Demo and explanations: https://regex101.com/r/DtFU8j/1/
It uses a negative lookahead (?!\b[a-z]{2}\d\b) to add the constraint that the character matched after the initial pattern (?i)\b[a-z]{2}\d\b should not contain this exact pattern.
Trying to remove all characters except from the compination of 'r d`. To be more clear some examples:
a ball -> ball
r something -> something
d someone -> someone
r d something -> r d something
r something d -> something
Till now I managed to remove the letters except from r or d, but this is not what i want. I want to keep only the compination(ex.4). I use this:
\b(?!r|d)\w{1}\b
Any idea who to do it?
Edit:The reg engine supports lookbehinds.
You may capture the r d combination and use a backreference in the replacement pattern to restore that combination, and remove all other matches:
\b(r d)\b|\b\w\b\s*
See the regex demo (replace with $1 that will put the r d back into the result).
Details:
\b(r d)\b - a "whole word" r d that is captured into Group 1
| - or
\b\w\b\s* - a single whole word consisting of 1 letter/digit/underscore (\b\w\b) and followed with 0+ whitespaces (\s*, just for removing the excessive whitespace, might not be necessary).
lets say I have the phrase :
www w w w wwwcom com c o m
I want to block www, w w w com (when its not part of another word) and c o m
I can do this by looking for each of the following:
www
\bw w w\b
\bcom\b
\bc o m\b
However can i combine them into single statements that would search for both www and w w w or com and c o m leaving 2 regular expressions rather than 4?
You can use pipe to or your regex. For example rega|regb. In your case it will be:
\b((www)|(w w w))\b
Strip the white space as part of the match:
www - \b\s*w\s*w\s*w\s*\b
com - \b\s*c\s*o\s*m\s*\b
both - \b\s*w\s*w\s*w\s*|\s*c\s*o\s*m\s*\b
Sabuj Hassan's answer is the only way to get precisely what you've specified. However, if you're okay with also matching ww w and w ww, then you could use:
\bw ?w ?w\b
This will allow up to one space between each pair of letters (tabs, multiple spaces, etc will not be matched).
The same can be done for com. You can combine this with the | approach to get one regex:
\b(w ?w ?w|c ?o ?m)\b
David Haney's answer (using \s*) is similar, but will match "phrases" that have any combination of spaces and tabs between letters. For instance, w\t\t\t \tw w (where \t is a tab character) will be considered a match.
\b(w\s*w\s*w|c\s*o\s*m)\b
This matches:
www
com
ww w
co m
w w w
c o m
in:
www.something => matches www
wwwword => no matches
word.com => matches com
wordcom => no matches
doesn't match:
wwwe
ewww
come
ecom
In R, is there a better/simpler way than the following of finding the location of the last dot in a string?
x <- "hello.world.123.456"
g <- gregexpr(".", x, fixed=TRUE)
loc <- g[[1]]
loc[length(loc)] # returns 16
This finds all the dots in the string and then returns the last one, but it seems rather clumsy. I tried using regular expressions, but didn't get very far.
Does this work for you?
x <- "hello.world.123.456"
g <- regexpr("\\.[^\\.]*$", x)
g
\. matches a dot
[^\.] matches everything but a dot
* specifies that the previous expression (everything but a dot) may occur between 0 and unlimited times
$ marks the end of the string.
Taking everything together: find a dot that is followed by anything but a dot until the string ends. R requires \ to be escaped, hence \\ in the expression above. See regex101.com to experiment with regex.
How about a minor syntax improvement?
This will work for your literal example where the input vector is of length 1. Use escapes to get a literal "." search, and reverse the result to get the last index as the "first":
rev(gregexpr("\\.", x)[[1]])[1]
A more proper vectorized version (in case x is longer than 1):
sapply(gregexpr("\\.", x), function(x) rev(x)[1])
and another tidier option to use tail instead:
sapply(gregexpr("\\.", x), tail, 1)
Someone posted the following answer which I really liked, but I notice that they've deleted it:
regexpr("\\.[^\\.]*$", x)
I like it because it directly produces the desired location, without having to search through the results. The regexp is also fairly clean, which is a bit of an exception where regexps are concerned :)
There is a slick stri_locate_last function in the stringi package, that can accept both literal strings and regular expressions.
To just find a dot, no regex is required, and it is as easy as
stringi::stri_locate_last_fixed(x, ".")[,1]
If you need to use this function with a regex, to find the location of the last regex match in the string, you should replace _fixed with _regex:
stringi::stri_locate_last_regex(x, "\\.")[,1]
Note the . is a special regex metacharacter and should be escaped when used in a regex to match a literal dot char.
See an R demo online:
x <- "hello.world.123.456"
stringi::stri_locate_last_fixed(x, ".")[,1]
stringi::stri_locate_last_regex(x, "\\.")[,1]
I have the following data:
a b c d FROM:<uniquepattern1>
e f g h TO:<uniquepattern2>
i j k l FROM:<uniquepattern1>
m n o p TO:<uniquepattern3>
q r s t FROM:<uniquepattern4>
u v w x TO:<uniquepattern5>
I would like a regex query that can find the contents of TO: when FROM:<uniquepattern1> is encountered, so the results would be uniquepattern2 and uniquepattern3.
I am hopeless with regex, I would appreciate any pointers on how to write this (lookahead parameters?) and any differences between regex on different platforms (eg the C# .NET Regex versus Grep vs Perl) that might be relevant here.
Thank you.
Try:
/FROM:<uniquepattern1>.*\r?\n.*?TO:<(.*?)>/
This works by first finding the FROM anchor and then use a dot wildcard. The dot operator does not match a newline so this will consume the rest of the line. A non-greedy dot wildcard match then consumes up to the next TO and captures what's between the angle brackets.
your requirement for file parsing is simple. there is no need to use regular expression. Open the file for reading, go through each line check for FROM:<uniquepattern1>, get the next line and print them out. Furthermore, your TO lines are only separated by ":". therefore you can use that as field delimiter.
eg with awk
$ awk -F":" '/FROM:<uniquepattern1>/{getline;print $2}' file
<uniquepattern2>
<uniquepattern3>
the same goes for other languages/tools