Only output matching regex pattern - regex

I have a csv file that contains 10,000s of rows. Each row has 8 columns. One of those columns contains text similar to this:
this is a row: http://somedomain.com | some_text | http://someanotherdomain.com | some_more_text
this is a row: http://yetanotherdomain.net
this is a row: https://hereisadomain.org | some_text
I'm currently accessing the data in this column this way:
for row in csv_reader:
the_url = row[3]
# this regex is used to find the hrefs
href_regex = re.findall('(?:http|ftp)s?://.*', the_url)
for link in href_regex:
print (link)
Output from the print statement:
http://somedomain.com | some_text | http://someanotherdomain.com | some_more_text
http://yetanotherdomain.net
https://hereisadomain.org | some_text
How do I obtain only the URLs?
http://somedomain.com
http://someanotherdomain.com
http://yetanotherdomain.net
https://hereisadomain.org

Just change your pattern to:
\b(?:http|ftp)s?://\S+
Instead of matching anything with .*, match any non-whitespace characters instead with \S+. You might want to add a word boundary before your non capturing group, too.
Check it live here.

Instead of repeating any character at the end
'(?:http|ftp)s?://.*'
^
repeat any character except a space, to ensure that the pattern will stop matching at the end of a URL:
'(?:http|ftp)s?://[^ ]*'
^^^^

Related

Match a word in a list of words regex

I want the user to only be able to enter the values in the following regex:
^[AB | BC | MB | NB | NL | NS | NT | NU | ON |QC | PE | SK | YT]{2}$
My problem is that words like : PP AA QQ are accepted.
I am not sure how i can prevent that ? Thank you.
Site i use to verify the expression : https://regex101.com/
In most RegExp flavors, square brackets [] denotate character classes; that is, a set of individual tokens that can be matched in a specific position.
Because P is included in this character class (along with a quantifier of {2}) PP is matched.
Instead, you seem to want a group with alternatives; for that, you'd use parenthesis () (while also eliminating the whitespace, something it doesn't appear was intentional on your part):
^(AB|BC|MB|NB|NL|NS|NT|NU|ON|QC|PE|SK|YT){2}$
RegEx101
This matches things like ABBC, ABAB, NLBC, etc.

How can I delete the rest of the line after the second pipe character "|" for every line with python?

I am using notepad++ and I want to get rid of everything after one second (including the second pipe character) for every line in my txt file.
Basically, the txt file has the following format:
3.1_1.wav|I like apples.|I like apples|I like bananas
3.1_2.wav|Isn't today a lovely day?|Right now it is 1 in the afternoon.|....
The result should be:
3.1_1.wav|I like apples.
3.1_2.wav|Isn't today a lovely day?
I have tried using \|.* but then everything after the first pipe character is matched.
In Notepad++ do this:
Find what: ^([^\|]*\|[^\|]*).*
Replace with: $1
check "Regular expression", and "Replace All"
Explanation:
^ - anchor at start of line
( - start group, can be referenced as $1
[^\|]* - scan over any character other than |
\| - scan over |
[^\|]* - scan over any character other than |
) - end group
.* - scan over everything until end of line
in replace reference the captured group with $1
I'm not sure if this is the best way to do it, but try this:
[^wav]\|.*

Pyspark - Regex - Extract value from last brackets

I created the following regular expression with the idea of extracting the last element in brackets. See that if I only have one parenthesis it works fine, but if I have 2 parenthesis it extracts the first one (which is a mistake) or extract with the brackets .
Do you know how to solve it?
tmp= spark.createDataFrame(
[
(1, 'foo (123) oiashdj (hi)'),
(2, 'bar oiashdj (hi)'),
],
['id', 'txt']
)
tmp = tmp.withColumn("old", regexp_extract(col("txt"), "(?<=\().+?(?=\))", 0));
tmp = tmp.withColumn("new", regexp_extract(col("txt"), "\(([^)]+)\)?$", 0));
tmp.show()
+---+--------------------+---+----+
| id| txt|old| new| needed
+---+--------------------+---+----+
| 1|foo (123) oiashdj...|123|(hi)| hi
| 2| bar oiashdj (hi)| hi|(hi)| hi
+---+--------------------+---+----+
To extract the substring between parentheses with no other parentheses inside at the end of the string you may use
tmp = tmp.withColumn("new", regexp_extract(col("txt"), r"\(([^()]+)\)$", 1));
Details
\( - matches (
([^()]+) - captures into Group 1 any 1+ chars other than ( and )
\) - a ) char
$ - at the end of the string.
The 1 argument tells the regexp_extract to extract Group 1 value.
See the regex demo online.
NOTE: To allow trailing whitespace, add \s* right before $: r"\(([^()]+)\)\s*$"
NOTE2: To match the last occurrence of such a substring in a longer string, with exactly the same code as above, use
r"(?s).*\(([^()]+)\)"
The .* will grab all the text up to the end, and then backtracking will do the job.
This should work. Use it with the single line flag.
\([^\(\)]*?\)(?!.*\([^\(\)]*?\))
https://regex101.com/r/Qrnlf3/1

Rematch same or part of previous matched group

I'm looking for a way to match part of - or the whole - previously matched group. For instance, assume we've the following text:
this is a very long text "with" some quoted strings I "need" to match in their own context
A regex like (.{1,20})(".*?")(.{1,20}) gives the following output:
# | 1st group | 2nd group | 3rd group
------------------------------------------------------------------
1 | is a very long text | "with" | some quoted strings
2 | I | "need" | to extract in their
The goal's to force the regex to re-match part of the 3rd group from the 1st match - or the whole match when quoted strings are quite near - when is matching the 2nd one. Basically I'd like to have the following output instead:
# | 1st group | 2nd group | 3rd group
------------------------------------------------------------------
1 | is a very long text | "with" | some quoted strings
2 | me quoted strings I | "need" | to extract in their
Probably, a backreference support would do the trick but go regex engine lacks of it.
If you go back to the original problem, you need to extract the quotes in context.
Since you don't have lookahead, you could use regexp just to match quotes (or even just strings.Index), and just get byte ranges, then expand to include context yourself by expanding the range (this may require more work if dealing with complex UTF strings).
Something like:
input := `this is a very long text "with" some quoted strings I "need" to extract in their own context`
re := regexp.MustCompile(`(".*?")`)
matches := re.FindAllStringIndex(input, -1)
for _, m := range matches {
s := m[0] - 20
e := m[1] + 20
if s < 0 {
s = 0
}
if e >= len(input) {
e = -1
}
fmt.Printf("%s\n", input[s:e])
}
https://play.golang.org/p/brH8v6OM-Fx

sed: Why does "s/TR[0-9]*//2g" work but not "s/TR[0-9].*//2g"?

My file look like this:
>TR45672|c1_g1_i1|m.87632TR21000
sometextherethatmayincludeTRbutnonumbers
>TR10000|c0_g1_i1|m.83558TR1702000
sometextherethatmayincludeTRbutnonumbers
....
....
I want it to looks like this:
>TR45672|c1_g1_i1|m.87632
sometextherethatmayincludeTRbutnonumbers
>TR10000|c0_g1_i1|m.83558
sometextherethatmayincludeTRbutnonumbers
....
....
In other words, I want to remove second occurrence of the pattern TR in the headers (rows that start with ">") and everything after that, but not touch any TR patterns in lines that are not headers. In non-header lines, TR will never ever be followed by a number.
I try to use the following code:
sed "s/TR[0-9].*//2g"
It will, as I have understood it, match TR and then a number and remove all instances but the first one. Since there are always exactly two occurrences of TR[0-9] in the header and no occurrences of TR[0-9] in non-headers, this will accomplish my goals...
...or so I thought. In reality, using the above code has no effect whatsoever.
If I instead skip the dot and use:
sed "s/TR[0-9]*//2g"
It produces what looks like the desired result for those lines I have manually checked.
Questions:
(1) How come it works without the dot but not work with it? My understanding is that ".*" is the key to removing everything after a pattern.
(2) Removing the dot seems to work, but it is not possible for me to manually check through the entire file. Is there are reason to suspect something unexpected happens when skipping the dot in this case?
sed "s/TR[0-9].*//2g"
...matches the whole line from the first TR to the end of the line, which means there is no following match (there's nothing left of the line to match since it has all been matched)
sed "s/TR[0-9]*//2g"
...first matches only the first TR<number> sequence, then finds the second match in the rest of the line.
Analyze the first line of your input file against the regex with the dot:
|-------------------------------- (1) TR matches 'TR' literally
| |------------------------------ (2) [0-9] match a single digit
| | |---------------------------- (3) .* matches any char till the end
| | |
TR 4 5672|c1_g1_i1|m.87632TR21000
11 2 3333333333333333333333333333
---------------------------------
1st and only match so there no 2nd match or above to replace
So using TR[0-9].* you have a single match per line starting with TR.
If you use the second regex instead:
|---------------------------------- (m1) TR matches 'TR' literally
| |------------------------------- (m1) [0-9]* match zero or more digits
| |
| | |------ (m2) TR matches 'TR' literally
| | | |--- (m2) [0-9]* match zero or more
TR 45672 |c1_g1_i1|m.87632 TR 21000
-------- --------
1st match 2nd match
By the way, since there are only two TR section you can skip global flag and use:
sed 's/TR[0-9]*//2' file