file app_ids.txt is of the following format:
app1 = "0123456789"
app2 = "1234567890"
app3 = "2345678901"
app4 = "3456789012"
app5 = "4567890123"
printing the lines containing the given regex with the following code in file, find_app_id.jl:
#! /opt/julia/julia-1.1.0/bin/julia
function find_app_id()
app_pattern = "r\"app2.*\"i"
open("/path/to/app_ids.txt", "r") do apps
for app in eachline(apps)
if occursin(app_pattern, app)
println(app)
end
end
end
end
find_app_id()
$/home/julia/find_app_id.jl, does not print the second line though it contains the regex!
How do I solve this problem?
Your regular expression looks odd. If you change the line which assigns to app_pattern to
app_pattern = r"app2.*"
it should work better.
For example, the following prints "Found it" when run:
app_pattern = r"app2.*"
if occursin(app_pattern, "app2 = blah-blah-blah")
println("Found it")
else
println("Nothing there")
end
Best of luck.
I'm not sure, how regex matching works in Julia, this post might help you to figure it out.
However, in general, your pattern is quite simple, and you probably do not need regular expression matching to do this task.
This RegEx might help you to design your expression.
^app[0-9]+\s=\s\x22([0-9]+)\x22$
There is a simple ([0-9]+) in the middle where your desired app ids are, and you can simply call them using $1:
This graph shows how the expression would work:
Related
I need to use python to match url in my text file.
However, there is a special case:
i like 🤣pic.twitter.com/Sex8JaP5w5/a7htvq🤣
In this case I would like to keep the emoji next to the url and just match the url in the middle.
Ideally, I would like to have result like this:
i like 🤣<url>🤣
Since I am new to this, this is what I have so far.
pattern = re.compile("([:///a-zA-Z////\.])+(.com)+([:///a-zA-Z////\.])")
but the return result is something unsatisfied like this:
i like 🤣<url>Sex8JaP5w5/a7htvq🤣
Would you please help me with this? Thank you so much
A solution using existing packages:
from urlextract import URLExtract
import emoji
def remove_emoji(text):
return emoji.get_emoji_regexp().sub(r'', text)
extractor = URLExtract()
source = "i like 🤣pic.twitter.com/Sex8JaP5w5/a7htvq🤣 "
urlsWithEmojis = extractor.find_urls(source)
urls = list(map(remove_emoji, urlsWithEmojis))
print(urls)
output
['pic.twitter.com/Sex8JaP5w5/a7htvq']
Try it Online!
Inspired by How do you extract a url from a string using python? and removing emojis from a string in Python
If looks like you are missing * or+ at the last matching group so it only matches one character. So you want "([:///a-zA-Z////\.])+(.com)+([:///a-zA-Z////\.])*" or "([:///a-zA-Z////\.])+(.com)+([:///a-zA-Z////\.])+".
Now I don't know if this regex is simplified for your case, but it does not match all urls. For an example of that check out https://www.regextester.com/20
If you are attempting to match any url I would recommend rethinking your problem and trying to simplify down to more specific types of urls, like the example you provided.
EDIT: Also why (.com)+? Is there really a case where multiple ".com"s appear like .com.com.com
Also I think you have small typo and it is supposed to be (\.com). But since you have ([:///a-zA-Z////\.])+ it could be reduced to (com), however i think the explicit (\.com) makes it an easier expression to read.
I am using python to get user input and then by using regular expressions I want to check for certain words. In this case I want to check how the user is feeling and then store it in a list. The problem is that when I print the list it is empty.
import re
phrase = raw_input("How are you feeling ")
phrase = phrase.lower()
feel=(re.findall(r^(?=.*\bsad\b)(?=.*\bhappy\b)(?=.*\bjoyful\b)(?=.*\bmad\b)(?=.*\bsad\b), phrase))
print feel
I'm not a python expert, but am fairly decent with regex. Why wouldn't you just use something like:
\b(happy|sad|joyful|mad)\b
Add chars to match
...(?=.*\bsad\b).*
I'm trying to parse a gitolite.conf file, which is a whitespace-oriented conf file with a few regexes. The worst problem is that some options might appear anywhere:
#staff = dilbert alice # line 1
#projects = foo bar # line 2
repo #projects baz # line 3
RW+ = #staff # line 4
- master = ashok # line 5
RW = ashok # line 6
R = wally # line 7
config hooks.emailprefix = '[%GL_REPO] ' # line 8
Check the "master" attribute. Some repos have them, others do not. It's a real pain.
This answer assumes a goal of extracting key/value pairs into capturing groups, where key consists of contiguous non-whitespace before = and value includes everything after = but before #, trimmed of leading/trailing whitespace.
Basic version
([^\s]+)\s*=\s*((?:\s*[^\s#]+)*)
More advanced version
The regex above doesn't handle quoted strings very well (e.g. prefix = ' Quoted with # and leading/trailing whitespace '). Regex isn't great at this kind of thing but simple cases can be handled as follows:
([^\s]+)\s*=\s*('[^']*'|"[^"]*"|(?:(?:\s*[^\s#]+)*))
Here's the demo if you need to see what is captured and play around with it more: Debuggex Demo
First, you should know that this isn't entirely possible with Regex. Regex is a great tool for parsing regular languages (including some types of configuration files), but as soon as you get into "Well, this line is actually a header line and we need all lines under it, and some lines might have this token, and others might not", it gets quite messy. I'm not saying it's impossible, but you're going to waste a lot of time debugging your Regex pattern instead of just writing a parser in whatever language you're using this with.
Second, if you're going to ask a quesiton about Regex, it is always helpful to know what you want out of the expression. Do you want to tokenize everything, do you only want the configuration keys, do you only want the comments?
That being said, I took my best guess, here's an expression to get you started:
^(?:([^=#]+?)\s.?=?\s.?([^=#]+?)\s.?(?:#|$))
With this expression, please apply the g and m flags (global and multiline). In PCRE, this would look like:
/^(?:([^=#]+?)\s.?=?\s.?([^=#]+?)\s.?(?:#|$))/gm
There are two capture groups, one is whatever is before the = sign, and the other is whatever is after. If there is no = sign, the first capture group contains everything. Anything after "#" is ignored.
Here's a fiddle to demonstrate: http://www.rexfiddle.net/eQexbZU
As I write this I realise there are two parts to this question, however I think I am only really stuck on the first part and therefore the second is only provided for context:
Part A:
I need to search the contents of each value returned by a for loop (where each value is a url) for the following:
href="/dir/Sub_Dir/dir/163472311232-text-text-text-text/page-n"
where:
the numerals 163472311232 could be any length (ie it could be 5478)
-text-text-text-text could be any number of different words
where page-n could be from page-2 up until any number
where matches are not returned more than once, ie only unique matches are returned and therefore only one of the following would be returned:
href="/dir/Sub_Dir/dir/5422-la-la/page-4
href="/dir/Sub_Dir/dir/5422-la-la/page-4
Part B:
So the logic would be something like:
list_of_urls = original_list
for url in list_of_urls:
headers = {'User-Agent' : 'Mozilla 5.0'}
request = urllib2.Request(url, None, headers)
url_for_re = urllib2.urlopen(request).read()
another_url = re.findall(r'href="(/dir/Sub_dir\/dir/[^"/]*)"', url_for_re, re.I)
file.write(url)
file.write('\n')
file.write(another_url)
file.write('\n')
Which i am hoping will give me output similar to:
a.html
a/page-2.html
a/page-3.html
a/page-4.html
b.html
b/page-2.html
b/page-3.html
b/page-4.html
So my question is (assuming the logic in part B is ok):
What is the required regex pattern to use for part A?
I am a newbie to python and regex so this will limit my understanding somewhat in regards to relatively complicated regex suggestions etc.
update:
after suggestions i tried to test the following regex which did not produce any results:
import re
content = 'href="/dir/Sub_Dir/dir/5648342378-text-texttttt-texty-text-text/page-2"'
matches = re.findall(r'href="/dir/Sub_Dir/dir/[0-9]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+/page-([2-9]|[1-9][0-9]+)"', content, re.I)
prefix = 'http://www.test.com'
for match in matches:
i = prefix + match + '\n'
print i
solution:
i think this is the regex that will work:
matches = re.findall(r'href="(/dir/Sub_Dir/dir/[^"/]*/page-[2-9])"', content, re.I)
You can have... most of what you want. Regexes don't really do the distinct thing, so I suggest you just use them to get all the URLs, and then remove duplicates yourself.
Off the top of my head it would be something like this:
href="/dir/Sub_Dir/dir/[0-9]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+/page-([2-9])|([1-9][0-9]+)"
Plus or minus escaping rules, specifics on what words are allowed, etc. I'm a Windows guy, there's a great tool called Expresso which is helpful for learning regexes. I hope there's an equivalent for whatever platform you're using, it comes in handy.
I'm trying to search a field in a database to extract URLs. Sometimes there will be more than 1 URL in a field and I would like to extract those in to separate variables (or an array).
I know my regex isn't going to cover all possibilities. As long as I flag on anything that starts with http and ends with a space I'm ok.
The problem I'm having is that my efforts either seem to get only 1 URL per record or they get only 1 the last letter from each URL. I've tried a couple different techniques based on solutions other have posted but I haven't found a solution that works for me.
Sample input line:
Testing http://marko.co http://tester.net Just about anything else you'd like.
Output goal
$var[0] = http://marko.co
$var[1] = http://tester.net
First try:
if ( $status =~ m/http:(\S)+/g ) {
print "$&\n";
}
Output:
http://marko.co
Second try:
#statusurls = ($status =~ m/http:(\S)+/g);
print "#statusurls\n";
Output:
o t
I'm new to regex, but since I'm using the same regex for each attempt, I don't understand why it's returning such different results.
Thanks for any help you can offer.
I've looked at these posts and either didn't find what I was looking for or didn't understand how to implement it:
This one seemed the most promising (and it's where I got the 2nd attempt from, but it didn't return the whole URL, just the letter: How can I store regex captures in an array in Perl?
This has some great stuff in it. I'm curious if I need to look at the URL as a word since it's bookended by spaces: Regex Group in Perl: how to capture elements into array from regex group that matches unknown number of/multiple/variable occurrences from a string?
This one offers similar suggestions as the first two. How can I store captures from a Perl regular expression into separate variables?
Solution:
#statusurls = ($status =~ m/(http:\S+)/g);
print "#statusurls\n";
Thanks!
I think that you need to capture more than just one character. Try this regex instead:
m/http:(\S+)/g