Regular expression to trim a string - regex

In my application, i am trying to get the name of a file, from a string retrieved from a 'content-header' tag from a server. The filename looks like \"uploads/2014/03/filename.zip\" (quotations included in value).
I have tried using Path.GetFileName(string); to get just the file name but it throws an exception stating that there are illegal characters in the path.
What should i use to get just filename.zip returned? is a regex the best way to trim this string off or is there a better one?
the \"uploads/2014/03/ part will always be the same length. The filename.zip can be any filename and extension, im just using that as an example. But the numbers may vary. It sounds like a job for a regex to me, but i have no idea how to use regular expressions.

You can try something like this:
var inputString = #"\""uploads/2014/03/filename.zip\""";
var result = inputString.Trim('\\', '"').Split('/')[3];
This should work if the format is always like \"uploads/someNumber/someOtherNumber/filename\".In order to make it more safe you might want to use Enumerable.Last method after Split:
var result = inputString.Trim('\\', '"').Split('/').Last();

Related

regular expression in python for match url

I need to use python to match url in my text file.
However, there is a special case:
i like 🤣pic.twitter.com/Sex8JaP5w5/a7htvq🤣
In this case I would like to keep the emoji next to the url and just match the url in the middle.
Ideally, I would like to have result like this:
i like 🤣<url>🤣
Since I am new to this, this is what I have so far.
pattern = re.compile("([:///a-zA-Z////\.])+(.com)+([:///a-zA-Z////\.])")
but the return result is something unsatisfied like this:
i like 🤣<url>Sex8JaP5w5/a7htvq🤣
Would you please help me with this? Thank you so much
A solution using existing packages:
from urlextract import URLExtract
import emoji
def remove_emoji(text):
return emoji.get_emoji_regexp().sub(r'', text)
extractor = URLExtract()
source = "i like 🤣pic.twitter.com/Sex8JaP5w5/a7htvq🤣 "
urlsWithEmojis = extractor.find_urls(source)
urls = list(map(remove_emoji, urlsWithEmojis))
print(urls)
output
['pic.twitter.com/Sex8JaP5w5/a7htvq']
Try it Online!
Inspired by How do you extract a url from a string using python? and removing emojis from a string in Python
If looks like you are missing * or+ at the last matching group so it only matches one character. So you want "([:///a-zA-Z////\.])+(.com)+([:///a-zA-Z////\.])*" or "([:///a-zA-Z////\.])+(.com)+([:///a-zA-Z////\.])+".
Now I don't know if this regex is simplified for your case, but it does not match all urls. For an example of that check out https://www.regextester.com/20
If you are attempting to match any url I would recommend rethinking your problem and trying to simplify down to more specific types of urls, like the example you provided.
EDIT: Also why (.com)+? Is there really a case where multiple ".com"s appear like .com.com.com
Also I think you have small typo and it is supposed to be (\.com). But since you have ([:///a-zA-Z////\.])+ it could be reduced to (com), however i think the explicit (\.com) makes it an easier expression to read.

Regular Expression for String without a "?" character to redirect to string with "?" character

On our website we occasionally experience an error where dynamic links aren't building correctly.
URLs like this
https://www.test.url.edu/collections/&edan_fq[]=p.edanmdm.indexedstructured.object_type:%22Financial+records%22&edan_fq[]=p.edanmdm.descriptivenonrepeating.record_id:item_*
Should actually be this:
https://www.test.url.edu/collections/search?edan_fq[]=p.edanmdm.indexedstructured.object_type:%22Financial+records%22&edan_fq[]=p.edanmdm.descriptivenonrepeating.record_id:item_*
We want to create a regular expression to redirect
/collections/&edan_fq[]=
to
/collections/search?edan_fq[]=
But everything after "edan_fq[]=" can change dynamically--there are thousands of permutations of the string after that point.
Does anyone know how this would be done?
If you use \& without Global Flag in Regex it will give first match. I've used JavaScript, please check this.
var data = "https://www.test.url.edu/collections/&edan_fq[]=p.edanmdm.indexedstructured.object_type:%22Financial+records%22&edan_fq[]=p.edanmdm.descriptivenonrepeating.record_id:item_*";
var regex = /\&/
data = data.replace(regex,"search?");
console.log(data);
Please check Substitution example in Regex101.

regex to match everything except character

I have a payload that contains the following:
����\�p�a�t�r�i�c�k�-�t�e�s�t�-�f�i�l�e�.�t�x�t������x�SMB2
I'm looking to extract the file name of patrick-test-file.txt
I can get close by using this, but it continues to include everything (including ascii characters)
[\\\\](.*?)x�SMB2
With a result of this: �p�a�t�r�i�c�k�-�t�e�s�t�-�f�i�l�e�.�t�x�t������ for the capture group.
How would I just match the characters of the file name, which could be anything of variable length, and could contain alphanumeric characters? Is this possible with pure regex?
Any help is much appreciated.
Sometimes you just can't do a single language-agnostic Regular Expression to accomplish something. And sometimes (usually) it is more performant to do a series of string functions.
I wouldn't personally accept any solution which has hard-coded values, such as x�SMB2.
If you want to use Regular Expressions only, you can first select the File-Name portion like so: (([-\w\d.\\]+)[^-\w\d.\\]?)+, then go ahead and replace [^-\w\d.\\] with nothing "".
Honestly, given the limited detail, the best function is like so:
var fileName = "\patrick-test-file.txt";
But half-joking aside, and with that limited detail, your best bet is to do a couple string functions:
var yuckyString = #"����\�p�a�t�r�i�c�k�-�t�e�s�t�-�f�i�l�e�.�t�x�t������x�SMB2";
var fileNameArea = yuckyString.Split(new[] { "��" }, StringSplitOptions.RemoveEmptyEntries)[0];
var fileName = fileNameArea.Replace("�", "");
Granted, there was no language listed, so I'm using C#. Also, the answer would change if there were irregularities with those special characters. With the limited info, the pattern seems clear.

What is the regex required to find specific urls within content from a list of urls generated by a for loop?

As I write this I realise there are two parts to this question, however I think I am only really stuck on the first part and therefore the second is only provided for context:
Part A:
I need to search the contents of each value returned by a for loop (where each value is a url) for the following:
href="/dir/Sub_Dir/dir/163472311232-text-text-text-text/page-n"
where:
the numerals 163472311232 could be any length (ie it could be 5478)
-text-text-text-text could be any number of different words
where page-n could be from page-2 up until any number
where matches are not returned more than once, ie only unique matches are returned and therefore only one of the following would be returned:
href="/dir/Sub_Dir/dir/5422-la-la/page-4
href="/dir/Sub_Dir/dir/5422-la-la/page-4
Part B:
So the logic would be something like:
list_of_urls = original_list
for url in list_of_urls:
headers = {'User-Agent' : 'Mozilla 5.0'}
request = urllib2.Request(url, None, headers)
url_for_re = urllib2.urlopen(request).read()
another_url = re.findall(r'href="(/dir/Sub_dir\/dir/[^"/]*)"', url_for_re, re.I)
file.write(url)
file.write('\n')
file.write(another_url)
file.write('\n')
Which i am hoping will give me output similar to:
a.html
a/page-2.html
a/page-3.html
a/page-4.html
b.html
b/page-2.html
b/page-3.html
b/page-4.html
So my question is (assuming the logic in part B is ok):
What is the required regex pattern to use for part A?
I am a newbie to python and regex so this will limit my understanding somewhat in regards to relatively complicated regex suggestions etc.
update:
after suggestions i tried to test the following regex which did not produce any results:
import re
content = 'href="/dir/Sub_Dir/dir/5648342378-text-texttttt-texty-text-text/page-2"'
matches = re.findall(r'href="/dir/Sub_Dir/dir/[0-9]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+/page-([2-9]|[1-9][0-9]+)"', content, re.I)
prefix = 'http://www.test.com'
for match in matches:
i = prefix + match + '\n'
print i
solution:
i think this is the regex that will work:
matches = re.findall(r'href="(/dir/Sub_Dir/dir/[^"/]*/page-[2-9])"', content, re.I)
You can have... most of what you want. Regexes don't really do the distinct thing, so I suggest you just use them to get all the URLs, and then remove duplicates yourself.
Off the top of my head it would be something like this:
href="/dir/Sub_Dir/dir/[0-9]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+/page-([2-9])|([1-9][0-9]+)"
Plus or minus escaping rules, specifics on what words are allowed, etc. I'm a Windows guy, there's a great tool called Expresso which is helpful for learning regexes. I hope there's an equivalent for whatever platform you're using, it comes in handy.

Rewrite URL-string with String.replace in Actionscript 3

I'm getting a string that looks like this from a database: ~\Uploads\Tree.jpg
And I would like to change it in Actionscript3 to Uploads/Tree.jpg
Any idea how I can do this in neat way?
Assuming path is the string from the database, you can use this:
var newPath:String = path.replace(new RegExp("^~\\\\", "g"), "").replace(new RegExp("\\\\", "g"), "/")
If you always have the "~\" in the beginning, you can optimize it by using String.substring() instead. And if you are gonna convert many strings at once, make a reference to the regex and use that instead, so you do not create a new regex for each string.