Regex Match and Replace path to file - regex

I am trying to do a regex match and replace for hotfile.com links to mp3 files i have in my database (Wordpress).
I used to use hotfile for streaming mp3 files on my site, now i switched to a CDN, could someone kindly help me out with this:
Replace: http//hotfile.com/dl/157490069/c8732d4/mp3_file_name.mp3
With: http//p.music.cdndomain.com/vod/music.folder/2010/mp3_file_name.mp3
I have been trying a simple Search and Replace like this: http//hotfile.com/dl/%/%/, but its not working.
It would have been easier to perform a search and replace if hotfile.com didn't have different folders for all files, below is 2 examples of the problem:
http//hotfile.com/dl/155490069/c7932d4/
http//hotfile.com/dl/165490070/c8745e7/
I have over 500 files to replace.
Thanks

Since you must be using mysql for your wordpress database, you can do this replacement either by regex as you asked:
Regex pattern : #http://(www.)?hotfile.com/\w+/\w+/\w+/#
Replacement pattern: http//p.music.cdndomain.com/vod/music.folder/2010/
An alternate simpler solution would be to extract the mp3 file name using simple string functions of mysql e.g.
Use SUBSTRING or SUBSTRING_INDEX to extract file name of your mp3 file i.e. find the string after last occurence of "/" in your hotfiles url.
Use CONCAT to append the file name retreived to new url prefix and update it in the database.
Here is an example, you can appropriately change it for your database:
mysql> select * from test_songs;
+---------------------------------------------------------------+
| song_url |
+---------------------------------------------------------------+
| http://hotfile.com/dl/157490069/c8732d4/mp3_file_name.mp3 |
| http://www.hotfile.com/dl/123412312/dd732d4/mp3_song_name.mp3 |
+---------------------------------------------------------------+
Taking substrings:
mysql> select SUBSTRING_INDEX(song_url,"/",-1) from test_songs;
+----------------------------------+
| SUBSTRING_INDEX(song_url,"/",-1) |
+----------------------------------+
| mp3_file_name.mp3 |
| mp3_song_name.mp3 |
+----------------------------------+
2 rows in set (0.03 sec)
Creating final update query:
mysql> Update test_songs set song_url =
CONCAT("http//p.music.cdndomain.com/vod/music.folder/2010/",
SUBSTRING_INDEX(song_url,"/",-1)) ;
Query OK, 2 rows affected (0.00 sec)
Rows matched: 2 Changed: 2 Warnings: 0
Checking the results :
mysql> select * from test_songs;
+---------------------------------------------------------------------+
| song_url |
+---------------------------------------------------------------------+
| http//p.music.cdndomain.com/vod/music.folder/2010/mp3_file_name.mp3 |
| http//p.music.cdndomain.com/vod/music.folder/2010/mp3_song_name.mp3 |
+---------------------------------------------------------------------+
2 rows in set (0.00 sec)
Done !

Something as simple as http://regex101.com/r/lK9wH4 should work:
/^.+\/(.+)$/ and replace with <your_new_url>\1.
Good luck.

you can use notepad++ to search and replace all your files
for this particular sample:
search and replace in regex mode:
search "http//hotfile.com/(.)/(..mp3) "
replace "http//p.music.cdndomain.com/vod/music.folder/2010/\2 "
remove the quote mark but keep the space at the end
updated: screencapture for notepad++

Related

Splunk query not endswith

I am just into learning of Splunk queries, I'm trying to grab a data from myfile.csv file based on the regex expression.
In particular, I'm looking forward, print only the rows where column fqdn not endswith udc.net and htc.com.
Below is my query which is working but i'm writing it twice.
| inputlookup myfile.csv
| regex support_group="^mygroup-Linux$"
| regex u_sec_dom="^Normal Secure$"
| regex fqdn!=".*?udc.net$"
| regex fqdn!=".*?htc.com$"
| where match(fqdn,".")
I am trying them to combine with | separeted but not working though...
| regex fqdn!="(.*?udc.net | ".*?htc.com)$"
You can do this with a search and where clause:
| inputlookup myfile.csv
| search support_group="mygroup-Linux" u_sec_dom="Normal Secure"
| where !match(fqdn,"udc.net$") AND !match(fqdn,"htc.com$")
Or just a single search clause:
| inputlookup myfile.csv
| search support_group="mygroup-Linux" u_sec_dom="Normal Secure" NOT (fqdn IN("*udc.net","*htc.com")
You can also rewrite the IN() thusly:
(fqdn="*udc.net" OR fqdn="*htc.com")
The combined regex will work if you omit the spaces on either side of the |. The extra spaces become part of the regex and prevent matches.
There's no need for the final where command. Splunk by default will display all events that match ..

Extract hive count string using regex

I am trying to get total number of records in a hive table using paramiko. I know we can use Pyhive or pyhs2 but it requires certain configuration and it will take alot of time to get that done from my IT team.
So I am using paramiko to execute the below command and get count:
beeline -u jdbc:hive2://localhost:10000 -n hive -e 'select count(*) from table_name'
And i get following result,
+----------+--+
| _c0 |
+----------+--+
| 1232322 |
+----------+--+
I need to extract this count from the output.
I have tried the following code and RE but its not working,
pattern="""
+----------+--+
| _c0 |
+----------+--+
| [0-9]* |
+----------+--+
"""
import paramiko
si, so, se=ssh_con.exec_command("beeline -u jdbc:hive2://localhost:10000 -n hive -e 'select count(*) from table_name'")
print(so.read().decode())
print(re.match(pattern,so.read().decode()))
I am able to retrieve count and print it. Just looking for regular expression to extract count.
In Beeline, the result can be displayed in different formats. By default the result is being printed in a table with header. You can remove header and table, no need in parsing result using regexp. Add these options: --showHeader=false --outputformat=tsv2
beeline --showHeader=false --outputformat=tsv2 -u jdbc:hive2://localhost:10000 -n hive -e 'select count(*) from table_name'
Read more details about Output Formats.
You mean to match the whole string and just extract the number below a table name.
Here is a regex that fixes your approach:
^\+-+\+--\+\n\| *\w+ *\|\n\+-+\+--\+\n\| *(\d+) *\|\n\+-+\+--\+$
See the regex demo. The \w+ matches one or more word chars and matches any table name.
However, it seems all you need is a regex to match a number between | ... |.
Use
result = ''
m = re.search(r'\|\s*(\d+)\s*\|', so.read().decode())
if m:
result = m.group(1)
See this regex demo.
Details
\| - a | char
\s* - 0+ whitespaces
(\d+) - Group 1: one or more digits
\s*\| - 0+ whitespaces and a | char.

How to extract domain name using dynamic regex in Redshift?

I need to extract domain name from url using Redshift PostgreSQL. Example : extract 'google.com' from 'www.google.com'. Each url in my dataset has different top level domain (TLD). My approach was to first join the matching TLD to the dataset and use regex to extract 'first_string.TLD'. In Redshift, I'm getting error 'The pattern must be a valid UTF-8 literal character expression'. Is there a way around this?
A sample of my dataset:
+---+------------------------+--------------+
| id| trimmed_domain | tld |
+---+------------------------+--------------+
| 1 | sample.co.uk | co.uk |
| 2 | www.sample.co.uk | co.uk |
| 3 | www3.sample.co.uk | co.uk |
| 4 | biz.sample.co.uk | co.uk |
| 5 | digital.testing.sam.co | co |
| 6 | sam.co | co |
| 7 | www.google.com | com |
| 8 | 1.11.220 | |
+---+------------------------+--------------+
My code:
SELECT t1.extracted_domain, COUNT(DISTINCT(t1.id))
FROM(
SELECT
d.id,
d.trimmed_domain,
CASE
WHEN d.tld IS null THEN d.trimmed_domain ELSE
regexp_replace(d.trimmed_domain,'(.*\.)((.[a-z]*).*'||replace(tld,'.','\.')||')','\2')
END AS "extracted_domain"
FROM dataset d
)t1
GROUP BY 1
ORDER BY 2;
Expected output:
+------------------------+--------------+
| extracted_domain | count |
+------------------------+--------------+
| sample.co.uk | 4 |
| sam.co | 2 |
| google.com | 1 |
| 1.11.220 | 1 |
+------------------------+--------------+
I'm so sure about the query. However, you can use this tool and design any expression that you wish to modify your query.
My guess is that maybe this would help:
^(?!d|b|www3).*
You can list any domain that you wish to exclude in the list using OR (?!d|b|www3).
RegEx Circuit
You can visualize your expressions in this link:
You maybe want to add your desired URLs to an expression similar to:
^(sam|www.google|1.11|www.sample|www3.sample).*
So, I've found a solution. Redshift does not support column based regex so the alternative is to use Python UDF.
Change the tld column to regex pattern.
Go row by row and extract the domain name using the regex pattern column.
Group by the extracted_domain and count the users.
The SQL query is as below:
CREATE OR REPLACE function extractor(col_domain varchar)
RETURNS varchar
IMMUTABLE AS $$
import re
_regex = ''
for domain in col_domain:
if domain is None:
continue
else:
_regex += r'{}'.format(domain)
domain_regex = r'([^/.]+\.({}))'.format(_regex)
return domain_regex
$$ LANGUAGE plpythonu;
CREATE OR REPLACE FUNCTION regex_match(in_pattern varchar, input_str varchar)
RETURNS varchar
IMMUTABLE AS $$
import re
if in_pattern == '':
a = str(input_str)
else:
a= str(re.search(in_pattern, input_str).group())
return a
$$ LANGUAGE plpythonu;
SELECT
t2.extracted_domain,
COUNT(DISTINCT(t2.id)) AS "Unique Users"
FROM(
SELECT
t1.id,
t1.trimmed_domain,
regex_match(t1.regex_pattern, t1.trimmed_domain) AS "extracted_domain"
FROM(
SELECT
id,
trimmed_domain,
CASE WHEN tld is null THEN '' ELSE extractor(tld) END AS "regex_pattern"
FROM dataset
)t1
)t2
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;
Python UDF seems to be slow on a large dataset. So, I'm open to suggestions on improving the query.
If you know the prefixes you would like to remove from the domains, then why not just exclude these? The following query simply removes the know www/http/etc prefixes from domain names and counts normalized domain names.
SELECT COUNT(*) from
(select REGEXP_REPLACE(domain, '^(https|http|www|biz)') FROM domains)
GROUP BY regexp_replace;

How do I select a substring using a regexp in robot framework

In the Robot Framework library called String, there are several keywords that allow us to use a regexp to manipulate a string, but these manipulations don't seem to include selecting a substring from a string.
To clarify, what I intend is to have a price, i.e. € 1234,00 from which I would like to select only the 4 primary digits, meaning I am left with 1234 (which I will convert to an int for use in validation calculations). I have a regexp which will allow me to do that, which is as follows:
(\d+)[\.\,]
If I use Remove String Using Regexp with this regexp I will be left with exactly what I tried to remove. If I use Get Lines Matching Regexp, I will get the entire line rather than just the result I wanted, and if I use Get Regexp Matches I will get the right result except it will be in a list, which I will then have to manipulate again so that doesn't seem optimal.
Did I simply miss the keyword that will allow me to do this or am I forced to write my own custom keyword that will let me do this? I am slightly amazed that this functionality doesn't seem to be available, as this is the first use case I would think of when I think of using a regexp with a string...
You can use the Evaluate keyword to run some python code.
For example:
| Using 'Evaluate' to find a pattern in a string
| | ${string}= | set variable | € 1234,00
| | ${result}= | evaluate | re.search(r'\\d+', '''${string}''').group(0) | re
| | should be equal as strings | ${result} | 1234
Starting with robot framework 2.9 there is a keyword named Get regexp matches, which returns a list of all matches.
For example:
| Using 'Get regexp matches' to find a pattern in a string
| | ${string}= | set variable | € 1234,00
| | ${matches}= | get regexp matches | ${string} | \\d+
| | should be equal as strings | ${matches[0]} | 1234

notepad++: keep regex (multi occurence per line) and line structure, remove other characters

I have a 130k line text file with patent information and I just want to keep the dates (regex "[0-9]{4}-[0-9]{2}-[0-9]{2} ") for subsequent work in Excel. For this purpose I need to keep the line structure intact (also blank lines). My main problem is that I can't seem to find a way to identify and keep multiple occurrences of date information in the same line while deleting all other information.
Original file structure:
US20110228428A1 | US | | 7 | 2010-03-19 | SEAGATE TECHNOLOGY LLC
US20120026629A1 | US | | 7 | 2010-07-28 | TDK CORP | US20120127612A1 | US | | EXAMINER | 2010-11-24 | | US20120147501A1 | US | | 2 | 2010-12-09 | SAE MAGNETICS HK LTD,HEADWAY TECHNOLOGIES INC
Desired file structure:
2010-03-19
2010-07-28 2010-11-24 2010-12-09
Thank you for your help!
Search for
.*?(?:([0-9]{4}-[0-9]{2}-[0-9]{2})|$)
And replace with
" $1"
Don't put the quotes, just to show there is a space before the $1. This will also put a space before the first match in a row.
This regex will match as less as possible .*? before it finds either the Date or the end of the row (the $). If a date is found it is stored in $1 because of the brackets around. So as replacement just put a space to separate the found dates and then the found date from $1.