Remove everything except numbers and alphabets from a string using google sheet or excel formulas - regex

I have search but found python and related solutions.
I have a string like
"Hello 'how' are % you?"
which I want to convert to below after Remove everything except numbers and alphabets
Hello how are you
I am using Regexreplace as follows but now sure what should be the replacement or if its a right approach
=REGEXREPLACE(B2 , "([^A-Za-z0-9]+)")
The main thing i want to remove from the string are the stuff like " or strange symbols
can anyone help?

You can use:
=TRIM(REGEXREPLACE(B2,"[\W_]+"," "))
Or, include the space in your character class:
=REGEXREPLACE(B2,"[\W_ ]+"," "))
Where: \W is short for [^A-Ba-b0-9_], so to include the underscore we added it to the character class.

you can use:
=TRIM(REGEXREPLACE(A1, "'|%|""", ))

Related

Tcl - How to Add Text after last character through regex?

I need a tip, tip or suggestion followed by some example of how I can add an extension in .txt format after the last character of a variable's output line.
For example:
set txt " ONLINE ENGLISH COURSE - LESSON 5 "
set result [concat "$txt" .txt]
Print:
Note that there is space in the start, means and fin of the variable phrase (txt). What must be maintained are the spaces of the start and means. But replace the last space after the end of the sentence, with the format of the extension [.txt].
With the built-in concat method of Tcl, it does not achieve the desired effect.
The expected result was something like this:
ONLINE ENGLISH COURSE - LESSON 5.txt
I know I could remove spaces with string map but I don't know how to remove just the last occurrence on the line.
And otherwise I don’t know how to remove the last space to add the text [.txt]
If anyone can point me to one or more solutions, thank you in advance.
set result "[string trimright $txt].txt"
or
set result [regsub {\s*$} $txt ".txt"]

Extract a text string with regex

I have a large set of data I need to clean with open refine.
I am quite bad with regex and I can't think of a way to get what I want,
which is extracting a text string between quotes that includes lots of special characters like " ' / \ # # -
In each cell, it has the same format
caption': u'text I want to extract', u'likes':
Any help would be highly appreciated!
If you want to extract text string that includes lots of special characters in between, and is located between quotes ' ', You can do it in general this way:
\'[\S\s]*?\'
Demo
.
In your case, if you want to extract only the medial quote from this: caption': u'text I want to extract', u'likes': , Try this Regex:
(?<=u\')[\V]*?(?=\'\,)
Demo
We designed OpenRefine with a few smart functions to handle common cases such as yours without using Regex.
Two other cool ways to handle this in OpenRefine.
Using drop down menu:
Edit Column
Split into several columns
by separator Separator '
Using smartSplit
(string s, optional string sep)
returns: array
Returns the array of strings obtained by splitting s with separator sep. Handles quotes properly. Guesses tab or comma separator if "sep" is not given.
value.smartSplit("'")[2]

Extract string of numbers from URL using regex PIG

I'm using PIG to generate a list of URLs that have been recently visited. In each of the URLs, there is a string of numbers that represents the product page visited. I'm trying to use a regex_extract_all() function to extract just the string of numbers, which vary in length from 6-8. The string of digits can be found directly after jobs2/view/ and usually ends with +&cd but sometimes they may end with ).
Here are a few example URLs:
(http://a.com/search?q=cache:QD7vZRHkPQoJ:ca.xyz.com/jobs2/view/17069404+&cd=1&hl=en&ct=clnk&gl=ca)
(http://a.com/search?q=cache:G9323j2oNbAJ:ca.xyz.com/jobs2/view/5977065+&cd=1&hl=en&ct=clnk&gl=ca)
(http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clnk&gl=hk)
(http://a.com/search?q=cache:aNspmG11AJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clnk&gl=hk)
(http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=cl k&gl=hk)
Here is the current regex I am using:
J = FOREACH jpage GENERATE FLATTEN(REGEX_EXTRACT_ALL(TEXTCOLUMN, '\/view\/(\d+)\+\&')) as (output:chararray)
I have also tried other forms such as:
'[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]', 'view.([0-9]+)', 'view\/([\d]+)\+',
'[0-9][0-9][0-9]+', and
'[0-9][0-9][0-9]*'; none of which work.
Can anybody assist here or have another way of going about it?
Much appreciated,
MM
Reason for"Unexpected character 'D'" is, you need to put double backslash instead of single backslash. eg just replace [\d+] to [\\d+]
Here your solution, please validate all your inputs strings
input.txt
http://a.com/search?q=cache:QD7vZRHkPQoJ:ca.xyz.com/jobs2/view/17069404+&cd=1&hl=en&ct=clnk&gl=ca
http://a.com/search?q=cache:G9323j2oNbAJ:ca.xyz.com/jobs2/view/5977065+&cd=1&hl=en&ct=clnk&gl=ca
http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clnk&gl=hk
http://a.com/search?q=cache:aNspmG11AJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clnk&gl=hk
http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928+&cd=2&hl=zh-TW&ct=clk&gl=hk
http://a.com/search?q=cache:aNspmG11qAJ:hk.xyz.com/jobs2/view/16988928)=2&hl=zh-TW&ct=clk&gl=hk
http://webcache.googleusercontent.com/search?q=cache:http://my.linkedin.com/jobs2/view/9919248
Updated Pigscript:
A = LOAD 'input.txt' as line;
B = FOREACH A GENERATE REGEX_EXTRACT(line,'.*/view/(\\d+)([+|&|cd|)?]+)?',1);
dump B;
(17069404)
(5977065)
(16988928)
(16988928)
(16988928)
(16988928)
I'm not familiar with PIG, but this regex will match your target:
(?<=/jobs2/view/)\d+
By using a (non-consuming) look behind, the entire match (not just a group of the match) is your number.

Regex: How to match a string that is not only numbers

Is it possible to write a regular expression that matches all strings that does not only contain numbers? If we have these strings:
abc
a4c
4bc
ab4
123
It should match the four first, but not the last one. I have tried fiddling around in RegexBuddy with lookaheads and stuff, but I can't seem to figure it out.
(?!^\d+$)^.+$
This says lookahead for lines that do not contain all digits and match the entire line.
Unless I am missing something, I think the most concise regex is...
/\D/
...or in other words, is there a not-digit in the string?
jjnguy had it correct (if slightly redundant) in an earlier revision.
.*?[^0-9].*
#Chad, your regex,
\b.*[a-zA-Z]+.*\b
should probably allow for non letters (eg, punctuation) even though Svish's examples didn't include one. Svish's primary requirement was: not all be digits.
\b.*[^0-9]+.*\b
Then, you don't need the + in there since all you need is to guarantee 1 non-digit is in there (more might be in there as covered by the .* on the ends).
\b.*[^0-9].*\b
Next, you can do away with the \b on either end since these are unnecessary constraints (invoking reference to alphanum and _).
.*[^0-9].*
Finally, note that this last regex shows that the problem can be solved with just the basics, those basics which have existed for decades (eg, no need for the look-ahead feature). In English, the question was logically equivalent to simply asking that 1 counter-example character be found within a string.
We can test this regex in a browser by copying the following into the location bar, replacing the string "6576576i7567" with whatever you want to test.
javascript:alert(new String("6576576i7567").match(".*[^0-9].*"));
/^\d*[a-z][a-z\d]*$/
Or, case insensitive version:
/^\d*[a-z][a-z\d]*$/i
May be a digit at the beginning, then at least one letter, then letters or digits
Try this:
/^.*\D+.*$/
It returns true if there is any simbol, that is not a number. Works fine with all languages.
Since you said "match", not just validate, the following regex will match correctly
\b.*[a-zA-Z]+.*\b
Passing Tests:
abc
a4c
4bc
ab4
1b1
11b
b11
Failing Tests:
123
if you are trying to match worlds that have at least one letter but they are formed by numbers and letters (or just letters), this is what I have used:
(\d*[a-zA-Z]+\d*)+
If we want to restrict valid characters so that string can be made from a limited set of characters, try this:
(?!^\d+$)^[a-zA-Z0-9_-]{3,}$
or
(?!^\d+$)^[\w-]{3,}$
/\w+/:
Matches any letter, number or underscore. any word character
.*[^0-9]{1,}.*
Works fine for us.
We want to use the used answer, but it's not working within YANG model.
And the one I provided here is easy to understand and it's clear:
start and end could be any chars, but, but there must be at least one NON NUMERICAL characters, which is greatest.
I am using /^[0-9]*$/gm in my JavaScript code to see if string is only numbers. If yes then it should fail otherwise it will return the string.
Below is working code snippet with test cases:
function isValidURL(string) {
var res = string.match(/^[0-9]*$/gm);
if (res == null)
return string;
else
return "fail";
};
var testCase1 = "abc";
console.log(isValidURL(testCase1)); // abc
var testCase2 = "a4c";
console.log(isValidURL(testCase2)); // a4c
var testCase3 = "4bc";
console.log(isValidURL(testCase3)); // 4bc
var testCase4 = "ab4";
console.log(isValidURL(testCase4)); // ab4
var testCase5 = "123"; // fail here
console.log(isValidURL(testCase5));
I had to do something similar in MySQL and the following whilst over simplified seems to have worked for me:
where fieldname regexp ^[a-zA-Z0-9]+$
and fieldname NOT REGEXP ^[0-9]+$
This shows all fields that are alphabetical and alphanumeric but any fields that are just numeric are hidden. This seems to work.
example:
name1 - Displayed
name - Displayed
name2 - Displayed
name3 - Displayed
name4 - Displayed
n4ame - Displayed
324234234 - Not Displayed

Regex to replace string with another string in MS Word?

Can anyone help me with a regex to turn:
filename_author
to
author_filename
I am using MS Word 2003 and am trying to do this with Word's Find-and-Replace. I've tried the use wildcards feature but haven't had any luck.
Am I only going to be able to do it programmatically?
Here is the regex:
([^_]*)_(.*)
And here is a C# example:
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
String test = "filename_author";
String result = Regex.Replace(test, #"([^_]*)_(.*)", "$2_$1");
}
}
Here is a Python example:
from re import sub
test = "filename_author";
result = sub('([^_]*)_(.*)', r'\2_\1', test)
Edit: In order to do this in Microsoft Word using wildcards use this as a search string:
(<*>)_(<*>)
and replace with this:
\2_\1
Also, please see Add power to Word searches with regular expressions for an explanation of the syntax I have used above:
The asterisk (*) returns all the text in the word.
The less than and greater than symbols (< >) mark the start and end
of each word, respectively. They
ensure that the search returns a
single word.
The parentheses and the space between them divide the words into
distinct groups: (first word) (second
word). The parentheses also indicate
the order in which you want search to
evaluate each expression.
Here you go:
s/^([a-zA-Z]+)_([a-zA-Z]+)$/\2_\1/
Depending on the context, that might be a little greedy.
Search pattern:
([^_]+)_(.+)
Replacement pattern:
$2_$1
In .NET you could use ([^_]+)_([^_]+) as the regex and then $2_$1 as the substitution pattern, for this very specific type of case. If you need more than 2 parts it gets a lot more complicated.
Since you're in MS Word, you might try a non-programming approach. Highlight all of the text, select Table -> Convert -> Text to Table. Set the number of columns at 2. Choose Separate Text At, select the Other radio, and enter an _. That will give you a table. Switch the two columns. Then convert the table back to text using the _ again.
Or you could copy the whole thing to Excel, construct a formula to split and rejoin the text and then copy and paste that back to Word. Either would work.
In C# you could also do something like this.
string[] parts = "filename_author".Split('_');
return parts[1] + "_" + parts[0];
You asked about regex of course, but this might be a good alternative.