In Power Query, I have a list of emails that includes invalid emails. I am looking to use M codes to identify and "fix" them. For example, my email list would include something like "1234.my_email_gmail_com#error.invalid.com"
I am looking for Power Query to find similar email addresses, then produce an output of a valid email. For the example above, it should be "my_email#gmail.com"
Essentially, I want to do the following:
Remove the digits at the front (number of digits varies)
Remove the "#error.invalid.com"
Replace the first underscore "_" from the right to "."
Replace the second underscore "_" from the right to "#"
I'm still new to Power Query, especially with M codes. I appreciate any help and guidance I can get.
Try the function cleanEmailAddress below:
let
cleanEmailAddress = (invalidEmailAddress as text) as text =>
let
removeLeadingNumbers = Text.AfterDelimiter(invalidEmailAddress, "."), // Assumes invalid numbers are followed by "." which itself also needs removing.
removeInvalidDomain = Text.BeforeDelimiter(removeLeadingNumbers, "#"),
replaceLastOccurrence = (someText as text, oldText as text, newText as text) as text =>
let
lastPosition = Text.PositionOf(someText, oldText, Occurrence.Last),
replaced = if lastPosition >= 0 then Text.ReplaceRange(someText, lastPosition, Text.Length(oldText), newText) else someText
in replaced,
overwriteTopLevelDomainSeparator = replaceLastOccurrence(removeInvalidDomain, "_", "."),
overwriteAtSymbol = replaceLastOccurrence(overwriteTopLevelDomainSeparator, "_", "#")
in overwriteAtSymbol,
cleaned = cleanEmailAddress("1234.my_email_gmail_com#error.invalid.com")
in
cleaned
Regarding:
"Remove the digits at the front (number of digits varies)"
Your question doesn't mention what to do with the leading . (which remains if you remove the leading digits), but your expected output ("my_email#gmail.com") suggests it should be removed. Email addresses which do not have . immediately after the leading digits, will return an error (and the logic for removeLeadingNumbers expression will need to be improved).
This seems to work too:
let
Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Added Custom" = Table.AddColumn(Source, "Valid", each Text.ReplaceRange(Text.ReplaceRange(Text.BetweenDelimiters([Column1],".","#"),Text.PositionOf(Text.BetweenDelimiters([Column1],".","#"),"_",Occurrence.Last),1,"."),Text.PositionOf(Text.ReplaceRange(Text.BetweenDelimiters([Column1],".","#"),Text.PositionOf(Text.BetweenDelimiters([Column1],".","#"),"_",Occurrence.Last),1,"."),"_",Occurrence.Last),1,"#"))
in
#"Added Custom"
Related
I have some data that looks like this (more than 400 columns) :
year
ID
fake_num1
fake_num2
text1
2019
11
36 000
10'000
text, 1
2020
12
-1 275
1 000,00
text 2
Columns fake_num1 and fake_num2 are stored as text. What I'm trying to achieve is
Identify those fake numbers columns
Clean the data (e.g. remove space, columns, replace comma by points) with a for loop
I need some help with step 1. I have to identify columns fake_num1 and fake_num2, while avoiding columns like text1. I was thinking of going with regexp but maybe there is another solution.
I used part of the code here: SO regexp, however I am not sure how to proceed from there.
Dim strPattern as String: strPattern = "^[0-9]$"
will find anything that starts and ends with a number, and only has numbers (if my comprehension is correct). What's the best way to manage the cases listed in the table above ?
Please, try the next code, It considers "fake numbers columns" as ones where replacing the necessary characters makes from string a number:
Sub testMakeNumbers()
Dim sh As Worksheet, lastR As Long, lastCol As Long, i As Long, rngCol As Range
Set sh = ActiveSheet 'you can use here the necessary sheet
lastR = sh.Range("A" & sh.rows.Count).End(xlUp).row
lastCol = sh.cells(1, Columns.Count).End(xlToLeft).Column
'determine the problematic columns:
For i = 1 To lastCol
If Not IsNumeric(sh.cells(2, i).Value) And _
IsNumeric(Replace(Replace(Replace(sh.cells(2, i).Value, " ", ""), "'", ""), ",", ".")) Then
If rngCol Is Nothing Then
Set rngCol = sh.cells(2, i)
Else
Set rngCol = Union(rngCol, sh.cells(2, i))
End If
End If
Next
'replace the characters making the string as number:
With Intersect(rngCol.EntireColumn, sh.Range("A2", sh.cells(lastR, lastCol)))
.Replace ",", "."
.Replace Chr(160), ""
.Replace " ", ""
.Replace "'", ""
End With
End Sub
Take an input like "This is, in text, an example! Cool stuff"
I have some C# code that takes it, removes the punctuation, splits on the spaces, and returns the first 6 elements:
var title = new string(input.Where(c => !char.IsPunctuation(c)).ToArray()).Split(' ').Take(6);
so I get an array of:
["This", "is", "in", "text", "an", "example"]
From that array, how can I work backwards to match it to the original input? I've tried doing:
'This|is|in|text|an|example' but it's not precise enough, as I think it's going or's instead of and's.
I'm going to use the regex expression in an SQL query, something like:
SELECT t.*, Max(e.Timestamp) As EventUpdated, Min(e.Timestamp) as Timestamp
From test t
Left Join edithistory e on t.IdTimelineinfo = e.IdTimelineinfo
where t.date = "2020-12-06" and t.Title REGEXP 'Testing|two|events|on|the';
I'm really new to regex and would appreciate any help.
I ended up using REGEX like the following:
DbTitle = string.Join("[^a-zA-Z]*", ArraryOfWords);
var title = $"[^a-zA-Z]*{DbTitle}";
SELECT t.*, Max(e.Timestamp) As EventUpdated, Min(e.Timestamp) as Timestamp
From test t
Left Join edithistory e on t.IdTimelineinfo = e.IdTimelineinfo
where t.date = #date and t.Title Regexp #title and Confirmed = 1;
I've got a variable "Variable" in VBScript that will receive different values, based on names that come from xml files i don't trust. I can't let "Variable" have forbidden caracters on it (<, >, :, ", /, \, |, ?, * ) or characters with accents (I think they are called accent in english) like (Á, á, É, é, Â, â, Ê, ê, ñ, ã).
So, my question is: How can I create a script that studies and replace these possible multiple possible characters in the variable I have? I'm using a Replace function found in MSDN Library, but it won't let me alter many characters in the way I'm using it.
Example:
(Assuming a Node.Text value of "Example A/S")
For Each Node In xmlDoc.SelectNodes("//NameUsedToRenameFile")
Variable = Node.Text
Next
Result = Replace(Variable, "<", "-")
Result = Replace(Variable, "/", "-")
WScript.Echo Result
This Echo above returns me "Example A-S", but if I change my Replaces order, like:
Result = Replace(Variable, "/", "-")
Result = Replace(Variable, "<", "-")
I get a "Example A/S". How should I program it to be prepared to any possible characters? Thanks!
As discussed, it might be easier to do things the other way around; create a list of allowed characrters as VBScript is not so good at handling unicode like characters; whilst the characters you have listed may be fine, you may run into issues with certain character sets. here's an example routine that could help your cause:
Consider this command:
wscript.echo ValidateStr("This393~~_+'852Is0909A========Test|!:~#$%####")
Using the sample routine below, it should produce the following results:
This393852Is0909ATest
The sample routine:
Function ValidateStr (vsVar)
Dim vsAllowed, vscan, vsaScan, vsaCount
vsAllowed = "ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890"
ValidateStr = ""
If vartype(vsvar) = vbString then
If len(vsvar) > 0 then
For vscan = 1 To Len(vsvar)
vsValid = False
vsaCount = 1
Do While vsaValid = false and vsaCount <= len(vsAllowed)
If UCase(Mid(vsVar, vscan, 1)) = Mid(vsAllowed, vsaCount, 1) Then vsValid = True
vsaCount = vsaCount + 1
Loop
If vsValid Then ValidateStr = ValidateStr & Mid(vsVar, vscan,1)
Next
End If
End If
End Function
I hope this helps you with your quest. Enjoy!
EDIT: If you wish to continue with your original path, you will need to fix your replace command - it is not working because you are resetting it after each line. You'll need to pump in variable the first time, then use result every subsequent time..
You had:
Result = Replace(Variable, "/", "-")
Result = Replace(Variable, "<", "-")
You need to change this to:
Result = Replace(Variable, "/", "-")
Result = Replace(Result, "<", "-")
Result = Replace(Result, ...etc..)
Result = Replace(Result, ...etc..)
Edit: You could try Ansgar's Regex, as the code is by far more simple, but I am not sure it will work if as an example you had simplified Chinese characters in your string.
I agree with Damien that replacing everything but known-good characters is the better approach. I would, however, use a regular expression for this, because it greatly simplifies the code. I would also recommend to not remove "bad" characters, but to replace them with a known-good placeholder (an underscore for instance), because removing characters might yield undesired results.
Function SanitizeString(str)
Set re = New RegExp
re.Pattern = "[^a-zA-Z0-9]"
re.Global = True
SanitizeString = re.Replace(str, "_")
End Function
A Bloomberg futures ticker usually looks like:
MCDZ3 Curcny
where the root is MCD, the month letter and year is Z3 and the 'yellow key' is Curcny.
Note that the root can be of variable length, 2-4 letters or 1 letter and 1 whitespace (e.g. S H4 Comdty).
The letter-year allows only the letter listed below in expr and can have two digit years.
Finally the yellow key can be one of several security type strings but I am interested in (Curncy|Equity|Index|Comdty) only.
In Matlab I have the following regular expression
expr = '[FGHJKMNQUVXZ]\d{1,2} ';
[rootyk, monthyear] = regexpi(bbergtickers, expr,'split','match','once');
where
rootyk{:}
ans =
'mcd' 'curncy'
and
monthyear =
'z3 '
I don't want to match the ' ' (space) in the monthyear. How can I do?
Assuming there are no leading or trailing whitespaces and only upcase letters in the root, this should work:
^([A-Z]{2,4}|[A-Z]\s)([FGHJKMNQUVXZ]\d{1,2}) (Curncy|Equity|Index|Comdty)$
You've got root in the first group, letter-year in the second, yellow key in the third.
I don't know Matlab nor whether it covers Perl Compatible Regex. If it fails, try e.g. with instead of \s. Also, drop the ^...$ if you'd like to extract from a bigger source text.
The expression you're feeding regexpi with contains a space and is used as a pattern for 'match'. This is why the matched monthyear string also has a space1.
If you want to keep it simple and let regexpi do the work for you (instead of postprocessing its output), try a different approach and capture tokens instead of matching, and ignore the intermediate space:
%// <$1><----------$2---------> <$3>
expr = '(.+)([FGHJKMNQUVXZ]\d{1,2}) (.+)';
tickinfo = regexpi(bbergtickers, expr, 'tokens', 'once');
You can also simplify the expression to a more genereic '(.+)(\w{1}\d{1,2})\s+(.+)', if you wish.
Example
bbergtickers = 'MCDZ3 Curncy';
expr = '(.+)([FGHJKMNQUVXZ]\d{1,2})\s+(.+)';
tickinfo = regexpi(bbergtickers, expr, 'tokens', 'once');
The result is:
tickinfo =
'MCD'
'Z3'
'Curncy'
1 This expression is also used as a delimiter for 'split'. Removing the trailing space from it won't help, as it will reappear in the rootyk output instead.
Assuming you just want to get rid of the leading and or trailing spaces at the edge, there is a very simple command for that:
monthyear = trim(monthyear)
For removing all spaces, you can do:
monthyear(isspace(monthyear))=[]
Here is a completely different approach, basically this searches the letter before your year number:
s = 'MCDZ3 Curcny'
p = regexp(s,'\d')
s(min(p)
s(min(p)-1:max(p))
I took many hours trying to solve this problem I have attempted, without success.
All I need is to validate a textbox:
Valid Chains:
10%
0%
1111111.12%
15.2%
10
2.3
Invalid Chains:
.%
12.%
.02%
%
123456789123.123
I need to validate the textbox with these valid chains, supporting the keypress event.
I tryed:
Private Sub prices_KeyPress(ByVal sender As Object, ByVal e As System.Windows.Forms.KeyPressEventArgs) Handles wholeprice_input_new_item.KeyPress, dozenprice_input_new_item.KeyPress, _
detailprice_input_new_item.KeyPress, costprice_input_new_item.KeyPress
Dim TxtB As TextBox = CType(sender, TextBox)
Dim fullText As String = TxtB.Text & e.KeyChar
Dim rex As Regex = New Regex("^[0-9]{1,9}([\.][0-9]{1,2})?[\%]?$ ")
If (Char.IsDigit(e.KeyChar) Or e.KeyChar.ToString() = "." Or e.KeyChar = CChar(ChrW(Keys.Back))) Then
If (fullText.Trim() <> "") Then
If (rex.IsMatch(fullText) = False And e.KeyChar <> CChar(ChrW(Keys.Back))) Then
e.Handled = True
MessageBox.Show("You are Not Allowed To Enter More then 2 Decimal!!")
End If
End If
Else
e.Handled = True
End If
End Sub
NOTE: The regex has to validate (Maximum 2 decimal places, and 9 integers) with an optional percent symbol.
Please help, I feel so frustrated trying to solve the problem without success
I think that you almost had the right answer. When I run your regex against the samples you supplied, they all fail. But if I remove the extra space at the end of the regex I get the expected successes and failures.
So currently your regex looks like this:
Dim rex As Regex = New Regex("^[0-9]{1,9}([\.][0-9]{1,2})?[\%]?$ ")
and it should look like
Dim rex As Regex = New Regex("^[0-9]{1,9}([\.][0-9]{1,2})?[\%]?$")
EDIT:
Ok I understand the issue more. The problem with the regex is that it will only allow a period if it is followed by one or two numbers. That works fine if you are evaluating the textbox value after someone has finished typing. But in your code, you are evaluating for each keypress, so you don't have a chance to type a number after the "."
I can see two possible solutions
Change the regex to allow 1. as a valid entry
Change when you evaluate the regex, perhaps trying to figure out a way to only evaluate the regex when the person has paused typing.
If you went with option 1, then we need to tweak the regex to something like this
"^[0-9]{1,9}((\.)|(\.[0-9]{1,2}(%)?)|(%))?$"
I changed the regex so that it will accept three optional endings to the text string (\.) will allow the string to end in a period , (\.[0-9]{1,2}(%)?) will allow the string to end period followed by one or two numbers and an optional percent sign, and (%) will allow the string to end in a percent sign. I broke the ending into the three options because I didn't want to allow something like 12.% to be valid. Also for this to work you will also need to add the percent sign to your first If statement
If (Char.IsDigit(e.KeyChar) Or e.KeyChar.ToString() = "." Or e.KeyChar.ToString() = "%" Or e.KeyChar = CChar(ChrW(Keys.Back))) Then
so that the regex runs when someone types the percent sign.