Regex: How to remove English words from sentences using Regex? - regex

I've number of rows in SQLite, each row has one column that contains data like this:
prosperکامیاب شدن ، موفق شدن ، رونق یافتن
As you can see, the sentence starts with English words, Now I want to remove English words at first of each sentence. Is there any way to do that via T-SQL query(using Regex)?

you may try this :) I have made it as a function to call upon
create function dbo.RemoveEngChars (#Unicode_string nvarchar(max))
returns nvarchar(max) as
begin
declare #i int = 1; -- must start from 1, as SubString is 1-based
declare #OriginalString nvarchar(100) = #Unicode_string collate SQL_Latin1_General_Cp1256_CS_AS
declare #ModifiedString nvarchar(100) = N'';
while #i <= Len(#OriginalString)
begin
if SubString(#OriginalString, #i, 1) not like '[a-Z]'
begin
set #ModifiedString = #ModifiedString + SubString(#OriginalString, #i, 1);
end
set #i = #i + 1;
end
return #ModifiedString
end
--To call the function , you can run the following script and pass the Unicode in N' prefix
select dbo.RemoveEngChars(N'prosperکامیاب شدن ، موفق شدن ، رونق یافتن')

Related

Regex for IFC with array attributed

IFC is a variation of STEP files used for construction projects. The IFC contains information about the building being constructed. The file is text based and it easy to read. I am trying to parse this information into a python dictionary.
The general format of each line will be similar to the following
2334=IFCMATERIALLAYERSETUSAGE(#2333,.AXIS2.,.POSITIVE.,-180.);
ideally this should be parsed int #2334, IFCMATERIALLAYERSETUSAGE, #2333,.AXIS2.,.POSITIVE.,-180.
I found a solution Regex includes two matches in first match
https://regex101.com/r/RHIu0r/10 for part of the problem.
However, there are some cases the data contains arrays instead of values as the example below
2335=IFCRELASSOCIATESMATERIAL('2ON6$yXXD1GAAH8whbdZmc',#5,$,$,(#40,#221,#268,#281),#2334);
This case need to be parsed as #2335, IFCRELASSOCIATESMATERIAL, '2ON6$yXXD1GAAH8whbdZmc', #5,$,$, [#40,#221,#268,#281],#2334
Where [#40,#221,#268,#281] is a stored in a single variable as an array
The array can be in the middle or the last variable.
Would you be able to assist in creating a regular expression to obtain desired results
I have created https://regex101.com/r/mqrGka/1 with cases to test
Here's a solution that continues from the point you reached with the regular expression in the test cases:
file = """\
#1=IFCOWNERHISTORY(#89024,#44585,$,.NOCHANGE.,$,$,$,1190720890);
#2=IFCSPACE(';;);',#1,$);some text);
#2=IFCSPACE(';;);',#1,$);
#2885=IFCRELAGGREGATES('1gtpBVmrDD_xsEb7NuFKc8',#5,$,$,#2813,(#2840,#2846,#2852,#2858,#2879));
#2334=IFCMATERIALLAYERSETUSAGE(#2333,.AXIS2.,.POSITIVE.,-180.);
#2335=IFCRELASSOCIATESMATERIAL('2ON6$yXXD1GAAH8whbdZmc',#5,$,$,(#40,#221,#268,#281),#2334);
""".splitlines()
import re
d = dict()
for line in file:
m = re.match(r"^#(\d+)\s*=\s*([a-zA-Z0-9]+)\s*\(((?:'[^']*'|[^;'])+)\);", line, re.I|re.M)
attr = m.group(3) # attribute list string
values = [m.group(2)] # first value is the entity type name
while attr:
start = 1
if attr[0] == "'": start += attr.find("'", 1) # don't split at comma within string
if attr[0] == "(": start += attr.find(")", 1) # don't split item within parentheses
end = attr.find(",", start) # search for a comma / end of item
if end < 0: end = len(attr)
value = attr[1:end-1].split(",") if attr[0] == "(" else attr[:end]
if value[0] == "'": value = value[1:-1] # remove quotes
values.append(value)
attr = attr[end+1:] # remove current attribute item
d[m.group(1)] = values # store into dictionary

PostgreSQL return an Array or Record as a Row

I'm trying to return a variable with a PostgreSQL function that returns row/rows so I can use libpqxx on the client side to iterate over it for example using:
for (pqxx::result::const_iterator row = result.begin(); row != result.end(); row++)
{
for (pqxx::const_row_iterator field = row.begin(); field != row.end(); field++)
{
cout << field << '\n';
}
}
This is my PostgresSQL function:
CREATE OR REPLACE FUNCTION seal_diff_benchmark_pgsql(sealparams CHARACTER VARYING) RETURNS RECORD AS $outputVar$
DECLARE
tempVar1 CHARACTER VARYING;
tempVar2 CHARACTER VARYING;
outputVar1 TEXT[];
outputVar record;
sealArray TEXT[];
execTime NUMERIC[];
BEGIN
FOR i IN 1..2 LOOP
SELECT "Pickup_longitude", "Dropoff_longitude" INTO tempVar1, tempVar2 FROM public.nyc2015_09_enc WHERE id=i;
sealArray := (SELECT public.seal_diff_benchmark(tempVar1, tempVar2, sealparams));
outputVar1[i] := sealArray[1];
execTime[i] := sealArray[2];
END LOOP;
SELECT UNNEST(outputVar1) INTO outputVAR;
RETURN outputVar;
END;
$outputVar$ LANGUAGE plpgsql;
I also tried returning outputVar1 as TEXT[]. My field variable on the client side holds {foo, bar} if I use returns TEXT[] or (foo) if I use returns RECORD. But this is not what I need, which is a row like return from a TEXT[] array or a RECORD variable without any (), [], {} chars at the beginning and at the end of the output.
How can I change my PostgreSQL function to make it work? I think I'm missing something but I can't see what.
There are many approaches to do what you want.
If it really is just one column that you want, then you can simply do:
CREATE OR REPLACE FUNCTION seal_diff_benchmark_pgsql(sealparams CHARACTER VARYING)
RETURNS SETOF TEXT AS $outputVar$
DECLARE
tempVar1 CHARACTER VARYING;
tempVar2 CHARACTER VARYING;
sealArray TEXT[];
execTime NUMERIC[];
outputVar text;
BEGIN
FOR i IN 1..2 LOOP
SELECT "Pickup_longitude", "Dropoff_longitude" INTO tempVar1, tempVar2
FROM public.nyc2015_09_enc WHERE id=i;
sealArray := (SELECT public.seal_diff_benchmark(tempVar1, tempVar2, sealparams));
execTime[i] := sealArray[2];
FOREACH outputVar IN ARRAY sealArray[1] LOOP --iterate over that text array
RETURN NEXT outputVar;
END LOOP;
END LOOP;
END;
$outputVar$ LANGUAGE plpgsql;
Returned colum will be named just like the function.
SELECT seal_diff_benchmark_pgsql FROM seal_diff_benchmark_pgsql('stuff');
-- alternative
SELECT seal_diff_benchmark_pgsql('stuff');
You can also specify columns in function parameters:
CREATE OR REPLACE FUNCTION seal_diff_benchmark_pgsql(sealparams CHARACTER VARYING, OUT outputVar text)
Then returned column will be named outputVar. In case of returning just one column, Postgres forces RETURNS to be of that column type, so in this case SETOF TEXT or just TEXT if one row is expected. If you return more than one column, then you need to use RETURNS SETOF RECORD.
When you use named columns in function parameters, then you need to assign values to them just like you would to variables from DECLARE section:
LOOP
outputVar := 'some value';
outputVar2 := 'some value';
outputVar3 := 'some value';
RETURN NEXT;
END LOOP;
There are a few other examples on how to return sets from functions in my old answer here: How to return rows of query result in PostgreSQL's function?

T-SQL RegExp to find sequential repeated characters

I'm looking for a RegExp to find duplicated characters in a entire word in SQL Server and Regular Expressions (RegExp). For Example:
"AAUGUST" match (AA)
"ANDREA" don't match (are 2 vowels "A", buit are separated)
"ELEEPHANT" match (EE)
I was trying with:
SELECT field1
FROM exampleTable
WHERE field1 like '%([A-Z]){2}%'
But it doesn't work.
I apreciated for your help.
Thanks!
You can't do what you're asking with T-SQL's LIKE.
Your best bet is to look at using the Common Language Runtime (CLR), but it can also be achieved (albeit painfully slowly) using, for example, a scalar value function as follows:
create function dbo.ContainsRepeatingAlphaChars(#str nvarchar(max)) returns bit
as begin
declare #p int, -- the position we're looking at
#c char(1) -- the previous char
if #str is null or len(#str) < 2 return 0;
select #c = substring(#str, 1, 1), #p = 1;
while (1=1) begin
set #p = #p + 1; -- move position pointer ahead
if #p > len(#str) return 0; -- if we're at the end of the string and haven't already exited, we haven't found a match
if #c like '[A-Z]' and #c = substring(#str, #p, 1) return 1; -- if last char is A-Z and matches the current char then return "found!"
set #c = substring(#str, #p, 1); -- Get next char
end
return 0; -- this will never be hit but stops SQL Server complaining that not all paths return a value
end
GO
-- Example usage:
SELECT field1
FROM exampleTable
WHERE dbo.ContainsRepeatingAlphaChars(field1) = 1
Did I mention that it would be slow? Don't use this on a large table. Go CLR.

Remove everything but numbers from a cell

I have an excel sheet where i use the follwoing command to get numbers from a cell that contains a form text:
=MID(D2;SEARCH("number";D2)+6;13)
It searches for the string "number" and gets the next 13 characters that comes after it. But some times the results get more than the number due to the fact these texts within the cells do not have a pattern, like the example below:
62999999990
21999999990
11999999990
6299999993) (
17999999999)
21914714753)
58741236714 P
18888888820
How do i avoid taking anything but numbers OR how do i remove everything but numbers from what i get?
You can user this User Defined Function (UDF) that will get only the numbers inside a specific cell.
Code:
Function only_numbers(strSearch As String) As String
Dim i As Integer, tempVal As String
For i = 1 To Len(strSearch)
If IsNumeric(Mid(strSearch, i, 1)) Then
tempVal = tempVal + Mid(strSearch, i, 1)
End If
Next
only_numbers = tempVal
End Function
To use it, you must:
Press ALT + F11
Insert new Module
Paste code inside Module window
Now you can use the formula =only_numbers(A1) at your spreadsheet, by changing A1 to your data location.
Example Images:
Inserting code at module window:
Executing the function
Ps.: if you want to delimit the number of digits to 13, you can change the last line of code from:
only_numbers = tempVal
to
only_numbers = Left(tempVal, 13)
Alternatively you can take a look a this topic to understand how to achieve this using formulas.
If you are going to go to a User Defined Function (aka UDF) then perform all of the actions; don't rely on the preliminary worksheet formula to pass a stripped number and possible suffix text to the UDF.
In a standard code module as,
Function udfJustNumber(str As String, _
Optional delim As String = "number", _
Optional startat As Long = 1, _
Optional digits As Long = 13, _
Optional bCaseSensitive As Boolean = False, _
Optional bNumericReturn As Boolean = True)
Dim c As Long
udfJustNumber = vbNullString
str = Trim(Mid(str, InStr(startat, str, delim, IIf(bCaseSensitive, vbBinaryCompare, vbTextCompare)) + Len(delim), digits))
For c = 1 To Len(str)
Select Case Asc(Mid(str, c, 1))
Case 32
'do nothing- skip over
Case 48 To 57
If bNumericReturn Then
udfJustNumber = Val(udfJustNumber & Mid(str, c, 1))
Else
udfJustNumber = udfJustNumber & Mid(str, c, 1)
End If
Case Else
Exit For
End Select
Next c
End Function
I've used your narrative to add several optional parameters. You can change these if your circumstances change. Most notable is whether to return a true number or text-that-looks-like-a-number with the bNumericReturn option. Note that the returned values are right-aligned as true numbers should be in the following supplied image.
By supplying FALSE to the sixth parameter, the returned content is text-that-looks-like-a-number and is now left-aligned in the worksheet cell.
If you don't want VBA and would like to use Excel Formulas only, try this one:
=SUMPRODUCT(MID(0&MID(D2,SEARCH("number",D2)+6,13),LARGE(INDEX(ISNUMBER(--MID(MID(D2,SEARCH("number",D2)+6,13),ROW($1:$13),1))* ROW($1:$13),0),ROW($1:$13))+1,1)*10^ROW($1:$13)/10)

Split a word with regexp in matlab; startIndex for 'split'?

My aim is to generate the phonetic transcription for any word according to a set of rules.
First, I want to split words into their syllables. For example, I want an algorithm to find 'ch' in a word and then separate it like shown below:
Input: 'aachbutcher'
Output: 'a' 'a' 'ch' 'b' 'u' 't' 'ch' 'e' 'r'
I have come so far:
check=regexp('aachbutcher','ch');
if (isempty(check{1,1})==0) % Returns 0, when 'ch' was found.
[match split startIndex endIndex] = regexp('aachbutcher','ch','match','split')
%Now I split the 'aa', 'but' and 'er' into single characters:
for i = 1:length(split)
SingleLetters{i} = regexp(split{1,i},'.','match');
end
end
My problem is: How do I put the cells together, such that they are formatted like the desired output? I only have the starting indexes for the match parts ('ch') but not for the split parts ('aa', 'but','er').
Any ideas?
You don't need to work with the indices or length. Simple logic: Process first element from match, then first from split, then second from match etc....
[match,split,startIndex,endIndex] = regexp('aachbutcher','ch','match','split');
%Now I split the 'aa', 'but' and 'er' into single characters:
SingleLetters=regexp(split{1,1},'.','match');
for i = 2:length(split)
SingleLetters=[SingleLetters,match{i-1},regexp(split{1,i},'.','match')];
end
So, you know the length of 'ch', it's 2. You know where you found it from regex, as those indices are stored in startIndex. I'm assuming (Please, correct me if I'm wrong) that you want to split all other letters of the word into single-letter cells, like in your output above. So, you can just use the startIndex data to construct your output, using conditionals, like this:
check=regexp('aachbutcher','ch');
if (isempty(check{1,1})==0) % Returns 0, when 'ch' was found.
[match split startIndex endIndex] = regexp('aachbutcher','ch','match','split')
%Now I split the 'aa', 'but' and 'er' into single characters:
for i = 1:length(split)
SingleLetters{i} = regexp(split{1,i},'.','match');
end
end
j = 0;
for i = 1 : length('aachbutcher')
if (i ~= startIndex(1)) && (i ~= startIndex(2))
j = j +1;
output{end+1} = SingleLetters{j};
else
i = i + 1;
output{end+1} = 'ch';
end
end
I don't have MATLAB right now, so I can't test it. I hope it works for you! If not, let me know and I'll take anther shot at it.