Identifying nearly identical messages in list

Identifying nearly identical messages in list - regex

It looks like a simple task, but how would you solve it? I don't get any solution right now.
ls_message-text = 'Pernr. 12345678 (Pete Peterson) is valid (06/2015).
append ls_message to lt_message.
ls_message-text = 'Pernr. 12345678 (Pete Peterson) is valid (07/2015).
append ls_message to lt_message.
This is the code I got, the thing is, this is the message I am showing in my application. The customer says that the 2 messages are the same. The second should be deleted.
How would you compare it to delete the line? The table might contain more then 2 lines and also with another text like "is not valid".
I can't extend the structure to have more fields for comparison, I can only use the string comparison on this one field. Are there string comparisons possible with a regex or something?

Maybe you could solve your requirement using the Levenshtein distance . ABAP has a built-in function "distance" that gives you the number of operations to convert one string into another. Ex:
DATA msg1 type string.
DATA msg2 type string.
msg1 = 'Levehnstein Distance 7/2015'.
msg2 = 'Levehnstein Distance 6/2015'.
data l_distance type i.
l_distance = distance( val1 = msg1 val2 = msg2 ).
if l_distance lt 2 .
"It's almost the same text
endif.
In this case l_distance will be 1, because only one operation is necessary (replacing).
Hope this helps,

Assuming you want to retain only one message for each unique Pernr. in lt_message, you can use regex to filter for the Pernr. and use that as "key". Now you can delete all but the first message of lt_message that matches this key.
Expand your regex if you want to keep only certain messages, e.g. only the "is valid" ones.

have you tried looking to program DEMO_REGEX_TOY.
Gives an idea on how to work with Regular expresion, that probably will save the problem

Related

How to create new column that parses correct values from a row to a list

I am struggling on creating a formula with Power Bi that would split a single rows value into a list of values that i want.
So I have a column that is called ID and it has values such as:
"ID001122, ID223344" or "IRRELEVANT TEXT ID112233, MORE IRRELEVANT;ID223344 TEXT"
What is important is to save the ID and 6 numbers after it. The first example would turn into a list like this: {"ID001122","ID223344"}. The second example would look exactly the same but it would just parse all the irrelevant text from between.
I was looking for some type of an loop formula where you could use the text find function to find ID starting point and use middle function to extract 8 characters from the start but I had no progress in finding such. I tried making lists from comma separator but I noticed that not all rows had commas to separate IDs.
The end results would be that the original value is on one column next to the list of parsed values which then could be expanded to new rows.
ID Parsed ID
"Random ID123456, Text;ID23456" List {"ID123456","ID23456"}
Any of you have former experience?

Hey I found the answer by myself using a good article similar to my problem.
Here is my solution without any further text parsing which i can do later on.
each let
PosList = Text.PositionOf([ID],"ID",Occurrence.All),
List = List.Transform(PosList, (x) => Text.Middle([ID],x,8))
in List
For example this would result "(ID343137,ID352973) ID358388" into {ID343137,ID352973,ID358388}
Ended up being easier than I thought. Suppose the solution relied again on the lists!

Exact match of string in pandas python

I have a column in data frame which ex df:
A
0 Good to 1. Good communication EI : tathagata.kar#ae.com
1 SAP ECC Project System EI: ram.vaddadi#ae.com
2 EI : ravikumar.swarna Role:SSE Minimum Skill
I have a list of of strings
ls=['tathagata.kar#ae.com','a.kar#ae.com']
Now if i want to filter out
for i in range(len(ls)):
df1=df[df['A'].str.contains(ls[i])
if len(df1.columns!=0):
print ls[i]
I get the output
tathagata.kar#ae.com
a.kar#ae.com
But I need only tathagata.kar#ae.com
How Can It be achieved?
As you can see I've tried str.contains But I need something for extact match

You could simply use ==
string_a == string_b
It should return True if the two strings are equal. But this does not solve your issue.
Edit 2: You should use len(df1.index) instead of len(df1.columns). Indeed, len(df1.columns) will give you the number of columns, and not the number of rows.
Edit 3: After reading your second post, I've understood your problem. The solution you propose could lead to some errors.
For instance, if you have:
ls=['tathagata.kar#ae.com','a.kar#ae.com', 'tathagata.kar#ae.co']
the first and the third element will match str.contains(r'(?:\s|^|Ei:|EI:|EI-)'+ls[i])
And this is an unwanted behaviour.
You could add a check on the end of the string: str.contains(r'(?:\s|^|Ei:|EI:|EI-)'+ls[i]+r'(?:\s|$)')
Like this:
for i in range(len(ls)):
df1 = df[df['A'].str.contains(r'(?:\s|^|Ei:|EI:|EI-)'+ls[i]+r'(?:\s|$)')]
if len(df1.index != 0):
print (ls[i])
(Remove parenthesis in the "print" if you use python 2.7)

Thanks for the help. But seems like I found a solution that is working as of now.
Must use str.contains(r'(?:\s|^|Ei:|EI:|EI-)'+ls[i])
This seems to solve the problem.
Although thanks to #IsaacDj for his help.

Why not just use:
df1 = df[df['A'].[str.match][1](ls[i])
It's the equivalent of regex match.

Applying regexp and finding the highest number in a list

I have got a list of different names. I have a script that prints out the names from the list.
req=urllib2.Request('http://some.api.com/')
req.add_header('AUTHORIZATION', 'Token token=hash')
response = urllib2.urlopen(req).read()
json_content = json.loads(response)
for name in json_content:
print name['name']
Output:
Thomas001
Thomas002
Alice001
Ben001
Thomas120
I need to find the max number that comes with the name Thomas. Is there a simple way to to apply regexp for all the elements that contain "Thomas" and then apply max(list) to them? The only way that I have came up with is to go through each element in the list, match regexp for Thomas, then strip the letters and put the remaining numbers to a new list, but this seems pretty bulky.

You don't need regular expressions, and you don't need sorting. As you said, max() is fine. To be safe in case the list contains names like "Thomasson123", you can use:
names = ((x['name'][:6], x['name'][6:]) for x in json_content)
max(int(b) for a, b in names if a == 'Thomas' and b.isdigit())
The first assignment creates a generator expression, so there will be only one pass over the sequence to find the maximum.

You don't need to go for regex. Just store the results in a list and then apply sorted function on that.
>>> l = ['Thomas001',
'homas002',
'Alice001',
'Ben001',
'Thomas120']
>>> [i for i in sorted(l) if i.startswith('Thomas')][-1]
'Thomas120'

Stata: Efficient way to replace numerical values with string values

I have code that currently looks like this:
replace fname = "JACK" if id==103
replace lname = "MARTIN" if id==103
replace fname = "MICHAEL" if id==104
replace lname = "JOHNSON" if id==104
And it goes on for multiple pages like this, replacing an ID name with a first and last name string. I was wondering if there is a more efficient way to do this en masse, perhaps by using the recode command?

I will echo the other answers that suggest a merge is the best way to do this.
But if you absolutely must code the lines item-wise (again, messy) you can generate a long list ("pages") of replace commands by using MS Excel to "help" you write the code. Here is a picture of your Excel sheet with one example, showing the MS Excel formula:
columns:
A B C D
row: 1 last first id code
2 MARTIN JACK 103 ="replace fname=^"&B2&"^ if id=="&C2
You type that in, make sure it looks like Stata code when the formula calculates (aside from the carets), and copy the formula in column D down to the end of your list. Then copy the whole block of Stata code in column D generated by the formulas into your do-file, and do a find and replace (be careful here if you are using the caret elsewhere for mathematical uses!!) for all ^ to be replaced with ", which will end up generating proper Stata syntax.
(This is truly a brute force way of doing this, and is less dynamic in the case that there are subsequent changes to your generation list. All--apologies in advance for answering a question here advocating use of Excel :) )

You don't explain where the strings you want to add come from, but what is generally the best technique is explained at
http://www.stata.com/support/faqs/data-management/group-characteristics-for-subsets/index.html

Create an associative array of ids vs Fname,Lname
103 => JACK,MARTIN
104 => MICHAEL,JOHNSON
...
Replace
id => hash{id} ( fname & lname )
The efficiency of doing this will be taken care by the programming language used

How to read semicolon separated certain values from a QString?

I am developing an application using Qt/KDE. While writing code for this, I need to read a QString that contains values like ( ; delimited)
<http://example.com/example.ext.torrent>; rel=describedby; type="application/x-bittorrent"; name="differentname.ext"
I need to read every attribute like rel, type and name into a different QString. The apporach I have taken so far is something like this
if (line.contains("describedby")) {
m_reltype = "describedby" ;
}
if (line.contains("duplicate")) {
m_reltype = "duplicate";
}
That is if I need to be bothered only by the presence of an attribute (and not its value) I am manually looking for the text and setting if the attribute is present. This approach however fails for attributes like "type" and name whose actual values need to be stored in a QString. Although I know this can be done by splitting the entire string at the delimiter ; and then searching for the attribute or its value, I wanted to know is there a cleaner and a more efficient way of doing it.

As I understand, the data is not always an URL.
So,
1: Split the string
2: For each substring, separate the identifier from the value:
id = str.mid(0,str.indexOf("="));
value = str.mid(str.indexOf("=")+1);
You can also use a RegExp:
regexp = "^([a-z]+)\s*=\s*(.*)$";
id = \1 of the regexp;
value = \2 of the regexp;

I need to read every attribute like rel, type and name into a different QString.
Is there a gurantee that this string will always be a URL?
I wanted to know is there a cleaner and a more efficient way of doing it.
Don't reinvent the wheel! You can use QURL::queryItems which would parse these query variables and return a map of name-value pairs.
However, make sure that your string is a well-formed URL (so that QURL does not reject it).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Identifying nearly identical messages in list - regex

have you tried looking to program DEMO_REGEX_TOY. Gives an idea on how to work with Regular expresion, that probably will save the problem

Related

How to create new column that parses correct values from a row to a list

Exact match of string in pandas python

Applying regexp and finding the highest number in a list

Stata: Efficient way to replace numerical values with string values

How to read semicolon separated certain values from a QString?

Categories

Resources