Unicode string extraction and comparison

Unicode string extraction and comparison - regex

1.val Matcher = """.+/(.*)""".r
2.val Matcher(title) = """http://en.wikipedia.org/wiki/Château_La_Louvière"""
3.val lowerCase = title.toLower
4.if(lowercase.equals("château_la_louvière")) //do something
The above comparison returns false because I guess line 2 results in Ch?teau_La_Louvi?re. Any ideas how I can accomplish this?

As 4e6 says the problem lies within the standard configuration of Java. Which assumes all files encoded in Latin1.
1.val Matcher = """.+/(.*)""".r
2.val Matcher(title) = """http://en.wikipedia.org/wiki/Château_La_Louvière"""
This could be fixed by setting the following java-OPTS
export JAVA_OPTS='-Dfile.encoding=UTF-8'
Still 1. and 2. will work, even if you don't change the encoding. The Problem lies in 3. and 4. .
3.val lowerCase = title.toLower
4.if(lowercase.equals("château_la_louvière")) //do something
''toLower'' will cause the test in 4. to fail , because "â" and "è" will be interpreted wrongly. These characters would be encoded as two up to four bytes, which each will be lowercased independently thus yielding a completely different result as ''château_la_louvière'' .

Related

How to remove prefixed u from a unicode string?

I am reading the lines from a CSV file; I am applying LDA algorithm to find the most common topic, after data processing in doc_processed, I am getting 'u' in every word but why? Please suggest me to remove 'u' from the doc+processed, my code in Python 2.7 is
data = [line.strip() for line in open("/home/dnandini/test/h.csv", 'r')]
stop = set(stopwords.words('english'))# stop words
exclude = set(string.punctuation) #to reomve the punctuation
lemma = WordNetLemmatizer() # to map with parts of speech
def clean(doc):
stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
shortword = re.compile(r'\W*\b\w{1,2}\b')
output=shortword.sub('', normalized)
return output
doc_processed = [clean(doc) for doc in data]
Output as doc_processed -
[u'amount', u'ze69heneicosadien11one', u'trap', u'containing', u'little', u'microg', u'zz69ket', u'attracted', u'male', u'low', u'population', u'level', u'indicating', u'potent', u'sex', u'attractant', u'trap', u'baited', u'z6ket', u'attracted', u'male', u'windtunnel', u'bioassay', u'least', u'100fold', u'attractive', u'male', u'zz69ket', u'improvement', u'trap', u'catch', u'occurred', u'addition', u'z6ket', u'various', u'binary', u'mixture', u'zz69ket', u'including', u'female', u'ratio', u'ternary', u'mixture', u'zz69ket']

the u'some string' format means it is a unicode string. See this question for more details on unicode strings themselves, but the easiest way to fix this is likely to str.encode the result before returning it from clean.
def clean(doc):
# all as before until
output = shortword.sub('', normalized).encode()
return output
Note that attempting to encode unicode code points that don't translate directly to the default encoding (which appears to be ASCII. See sys.getdefaultencoding() on your system to check) will throw an error here. You can handle the error in various ways be defining the errors kwarg to encode.
s.encode(errors="ignore") # omit the code point that fails to encode
s.encode(errors="replace") # replace the failing code point with '?'
s.encode(errors="xmlcharrefreplace") # replace the failing code point with ' '
# Note that the " " above is U+FFFD, the Unicode Replacement Character.

Exact match of string in pandas python

I have a column in data frame which ex df:
A
0 Good to 1. Good communication EI : tathagata.kar#ae.com
1 SAP ECC Project System EI: ram.vaddadi#ae.com
2 EI : ravikumar.swarna Role:SSE Minimum Skill
I have a list of of strings
ls=['tathagata.kar#ae.com','a.kar#ae.com']
Now if i want to filter out
for i in range(len(ls)):
df1=df[df['A'].str.contains(ls[i])
if len(df1.columns!=0):
print ls[i]
I get the output
tathagata.kar#ae.com
a.kar#ae.com
But I need only tathagata.kar#ae.com
How Can It be achieved?
As you can see I've tried str.contains But I need something for extact match

You could simply use ==
string_a == string_b
It should return True if the two strings are equal. But this does not solve your issue.
Edit 2: You should use len(df1.index) instead of len(df1.columns). Indeed, len(df1.columns) will give you the number of columns, and not the number of rows.
Edit 3: After reading your second post, I've understood your problem. The solution you propose could lead to some errors.
For instance, if you have:
ls=['tathagata.kar#ae.com','a.kar#ae.com', 'tathagata.kar#ae.co']
the first and the third element will match str.contains(r'(?:\s|^|Ei:|EI:|EI-)'+ls[i])
And this is an unwanted behaviour.
You could add a check on the end of the string: str.contains(r'(?:\s|^|Ei:|EI:|EI-)'+ls[i]+r'(?:\s|$)')
Like this:
for i in range(len(ls)):
df1 = df[df['A'].str.contains(r'(?:\s|^|Ei:|EI:|EI-)'+ls[i]+r'(?:\s|$)')]
if len(df1.index != 0):
print (ls[i])
(Remove parenthesis in the "print" if you use python 2.7)

Thanks for the help. But seems like I found a solution that is working as of now.
Must use str.contains(r'(?:\s|^|Ei:|EI:|EI-)'+ls[i])
This seems to solve the problem.
Although thanks to #IsaacDj for his help.

Why not just use:
df1 = df[df['A'].[str.match][1](ls[i])
It's the equivalent of regex match.

Only one IF ELSE statement working SAS

Can someone explain to me why only the first IF ELSE statement in my code works? I am trying to combine multiple variables into one.
DATA BCMasterSet2;
SET BCMasterSet;
drop PositiveLymphNodes1;
if PositiveLymphNodes1 = "." then PositiveLymphNodes =
put(PositiveLymphNodes2, 2.);
else PositiveLymphNodes = PositiveLymphNodes1;
if PositiveLymphNodes2 = "." then new_posLymph = put(PositiveLymphNodes,
2.);
else new_posLymph = PositiveLymphNodes2;
RUN;
Here is a nice screenshot of what the incorrect output looks like:OUTPUT
Thanks!

Hard to say without seeing all of your data, but I have a suspicion: is positivelymphnodes1 character or numeric? Is it ever actually equal to "."?
If you are trying to say "if PositiveLymphNodes1 is missing", then you can say that this way:
if missing(positivelymphnodes1) then ...
You can also do the same thing using coalesce or coalescec (the latter is character, the former numeric, in its return value). It chooses the first nonmissing argument. - so if the first argument is missing, it chooses the second.
positiveLymphNodes = coalescec(PositiveLymphNodes1, put(positiveLymphNodes2,2.));
new_posLymph = coalescec(positiveLymphNodes2, put(positiveLymphNodes,2.));
I would be curious why you're using put only in one place and not the other - use it in both or neither, I would suggest.

Identifying nearly identical messages in list

It looks like a simple task, but how would you solve it? I don't get any solution right now.
ls_message-text = 'Pernr. 12345678 (Pete Peterson) is valid (06/2015).
append ls_message to lt_message.
ls_message-text = 'Pernr. 12345678 (Pete Peterson) is valid (07/2015).
append ls_message to lt_message.
This is the code I got, the thing is, this is the message I am showing in my application. The customer says that the 2 messages are the same. The second should be deleted.
How would you compare it to delete the line? The table might contain more then 2 lines and also with another text like "is not valid".
I can't extend the structure to have more fields for comparison, I can only use the string comparison on this one field. Are there string comparisons possible with a regex or something?

Maybe you could solve your requirement using the Levenshtein distance . ABAP has a built-in function "distance" that gives you the number of operations to convert one string into another. Ex:
DATA msg1 type string.
DATA msg2 type string.
msg1 = 'Levehnstein Distance 7/2015'.
msg2 = 'Levehnstein Distance 6/2015'.
data l_distance type i.
l_distance = distance( val1 = msg1 val2 = msg2 ).
if l_distance lt 2 .
"It's almost the same text
endif.
In this case l_distance will be 1, because only one operation is necessary (replacing).
Hope this helps,

Assuming you want to retain only one message for each unique Pernr. in lt_message, you can use regex to filter for the Pernr. and use that as "key". Now you can delete all but the first message of lt_message that matches this key.
Expand your regex if you want to keep only certain messages, e.g. only the "is valid" ones.

have you tried looking to program DEMO_REGEX_TOY.
Gives an idea on how to work with Regular expresion, that probably will save the problem

URL-Encoding a Byte String?

I am writing a Bittorrent client. One of the steps involved requires that the program sends a HTTP GET request to the tracker containing an SHA1 hash of part of the torrent file. I have used Fiddler2 to intercept the request sent by Azureus to the tracker.
The hash that Azureus sends is URL-Encoded and looks like this: %D9%0C%3C%E3%94%18%F0%C5%D9%83X%E03I%26%2B%60%8C%BFR
The hash should look like this before it's URL-Encoded: d90c3ce39418f0c5d98358e03349262b608cbf52
I notice that it is not as simple as placing a '%' symbol every two characters, so how would I go about encoding this BYTE string to get the same as Azureus.
Thanks in advance.

Actually, you can just place a % symbol every two characters. Azureus doesn't do that because, for example, R is a safe character in a URL, and 52 is the hexadecimal representation of R, so it doesn't need to percent-encode it. Using %52 instead is equivalent.

Go through the string from left to right. If you encounter a %, output the next two characters, converting upper-case to lower-case. If you encounter anything else, output the ASCII code for that character in hex using lower-case letters.
%D9 %0C %3C %E3 %94 %18 %F0 %C5 %D9 %83 X %E0 3 I %26 %2B %60 %8C %BF R
The ASCII code for X is 0x58, so that becomes 58. The ASCII code for 3 is 0x33.
(I'm kind of puzzled why you had to ask though. Your question clearly shows that you recognized this as URL-Encoded.)

Even though I know well the original question was about C++, it might be useful somehow, sometimes to see alternative solutions. Therefore, for what it's worth (10 years later), here's
An alternative solution implemented in Python 3.6+
import binascii
import urllib.parse
def hex_str_to_esc_str(s: str, *, encoding: str='Windows-1252') -> str:
# decode hex string as a Windows-1252 string
win1252_str = binascii.unhexlify(hex_str).decode(encoding)
# escape string and return
return urllib.parse.quote(win1252_str, encoding=encoding)
def esc_str_to_hex_str(s: str, *, encoding: str='Windows-1252') -> str:
# unescape the escaped string as a Windows-1252 string
win1252_str = urllib.parse.unquote(esc_str, encoding='Windows-1252')
# encode string, hexlify, and return
return win1252_str.encode('Windows-1252').hex()
Two elementary tests:
esc_str = '%D9%0C%3C%E3%94%18%F0%C5%D9%83X%E03I%26%2B%60%8C%BFR'
hex_str = 'd90c3ce39418f0c5d98358e03349262b608cbf52'
print(hex_str_to_esc_str(hex_str) == esc_str) # True
print(esc_str_to_hex_str(esc_str) == hex_str) # True
Note
Windows-1252 (aka cp1252) emerged as the default encoding as a result of the following test:
import binascii
import chardet
esc_str = '%D9%0C%3C%E3%94%18%F0%C5%D9%83X%E03I%26%2B%60%8C%BFR'
hex_str = 'd90c3ce39418f0c5d98358e03349262b608cbf52'
print(
chardet.detect(
binascii.unhexlify(hex_str)
)
)
...which gave a pretty strong clue:
{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Unicode string extraction and comparison - regex

Related

How to remove prefixed u from a unicode string?

Exact match of string in pandas python

Only one IF ELSE statement working SAS

Identifying nearly identical messages in list

URL-Encoding a Byte String?

Categories

Resources