XSLT 2.0 Convert a space separated string to a sequence - xslt

I'm looking for a function or simple method to convert a space-separated string into a sequence of strings. For example the string 'abcd ef ghi' would be converted to a three-string sequence: 'abcd','ef','ghi'. It can be assumed that there is only one space between the sets of characters. A string with no spaces would generate a one-string sequence.
I looked around through the usual references, but nothing jumped out at me. I'm using XSLT 2.0. Suggestions?

You are describing the tokenize() function.

Related

XSLT/XPATH How to concatenate two strings containing Hebrew characters in a left to right direction?

I'm trying to concatenate two strings containing Hebrew character in XSLT/XPATH (NOT XSL-FO), however, when I try "concat(string A, String B), the output I'm getting is String B + String A.
I guess this is probably because of the fact that Hebrew characters have a right to left direction. However, what can I do in order to get String A + String B in the output? The output file I need to produce is a text file (neither XML nor HTML).
Any help would be appreciated. Thanks!
Update: Here is an example:
example: יוסף
בניון
then concat(stringA,stringB) gets me this: יוסףבניון instead of בניוןיוסף
Also, there's no guarantee that stringA and stringB will always contain Hebrew characters, so concat(stringB, stringA) would not work for me.
<stringA>יוסף</stringA>
<stringB>בניון</stringB>
then
concat(stringA,stringB)
gets me this:
יוסףבניון
instead of
בניוןיוסף
The result that you get is the correct result: stringA is before stringB.
Because the characters are RTL, the entire block is displayed from right-to-left (as one would expect). However, the order of the individual characters in the underlying string (as well as in the resulting text file) is:
י
ו
ס
ף
ב
נ
י
ו
ן
You can verify this by looking at the hex dump of the file.

replace the character in XSLT

I am using the xslt for transformation but from input 240 characters receiving in one element ,In that element different special characters receiving(eg :---> %,?,/,-,_,#,!,$,^) .
I need to replace the those characters.
It is possible in XSLT 1.0.If it is possible can you please give me the code with examples?.Thanks
Eg:
<remark> whfwlknf234#skl$ck?nvwkld^fnwlfn </remark>
It is possible in XSLT 1.0
Yes, it is possible. Use the translate() function to replace them with ... oh, you didn't say with what.

find if string starts with \U in Python 3.3

I have a string and I want to find out if it starts with \U.
Here is an example
myStr = '\U0001f64c\U0001f60d\U0001f4a6\U0001f445\U0001f4af'
I was trying this:
myStr.startswith('\\U')
but I get False.
How can I detect \U in a string?
The larger picture:
I have a list of strings, most of them are normal English word strings, but there are a few that are similar to what I have shown in myStr, how can I distinguish them?
The original string does not have the character \U. It has the unicode escape sequence \U0001f64c, which is a single Unicode character.
Therefore, it does not make sense to try to detect \U in the string you have given.
Trying to detect the \U in that string is similar to trying to detect \x in the C string "\x90".
It makes no sense because the interpreter has read the sequence and converted it. Of course, if you want to detect the first Unicode character in that string, that works fine.
myStr.startswith('\U0001f64c')
Note that if you define the string with a real \U, like this, you can detect it just fine. Based on some experimentation, I believe Python 2.7.6 defaults to this behavior.
myStr = r'\U0001f64c\U0001f60d\U0001f4a6\U0001f445\U0001f4af'
myStr.startswith('\\U') # Returns True.
Update: The OP requested a way to convert from the Unicode string into the raw string above.
I will show the solution in two steps.
First observe that we can view the raw hex for each character like this.
>>> [hex(ord(x)) for x in myStr]
['0x1f64c', '0x1f60d', '0x1f4a6', '0x1f445', '0x1f4af']
Next, we format it by using a format string.
formatString = "".join(r'\U%08x' for x in myStr)
output = formatString % tuple(myChars)
output.startswith("\\U") # Returns True.
Note of course that since we are converting a Unicode string and we are formatting it this way deliberately, it guaranteed to start with \U. However, I assume your actual application is not just to detect whether it starts with \U.
Update2: If the OP is trying to differentiate between "normal English" strings and "Unicode Strings", the above approach will not work, because all characters have a corresponding Unicode representation.
However, one heuristic you might use to check whether a string looks like ASCII is to just check whether the values of each character are outside the normal ASCII range. Assuming that you consider the normal ASCII range to be between 32 and 127 (You can take a look here and decide what you want to include.), you can do something like the following.
def isNormal(myStr):
myChars = [ord(x) for x in myStr]
return all(x < 128 and x > 31 for x in myChars)
This can be done in one line, but I separated it to make it more readable.
Your string:
myStr = '\U0001f64c\U0001f60d\U0001f4a6\U0001f445\U0001f4af'
is not a foraign language text. It is 5 Unicode characters, which are (in order):
PERSON RAISING BOTH HANDS IN CELEBRATION
SMILING FACE WITH HEART-SHAPED EYES
SPLASHING SWEAT SYMBOL
TONGUE
HUNDRED POINTS SYMBOL
If you want to get strings that only contain 'normal' characters, you can use something like this:
if re.search(r'[^A-Za-z0-9\s]', myStr):
# String contained 'weird' characters.
Note that this will also trip on characters like é, which will sometimes be used in English on words with a French origin.

XSLT - Replace a number of chars in a string

I have the following string,
';#6;#'
The above string could be anything, E.g.:
';#1;#' or ';#2;#' , or ';#3;#' ...
I need to be able to replace the contents between the ' and '
Is this possible using something like translate in XSLT 1.0?
This kind of thing is quite difficult in XSLT 1.0. Take a look at the library of string-handling functions available at www.exslt.org - some of them come with XSLT implementations that you can copy into your stylesheet and call (typically as xsl:call-template).
Use substring and concat functions.

How to compute a unicode string which bidirectional representation is specified?

fellows. I have a rather pervert question. Please forgive me :)
There's an official algorithm that describes how bidirectional unicode text should be presented.
http://www.unicode.org/reports/tr9/tr9-15.html
I receive a string (from some 3rd-party source), which contains latin/hebrew characters, as well as digits, white-spaces, punctuation symbols and etc.
The problem is that the string that I receive is already in the representation form. I.e. - the sequence of characters that I receive should just be presented from left to right.
Now, my goal is to find the unicode string which representation is exactly the same. Means - I need to pass that string to another entity; it would then render this string according to the official algorithm, and the result should be the same.
Assuming the following:
The default text direction (of the rendering entity) is RTL.
I don't want to inject "special unicode characters" that explicitly override the text direction (such as RLO, RLE, etc.)
I suspect there may exist several solutions. If so - I'd like to preserve the RTL-looking of the string as much as possible. The string usually consists of hebrew words mostly. I'd like to preserve the correct order of those words, and characters inside those words. Whereas other character sequences may (and should) be transposed.
One naive way to solve this is just to swap the whole string (this takes care of the hebrew words), and then swap inside it sequences of non-hebrew characters. This however doesn't always produce correct results, because actual rules of representation are rather complex.
The only comprehensive algorithm that I see so far is brute-force check. The string can be divided into sequences of same-class characters. Those sequences may be joined in random order, plus any of them may be reversed. I can check all those combinations to obtain the correct result.
Plus this technique may be optimized. For instance the order of hebrew words is known, so we only have to check different combinations of their "joining" sequences.
Any better ideas? If you have an idea, not necessarily the whole solution - it's ok. I'll appreciate any idea.
Thanks in advance.
If you want to check if a character is Bidirectional you have to use UCD (Unicode Character Database) which provided by Unicode.org and includes lots of information about characters . in one of that DB attributes you can find the Bidirectionality of a character
So you have to Download USD , then write a class to look for your character in the XML and return answer
I did this in an opensource C# application and you can ind it here http://Unicode.Codeplex.com
Please let me know has your issue resolved by this or not.
Nasser, thanks for the answer.
Unfortunately it doesn't fully resolve my problem.
So far for every character I can know its directionality. Still I don't see how can I compute the whole string so that its representation would match what I need.
Imagine you want to have the following text written from left to right, whereas hebrew/arabic characters are denoted by BIG:
ABC eng 123 456 DEF
The correct string would be like this:
FED 456 123 eng CBA
or also:
FED eng 456 123 CBA
Or, if using explicit direction override codes it can be written like this:
FED eng 123 456 CBA
Currently I solved this problem by injecting explicit directionality override codes into the string. So that I isolate sequences of hebrew/arabic words, and for all the joining LTR/Weak/Neutral characters I explicitly override the direction to LTR.
However I'd like to do this without injecting explicit override codes.