There are two “text handling” types in Xojo – String and Text.
And they vary quite a bit in how they handle textual data.
While strings use UTF-8 as their default encoding you still have to worry about what form of UTF-8 the characters in the string are in. Strings dont deal with “characters” in the way you and I perceive them.
For instance if you run this code
Dim s1 As String = "ü" Dim s2 As String = &u75 + &u308 Break
What you see in the debugger as the text they hold is the same
But if you chnage this to
Dim s1 As String = "ü" Dim s2 As String = &u75 + &u308 If s1 = s2 Then Break End If
what you will find is that while they appear to you and me to hold the same contents they are not “the same”. And this is because the first one uses one form of UTF-8 (composed characters) and the second uses a different form (decomposed characters)
And theres no built in mechanism to know one is in one form or the other nor any to convert one into the other 🙁
So years ago the “next” framework was created and a new type added – Text. And it deals with these issues much better. That same code, using text, looks like
Dim t1 As Text = "ü" Dim t2 As Text = &u75 + &u308 If t1 = t2 Then Break End If
but this time when you run you WILL hit the break point. Text handles the different forms seamlessly and you get the result you expect.
And the differences go further than this. When you split a string up into “characters” you get different numbers of characters from the two apparently equal strings. Not so with text.
Dim s1 As String = "ü" Dim s2 As String = &u75 + &u308 If s1 = s2 Then Break // wont stop here but you might expect it should End If Dim s1Chars() As String = s1.Split("") Dim s2Chars() As String = s2.Split("") Break // note that s1chars. ubound < s2chars.ubound // and the contents are totally different Dim s1CodePoints() As UInt32 For i As Integer = 1 To s1.LenB s1CodePoints.Append AscB(s1.MidB(i,1)) Next Dim s2CodePoints() As UInt32 For i As Integer = 1 To s2.LenB s2CodePoints.Append AscB(s2.MidB(i,1)) Next Break // again the ubounds are different - this time they should be ! Dim t1 As Text = "ü" Dim t2 As Text = &u75 + &u308 If t1 = t2 Then Break End If Dim t1Chars() As Text = t1.Split Dim t2Chars() As Text = t2.Split Break // note that t1Chars.ubound = t2Chars.ubound // and the chars are "the same" !!!!!! Dim t1CodePoints() As UInt32 For Each cp As UInt32 In t1.Codepoints t1CodePoints.Append cp Next Dim t2CodePoints() As UInt32 For Each cp As UInt32 In t2.Codepoints t2CodePoints.Append cp Next Break // these should differ since one uses one form of utf-8 // and one uses a different one
Text just handles things seamlessly
With the transition to API 2 it will be a shame if String doesnt adopt some of these capabilities AND there’s no framework provided means to normalize string so they all use UTF-8 composed or decomposed so we can deal with the inconsistencies that can arise.
2 Replies to “When characters arent characters”
2020r1 will become String.Characters Iterator like Text.Characters. First step in the right direction.
cant discuss unreleased versions
Comments are closed.