{"id":683,"date":"2020-04-24T11:21:00","date_gmt":"2020-04-24T17:21:00","guid":{"rendered":"https:\/\/www.great-white-software.com\/blog\/?p=683"},"modified":"2020-04-16T20:20:18","modified_gmt":"2020-04-17T02:20:18","slug":"when-characters-arent-characters","status":"publish","type":"post","link":"https:\/\/www.great-white-software.com\/blog\/2020\/04\/24\/when-characters-arent-characters\/","title":{"rendered":"When characters arent characters"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">There are two &#8220;text handling&#8221; types in Xojo &#8211; String and Text.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">And they vary quite a bit in how they handle textual data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">While strings use UTF-8 as their default encoding you still have to worry about what form of UTF-8 the characters in the string are in. Strings dont deal with &#8220;characters&#8221; in the way you and I perceive them. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For instance if you run this code<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Dim s1 As String = \"\u00fc\"\nDim s2 As String = &amp;u75 + &amp;u308\n\nBreak\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">What you see in the debugger as the text they hold is the same<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">But if you chnage this to<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\nDim s1 As String = \"\u00fc\"\n\nDim s2 As String = &amp;u75 + &amp;u308\n\nIf s1 = s2 Then \n  Break\nEnd If\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">what you will find is that while they appear to you and me to hold the same contents they are not &#8220;the same&#8221;. And this is because the first one uses one form of UTF-8 (composed characters) and the second uses a different form (decomposed characters)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">And theres no built in mechanism to know one is in one form or the  other nor any to convert one into the other \ud83d\ude41<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So years ago the &#8220;next&#8221; framework was created and a new type added &#8211; Text. And it deals with these issues much better. That same code, using text, looks like<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Dim t1 As Text = \"\u00fc\"\nDim t2 As Text = &amp;u75 + &amp;u308\n\nIf t1 = t2 Then\n  Break\nEnd If\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">but this time when you run you WILL hit the break point. Text handles the different forms seamlessly and you get the result you expect.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">And the differences go further than this. When you split a string up into &#8220;characters&#8221; you get different numbers of characters from the two apparently equal strings. Not so with text. <\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\nDim s1 As String = \"\u00fc\"\n\nDim s2 As String = &amp;u75 + &amp;u308\n\nIf s1 = s2 Then \n  Break \/\/ wont stop here but you might expect it should\nEnd If\n\nDim s1Chars() As String = s1.Split(\"\")\nDim s2Chars() As String = s2.Split(\"\")\n\nBreak \/\/ note that s1chars. ubound &lt; s2chars.ubound\n      \/\/ and the contents are totally different\n\nDim s1CodePoints() As UInt32\nFor i As Integer = 1 To s1.LenB\n  s1CodePoints.Append AscB(s1.MidB(i,1))\nNext\nDim s2CodePoints() As UInt32\nFor i As Integer = 1 To s2.LenB\n  s2CodePoints.Append AscB(s2.MidB(i,1))\nNext\n\nBreak \/\/ again the ubounds are different - this time they should be !\n\nDim t1 As Text = \"\u00fc\"\n\nDim t2 As Text = &amp;u75 + &amp;u308\n\nIf t1 = t2 Then\n  Break\nEnd If\n\nDim t1Chars() As Text = t1.Split\nDim t2Chars() As Text = t2.Split\n\nBreak \/\/ note that t1Chars.ubound = t2Chars.ubound\n      \/\/ and the chars are \"the same\" !!!!!!\n\nDim t1CodePoints() As UInt32\nFor Each cp As UInt32 In t1.Codepoints\n  t1CodePoints.Append cp\nNext\nDim t2CodePoints() As UInt32\nFor Each cp As UInt32 In t2.Codepoints\n  t2CodePoints.Append cp\nNext\n\nBreak \/\/ these should differ since one uses one form of utf-8\n      \/\/  and one uses a different one<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Text just handles things seamlessly<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">With the transition to API 2 it will be a shame if String doesnt adopt some of these capabilities AND there&#8217;s no framework provided means to normalize string so they all use UTF-8 composed or decomposed so we can deal with the inconsistencies that can arise.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>There are two &#8220;text handling&#8221; types in Xojo &#8211; String and Text. And they vary quite a bit in how they handle textual data. While strings use UTF-8 as their default encoding you still have to worry about what form of UTF-8 the characters in the string are in. Strings dont deal with &#8220;characters&#8221; in &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.great-white-software.com\/blog\/2020\/04\/24\/when-characters-arent-characters\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;When characters arent characters&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-683","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.great-white-software.com\/blog\/wp-json\/wp\/v2\/posts\/683","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.great-white-software.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.great-white-software.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.great-white-software.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.great-white-software.com\/blog\/wp-json\/wp\/v2\/comments?post=683"}],"version-history":[{"count":1,"href":"https:\/\/www.great-white-software.com\/blog\/wp-json\/wp\/v2\/posts\/683\/revisions"}],"predecessor-version":[{"id":684,"href":"https:\/\/www.great-white-software.com\/blog\/wp-json\/wp\/v2\/posts\/683\/revisions\/684"}],"wp:attachment":[{"href":"https:\/\/www.great-white-software.com\/blog\/wp-json\/wp\/v2\/media?parent=683"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.great-white-software.com\/blog\/wp-json\/wp\/v2\/categories?post=683"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.great-white-software.com\/blog\/wp-json\/wp\/v2\/tags?post=683"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}