I also reached out to them on Twitter but they directed me to this form. I followed up with them on Twitter with what happened in this screenshot but they are now ignoring me.

  • dan@upvote.au
    link
    fedilink
    English
    arrow-up
    12
    ·
    edit-2
    1 year ago

    ‘U’ and ‘u’ are two different symbols. And you have to make such rules for every language a part of your processing logic.

    Unicode has standard rules for case folding, which includes the rules for all languages supported by Unicode. Case-insensitive comparisons in all good programming languages uses this data.

    Note that you can’t simply convert both strings to uppercase or lowercase to compare them, as then you’ll run into the Turkish i problem: https://haacked.com/archive/2012/07/05/turkish-i-problem-and-why-you-should-care.aspx/

    • rottingleaf
      link
      fedilink
      English
      arrow-up
      5
      ·
      1 year ago

      So good that we all use Unicode now. No CP1251, no ISO single-byte encodings, no Japanese encoding hell.

    • labsin@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      5
      arrow-down
      1
      ·
      edit-2
      1 year ago

      It’s that capitalization is language dependent, which email addresses shouldn’t be as I hope the rules for France shouldn’t be different than for Dutch. For instance é in Dutch is capitalized as E, but in French it is É. The eszett didn’t even have an official capital before 2017

      In most programming languages, case-insensitive string compare without specifying the culture became deprecated. It should imo only be used for fuzzy searching doubles, which you probably will do with ToUpper for performance reasons, or maybe some UI validation.

      • dan@upvote.au
        link
        fedilink
        English
        arrow-up
        4
        ·
        edit-2
        1 year ago

        For instance é in Dutch is capitalized as E, but in French it is É

        Sure, but we’re just talking about string comparison rules, and Unicode sees all three of those as being equal. For example, a search engine that uses proper case folding rules in its indexer should return results for “entrée” if you search for “entree”, “Čech” if you search for “cech”, etc.

        It should imo only be used for fuzzy searching doubles, which you probably will do with ToUpper

        You can’t just use ToUpper for comparisons due to issues like you mentioned, and the Turkish i problem. You need to do proper case-insensitive comparisons, which is where the Unicode case folding rules are used.

      • rottingleaf
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        offtopic: The eszett strictly speaking was a ligature for ‘sz’, which Hungarian orthography kinda preserved while for German the separated version is ‘ss’, and there’s plenty of such stuff in nature.

        In most programming languages, case-insensitive string compare without specifying the culture became deprecated. It should imo only be used for fuzzy searching doubles, which you probably will do with ToUpper on all four performance reasons, or maybe some UI validation.

        Thank you for saying that more clearly.