How to Implement String Case Insensitive Compare?

How would you implement a String Comparator used in String#compareToIgnoreCase? I may convert all character to upper case then compare like the following does:

int n1 = s1.length();
int n2 = s2.length();
int min = Math.min(n1, n2);
for (int i = 0; i < min; i++) {
    char c1 = Character.toUpperCase(s1.charAt(i));
    char c2 = Character.toUpperCase(s2.charAt(i));
    if (c1 != c2) {
       return c1 - c2;
    }
}
return n1 - n2;

It seems work and we may also use toLowerCase to replace toUpperCase. But the implementation in JDK source code doesn’t agree:

int n1 = s1.length();
int n2 = s2.length();
int min = Math.min(n1, n2);
for (int i = 0; i < min; i++) {
    char c1 = s1.charAt(i);
    char c2 = s2.charAt(i);
    if (c1 != c2) {
        c1 = Character.toUpperCase(c1);
        c2 = Character.toUpperCase(c2);
        if (c1 != c2) {
            c1 = Character.toLowerCase(c1);
            c2 = Character.toLowerCase(c2);
            if (c1 != c2) {
                // No overflow because of numeric promotion
                return c1 - c2;
            }
        }
    }
}
return n1 - n2;

Yes, it first convert it to upper case, then to lower case again! The logic seems to say that there exists some character that:

They have different upper case, but same lower case: so the second conversion is necessary to tell the difference;
They have different lower case, but same upper case: so the first conversion is necessary;

Feel confused about this logic, I am decided to find some examples to verify it. First I try to find in the JDK source code, which contains the mapping between lower case and upper case. But find a same one using eyes is too hard, so I write a simple program to find:

int charLimit = 65536;
for (int i = 0; i < charLimit; i++) {
  for (int j = i+1; j < charLimit; j++) {
    if (Character.toUpperCase(i) == Character.toUpperCase(j)
        && Character.toLowerCase(i) != Character.toLowerCase(j)) {
      System.out.format("same upper case: %c(%d) & %c(%d)\n", ((char) i), i, ((char) j), j);
    }
    if (Character.toLowerCase(i) == Character.toLowerCase(j)
        && Character.toUpperCase(i) != Character.toUpperCase(j)) {
      System.out.format("same lower case: %c(%d) & %c(%d)\n", ((char) i), i, ((char) j), j);
    }
  }
}

And I indeed find some examples:

same lower case: I(73) & İ(304)
same upper case: I(73) & ı(305)
same lower case: K(75) & K(8490)
same upper case: S(83) & ſ(383)

After some searching, I found this is related to different locales, as Infamous turkish locale bug introduced. It says in the Turkish Locale, the uppercase counterpart of i is not I, but İ. And in return, the I is not transformed to i, but a “dotless i”: ı. Showing in illustration can be the following:

i   ı     
|\  |
| \ |
|  \|
İ   I

So comparing a string is not so easy as we have thought. In the real wold, even such a simple task can be wrong. Actually, compare of locate related string should avoid using String#compareIgnoreCase, as the document of this comparator says:

Note that this method does not take locale into account, and will result in an unsatisfactory ordering for certain locales. The java.text package provides collators to allow locale-sensitive ordering.

And the following is the recommended way to compare string in real application:

//Get the Collator for US English and set its strength to PRIMARY
Collator usCollator = Collator.getInstance(Locale.US);
usCollator.setStrength(Collator.PRIMARY);
if( usCollator.compare("abc", "ABC") == 0 ) {
    System.out.println("Strings are equivalent");
}

For comparing Strings exactly once, the compare method provides the best performance. When sorting a list of Strings however, it is generally necessary to compare each String multiple times. In this case, CollationKeys provide better performance. The CollationKey class converts a String to a series of bits that can be compared bitwise against other CollationKeys. A CollationKey is created by a Collator object for a given String.

Ref

Written with StackEdit.

On teh way

Blog Search

How to Implement String Case Insensitive Compare?

Ref

标签

评论

发表评论

此博客中的热门博文

Spring Boot: Customize Environment

Elasticsearch: Join and SubQuery

Implement isdigit