How would you implement a String Comparator
used in String#compareToIgnoreCase
? I may convert all character to upper case then compare like the following does:
int n1 = s1.length();
int n2 = s2.length();
int min = Math.min(n1, n2);
for (int i = 0; i < min; i++) {
char c1 = Character.toUpperCase(s1.charAt(i));
char c2 = Character.toUpperCase(s2.charAt(i));
if (c1 != c2) {
return c1 - c2;
}
}
return n1 - n2;
It seems work and we may also use toLowerCase
to replace toUpperCase
. But the implementation in JDK source code doesn’t agree:
int n1 = s1.length();
int n2 = s2.length();
int min = Math.min(n1, n2);
for (int i = 0; i < min; i++) {
char c1 = s1.charAt(i);
char c2 = s2.charAt(i);
if (c1 != c2) {
c1 = Character.toUpperCase(c1);
c2 = Character.toUpperCase(c2);
if (c1 != c2) {
c1 = Character.toLowerCase(c1);
c2 = Character.toLowerCase(c2);
if (c1 != c2) {
// No overflow because of numeric promotion
return c1 - c2;
}
}
}
}
return n1 - n2;
Yes, it first convert it to upper case, then to lower case again! The logic seems to say that there exists some character that:
- They have different upper case, but same lower case: so the second conversion is necessary to tell the difference;
- They have different lower case, but same upper case: so the first conversion is necessary;
Feel confused about this logic, I am decided to find some examples to verify it. First I try to find in the JDK source code, which contains the mapping between lower case and upper case. But find a same one using eyes is too hard, so I write a simple program to find:
int charLimit = 65536;
for (int i = 0; i < charLimit; i++) {
for (int j = i+1; j < charLimit; j++) {
if (Character.toUpperCase(i) == Character.toUpperCase(j)
&& Character.toLowerCase(i) != Character.toLowerCase(j)) {
System.out.format("same upper case: %c(%d) & %c(%d)\n", ((char) i), i, ((char) j), j);
}
if (Character.toLowerCase(i) == Character.toLowerCase(j)
&& Character.toUpperCase(i) != Character.toUpperCase(j)) {
System.out.format("same lower case: %c(%d) & %c(%d)\n", ((char) i), i, ((char) j), j);
}
}
}
And I indeed find some examples:
same lower case: I(73) & İ(304)
same upper case: I(73) & ı(305)
same lower case: K(75) & K(8490)
same upper case: S(83) & ſ(383)
After some searching, I found this is related to different locales, as Infamous turkish locale bug introduced. It says in the Turkish Locale, the uppercase counterpart of i
is not I
, but İ
. And in return, the I
is not transformed to i
, but a “dotless i”: ı
. Showing in illustration can be the following:
i ı
|\ |
| \ |
| \|
İ I
So comparing a string is not so easy as we have thought. In the real wold, even such a simple task can be wrong. Actually, compare of locate related string should avoid using String#compareIgnoreCase
, as the document of this comparator says:
Note that this method does not take locale into account, and will result in an unsatisfactory ordering for certain locales. The java.text package provides collators to allow locale-sensitive ordering.
And the following is the recommended way to compare string in real application:
//Get the Collator for US English and set its strength to PRIMARY
Collator usCollator = Collator.getInstance(Locale.US);
usCollator.setStrength(Collator.PRIMARY);
if( usCollator.compare("abc", "ABC") == 0 ) {
System.out.println("Strings are equivalent");
}
For comparing Strings exactly once, the compare method provides the best performance. When sorting a list of Strings however, it is generally necessary to compare each String multiple times. In this case, CollationKeys provide better performance. The CollationKey class converts a String to a series of bits that can be compared bitwise against other CollationKeys. A CollationKey is created by a Collator object for a given String.
Ref
Written with StackEdit.
评论
发表评论