Modern CSV’s Sort Algorithm
To sort a series of strings, you need a way to compare two strings to determine which is “less” and which is “more”. The lesser one comes before the greater one (or in an descending sort, after). Modern CSV uses a combination of numerical and lexicographical sort. Here is a super professional looking flow chart to show how it works:
Are both strings numbers? | \ No Yes | \ | ˅ | Numerical Sort | ˅ Do both strings contain identical non-numeric characters interspersed by (perhaps different) numbers? | \ No Yes | \ | ˅ | Compare only the number parts. (See explanation below) | ˅ Straight lexicographical sort
Lexicographical sort is ill-suited for sorting numbers. If you have a series of numbers, say 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, and you apply a lexicographical sort, it’ll come up 1, 10, 11, 12, 2, 3, 5, 6, 7, 8, 9. That’s dumb. It’s better applied to words.
Lexicographical comparison looks at the first character of the two strings. If one character is less, the whole string is less and it’s done. If they’re the same, it moves on to the second character. It repeats until it finds differing characters or one string ends. If one string is shorter but they’re otherwise identical, the shorter string is less.
This is why in the example above, 10 comes before 2. It start with the first characters, 1 and 2, and since 1 is less than 2, 10 comes first. It’s better to compare these strings numerically.
Compare only the number parts
Suppose you have a list of IP addresses to sort. It has to compare these two:
These strings won’t convert to numbers, but we obviously want 126.96.36.199 to come first. A lexicographical comparison won’t do that. The program sees both of these strings as number-dot-number-dot-number-dot-number. Since the non-number parts are identical, it compares the number parts numerically starting with the first. Once it gets to comparing 2 and 10, the one with 2 comes first.
This also works nicely on things like street addresses (423 Sesame St. vs. 1234 Sesame St.) and heights in the American system (5’3″ vs. 5’10”).
One last thing to note is that capitalized words come before lower-case. If you have the following strings:
it will be sorted as: