Why sort code? How to do it?

11 minute read Published: 2024-04-02

Your code is full of lists. And some of them should be sorted.

Why some code needs sorting

Have a small look at this Rust enum:

/// An error in codesort
#[derive(thiserror::Error, Debug)]
pub enum CsError {
    #[error("You can't specify both --around and --range")]
    RangeAndAround,

    #[error("IO error: {0}")]
    Io(#[from] std::io::Error),

    #[error("Fmt error: {0}")]
    Fmt(#[from] std::fmt::Error), // only happens in debug

    #[error("No sortable range found around line {}", .0+1)]
    NoSortableRangeAround(LineIndex),

    #[error("Invalid range {}..{}", .start+1, .end+1)]
    InvalidRange {
        start: LineIndex,
        end: LineIndex,
    },

    #[error("Provided range not sortable (lang: {0:?})")]
    RangeNotSortable(Language),

    #[error("Unexpected closing brace: {0}")]
    UnexpectedClosingBrace(char),

    #[error("Unclosed char literal at line {}", .0+1)]
    UnclosedCharLiteral(LineIndex),

    #[error("Provided input not balanced")]
    InputNotBalanced,
}

It's a common enum, not especially big (because it must fit this blog post).

If this weren't code, the list would be alphabetically sorted.

Alphabetical order is the universal solution when there's no obvious natural ordering. Alpha sort works because we're trained not only to search those lists, but also to see them. Your eyes easily scan them, see what's present and what isn't.

They feel more satisfying, easier to grasp, and less straining.

It's important to note that it's not always the best order. There are often inherent properties which call for a different order, or for some grouping.

But when there's not, such a list should be alpha sorted, so that you more easily maintain it.

Why isn't this ordering more often used in code?

Because it's not so easy to do.

There's no shortcut which lets you have all those items alphabetically sorted, and you can't just sort lines of text or everything would be broken.

Lists everywhere

Now, let's take a wider look: you don't want just to sort the enum variants.

When you have a big enum, you often also have a big match defining what to do for each variant.

And maybe a big list of impl From not far away.

In fact, your code is full of lists:

Most of them should not be sorted, either because they're short enough or because there's a better ordering. Or because it would change the behavior, or be less efficient. But there are some that you do want to sort.

You may notice that I didn't mention lists which should always be sorted. Like imports or modules. Because your formatter and linter already deal with them. What's interesting to me is the in-between, all the lists you decide to sort or not, lists which often starts unordered and end up in need of sorting.

It would be so much easier, when editing the code, if you could ask the IDE to sort the items of the list in which you have your caret...

What do we precisely mean with sorting

Here's an enum with 2 items:

enum MyEnum {
    /// This item is simple
    SomeSimpleItem,
    /// This item is more complex
    #[serde(default)]
    ComplexOne {
        start: LineIndex,
        end: LineIndex,
    },
}

This is enough to explain what we want:

Comments and annotations must stick with code items, blocks must not be broken. Here items end with a comma but depending on the kind of list and the language, the end may be different.

Comments and annotations must not be part of the sorting key. Here, sorting keys are SomeSimpleItem and ComplexOne.

And there's another job to do if we want to be able to tell our IDE "sort the list here": we must determine where the list starts and ends, so that it doesn't try to sort the enum MyEnum line. What we want here is to sort the siblings of the current line, and they're delimited by the external curly braces.

If you look at the first illustration, you may notice we also want to keep the same kind of spacing between items (in this example, this was a blank line). There's not always spacing in a list but when there is, sorting must maintain it.

The task appears quite easy, at least to the human eye.

You don't even have to know Rust to sort such a list. And that's good, because we don't want a solution that wouldn't work when coding in another language.

The Sort operations we want in our IDE

In practice, 2 operations appear necessary.

Sort the "current list"

This is probably the most common need: sort the list of items you're in.

More precisely, the steps are

I call this operation "sort around cursor"

Sort a selected range

Sometimes, you don't want to select all the siblings.

For example there's one that you may want to keep at the end.

Or you just want to sort a few functions while keeping them together.

In such case, what you want is: sort the list of items forming the selection.

The steps here are

The CodeSort solution

In order to have those two commands in my IDE, I built CodeSort.

Here's a before/after of CodeSort applied to the enum we previously saw:

errors

CodeSort is both a library and a small program.

As a program, its input is either stdin, or a file.

And it can sort either around a given line, a given range, or the whole input.

Sorting the whole input is useful for binding a command to replacing the current selection in your IDE. Of course it means the input isn't always some syntactically valid file (you can't have a match arm at the start of your file), but that's not really a problem. After all, a human is able to sort a list without looking at the whole file.

There are some language specific features to consider: Rust has raw strings, JavaScript has single-quoted strings, annotations in Java start with a @ while they're like #[...] in Rust, etc. So there are also parameters to recognize or specify the language.

This makes it easy to bind keys to CodeSort. The README explains how to integrate CodeSort in vim/neovim and IntelliJ.

How CodeSort works

CodeSort is line-oriented. To ensure reliability, it doesn't write output code from a CST or another tree representation, it only swaps input lines.

There are 2 basic objects:

CodeSort identifies and qualifies the lines of the input, computes the range to sort, extracts the blocks which are the items of the list (those blocks are instances of LocList too), sorts those items in place, then writes the list to the output.

A Loc contains exactly what's in the input's line, including the CRLF or LF at end.

CodeSort doesn't care if your code is inconsistently indented or has a mix of CRLF and LF, or whatever: the output is guaranteed to be made of the input with just some lines swapped.

To know what's an annotation, to recognize where the item starts and ends, etc. an essential concept is the depth of anything in the code tree (think AST, CST).

Here's a major axiom: the depth just before the first character of the first line of an item is the depth just after the last character of the last line of the item. So, a Loc contains 2 fields: start_depth and end_depth.

To know what's the depth of items, CodeSort scans the input, and keeps track of the braces in a stack (its length is the depth in the syntactic tree).

There are some little details to pay attention to, because you can't track braces without knowing when you're in a string or char literal, or in comments. That's also the main reason why there are several analyzers: even among languages with a C-style syntax, you have different rules for string and char literals.

Different analyzers also can tell whether the code is an annotation and can more finely keep track of what should be part of the sort key.

You need to know what's an annotation first so that they stick with the item they precede. The annotations, blank lines, comments, are part of the consistent balanced block which follows them.

The sort_key is built when the input is scanned. It includes everything which isn't annotation, spacing, or comments, even literals.

Is that all?

Well, in fact, that's not really the end.

It works for almost everything but there are exceptions.

Here's an example, with 2 items:

impl<T> Foo for Blee<T>
where
    T: Copy,
{
    // ...
}

impl<T> Foo for Baz<T>
where
    T: Copy,
{
    // ...
}

If you use only what precedes, you break items just after the commas.

It turns out you can't sort everything with just the depth computed from the braces.

The rule that must be added is that an impl contains a block delimited by curly braces (at the same depth).

You have other similar cases, for example a function declaration, in Rust, either contains a curly block at the same depth, or is ended with a semicolon (eg in traits or type declaration).

To deal with such cases, the solution is to keep track, in each LOC, of what is expected after, to ensure we don't cut items.

Does it work now?

Yes, it does, at least in Rust, and here's how I checked that:

Stress testing the analyzer on the rustlang/rust codebase

To prove the analyzer was reliable enough, I made a tiny program. This program looks at all rust files in the rustlang/rust codebase and uses the codesort library to sort the variants of all enums whose annotations don't contain repr, serde, PartialOrd or Ord.

Here's the command I ran:

cargo run --release --example sort-all-enums -- ../rust/

The program outputs all enums and ends with this report:

Done in 2.371s

I analyzed 29373 files
28618 files were OK but I encountered 388 empty files, 292 incomplete files, 75 invalid files
I excluded 668 enums whose annotation contained excluding keywords
I sorted 4273 enums in 2192 files

And it works: Analyzing and sorting are safe enough so that x.py runs successfully after.

The "empty" files contain only blank lines and comments. It turns out there are many such files.

The incomplete files are files which don't end in a satisfying way for my analyzer so it doesn't want to sort them. It seems they all have weird macros which break its assumptions. If you ask codesort to sort a piece of code containing such strange construct, it will refuse to do so, which is kind of a failure, but it won't mess your code. You're very unlikely to want to sort such code anyway.

The invalid files are totally broken files in test, test-data and tests directories. I think all of them are purposely invalid in order to test the parser.

If you want to see the list of excluded files, run

cargo run --release --example sort-all-enums -- ../rust/

About the excluded enums:

The variants in those excluded enums should not be sorted. None of the 4273 enums which were sorted was badly sorted.

CodeSort can thus be considered reliable enough for Rust.

What about other languages

I gave a lot more attention to the Rust analyzer than to other languages.

JavaScript, Java, C, Zig, are supported but way less tested.

If you're interested in codesort for any language and it doesn't work as well as you'd wished, either make an issue with test cases or, better, come to miaou for a chat.