Unicode is hard: http://t.co/5MD6zLofXU
Related Posts
I hear people say that Go is often hard to search online (hence sometimes "Golang"), but the vast majority of language names are common words. Names with punctuation (C++, C#) are hard too.
Is this a big problem in practice? "Perl" isn't a dictionary word, but it's an exception.
In LSP, a position is represented as a line number and a column offset (in Unicode code units): https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#position
This is pretty elegant. You'll get the correct line regardless of encoding bugs, and the editor already knows the line number so it's cheap to compute.
On the challenge of writing accurate source spans on Unicode source code: https://reedmullanix.com/posts/unicode-source-spans.html
Also (see footnotes) a fair number of LSP clients assume UTF-8 despite early versions of LSP mandating UTF-16!