Quirks of parsing SSH configs — Store Halfword Byte-Reverse Indexed

Introduction

I've been using the VSCodium Open Remote - SSH extension recently to great results. I can treat everything as a single environment, without any worry about syncing between my local development files and the remote. This is very different to mounting the remote as a network drive and opening a local instance of VSCodium on it: in addition to crippling latency on every action, a locally mounted drive doesn't bring the build context that tools like clangd require (e.g., system headers).

Instead, the remote extension runs a server on the remote that performs most actions, and the local VSCodium instance acts as a client that buffers and caches data seamlessly, so the experience is nearly as good as developing locally.

For example, a project wide file search on a network drive is unusably slow because every file and directory read requires a round trip back to the remote, and the latency is just too large to finish getting results back in a reasonable time. But with the client-server approach, the client just sends the search request to the server for it to fulfil, and all the server has to do is send the matches back. This eliminates nearly all the latency effects, except for the initial request and receiving any results.

However there has been one issue with using this for everything: the extension failed to connect when I wasn't on the same network as the host machine. So I wasn't able to use it when working from home over a VPN. In this post we find out why this happened, and in the process look at some of the weird quirks of parsing an SSH config.

The issue

As above, I wasn't able to connect to my remote machines when working from home. The extension would abort with the following error:

[Error  - 00:23:10.592] Error resolving authority
Error: getaddrinfo ENOTFOUND remotename.ozlabs.ibm.com
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (node:dns:109:26)

So it's a DNS issue. This would make sense, as the remote machine is not exposed to the internet, and must instead be accessed through a proxy. What's weird is that the integrated terminal in VSCodium has no problem connecting to the remote. So the extension seems to be doing something different than just a plain SSH connection.

You might think that the extension is not reading the SSH config. But the extension panel lists all the host aliases I've declared in the config, so it's clearly aware of the config at least. Possibly it doesn't understand the proxy config correctly? If it was trying to connect directly from the host, it would make sense to fail a DNS lookup.

Investigating

Enough theorising, time to debug the extension as it tries to connect.

From the error above, the string "Error resolving authority" looks like something I can search for. This takes me to the catch case for a large try-catch block. It could be annoying to narrow down which part of the block throws the exception, but fortunately debugging is as easy as installing the dependencies and running the pre-configured 'Extension' debug target. This opens a new window with the local copy of the extension active, and I can debug it in the original window.

In this block, there is a conditional statement on whether the ProxyJump field is present in the config. This is a good place to break on and see what the computed config looks like. If it doesn't find a proxy then of course it's going to run everything on the host.

And indeed, it doesn't think there is a proxy. This is progress, but why does the extension's view of the config not match up with what SSH does? After all, invoking SSH directly connects properly. Tracing back the source of the config in the extension, it ultimately comes from manually reading in and parsing the SSH config. When resolving the host argument it manually computes the config as per ssh_config(5). Yet somewhere it makes a mistake, because it doesn't include the ProxyJump field.

Parsing SSH config

To get to the bottom of this, we need to know the rules behind parsing SSH configs. The ssh_config(5) manpage does a pretty decent job of explaining this, but I'm going to go over the relevant information here. I reckon most people have a vague idea of how it works, and can write enough to meet their needs, but have never looked deeper into the actual rules behind how SSH parses the config.

For starters, the config is parsed line by line. Leading whitespace (i.e., indentation) is ignored. So, while indentation makes it look like you are configuring properties for a particular host, this isn't quite correct. Instead, the Host and Match lines are special statements that enable or disable all subsequent lines until the next Host or Match.

There is no backtracking; previous conditions and lines are not re-evaluated after learning more about the config later on.
When a config line is seen, and is active thanks to the most recent Host or Match succeeding, its value is selected if it is the first of that config to be selected. So the earliest place a value is set takes priority; this may be a little counterintuitive if you are used to having the latest value be picked, like enable/disable command line flags tend to work.
When HostName is set, it replaces the host value in Match matches. It is also used as the Host value during a final pass (if requested).
The last behaviour of interest is the Match final rule. There are several conditions a Match statement can have, and the final rule says make this active on the final pass over the config.

Wait, final pass? Multiple passes? Yes. If final is a condition on a Match, SSH will do another pass over the entire config, following all the rules above. Except this time all the configs we read on the first pass are still active (and can't be changed). But all the Host and Matches are re-evaluated, allowing other configs to potentially be set. I guess that means rule (1) ought to have a big asterisk next to it.

Together, these rules can lead to some quirky behaviours. Consider the following config

Match host="*.ozlabs.ibm.com"
    ProxyJump proxy

Host example
    HostName example.ozlabs.ibm.com

If I run ssh example on the command line, will it use the proxy?

By rule (1), no. When testing the first Match host condition, our host value is currently example. It is not until we reach the HostName config that we start using example.ozlabs.ibm.com for these matches.

But by rule (4), the answer turns into maybe. If we end up doing a second pass over the config thanks to a Match final that could be anywhere else, we would now be matching example.ozlabs.ibm.com against the first line on the second go around. This will pass, and, since nothing has set ProxyJump yet, we would gain the proxy.

You may think, yes, but we don't have a Match final in that example. But if you thought that, then you forgot about the system config.

The system config is effectively appended to the user config, to allow any system wide settings. Most of the time this isn't an issue because of the first-come-first-served rule with config matches (rule 2). But if the system config includes a Match final, it will trigger the entire config to be re-parsed, including the user section. And it so happens that, at least on Fedora with the openssh-clients package installed, the system config does contain a Match final (see /etc/ssh/ssh_config.d).

But wait, there's more! If we want to specify a custom SSH config file, then we can use -F path/to/config in the command line. But this disables loading a system config, so we would no longer get the proxy!

To sum up, for the above config:

ssh example doesn't have a proxy
...unless a system config contains Match final
...but invoking it as ssh -F ~/.ssh/config example definitely won't have the proxy
...but if a subprocess invokes ssh example while trying to resolve another host, it'll probably not add the -F ~/.ssh/config, so we might get the proxy again (in the child process).

Wait, how did that last one slip in? Well, unlike environment variables, it's a lot harder for processes to propagate command line flags correctly. If resolving the config involves running a script that itself tries to run SSH, chances are the -F flag won't be propagated and you'll see some weird behaviour.

I swear that's all for now, you've probably learned more about SSH configs than you will ever need to care about.

Back to VSCodium

Alright, armed now with this knowledge on SSH config parsing, we can work out what's going on with the extension. It ends up being a simple issue: it doesn't apply rules (3) and (4), so all Host matches are done against the original host name.

In my case, there are several machines behind the proxy, but they all share a common suffix, so I had a Host *.ozlabs.ibm.com rule to apply the proxy. I also use aliases to refer to the machines without the .ozlabs.ibm.com suffix, so failing to follow rule (3) lead to the situation where the extension didn't think there was a proxy.

However, even if this were to be fixed, it still doesn't respect rule (4), or most complex match logic in general. If the hostname bug is fixed then my setup would work, but it's less than ideal to keep playing whack-a-mole with parsing bugs. It would be a lot easier if there was a way to just ask SSH for the config that a given host name resolves to.

Enter ssh -G. The -G flag asks SSH to dump the complete resolved config, without actually opening the connection (it may execute arbitrary code while resolving the config however!). So to fix the extension once and for all, we could swap the manual parser to just invoking ssh -G example, and parsing the output as the final config. No Host or Match or HostName or Match final quirks to worry about.

Sure enough, if we replace the config backend with this 'native' resolver, we can connect to all the machines with no problem. Hopefully the pull request to add this support will get accepted, and I can stop running my locally patched copy of the extension.

In general, I'd suggest avoiding any dependency on a second pass being done on the config. Resolve your aliases early, so that the rest of your matches work against the full hostname. If you later need to match against the name passed in the command line, you can use Match originalhost=example. The example above should always be written as

Host example
    HostName example.ozlabs.ibm.com

Match host="*.ozlabs.ibm.com"
    ProxyJump proxy

even if the reversed order might appear to work thanks to the weird interactions described above. And after learning these parser quirks, I find the idea of using Host match statements unreliable; that they may or may not be run against the HostName value allows for truely strange bugs to appear. Maybe you should remove this uncertainty by starting your config with Match final to at least always be parsed the same way.