about summary refs log tree commit diff stats
diff options
context:
space:
mode:
authorAlan Pearce2024-06-23 11:00:37 +0200
committerAlan Pearce2024-06-25 09:03:58 +0200
commit4f5ca39f582d83ded08a6e1cbf1a195141623b1c (patch)
tree5328c626aecef92beec4b34a44193627dc7d7ec4
parent8df549bdce3a3fc0e3674768c2e75002673c987d (diff)
downloadwebsite-4f5ca39f582d83ded08a6e1cbf1a195141623b1c.tar.lz
website-4f5ca39f582d83ded08a6e1cbf1a195141623b1c.tar.zst
website-4f5ca39f582d83ded08a6e1cbf1a195141623b1c.zip
post: When Tailscale MagicDNS isn’t
-rw-r--r--content/post/when-tailscale-magicdns-isn't.md200
1 files changed, 200 insertions, 0 deletions
diff --git a/content/post/when-tailscale-magicdns-isn't.md b/content/post/when-tailscale-magicdns-isn't.md
new file mode 100644
index 0000000..2c611a4
--- /dev/null
+++ b/content/post/when-tailscale-magicdns-isn't.md
@@ -0,0 +1,200 @@
+---
+title: When Tailscale MagicDNS isn’t
+description: Frustrations of a NixOS user
+date: 2024-06-23T10:57:00+02:00
+taxonomies:
+  tags: [nixos]
+---
+
+On a router, I have [dnsmasq](https://dnsmasq.org/doc.html) and [kresd](https://knot-resolver.readthedocs.io/en/stable/) as DNS servers. Dnsmasq is accessible on the LAN interface and forwards queries to kresd, which is accessible on the loopback interface.  This has been working for a long time.
+
+I recently set up [Tailscale](https://tailscale.com/) and was confused as to why [MagicDNS](https://tailscale.com/kb/1081/magicdns) wasn’t working on this one device (I have two other NixOS devices that didn’t have any problems). I’m no stranger to investigating these problems, after using and tinkering with networking on Linux/FreeBSD for many years and if there’s a problem, [it’s always DNS](https://isitdns.com/).
+
+My first look at `/etc/resolv.conf` suggested things should be fine, I thought.
+
+```
+# Generated by resolvconf
+search my-network.ts.net
+nameserver 127.0.0.1
+nameserver ::1
+```
+
+:::{.aside}
+‘why is this even generated by `resolvconf`?’, I ask myself: this is a router with a static networking configuration and custom upstream nameservers. I’ll investigate that later, I tell myself.
+:::
+
+I eventually realised that it should be using 100.100.100.100 for two reasons.
+
+1. A working machine’s `/etc/resolv.conf` contains:
+
+	```
+	# Generated by resolvconf
+	search my-network.ts.net
+	nameserver 100.100.100.100
+	options edns0
+	```
+
+2. The Tailscale dashboard gives me a *hint* under the nameservers section:
+> ### Nameservers
+> Set the nameservers used by devices on your network to resolve DNS queries. [Learn more ↗](https://tailscale.com/kb/1054/dns)
+>
+> *my-network*.ts.net ✨MagicDNS\
+> 100.100.100.100
+
+The [linked documentation (DNS in Tailscale)](https://tailscale.com/kb/1054/dns)  doesn’t even mention 100.100.100.100, nor does the documentation for [MagicDNS](https://tailscale.com/kb/1081/magicdns). It is explained as [part of a blog post under the heading ‘how MagicDNS works’](https://tailscale.com/blog/2021-09-private-dns-with-magicdns#how-magicdns-works), but that’s not the first place I’d look.
+
+This led me to try to find what might be responsible for the `nameserver 127.0.0.1` setting:
+- I checked for systemd-resolved, but it wasn’t that (the nameserver would have been 127.0.0.53 if it were).
+- I didn’t think it was anything to do with dnsmasq because that’s configured to use the LAN interface, not the loopback.
+- I didn’t think it was kresd either, since another device uses kresd, but does not have the problem. kresd is not listening on a loopback address on the default port 53 on either machine, meaning that if it were doing this, DNS resolution would have been broken on both machines for a long time.
+
+Eventually I stumbled upon something that worked: setting `networking.resolvconf.useLocalResolver = false`. I then started to investigate in order to open an issue with NixOS, but found things got more confusing and I forgot why I was shaving a yak[^7].
+
+---
+
+### Unexpected behaviours
+
+1. The NixOS kresd module [blindly sets `networking.resolvconf.useLocalResolver` to default to `true`](https://github.com/NixOS/nixpkgs/blob/bfb7a882678e518398ce9a31a881538679f6f092/nixos/modules/services/networking/kresd.nix#L113) because [someone ran into resolver loops](https://github.com/NixOS/nixpkgs/pull/124391) and this was accepted on the grounds that it’s [‘good to be consistent’ (with pdns-recursor, for example)](https://github.com/NixOS/nixpkgs/pull/124391#pullrequestreview-667950510).
+    - I consider this ["spooky action at a distance"](https://en.wikipedia.org/wiki/Action_at_a_distance#%22Spooky_action_at_a_distance%22).  I do not think it should do this, but I can see how it *could* be helpful. In enough cases, though? I’m not sure.
+    - In my case, kresd was *never even listening* on localhost port 53, meaning that this default setting would have led to a broken DNS setup, which at least would at least have led me to investigate the right thing at that time.
+			:::{.aside}
+			Why didn’t it, then?
+			:::
+
+2. The NixOS dnsmasq module sets [`networking.resolvconf.useLocalResolver = true` if `services.dnsmasq.resolveLocalQueries = true`](https://github.com/NixOS/nixpkgs/blob/7780e5160e011b39019797a4c4b1a4babc80d1bf/nixos/modules/services/networking/dnsmasq.nix#L150-L151). This is at least less spooky and distant, but still faulty.
+
+	1. I had indeed set `services.dnsmasq.resolveLocalQueries = true`, because I would like to be able to `ping foo.lan` on the router and it made that possible.
+	2. My erroneous assumption was likely that this setting changes the system nameserver to match the listen address of dnsmasq, rather than setting `networking.resolvconf.useLocalResolver`. Setting  `resolveLocalQueries = false` was one of the first things I tried and it didn’t make a difference, which was unexpected (because kresd was setting it).
+		:::{.aside}
+		If that weren’t enough, [`services.dnsmasq.resolveLocalQueries` sets `networking.nameservers` to include 127.0.0.1](https://github.com/NixOS/nixpkgs/blob/7780e5160e011b39019797a4c4b1a4babc80d1bf/nixos/modules/services/networking/dnsmasq.nix#L138-L139), which makes things more confusing, and is wrong because I haven’t configured dnsmasq to listen on this address.
+		:::
+	3. dnsmasq’s `--interface=<interface name>` setting doesn’t work either as I expected or as it is [documented](https://dnsmasq.org/docs/dnsmasq-man.html).
+
+		> **-i, --interface=\<interface name\>**\
+		> Listen only on the specified interface(s).
+
+		 I had this set to listen _only_ on the LAN interface and I could see it nevertheless listening on `*:53` in `lsof`, rather than the addresses of the LAN interface.
+
+		1. This behaviour *is* mentioned, under a different option, `--bind-interfaces`:
+
+			> On systems which support it, dnsmasq binds the wildcard address, even when it is listening on only some interfaces. It then discards requests that it shouldn’t reply to. This has the advantage of working even when interfaces come and go and change address. This option forces dnsmasq to really bind only the interfaces it is listening on. *About the only time when this is useful is when running another nameserver (or another instance of dnsmasq) on the same machine*.
+
+			I had enabled this setting when I set up kresd on the machine, because the last sentence applies to my case and interfaces are not ‘coming and going’[^4]. The naming is weird as it would suggest it works like `--interface`; namely that `--bind-interfaces=<interface name>` would be a reasonable use. Alas, it is a boolean flag and takes no value.
+
+		2. Even when `--interface` and `--bind-interfaces` *are* set, dnsmasq decides to ignore my intent and explicitly listen on loopback. This is weird, but, guess what, documented back under `--interface`:
+
+			> Dnsmasq automatically adds the loopback (local) interface to the list of interfaces to use when the **--interface** option is used.
+
+		:::{.aside}
+		This explains why enabling kresd didn’t break things before; dnsmasq was listening on 127.0.0.1, even though I thought I had told it not to.
+		:::
+
+3. The option name `networking.resolvconf.useLocalResolver` and its [documentation](https://search.nixos.org/options?channel=unstable&show=networking.resolvconf.useLocalResolver&from=0&size=50&sort=relevance&type=packages&query=uselocalresolver) are unclear.
+
+	> Use local DNS server for resolving.
+
+	This sentence adds no additional data not present in the option name. Local to what? I’m on a *Local* Area Network and wish to use a DNS server on the LAN, should I enable this? No, that’s not what this option is for. It could mean local to *this host*, looking at [its usage](https://github.com/NixOS/nixpkgs/blob/bfb7a882678e518398ce9a31a881538679f6f092/nixos/modules/config/resolvconf.nix#L29-L32). (I deliberately avoided combining the words ‘local’ and ‘host’, for reasons below)
+	- Even if it had a clearer name like `useLocalhostResolver`, the inaccuracy would remain, as `getent hosts localhost` and `getent ahosts localhost` both prefer ::1 over 127.0.0.1, not the other way around.
+	- What would have happened if I had enabled this setting with a server listening on ::1 and *not* 127.0.0.1? It would work, but not be the best setting as applications would attempt to reach a nameserver that is not listening on 127.0.0.1.
+	- It might be confusing to reference a hostname when talking about nameserver reachability, since an IP address is required to reach a nameserver and a nameserver <del>is</del> <ins>could be</ins> required to resolve a hostname. Setting `nameserver localhost` in `/etc/resolv.conf` won’t work, I _thought_.
+
+		- I tried it. Neovim didn’t highlight it as an error[^2] and it did **not** break name resolution, presumably because [`localhost` is added to `/etc/hosts`](https://github.com/NixOS/nixpkgs/blob/bfb7a882678e518398ce9a31a881538679f6f092/nixos/modules/config/networking.nix#L179-L182) *and* is a [special case in `nss-myhostname`](https://www.man7.org/linux/man-pages/man8/nss-myhostname.8.html), the existence of which I might not have known had I not set up [multicast DNS](https://en.wikipedia.org/wiki/Multicast_DNS) to resolve `.local` hostnames in `/etc/nsswitch.conf` in the past,
+
+			> The hostnames "localhost" and "localhost.localdomain" (as
+					well as any hostname ending in ".localhost" or
+					".localhost.localdomain") are resolved to the IP addresses
+					127.0.0.1 and ::1.
+
+			Why is localhost added to `/etc/hosts` if it’s handled by `nss-myhostname` in `/etc/nsswitch.conf?`? The man page of `nsswitch.conf` ([mirror](https://www.man7.org/linux/man-pages/man5/nsswitch.conf.5.html)) gives a small clue:
+
+        > `/etc/nsswitch.conf` is used by the GNU C Library and certain
+        other applications[^5]
+
+      Why isn’t even the choice of a name resolution mechanism for *localhost* unified in 2024?
+
+		- I tried on macOS and iOS. Both allow setting a named nameserver in the GUI, which surprised me. I remember that on earlier versions of Windows (at least on XP, Vista and 7) there were special input boxes that exclusively allowed an IPv4 address, although there was a separate dialogue to input IPv6 addresses. I wonder if that allows hostnames or not, but I don’t have Windows running at the moment to check.
+
+4. Setting `networking.resolvconf.enable = false` doesn’t appear to do.. well.. anything.
+	- The *generated by resolvconf* comment in `/etc/resolv.conf` remains, as do the previous nameserver entries.
+	- `/run/current-system/sw/bin/resolvconf` is not removed.
+	- `man resolvconf` has content (because it’s a link to `man resolvectl`, which is part of `systemd-resolved`, which isn’t even enabled on this system)
+	- I was surprised that `/etc/resolv.conf` was writable at all, as <del>all</del> <ins>many</ins> files under `/etc/` are symlinks to their namesakes under `/etc/static`, which itself is a symlink to a folder in the nix store which contains… more symlinks. Here I use `resolvconf.conf` as an example, i.e. the configuration of `resolvconf`, the program that manages `resolv.conf`.
+
+		```
+		 rwxrwxrwx 1 root root  27 May 26 23:14 /etc/resolvconf.conf -> /etc/static/resolvconf.conf
+		lrwxrwxrwx 1 root root  51 May 26 23:14 /etc/static -> /nix/store/pm0yi93ak5kcvfmidv5lckzfixrh2gck-etc/etc/
+		lrwxrwxrwx 4 root root  63 Jan  1  1970 /nix/store/pm0yi93ak5kcvfmidv5lckzfixrh2gck-etc/etc/resolvconf.conf -> /nix/store/kf0lrhiqqqrc6w96h4qm0sysffnccx2d-etc-resolvconf.conf
+		-r--r--r-- 3 root root 518 Jan  1  1970 /nix/store/kf0lrhiqqqrc6w96h4qm0sysffnccx2d-etc-resolvconf.conf
+		```
+
+		That’s too much indirection for me. If [‘we can solve any problem by introducing an extra level of indirection’](https://en.wikipedia.org/wiki/Fundamental_theorem_of_software_engineering), this suggests that *at least three* problems have been solved here.
+
+		Isn’t it odd that the symlinks are writable to all? I know that `/nix/store` is a read-only filesystem, but it looks odd. Upon searching the web for information, I was directed to the [coreutils `chmod` documentation](https://www.gnu.org/software/coreutils/manual/html_node/chmod-invocation.html):
+
+		> `chmod`  doesn’t change the permissions of symbolic links; the `chmod` system call cannot change their permissions on most systems, and most systems ignore permissions of symbolic links
+
+		_Most_ systems? What does this mean? Is it based on the filesystem used? I would assume that it doesn’t mean ’this is the case on Linux’ given the reference to the system call of the same name and that Linux was not mentioned. The man page for the system call mentions
+
+		> *flags* can either be 0, or include the following flag:
+    > \
+    >   **AT_SYMLINK_NOFOLLOW**\
+    >          If pathname is a symbolic link, do not dereference it:
+    >          instead operate on the link itself.  This flag is not
+    >          currently implemented.
+
+	- The default value of `networking.resolvconf.enable`  is [`!(config.environment.etc ? "resolv.conf")`](https://search.nixos.org/options?channel=unstable&show=networking.resolvconf.enable&from=0&size=50&sort=relevance&type=packages&query=networking.resolvconf.enable), which I understand as _if the content of `resolv.conf` isn’t otherwise assigned_.
+	- There are values in `networking.nameservers`, but these aren’t used as content for `resolv.conf`, which I thought would have been reasonable.
+
+	What *is* the point of `networking.resolvconf.enable`, then? And what about `networking.nameservers`? Where do its values even go? [^6]
+
+4. This issue pushed me to drop flakes on the router so that `nixos-option` would help me as [it does not support flakes](https://github.com/NixOS/nixpkgs/issues/97855)[^3].
+
+5. The Tailscale module [adds `resolvconf` to its path conditionally](https://github.com/NixOS/nixpkgs/blob/7780e5160e011b39019797a4c4b1a4babc80d1bf/nixos/modules/services/networking/tailscale.nix#L87). The [commit adding this condition](https://github.com/NixOS/nixpkgs/commit/922351ec866dcfe1dca4d190bfd3c360933e5cd0) explains that ‘trying to use [resolvconf] always fails because
+`/etc/resolvconf.conf` contains an `exit 1`’, which sounds perfectly reasonable.
+	- If `resolvconf` weren’t in tailscaled’s path, Tailscale would fall back to overwriting resolv.conf, which I found out about because it is a common enough problem/question to warrant [a heading and its own page](https://tailscale.com/kb/1235/resolv-conf).
+
+		This document is the most concise and informative clarification of my original issue; the last paragraph tells me everything I needed to know:
+
+		> Even if you set `--accept-dns=false`, Tailscale’s MagicDNS server still replies at `100.100.100.100` (or `fd7a:115c:a1e0::53`), as long as MagicDNS is enabled on the tailnet. If you’d like to manually configure your DNS configuration, you can point `*.ts.net` queries at `100.100.100.100`.
+
+		Sadly I didn’t look at this page earlier as Tailscale isn’t the one overwriting `/etc/resolv.conf`: it would have set the nameserver to be `100.100.100.100` in that case. Its behaviour is reasonable as ‘there are [an incredible number of ways](https://tailscale.com/blog/sisyphean-dns-client-linux) to configure DNS on Linux’.
+
+		- This blog post suggests that the upcoming (as of April 2021) Tailscale 1.8 will use/prefer using `systemd-resolved` to configure the system resolver
+		- It convinced me that `systemd-resolved` would be the right choice even on a router as the nameserver should depend on the interface. Thanks [Xe](https://xeiaso.net/), I always like your posts!
+
+This should be the end of my issues now then, right?
+
+What happens when I enable `systemd-resolved` and disable `resolvconf`? The hilarity continues:
+
+```
+# resolv.conf(5) file generated by tailscale
+# For more info, see https://tailscale.com/s/resolvconf-overwrite
+# DO NOT EDIT THIS FILE BY HAND -- CHANGES WILL BE OVERWRITTEN
+nameserver 100.100.100.100
+search my-network.ts.net lan my-network.ts.net
+```
+
+I expected Tailscale not to overwrite `resolv.conf` in this scenario, but instead configure `systemd-resolved` (Adding the tailnet to the search domains without checking its presence is yet another issue). I think it’s a race condition that `tailscaled` won that might have been caused by NixOS starting the services at the same time, however `tailscaled.service` has `After=systemd-resolved.service`. Restarting `systemd-resolved` _then_ `tailscaled` explicitly did the right thing:
+
+```
+# This is /run/systemd/resolve/stub-resolv.conf managed by man:systemd-resolved(8).
+# Do not edit.
+# [...]
+nameserver 127.0.0.53
+options edns0 trust-ad
+search lan my-network.ts.net
+```
+
+DNS can be confusing sometimes!
+
+[^2]: vim’s syntax highlighting in `/etc/resolv.conf` marks `nameserver localhost` as an error, which is neat, but somewhat inaccurate here, as this does not appear to be invalid.
+
+[^3]: After reading through the issue, the title does not appear to be accurate given that there are workarounds, however, upon loading the page and seeing that the issue is open since 2020 and has a small scroll bar, indicating many comments, it’s easy to be drawn to the assumption that it continues to be an issue.
+
+[^4]: Since the router has a dynamic <del>public</del> <ins>CGNAT</ins> IP address, it’s true that the addresses are changing, but that’s not relevant to dnsmasq given that it was not ever configured to listen on this interface.
+
+[^5]: As a Wikipedia editor would ask, _which_?
+
+[^6]: I [searched nixpkgs on Github](https://github.com/search?q=repo%3ANixOS%2Fnixpkgs+networking.nameservers&type=code) and found it amusing that all the results where it is set correctly are *in tests*, but the other results show hardcoded definitions.
+
+[^7]: Whilst checking to see if [yak shaving](https://en.wiktionary.org/wiki/yak_shaving) was the right turn of phrase, Wiktionary suggested [‘when you're up to your neck in alligators, it's hard to remember that your initial objective was to drain the swamp’](https://en.wiktionary.org/wiki/when_you%27re_up_to_your_neck_in_alligators,_it%27s_hard_to_remember_that_your_initial_objective_was_to_drain_the_swamp#English) which would be more fitting, but I first learned of this expression today and don’t think it’s as widely-known.