Recurring Failure Mode: No DNS resolution in clients

I have a recurring failure mode that seems to occur randomly: DNS failure for Pritunl Clients.
Specifically, it’s the DNS server inside the VPN, the one that’s configured for the connection.

My most effective short-term solution is to restart the pritunl service weekly via crontab:
01 12 * * 0 /usr/bin/systemctl restart pritunl.service
This reduces the frequency of interventions and masks the issue but does nothing to prevent it.
I used to encounter this failure mode at least weekly, and with the automated restarts, it’s more like every-other-month.

I’d really like to find the root cause for this.
Does anyone have insights to share?
Are there configuration changes I can make for logging that might reveal the issue?

Logs:
When I check the Pritunl logs (gui) the only thing that shows are the scheduled crontab restarts going back weeks. There is nothing in between.

Journal:
I’ve checked for this in the past but haven’t seen errors that appear meaningful.

Pritunl Host: RHEL - Linux 5.4.17-2136.335.4.el8uek.x86_64 #3 SMP Thu Aug 22 12:18:30 PDT 2024 x86_64 x86_64 x86_64 GNU/Linux

Background:

I have a stable Pritunl configuration that works reliably 99% of the time.
I have scripts for all of my client side configuration. I’m confident everything is as consistent as it can be.
Every now and then DNS just stops working (POOF) for connected clients.
When this occurs, I can still ping IP addresses within the internal network.
I suspect this is an issue within the Pritunl service on the Pritunl host but I can’t tell why things fail.

I typically encounter this failure mode by:

  1. Connecting to Pritunl
  2. Establishing my split-tunnel routing configuration on my client
  3. Attempting to resolve hostnames of internal hosts and DNS resolution times out

I diagnose this failure mode by:

  1. Performing a client-side dns query for a known name at the IP address of the DNS server: nslookup foo.mydomain x.x.x.x: DNS resolution times out.
  2. Performing the same query from the Pritunl host immediately receives a valid response.

At this point, I know I’m in the DNS failure mode.

The only fix I have is to restart the pritunl service or reboot the host.
Again, the logs don’t give me anything actionable.
Help?

Any errors from the pritunl-dns process are sent to the pritunl process stdout. This would go the pritunl.service systemd unit. It would not show up in the Pritunl logs in the web console or the /var/log/pritunl.log file. Run sudo journalctl -u pritunl to check for errors from the DNS process.

I have had some test servers run out of memory during automatic DNF updates and the pritunl-dns process seems to be the first to stop functioning. It will lock up or get stuck at 100% CPU usage likely from the GC in Go. Verify the server is not running low on memory.

For clarification, pritunl-dns is not running in my environment.
The DNS service is provided in Azure, on one of the network routes provided to Pritunl clients.
Does that change anything?

The only running pritunl processes (identified by ps aux | grep -i pritunl) are:

  • /usr/lib/pritunl/usr/bin/python3 /usr/lib/pritunl/usr/bin/pritunl start
  • pritunl-web

I checked the output of journalctl -u pritunl and see http output since last restart but it doesn’t contain any information for dns at this. I’ll make sure to check the next time the issue presents itself.

Manually inspecting all available system.journal files with strings, I found some references to chrony and other python DNS modules but no obvious errors.

I’m a bit skeptical that the service(s), as configured, would log sudo canDNS faults as is.
What do you think?

[Unit]
Description=Pritunl Daemon

[Service]
LimitNOFILE=500000
ExecStart=/usr/lib/pritunl/usr/bin/pritunl start
SuccessExitStatus=SIGALRM
TimeoutStopSec=20

[Install]
WantedBy=multi-user.target

Enabling Client DNS Mapping in the server settings will enable pritunl-dns and route DNS queries through the Pritunl server. These issue also occur with the AWS VPC DNS servers and using pritunl-dns has fixed those issues. It’s possible it’s an issue with access controls on the VPC DNS servers or rate limits.