Clients stuck on authentication, WireGuard, Okta SSO

drhackenbush · October 23, 2023, 9:09pm

software versions:
Pritunl Client: v1.3.3600.11
Pritunl Server: pritunl-1.32.3602.80-1.el8.oraclelinux.x86_64

(server is running on an EC2 instance in AWS, both Pritunl and MongoDB are on the same machine)

PROBLEM:
Few days ago some users reported they can no longer establish vpn tunnels (we use WireGuard only). During the investigation we’ve figured out everything was working fine as long as given user’s previous auth was still valid and cached. The moment one needed new auth, they’re hit with the issue.

From Okta side, everything was looking good. Successful auth, access granted.
Pritunl server logs were not showing any information about the user trying to connect at all.
Pritunl client was hanging on ‘Authenticating’ status for some time, after which in the logs we had the info that auth has failed because of a timeout.

To check if this had something to with SSO, I’ve created a local user, and unticked the ‘SSO’ in one of the servers. This local user was able to connect just fine.

I’m writing this in the past tense, because everything works now. However I’m very worried because I wasn’t able to find the cause and I’m hoping I can get some help here.

How did I ‘fix’ this?

neither restarting particular servers nor restarting whole pritunl systemd service, not even rebooting the machine helped, → still not OK (NOK)

In the meantime I’ve noticed journal on the machine is flooded with avc denials like this one (for mongodb):

Oct 23 13:35:02 pritunl-server kernel: audit: type=1400 audit(1698068102.000:11080): avc:  denied  { search } for  pid=729 comm="ftdc" name="fs" dev="proc" ino=13404 scontext=system_u:system_r:mongod_t:s0 tcontext=system_u:object_r:sysctl_fs_t:s0 tclass=dir permissive=0

Because of that I’ve decided to test disabling SELinux temporarily (using setenforce 0) → NOK
I’ve changed SELinux policy in /etc/sysconfig/selinux to permissive and rebooted server again → SUCCESS
I was super happy at this point because I thought I’ve found the cause of the issues. But quickly I’ve decided to set SELinux policy back to enforcing, and reboot once again. And… → SUCCESS

So, to me it looks like SELinux is not the one to be blamed for this but rather that the reboot was the thing that helped. But why the first reboot (mentioned in point 1) didn’t help?

Again, I’d love to be able to find the cause. If anyone can give me some clue what to look for, I’d be super, super happy!

zach · October 25, 2023, 11:14am

When the client requires single sign-on connection authentication the server will respond with the URL which the client will open in the web browser to complete the single sign-on. The client will also send a request to /key/wg_wait/<org_id>/<user_id>/<server_id> which will wait for the other request to be completed in the browser causing the wait request to return. This URL is retried every 10 seconds until the maximum time of 2 minutes is reached. The MongoDB errors would be unrelated this is done only on the server, the wait check doesn’t utilize the database. There is a small memory leak from the token events not being removed from the dictionary that hasn’t been finished yet but restarting the service would correct that. The service logs on the client may provide more information. If it’s a load balanced configuration it’s possible the initial request is sent to one host and the wait request gets sent to a different host. The host public address should always be the IP address of that specific host.