Error "Instance doc lost, stopping server." when stopping a Pritunl Server then the Server start timeout

baud · October 17, 2023, 4:38pm

Hello,

I am currently deploying a Pritunl server on Ubuntu 22.04 LTS with pritunl v1.32.3602.80 and with a Mongo Replicaset (Mongo 7) hosted on 3 separate hosts.

I am however encountering the following problem:
Every time I stop the pritunl server, the following error appears in the logs in /var/log/pritunl.log on the host machine:

Instance doc lost, stopping server. Check datetime setting

Then if I try to restart the server from the UI, after some time it eventually fails and the following message is printed:

Failed to start the server, server error occurred.

In the server logs, accessible from the UI, the following log appears:

[ERROR] Exception on /server/<server id>/operation/start [PUT]
Traceback (most recent call last):
  File "/usr/lib/pritunl/usr/lib/python3.9/site-packages/flask/app.py", line 2528, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/lib/pritunl/usr/lib/python3.9/site-packages/flask/app.py", line 1825, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/lib/pritunl/usr/lib/python3.9/site-packages/flask/app.py", line 1823, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/lib/pritunl/usr/lib/python3.9/site-packages/flask/app.py", line 1799, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/usr/lib/pritunl/usr/lib/python3.9/site-packages/pritunl/auth/app.py", line 10, in _wrapped
    return call(*args, **kwargs)
  File "/usr/lib/pritunl/usr/lib/python3.9/site-packages/pritunl/handlers/server.py", line 1351, in server_operation_put
    svr.start()
  File "/usr/lib/pritunl/usr/lib/python3.9/site-packages/pritunl/server/server.py", line 1656, in start
    raise ServerStartError('Server start timed out', {
pritunl.exceptions.ServerStartError: Server start timed out.

The only way to fix this timeout is to restart the pritunl service on the host.
After restarting the service, the server start successfully without any issue.

If I stop the server again, the same error log (Instance doc lost) happens again.

There is no time drift between the pritunl host and the mongo hosts.

Does someone have an idea of where this error could come from?

Thank you

pjak · October 18, 2023, 4:41am

I am having the same issue, i run our setup on kubernetes. When i stop the server i cant start it again until i completely delete the host and bring a new one up and attach it to the server

zach · October 19, 2023, 2:28am

This could be from a corrupted messages collection, this happens sometimes when upgrading MongoDB. This is a known issue with the MongoDB 7 upgrade. You can either run sudo pritunl destroy-secondary to clear all the cache from the database or drop the messages collection and restart a host to recreate the collection. If the collection is dropped manually you will likely need to stop all running hosts to prevent a host from adding to the collection before it can be created as a capped collection.

The next release will have the command sudo pritunl clear-message-cache to clear only this collection.

pjak · October 19, 2023, 3:05am

Same issue for me after clearing the collections. We did just update to a replica set architecture for mongo from a standalone and started noticing it. It wasnt happening on the standalone setup. Could it be something to do with replicaset architecture and mongo? Maybe the lag is too high or something.

elafarge · October 19, 2023, 10:10am

Actually (I’m working with Baud), we also did just switch from a Single Mongo instance to a replicaset architecture (we’re trying to make pritunl resilient to the loss of a cloud provider region).

elafarge · October 19, 2023, 3:44pm

So… we’ve performed multiple tests on a test Pritunl installation and… it seems that the issue disappears when using MongoDB 6.3.

When using MongoDB 7, even on a fresh installation, and despite dropping the messages collection and/or running sudo pritunl destroy-secondary, the issue persists: after stopping a Pritunl server, it’s impossible to restart it without restarting the pritunl SystemD unit on all hosts first.

We will downgrade the MongoDB installation backing our production Pritunl deployment tonight and keep you posted.

Out of curiosity, @pjak , are you trying to run Pritunl against Mongo 7 too ?

zach · October 19, 2023, 6:06pm

The Pritunl documentation still recommends MongoDB 6 and for now it is recommended to keep that version. I am currently testing on a MongoDB 7 replica set and have not seen any other issues. I have replicated the issue with the capped collection and it was fixed with recreating the collection.

You may want to either stop all hosts in the cluster or run the command a few times. An existing collection cannot be converted to a capped collection. When inserting a document to a collection that doesn’t exist MongoDB will create the collection automatically. I’m not sure if the await query would also trigger creating the collection. The collection is used frequently and there is some delay with the initialization process between dropping all the collections and creating them. It may not be possible to complete this initialization if hosts are actively connected to the database.

After the collection is recreated this will disrupt all existing listeners on the collection. It may take some time for these to reset. Restarting the hosts will fix this immediately.

elafarge · October 19, 2023, 7:48pm

First off, a piece of good news (which, on our end means we consider the problem we opened this ticket for as “solved” ). Downgrading our production Pritunl installation to a Mongo 6.3 ReplicaSet did the trick.

We can now restart our Pritunl Servers as we please, without any error

I do confirm however that the tests we conducted today show that, with Mongo 7.0, even on brand new installations, even after dropping the messages collection with the Pritunl servers stopped, we do encounter the issue. It’s not a blocker for us any more but probably something you want to investigate deeper before considering Mongo 7 as GA for Pritunl.

@zach I’d be very happy to provide you with the Terraform code we use to provision Pritunl on GCP if you want to quickly get an installation where the bug can trivially be reproduced. Just contact me and/or @baud in DM if you’re interested

pjak · October 19, 2023, 8:40pm

I can confirm once I downgraded to MongoDB 6 the issue was resolved and I can stop and start my pritunl servers with no issues. I also tried all the steps above on a fresh install of MongoDB 7 and still no luck. Will stick to Mongo 6 for now.