Nuxeo cluster and SQL Server shared instance problem
We have a Nuxeo-5.9.3 cluster consisting in two Tomcat server and a SQL Server geo-cluster. The SQL instance is shared among many applications.
We have observed that the cleanup of the table cluster_invals and cluster_nodes was not working properly, and Nuxeo performance plummeted (due to 30 rows in cluster_nodes and millions in cluster_invals). After doing some research, we found that the way Nuxeo handles the cleanup of those tables only works if you have a dedicated SQL Server instance.
We were able to work-around the problem by:
Adding a column to cluster_nodes (hostname nvarchar(128)).
Modify the file sqlserver.sql.txt included in nuxeo-core-storage-sql-5.9.3.jar. Now nodes are deleted and inserted this way:
-- DB Collation is CI but Nuxeo DB is CS DELETE FROM [cluster_nodes] WHERE hostname = (SELECT host_name COLLATE DATABASE_DEFAULT FROM sys.dm_exec_sessions S WHERE S.session_id = @@SPID) INSERT INTO [cluster_nodes] (nodeid, created, hostname) SELECT @@SPID, CURRENT_TIMESTAMP, host_name FROM sys.dm_exec_sessions S WHERE S.session_id = @@SPID
The work-around can be easily improved (it doesn't currently work for many nuxeo instances on the same node).
Do you think there is a better approach for this problem? Should it be reported as a bug?
Thanks in advance.
I can confirm that 5.9.4 indeed has this problem, but it's fixed for the next 5.9.5 (release in around 2 weeks). I don't have an exact ticket to pinpoint when the problem was introduced or when the fix occurred, but we've done a number of changes in the repository shutdown mechanism and one of them has fixed it.
You can try a 5.9.5-SNAPSHOT from http://qa.nuxeo.org/jenkins/job/IT-nuxeo-master-build/ if you want to confirm.
The problem is not the SQL cluster but the use of shared instance. The original cleanup is:
HAS\_PERMS\_BY\_NAME(null, null, 'VIEW SERVER STATE') = 1
AND NOT EXISTS(
SELECT 1 FROM sys.dm_exec_sessions S WHERE
S.is\_user\_process = 1 AND N.nodeid = S.session\_id);
When there are connections that are not originated from nuxeo this doesn't work as expected. Session_id is unique, but it may not be assigned to a nuxeo session under these circumstances. Let's consider the following:
- One nuxeo server is started and connects to the database with session_id=50 (one row with that session_id is added to cluster_nodes)
- That nuxeo server goes offline.
- Other session is started from a different application (e.g. Team Foundation Server), connects to database and gets session_id=50
- The nuxeo instance is started again. The row in cluster_node is not removed because there is an active session with session_id=50.
- A new row is added. As a result, the same node has added two rows to cluster_nodes. In a quite busy database, after many restarts you can get too many rows in cluster_nodes.
Adding host_name allows us to clean all previous rows inserted from one node. It's not a full solution and there are still some situation where it doesn't work:
- Multiple Nuxeo Instances in the same host.
- Another application in the same host that connects to the same database.
If there were some field or parameter that uniquely identifies the instance, using it would resolve the problem. Is there such a thing?
What you describe only occurs in case of a crash or network partition, and in that case indeed we don't support automatic cleanup, and the cleanup on next startup of a Nuxeo node will fail as you describe if the session_id is reused externally.
We don't have a uniquely identifying Nuxeo node instance id at the moment, although that's something we want to add (in order to facilitate clustering and not have issues like the one you describe).
We'll look for any error in the logs during shutdown and let you know what we find.
Thanks for your help.
I'm afraid a correct shut down does not delete the entry from cluster_nodes.
I created a trigger in cluster_nodes to log every delete. After shutting down the instance no delete operation was recorded. When I started it, the delete operation took place and the log appeared. No error or warning appears in Nuxeo server.log.
After downloading the source code I look for references to cluster_nodes and only find the txt files in nuxeo-core-storage-sql. Where could I find the cleanup code?
Additionaly, we are also going to modify the cleanup of cluster_invals. Since it also depends on the session, a SQLServer failover make the cleanup fail until the Nuxeo nodes are restarted (a failover will force the session to reconnect and get a new session_id).
Thanks in advance.
If you look at sqlserver.sql.txt you'll see a section:
#CATEGORY: removeClusterNode DELETE FROM [cluster_nodes] WHERE nodeid = @@SPID; # Remove orphan invalidations DELETE FROM [cluster_invals] WHERE [nodeid] IN ( SELECT DISTINCT [cluster_invals].[nodeid] FROM [cluster_invals] LEFT JOIN [cluster_nodes] ON [cluster_invals].[nodeid] = [cluster_nodes].[nodeid] WHERE [cluster_nodes].[nodeid] IS NULL )
which is the code executed at shutdown. If you uncomment this section in
<!-- Uncomment to debug SQL statements --> <!-- <category name="org.nuxeo.ecm.core.storage.sql.jdbc"> <priority value="TRACE" /> </category> -->
then you'll see all SQL statements executed in the
log/server.log file. At shutdown you should see among other things:
TRACE [JDBCLogger] (2) SQL: DELETE FROM [cluster_nodes] WHERE nodeid = @@SPID; TRACE [JDBCLogger] (2) SQL: DELETE FROM [cluster_invals] WHERE [nodeid] IN ( SELECT DISTINCT [cluster_invals].[nodeid] FROM [cluster_invals] LEFT JOIN [cluster_nodes] ON [cluster_invals].[nodeid] = [cluster_nodes].[nodeid] WHERE [cluster_nodes].[nodeid] IS NULL
This is the modified version log at startup: 2014-07-18 13:00:47,100 TRACE [localhost-startStop-1] org.nuxeo.ecm.core.storage.sql.jdbc.JDBCLogger SQL: DELETE FROM [cluster_nodes] where hostname =
(select host_name collate DATABASE_DEFAULT FROM sys.dm_exec_sessions S WHERE S.session_id = @@SPID)
But there is no DELETE at shutdown. Actually there is no much at shutdown in the log, only two rows (log is set at WARN for most components): 2014-07-18 13:01:44,352 INFO [localhost-startStop-2] [org.nuxeo.ecm.core.storage.sql.jdbc.QueryMakerServiceImpl] Unregistering QueryMaker 'NXQL': org.nuxeo.ecm.core.storage.sql.jdbc.NXQLQueryMaker 2014-07-18 13:01:45,615 INFO [localhost-startStop-2] [org.nuxeo.ecm.core.storage.sql.jdbc.QueryMakerServiceImpl] Unregistering QueryMaker 'CMISQL': org.nuxeo.ecm.core.opencmis.impl.server.CMISQLQueryMaker
I'll enable more traces and test it again.
Maybe It's because of the OS (Windows). I'll test with Linux and see what I get.
After enabling the jdbc logs and setting up the cluster, the behaviour is the same (DELETEs occur at startup but not during shutdown).
I'm configuring the cluster according to http://doc.nuxeo.com/display/public/ADMINDOC/Nuxeo+Clustering+Configuration.
Is there any additional step to consider? Could anyone else reproduce the problem?
Thanks for the report. I opened NXP-14795 to track this issue and its resolution.
We haven't tested or validated Nuxeo with a clustered SQL Server. Can you explain why the host_name is needed in addition to the session_id? The session_id is not unique globally on the server cluster? Also host_name is provided by the client application and marked not reliable in the documentation.