Nuxeo cluster and SQL Server shared instance problem

We have a Nuxeo-5.9.3 cluster consisting in two Tomcat server and a SQL Server geo-cluster. The SQL instance is shared among many applications.

We have observed that the cleanup of the table cluster_invals and cluster_nodes was not working properly, and Nuxeo performance plummeted (due to 30 rows in cluster_nodes and millions in cluster_invals). After doing some research, we found that the way Nuxeo handles the cleanup of those tables only works if you have a dedicated SQL Server instance.

We were able to work-around the problem by:

  • Adding a column to cluster_nodes (hostname nvarchar(128)).

  • Modify the file sqlserver.sql.txt included in nuxeo-core-storage-sql-5.9.3.jar. Now nodes are deleted and inserted this way:

    -- DB Collation is CI but Nuxeo DB is CS
    DELETE FROM [cluster_nodes] WHERE hostname =
      (SELECT host_name COLLATE DATABASE_DEFAULT FROM sys.dm_exec_sessions S
       WHERE S.session_id = @@SPID)
    
    INSERT INTO [cluster_nodes] (nodeid, created, hostname)
      SELECT @@SPID, CURRENT_TIMESTAMP, host_name FROM sys.dm_exec_sessions S
      WHERE S.session_id = @@SPID
    

The work-around can be easily improved (it doesn't currently work for many nuxeo instances on the same node).

Do you think there is a better approach for this problem? Should it be reported as a bug?

Thanks in advance.

0 votes

3 answers

1618 views

ANSWER



I can confirm that 5.9.4 indeed has this problem, but it's fixed for the next 5.9.5 (release in around 2 weeks). I don't have an exact ticket to pinpoint when the problem was introduced or when the fix occurred, but we've done a number of changes in the repository shutdown mechanism and one of them has fixed it.

You can try a 5.9.5-SNAPSHOT from http://qa.nuxeo.org/jenkins/job/IT-nuxeo-master-build/ if you want to confirm.

0 votes



And thanks for the report.
07/25/2014

Great!

I'll try to test it and let you know if it worked.

Thank you.

07/28/2014


Hi Florent,

The problem is not the SQL cluster but the use of shared instance. The original cleanup is:

DELETE FROM N FROM [cluster_nodes] N WHERE
HAS\_PERMS\_BY\_NAME(null, null, 'VIEW SERVER STATE') = 1
AND NOT EXISTS(
SELECT 1 FROM sys.dm_exec_sessions S WHERE
S.is\_user\_process = 1 AND N.nodeid = S.session\_id);

When there are connections that are not originated from nuxeo this doesn't work as expected. Session_id is unique, but it may not be assigned to a nuxeo session under these circumstances. Let's consider the following:

  1. One nuxeo server is started and connects to the database with session_id=50 (one row with that session_id is added to cluster_nodes)
  2. That nuxeo server goes offline.
  3. Other session is started from a different application (e.g. Team Foundation Server), connects to database and gets session_id=50
  4. The nuxeo instance is started again. The row in cluster_node is not removed because there is an active session with session_id=50.
  5. A new row is added. As a result, the same node has added two rows to cluster_nodes. In a quite busy database, after many restarts you can get too many rows in cluster_nodes.

Adding host_name allows us to clean all previous rows inserted from one node. It's not a full solution and there are still some situation where it doesn't work:

  • Multiple Nuxeo Instances in the same host.
  • Another application in the same host that connects to the same database.

If there were some field or parameter that uniquely identifies the instance, using it would resolve the problem. Is there such a thing?

0 votes



But in step 2, if the Nuxeo server is correctly shut down then it will clean up its entry in cluster_nodes.

What you describe only occurs in case of a crash or network partition, and in that case indeed we don't support automatic cleanup, and the cleanup on next startup of a Nuxeo node will fail as you describe if the session_id is reused externally.

We don't have a uniquely identifying Nuxeo node instance id at the moment, although that's something we want to add (in order to facilitate clustering and not have issues like the one you describe).

07/17/2014

In that case we are having some problem in our instances, because cluster_nodes is only cleaned up when starting an instance. It's been the behaviour since we install Nuxeo.

We'll look for any error in the logs during shutdown and let you know what we find.

Thanks for your help.

07/17/2014

Hi Florent,

I'm afraid a correct shut down does not delete the entry from cluster_nodes.

I created a trigger in cluster_nodes to log every delete. After shutting down the instance no delete operation was recorded. When I started it, the delete operation took place and the log appeared. No error or warning appears in Nuxeo server.log.

After downloading the source code I look for references to cluster_nodes and only find the txt files in nuxeo-core-storage-sql. Where could I find the cleanup code?

Additionaly, we are also going to modify the cleanup of cluster_invals. Since it also depends on the session, a SQLServer failover make the cleanup fail until the Nuxeo nodes are restarted (a failover will force the session to reconnect and get a new session_id).

Thanks in advance.

07/18/2014

If you look at sqlserver.sql.txt you'll see a section:

#CATEGORY: removeClusterNode

DELETE FROM [cluster_nodes] WHERE nodeid = @@SPID;

# Remove orphan invalidations
DELETE FROM [cluster_invals] WHERE [nodeid] IN (
  SELECT DISTINCT [cluster_invals].[nodeid]
    FROM [cluster_invals] LEFT JOIN [cluster_nodes] ON [cluster_invals].[nodeid] = [cluster_nodes].[nodeid]
    WHERE [cluster_nodes].[nodeid] IS NULL
)

which is the code executed at shutdown. If you uncomment this section in lib/log4j.xml:

<!-- Uncomment to debug SQL statements -->
<!--
<category name="org.nuxeo.ecm.core.storage.sql.jdbc">
  <priority value="TRACE" />
</category>
-->

then you'll see all SQL statements executed in the log/server.log file. At shutdown you should see among other things:

TRACE [JDBCLogger] (2) SQL: DELETE FROM [cluster_nodes] WHERE nodeid = @@SPID;
TRACE [JDBCLogger] (2) SQL: DELETE FROM [cluster_invals] WHERE [nodeid] IN (
  SELECT DISTINCT [cluster_invals].[nodeid]
    FROM [cluster_invals] LEFT JOIN [cluster_nodes] ON [cluster_invals].[nodeid] = [cluster_nodes].[nodeid]
    WHERE [cluster_nodes].[nodeid] IS NULL
07/18/2014

As the triggered test showed, apparently there is no DELETE is executed during shutdown.

This is the modified version log at startup: 2014-07-18 13:00:47,100 TRACE [localhost-startStop-1] org.nuxeo.ecm.core.storage.sql.jdbc.JDBCLogger SQL: DELETE FROM [cluster_nodes] where hostname =

(select host_name collate DATABASE_DEFAULT FROM sys.dm_exec_sessions S WHERE S.session_id = @@SPID)

But there is no DELETE at shutdown. Actually there is no much at shutdown in the log, only two rows (log is set at WARN for most components): 2014-07-18 13:01:44,352 INFO [localhost-startStop-2] [org.nuxeo.ecm.core.storage.sql.jdbc.QueryMakerServiceImpl] Unregistering QueryMaker 'NXQL': org.nuxeo.ecm.core.storage.sql.jdbc.NXQLQueryMaker 2014-07-18 13:01:45,615 INFO [localhost-startStop-2] [org.nuxeo.ecm.core.storage.sql.jdbc.QueryMakerServiceImpl] Unregistering QueryMaker 'CMISQL': org.nuxeo.ecm.core.opencmis.impl.server.CMISQLQueryMaker

I'll enable more traces and test it again.

07/18/2014

No sql traces during shutdown even if I enable more logging.
07/18/2014

It's working correctly in the current 5.9.5-SNAPSHOT version. We have refactored a few things related to repository shutdown. Maybe try 5.9.4 first? Your 5.9.3 is a Fast Track version, so you should always upgrade asap.
07/18/2014

I've tested both versions (5.9.3 and 5.9.4) and no delete was executed during shutdown. I tested in the lab with H2 database and I got the same result.

Maybe It's because of the OS (Windows). I'll test with Linux and see what I get.

07/21/2014

I've tested versions 5.9.3 and 5.9.4 with H2 databases using the virtual machines availables at: http://nuxeo.github.io/downloads.html

After enabling the jdbc logs and setting up the cluster, the behaviour is the same (DELETEs occur at startup but not during shutdown).

I'm configuring the cluster according to http://doc.nuxeo.com/display/public/ADMINDOC/Nuxeo+Clustering+Configuration.

Is there any additional step to consider? Could anyone else reproduce the problem?

07/23/2014


Thanks for the report. I opened NXP-14795 to track this issue and its resolution.

We haven't tested or validated Nuxeo with a clustered SQL Server. Can you explain why the host_name is needed in addition to the session_id? The session_id is not unique globally on the server cluster? Also host_name is provided by the client application and marked not reliable in the documentation.

0 votes