Which TNSNAMES ?
By: Date: March 12, 2014 Categories: Oracle

In my workplace we host a large number of customers running a complex application against an Oracle database (11.2.0.3) on Solaris 11.  Each customer is in their own database, running multiple database on a server.  Each of these databases has a link to a common lookup database (really, several identical lookup databases that they can fail over to).

We typically use LDAP for TNS lookups, but each server has a local tnsnames file that is used strictly to provide the lookup database.  This let’s the DBAs create the same database link across all databases, and then at the server level determine the primary and failover look up databases.

(Keep in mind that in the discussion that follows, all references to the listener are talking about the listener on the client or initial database, not the listener on the target of the database link).

So, we ran into this strange problem where with just one database on this particular server the database link failed over to the second database each time, after a one minute timeout.  The tnsnames.ora file in the Oracle database home looked like this:

mydb.world=
    (description=
     (failover=on)
     (load_balance=off)
     (address=
          (protocol=tcp)
          (host=myhost1)
          (port=1521))
     (address=
          (protocol=tcp)
          (host=myhost2)
          (port=1521))
     (connect_data=
        (service_name=mydb.world)
          (failover_mode=
              (type=session)
              (method=basic))))

So, in this case, the database was always connecting its link to myhost2, not myhost1, as it should have. All other databases on this host worked correctly. Whats troubling from a performance perspective is that the link only worked in the session once a 60-second connect timeout had expired.

So the question is:  why? My colleagues and I checked all the usual suspects, and found nothing. The only thing out of the ordinary about this host was that it was new, having replaced an old server recently, changing the name in the process.

Once we identified the problem, we temporarily worked around the customer performance issue by putting the lookup database schemas into the customer database in this instance, and avoiding the link.

One thing we identified early on is that performing a local connection without the listener involved would lead to a database link that worked correctly. (That is, setting the ORACLE_SID, and connecting without the @TNS.. connection string, yielded a connection that could use the link without issue.) Odd.

We duplicated the database onto the same server, and could also recreate the problem with the duplicated database. Beyond this duplication test, we moved this to the back burner for a day or two and let it simmer.

One of my colleagues continued to look at this problem at night, and found that connecting over the port 1521 listener yielded a session that could create the database link without issue, while connecting over the port 1522 listener yielded a session that had the timeout/failover issue. This was the key to unraveling this issue. It turns out that we had load balanced connections for this one database to take place on the 1522 listener (that is, our LDAP entry for this database directed client connections to the 1522 listener).

So, a quick fix here is to use the 1521 listener. However, while we might take a quick fix, we aren’t apt to leave it without understanding the whys of the issue. So, we sat down and started some testing.  Sure enough, each time we connected to the 1522 listener, the problem presented. When connecting to the 1521 listener, the issue was not present. Why in the world would the listener used in the initial database connection have anything to do with a database link created from within the database?

Some background here for those that don’t know. There are two ways for a listener to know about a database. First, static entries listed in the listener.ora file. Second, the database performs “self-registration” with its local listener. By default, the local listener is at port 1521. If you run a status of a listener that has both a static entry, and a self-registered entry, you’ll see something like:

Service "clientdb.world" has 2 instance(s).
Instance "clientdb", status UNKNOWN, has 1 handler(s) for this service...
Instance "clientdb", status READY, has 1 handler(s) for this service...

The UNKNOWN status is shown for the static listener entry, while the READY status is shown for the self-registered entry. Since the 1521 listener was the local listener, it shows both service entries whereas the 1522 listener showed only the UNKNOWN entry.

Our next step: Let’s move the local listener for this database to be the 1522 listener, rather than the 1521 listener. Bingo. Now we were able to say that connections made to the local listener did not have the problem, so there was nothing specific to either the 1521 or 1522 listener, but rather whether it was the local listener or not.

Ok, a little further down the path, but not to our destination. Why does the local listener status have any impact on my database link?

Next step: we created network database connections to the database and then used the solaris command “pargs -e” to show the environment of our process that was created:

Connection to non local listener:


[oracle@dominio]$ ps -elf | grep clientdb | grep NO
 0 S   oracle 25804     1   1  39 20        ? 483213        ? 15:14:30 ?           0:00 oracleclientdb (LOCAL=NO)
[oracle@domino]$ pargs -e 25804
25804:  oracleclientdb (LOCAL=NO)
envp[0]: ORACLE_SPAWNED_PROCESS=1
envp[1]: TNS_ADMIN=/export/home/oracle/grid/grid11gr2/network/admin/

BINGO! TNS_ADMIN was set in the process environment for those connections made through other-than the local listener, and that TNS_ADMIN environment variable was pointing to the /network/admin folder of the **LISTENER** home, not the database home!! (Our listeners are provided by the grid infrastructure that in 11G lives in a different home.) This means that for **CLIENT** connections made to the database over the non-local listener, database links created during the session would lookup their TNS information in the listener home. It turns out that our listener home tnsnames.ora file had an improper hostname (because the server was renamed when replaced), so the initial connection would fail, leading to a timeout and failover.

Summarizing what we discovered:

  • Listeners started manually using lsnrctl do not exhibit this behavior.  That is, they do not set the TNS_ADMIN variable in the database process, regardless of whether the local listener is used.
  • Listeners started by the grid infrastructure (i.e., srvctl start listener) do set the TNS_ADMIN environment variable for the non-local listener only, and the listener sets this TNS_ADMIN variable to point to the listener’s home, not the database’s home.

We had another, related issue to this, where during an upgrade we saw some slow performance when compiling objects that reference other objects across the database links.  We traced this to the same problem, though the characteristics were much different.

As I mentioned above, when a database starts up, it registers itself with the local listener.  However, there is a brief time when an external connection can be made to the database and the database will not yet have registered.  This connection is made to the non-local listener, and hence, will use the listener’s TNS home information when creating database links.

Our fix is really just to make sure we keep the TNSNAMES files synchronized between the database and grid (listener) homes.

This all came as a great surprise, as I don’t think any of us would have ever looked to the listener’s home when diagnosing database link connection issues.