AWS數據庫云服務器出現了連接超時的錯誤,于是查看相關時段的alert日志,發現了如下的錯誤:
***********************************************************************
Fatal NI connect error 12170.
VERSION INFORMATION:
TNS for Linux: Version 11.2.0.4.0 - Production
Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production
Time: 25-OCT-2017 05:00:12
Tracing not turned on.
Tns error struct:
ns main err code: 12535
TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505
TNS-00505: Operation timed out
nt secondary err code: 110
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=x.x.x.x)(PORT=4459))
Wed Oct 25 05:18:13 2017
***********************************************************************
網絡連接超時,鑒于以前也經常遇到類似的TNS錯誤,因此需要徹底搞定它。查看MOS后找到如下文章:
Alert Log Errors: 12170 TNS-12535/TNS-00505: Operation Timed Out (文檔 ID 1628949.1)
主要內容如下:
CAUSE
The 'nt secondary err code' identifies the underlying network transport, such as (TCP/IP) timeout limit. In the current case 60 identifies Windows underlying transport layer.
The "nt secondary err code" will be different based on the operating system:
Linux x86 or Linux x86-64: "nt secondary err code: 110"
HP-UX : "nt secondary err code: 238"
AIX: "nt secondary err code: 78"
Solaris: "nt secondary err code: 145"
The alert.log message indicates that a connection was terminated AFTER it was established to the instance. In this case, it was terminated 2 hours and 3 minutes after the listener
handed the connection to the database.
This would indicate an issue with a firewall where a maximum idle time setting is in place.
The connection would not necessarily be "idle". This issue can arise during a long running query or when using JDBC Thin connection pooling.
If there is no data 'on the wirte for lengthy periods of time for any reason, the firewall might terminate the connection.
SOLUTION
The non-Oracle solution would be to remove or increase the firewall setting for maximum idle time. In cases where this is not feasible,
Oracle offers the following suggestion:
The following parameter, set at the **RDBMS_HOME/network/admin/sqlnet.ora, can resolve this kind of problem.
DCD or SQLNET.EXPIRE_TIME can mimic data transmission between the server and the client during long periods of idle time.
SQLNET.EXPIRE_TIME=n Where <n> is a non-zero value set in minutes.
Once this change is in place, there is NO need to restart the listener or the database.
The change will be in place for all newly spawned server processes following the change.
Be aware that connections that were established prior to this setting would not be affected by this change. Therefore, you may continue to experience some timeouts until all remote
connection are established with this setting in place.
See the following : Note 257650.1 Resolving Problems with Connection Idle Timeout With Firewall
從文中我們看出,這可能是防火墻策略或者JDBC連接池的原因,服務器自動將長時間沒有響應的會話關閉,可以通過設置SQLNET.EXPIRE_TIME來解決。
文中也列出了不同操作系統報出的不同nt secondary err code。
問題的基本診斷思路如下:
1.查找問題時間段的alert日志和監聽日志,其位置分別是$ORACLE__BASE/diag/rdbms/db_name/$ORACLE_SID/trace/和$ORACLE__BASE/diag/tnslsnr/hostname/listener/trace/
2.如果alert日志顯示如上報錯,但listener日志顯示正常,那么說明確實是數據庫以外的問題,按如上文檔的solution解決即可。
3.如果監聽日志顯示了其他報錯,那么再根據報錯找到相應的解決辦法。
文章列表