I've documented in red what is going on with the system.
0:005> lmv msqlservr
start end module name
00000000`01000000 00000000`03662000 sqlservr (deferred)
Mapped memory image file: D:\SymCache\sqlservr.exe\4A7245F62662000\sqlservr.exe
Image path: C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Binn\sqlservr.exe
Image name: sqlservr.exe
Timestamp: Fri Jul 31 09:16:38 2009 (4A7245F6)
CheckSum: 025DBEE2
ImageSize: 02662000
File version: 2005.90.4230.0
Product version: 9.0.4230.0
File flags: 0 (Mask 3F)
File OS: 40000 NT Base
File type: 1.0 App
File date: 00000000.00000000
Translations: 0409.04e4
CompanyName: Microsoft Corporation
ProductName: Microsoft SQL Server
InternalName: SQLSERVR
OriginalFilename: SQLSERVR.EXE
ProductVersion: 9.00.4230.00
FileVersion: 2005.090.4230.00
FileDescription: SQL Server Windows NT - 64 Bit
LegalCopyright: © Microsoft Corp. All rights reserved.
LegalTrademarks: Microsoft® is a registered trademark of Microsoft Corporation. Windows(TM) is a trademark of Microsoft Corporation
Comments: NT AMD64
And you have 8 CPUs there installed on the server, while all of them were executing the following:
SOS_Scheduler:3BA4080 m_id:0 status:VISIBLE ONLINE
m_pCurrentWorker:9FF61C0 Debug id:147
# Child-SP RetAddr Call Site
00 00000000`48c4ba58 00000000`77d705d6 ntdll!ZwDelayExecution+0xa
01 00000000`48c4ba60 00000000`0193a631 kernel32!SleepEx+0xaf
02 00000000`48c4bb00 00000000`0168aeb7 sqlservr!SpinlockBase::Sleep+0x1566c1
03 00000000`48c4bc80 00000000`01dc31c6 sqlservr!SpinlockBase::SpinToAcquire+0x117
04 00000000`48c4bcd0 00000000`02173ddc sqlservr!TSyncHashTable_EntryAccessor
05 00000000`48c4bd10 00000000`019e2cc9 sqlservr!CQSIndexStatsMgr::AddNewMissingIndex+0x7c
06 00000000`48c4bda0 00000000`016df878 sqlservr!CIdxSuggestion::Register+0x18a819
07 00000000`48c4bfb0 00000000`0173b192 sqlservr!COptExpr::PqteConvert+0x1d8
08 00000000`48c4c040 00000000`016df77d sqlservr!CPhyOp_RestrRemap::PqteConvert+0xc2
09 00000000`48c4c2b0 00000000`0173ab18 sqlservr!COptExpr::PqteConvert+0xdd
0a 00000000`48c4c340 00000000`016ef4f7 sqlservr!COptExpr::PqteConvertTree+0x78
0b 00000000`48c4c4d0 00000000`01712037 sqlservr!COptContext::PcxteOptimizeQuery+0x7c7
0c 00000000`48c4e880 00000000`0170658d sqlservr!CQuery::Optimize+0x217
0d 00000000`48c4ea10 00000000`01509e87 sqlservr!CQuery::PqoBuild+0x54d
0e 00000000`48c4ee50 00000000`014f0c0a sqlservr!CStmtQuery::InitQuery+0x167
0f 00000000`48c4eef0 00000000`01724abd sqlservr!CStmtSelect::Init+0x8a
10 00000000`48c4ef80 00000000`0157f15b sqlservr!CCompPlan::FCompileStep+0x3ed
11 00000000`48c4f040 00000000`014e8af1 sqlservr!CSQLSource::FCompile+0x6eb
12 00000000`48c4f2d0 00000000`016786f7 sqlservr!CSQLSource::FCompWrapper+0x121
13 00000000`48c4f390 00000000`01669db2 sqlservr!CSQLSource::Transform+0x377
14 00000000`48c4f450 00000000`0166ae92 sqlservr!CSQLSource::Execute+0x292
15 00000000`48c4f5a0 00000000`0166a9ad sqlservr!process_request+0x312
16 00000000`48c4f830 00000000`0167139e sqlservr!process_commands+0x3fd
17 00000000`48c4faf0 00000000`01670b89 sqlservr!SOS_Task::Param::Execute+0xee
18 00000000`48c4fc00 00000000`01676ca4 sqlservr!SOS_Scheduler::RunTask+0xc9
19 00000000`48c4fc90 00000000`017afcb7 sqlservr!SOS_Scheduler::ProcessTasks+0xb4
1a 00000000`48c4fd00 00000000`014c5a79 sqlservr!SchedulerManager::WorkerEntryPoint+0xe7
1b 00000000`48c4fda0 00000000`017b02c0 sqlservr!SystemThread::RunWorker+0x59
1c 00000000`48c4fde0 00000000`017f9b88 sqlservr!SystemThreadDispatcher::ProcessWorker+0x130
1d 00000000`48c4fe80 00000000`781337a7 sqlservr!SchedulerManager::ThreadEntryPoint+0x128
1e 00000000`48c4ff20 00000000`78133864 msvcr80!_callthreadstartex+0x17
1f 00000000`48c4ff50 00000000`77d6b6da msvcr80!_threadstartex+0x84
20 00000000`48c4ff80 00000000`00000000 kernel32!BaseThreadStart+0x3a
0:270> ub 00000000`0168aeb7
sqlservr!SpinlockBase::SpinToAcquire+0xf4:
00000000`0168ae94 e970ffffff jmp sqlservr!SpinlockBase::SpinToAcquire+0x69 (00000000`0168ae09)
00000000`0168ae99 458bcd mov r9d,r13d
00000000`0168ae9c 458bc6 mov r8d,r14d
00000000`0168ae9f 498bd4 mov rdx,r12
00000000`0168aea2 488bcf mov rcx,rdi
00000000`0168aea5 4889742428 mov qword ptr [rsp+28h],rsi
00000000`0168aeaa c744242001000000 mov dword ptr [rsp+20h],1
00000000`0168aeb2 e8b9901500 call sqlservr!SpinlockBase::Sleep (00000000`017e3f70)
0:147> u sqlservr!SpinlockBase::Sleep
sqlservr!SpinlockBase::Sleep:
00000000`017e3f70 4053 push rbx
00000000`017e3f72 55 push rbp
00000000`017e3f73 56 push rsi
00000000`017e3f74 57 push rdi
00000000`017e3f75 4154 push r12
00000000`017e3f77 4155 push r13
00000000`017e3f79 4157 push r15
00000000`017e3f7b 4881ec40010000 sub rsp,140h
0:147> dq 00000000`48c4bb00+0x140+8+8+8
00000000`48c4bc58 00000000`801c3428 00000000`03ba0508
00000000`48c4bc68 00000000`00001e68 00000000`00002710
00000000`48c4bc78 00000000`0168aeb7 00000000`00000000
00000000`48c4bc88 00000002`0f7e0a20 00000000`801b82c8
00000000`48c4bc98 00000000`01c5c471 00000000`00000001
00000000`48c4bca8 00000000`03ba0508 00000000`0009001f
00000000`48c4bcb8 00000000`48c4bdd8 00000000`00000001
00000000`48c4bcc8 00000000`01dc31c6 00000000`801c3418
So the rdi is on the stack, and we found it to be 00000000`801c3428, and it told us the spin lock was owned by the thread whose id is 7396,
0:147> ~~[0n7396]s
ntdll!ZwSignalAndWaitForSingleObject+0xa:
00000000`77ef1b5a c3 ret
0:270> kL
Child-SP RetAddr Call Site
00000000`5f3bbab8 00000000`77d44ebb ntdll!ZwSignalAndWaitForSingleObject+0xa
00000000`5f3bbac0 00000000`0165de56 kernel32!SignalObjectAndWait+0x133
00000000`5f3bbb70 00000000`01662d87 sqlservr!SOS_Scheduler::SwitchContext+0x316
00000000`5f3bbe60 00000000`0166f5a0 sqlservr!SOS_Scheduler::Suspend+0x97
00000000`5f3bbea0 00000000`019518a1 sqlservr!SOS_Event::Wait+0x150
00000000`5f3bbf00 00000000`01673a40 sqlservr!SOS_UnfairMutexPair::LongWait+0x3fef51
00000000`5f3bbf80 00000000`01676b20 sqlservr!CMemThread::Free+0x4e0
00000000`5f3bbfe0 00000000`02173e1d sqlservr!commondelete+0x30
0000000`5f3bc010 00000000`019e2cc9 sqlservr!QSIndexStatsMgr::AddNewMissingIndex+0xbd
00000000`5f3bc0a0 00000000`016df878 sqlservr!CIdxSuggestion::Register+0x18a819
00000000`5f3bc2b0 00000000`0173ab18 sqlservr!COptExpr::PqteConvert+0x1d8
00000000`5f3bc340 00000000`016ef4f7 sqlservr!COptExpr::PqteConvertTree+0x78
00000000`5f3bc4d0 00000000`01712037 sqlservr!COptContext::PcxteOptimizeQuery+0x7c7
00000000`5f3be880 00000000`0170658d sqlservr!CQuery::Optimize+0x217
00000000`5f3bea10 00000000`01509e87 sqlservr!CQuery::PqoBuild+0x54d
00000000`5f3bee50 00000000`014f0c0a sqlservr!CStmtQuery::InitQuery+0x167
00000000`5f3beef0 00000000`01724abd sqlservr!CStmtSelect::Init+0x8a
00000000`5f3bef80 00000000`0157f15b sqlservr!CCompPlan::FCompileStep+0x3ed
00000000`5f3bf040 00000000`014e8af1 sqlservr!CSQLSource::FCompile+0x6eb
00000000`5f3bf2d0 00000000`016786f7 sqlservr!CSQLSource::FCompWrapper+0x121
When the query optimizer of SQL Server 2005 generates a query plan, the query optimizer may wait for a spinlock to be released. The spinlock is used to access internal structure to record information about the missing indexes for the query plan. While the query optimizer is waiting for the spinlock, the query optimizer does not release the scheduler. The query optimizer does not obtain the spinlock and keeps holding the scheduler. All the threads blocked in the spin lock, which cause the SQL Server cannot respond to the CheckServcieAlive request from the Windows cluster, which resulted in the SQL Server resource failure and SQL group failover.
We can see it hit a known issue in this build of SQL Server:
974205 FIX: Error message in the Errorlog file of SQL Server 2005 or of SQL Server 2008 after the SQL Server service stops responding: "Timeout occurred while waiting for latch"
http://support.microsoft.com/default.aspx?scid=kb;EN-US;974205
It was fixed in the SQL Server 2005 SP3 CU6. You can download it from the following URL:
http://support.microsoft.com/kb/974648/LN/
If you get this error or similar install the SQL Server 2005 SP3 CU6 at your earliest convenience.
No comments:
Post a Comment