I am posting the incredibly detailed technical article for both posterity and my fellow administrative brethren.
Beginning late Monday afternoon, a service–TFS Deployer–that is critical to the instance of Microsoft Team Foundation Server (TFS) for which my group is responsible failed to start. The TFS Deployer service is a community add-on for TFS. The service listens for changes in build quality, and once it sees a change that meets a specified set of criteria, the build is deployed. A summary of events to this point follows.
- Maintenance on the physical host was performed on Friday, 10 July 2009. The server had the various Microsoft hotfixes applied to it, and the server was also rebooted.
- Unbeknownst to my team, the TFS Deployer service failed to restart once the server was restored to service. The error in the error log was that the service had hung upon restart. We were never alerted to the service being down, as we were not monitoring it.
- On Monday afternoon, we started receiving inquiries into an issue with the deployer service from different application teams.
- We investigated, and immediately discovered the service was down. Repeated attempts to restart the service were unsuccessful. Furthermore, no helpful diagnostic messages were received. The service would immediately stop once it was started.
At first, I thought the problem was possible the password or the rights of the domain service account used to start the service. The service account has rights to servers and filesystems used by the various applications housed within our TFS instance. Unfortunately, two facts dispelled this notion as a cause of our problem. We could login to the server console as the service account. We also reconfigured the service to use another account with the same privileges. Thus, password and rights are now ruled out.
Further complicating matters was that the server in our Stage environment was configured at the same patch level as the Production server. As a sanity check, we copied and installed the TFS Deployer service to this server. We configured the service in this environment with the same service account. The service successfully started in this environment, but we could not make use of it without some effort in reconfiguring the deployment scripts used for different applications. We then started digging a bit deeper into the problem.
I started looking through the event log on the server to see if there were any other glaring messages. Of all the messages, the following entry in the Security log seemed suspicious.
As highlighted above, the characters for the Logon Process seemed strange. Moreover, I could see a Success Audit entry for the same account before seeing two consecutive Failure Audits. So, I put together a Google search with the information I had. Nothing conclusive turned up, but there were consistent entries discussing around disabling a loopback check.
Essentially, a security fix applied to Windows Server 2003 implemented a loopback check. The security fix was to prevent a reflection attack on the server. The security fix is more fully explained at Microsoft Security Bulletin MS08-068. The description of the issue follows.
A remote code execution vulnerability exists in the way that Microsoft Server Message Block (SMB) Protocol handles NTLM credentials when a user connects to an attacker’s SMB server. This vulnerability allows an attacker to replay the user’s credentials back to them and execute code in the context of the logged-on user. If a user is logged on with administrative user rights, an attacker who successfully exploited this vulnerability could take complete control of an affected system. An attacker could then install programs; view, change, or delete data; or create new accounts with full user rights. Users whose accounts are configured to have fewer user rights on the system could be less impacted than users who operate with administrative user rights.
Nonetheless, I deferred to our Infrastructure support team, and I escalated the problem in their direction. While troubleshooting this on Tuesday, we also learned a bit more about the TFS Deployer service. Specifically, the service has a debug mode:
Unfortunately, it took us most of the day before we discovered this switch. Once we ran the service with the debug switch, we discovered that the loopback security fix was preventing the service from starting up. With that, we employed the registry fix to disable the loopback check implemented in MS08-068. Further reflection on this would explain how the service could start without issue in our Stage environment. The configuration was for a different server, so there would be no use of loopback in that instance.
In the interest of others who encounter this problem, below are the steps to edit the registry. As an extra precaution, we also removed the hotfix that was deployed on 10 July 2009.
- Click Start, click Run, type regedit in the Open box, and then click OK.
- Locate and then click the following subkey in the registry:
- On the Edit menu, point to New, and then click DWORD Value.
- Type DisableLoopbackCheck for the name of the DWORD, and then press ENTER.
- Right-click DisableLoopbackCheck, and then click Modify.
- In the Value data box, type 1, and then click OK.
- Exit Registry Editor, and then restart the computer.
Once the computer restarted, the service successfully started. We tested it and confirmed that deployments were working as expected. Following that, the hotfix from 10 July was reinstalled, and the service was restarted successfully. Clearly, the loopback check had to be disabled.
Further investigation is being performed to determine root cause. Specifically, why did the service fail to restart if the security patch in question dates back to late last year. The server is not behind in its patches, so we were going to investigate if a new security policy was implemented and pushed down by our Active Directory Operations team. There were still some important lessons learned from the issue resolution:
- Ensure that monitoring for critical services is in-place. My team dropped the ball on this front, as this service has been active for over a year. It was only early this year, however, that we declared our TFS implementation fully ready. The lack of monitoring has been remedied.
- Ensure that as much documentation or links to critical documentation are available for tools used in Production that are not directly supported by the vendor. In other words, if you are using freely available open-source or community-developed tools, be sure to have links to important documentation. My team thought that the TFS Deployer service was custom code developed in-house. We discovered that to not be the case, but we should have had ready links to documentation on the service to allow for easier troubleshooting. This is still pending.
While the resolution took longer than either I or the Infrastructure team we worked with would have liked, we both now know more about how this service works.