2009-03-10

Excessive CPU usage on Domain Controllers

Now this has nothing to do with Virtualization, but it is still very handy, so I thought I would share it with you all. Yesterday I started to receive alerts from our monitoring system that our DC's were constantly using almost 100% CPU. Now of course this is not normal. Because of the high CPU usage I started to get more alerts of FRS replication issues, DNS problems and I saw a snowball slowly gathering momentum. First off, let me state that I am happy that our monitoring system notified us of a problem. Next, let me state I am not happy to announce that our monitoring system could not identify what the problem was, so I had to dig further (we are looking into a better monitoring solution). Time to put on my detective's cap and go to work (Sorry for the shameless rip-off Gabrie!!) The process that was eating up so much CPU was lsass.exe. Now of course as you can see, this process is responsible for domain authentication, and it was happening on both DC's at the same time, no viral activity was detected on our network, so I was thinking that something is causing this due to authentication problems. Therefore, I came to Windows Server 2003 Performance Advisor (SPA), for some more wisdom and enlightenment. A quick overview as how to use this utility can be found here. This tool is highly useful in diagnosing what is causing load on your server and can be used in many different ways to report problems and trends. Now after gathering the info and the reports on the DC's (which was not that easy running at almost 100% CPU) the results showed that one user account was trying to validate against the DC's at a rate of over 220 times/sec (which is NOT NORMAL). To make things more complicated, most of the load on the second DC was coming from the first. So after some investigation, it turned out that the user's account password had expired. According to Microsoft
When a domain controller detects that an authentication attempt did not work and a condition of STATUS_WRONG_PASSWORD, STATUS_PASSWORD_EXPIRED, STATUS_PASSWORD_MUST_CHANGE, or STATUS_ACCOUNT_LOCKED_OUT is returned, the domain controller forwards the authentication attempt to the primary domain controller (PDC) emulator operations master. Essentially, the domain controller queries the PDC to authoritatively determine if the password is current. The domain controller queries the PDC for this information because the domain controller may not have the most current password for the user but, by design, the PDC emulator operations master always has the most current password.
Aha! That is why it was effecting both DC's (why it did not lock out the account? - I still have to find out). But once we found that the problem was coming from a certain account, confirmed also by running a quick network capture on the DC to see which IP all this traffic was coming from. The offending services were shut down and peace and calm has returned to the land of of the Domain Controllers.
Now comes the stage that I go out with my big virtual cannon and boink some people over the head!!!