A customer called me in to look at an issue occurring on their Windows 8.1 image that was deployed in their XenDesktop site. Images were deployed using PVS, using the Write Cache in RAM with Overflow to Disk feature. Write Cache RAM size was set to 256MB, a fairly standard size for a desktop OS, and the write cache disk was 10GB. The issue the customer noticed was the write cache was filling very rapidly, and in some cases in as little as 30 minutes after startup. As PVS admins will know once the write cache fills it’s game over for that VM – it will blue screen or in our case just lock up and had to be forcibly reset using the hypervisor console.
Upon investigation, the write cache was certainly filling rapidly.As you will notice in the above screenshot, write cache usage is over 65% and this was approximately 15 minutes after the VM booted. Obviously some process was performing a large number of disk writes to cause the cache to fill. A quick look at the task manager leads us to our first suspect, TiWorker.exe which was consuming a large amount of CPU and also performing a lot of disk IO. In fact, this was occurring across all VMs on the farm and saturating the host as per the below screenshot. Rather curiously, it seemed to become worse around midday. (It should be noted there were very few active users on the farm while all this resource usage was occurring.)
Looking at an individual VM showed the following:
You can see a spike in CPU activity, with a corresponding increase in disk activity. So, to find out the cause. A bit of research showed TiWorker.exe was part of the Windows Update engine, but there were no updates being installed at the time, in fact none available as the customer had recently updated this image with the latest Windows Updates. I came across a Microsoft forum post here that started to shed some light here. Turns out that periodically, Windows Updates will do some additional compression and cleanup work on the downloaded Windows Updates folder (C:\Windows\WinSxS) and compress the entire lot using the LZX compression routine that was introduced in March 2014 (see here).
What had happened is the client booted the PVS image in private mode, installed the latest batch of Windows Updates then shut down and reverted to Standard mode (with Write Cache in RAM w/Overflow to Disk). Windows hadn’t completed all these post-Windows Update cleanup routines however, so it kicked them off on the next startup, which happened to be when all VMs were in standard mode. This filled the write cache on the VMs, they would lock up/freeze/crash, reboot, and start the same process all over again.
The resolution was pretty simple – open the image in private mode again, and let it sit there for Windows Update to do it’s thing. In this instance, going by the XenCenter performance graph for the VM it took around 4 hours for disk activity to subside. Once that was complete we reverted to Standard mode, and no more issues!