I'm writing this as a tutorial of my issues with VMAN, specifically the backup process, but also as a point to gather information about how others are handling it.
We installed VMAN in June 2016. Details of the infrastructure are as follows:
VMAN running on ESX 5.5
Collecting data from 8 vCenters, totaling 17,000 VMs, 850 hosts, 2500 datastores, and 150 clusters
We were backing up the appliance using Avamar. Avamar backups the VM by taking a snapshot, backing up the data, removing the snapshot, and running the disk consolidation job.
The process was running fine for a while until the database got to over 1TB in size. After that we started having issues with the appliance to where it would suddenly crash. When it crashed, it was unrecoverable by a simple reboot as it would not accept any commands including to power off or vmotion. The process of recovery involved moving all of the remaining VMs from the host and rebooting the host. When the host reboots, the appliance would migrate to another host and allow it to power back on.
Solarwinds support was unable to find an answer to the problem as the VM was down. VMware was unable to fine a solution as there were no error logs or indication of what was occurring. Solarwinds was saying it was a VMware problem since the VM was down and not functioning and VMware was saying it was a Solarwinds problem since we had no other problems in our environment.
The VM crashed beyond repair in early November, and we had no good backup. The VM was rebuilt and seemed to be working fine until mid January when the problem resurfaced.
After extensive research on my part, the root cause was determined to be because the disk consolidation job was not completing. There was a change in ESX 5.5 where the disk consolidation process was designed to be less intrusive, preventing the VM from going into a stunned state. As a result, the disk consolidation job cannot keep up with the amount of IO from the data collections. The disk consolidation job times out, causing backend disk problems.
VMware recommended removing the timeout condition. This worked with the VM completely powered off, but the job took about 2.5 days to complete. It however, later caused additional problems as the VM crashed during a consolation job and we could not power it back on until the job finished.
We recently quit using Avamar to back up the data and went to TSM. The downside of TSM is that it does not have a postgresql agent, so we have to back up the data to a flat file and then have TSM back up the flat file.
The script is referenced at Perform backups in the Virtualization Manager - SolarWinds Worldwide, LLC. Help and Support. The database is currently about 1.7TB. I chose the custom backup option and it takes about 20 hours to complete, resulting in a 100GB file. During the backup job, the application goes in and out of a usable state, so we have numerous missing data points, sometimes lasting multiple hours. After the backup completed, the application returned to a usable state. As a result of this, we cannot back up the database more than once a week.
Next steps are to come up with a way to do incremental backups. If anyone is already conducting a weekly backup with a nightly incremental, I would be curious to hear how you have it configured.