Voice Impairment

Incident Report for VoiceNEXT

Postmortem

VoiceNEXT sincerely apologizes for the issues caused by last week’s service impairment. We understand the importance of reliable phone service for your business and the impact that even a brief outage can have on your operations. Please see the postmortem report below, along with the steps we have taken to learn from our mistakes and make our services better than ever.

Issue Summary:

On Friday March 15th at approximately 4:10 PM Eastern, VoiceNEXT became aware of phones unable to make or receive calls. Customers reported seeing “No service” on their telephones and were receiving fast busy signals on both inbound and outbound calls. By 4:30 PM VoiceNEXT engineers identified a critical spike in memory on a core controller server. All systems were immediately rebooted to clear the memory cache and restore services. Services were temporarily restored, but unfortunately several minutes later another memory spike brought the server back down.

Upon further analysis it was determined that an error in a system script continuously overloaded the core server’s memory and eventually took the server down. Due to infrastructure logic, the redundant servers also attempted to process the bad script which caused a cascading effect across all servers. This continued for approximately 45 minutes as engineers attempted to identify the source of the memory spike.

Once we positively identified the cause of the memory spike, we increased memory allocation across all servers and fixed the bad script, resulting in restored services and normal performance. We actively monitored all services and made further repairs over the next 48 hours and determined the issue to be resolved.

Incident Timeline:

4:10 PM: Customers begin reporting issues to VoiceNEXT

4:17 PM: Incident Lead Assigned

4:30 PM: Memory spike Identified

4:34 PM: Controller Server Rebooted

4:40 PM: Services Temporarily Restored

5:00 PM: Additional reports of service issues received

5:30 PM: Root cause of memory spike identified

5:45 PM: Local memory expanded and code patched

6:00 PM: Servers began exhibiting normal behavior

6:30 PM: All servers up and operating normally

‌

Follow-up Actions

Server memory was expanded well beyond recommended levels to ensure future continuity of service. System developers were engaged over the following weekend to identify and permanently repair the bad coding, ensuring that should a similar incident occur, the error is rectified seamlessly by the system without downtime. Further analysis of our redundancy architecture is scheduled so we can learn from this incident and improve upon our network infrastructure.

We will also be rolling out an update to all servers in Q2 of this year which will further enhance system reliability and security, and will provide several exciting new features and improvements.

Posted Mar 20, 2024 - 19:29 EDT

Resolved

This incident has been resolved. We will continue to monitor and send a Reason for Outage shortly.

Posted Mar 16, 2024 - 13:17 EDT

Update

We are seeing continued stability for all services. We will continue to monitor throughout the weekend and provide relevant updates.

Posted Mar 15, 2024 - 19:52 EDT

Monitoring

Systems are operating normally at this time. We will continue actively monitoring and provide further updates.

Posted Mar 15, 2024 - 18:37 EDT

Identified

VoiceNEXT has become aware of phones unable to make or receive calls intermittently. The issue has been identified and is being worked.

Posted Mar 15, 2024 - 17:26 EDT

This incident affected: Hosted Servers (as.voicenext.com, as2.voicenext.com, as4.voicenext.com, as5.voicenext.com, as6.voicenext.com, as7.voicenext.com, nb.voicenext.com) and Inbound Calling Services, Outbound Calling Services.