Executive Summary
As organizations rapidly migrate to cloud-based services, a fundamental shift in network architecture is creating a critical visibility and accountability gap. Traditional private MPLS networks provided end-to-end service level agreements (SLAs), clear ownership boundaries, and deterministic troubleshooting paths. The move to public internet-based cloud services—especially internet-delivered security stacks such as SSE/SASE—reduces these guarantees, leaving organizations with limited diagnostic authority and fewer enforceable remedies when performance degrades.
This document examines the technical and operational implications of this architectural shift, focusing on the loss of network visibility, the limitations of current troubleshooting methodologies, and the contrast between public internet delivery versus private connectivity models that restore clearer accountability boundaries.
The Traditional Network Model: Accountability Through Ownership
MPLS Architecture and SLA Guarantees
In traditional enterprise networks, MPLS circuits provided:
- End-to-end SLAs covering uptime, latency, jitter, and packet loss (as defined by the provider contract)
- Single provider accountability for the managed path between endpoints
- Private infrastructure where the provider maintains visibility into intermediate hops
- Clear demarcation points establishing where customer responsibility ends and provider responsibility begins
- More predictable routing with fewer unknown transit networks in the delivery path
The Troubleshooting Advantage
When a performance issue occurred on MPLS:
- Clear SLA metrics identified when thresholds were breached
- Provider had monitoring and diagnostic access to their managed segment
- Ticket escalation paths were defined and contractually bounded
- Root cause analysis could often identify the responsible segment with high confidence
- Remediation timelines were driven by contractual commitments and penalties
The Internet-Based Cloud Model: The Accountability Void
Architecture of Modern Cloud Services
Services like Microsoft 365, internet-delivered security platforms (SSE/SASE), and other SaaS applications route traffic across the public internet, introducing:
- Multiple autonomous systems (AS) with independent policies and peering constraints
- Dynamic routing that can change based on BGP decisions, congestion, and peering agreements
- Tier 1, 2, and 3 ISP hand-offs where packets traverse multiple autonomous systems
- No enforceable end-to-end SLA across the entire multi-AS path (beyond what your own ISP contracts cover)
- Limited visibility into intermediate hops and routing decisions outside the networks you directly contract with
- No direct contractual relationship with intermediate carriers
- No guaranteed ability to open support tickets with every transit provider involved
- No enforceable end-to-end SLA guarantees for latency, jitter, or packet loss across the full path
- Limited insight into routing decisions made by other autonomous systems
- An indirect escalation path where remediation is often “best effort” and timeline outcomes are uncertain
The SSE/SASE Use Case (Cloud Security as an Example)
SSE/SASE security stacks (SWG, CASB, ZTNA, DLP, firewall-as-a-service) commonly operate as cloud enforcement points that user or site traffic must reach before accessing SaaS applications or internal resources. Consider the path complexity:
ISP Point of Presence (PoP) →
[Return/Forward Path May Differ] →
Destination (SaaS Application or Internal Resource)
Each segment introduces variables:
- Peering relationship quality and capacity
- Congestion during peak hours
- Route instability and BGP policy shifts
- Provider-specific traffic shaping behaviors
- Geographic routing inefficiencies or suboptimal PoP selection
When thousands of users experience degraded performance through an internet-delivered security stack, you often cannot directly:
- Engage every intermediate ISP for investigation (you are not their customer)
- Access routing tables or traffic statistics from all transit providers
- Enforce remediation on networks you do not contract with
- Guarantee the path will remain consistent across subsequent connections
Current Troubleshooting Approaches and Their Limitations
MTR (My Traceroute) – Limited Diagnostic Value
MTR combines traceroute and ping functionality to show packet loss and latency per hop. However:
Limitations:
- ICMP deprioritization: Many networks rate-limit or deprioritize ICMP, making results unreliable for user traffic impact
- Asymmetric routing: Return paths differ from forward paths; MTR typically shows only one direction
- No context for packet loss: A hop showing loss may be ICMP-specific and not reflect actual application traffic loss
- No direct remediation authority: Identifying a problematic hop in a third-party AS often yields no direct path to a fix
- Dynamic routing: The tested path can differ from production paths minutes later
- No relationship with that provider
- No proof that production TCP/UDP flows experience the same loss
- No guaranteed ability to influence routing to avoid that hop
- No enforceable escalation mechanism to force remediation
ThousandEyes – Visibility Without Authority
Distributed monitoring platforms can provide global visibility into ISP outages, BGP events, and performance degradation from multiple vantage points.
Limitations:
- Detection without guaranteed resolution: Seeing elevated latency or loss doesn’t guarantee a corrective action path
- Correlation challenges: An ISP incident may not map cleanly to your specific user paths or apps
- Coverage gaps: Not every peering point or last-mile ISP is measured in every geography
- No enforcement mechanism: Identifying the likely root network does not grant control over that network
- Cost vs. value risk: Monitoring can confirm issues you may only be able to mitigate indirectly
- Open a ticket with that provider (not a customer)
- Force your ISP to permanently avoid specific transit networks
- Guarantee your traffic won’t route through that network tomorrow
- Hold every segment accountable for end-to-end SLA outcomes
The Diagnostic Dead End
The typical troubleshooting workflow reveals the constraint:
- Users report slowness accessing apps through an internet-delivered security stack
- Run MTR from multiple locations – identifies a latency spike at an intermediate hop
- Check distributed monitoring – confirms elevated latency in that region/provider
- Contact your ISP – they confirm the issue appears downstream, outside their directly managed network
- Contact the security/SaaS vendor – they confirm their infrastructure is healthy and/or traffic arrives degraded upstream
- Result: Likely problem segment identified, but remediation is indirect, authority is limited, and resolution timelines are uncertain
The Private Cloud Alternative: Maintaining Visibility and Control
ExpressRoute (Azure) and Direct Connect (AWS)
Private connectivity can restore clearer accountability boundaries and more deterministic paths for defined workloads:
Architecture:
- Dedicated circuits from your location to a colocation facility
- Cross-connects to cloud provider infrastructure within the same facility
- Reduced internet traversal for production traffic destined to supported cloud edges
- Provider SLAs covering defined segments (as contractually defined)
- More predictable routing with fewer unknown transit networks
Letter of Authorization/Connecting Facility Assignment (LOA-CFA):
When establishing ExpressRoute or Direct Connect, the cloud provider issues an LOA-CFA document containing:
- Authorization for your provider to connect equipment on your behalf
- Specific port locations (cage, cabinet, patch panel identifiers)
- Media type requirements (single-mode fiber, copper)
- Technical specifications for the cross-connect
- Circuit identifiers linking the physical connection to your virtual circuits
Restored Accountability Model
With private connectivity:
Clear ownership boundaries:
- Customer responsibility: On-premises equipment to demarcation point
- Carrier responsibility: Demarcation point to colocation facility/cross-connect segment (per contract)
- Cloud provider responsibility: Cross-connect to regional cloud infrastructure (per provider SLA)
Troubleshooting advantages:
- Each party has better diagnostic access to their segment
- SLAs define performance metrics and remediation processes for covered segments
- Support escalation paths are defined in contracts and provider support models
- Root cause identification can follow clearer demarcation logic
- Cloud monitoring indicates the issue aligns to a carrier circuit segment
- Ticket opened with the carrier per the connectivity support agreement
- Carrier confirms physical degradation in their infrastructure
- Remediation initiated with a committed support workflow
- Performance restored and root cause documentation provided
Hybrid Architecture Considerations
Organizations can implement hybrid models:
Private connectivity for:
- Business-critical SaaS and cloud workloads that support private on-ramps
- Internal applications and data center resources
- Voice and video conferencing (latency-sensitive)
- Healthcare/financial applications requiring compliance
Public internet for:
- General web browsing
- Non-critical SaaS applications
- Guest/BYOD traffic
This approach maintains clearer control where it matters most while accepting best-effort limitations where it is appropriate.
The SSE/SASE Dilemma: Security at the Cost of Determinism
Architectural Challenge
Internet-delivered security architectures route user and site traffic through cloud enforcement points. In many common deployments, reaching those enforcement points involves best-effort internet transit for at least part of the path. While some organizations can reduce exposure to uncontrolled transit for specific on-ramps (e.g., via colocation ecosystems and private interconnect constructs), this does not eliminate best-effort segments for all users—especially remote/home ISP users and mobile users.
For organizations with MPLS networks:
- Cloud security enforcement often requires an internet on-ramp before reaching provider PoPs
- This creates a parallel delivery model: private connectivity for internal paths, best-effort internet for cloud security insertion and many SaaS paths
- The network team loses deterministic control over a critical component of the application delivery chain
For distributed workforces:
- Remote users connect via home ISPs of varying quality and policies
- Traffic routes through multiple ISPs before reaching cloud security enforcement points
- Each user location introduces unique routing paths and potential failure modes
- Troubleshooting becomes location-specific with limited uniform remediation options
Private Peering Options for SSE/SASE Platforms
Many SSE/SASE vendors offer private connectivity alternatives that can restore MPLS-like characteristics for site-to-enforcement-point traffic. These options typically involve colocation facilities and direct interconnect models:
Common Private Connectivity Models:
- Colocation fabric interconnects (Equinix Fabric, Megaport, PacketFabric, CoreSite) providing direct Layer 2/3 connections to security vendor PoPs
- Cloud exchange platforms where enterprises and vendors meet at common peering points
- Direct cross-connects within shared data center facilities
- Vendor-specific programs (e.g., Zscaler Cloud Connector, Netskope Private Access via colocation partners) designed to bypass public internet for site connectivity
What Private Peering Restores:
- Clearer SLA boundaries between customer circuit, colocation provider, and security vendor infrastructure
- Reduced AS hops by eliminating best-effort internet transit for site-to-PoP segments
- More predictable routing with deterministic paths over private infrastructure
- Direct troubleshooting paths with each segment owner (circuit provider, colo facility, security vendor)
- Performance consistency for office/branch locations with private connectivity
- Branch to colo: Carrier MPLS SLA
- Colo cross-connect: Equinix SLA
- PoP to cloud apps: Vendor SLA + destination provider paths
What Private Peering Does NOT Solve:
- Remote/mobile workforce connectivity: Home ISP users and mobile workers still traverse best-effort internet paths to reach enforcement points—this is often 40-60% of enterprise users today
- Last-mile variability: Even with private peering to PoPs, traffic destined to SaaS applications still traverses internet paths from the PoP to the destination
- Geographic distribution challenges: Not all user locations can economically reach colocation facilities; distant users may still hairpin through internet paths
- Cost and complexity: Private connectivity requires colocation presence, cross-connect fees, and increased circuit costs—prohibitive for some deployment scales
- Operational overhead: Managing hybrid connectivity models (private for sites, internet for remote users) increases architecture complexity
- Home ISP quality variations (cable, DSL, fiber, 5G with vastly different characteristics)
- Best-effort internet routing to reach enforcement points
- No enforceable SLA coverage for their access paths
- Geographic inconsistency (users in different regions experience different transit networks)
Result: Even with significant investment in private peering for sites, a substantial portion of your user base still experiences the internet visibility gap described throughout this document. This creates a two-tier troubleshooting reality: deterministic paths for sites, best-effort diagnostics for remote workers.
Cost-Benefit Considerations:
- When private peering makes sense: Large site deployments, compliance-driven environments, latency-sensitive applications, predictable traffic patterns, and budgets that support colocation infrastructure
- When internet delivery is acceptable: Distributed workforces, small/medium site counts, cost-constrained deployments, and applications tolerant of variable performance
- Hybrid models: Many enterprises use private connectivity for headquarters and major branches while accepting internet delivery for smaller sites and remote users
The Hybrid Architecture Reality
Most modern enterprises operate hybrid connectivity models that combine private and public paths:
Typical deployment pattern:
- Site traffic (30-50% of users): Private peering via colocation facilities to enforcement points, restoring deterministic performance and clearer SLA boundaries
- Remote/mobile traffic (40-60% of users): Best-effort internet transit from home ISPs and mobile carriers, subject to the visibility gaps described throughout this document
- SaaS-bound traffic (all users): Even after security inspection, traffic to SaaS destinations often traverses internet paths with limited visibility
This creates operational complexity:
- Two troubleshooting methodologies: Deterministic root cause for site traffic, best-effort diagnostics for remote users
- Inconsistent user experiences: Office users may have reliable performance while remote users experience variability
- Communication challenges: Explaining to leadership why “the same application” performs differently for different user populations
- Monitoring complexity: Tracking baselines and SLAs requires segmenting metrics by connectivity model
- Incident management: Distinguishing between “actionable with authority” vs. “best-effort mitigation” incidents
The Support Escalation Problem
When users report performance issues:
What you can do:
- Verify your contracted circuits are performing within SLA where applicable
- Confirm DNS resolution and endpoint posture flows are functioning
- Validate local network configurations, tunnel health, MTU correctness, and capacity headroom
What you often cannot do directly:
- Diagnose or remediate issues in every upstream transit network between users and the provider
- Force routing changes across networks you do not control
- Hold intermediate networks accountable to your end-to-end performance expectations
- Guarantee consistent outcomes across different user geographies and ISPs
Vendor limitations:
- They can confirm their infrastructure and PoPs are healthy
- They can observe when traffic arrives degraded at their edges
- They cannot fully control upstream internet routing decisions across third-party networks
- They cannot provide enforceable SLAs covering customer ISP segments and uncontrolled internet transit
Cloud Security (SSE/SASE) Mitigations (practical levers you can actually pull):
- Control PoP selection where possible: Prefer the nearest/most stable enforcement points per region and avoid unnecessary PoP drift that introduces new transit paths
- Engineer egress diversity: If you have multiple DIA providers, ensure security-bound traffic can steer to the better-performing provider during incidents
- Build tunnel redundancy and health criteria: Use redundant GRE/IPsec paths with explicit failover triggers (latency/jitter/loss thresholds, not just “tunnel up/down”)
- Validate MTU and fragmentation behavior end-to-end: Many “mysterious slowness” incidents trace back to PMTUD/MTU issues across tunnels and intermediate networks
- Baseline enforcement point performance by geography: Track DNS time, TLS handshake time, TTFB, and throughput so “internet variability” is measurable
- Use application-layer synthetic testing: Measure real login flows and critical SaaS transactions through the security stack, not only ICMP/MTR outputs
- Create an emergency bypass policy (pre-approved): For defined critical apps, establish a documented, security-approved temporary bypass with scope, logging, and rollback
- Instrument endpoint and client behavior: Confirm forwarding mode, PAC/proxy behaviors, certificate inspection impacts, and endpoint resource constraints aren’t the root cause
- Standardize an escalation evidence bundle: Include impacted regions, selected enforcement point, tunnel metrics, app timing breakdowns, and a known-good comparison path
- Align with ISPs on routing influence options: Where supported, use provider features (communities / preferred peer routing / managed reroutes) to avoid chronic congestion points
Risk Acceptance vs. Risk Management
Deploying internet-delivered services requires accepting that:
- Network performance is variable and can be influenced by factors outside your control
- Troubleshooting authority is constrained to the segments you own or contractually cover
- User experience will vary based on geography, last-mile ISP quality, and transit path selection
- Enforceable end-to-end SLAs are limited when the path includes networks you do not contract with
- Resolution timelines can be uncertain when upstream routing or peering issues occur
This isn’t a criticism of any single vendor—it is the operational reality of delivering applications and security over best-effort internet transit when private connectivity is not end-to-end.
Recommendations and Strategic Considerations
For Organizations Planning Cloud Migration
Conduct realistic assessments (measure the real path, not the ideal path):
- Map user-to-enforcement-point and user-to-app paths by region and access method (office, remote, mobile)
- Test from actual user vantage points using application-layer synthetic transactions, not only ICMP-based tools
- Establish baselines per geography (DNS time, TLS handshake, TTFB, throughput) before migrations
- Define “good enough” thresholds and document which parts are best-effort vs. contract-backed
Evaluate connectivity and egress models (design for steering, not hope):
- Multi-provider DIA with deliberate egress policy and tested failover behavior
- Private connectivity (ExpressRoute/Direct Connect, Equinix Fabric, direct peering) for workloads with strict performance needs and sufficient user concentration
- Regional breakout strategies to reduce unnecessary internet AS hops
- Hybrid architectures that apply deterministic connectivity where it matters most
Set realistic expectations:
- Acknowledge authority limits to leadership and document as accepted risk where appropriate
- Redefine SLAs to reflect the portions of the network you control and contractually cover
- Adjust troubleshooting processes to focus on actionable segments and mitigations
- Plan for “best-effort transit variability” as a valid root cause category
For Troubleshooting Teams
Focus efforts where you have control:
- Your circuits and edge: Ensure they meet SLA, are properly sized, and have headroom
- Local infrastructure: Eliminate internal bottlenecks and configuration errors
- DNS and routing policies: Optimize resolvers, split-horizon behavior, and egress selection where possible
- Endpoint health: Verify devices, posture, and client behaviors aren’t contributing to slowness
Accept diagnostic limitations (and collect better evidence):
- MTR and traceroute indicate paths but rarely provide remediation authority
- Distributed monitoring provides visibility but does not guarantee fixes
- ISP support typically ends at their boundary; downstream remediation is indirect
- Vendor support can confirm edge health but cannot control upstream transit networks
Develop response playbooks:
- Document known-good baselines for comparison during incidents
- Create stakeholder communication templates explaining best-effort transit limitations clearly
- Establish escalation criteria distinguishing actionable issues from upstream variability
- Maintain ISP relationships to influence what routing you can within contracted networks
For Leadership and Decision Makers
Understand the trade-offs:
- Cloud services provide scalability, reduced capital expenditure, and operational flexibility
- Internet delivery introduces variability, indirect escalation, and limited end-to-end authority
- This is not a failure of any vendor—it’s an inherent characteristic of multi-AS best-effort transit
Budget and resource implications:
- Monitoring tools provide visibility but do not create remediation authority
- Private connectivity restores clearer control but requires investment and operational planning
- Support training must shift from “fix everything” to “optimize what we control and mitigate what we can”
- User expectations must be managed; not all performance issues have deterministic fixes
Risk acceptance framework:
- Document best-effort segments as known limitations in your service delivery model
- Establish performance thresholds that account for geographic and ISP variability
- Create incident procedures differentiating controllable vs. uncontrollable contributors
- Communicate limitations to stakeholders before incidents occur
Conclusion
The migration to cloud-based services represents a fundamental architectural shift that extends beyond application hosting. Organizations are replacing private, controlled delivery paths with shared public infrastructure—and with that shift comes reduced determinism, limited authority over intermediate networks, and weaker end-to-end accountability models.
For internet-delivered services—including SSE/SASE and many SaaS platforms—enterprises must accept that:
- End-to-end enforceable SLAs are limited when the path includes networks you don’t contract with
- Troubleshooting can identify contributors you cannot directly remediate
- Performance can vary based on factors outside your influence
- “Best-effort transit variability” is a valid root cause category, even if it’s unsatisfying
Private connectivity options through colocation providers like Equinix can restore deterministic performance for site traffic, but the distributed workforce—now representing 40-60% of enterprise users—remains subject to best-effort internet constraints. This creates a two-tier operational reality that requires different troubleshooting approaches, monitoring strategies, and stakeholder expectations.
This isn’t an argument against cloud adoption—it’s a call for realistic assessment of what changes when you move from private connectivity models to internet delivery. Organizations that understand these constraints can:
- Make informed decisions about which workloads fit which connectivity models
- Set appropriate expectations with stakeholders and users
- Invest in monitoring and telemetry that leads to actionable mitigations
- Maintain private connectivity for truly critical applications where deterministic performance is required
- Accept best-effort variability as an operational reality for distributed workforces
The visibility and accountability gap is real. Acknowledging it is the first step toward making strategic decisions that balance cloud benefits against the operational realities of internet-based delivery.