Full Report
Editor’s note: For retailers, predicting consumers’ desires and demand is the holy grail. For retail IT, the goal is understanding the performance of your ecommerce applications. Here, Japanese online retailer Mercari shows how they used Cloud Profiler and Trace to understand a complex microservices-based application running on Google Cloud, to meet rigorous SLOs as demand shifts for their products. The events of 2020 have accelerated ecommerce, increasing demand for and traffic on online marketplaces. Analyst eMarketer predicts that ecommerce sales in the United States will grow 18% in 2020, against an overall fall in total retail sales of 10.5% for the year. Likewise, our business—Japan-headquartered consumer-to-consumer marketplace Mercari Inc—is growing rapidly. In the United States alone, we have seen 74% year-on-year growth in monthly average users to 3.4 million. A big part of our success are our robust payment and deposit systems and AI-based fraud monitoring, which enable sellers to list items for purchase and buyers to complete transactions safely. Mercari started as a monolithic application but as complexity grew we decided to transition to a microservices architecture. And through it all, tools like Cloud Profiler and Cloud Trace helped us track down performance problems in our code, significantly improving latency.A microservices menagerieToday, we run 80+ microservices on Google Cloud with a mix of languages including Go, Python, JavaScript and Java. To deliver this new architecture, we created a gateway-like microservice to route traffic from soon-to-be migrated monolithic service to the Google Cloud microservices, which delivers a range of features. After creating several microservices, we identified common requirements and created a template to accelerate their development. These common requirements included: Exporting metrics to PrometheusA gRPC server and interceptorsError Reporting, Cloud Trace and Cloud Profiler. Error Reporting counts, analyzes and aggregates crashes in running cloud services, while Cloud Trace provides a view of requests as they flow through microservices and Cloud Profiler shows how microservices consume CPU, memory and threads. We then used Python to create a template for machine learning services, also expediting the creation of new microservices. This has enabled us to grow the number of microservices we use in order to address new requirements. However, as our microservices proliferated, we needed to efficiently monitor and understand their performance. Maintaining SLO a challengeIn particular, we needed to monitor the impact of new versions on the production environment and the efficiency of production operations, so we could maintain our service level objective (SLO) for success rates of 99.95% and 350 milliseconds for 95% latency. Our engineering team also uses canary deployments to detect issues with new versions of major services. However, despite applying these measures, we found it challenging to maintain our SLO when our business grew faster than expected or during unanticipated spikes in demand. Some issues can be obvious or easy to detect. For example, if a service is experiencing high CPU utilization, we could simply place or fine tune our horizontal pod autoscaler (HPA) to resolve the problem. However, other issues may be less obvious. For example, a drop in performance may not directly be tied to a specific release—it may instead be due to unexpected requests, or may arise from changes to multiple functions in a single code release. Using Cloud Profiler and Cloud Trace to minimize performance issuesIn particular, our business-critical UserStats service, which tracks the speed with which a user replies to a message and how fast and reliably a seller ships an item, recently started performing poorly. New feature requirements had prompted us to track how often a seller cancels an order and provide statistics. However, while adding this new functionality, the change refactored other functions, meaning we were unable to identify the function experiencing reduced performance. Since most of our services are enabled with Cloud Profiler and Cloud Trace, we turned to these products to investigate and identify the root cause. Before the change: Click to enlarge After the change: Click to enlarge These two Cloud Profiler views show the CPU time of the call stack increased from 457 milliseconds to 904 milliseconds, with most of the delta attributable to the _UserStats_SellerCancelStats_Handler function. But because other functions also saw variations in their CPU consumption, and because calls occurred in parallel, we found it difficult to identify the cause of latency increases. The fact that this function call was necessary meant we could not remove the entire function. We checked Cloud Trace and confirmed the function call had increased overall latency on some requests, similar to below: Click to enlarge We analyzed the service with Cloud Profiler and identified hot spots that were contributing to the increase in CPU time consumption. We optimized these hot functions, deployed the new code, used Cloud Profiler to verify that the changes had the desired effect of reducing the CPU time. Doing so, we were able to improve latency by 10% to 15%!Simplifying the DevOps experienceBefore adopting Cloud Profiler, profiling production services was a tedious and manual undertaking involving recompiling with debug flags; deployment to production environments, and using disparate tools to collect profiles and perform analysis. Containerization only increased this complexity, further reducing developer productivity. Cloud Profiler enables us to continuously profile production environments with small and simple code changes, replacing the tedious work previously required to set up environments for performance analysis. Low overhead continuous profiling with Cloud Profiler helps us react swiftly to changes in service performance by root causing and resolving issues quickly.Further, tools such as Cloud Trace and Cloud Profiler require minimal effort to setup and provide a consistent DevOps experience for our service owners. This is particularly important as we grow in the United States and elsewhere. Without Google Cloud, monitoring, debugging and profiling across production environments that feature a mix of languages, technology stacks, frameworks and containers would be extremely challenging and time-consuming. The release of new features and experiences in tools such as Cloud Profiler make us glad we chose Google Cloud as our primary cloud platform. We will continue to work with new features and provide feedback to Google Cloud, so it can continue to provide a better service to users. Visit the Google Cloud website to learn more about Cloud Profiler and Cloud Trace. Related Article Mercari: Faster and more efficient development with the help of Google Cloud Technical implementation can be challenging, and many businesses can benefit from hands-on support from their cloud provider. Learn how w... Read Article
Analysis Summary
# Best Practices: Performance Monitoring and Optimization in Microservices Architectures
## Overview
These recommendations focus on establishing continuous performance monitoring, root cause analysis, and optimization practices within complex, multi-language microservices environments, specifically leveraging cloud-native tools to maintain strict Service Level Objectives (SLOs). The core objective is to shift from manual/tedious profiling to automated, low-overhead continuous measurement.
## Key Recommendations
### Immediate Actions
1. **Enable Continuous Profiling on Critical Services:** Immediately integrate and enable Cloud Profiler across all production microservices (especially those handling critical paths like payments and user statistics) to gather CPU, memory, and thread consumption data continuously.
2. **Implement Distributed Tracing:** Ensure Cloud Trace (or equivalent distributed tracing) is enabled across all 80+ microservices to visualize request flow, understand inter-service dependencies, and immediately pinpoint latency spikes across the entire request path.
3. **Establish Baseline SLOs and Monitoring:** Formally document and enforce rigorous SLOs, such as Mercari's 99.95% success rate and 350ms latency for the 95th percentile. Configure immediate alerting when these thresholds are approached or breached.
### Short-term Improvements (1-3 months)
1. **Standardize Service Templates:** Formalize and enforce a development template for new microservices that mandatorily includes instrumentation for core observability tools (Error Reporting, Cloud Trace, Cloud Profiler).
2. **Integrate Metrics Export for Visualization:** Configure all services to export essential operational metrics to a centralized system (e.g., Prometheus). This allows for correlation between application performance metrics and infrastructure scaling (e.g., HPA adjustments).
3. **Develop Root Cause Analysis (RCA) Playbooks:** Create standardized procedures for engineers using combined Profiler and Tracer data to quickly identify "hot spots" (inefficient functions contributing most to latency) following an SLO breach or performance degradation report.
### Long-term Strategy (3+ months)
1. **Automate Performance Verification in Deployments:** Integrate performance profiling checks (e.g., profiling post-deployment in a staging or canary environment) into the CI/CD pipeline to automatically reject deployments that introduce unwarranted performance regressions before they impact the main production SLOs.
2. **Proactive Capacity Planning via Demand-Pattern Analysis:** Use long-term performance data gathered by profilers during demand spikes to inform necessary resource tuning (e.g., optimizing HPA thresholds or scaling configurations) before future anticipated growth phases.
3. **Code Refactoring Based on Hotspot Data:** Mandate that discovered performance bottlenecks (hot functions identified by Profiler) are prioritized in the engineering backlog for targeted optimization, ensuring continuous latency improvement (e.g., 10-15% annual targeted reduction).
## Implementation Guidance
### For Small Organizations
- **Phased Rollout:** Start continuous profiling on the single most business-critical service. Use the performance gains seen there to justify extending the tooling budget and time allocation to other services.
- **Leverage Managed Services:** Opt for fully managed profiling and tracing solutions (like Cloud Profiler/Trace) to avoid the complexity of self-managing profiling agents and data aggregation tools.
### For Medium Organizations
- **Template Enforcement:** Introduce mandatory service templates (as Mercari did) to ensure all new services are observability-ready from inception, reducing onboarding time and ensuring consistency across multiple development teams.
- **Cross-Team Training:** Conduct workshops focused specifically on interpreting Flame Graphs (from Profiler) and latency traces (from Trace) to distribute performance debugging skills beyond a core SRE team.
### For Large Enterprises
- **Polyglot Standardization:** Establish strict guidelines for deploying and configuring profiling agents compatible with the organization’s diverse technology stack (Go, Python, Java, JavaScript, etc.) to maintain a consistent DevOps experience across all language teams.
- **Governance for SLO Maintenance:** Formalize governance where performance monitoring review (using combined Profiler/Trace outputs) becomes a mandatory checkpoint before major feature releases are promoted past canary stages.
## Configuration Examples
The article highlights successful configuration patterns implemented by Mercari:
1. **Service Template Requirements:** Templates must include configurations for:
* Metrics Export to Prometheus.
* A gRPC server and interceptors.
* Integration with Error Reporting, Cloud Trace, and Cloud Profiler.
2. **Performance Identification Example:** The primary indicator for performance issues was identifying specific functions (e.g., `_UserStats_SellerCancelStats_Handler`) whose CPU time allocation increased significantly (e.g., from 457ms to 904ms in total stack time) following a code change that refactored logic.
3. **Resolution Verification:** Use Cloud Profiler *after* optimization to verify the reduction in consumed resources (CPU time) for the previously identified hot function to confirm the latency improvement.
## Compliance Alignment
While the article primarily focuses on performance engineering and SLOs, the implementation of comprehensive monitoring directly supports foundational security and reliability compliance requirements:
* **NIST SP 800-53 (CA-7, RA-5):** Continuous monitoring supports Configuration Management (CA-7) and Vulnerability Monitoring and Scanning (RA-5) by identifying anomalous resource usage that could indicate resource exhaustion attacks or insecure operational states.
* **ISO 27001 (A.12.1.2 - Operational Procedures and Responsibilities):** Maintaining clear procedures for monitoring system performance and detecting failures aligns with operational best practices for information security management.
## Common Pitfalls to Avoid
1. **Treating Profiling as an 'On-Demand' Activity:** Avoid the pre-Cloud Profiler approach of tedious manual setup, recompiling with debug flags, and deploying only for investigation. This leads to overlooking subtle performance degradation that accumulates slowly.
2. **Focusing Only on Infrastructure Scaling:** Do not assume all performance issues are solvable solely by tuning Horizontal Pod Autoscalers (HPA) or adding resources. Hidden application inefficiencies (like the refactored function in the UserStats service) will consume added resources without resolving the underlying latency.
3. **Ignoring Non-Regression Performance Reviews:** Avoid deploying code in canary stages without actively profiling. A new feature might pass functional tests but drastically increase latency due to unforeseen code paths or interactions within the microservices graph.
## Resources
- Google Cloud documentation for **Cloud Profiler** (Continuous low-overhead profiling).
- Google Cloud documentation for **Cloud Trace** (Distributed request tracing).
- Mercari's strategy of **Service Templates** for microservices development as a pattern for enforcing observability standards.