This blog post relays the findings of a peer workshop on performance and reliability testing and monitoring. I will not provide the usual TL; DR summary this time – if you want this gold, you will have to sift it yourself.
On 22-24 Oct, I attended and facilitated WOPR24, a LAWST-inspired workshop for performance engineers. I have been involved in organizing WOPRs for about 6 years now, and have attended 18(!) WOPRs over the last 11 years.
The attendees of WOPR24 were Ajay Davuluri, Oliver Erlewein, James Davis, Andy Hohenner, Yury Makedonov, Eric Proegler, Mais Tawfik Ashkar, Andreas Grabner, Doug Hoffman, John Meza, Michael Pearl, Ben Simo, and Alon Girmonsky. We were hosted by BlazeMeter in their Mountain View, California office, which was a fantastic venue. Adi BenNun of BlazeMeter took great care of us, making us very comfortable, feeding us very well, and making sure we had everything we could want.
I could talk for a long while about WOPR, WOPR’s history, our mission of advancing the practice and community building, how WOPRs are put together, and how Ross Collard changed the trajectory of my career and life by inviting me to WOPR3. At some future date, I will. But for now, so that we can talk about the great content of WOPR24, I will just drop a link about WOPR and let you surf that site.
The Workshop Theme
Each WOPR has a Theme, to help focus the experience reports and lay out what we want to explore. From http://www.performance-workshop.org/wopr24/:
Production is where performance matters most, as it directly impacts our end users and ultimately decides whether our software will be successful or not. Efforts to create test conditions and environments exactly like Production will always fall short; nothing compares to production!
Modern Application Performance Management (APM) solutions are capturing every transaction, all the time. Detailed monitoring has become a standard operations practice – but is it making an impact in the product development cycle? How can we find actionable information with these tools, and communicate our findings to development and testing? How might they improve our testing?
Content – Experience Reports
The primary format of WOPR is Experience Reports (ERs), which are narratives supported by charts, graphs, results, and other relevant data. Each ER is followed by facilitated discussion triggered by the ER. 7 attendees presented ERs over 3 days. Systems we discussed included the following:
- An online advertising auction system to algorithmically price and purchase ads for specific user profiles in < 40ms. Presenter used Splunk to model and characterize log events, web hits, application instrumentation, business metrics, and other content.
- An automotive parts sourcing SaaS application. Presenter discussed supplementing regular human-conducted load tests with CI automated tests. Lively discussion about thresholds and test environment control/resetting ensued.
- A mapping application company’s efforts to test with virtualizing high end video cards (https://www.cdw.com/shop/products/NVIDIA-GRID-K2-graphics-card-2-GPUs-GRID-K2-8-GB/3126398.aspx).
- An application that collects, rolls up to big data sets, and displays back to the system’s owner detailed operational metrics from a very large number of embedded systems, distributed around the world.
- A order-taking website for a large retailer. Discussion was about launch of a site, and the difficulty of enrolling reluctant development stakeholders in performance testing projects.
- A non-profit transportation industry service that publishes rate tables multiple times per day to and from all the vendors in that sector. Discussion of concurrency bugs and how they were reproduced.
- A large financial services SaaS provider shared some of the issues around testing mobile in-app video and chat. Generating traffic and evaluating results were some of the technical issues we discussed.
Content – Exercises
Between ERs, we conducted several guided discussions to explore specific areas of interest. Findings from those are relayed here:
What should we alert on?
In this exercise, we just started calling out thresholds and notes. We definitely didn’t finish.
|CPU||> 75% Web/App, > 50-60% DB||User + System CPU: For Vus with 4 or less and hyperthreaded, Warning 50 alarm 75, critical 95. With > 4/metal with HT disable warning 75 critical 99||Context-dependent Physical and virtual CPUs Per core and overall||1 minute monitor – number of sequential observations to trigger (two at which level? 3 alarm? 1 critical)? One minute? 5 Minute?|
|Order Rate||Dynamic Baseline (Time/Day/Etc)||Oliver: Tues is Max, Friday is Min||revenue, order rates, conversion rate, bounce rate||Business spend: Advertising money out|
|Failure Rate||http error rate against yesterday||Need to have a baseline. Beware of bots, synthetic measures, etc|
|Queue Length: Threads||Web/App, detectable at App?||Should have threads available, 2threads/core, beware of thread contention/concurrency events||Alert on contention/concurrency?|
|Connection Pool Utilization||JMX: DB Connections, Outgoing Requests||DB Connections, external web service calls, depends on app server. Is app waiting for a connection?|
|TCP Inbound Queue||Anything increasing in subsequent samples|
|Message Q||Messages creates/sec, size (no growth), message age, expiring messages present?|
|CPU Q||Load Average/CPU Queue Length > 2 per core||Load Average: 4x number of cores|
|MIPS/Second||3500/second, threshold 2900|
|Response Time||Several times higher than SLA||TCP/Response Time can indicate app server overload||distribution and median/thresholds vs baseline|
|Errors (Particular)||Class/Types, Rates, and severity||Error clusters|
|CPU Ready Time||Any significant percentage|
|GC Time||Percentage of time – what’s impact? Over 10%||Full collection? Partial? Age of objects leaving generation|
|Averaging http status||Compare to baseline average literal numbers||By request (type), or by page. Watch for dramatic changes||Watch for bots > 50%||Big difference, prod vs. test. Remove synthetics in test. Load balancer, keep-alive, etc|
|Throughput||Cloud $ rate, watch for bottleneck||Network throughput + context switch over Cpu = ratio for sanity check, first bottleneck|
|GPU utilization in virtualization||75-85% utilization = efficient virtualization|
|Redirects||Count/rate. Redirects on keep-alives||Endless loops? Redirects per user, check for max|
|SAN I/O count|
|I/O latency||Write latency||Queue lengths let me see problem faster|
|Thread context switches|
|Wait time (DB)||Latches, locks, just look for increases||Also wait states|
|Log Volume (and by type)|
|Cache hit ratio||Page life expectancy|
|Virtual active mem vs real memory||Check for Disk swapping|
|Physical memory free||66%||Managed memory||Leave small amount free|
|Network I/O||Based on network link rating|
|Network errors/packet discards|
|Disk utilization – space and I/Os|
|Average SQL Statements/Request||Data driven pattern detection problem||n+1 query results|
|Disk space rate consumption|
|Starting/stopping monitoring agents|
|Connection Pool: Available|
|Rate of change in connection pool count|
|recycling of worker process/app pool|
|Log messages by severity|
|thread status – deadlocks?|
|Business transaction rates – whatever those are|
|Cloud node thrashing||Spin up spin down|
|Functional audit log/application log||Specific events|
|Transaction span lifecycles|
|Scale of code change on updates||This jar file is x times previous verision|
|Page size (html)||Delta vs previous|
|Web Server Request Queue: IIS|
|%c 1/2/3||VMs in power-save mode?||Detected measuring against baseline|
|Response time total for all requests (SUM)||Saw things that were not obvious||Correlate against load|
What Would an Ideal Dashboard Look Like?
For this activity, we broke up into groups to talk about different contexts. There are some metacomments in each section, reflecting when they came out.
SaaS Company Dashboard
This dashboard has horizontal bars to show different metrics for different consumers. The audience is the whole company. The Checkbox/Xs on the left are to provide a two state indicator to easily, rapidly, and broadly detect when something is wrong, by functional area. Here is what each had:
- CEO – For each metric, the current time period against a previous time period. Demos, Signups, subscriptions, Retain/Drop Rates, and some support metric such as call count
- CFO – Cost per lead, cost per user, and a sparkline for Revenue
- CMO – Social media engagement, Mailing opens/click rates, ad conversion
- CIO – Availability, error rate, Response time, throughput
- CTO – Deployments, key resource metrics
Each one of these bars is designed for drill-down. Remember, this is the top level.
For E-Commerce, they decided not to draw a single dashboard. They defined five different functional areas they care about, and drew a Venn diagram to show that there are overlaps of interesting information, but very little that everyone wanted, needed, or would benefit from knowing.
Consider this continuum, generic to specific, from “easy for everyone to understand” to “doesn’t have to be understood except by the individual who uses the dashboard”. You could also think of these as spokes from the center outwards.
Company -> Department -> Team -> Individual
As the questions were posed by them:
- Who is looking at the dashboard?
- What is their role?
- What do they want to see?
- What actions would they take if they did see something?
The four areas they defined, and the notes on each:
- Performance Testing
- Business – Adoption Rate, Active/passive campaigns, usage by feature, dollars per aggregate
- Development – Separate dashboards for each scrum team
- OPS – Active support ticket counts
Here is a dashboard imagined for a very large company. It is intentionally light on detail to avoid having visitors/contractors get information they shouldn’t have. Some of the resulting discussion was about dysfunctions, but these are real safety issues in some organizations.
Consider the following characteristics in deciding what to have in the dashboard:
- Tooling – what can you show
- Time – both for measuring effort, and for displaying data in context
- Security – Don’t show something you shouldn’t – review content with Security officer?
- Liability – Knowing specific things – or being responsible for exposing specific things – might put a person in a position they don’t want to be
- Cost – Screens, software, etc.
- Political dimensions of having potentially out-of-context data be embarrassing to specific executives or individuals. Is this public shaming? Are people stuck with metrics that will never go green? Is that deliberate? What if we move the goalposts so that someone can see green – are they still worth tracking? Should we turn off our sales dashboard when we are clearly going to miss goals, so that we don’t depress morale?
- NO STOCK PRICES. Or other metrics that are not actually connected to current, actionable data.
Some specifics about this dashboard:
The “Christmas Tree” or dragstrip pattern is for a number of areas to indicate green/yellow/red. This is designed only to communicate whether something interesting is happening.
Other visualization methods – use vertical meters, or speedometers.
The other primary visual is a graph comparing a current time period against a previous, such as this day’s revenue against last year’s/quarter’s/week’s. The interesting suggestion here was to change the background color as an indicator.
Focus in the information needs: What is actionable? What is relevant? What is helpful?
Go to people consuming data – what are they needing? Answer that question, instead of starting with what is easy to provide.
Revenue metrics are key
What metrics do we talk about? What metrics do we present? Those metrics already have a life, so we should try to reuse them.
Test for information completeness – establish that the dashboards are truthful and accurate. Make sure the team knows what the data means and where it comes from.
Dashboards provide transparency into current states – but require interpretation. They are starting place for discussions across teams.
Leave space for temporary/rotating items
Bonus Screenshot of an Ops Dashboard Someone Built
Unclear what the news presenter sees on the dashboard that concerns her ;-p
What Would We Want to Monitor for Mobile Users?
In this exercise, we listed the attributes that would be interesting to capture for mobile monitoring.
On the device:
- Network conditions: latency, bandwidth, jitter, packet loss
- Geographic location
- Screen size/device
- OS version
- Battery State
- Memory – free/available
- Number (list?) of apps running
- Powersave mode
- Is the screen cracked or broken?
- Signal strength (wireless/cellular)
- Recent movement/direction/accelerometer data
- Contention? Drops/retransmits
- Carrier Plan status?
- Time: last restart/power cycle
- Date in Service
- Accessibility Options
- Geographically similar to other devices? People travelling together in a train, for example
In the app:
- Screen load RT
- Memory footprint
- Data transfer
- CPU cycles
- Gestures captured
- Round trips/waterfall
- Ad activity
- Freemium status?
- Known customer
- Time of day
- Date of Install
- Date last used
- Other app requests
- Phone connection right now?