WOPR24 Experience Report

This blog post relays the findings of a peer workshop on performance and reliability testing and monitoring. I will not provide the usual TL; DR summary this time – if you want this gold, you will have to sift it yourself.

On 22-24 Oct, I attended and facilitated WOPR24, a LAWST-inspired workshop for performance engineers. I have been involved in organizing WOPRs for about 6 years now, and have attended 18(!) WOPRs over the last 11 years.

The attendees of WOPR24 were Ajay Davuluri, Oliver Erlewein, James Davis, Andy Hohenner, Yury Makedonov, Eric Proegler, Mais Tawfik Ashkar, Andreas Grabner, Doug Hoffman, John Meza, Michael Pearl, Ben Simo, and Alon Girmonsky. We were hosted by BlazeMeter in their Mountain View, California office, which was a fantastic venue. Adi BenNun of BlazeMeter took great care of us, making us very comfortable, feeding us very well, and making sure we had everything we could want.

WOPR24Group

I could talk for a long while about WOPR, WOPR’s history, our mission of advancing the practice and community building, how WOPRs are put together, and how Ross Collard changed the trajectory of my career and life by inviting me to WOPR3. At some future date, I will. But for now, so that we can talk about the great content of WOPR24, I will just drop a link about WOPR and let you surf that site.

The Workshop Theme

Each WOPR has a Theme, to help focus the experience reports and lay out what we want to explore. From http://www.performance-workshop.org/wopr24/:

Production is where performance matters most, as it directly impacts our end users and ultimately decides whether our software will be successful or not. Efforts to create test conditions and environments exactly like Production will always fall short; nothing compares to production!

Modern Application Performance Management (APM) solutions are capturing every transaction, all the time. Detailed monitoring has become a standard operations practice – but is it making an impact in the product development cycle? How can we find actionable information with these tools, and communicate our findings to development and testing? How might they improve our testing?

Content – Experience Reports

The primary format of WOPR is Experience Reports (ERs), which are narratives supported by charts, graphs, results, and other relevant data. Each ER is followed by facilitated discussion triggered by the ER. 7 attendees presented ERs over 3 days. Systems we discussed included the following:

  1. An online advertising auction system to algorithmically price and purchase ads for specific user profiles in < 40ms. Presenter used Splunk to model and characterize log events, web hits, application instrumentation, business metrics, and other content.
  2. An automotive parts sourcing SaaS application. Presenter discussed supplementing regular human-conducted load tests with CI automated tests. Lively discussion about thresholds and test environment control/resetting ensued.
  3. A mapping application company’s efforts to test with virtualizing high end video cards (https://www.cdw.com/shop/products/NVIDIA-GRID-K2-graphics-card-2-GPUs-GRID-K2-8-GB/3126398.aspx).
  4. An application that collects, rolls up to big data sets, and displays back to the system’s owner detailed operational metrics from a very large number of embedded systems, distributed around the world.
  5. A order-taking website for a large retailer. Discussion was about launch of a site, and the difficulty of enrolling reluctant development stakeholders in performance testing projects.
  6. A non-profit transportation industry service that publishes rate tables multiple times per day to and from all the vendors in that sector. Discussion of concurrency bugs and how they were reproduced.
  7. A large financial services SaaS provider shared some of the issues around testing mobile in-app video and chat. Generating traffic and evaluating results were some of the technical issues we discussed. 

Content – Exercises

Between ERs, we conducted several guided discussions to explore specific areas of interest. Findings from those are relayed here:

What should we alert on?

In this exercise, we just started calling out thresholds and notes. We definitely didn’t finish.

Monitor What Threshold Notes
CPU > 75% Web/App, > 50-60% DB User + System CPU: For Vus with 4 or less and hyperthreaded, Warning 50 alarm 75, critical 95. With > 4/metal with HT disable warning 75 critical 99 Context-dependent Physical and virtual CPUs Per core and overall 1 minute monitor – number of sequential observations to trigger (two at which level? 3 alarm? 1 critical)? One minute? 5 Minute?
Order Rate Dynamic Baseline (Time/Day/Etc) Oliver: Tues is Max, Friday is Min revenue, order rates, conversion rate, bounce rate Business spend: Advertising money out
Failure Rate http error rate against yesterday Need to have a baseline. Beware of bots, synthetic measures, etc
Queue Length: Threads Web/App, detectable at App? Should have threads available, 2threads/core, beware of thread contention/concurrency events Alert on contention/concurrency?
Connection Pool Utilization JMX: DB Connections, Outgoing Requests DB Connections, external web service calls, depends on app server. Is app waiting for a connection?
TCP Inbound Queue Anything increasing in subsequent samples
Message Q Messages creates/sec, size (no growth), message age, expiring messages present?
CPU Q Load Average/CPU Queue Length > 2 per core Load Average: 4x number of cores
MIPS/Second 3500/second, threshold 2900
Response Time Several times higher than SLA TCP/Response Time can indicate app server overload distribution and median/thresholds vs baseline
Errors (Particular) Class/Types, Rates, and severity Error clusters
CPU Ready Time Any significant percentage
GC Time Percentage of time – what’s impact? Over 10% Full collection? Partial? Age of objects leaving generation
Averaging http status Compare to baseline average literal numbers By request (type), or by page. Watch for dramatic changes Watch for bots > 50% Big difference, prod vs. test. Remove synthetics in test. Load balancer, keep-alive, etc
Throughput Cloud $ rate, watch for bottleneck Network throughput + context switch over Cpu = ratio for sanity check, first bottleneck
GPU utilization in virtualization 75-85% utilization = efficient virtualization
Redirects Count/rate. Redirects on keep-alives Endless loops? Redirects per user, check for max
SAN I/O count
I/O latency Write latency Queue lengths let me see problem faster
Thread context switches
Wait time (DB) Latches, locks, just look for increases Also wait states
Log Volume (and by type)
Cache hit ratio Page life expectancy
DB Connections
Virtual active mem vs real memory Check for Disk swapping
Physical memory free 66% Managed memory Leave small amount free
Network I/O Based on network link rating
Network errors/packet discards
objects generated
Induced GCs
Live Sesions
Disk utilization – space and I/Os
Average SQL Statements/Request Data driven pattern detection problem n+1 query results
Disk space rate consumption
Starting/stopping monitoring agents
Connection Pool: Available
Monitoring Boxes
# restarts
Rate of change in connection pool count
recycling of worker process/app pool
Log messages by severity
thread status – deadlocks?
Business transaction rates – whatever those are
Revenue/sec
Availability
Cloud node thrashing Spin up spin down
Functional audit log/application log Specific events
Transaction span lifecycles
Scale of code change on updates This jar file is x times previous verision
Page size (html) Delta vs previous
User/system ratio
Web Server Request Queue: IIS
Batches/sec 200
%c 1/2/3 VMs in power-save mode? Detected measuring against baseline
Response time total for all requests (SUM) Saw things that were not obvious Correlate against load

What Would an Ideal Dashboard Look Like?

For this activity, we broke up into groups to talk about different contexts. There are some metacomments in each section, reflecting when they came out.

SaaS Company Dashboard

SaaSCompanyDash

This dashboard has horizontal bars to show different metrics for different consumers. The audience is the whole company. The Checkbox/Xs on the left are to provide a two state indicator to easily, rapidly, and broadly detect when something is wrong, by functional area. Here is what each had:

  • CEO – For each metric, the current time period against a previous time period. Demos, Signups, subscriptions, Retain/Drop Rates, and some support metric such as call count
  • CFO – Cost per lead, cost per user, and a sparkline for Revenue
  • CMO – Social media engagement, Mailing opens/click rates, ad conversion
  • CIO – Availability, error rate, Response time, throughput
  • CTO – Deployments, key resource metrics

Each one of these bars is designed for drill-down. Remember, this is the top level.

E-Commerce

For E-Commerce, they decided not to draw a single dashboard. They defined five different functional areas they care about, and drew a Venn diagram to show that there are overlaps of interesting information, but very little that everyone wanted, needed, or would benefit from knowing.

Consider this continuum, generic to specific, from “easy for everyone to understand” to “doesn’t have to be understood except by the individual who uses the dashboard”. You could also think of these as spokes from the center outwards.

Company -> Department -> Team -> Individual

As the questions were posed by them:

  • Who is looking at the dashboard?
  • What is their role?
  • What do they want to see?
  • What actions would they take if they did see something?

The four areas they defined, and the notes on each:

  1. Performance Testing
  2. Business – Adoption Rate, Active/passive campaigns, usage by feature, dollars per aggregate
  3. Development – Separate dashboards for each scrum team
  4. OPS – Active support ticket counts

BigCorp’s Dashboard

BigCorpDash

Here is a dashboard imagined for a very large company. It is intentionally light on detail to avoid having visitors/contractors get information they shouldn’t have. Some of the resulting discussion was about dysfunctions, but these are real safety issues in some organizations.

Consider the following characteristics in deciding what to have in the dashboard:

  • Tooling – what can you show
  • Time – both for measuring effort, and for displaying data in context
  • Security – Don’t show something you shouldn’t – review content with Security officer?
  • Liability – Knowing specific things – or being responsible for exposing specific things – might put a person in a position they don’t want to be
  • Cost – Screens, software, etc.
  • Political dimensions of having potentially out-of-context data be embarrassing to specific executives or individuals. Is this public shaming? Are people stuck with metrics that will never go green? Is that deliberate? What if we move the goalposts so that someone can see green – are they still worth tracking? Should we turn off our sales dashboard when we are clearly going to miss goals, so that we don’t depress morale?
  • NO STOCK PRICES. Or other metrics that are not actually connected to current, actionable data.

Some specifics about this dashboard:

The “Christmas Tree” or dragstrip pattern is for a number of areas to indicate green/yellow/red. This is designed only to communicate whether something interesting is happening.

Other visualization methods – use vertical meters, or speedometers.

The other primary visual is a graph comparing a current time period against a previous, such as this day’s revenue against last year’s/quarter’s/week’s. The interesting suggestion here was to change the background color as an indicator.

Final Metacomments

Focus in the information needs: What is actionable? What is relevant? What is helpful?

Go to people consuming data – what are they needing? Answer that question, instead of starting with what is easy to provide.

Revenue metrics are key

Iterate!

What metrics do we talk about? What metrics do we present? Those metrics already have a life, so we should try to reuse them.

Test for information completeness – establish that the dashboards are truthful and accurate. Make sure the team knows what the data means and where it comes from.

Dashboards provide transparency into current states – but require interpretation. They are starting place for discussions across teams.

Leave space for temporary/rotating items

Bonus Screenshot of an Ops Dashboard Someone Built

Unclear what the news presenter sees on the dashboard that concerns her ;-p

NewsPresenter

What Would We Want to Monitor for Mobile Users?

In this exercise, we listed the attributes that would be interesting to capture for mobile monitoring.

On the device:

  1. Network conditions: latency, bandwidth, jitter, packet loss
  2. Geographic location
  3. Screen size/device
  4. OS version
  5. Battery State
  6. Memory – free/available
  7. Number (list?) of apps running
  8. Powersave mode
  9. Chaperoned?
  10. Carrier
  11. Is the screen cracked or broken?
  12. Signal strength (wireless/cellular)
  13. Accessories
  14. Recent movement/direction/accelerometer data
  15. Contention? Drops/retransmits
  16. Carrier Plan status?
  17. Time: last restart/power cycle
  18. Rooted/Jailbroken?
  19. Date in Service
  20. Accessibility Options
  21. Geographically similar to other devices? People travelling together in a train, for example

In the app:

  1. Screen load RT
  2. Memory footprint
  3. Data transfer
  4. CPU cycles
  5. Gestures captured
  6. Round trips/waterfall
  7. Ad activity
  8. Freemium status?
  9. Known customer
  10. Time of day
  11. Demographics
  12. Date of Install
  13. Version
  14. Date last used
  15. Other app requests
  16. Phone connection right now?