Developing the handles model

The Problem

I/O are currently detected using only the CPU data. We generally expect servers to be CPU bound, as node.js is very good at handling I/O concurrency. Thus if the CPU usage is low, that indicates that too much time is spent on waiting for I/O.

However, if by pressureing the server sufficiently hard with a lot of requests, it is almost always possible to get the CPU to 100%. The idea is to use the additional handles data as a second metric for detecting I/O issues.

Getting the data

Matteo provided some sample files. To work on these in R, they should be converted to .csv files.

cat small-load.clinic-doctor-sample | node node-clinic-doctor/debug/sample-to-csv.js > small-load.csv

cat heavy-load-1.clinic-doctor-sample | node node-clinic-doctor/debug/sample-to-csv.js > heavy-load-1.csv
cat heavy-load-2.clinic-doctor-sample | node node-clinic-doctor/debug/sample-to-csv.js > heavy-load-2.csv

cat mystery-1.clinic-doctor-sample | node node-clinic-doctor/debug/sample-to-csv.js > mystery-1.csv
cat mystery-2.clinic-doctor-sample | node node-clinic-doctor/debug/sample-to-csv.js > mystery-2.csv
cat mystery-3.clinic-doctor-sample | node node-clinic-doctor/debug/sample-to-csv.js > mystery-3.csv
cat mystery-4.clinic-doctor-sample | node node-clinic-doctor/debug/sample-to-csv.js > mystery-4.csv

cat issue-1.clinic-doctor-sample | node node-clinic-doctor/debug/sample-to-csv.js > issue-1.csv
cat issue-2.clinic-doctor-sample | node node-clinic-doctor/debug/sample-to-csv.js > issue-2.csv
cat issue-3.clinic-doctor-sample | node node-clinic-doctor/debug/sample-to-csv.js > issue-3.csv

Unfortunately we currently don’t have any real data, and the not slow-io servers have unrealistic constant handle trends. This makes it difficult to test our model against data without a handle issue. To work around this, the small-load data set is used as an example without an I/O issue. Even though it was specifically created to have an I/O issue, the I/O issue primarly shows up in the CPU data.

To load the data:

load.data = function (name, has.issue) {
  dat = read.csv(name)
  dat$name = name
  dat$has.issue = has.issue
  dat$time = as.POSIXct((dat$timestamp - min(dat$timestamp)) / 1000, origin="1970-01-01", tz="GMT")
  return(dat)
};

dat.small.load = load.data('small-load.csv', F)
dat.heavy.load.1 = load.data('heavy-load-1.csv', T)
dat.heavy.load.2 = load.data('heavy-load-2.csv', T)
dat.mystery.1 = load.data('mystery-1.csv', F)
dat.mystery.2 = load.data('mystery-2.csv', F)
dat.mystery.3 = load.data('mystery-3.csv', T)
dat.mystery.4 = load.data('mystery-4.csv', T)
dat.issue.1 = load.data('issue-1.csv', F)
dat.issue.2 = load.data('issue-2.csv', F)
dat.issue.3 = load.data('issue-3.csv', F)

The data is structured as:

print.data.table(head(dat.small.load))

timestamp	delay	cpu	memory.rss	memory.heapTotal	memory.heapUsed	memory.external	handles
1.508255e+12	344.903667	0.8164976	43188224	22470656	16253032	482408	3
1.508255e+12	0.508562	0.1085781	43257856	22470656	16302568	482482	3
1.508255e+12	1.354164	0.0249248	43266048	22470656	16311496	482556	3
1.508255e+12	2.736907	0.0168016	43274240	22470656	16315992	482630	3
1.508255e+12	2.760665	0.0182592	43282432	22470656	16320296	482704	3
1.508255e+12	0.150944	0.0195056	43286528	22470656	16324600	482778	3

The interval column has the value 1 for the gussed analysis interval. To focus only on this data:

subset.interval = function (dat) {
  dat = dat[dat$interval == 1, ]
  dat$time = as.POSIXct((dat$timestamp - min(dat$timestamp)) / 1000, origin="1970-01-01", tz="GMT")
  return(dat)
}

dat.small.load = subset.interval(dat.small.load)
dat.heavy.load.1 = subset.interval(dat.heavy.load.1)
dat.heavy.load.2 = subset.interval(dat.heavy.load.2)
dat.mystery.1 = subset.interval(dat.mystery.1)
dat.mystery.2 = subset.interval(dat.mystery.2)
dat.mystery.3 = subset.interval(dat.mystery.3)
dat.mystery.4 = subset.interval(dat.mystery.4)
dat.issue.1 = subset.interval(dat.issue.1)
dat.issue.2 = subset.interval(dat.issue.2)
dat.issue.3 = subset.interval(dat.issue.3)

Reprinting the data, shows the interval = 0 have now been stripped.

print.data.table(head(dat.small.load))

	timestamp	interval	delay	cpu	memory.rss	memory.heapTotal	memory.heapUsed	memory.external	handles
15	1.508255e+12	1	2.336333	0.4967440	43782144	22994944	16743760	483444	23
16	1.508255e+12	1	0.727231	0.0316018	43786240	22994944	16748880	483518	23
17	1.508255e+12	1	2.187926	0.0269939	43790336	22994944	16753776	483592	23
18	1.508255e+12	1	2.734569	0.0155482	43794432	22994944	16758672	483666	23
19	1.508255e+12	1	2.731688	0.0185364	43802624	22994944	16763568	483740	23
20	1.508255e+12	1	1.796073	0.1580187	43831296	22994944	16790752	483814	27

Finally, the data is combined for convience:

dat.main = rbind(dat.small.load, dat.heavy.load.1, dat.heavy.load.2)
dat.mystery = rbind(dat.mystery.1, dat.mystery.2, dat.mystery.3, dat.mystery.4)
dat.issue = rbind(dat.issue.1, dat.issue.2, dat.issue.3)

The model hypothesis

The model hypothesis is made by looking at the data and condering once domain knowledge. Particularly the latter should be the primary component when there isn’t any real data.

p = ggplot(dat.main, aes(x = time, y = handles, colour=has.issue))
p = p + geom_line()
p = p + facet_grid(name ~ ., scales='free_y')
p = p + scale_x_datetime(labels = date_format("%S sec"))
p = p + scale_y_continuous(limits = c(0, NA))
print(p)

The model hypothesis that Matteo provided was:

servers with an I/O issue will have an increasing number of handles, that will occationally decrease a lot.

This appears to fit well with the heavy-load data and not very well with small-load. This is also what we want, as small-load is treated as having no issues with the handles data.

Quantifying the model hypothesis

Coefficient of variation

The immediate thought is that the heavy-load will have relatively more variance than small-load because it has more extream values. A challenge here is that the variance will depend on the number of concurrent requests. To normalize against this, the standard deviation is divided by the mean. This is what is called the coeffeicent of variation (cv).

cv.unbiased = function (vec) {
  cv = sd(vec) / mean(vec)
  return(cv * (1 + 1/(4 * length(vec))))
}

analysis.cv = function (dat) {
  name = dat[1, 'name']
  return(data.frame(list(
    cv.unbiased = cv.unbiased(dat$handles)
  )))
}

kable(ddply(dat.main, "name", analysis.cv))

name	cv.unbiased
heavy-load-1.csv	0.3652156
heavy-load-2.csv	0.1156849
small-load.csv	0.1299085

The results quickly show that this is not the case. heavy-load-1 does actally have a smaller coefficient of variation than small-load.

Heavy tail detection

Before going for the next idea it can be useful to at the data from other angels than just the plain graph. Once such way is the data distribution.

p = ggplot(dat.main, aes(handles, colour=has.issue))
p = p + geom_density(fill = 1, alpha = 0.1)
p = p + facet_wrap(~ name, scales='free')
print(p)

From this data, the idea is that that heavy-load data has more skewness (yes, this is a statistical term) than small-load. Once could then use a fancy statistical test like Jarque-Barre to test the skewness value. However, fancy statistical tests are a nightmare to implement in JavaScript. Instead the data is assumed to be normally distributed, from this assumtion one can check if there is a surprising amount of data at either distribution tails.

plot.heavy.tail = function (dat) {
  lower.bound = mean(dat$handles) - 1.96 * sd(dat$handles)
  upper.bound = mean(dat$handles) + 1.96 * sd(dat$handles)
  
  p = ggplot(dat, aes(x = time, y = handles, colour=has.issue))
  p = p + geom_line()
  p = p + scale_x_datetime(labels = date_format("%S sec"))
  p = p + geom_hline(yintercept = lower.bound)
  p = p + annotate("text", min(dat$time), lower.bound, vjust = -1, hjust=0, label = "lower bound")
  p = p + geom_hline(yintercept = upper.bound)
  p = p + annotate("text", max(dat$time), upper.bound, vjust = -1, hjust=1, label = "upper bound")
  print(p)
}

plot.heavy.tail(dat.small.load)

plot.heavy.tail(dat.heavy.load.1)

From the graphs, this appears to be a somewhat reasonable strategy. small-load is evenly distributed on both sides, heavy-load-1 is is clearly not. This is then transformed into completly test function:

analysis.heavy.tail = function (dat, risk = 0.05) {
  name = dat[1, 'name']
  has.issue = dat[1, 'has.issue']
  t.multipler = qt((1 - 0.05/2), nrow(dat))
  lower.bound = mean(dat$handles) - t.multipler * sd(dat$handles)
  upper.bound = mean(dat$handles) + t.multipler * sd(dat$handles)

    # extream tail ratio
  upper.extream.ratio = sum(dat$handles > upper.bound) / length(dat$handles)
  lower.extream.ratio = sum(dat$handles < lower.bound) / length(dat$handles)
  issue.deteted = max(upper.extream.ratio, lower.extream.ratio) > risk

  return(data.frame(list(
    lower = lower.extream.ratio,
    upper = upper.extream.ratio,
    detect.issue = issue.deteted,
    has.issue = has.issue
  )))
}

kable(ddply(dat.main, "name", analysis.heavy.tail))

name	lower	upper	detect.issue	has.issue
heavy-load-1.csv	0.0000000	0.0638767	TRUE	TRUE
heavy-load-2.csv	0.0288889	0.0566667	TRUE	TRUE
small-load.csv	0.0336818	0.0174216	FALSE	FALSE

At this point life is good. Matteo then later showed data where this strategy doesn’t work.

kable(ddply(dat.mystery, "name", analysis.heavy.tail))

name	lower	upper	detect.issue	has.issue
mystery-1.csv	0.0590031	0.0000000	TRUE	FALSE
mystery-2.csv	0.0011521	0.0299539	FALSE	FALSE
mystery-3.csv	0.0288889	0.0566667	TRUE	TRUE
mystery-4.csv	0.0000000	0.0315789	FALSE	TRUE

As seen from this result, the issue detection (detect.issue) doesn’t match the target (has.issue). Looking at the data, nothing appears to be unusual.

p = ggplot(dat.mystery, aes(x = time, y = handles, colour=has.issue))
p = p + geom_line()
p = p + facet_grid(name ~ ., scales='free_y')
p = p + scale_x_datetime(labels = date_format("%S sec"))
p = p + scale_y_continuous(limits = c(0, NA))
print(p)

Looking at the results for dat.mystery.4, where an issue should be detected but isn’t, nothing appears to be wrong. The detection logic just isn’t good enogth. One could of course just and work around with small hacks, but this is rarely the solution.

plot.heavy.tail(dat.mystery.4)

What one should take from this, is that the model hypthoesis might be slightly wrong. The drops don’t appear in dat.mystery.4 are sudden but also a bit gradual.

New model hypothesis

The previuse analysis, inspires a new hypothesis.

servers with an I/O issue will have an increasing number of handles for a period followed by a period of decreasing number of handles.

Sign change test

The immediate thought might be that one could fit a sinus function to this data. But without knowing what a good and bad period is, and because the period and amplitude may change over time, this is unlikely to be a good strategy.

Instead, increasing and decreasing trends are often best analysed by looking at the differential curves. This is where one looks at the difference from the previuse time step.

diff.data = function (dat) {
  return(data.frame(list(
    timestamp = head(dat$timestamp, -1),
    time = head(dat$time, -1),
    handles.delta = diff(dat$handles),
    has.issue = head(dat$has.issue, -1)
  )))
}

dat.main.delta = ddply(dat.main, "name", diff.data)
dat.mystery.delta = ddply(dat.mystery, "name", diff.data)

p = ggplot(rbind(dat.main.delta, dat.mystery.delta), aes(x = time, y = handles.delta, colour=has.issue))
p = p + geom_line()
p = p + facet_grid(name ~ ., scales='free_y')
p = p + scale_x_datetime(labels = date_format("%S sec"))
print(p)

In the differential data transformation, the increasing part will be positive and the decreasing part will be negative. We can understand this better by looking at the just the signs.

dat.main.delta$handles.delta.sign = sign(dat.main.delta$handles.delta)
dat.mystery.delta$handles.delta.sign = sign(dat.mystery.delta$handles.delta)

p = ggplot(rbind(dat.main.delta, dat.mystery.delta), aes(x = time, y = handles.delta.sign, colour=has.issue))
p = p + geom_point(alpha=0.1)
p = p + facet_grid(name ~ .)
p = p + scale_x_datetime(labels = date_format("%S sec"))
print(p)

From this it is apparent that the samples with an issue, has an uneven distribution of increasing and decreasing number of handles.

In classical statistics, a test that is often performed is the sign-test. This says, that if data is normally distributed then the next observation has a 50% change of having the opposite sign, compared to the previous observation. This test, does actually not just hold for normally distributed data but symetrically distributed data in general. Looking at the histrograms from before, and for the dat.mystery there do also appear to be a relation between symmetry and the number of handles. Although, the relation is not super apparent.

To get a better idea if this is true, a symmetry plot can be made:

symmetry.data = function (dat) {
  handles.median = median(dat$handles.delta)
  handles.sorted = dat$handles.delta[order(dat$handles.delta)]
  
  return(data.frame(
    symmetry.x = rev(handles.sorted) - handles.median,
    symmetry.y = handles.median - handles.sorted,
    has.issue = dat$has.issue
  ))
}

dat.main.delta.symmetry = ddply(dat.main.delta, "name", symmetry.data)
dat.mystery.delta.symmetry = ddply(dat.mystery.delta, "name", symmetry.data)

p = ggplot(rbind(dat.main.delta.symmetry, dat.mystery.delta.symmetry), aes(symmetry.x, symmetry.y, colour=has.issue))
p = p + geom_point(alpha=0.3)
p = p + geom_abline(intercept = 0, slope = 1)
p = p + facet_wrap(~ name, scales='free')
print(p)

In this plot, data samples that isn’t generally on the line can be said to be non-symetrical. From this plot the correlation between an issue and symmetry appears to be very strong.

The next step is then to implement the sign change test. To do this, we count the number of sign changes and compare that with the total number of observations. For symmetric data, the properbility of a sign change is 0.5. However this does not guarantee an exact split. To overcome this, the binomial distribution is used to tell the properbility of observing the given number of sign changes, given n observation and a properbility of 50%.

analysis.sign.change.v1 = function (dat, risk = 0.001) {
  name = dat[1, 'name']
  has.issue = dat[1, 'has.issue']
  
  differential.data = diff(dat$handles)
  differential.data.sign = sign(differential.data)
  
  # count changes
  sign.changes = sum(diff(differential.data.sign) != 0)
  
  # lower tail p.value
  p.value = pbinom(sign.changes, nrow(dat), 0.5)
  
  return(data.frame(list(
    name = name,
    p.value = p.value,
    detect.issue = p.value < risk,
    has.issue = has.issue
  )))
}

kable(ddply(rbind(dat.main, dat.mystery), "name", analysis.sign.change.v1))

name	p.value	detect.issue	has.issue
heavy-load-1.csv	0.0000000	TRUE	TRUE
heavy-load-2.csv	0.0000000	TRUE	TRUE
mystery-1.csv	0.9999998	FALSE	FALSE
mystery-2.csv	0.2595071	FALSE	FALSE
mystery-3.csv	0.0000000	TRUE	TRUE
mystery-4.csv	0.0000000	TRUE	TRUE
small-load.csv	0.8767220	FALSE	FALSE

Once again life is good. Matteo then later showed data where this doesn’t work.

kable(ddply(dat.issue, "name", analysis.sign.change.v1))

name	detect.issue	has.issue
issue-1.csv	TRUE	FALSE
issue-2.csv	TRUE	FALSE
issue-3.csv	TRUE	FALSE

Improved sign change test

Luckily, the fix here is simple. The issue exists when the number of handles is almost constant:

p = ggplot(dat.issue, aes(x = time, y = handles, colour=has.issue))
p = p + geom_line()
p = p + facet_grid(name ~ ., scales='free_y')
p = p + scale_x_datetime(labels = date_format("%S sec"))
p = p + scale_y_continuous(limits = c(0, NA))
print(p)

This mostly happen in very theoretical cases, were there is very little or no I/O activity. A simple solution to this, is to only consider the observations were the number of observations actually changed.

analysis.sign.change.v1 = function (dat, risk = 0.001) {
  name = dat[1, 'name']
  has.issue = dat[1, 'has.issue']
  
  differential.data = diff(dat$handles)
  differential.data.sign = sign(differential.data)
  
  # count changes
  sign.changes = sum(diff(differential.data.sign) != 0)
  num.none.constant.obs = sum(differential.data.sign != 0)
  num.none.constant.obs = max(num.none.constant.obs, sign.changes)
  
  # lower tail p.value
  p.value = pbinom(sign.changes, num.none.constant.obs, 0.5)
  
  return(data.frame(list(
    name = name,
    p.value = p.value,
    detect.issue = p.value < risk,
    has.issue = has.issue
  )))
}

kable(ddply(rbind(dat.main, dat.mystery, dat.issue), "name", analysis.sign.change.v1))

name	p.value	detect.issue	has.issue
heavy-load-1.csv	0	TRUE	TRUE
heavy-load-2.csv	0	TRUE	TRUE
issue-1.csv	1	FALSE	FALSE
issue-2.csv	1	FALSE	FALSE
issue-3.csv	1	FALSE	FALSE
mystery-1.csv	1	FALSE	FALSE
mystery-2.csv	1	FALSE	FALSE
mystery-3.csv	0	TRUE	TRUE
mystery-4.csv	0	TRUE	TRUE
small-load.csv	1	FALSE	FALSE

Conclusion

The final model appears to be quite good. The p-values are extreamly good, so there is a lot of room for errors. Without any real data it stil remains unclear whether or not the model hypothesis and the sign change test holds in practice. But for now, this is the best we can do.