The Problem

I/O are currently detected using only the CPU data. We generally expect servers to be CPU bound, as node.js is very good at handling I/O concurrency. Thus if the CPU usage is low, that indicates that too much time is spent on waiting for I/O.

However, if by pressureing the server sufficiently hard with a lot of requests, it is almost always possible to get the CPU to 100%. The idea is to use the additional handles data as a second metric for detecting I/O issues.

Getting the data

Matteo provided some sample files. To work on these in R, they should be converted to .csv files.

cat small-load.clinic-doctor-sample | node node-clinic-doctor/debug/sample-to-csv.js > small-load.csv

cat heavy-load-1.clinic-doctor-sample | node node-clinic-doctor/debug/sample-to-csv.js > heavy-load-1.csv
cat heavy-load-2.clinic-doctor-sample | node node-clinic-doctor/debug/sample-to-csv.js > heavy-load-2.csv

cat mystery-1.clinic-doctor-sample | node node-clinic-doctor/debug/sample-to-csv.js > mystery-1.csv
cat mystery-2.clinic-doctor-sample | node node-clinic-doctor/debug/sample-to-csv.js > mystery-2.csv
cat mystery-3.clinic-doctor-sample | node node-clinic-doctor/debug/sample-to-csv.js > mystery-3.csv
cat mystery-4.clinic-doctor-sample | node node-clinic-doctor/debug/sample-to-csv.js > mystery-4.csv

cat issue-1.clinic-doctor-sample | node node-clinic-doctor/debug/sample-to-csv.js > issue-1.csv
cat issue-2.clinic-doctor-sample | node node-clinic-doctor/debug/sample-to-csv.js > issue-2.csv
cat issue-3.clinic-doctor-sample | node node-clinic-doctor/debug/sample-to-csv.js > issue-3.csv

Unfortunately we currently don’t have any real data, and the not slow-io servers have unrealistic constant handle trends. This makes it difficult to test our model against data without a handle issue. To work around this, the small-load data set is used as an example without an I/O issue. Even though it was specifically created to have an I/O issue, the I/O issue primarly shows up in the CPU data.

To load the data:

load.data = function (name, has.issue) {
  dat = read.csv(name)
  dat$name = name
  dat$has.issue = has.issue
  dat$time = as.POSIXct((dat$timestamp - min(dat$timestamp)) / 1000, origin="1970-01-01", tz="GMT")
  return(dat)
};

dat.small.load = load.data('small-load.csv', F)
dat.heavy.load.1 = load.data('heavy-load-1.csv', T)
dat.heavy.load.2 = load.data('heavy-load-2.csv', T)
dat.mystery.1 = load.data('mystery-1.csv', F)
dat.mystery.2 = load.data('mystery-2.csv', F)
dat.mystery.3 = load.data('mystery-3.csv', T)
dat.mystery.4 = load.data('mystery-4.csv', T)
dat.issue.1 = load.data('issue-1.csv', F)
dat.issue.2 = load.data('issue-2.csv', F)
dat.issue.3 = load.data('issue-3.csv', F)

The data is structured as:

print.data.table(head(dat.small.load))
timestamp interval delay cpu memory.rss memory.heapTotal memory.heapUsed memory.external handles
1.508255e+12 0 344.903667 0.8164976 43188224 22470656 16253032 482408 3
1.508255e+12 0 0.508562 0.1085781 43257856 22470656 16302568 482482 3
1.508255e+12 0 1.354164 0.0249248 43266048 22470656 16311496 482556 3
1.508255e+12 0 2.736907 0.0168016 43274240 22470656 16315992 482630 3
1.508255e+12 0 2.760665 0.0182592 43282432 22470656 16320296 482704 3
1.508255e+12 0 0.150944 0.0195056 43286528 22470656 16324600 482778 3

The interval column has the value 1 for the gussed analysis interval. To focus only on this data:

subset.interval = function (dat) {
  dat = dat[dat$interval == 1, ]
  dat$time = as.POSIXct((dat$timestamp - min(dat$timestamp)) / 1000, origin="1970-01-01", tz="GMT")
  return(dat)
}

dat.small.load = subset.interval(dat.small.load)
dat.heavy.load.1 = subset.interval(dat.heavy.load.1)
dat.heavy.load.2 = subset.interval(dat.heavy.load.2)
dat.mystery.1 = subset.interval(dat.mystery.1)
dat.mystery.2 = subset.interval(dat.mystery.2)
dat.mystery.3 = subset.interval(dat.mystery.3)
dat.mystery.4 = subset.interval(dat.mystery.4)
dat.issue.1 = subset.interval(dat.issue.1)
dat.issue.2 = subset.interval(dat.issue.2)
dat.issue.3 = subset.interval(dat.issue.3)

Reprinting the data, shows the interval = 0 have now been stripped.

print.data.table(head(dat.small.load))
timestamp interval delay cpu memory.rss memory.heapTotal memory.heapUsed memory.external handles
15 1.508255e+12 1 2.336333 0.4967440 43782144 22994944 16743760 483444 23
16 1.508255e+12 1 0.727231 0.0316018 43786240 22994944 16748880 483518 23
17 1.508255e+12 1 2.187926 0.0269939 43790336 22994944 16753776 483592 23
18 1.508255e+12 1 2.734569 0.0155482 43794432 22994944 16758672 483666 23
19 1.508255e+12 1 2.731688 0.0185364 43802624 22994944 16763568 483740 23
20 1.508255e+12 1 1.796073 0.1580187 43831296 22994944 16790752 483814 27

Finally, the data is combined for convience:

dat.main = rbind(dat.small.load, dat.heavy.load.1, dat.heavy.load.2)
dat.mystery = rbind(dat.mystery.1, dat.mystery.2, dat.mystery.3, dat.mystery.4)
dat.issue = rbind(dat.issue.1, dat.issue.2, dat.issue.3)

The model hypothesis

The model hypothesis is made by looking at the data and condering once domain knowledge. Particularly the latter should be the primary component when there isn’t any real data.

p = ggplot(dat.main, aes(x = time, y = handles, colour=has.issue))
p = p + geom_line()
p = p + facet_grid(name ~ ., scales='free_y')
p = p + scale_x_datetime(labels = date_format("%S sec"))
p = p + scale_y_continuous(limits = c(0, NA))
print(p)

The model hypothesis that Matteo provided was:

servers with an I/O issue will have an increasing number of handles, that will occationally decrease a lot.

This appears to fit well with the heavy-load data and not very well with small-load. This is also what we want, as small-load is treated as having no issues with the handles data.

Quantifying the model hypothesis

Coefficient of variation

The immediate thought is that the heavy-load will have relatively more variance than small-load because it has more extream values. A challenge here is that the variance will depend on the number of concurrent requests. To normalize against this, the standard deviation is divided by the mean. This is what is called the coeffeicent of variation (cv).

cv.unbiased = function (vec) {
  cv = sd(vec) / mean(vec)
  return(cv * (1 + 1/(4 * length(vec))))
}

analysis.cv = function (dat) {
  name = dat[1, 'name']
  return(data.frame(list(
    cv.unbiased = cv.unbiased(dat$handles)
  )))
}

kable(ddply(dat.main, "name", analysis.cv))
name cv.unbiased
heavy-load-1.csv 0.3652156
heavy-load-2.csv 0.1156849
small-load.csv 0.1299085

The results quickly show that this is not the case. heavy-load-1 does actally have a smaller coefficient of variation than small-load.

Heavy tail detection

Before going for the next idea it can be useful to at the data from other angels than just the plain graph. Once such way is the data distribution.

p = ggplot(dat.main, aes(handles, colour=has.issue))
p = p + geom_density(fill = 1, alpha = 0.1)
p = p + facet_wrap(~ name, scales='free')
print(p)

From this data, the idea is that that heavy-load data has more skewness (yes, this is a statistical term) than small-load. Once could then use a fancy statistical test like Jarque-Barre to test the skewness value. However, fancy statistical tests are a nightmare to implement in JavaScript. Instead the data is assumed to be normally distributed, from this assumtion one can check if there is a surprising amount of data at either distribution tails.

plot.heavy.tail = function (dat) {
  lower.bound = mean(dat$handles) - 1.96 * sd(dat$handles)
  upper.bound = mean(dat$handles) + 1.96 * sd(dat$handles)
  
  p = ggplot(dat, aes(x = time, y = handles, colour=has.issue))
  p = p + geom_line()
  p = p + scale_x_datetime(labels = date_format("%S sec"))
  p = p + geom_hline(yintercept = lower.bound)
  p = p + annotate("text", min(dat$time), lower.bound, vjust = -1, hjust=0, label = "lower bound")
  p = p + geom_hline(yintercept = upper.bound)
  p = p + annotate("text", max(dat$time), upper.bound, vjust = -1, hjust=1, label = "upper bound")
  print(p)
}

plot.heavy.tail(dat.small.load)

plot.heavy.tail(dat.heavy.load.1)

From the graphs, this appears to be a somewhat reasonable strategy. small-load is evenly distributed on both sides, heavy-load-1 is is clearly not. This is then transformed into completly test function:

analysis.heavy.tail = function (dat, risk = 0.05) {
  name = dat[1, 'name']
  has.issue = dat[1, 'has.issue']
  t.multipler = qt((1 - 0.05/2), nrow(dat))
  lower.bound = mean(dat$handles) - t.multipler * sd(dat$handles)
  upper.bound = mean(dat$handles) + t.multipler * sd(dat$handles)

    # extream tail ratio
  upper.extream.ratio = sum(dat$handles > upper.bound) / length(dat$handles)
  lower.extream.ratio = sum(dat$handles < lower.bound) / length(dat$handles)
  issue.deteted = max(upper.extream.ratio, lower.extream.ratio) > risk

  return(data.frame(list(
    lower = lower.extream.ratio,
    upper = upper.extream.ratio,
    detect.issue = issue.deteted,
    has.issue = has.issue
  )))
}

kable(ddply(dat.main, "name", analysis.heavy.tail))
name lower upper detect.issue has.issue
heavy-load-1.csv 0.0000000 0.0638767 TRUE TRUE
heavy-load-2.csv 0.0288889 0.0566667 TRUE TRUE
small-load.csv 0.0336818 0.0174216 FALSE FALSE

At this point life is good. Matteo then later showed data where this strategy doesn’t work.

kable(ddply(dat.mystery, "name", analysis.heavy.tail))
name lower upper detect.issue has.issue
mystery-1.csv 0.0590031 0.0000000 TRUE FALSE
mystery-2.csv 0.0011521 0.0299539 FALSE FALSE
mystery-3.csv 0.0288889 0.0566667 TRUE TRUE
mystery-4.csv 0.0000000 0.0315789 FALSE TRUE

As seen from this result, the issue detection (detect.issue) doesn’t match the target (has.issue). Looking at the data, nothing appears to be unusual.

p = ggplot(dat.mystery, aes(x = time, y = handles, colour=has.issue))
p = p + geom_line()
p = p + facet_grid(name ~ ., scales='free_y')
p = p + scale_x_datetime(labels = date_format("%S sec"))
p = p + scale_y_continuous(limits = c(0, NA))
print(p)

Looking at the results for dat.mystery.4, where an issue should be detected but isn’t, nothing appears to be wrong. The detection logic just isn’t good enogth. One could of course just and work around with small hacks, but this is rarely the solution.

plot.heavy.tail(dat.mystery.4)

What one should take from this, is that the model hypthoesis might be slightly wrong. The drops don’t appear in dat.mystery.4 are sudden but also a bit gradual.

New model hypothesis

The previuse analysis, inspires a new hypothesis.

servers with an I/O issue will have an increasing number of handles for a period followed by a period of decreasing number of handles.

Sign change test

The immediate thought might be that one could fit a sinus function to this data. But without knowing what a good and bad period is, and because the period and amplitude may change over time, this is unlikely to be a good strategy.

Instead, increasing and decreasing trends are often best analysed by looking at the differential curves. This is where one looks at the difference from the previuse time step.

diff.data = function (dat) {
  return(data.frame(list(
    timestamp = head(dat$timestamp, -1),
    time = head(dat$time, -1),
    handles.delta = diff(dat$handles),
    has.issue = head(dat$has.issue, -1)
  )))
}

dat.main.delta = ddply(dat.main, "name", diff.data)
dat.mystery.delta = ddply(dat.mystery, "name", diff.data)
p = ggplot(rbind(dat.main.delta, dat.mystery.delta), aes(x = time, y = handles.delta, colour=has.issue))
p = p + geom_line()
p = p + facet_grid(name ~ ., scales='free_y')
p = p + scale_x_datetime(labels = date_format("%S sec"))
print(p)

In the differential data transformation, the increasing part will be positive and the decreasing part will be negative. We can understand this better by looking at the just the signs.

dat.main.delta$handles.delta.sign = sign(dat.main.delta$handles.delta)
dat.mystery.delta$handles.delta.sign = sign(dat.mystery.delta$handles.delta)
p = ggplot(rbind(dat.main.delta, dat.mystery.delta), aes(x = time, y = handles.delta.sign, colour=has.issue))
p = p + geom_point(alpha=0.1)
p = p + facet_grid(name ~ .)
p = p + scale_x_datetime(labels = date_format("%S sec"))
print(p)

From this it is apparent that the samples with an issue, has an uneven distribution of increasing and decreasing number of handles.

In classical statistics, a test that is often performed is the sign-test. This says, that if data is normally distributed then the next observation has a 50% change of having the opposite sign, compared to the previous observation. This test, does actually not just hold for normally distributed data but symetrically distributed data in general. Looking at the histrograms from before, and for the dat.mystery there do also appear to be a relation between symmetry and the number of handles. Although, the relation is not super apparent.

To get a better idea if this is true, a symmetry plot can be made:

symmetry.data = function (dat) {
  handles.median = median(dat$handles.delta)
  handles.sorted = dat$handles.delta[order(dat$handles.delta)]
  
  return(data.frame(
    symmetry.x = rev(handles.sorted) - handles.median,
    symmetry.y = handles.median - handles.sorted,
    has.issue = dat$has.issue
  ))
}

dat.main.delta.symmetry = ddply(dat.main.delta, "name", symmetry.data)
dat.mystery.delta.symmetry = ddply(dat.mystery.delta, "name", symmetry.data)

p = ggplot(rbind(dat.main.delta.symmetry, dat.mystery.delta.symmetry), aes(symmetry.x, symmetry.y, colour=has.issue))
p = p + geom_point(alpha=0.3)
p = p + geom_abline(intercept = 0, slope = 1)
p = p + facet_wrap(~ name, scales='free')
print(p)

In this plot, data samples that isn’t generally on the line can be said to be non-symetrical. From this plot the correlation between an issue and symmetry appears to be very strong.

The next step is then to implement the sign change test. To do this, we count the number of sign changes and compare that with the total number of observations. For symmetric data, the properbility of a sign change is 0.5. However this does not guarantee an exact split. To overcome this, the binomial distribution is used to tell the properbility of observing the given number of sign changes, given n observation and a properbility of 50%.

analysis.sign.change.v1 = function (dat, risk = 0.001) {
  name = dat[1, 'name']
  has.issue = dat[1, 'has.issue']
  
  differential.data = diff(dat$handles)
  differential.data.sign = sign(differential.data)
  
  # count changes
  sign.changes = sum(diff(differential.data.sign) != 0)
  
  # lower tail p.value
  p.value = pbinom(sign.changes, nrow(dat), 0.5)
  
  return(data.frame(list(
    name = name,
    p.value = p.value,
    detect.issue = p.value < risk,
    has.issue = has.issue
  )))
}

kable(ddply(rbind(dat.main, dat.mystery), "name", analysis.sign.change.v1))
name p.value detect.issue has.issue
heavy-load-1.csv 0.0000000 TRUE TRUE
heavy-load-2.csv 0.0000000 TRUE TRUE
mystery-1.csv 0.9999998 FALSE FALSE
mystery-2.csv 0.2595071 FALSE FALSE
mystery-3.csv 0.0000000 TRUE TRUE
mystery-4.csv 0.0000000 TRUE TRUE
small-load.csv 0.8767220 FALSE FALSE

Once again life is good. Matteo then later showed data where this doesn’t work.

kable(ddply(dat.issue, "name", analysis.sign.change.v1))
name p.value detect.issue has.issue
issue-1.csv 0 TRUE FALSE
issue-2.csv 0 TRUE FALSE
issue-3.csv 0 TRUE FALSE

Improved sign change test

Luckily, the fix here is simple. The issue exists when the number of handles is almost constant:

p = ggplot(dat.issue, aes(x = time, y = handles, colour=has.issue))
p = p + geom_line()
p = p + facet_grid(name ~ ., scales='free_y')
p = p + scale_x_datetime(labels = date_format("%S sec"))
p = p + scale_y_continuous(limits = c(0, NA))
print(p)

This mostly happen in very theoretical cases, were there is very little or no I/O activity. A simple solution to this, is to only consider the observations were the number of observations actually changed.

analysis.sign.change.v1 = function (dat, risk = 0.001) {
  name = dat[1, 'name']
  has.issue = dat[1, 'has.issue']
  
  differential.data = diff(dat$handles)
  differential.data.sign = sign(differential.data)
  
  # count changes
  sign.changes = sum(diff(differential.data.sign) != 0)
  num.none.constant.obs = sum(differential.data.sign != 0)
  num.none.constant.obs = max(num.none.constant.obs, sign.changes)
  
  # lower tail p.value
  p.value = pbinom(sign.changes, num.none.constant.obs, 0.5)
  
  return(data.frame(list(
    name = name,
    p.value = p.value,
    detect.issue = p.value < risk,
    has.issue = has.issue
  )))
}

kable(ddply(rbind(dat.main, dat.mystery, dat.issue), "name", analysis.sign.change.v1))
name p.value detect.issue has.issue
heavy-load-1.csv 0 TRUE TRUE
heavy-load-2.csv 0 TRUE TRUE
issue-1.csv 1 FALSE FALSE
issue-2.csv 1 FALSE FALSE
issue-3.csv 1 FALSE FALSE
mystery-1.csv 1 FALSE FALSE
mystery-2.csv 1 FALSE FALSE
mystery-3.csv 0 TRUE TRUE
mystery-4.csv 0 TRUE TRUE
small-load.csv 1 FALSE FALSE

Conclusion

The final model appears to be quite good. The p-values are extreamly good, so there is a lot of room for errors. Without any real data it stil remains unclear whether or not the model hypothesis and the sign change test holds in practice. But for now, this is the best we can do.