= Tips on hacking the OCaml runtime system =

== Linking a test program with the debug runtime ==

Suppose you have a self-contained OCaml program `test.ml` that
crashes, you are working on a development repository (not an installed
version of your system). You probably want to run `test.ml` against
the "debug runtime", which in particular activates the `CAMLassert`
debug assertions.

If you want to use the bytecode compiler:

----
# build the runtime
make runtime -j

# compile as usual
./ocamlc.opt -nostdlib -I stdlib test.ml -o test

# run with the debug runtime (ocamlrund)
./runtime/ocamlrund ./test
----

If you want to use the native compiler:

----
# build the native runtime
make runtimeopt -j

# compile with "-runtime-variant d"
./ocamlopt.opt -nostdlib -I stdlib -runtime-variant d -I runtime test.ml -o test

./test
----

Note that the debug runtime does extra work, so it may slow down your
program -- and sometimes make the issue you are trying to debug
vanish.

== GC messages ==

The GC can send various messages about what it is doing, enabled with
the "v" option of OCAMLRUNPARAM. Various options are more or less
documented in
link:https://ocaml.org/manual/runtime.html#s:ocamlrun-options[].
You can enable all printing with

----
OCAMLRUNPARAM="v=0xffffffff" ./test
----

Note: `caml_gc_log` can be used to show log messages prefixed with the
thread number, and it corresponds to the more precise setting
`v=0x800`.

== Heap verification ==

Another useful OCAMLRUNPARAM setting is `V=1`, which enables
additional sanity checks on the heap during major GC cycles.

----
OCAMLRUNPARAM="V=1" ./test
----

== Getting stack traces after assertion failures (Linux) ==

The output of a crashing OCaml program may end up like this:

----
[03] file domain.c; line 404 ### Assertion failed: domain_state->young_start == NULL
Aborted (core dumped)
----

The message "core dumped" indicates that some debugging information was kept on the disk.

On Linux, systemd-enabled systems tend to use a systemd tool (of course!) to store core dumps.

----
# ask your system how core dumps are handled.
$ cat /proc/sys/kernel/core_pattern
|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
----

If your system is also using `systemd-coredump`, then the command
`coredumpctl dump` will show you information about the last "core
dump".

----
$ $ coredumpctl dump
           PID: 678260 (Domain0)
           UID: 1000 (gasche)
           GID: 1000 (gasche)
        Signal: 6 (ABRT)
     Timestamp: Fri 2022-02-25 09:30:32 CET (4min 30s ago)
  Command Line: ./test
    Executable: /home/gasche/Prog/ocaml/github-max_domains/test
 Control Group: [...]
                [...]
     Disk Size: 133.0K
       Message: Process 678260 (Domain0) of user 1000 dumped core.

                Stack trace of thread 678266:
                #0  0x00007f60ee4842a2 raise (libc.so.6 + 0x3d2a2)
                #1  0x00007f60ee46d8a4 abort (libc.so.6 + 0x268a4)
                #2  0x0000000000475022 n/a (/home/gasche/Prog/ocaml/github-max_domains/test + 0x75022)
Refusing to dump core to tty (use shell redirection or specify --output).
----

You can get a full backtrace using `echo bt | coredumpctl debug`:

----
$ echo bt | coredumpctl debug
[...]
Core was generated by `./test'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f60ee4842a2 in raise () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x7f60d77fe640 (LWP 678266))]
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.33-20.fc34.x86_64
(gdb) #0  0x00007f60ee4842a2 in raise () from /lib64/libc.so.6
#1  0x00007f60ee46d8a4 in abort () from /lib64/libc.so.6
#2  0x0000000000475022 in caml_failed_assert (
    expr=expr@entry=0x488498 "domain_state->young_start == NULL",
    file_os=file_os@entry=0x488218 "domain.c", line=line@entry=404) at misc.c:56
#3  0x0000000000461831 in caml_free_minor_heap () at domain.c:404
#4  0x000000000046237b in caml_reallocate_minor_heap (wsize=wsize@entry=786432) at domain.c:469
#5  0x0000000000474404 in caml_set_minor_heap_size (wsize=wsize@entry=786432) at minor_gc.c:130
#6  0x00000000004696b3 in caml_gc_set (v=<optimized out>) at gc_ctrl.c:222
#7  <signal handler called>
#8  0x000000000042a3b2 in camlTest__set_gc_280 () at test.ml:17
#9  0x000000000042a818 in camlTest__fun_529 () at test.ml:39
#10 0x000000000044947a in camlStdlib__Domain__body_694 () at domain.ml:204
#11 <signal handler called>
#12 0x000000000045fe38 in caml_callback_exn (closure=<optimized out>, arg=<optimized out>, arg@entry=1) at callback.c:169
#13 0x0000000000460369 in caml_callback (closure=<optimized out>, arg=arg@entry=1) at callback.c:253
#14 0x0000000000461f6a in domain_thread_func (v=0x7ffdd7357bb0) at domain.c:1034
#15 0x00007f60ee61f299 in start_thread () from /lib64/libpthread.so.0
#16 0x00007f60ee547353 in clone () from /lib64/libc.so.6
(gdb) quit
----

== Using `rr` for deterministic replay debugging ==

There is a lot of information on how to use `rr` to debug the OCaml
runtime on the OCaml Multicore wiki:
link:https://github.com/ocaml-multicore/ocaml-multicore/wiki/Debugging-the-OCaml-Multicore-runtime#rr[].

TODO: it would be nice to migrate some information here.

== Compiling with sanitizers ==

=== ThreadSanitizer ===

You can instrument the runtime to detect data races in it, by adding
`-fsanitize=thread` to both `CFLAGS` and `LDFLAGS`. It will however make the
compiler build rather slow.

Note that this is different from passing `--enable-tsan` to the configure
script. `--enable-tsan` not only instruments the runtime, but also the code
generated by ocamlopt. In addition, it suppresses a number of race reports from
the runtime to avoid clogging the output of user programs, and it gives to the
TSan runtime a slightly altered version of the real memory accesses (see
#12114).

=== Other sanitizers ===

TODO: I would be curious to know!

(For the brave there are some scripts in
link:../tools/ci/inria/sanitizers/script[], but you probably don't
want to run them directly, in particular they will `git clean -xfd`,
destroying changed/uncommitted files in your development repository!)
