Tuning

Have Maestro Core perform.

Architecture

The Pool Manager is a multi-threaded app that can be expected to need some memory to fit metadata, which grows with the number of CDOs in the workflow.

The Pool Manager communicate with pool clients, that is Maestro resources on the client side. Applications using Maestro Core transparently use extra threads for Maestro usage, threads for pool communication (MSTRO_PM_PC_NUM_THREADS) and threads to handle operations (MSTRO_OPERATIONS_NUM_THREADS`) to parallelise pool operations at the network level.

Knobs

Maestro features a range of environment variables that can be played with.

  • MSTRO_LOG_LEVEL

    Log level (Error|Warning|Info|Debug). (Optional)

  • MSTRO_LOG_MODULES

    Selection of modules that should log (Optional) Typically

    MSTRO_LOG_LEVEL=Info MSTRO_LOG_MODULES="stats"
    

    Says Maestro Core should record logs up to Info level, but just for the stats module. This would typically be used for a benchmarking run, where we do not want to record logs that would slow us down, but still want the stats report at mstro_finalize() time.

  • MSTRO_LOG_COLOR

    Log color. (green|blue|brightblue|…) (Optional). Typically helps visually parse small logs between a couple apps.

  • MSTRO_LOG_DST

    Select logging output channel. (stdout|stderr|syslog) (Optional)

  • MSTRO_TRANSPORT_DEFAULT

    Which transport method to choose by default. (RDMA|GFS|MIO) (Optional)

  • MSTRO_TRANSPORT_GFS_DIR

    Directory for GFS transport. (Optional)

  • MSTRO_OFI_NUM_RECV

    Number of concurrent receive operations per endpoint. (Optional)

  • MSTRO_PM_PC_NUM_THREADS

    Number of threads servicing OFI completion events. (Optional)

  • MSTRO_OPERATIONS_NUM_THREADS

    Number of threads handling maestro operations. (Optional)

Maestro is relying on OFI for network operations, therefore the usual OFI knobs can also be played with.

Telemetry

Maestro Core log lines look like

[I:pm] Simple_Pool_Manager:0 1 CQ-H-0-0 (nid00001 777) 22222479341864000: mstro_pm__handle_join_phase2(pool_manager.c:2540) JOIN message received. Caller Client:2 is now known as app #2

Which reads as

[<log level>:<log module>] <component_name>:<rank_id> <app_id> <thread_id> (<hostname> <pid>) <timestamp>: <function>(<file>:<lineno>) <message>

Profiling

A couple of utilities shipped with Maestro core may complement well existing profiling tools reports to analyse Maestro-enabled workflows:

  • $(MAESTRO_PATH)/examples/core_bench runs a benchmark that shows some basic numbers

  • $(MAESTRO_PATH)/visualise/vis.py proposes an in-browser interactive visualisation of a Maestro-enabled workflow

  • $(MAESTRO_PATH)/examples/transport_bars.py plots timings of Maestro operations relative on transport, based on a Maestro logs input

Scheduling

TODO