Re: [che-dev] How to collect and persist all workspace logs?

Hello team,
we've got into troubles with the implementation of writing the logs of
container's stdout to the files. It is quite unfortunate as we've
spent some time analyzing the feature, but that's life.

Original idea was to modify the command of the image so it would
redirect the output into the file, something like `<command> | tee
c1.log`. However, it is very hard to impossible to achieve that. My
idea was to pass the `args` to the container command. This does not
work and I think that it's caused by arguments being passed in quotes
under the hood so it became something like `<command> '| tee c1.log'`,
which does not do what we want. To actually update the command, we
would need to have full image pulled and then somehow inspect it to
get the original command and update it. This would mean very high
intervention in current workspace startup logic with uncertain result
and high risk.

So where to go next? We have few ideas:

# Namespace log collector component (I have a working prototype of this)
  - will run in extra pod in the namespace of the workspace
  - will be watching for workspace pods and when there's some and
running, it will start follow the logs of all containers and write
them to files
  - one instance per namespace
  - lifecycle managed by che-server (can scale down when no workspace
is running and scale up before first workspace start)
  - should be quite gentle with hw resources (TODO: measure),
especially with many workspaces in the same namespace
  - outlive the workspace lifetime, so we should be able to get all the logs
  - logs could be provided to the backend within the same component
  - should be possible to manage file logs from inside the containers
with this component
  - needs extra PVC for logs XOR use workspace's PVC with limitation
that all workspaces will need to run on one node and the logic will
have to be more complex to reflect different Che PVC strategies
  - for "namespace per workspace" or "only one workspace per user"
scenarios, same hw requirements as a sidecar collector

# Workspace log collector sidecar
  - will run as a workspace pod as a sidecar
  - will follow all the container logs of the workspace and write them to PVC
  - no issues with PVC access from multiple pods
  - same lifecycle as the workspace, so it's easier to deploy with
current server logic ("just" add another sidecar)
  - easiest to get file logs from inside the containers as we're in the same pod
  - same lifecycle as the workspace, so we're not sure we get all the
logs before collector is killed
  - extra hw resources consumed per each workspace
  - we will need another component to send the logs to the backend as
we can't rely on workspace pod will manage it in time on workspace

  - che-server will watch and follow the logs of all workspaces and
write them to PVC/Database/?
  - no extra hw resources per workspace/namespace
  - logs are collected directly to the place where they can be
requested so not much extra coding needed to make them accessible on
server API
  - higher network traffic workspace ⇔ che-server
  - keep the connection to all workspaces open all the time
  - higher hw requirements on che-server
  - hard to impossible to get file logs from inside the containers,
probably will need another component that will run on-exit inside
workspace's namespace

Important question here is how hard requirement is to get the file
logs from the inside of the containers (e.g. language servers)? This
can be an important thing to decide which way to go.


On Thu, Jan 23, 2020 at 5:04 AM Michal Vala <mvala@xxxxxxxxxx> wrote:
> Hello team,
> we're currently working on improving diagnosis capabilities[1] of workspaces, to
> be more concrete, how to get all logs from the workspace[2]. We're in phase of
> investigating options and prototyping and we've came up with several variants
> how to achieve the goal. We would like to know your opinion and new ideas.
> Requirements:
>   - collect all logs of all containers from the workspace
>   - stdout/err as well as file logs inside the container
>   - keep history of last 5 runs of the workspace
>   - collect logs of crashed workspace
>   - make logs easily accessible to the user (rest API + dashboard view)
> I've splitted the effort into two sections:
>   ### How to collect:
>     # log everything to files to mounted PV
>       - just mount PV and log everything there
>       - pros
>         - not much extra overhead, only write stdout/err to the file
> and mount PV
>         - don't need extra hw resources (memory/cpu)
>       - cons
>         - we might need to override `command` of all containers. They will
>           have to run with extra parameters to write stdout/err to the file.
>           Something like `<command> 2>&1 | tee ws.log`
>     # workspace collector sidecar (kubernetes/client-go app?)
>       - pros
>         - per workspace
>         - dynamic and powerful
>       - cons
>         - very custom solution and might be hard to manage/maintain
>         - unknown performance and hw resources requirements
>         - hard when ws crash
>         - need more memory per workspace, even if user does not use it and
>           everything works as expected
>     # watch and collect from master
>       - pros
>         - easy to grab logs and events
>         - easy to access archived logs
>       - cons
>         - only container's stderr/out
>         - keep the connection to ws
>         - more network traffic
>         - increase memory footprint of mastaer
>     # kubernetes native
>       - change the logging backend of kubernetes [3]
>       - pros
>         - standard k8s way, "googleable"
>       - cons
>         - depends on kubernetes deployment
>         - needs extra cluster component/configuration
>         - only stdout/err of containers
>     # push logs directly from containers to logging backend
>       - cons
>         - customize all components to log to the backend
>         - performance and hw resources overhead
>     # collect on workspace exit
>       - mount PV and log there. When workspace exits, start collector pod that
>           grabs the logs and "archive" them.
>       - pros
>         - not much extra overhead
>       - cons
>         - don't have logs of running workspace
>         - custom collector pod
>   ### Where to store and how to access:
>     # Workspace PV
>       - pros
>         - easy to set quota per user
>       - cons
>         - harder to access (need to start some pod at workspace's namespace)
>         - lost when delete namespace
>     # Che PV
>       - pros
>         - easier to access
>       - cons
>         - harder to set quota per user
>         - harder to scale and manage
>         - possible performance bottleneck
>     # PostgreSQL
>       - pros
>         - the easiest to access
>       - cons
>         - harder to set quota per user
>         - harder to scale and manage
>         - possible performance bottleneck
> There is one remaining and very important question we have not investigated
> much. We need to somehow configure all plugins/editors and other components, to
> tell where they have all log files that should be collected. Otherwise, we
> would not be able to find the logs on containers. We would need to
> handle that in
> plugin's `meta.yaml` as well as in the devfile.
> What's next?
>   We would like to investigate and prototype following solution:
>     - collect all ws logs into files and store in PV in the workspace
>     - watch ws events from master and on exit, start the collector pod that will
>       collect all the logs and pass them to the backend. Logs backend
> is something
>       to be done. It might be only PV dedicated to archiving log, or some new
>       service, or Che master.
>     - prototype new Che master API to access the logs. If we store
> them in workspace's PV,
>       start the collector pod on demand to access the logs.
> We would very much welcome any opinions or ideas.
> [1] -
> [2] -
> [3] -

Michal Vala
Software Engineer, Eclipse Che
Red Hat Czech

