Re: [che-dev] How to collect and persist all workspace logs?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [che-dev] How to collect and persist all workspace logs?
From: Michal Vala <mvala@xxxxxxxxxx>
Date: Wed, 5 Feb 2020 13:55:12 +0100
Delivered-to: che-dev@xxxxxxxxxxx
List-archive: <https://www.eclipse.org/mailman/private/che-dev>
List-help: <mailto:che-dev-request@eclipse.org?subject=help>
List-subscribe: <https://www.eclipse.org/mailman/listinfo/che-dev>, <mailto:che-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://www.eclipse.org/mailman/options/che-dev>, <mailto:che-dev-request@eclipse.org?subject=unsubscribe>
Hi Mario and everyone,

when we were analysing and designing this, we've assumed these requirements:
  - consumer is user of the workspace
  - collect all workspace containers output logs and file logs inside
the containers
  - archive at least 5 runs of each workspace
  - make the logs easily accessible for the user (dashboard)

Now it looks like we're back to define what we're trying to achieve.
So few questions:
  - is consumer only the user?
  - do we want provide the logs from full workspace life-time or only startup?
  - do we want provide the logs when workspace fails to start?
  - do we want provide the logs when workspace crashes at some point
in running phase?
  - do we want to keep logs history of previous runs (last 5)?
  - do we want to keep logs history only previous crashes (last 5)?
  - same applies for file logs inside the containers and container's stdout

As a starting point, that covers the workspace start failure scenario
and that can
give the piece of information that is not obvious how to get for the
regular user we can suggest this:

- is consumer only the user?  -  yes
- do we want provide the logs from full workspace life-time or only
startup? - startup
- do we want provide the logs when workspace fails to start? - not
depends if the start was successful or not.
- do we want provide the logs when workspace crashes at some point in
the running phase? - no
- do we want to keep logs history of previous runs (last 5)? - no.
only the latest.
- do we want to keep logs history only previous crashes (last 5)? - no
- same applies for file logs inside the containers and container's
stdout. - track only containers stdout.

That should be enough to diagnose the majority of the failures that we
can expect today.

more comments/questions inlined.


On Wed, Feb 5, 2020 at 8:42 AM Mario Loriedo <mario.loriedo@xxxxxxxxx> wrote:
>
> Hi Michal,
>
> Thanks for these analysis and sorry for the late reply.
>
> I don't think that building a log collecting system from scratch is the right approach (it's painful). What about going through the kube native direction you mentioned in your first mail? Grafana loki or fluentd are projects that may solve our problem.

We were targeting the user, so cluster-wide logging solution is mostly
out of game. Also changing cluster configuration and introducing new,
possibly huge component is imho no-go. If cluster admin wants to do
something like this, we shouldn't block him, but that's all.

>
> Another aspect is that querying the workspaces logs is more an admin user story than a user one. Tools like elastic search and grafana provide a good UX for that, I would NOT build a Che UI component and a new wsmaster API for that. As for monitoring the logging collection should be optional and an admin could choose to activate it if he wants.

I was not thinking about building some log analysis tool on Che UI.
The idea was more like some simple download button that would provide
you zip with all the logs for your workspace runs (last 5). That's
all.

>
> But anyway, although the admin scenario is important, I believe the original problem we were trying to solve was more a user problem. We want to make it easy for a user to troubleshoot a workspaces that:
>
> - fails to start
> - is not behaving correctly (i.e. a LS doesn't work as expected)
>
> We have already made some good progress on troubleshooting (better messages) but there are still some cases where it's hard to figure out what's going. For those cases, providing the logs to the user would help. But I am not sure that persisting the logs is necessary:

how we could provide the logs of crashed workspace without some level
of persistence?

>
> - when an error happens at workspaces start we should provide: wsmaster logs, kubernetes events, containers status and logs from the workspace pod and from the plugin broker.

Should we really give wsmaster logs to the user?

> - at anytime when a workspace is running a user should be able to see/tail or download all the logs (theia, LS and other plugins) via a specific command within theia

all file logs are already be accessible via component's terminal. I think
countainer's stdout logs are currently out of reach from theia and I'm
not sure if we need che-server for it.

>
>
> On Tue, Feb 4, 2020 at 11:54 AM Michal Vala <mvala@xxxxxxxxxx> wrote:
>>
>> fix: global collector is without the rice, of course... facepalm,
>> clipboard went crazy or what...
>>
>> On Tue, Feb 4, 2020 at 11:10 AM Michal Vala <mvala@xxxxxxxxxx> wrote:
>> >
>> > Hello team,
>> > we've got into troubles with the implementation of writing the logs of
>> > container's stdout to the files. It is quite unfortunate as we've
>> > spent some time analyzing the feature, but that's life.
>> >
>> > Original idea was to modify the command of the image so it would
>> > redirect the output into the file, something like `<command> | tee
>> > c1.log`. However, it is very hard to impossible to achieve that. My
>> > idea was to pass the `args` to the container command. This does not
>> > work and I think that it's caused by arguments being passed in quotes
>> > under the hood so it became something like `<command> '| tee c1.log'`,
>> > which does not do what we want. To actually update the command, we
>> > would need to have full image pulled and then somehow inspect it to
>> > get the original command and update it. This would mean very high
>> > intervention in current workspace startup logic with uncertain result
>> > and high risk.
>> >
>> > So where to go next? We have few ideas:
>> >
>> > # Namespace log collector component (I have a working prototype of this)
>> >   - will run in extra pod in the namespace of the workspace
>> >   - will be watching for workspace pods and when there's some and
>> > running, it will start follow the logs of all containers and write
>> > them to files
>> >   - one instance per namespace
>> >   - lifecycle managed by che-server (can scale down when no workspace
>> > is running and scale up before first workspace start)
>> > pros:
>> >   - should be quite gentle with hw resources (TODO: measure),
>> > especially with many workspaces in the same namespace
>> >   - outlive the workspace lifetime, so we should be able to get all the logs
>> >   - logs could be provided to the backend within the same component
>> >   - should be possible to manage file logs from inside the containers
>> > with this component
>> > cons:
>> >   - needs extra PVC for logs XOR use workspace's PVC with limitation
>> > that all workspaces will need to run on one node and the logic will
>> > have to be more complex to reflect different Che PVC strategies
>> >   - for "namespace per workspace" or "only one workspace per user"
>> > scenarios, same hw requirements as a sidecar collector
>> >
>> >
>> > # Workspace log collector sidecar
>> >   - will run as a workspace pod as a sidecar
>> >   - will follow all the container logs of the workspace and write them to PVC
>> > pros:
>> >   - no issues with PVC access from multiple pods
>> >   - same lifecycle as the workspace, so it's easier to deploy with
>> > current server logic ("just" add another sidecar)
>> >   - easiest to get file logs from inside the containers as we're in the same pod
>> > cons:
>> >   - same lifecycle as the workspace, so we're not sure we get all the
>> > logs before collector is killed
>> >   - extra hw resources consumed per each workspace
>> >   - we will need another component to send the logs to the backend as
>> > we can't rely on workspace pod will manage it in time on workspace
>> > crash
>> >
>> >
>> > # Global che-server log collectora ze sojového masa
>> > 200g rýže bas
>> >   - che-server will watch and follow the logs of all workspaces and
>> > write them to PVC/Database/?
>> > pros:
>> >   - no extra hw resources per workspace/namespace
>> >   - logs are collected directly to the place where they can be
>> > requested so not much extra coding needed to make them accessible on
>> > server API
>> > cons:
>> >   - higher network traffic workspace ⇔ che-server
>> >   - keep the connection to all workspaces open all the time
>> >   - higher hw requirements on che-server
>> >   - hard to impossible to get file logs from inside the containers,
>> > probably will need another component that will run on-exit inside
>> > workspace's namespace
>> >
>> >
>> >
>> > Important question here is how hard requirement is to get the file
>> > logs from the inside of the containers (e.g. language servers)? This
>> > can be an important thing to decide which way to go.
>> >
>> >
>> > Thanks!
>> > Michal
>> >
>> >
>> > On Thu, Jan 23, 2020 at 5:04 AM Michal Vala <mvala@xxxxxxxxxx> wrote:
>> > >
>> > > Hello team,
>> > >
>> > > we're currently working on improving diagnosis capabilities[1] of workspaces, to
>> > > be more concrete, how to get all logs from the workspace[2]. We're in phase of
>> > > investigating options and prototyping and we've came up with several variants
>> > > how to achieve the goal. We would like to know your opinion and new ideas.
>> > >
>> > > Requirements:
>> > >   - collect all logs of all containers from the workspace
>> > >   - stdout/err as well as file logs inside the container
>> > >   - keep history of last 5 runs of the workspace
>> > >   - collect logs of crashed workspace
>> > >   - make logs easily accessible to the user (rest API + dashboard view)
>> > >
>> > >
>> > > I've splitted the effort into two sections:
>> > >
>> > >   ### How to collect:
>> > >
>> > >     # log everything to files to mounted PV
>> > >       - just mount PV and log everything there
>> > >       - pros
>> > >         - not much extra overhead, only write stdout/err to the file
>> > > and mount PV
>> > >         - don't need extra hw resources (memory/cpu)
>> > >       - cons
>> > >         - we might need to override `command` of all containers. They will
>> > >           have to run with extra parameters to write stdout/err to the file.
>> > >           Something like `<command> 2>&1 | tee ws.log`
>> > >
>> > >     # workspace collector sidecar (kubernetes/client-go app?)
>> > >       - pros
>> > >         - per workspace
>> > >         - dynamic and powerful
>> > >       - cons
>> > >         - very custom solution and might be hard to manage/maintain
>> > >         - unknown performance and hw resources requirements
>> > >         - hard when ws crash
>> > >         - need more memory per workspace, even if user does not use it and
>> > >           everything works as expected
>> > >
>> > >     # watch and collect from master
>> > >       - pros
>> > >         - easy to grab logs and events
>> > >         - easy to access archived logs
>> > >       - cons
>> > >         - only container's stderr/out
>> > >         - keep the connection to ws
>> > >         - more network traffic
>> > >         - increase memory footprint of mastaer
>> > >
>> > >     # kubernetes native
>> > >       - change the logging backend of kubernetes [3]
>> > >       - pros
>> > >         - standard k8s way, "googleable"
>> > >       - cons
>> > >         - depends on kubernetes deployment
>> > >         - needs extra cluster component/configuration
>> > >         - only stdout/err of containers
>> > >
>> > >     # push logs directly from containers to logging backend
>> > >       - cons
>> > >         - customize all components to log to the backend
>> > >         - performance and hw resources overhead
>> > >
>> > >     # collect on workspace exit
>> > >       - mount PV and log there. When workspace exits, start collector pod that
>> > >           grabs the logs and "archive" them.
>> > >       - pros
>> > >         - not much extra overhead
>> > >       - cons
>> > >         - don't have logs of running workspace
>> > >         - custom collector pod
>> > >
>> > >
>> > >   ### Where to store and how to access:
>> > >
>> > >     # Workspace PV
>> > >       - pros
>> > >         - easy to set quota per user
>> > >       - cons
>> > >         - harder to access (need to start some pod at workspace's namespace)
>> > >         - lost when delete namespace
>> > >
>> > >     # Che PV
>> > >       - pros
>> > >         - easier to access
>> > >       - cons
>> > >         - harder to set quota per user
>> > >         - harder to scale and manage
>> > >         - possible performance bottleneck
>> > >
>> > >     # PostgreSQL
>> > >       - pros
>> > >         - the easiest to access
>> > >       - cons
>> > >         - harder to set quota per user
>> > >         - harder to scale and manage
>> > >         - possible performance bottleneck
>> > >
>> > >
>> > > There is one remaining and very important question we have not investigated
>> > > much. We need to somehow configure all plugins/editors and other components, to
>> > > tell where they have all log files that should be collected. Otherwise, we
>> > > would not be able to find the logs on containers. We would need to
>> > > handle that in
>> > > plugin's `meta.yaml` as well as in the devfile.
>> > >
>> > > What's next?
>> > >   We would like to investigate and prototype following solution:
>> > >     - collect all ws logs into files and store in PV in the workspace
>> > >     - watch ws events from master and on exit, start the collector pod that will
>> > >       collect all the logs and pass them to the backend. Logs backend
>> > > is something
>> > >       to be done. It might be only PV dedicated to archiving log, or some new
>> > >       service, or Che master.
>> > >     - prototype new Che master API to access the logs. If we store
>> > > them in workspace's PV,
>> > >       start the collector pod on demand to access the logs.
>> > >
>> > >
>> > > We would very much welcome any opinions or ideas.
>> > >
>> > >
>> > > [1] - https://github.com/eclipse/che/issues/15047
>> > > [2] - https://github.com/eclipse/che/issues/15134
>> > > [3] - https://kubernetes.io/docs/concepts/cluster-administration/logging/#sidecar-container-with-a-logging-agent
>> >
>> >
>> >
>> > --
>> > Michal Vala
>> > Software Engineer, Eclipse Che
>> > Red Hat Czech
>>
>>
>>
>> --
>> Michal Vala
>> Software Engineer, Eclipse Che
>> Red Hat Czech
>>
>> _______________________________________________
>> che-dev mailing list
>> che-dev@xxxxxxxxxxx
>> To change your delivery options, retrieve your password, or unsubscribe from this list, visit
>> https://www.eclipse.org/mailman/listinfo/che-dev
>
> _______________________________________________
> che-dev mailing list
> che-dev@xxxxxxxxxxx
> To change your delivery options, retrieve your password, or unsubscribe from this list, visit
> https://www.eclipse.org/mailman/listinfo/che-dev



-- 
Michal Vala
Software Engineer, Eclipse Che
Red Hat Czech
Follow-Ups:
- Re: [che-dev] How to collect and persist all workspace logs?
  - From: Mario Loriedo
References:
- [che-dev] How to collect and persist all workspace logs?
  - From: Michal Vala
- Re: [che-dev] How to collect and persist all workspace logs?
  - From: Michal Vala
- Re: [che-dev] How to collect and persist all workspace logs?
  - From: Michal Vala
- Re: [che-dev] How to collect and persist all workspace logs?
  - From: Mario Loriedo
Prev by Date: Re: [che-dev] How to collect and persist all workspace logs?
Next by Date: Re: [che-dev] How to collect and persist all workspace logs?
Previous by thread: Re: [che-dev] How to collect and persist all workspace logs?
Next by thread: Re: [che-dev] How to collect and persist all workspace logs?
Index(es):
- Date
- Thread
Breadcrumbs