Re: [leshan-dev] Leshan cluster

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [leshan-dev] Leshan cluster

From: Julien Vermillard <jvermillard@xxxxxxxxx>
Date: Fri, 13 May 2016 11:11:56 +0200
Delivered-to: leshan-dev@xxxxxxxxxxx
List-archive: <https://dev.eclipse.org/mailman/private/leshan-dev>
List-help: <mailto:leshan-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/leshan-dev>, <mailto:leshan-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/options/leshan-dev>, <mailto:leshan-dev-request@eclipse.org?subject=unsubscribe>

Sorry for the delay I'm in vacation

On Mon, May 9, 2016 at 10:47 AM, Maier Daniel (INST/ECS4) <Daniel.Maier@xxxxxxxxxxxx> wrote:

Hi,

I agree with you that the concept might work for scaling and blue/green deployment. However with very high ack timeouts and many retries, shutting down a node can last really long.

Once you disabled this node for new connections on the load balancer you just have to wait 5 minutes to finish all the pending CoAP transactions.

But also for blue/green deployment I have still an open question: When we shutdown for example environment “green”, we need to finish all outstanding requests on this environment. In the meantime we start environment “blue” for new requests. Responses from devices with outstanding requests on “green” still get routed to “green”. If we have now a new request from northbound side we would handle this request on environment “blue”. But if this device has still requests on “green”, also the response to the new request gets routed to “green”. Did I got something wrong here? Or is this a problem? Would you also route requests from northbound side to “green” as long there are open requests for this device (with high request frequency “green” never gets out of work)?

yes that's an issue, specifically on the DTLS state more i think and more I start to believe we need to move DTLS termination to a DTLS proxy if if want to limit this impact.

Another solution would be to force the device to register again (there is an executalbe resource for that in /1 I think).

I also agree with you that handling node crashes would be very high effort. If we want to behave like it was one single node, we have to consider Reliability-Layer, De-duplication, Congestion-Control, Request-Response-Matching. Also we have to store the whole requests and refactor the response callback mechanism. I also think such an implementation would lead to much communication effort between the nodes.

Nevertheless, as a user of some kind of northbound API (e.g. realized by a REST interface or messaging interface on top of leshan) I would expect if I send a request to a clustered leshan instance, that no requests are lost even if one node crashes. But perhaps this kind of scenario must be handled by the northbound API implementation itself.

Yes I think so, you need to add some retry on both side (device for registration and northbound for transaction). Today we use LWM2M on wireless network and we need to have retries at multiple layer because sometimes the newtork just get crazy or the device loss it's power. I don't think it worth the troubles to try to make the device communication server 100% reliable when we have so many other source of errors and transaction falures.

One possibility could be that this northbound implementation stores all requests and just re-schedules them to leshan if something goes wrong. One drawback of such a solution would be that requests are very likely to be sent duplicated when a node crashes. And the client cannot detect these duplicates because the request has a new MID.

Yeah that's an idea, also being as restull as possible on the device side helps to handle order duplication. Anyway if we want 100% reliability and 0% duplication you won't be able to do it with a distributed system and you won't be able to scale horizontally.

What do you think, makes it sense to implement such a crash safe clustering into leshan or is this application specific? Perhaps there is also a solution in between.

I think you should complete the wiki pages with your ideas, also the whole northbound interface is still to be designed, I don't mind some help there :)

Kind regards

Daniel

Von: leshan-dev-bounces@xxxxxxxxxxx [mailto:leshan-dev-bounces@xxxxxxxxxxx] Im Auftrag von Julien Vermillard
Gesendet: Mittwoch, 4. Mai 2016 12:02

An: leshan developer discussions
Betreff: Re: [leshan-dev] Leshan cluster

Hi Daniel,

yes the first idea is to add and remove nodes to the cluster for scaling up/down, and doing blue/green deployment for new version (a bit like http://martinfowler.com/bliki/BlueGreenDeployment.html).

It's not providing resuming request lost while a node crash.

When you remove a node, you need to remove it from the load balancer list of active node so no new connections are sent to this node. Then wait something like 5 minutes for being sure all the current coap transaction processing are done and then kill the machine.

If the node crash it's not resuming the CoAP transaction on another node. It's probably doable with some effort, but it's probably mean serializing more Californium state in Redis and making more Californium changes and yes probably it have some performance impact.

I don't think storing only the token id can solve the problem. how do you resume retransmission on a new node in case the response sent by the client was lost and the server need to retransmit the request?

--
Julien Vermillard

On Mon, May 2, 2016 at 5:29 PM, Maier Daniel (INST/ECS4) <Daniel.Maier@xxxxxxxxxxxx> wrote:

Hi,

I have read https://github.com/eclipse/leshan/wiki/Cluster.

As I understand, the plan is to store tokenIds only for observe requests and not for requests like read, write etc. This would have the consequence that if one cluster node goes down, all request that are still in progress on this cluster node will be lost. The user (some application) never notices that his request was lost because he will not be notified in some way (Well, this would be also not be possible with the current callback design even if we would store the request). Is this intended as a tradeoff between performance and reliability? Is the user responsible to retry requests that got no response for a while himself?

I think it would be useful to store also tokenIds of other requests, i.e. another cluster node can handle the response even if the request issuer went down (We still have the problem that also another node needs to take over the responsibility for the retries). But to implement this we would need another design of the callbacks for requests in order to propagate the response to a listener on another cluster node.

What do you think, is this approach over the top?

Thanks a lot

Daniel

Von: leshan-dev-bounces@xxxxxxxxxxx [mailto:leshan-dev-bounces@xxxxxxxxxxx] Im Auftrag von Julien Vermillard
Gesendet: Montag, 15. Februar 2016 14:59
An: leshan developer discussions
Betreff: Re: [leshan-dev] Leshan cluster

Hi Paul,

Neither HA-proxy, nor ningx support UDP or CoAP.

This could solve part of scalability problem by offloading the DTLS processing of the leshan server.
This could be a good idea, but now the biggest challenge we have is the token matching for observe and this solution won't have an impact on it :(

In a second step that could be interesting, for example to use a more CPU optimised AES/TLS implementation like the one in OpenSSL and built a CoAP reverse proxy to use a load-balancer.
Now since we want both way authentication using per endpoint credentials (PSK and RPK) we need to pass the result of the authentication back to the backend server.
In HTTP world this is done by adding a HTTP header, we could imagine sending it as a CoAP option. So we need to build a CoAP reverse proxy able to change the CoAP messages.

But I see it more as a performance/infrastructure optimisation.

--
Julien Vermillard

On Fri, Feb 12, 2016 at 6:28 PM, <szego@xxxxxx> wrote:

Regarding security and the proposed network infrastructure, has anyone considered how we might support DTLS termination at the load-balancer layer? I’m thinking along the lines of how HTTP load balancers can handle TLS termination (e.g. HAProxy, nginx) out in the DMZ.

I’m keen to see that whatever we arrive at doesn’t preclude this.

Regards, Paul.

On 11 Feb 2016, at 7:10 AM, Julien Vermillard <jvermillard@xxxxxxxxx> wrote:

Hi,
Following the discussion we had during the last meeting:
https://github.com/eclipse/leshan/wiki/Cluster

Please feel free to comment/edit

--
Julien Vermillard

_______________________________________________
leshan-dev mailing list
leshan-dev@xxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/leshan-dev

_______________________________________________
leshan-dev mailing list
leshan-dev@xxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/leshan-dev

_______________________________________________
leshan-dev mailing list
leshan-dev@xxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/leshan-dev

_______________________________________________
leshan-dev mailing list
leshan-dev@xxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/leshan-dev

Follow-Ups:
- Re: [leshan-dev] Leshan cluster
  - From: Maier Daniel (INST/ECS4)

References:
- [leshan-dev] Leshan cluster
  - From: Julien Vermillard
- Re: [leshan-dev] Leshan cluster
  - From: szego
- Re: [leshan-dev] Leshan cluster
  - From: Julien Vermillard
- Re: [leshan-dev] Leshan cluster
  - From: Maier Daniel (INST/ECS4)
- Re: [leshan-dev] Leshan cluster
  - From: Julien Vermillard
- Re: [leshan-dev] Leshan cluster
  - From: Maier Daniel (INST/ECS4)

Prev by Date: Re: [leshan-dev] Leshan cluster
Next by Date: Re: [leshan-dev] Leshan cluster
Previous by thread: Re: [leshan-dev] Leshan cluster
Next by thread: Re: [leshan-dev] Leshan cluster
Index(es):
- Date
- Thread

Breadcrumbs