[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
Re: [paho-dev] MQTT C Client - Segmentation fault
|
Hi Ian
Just at the same time I added a bug. And now just marked it as a
duplicate of yours :-).
Thanks a lot for looking into this and fixing this so quickly. This is
a really cool fast feedback loop.
Regards
Franz
On Fri, Oct 17, 2014 at 11:56 AM, Ian Craggs
<icraggs@xxxxxxxxxxxxxxxxxxxxxxx> wrote:
> Franz,
>
> I've raised a bug - it's 447672. I have also implemented a proposed fix,
> which I'll be adding to the development branch very soon.
>
> Ian
>
>
> On 10/16/2014 09:08 PM, Franz Schnyder wrote:
>
> I'm still working on implement a MQTT-SN gateway using the Paho 'MQTT
> C Client for Posix and Windows'. And now had another problem. This
> time I think it is a concurrency bug of the MQTT.C library:
>
> My gateway run for more than a week an crashed with a Segmentation
> fault. My guess is that the Segmentation fault is due to a race
> condition caused by a missing lock(mutex) for s.cur_clientsds inside
> of Socket.c.
>
> Followed I provide the information and reasoning for my guess:
>
> Info of the segmentation fault:
>
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0xb4509460 (LWP 17845)]
> 0x00080750 in Socket_getReadySocket (more_work=0, tp=0xb4508d88)
> at source/lib/Mqtt.C/src/Socket.c:278
>
> the code there looks like this:
>
> 274 if (s.cur_clientsds == NULL)
> 275 rc = 0;
> 276 else
> 277 {
> 278 rc = *((int*)(s.cur_clientsds->content));
> 279 ListNextElement(s.clientsds, &s.cur_clientsds);
> 280 }
>
> the backtrace of the crash is:
>
> (gdb) bt
> #0 0x00080750 in Socket_getReadySocket (more_work=0, tp=0xb4508d88)
> at source/lib/Mqtt.C/src/Socket.c:278
> #1 0x00077518 in MQTTClient_cycle (sock=0xb4508dc4, timeout=1000,
> rc=0xb4508dc8)
> at source/lib/Mqtt.C/src/MQTTClient.c:1547
> #2 0x00074a90 in MQTTClient_run (n=0x12a9b4)
> at source/lib/Mqtt.C/src/MQTTClient.c:483
> #3 0xb6e40bfc in start_thread (arg=0xb4509460)
> at pthread_create.c:306
> #4 0xb6dd5968 in ?? ()
> at ../ports/sysdeps/unix/sysv/linux/arm/nptl/../clone.S:116
>
> and a the variables are:
>
> (gdb) p s
> $1 = {rset = {__fds_bits = {2048, 0 <repeats 31 times>}}, rset_saved = {
> __fds_bits = {1024, 0 <repeats 31 times>}}, maxfdp1 = 11,
> clientsds = 0x12a774, cur_clientsds = 0x0, connect_pending = 0x12a804,
> write_pending = 0x12a894, pending_wset = {__fds_bits = {
> 0 <repeats 32 times>}}}
> (gdb) p s.cur_clientsds
> $2 = (ListElement *) 0x0
>
> The Segmentation fault is because s.cur_clientsds is NULL when line
> 278 is executed. Since line 278 is actually inside an else of a
> (s.cur_clientsds == NULL) check I guess that s.cur_clientsds must
> have be changed by an other thread in between the check (line 274) and
> the access (line 278). So I searched where s.cur_clientsds is changed
> and found that the Socket_close() function sets it to NULL:
>
> void Socket_close(int socket)
> {
> FUNC_ENTRY;
> Socket_close_only(socket);
> FD_CLR(socket, &(s.rset_saved));
> if (FD_ISSET(socket, &(s.pending_wset)))
> FD_CLR(socket, &(s.pending_wset));
> if (s.cur_clientsds != NULL && *(int*)(s.cur_clientsds->content)
> == socket)
> s.cur_clientsds = s.cur_clientsds->next;
> ...
>
> Therefore s.cur_clientsds is accessed and changed by multiple threads:
> By the libraries 'worker thread' during the MQTTClient_run() and by a
> client thread when the client calls MQTTClient_disconnect() since this
> calls Socket_close(). So I checked the call trees for mutex locks: The
> Socket_close() is inside a Thread_lock_mutex(mqttclient_mutex) but the
> Socket_getReadySocket() is not inside any mutex lock since
> MQTTClient_cycle() is explicitly called outside of the
> mqttclient_mutex:
>
> Thread_unlock_mutex(mqttclient_mutex);
> pack = MQTTClient_cycle(&sock, timeout, &rc); ==>
> Socket_getReadySocket()
> Thread_lock_mutex(mqttclient_mutex);
>
> Therefore I think my guess is correct that a race condition, because
> of missing synchronization for s.cur_clientsds, caused the
> Segmentation fault.
>
> My quick solution would be to introduce a new socket_mutex inside
> Socket.c to protect accesses to :
> /**
> * Structure to hold all socket data for the module
> */
> Sockets s;
>
> but I would need to invest some more time to fully grasp the locking
> strategy of the library. Therefore should I try to invest some time to
> understand the internals and introduce this mutex to create a patch?
> Or is it not worth my effort because one of the commiters quickly can
> confirm and fix this problem.
>
>
> --
> Ian Craggs
> icraggs@xxxxxxxxxx IBM United Kingdom
> Paho Project Lead; Committer on Mosquitto
>
>
> _______________________________________________
> paho-dev mailing list
> paho-dev@xxxxxxxxxxx
> To change your delivery options, retrieve your password, or unsubscribe from
> this list, visit
> https://dev.eclipse.org/mailman/listinfo/paho-dev