A tale of two libcs September 25, 2020 on Drew DeVault's blog

I received a bug report from Debian today, who had fed some garbage into scdoc, and it gave them a SIGSEGV back. Diving into this problem gave me a good opportunity to draw a comparison between musl libc and glibc. Let’s start with the stack trace:

==26267==ERROR: AddressSanitizer: SEGV on unknown address 0x7f9925764184
(pc 0x0000004c5d4d bp 0x000000000002 sp 0x7ffe7f8574d0 T0)
==26267==The signal is caused by a READ memory access.
    0 0x4c5d4d in parse_text /scdoc/src/main.c:223:61
    1 0x4c476c in parse_document /scdoc/src/main.c
    2 0x4c3544 in main /scdoc/src/main.c:763:2
    3 0x7f99252ab0b2 in __libc_start_main
/build/glibc-YYA7BZ/glibc-2.31/csu/../csu/libc-start.c:308:16
    4 0x41b3fd in _start (/scdoc/scdoc+0x41b3fd)

And if we pull up that line of code, we find…

if (!isalnum(last) || ((p->flags & FORMAT_UNDERLINE) && !isalnum(next))) {

Hint: p is a valid pointer. “last” and “next” are both uint32_t. The segfault happens in the second call to isalnum. And, the key: it can only be reproduced on glibc, not on musl libc. If you did a double-take, you’re not alone. There’s nothing here which could have caused a segfault.

Since it was narrowed down to glibc, I pulled up the source code and went digging for the isalnum implementation, expecting some stupid bullshit. But before I get into their stupid bullshit, of which I can assure you there is a lot, let’s briefly review the happy version. This is what the musl libc isalnum implementation looks like:

int isalnum(int c)
{
	return isalpha(c) || isdigit(c);
}

int isalpha(int c)
{
	return ((unsigned)c|32)-'a' < 26;
}

int isdigit(int c)
{
	return (unsigned)c-'0' < 10;
}

As expected, for any value of c, isalnum will never segfault. Because why the fuck would isalnum segfault? Okay, now, let’s compare this to the glibc implementation. When opening this header, you’re greeted with the typical GNU bullshit, but let’s trudge through and grep for isalnum.

The first result is this:

enum
{
  _ISupper = _ISbit (0),        /* UPPERCASE.  */
  _ISlower = _ISbit (1),        /* lowercase.  */
  // ...
  _ISalnum = _ISbit (11)        /* Alphanumeric.  */
};

This looks like an implementation detail, let’s move on.

__exctype (isalnum);

But what’s __exctype? Back up the file a few lines…

#define __exctype(name) extern int name (int) __THROW

Okay, apparently that’s just the prototype. Not sure why they felt the need to write a macro for that. Next search result…

#if !defined __NO_CTYPE
# ifdef __isctype_f
__isctype_f (alnum)
// ...

Okay, this looks useful. What is __isctype_f? Back up the file now…

#ifndef __cplusplus
# define __isctype(c, type) \
  ((*__ctype_b_loc ())[(int) (c)] & (unsigned short int) type)
#elif defined __USE_EXTERN_INLINES
# define __isctype_f(type) \
  __extern_inline int                                                         \
  is##type (int __c) __THROW                                                  \
  {                                                                           \
    return (*__ctype_b_loc ())[(int) (__c)] & (unsigned short int) _IS##type; \
  }
#endif

Oh…. oh dear. It’s okay, we’ll work through this together. Let’s see, __isctype_f is some kind of inline function… wait, this is the else branch of #ifndef __cplusplus. Dead end. Where the fuck is isalnum actually defined? Grep again… okay… here we are?

#if !defined __NO_CTYPE
# ifdef __isctype_f
__isctype_f (alnum)
// ...
# elif defined __isctype
# define isalnum(c)     __isctype((c), _ISalnum) // <- this is it

Hey, there’s that implementation detail from earlier! Remember this?

enum
{
  _ISupper = _ISbit (0),        /* UPPERCASE.  */
  _ISlower = _ISbit (1),        /* lowercase.  */
  // ...
  _ISalnum = _ISbit (11)        /* Alphanumeric.  */
};

Let’s suss out that macro real quick:

# include <bits/endian.h>
# if __BYTE_ORDER == __BIG_ENDIAN
#  define _ISbit(bit)   (1 << (bit))
# else /* __BYTE_ORDER == __LITTLE_ENDIAN */
#  define _ISbit(bit)   ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8))
# endif

Oh, for fuck’s sake. Whatever, let’s move on and just assume this is a magic number. The other macro is __isctype, which is similar to the __isctype_f we were just looking at a moment ago. Let’s go look at that ifndef __cplusplus branch again:

#ifndef __cplusplus
# define __isctype(c, type) \
  ((*__ctype_b_loc ())[(int) (c)] & (unsigned short int) type)
#elif defined __USE_EXTERN_INLINES
// ...
#endif

Well, at least we have a pointer dereference now, that could explain the segfault. What’s __ctype_b_loc?

/* These are defined in ctype-info.c.
   The declarations here must match those in localeinfo.h.

   In the thread-specific locale model (see `uselocale' in <locale.h>)
   we cannot use global variables for these as was done in the past.
   Instead, the following accessor functions return the address of
   each variable, which is local to the current thread if multithreaded.

   These point into arrays of 384, so they can be indexed by any `unsigned
   char' value [0,255]; by EOF (-1); or by any `signed char' value
   [-128,-1).  ISO C requires that the ctype functions work for `unsigned
   char' values and for EOF; we also support negative `signed char' values
   for broken old programs.  The case conversion arrays are of `int's
   rather than `unsigned char's because tolower (EOF) must be EOF, which
   doesn't fit into an `unsigned char'.  But today more important is that
   the arrays are also used for multi-byte character sets.  */
extern const unsigned short int **__ctype_b_loc (void)
     __THROW __attribute__ ((__const__));
extern const __int32_t **__ctype_tolower_loc (void)
     __THROW __attribute__ ((__const__));
extern const __int32_t **__ctype_toupper_loc (void)
     __THROW __attribute__ ((__const__));

That is just so, super cool of you, glibc. I just love dealing with locales. Anyway, my segfaulted process is sitting in gdb, and equipped with all of this information I wrote the following monstrosity:

(gdb) print ((unsigned int **(*)(void))__ctype_b_loc)()[next]
Cannot access memory at address 0x11dfa68

Segfault found. Reading that comment again, we see “ISO C requires that the ctype functions work for ‘unsigned char’ values and for EOF”. If we cross-reference that with the specification:

In all cases [of functions defined by ctype.h,] the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF.

So the fix is obvious at this point. Okay, fine, my bad. My code is wrong. I apparently cannot just hand a UCS-32 codepoint to isalnum and expect it to tell me if it’s between 0x30-0x39, 0x41-0x5A, or 0x61-0x7A.

But, I’m going to go out on a limb here: maybe isalnum should never cause a program to segfault no matter what input you give it. Maybe because the spec says you can does not mean you should. Maybe, just maybe, the behavior of this function should not depend on five macros, whether or not you’re using a C++ compiler, the endianness of your machine, a look-up table, thread-local storage, and two pointer dereferences.

Here’s the musl version as a quick reminder:

int isalnum(int c)
{
	return isalpha(c) || isdigit(c);
}

int isalpha(int c)
{
	return ((unsigned)c|32)-'a' < 26;
}

int isdigit(int c)
{
	return (unsigned)c-'0' < 10;
}

Bye!

⇒ This article is also available on gemini.

Have a comment on one of my posts? Start a discussion in my public inbox by sending an email to ~sircmpwn/public-inbox@lists.sr.ht [mailing list etiquette]

Articles from blogs I read Generated by openring

Status update, August 2020

Hi! Regardless of the intense heat I’ve been exposed to this last month, I’ve still been able to get some stuff done (although having to move out to another room which isn’t right under the roof). I’ve worked a lot on IRC-related projects. I’ve added a znc-i…

via emersion 2020-08-19 00:00:00 +0200 +0200

What's cooking on Sourcehut? August 2020

Another month passes and we find ourselves writing (or reading) this status update on a quiet, rainy Sunday morning. Today our userbase numbers 16,683 members strong, up 580 from last month. Please extend a kind welcome to our new colleagues! Thanks for read…

via Blogs on Sourcehut 2020-08-16 00:00:00 +0000 +0000

Go 1.15 is released

Today the Go team is very happy to announce the release of Go 1.15. You can get it from the download page. Some of the highlights include: Substantial improvements to the Go linker Improved allocation for small objects at high core coun…

via The Go Programming Language Blog 2020-08-11 11:00:00 +0000 +0000

North Pacific Logbook

The passage from Japan (Shimoda) to Canada (Victoria) took 51 days, and it was the hardest thing we've ever done. We decided to keep a logbook, to better remember it and so it can help others who wish to make this trip.Continue Reading

via Hundred Rabbits 2020-07-31 00:00:00 +0000 GMT