man sscanf: %d is deprecated in C or glibc?

man sscanf: %d is deprecated in C or glibc?


26

I was just reading the glibc sscanf man page (from the Linux man-pages package) and I found the following:

The following conversion specifiers are available:
(…)

d    Deprecated. Matches an optionally signed decimal integer; the
next pointer must be a pointer to int.

i    Deprecated. Matches an optionally signed integer; the next
pointer must be a pointer to int. The integer is read in base
16 if it begins with 0x or 0X, in base 8 if it begins with 0,
and in base 10 otherwise. Only characters that correspond to
the base are used.

o    Deprecated. Matches an unsigned octal integer; the next pointer
must be a pointer to unsigned int.

(…)

How come %d is deprecated? It seem that all int specifiers are deprecated.

What does it mean and what is there to replace them?

14

  • 2

    man 3 sscanf does not indicate deprecation for my toolchain. You should cite yours.

    – Jeff Holt

    2 days ago

  • 5

    There is an explanation in the "BUGS" subsection of the man page

    – Eugene Sh.

    2 days ago

  • 3

    The reasoning is given in the BUGS section of the Linux manpage @DanielWalker linked.

    – Shawn

    2 days ago

  • 7

    They're using the dictionary definition of "deprecate", which means "express disapproval of". This is not how it's usually used in software documentation, which is for warnings of obsolete features that are planned to be removed. It's the opinion of the man page author, not from the language specification.

    – Barmar

    2 days ago


  • 2

    @Barmar They're using the dictionary definition of "deprecate", which means "express disapproval of". This is not how it's usually used in software documentation… How very Microsoft of them. 😉

    – Andrew Henle

    2 days ago


6 Answers
6


34

How come %d is deprecated? It seem that all int specifiers are deprecated.

They are not deprecated in the sense that that term is ordinarily used in software documentation. There is no plan for their removal from the language and there are no direct replacements. The ISO committee responsible for maintaining the language standard has not expressed any opinion that they should be avoided, though there are indeed workarounds available to avoid their use.

The deprecation notices on some Linux manual pages that you are asking about constitute an inappropriate liberty taken by the maintainer of that version of the documentation. It is explained in the BUGS section of the same page:

Numeric conversion specifiers

Use of the numeric conversion specifiers produces Undefined
Behavior for invalid input. See C11 7.21.6.2/10
⟨https://port70.net/%7Ensz/c/c11/n1570.html#7.21.6.2p10⟩. This is
a bug in the ISO C standard, and not an inherent design issue
with the API. However, current implementations are not safe from
that bug, so it is not recommended to use them. Instead,
programs should use functions such as strtol(3) to parse numeric
input. This manual page deprecates use of the numeric conversion
specifiers until they are fixed by ISO C.

The manual page maintainer is both unfortunately opinionated and atypically aggressive. It is a somewhat controversial opinion that it constitutes a bug in the standard for the affected functions have undefined behavior for invalid input. It is a valid opinion that that is a good reason to avoid numeric conversion specifiers, but that author is not empowered to deprecate the functions in the sense that readers of the manual page would typically understand. The conventional approach to a situation like this would to be add references to the BUGS section at appropriate places in the manual text, possibly even with a brief explanatory note. Deprecation labels are not that, no matter how they are explained elsewhere in the document.

With that said, the scanf-family functions are overall difficult to use correctly. Some around here are prone to recommend avoiding them entirely, and that should certainly be considered. If you do avoid them, then that moots the issue.

14

  • 3

    I'm just as frustrated with the manpages' maintainer's deprecation-happy attitude as you are, but I do not think there is any reasonable counterargument to the claim that 7.21.6.2p10's rule "if the result of the conversion cannot be represented in the object, the behavior is undefined" is a design defect in the standard. The only reason I haven't filed a DR is that I consider *scanf unfit for purpose anyway, for reasons that are much harder to fix.

    – zwol

    yesterday

  • 3

    At least every occurrence of "Deprecated." should have mentioned the reason and suggested alternative: "Deprecated, prefer strtol." or "Deprecated, see BUGS."!

    – Bergi

    yesterday

  • 3

    I understand the maintainer's attitude. Because the input is typically not under the control of the programmer, the UB is a potential opportunity for an exploit or at least denial of service triggered with malicious input. But many programs process input from known sources, and then this is not an issue. I'm also skeptical of the suggestion to replace a well-tested and versatile tool like scanf with one's own code. strtol is not perfectly trivial to use properly either (what's again the condition that the token was entirely read?) and there still is the issue of tokenizing etc.

    – Peter – Reinstate Monica

    22 hours ago

  • 3

    Maybe Don't use with untrusted input; see BUGS.. Cc: @zwol

    – alx – recommends codidact

    8 hours ago


  • 2

    @alx-recommendscodidact Thanks for listening and reacting! I think it will now cause much less confusion like the one which triggered this post (and less head-shaking by experienced users 😉 ).

    – Peter – Reinstate Monica

    2 hours ago


18

This is explained in the BUGS section of the man page:

Numeric conversion specifiers
Use of the numeric conversion specifiers produces Undefined
Behavior for invalid input. See C11 7.21.6.2/10
⟨https://port70.net/%7Ensz/c/c11/n1570.html#7.21.6.2p10⟩. This is
a bug in the ISO C standard, and not an inherent design issue
with the API. However, current implementations are not safe from
that bug, so it is not recommended to use them. Instead,
programs should use functions such as strtol(3) to parse numeric
input. This manual page deprecates use of the numeric conversion
specifiers until they are fixed by ISO C.

So it’s not deprecated by the C language specification, the author of the man page is using this notation to indicate that they’re not safe to use.

However, this is only a problem in practice if the input being read might not contain validly formatted data. If you’re reading a file that is formatted reliably, you can use these specifiers safely.

This actually seems to be an inconsistency in the language spec, because it also says that the function returns the number of valid conversions (or EOF if an input failure occurs before the first conversion). It makes no sense to say that a conversion failure is undefined behavior and also say what it returns in that case, and most implementations return the value properly.

The man-page author is being overly pedantic in recommending against these specifiers, in my opinion.

14

  • 5

    What's doubly-bad about putting this in a man page? The authors of the man page are the authors of the implementation they're complaining about. The C standard does not preclude the glibc authors from defining the behavior of their own implementation.

    – Andrew Henle

    2 days ago

  • 7

    They're complaining about "Use of the numeric conversion specifiers produces Undefined Behavior for invalid input." That's a "bug" glibc devs are free to fix – no one is stopping them from defining the behavior for their implementation. GCC, for example, has -fwrapv that defines the behavior of signed integer overflow – behavior that otherwise, were the "logic" of this man page followed – would "deprecate" all integer operations in C. "They could overflow and cause undefined behavior!!!" Would a compiler that "deprecates" every use of + between integral arguments be sane?

    – Andrew Henle

    2 days ago


  • 4

    @AndrewHenle They're not complaining about an implementation. They say it's a bug in the ISO C specification, because of the inconsistency of saying that it's undefined and then saying that it returns the number of valid conversions.

    – Barmar

    2 days ago

  • 2

    That's asinine. There's no contradiction in returning the number of valid conversions and "this object does not have an appropriate type, or if the result of the conversion cannot be represented in the object, the behavior is undefined". Glibc devs can implement what they want. Where's the contradiction in signed integer overflow results in undefined behavior and 6.5.6 Additive operators, paragraph 5's "The result of the binary + operator is the sum of the operands".

    – Andrew Henle

    2 days ago


  • 3

    @AndrewHenle: I wouldn't be surprised if intent of the man page is to warn that these aren't safely portable, even if glibc did check for overflow. (In practice I'm sure at worst the behaviour on integer overflow in glibc scanf is wrapping, since they compile to asm that has to work for non-overflowing cases, and the conversion loops are simple total = total*base + digit unless the check for overflow like in some other parts of glibc, such as in handling %12d conversions for printf. Parsing the 12 does check for overflow, making it unfortunately kinda slow for the common small case.)

    – Peter Cordes

    yesterday


5

This notice in the man page is for the benefit of people trying to write portable programs.
Since there has been speculation about what glibc itself does in this case, I decided to check.

The glibc source code actually avoids signed-overflow UB, at least in the conversion function scanf("%d") uses. At worst you could say the conversion result is undefined with glibc, but not the behaviour of the whole program. int on GNU systems doesn’t have trap values (it’s 2’s complement) so this can’t make your program crash or misbehave, other than perhaps not having a numeric value that matches what you might get from other ways of parsing the string. e.g. if your code looked at the last decimal digit as well as using sscanf to convert, you could have -1 even though the last decimal digit was even.

errno == ERANGE after a glibc scanf integer conversion that overflowed long or unsigned long, for conversions of long or narrower.
(%lld on a 32-bit system would only check for overflow of long long.)


I checked with this test program:

#include <stdio.h>

int main(){
        int tmp = 0xcccccccc;
        int conv_result = scanf("%d", &tmp);
        printf("successful conversions = %d,  result = %d = %#xn",
                                       conv_result, tmp, (unsigned)tmp);
}

With input that fits in a long (64-bit on x86-64 GNU/Linux), we get that value truncated to int.
With larger input, glibc detects overflow and produces -1 (actually LONG_MIN or LONG_MAX according to the sign, in this case LONG_MAX which gets truncated to -1 when narrowing to int).

For example it converts 1111111111111111111111111111111 as -1, but 1111111111111111111 as 734294471 = 0x2bc471c7. See it on Godbolt with 2 executors that feed stdin with those inputs. It treats this as a successful conversion either way, scanf returning 1, e.g.

successful conversions = 1,  result = -1 = 0xffffffff

I used GDB to single-step into scanf with glibc 2.38-7 on my Arch GNU/Linux system (letting debuginfod fetch the library source code, very helpful). It eventually reached __strtol_l (https://codebrowser.dev/glibc/glibc/stdlib/strtol_l.c.html#215) after a bunch of stdio overhead and copying characters one at a time into a tmp buffer, checking the base each time to see if it should be checking for hex or base-10 digits. Yikes, not efficient.

https://codebrowser.dev/glibc/glibc/stdlib/strtol_l.c.html#466 is the actual part of that function which checks for overflow with something like total >= ULONG_MAX/10 and the the trailing decimal digit of ULONG_MAX against the new digit being converted, before doing the total = total*base + digit.

// glibc/stdlib/strtol_l.c
INT
INTERNAL (__strtol_l) (const STRING_TYPE *nptr, STRING_TYPE **endptr,
               int base, int group, locale_t loc)
{
...
    if (c >= L_('0') && c <= L_('9'))
      c -= L_('0');
...  // check for grouping characters like ' if enabled
    else if (ISALPHA (c))
      c = TOUPPER (c) - L_('A') + 10;
    else
      break;

// my comments added:
// c is a the new digit converted to integer in the [0,base) range
// i is the total to be returned
    if ((int) c >= base)
      break;
    /* Check for overflow.  */
    if (i > cutoff || (i == cutoff && c > cutlim))   // cutoff and cutlim were set from a lookup table according to base
      overflow = 1;
    else
      {
      use_long:             // goto label from a loop using narrower types, if LONG isn't the same size as long
        i *= (unsigned LONG int) base;
        i += c;
      }
    }

...
  if (__glibc_unlikely (overflow))
    {
      __set_errno (ERANGE);
#if UNSIGNED
      return STRTOL_ULONG_MAX;
#else
      return negative ? STRTOL_LONG_MIN : STRTOL_LONG_MAX;
#endif
    }
...

(Yes, the loop could skip overflowing digits and still process a later smaller digit, but the later code doesn’t use i at all if overflow is set.)


5

We can all see that man7 does indeed list it as deprecated, but no-one here is answering the pertinent question that was asked of "why".

How come %d is deprecated? It seem that all int specifiers are deprecated.

The man pages describe the state of a current POSIX distribution. Thus each system may have its set of man pages, and the documentation on one can differ from another. Ideally you’d consult your local man page with man sscanf. However the online manages, e.g. at man7, are convenient. But note that they’re describing a system that isn’t yours, or perhaps even an idealised system that doesn’t exist.

You should always be wary about reading the man pages for a system that you’re aren’t programming for as they can be documented older or newer versions of the same interface.

In this instance, man7 is hosting the man pages as used by the Linux Kernel team and the GNU lib c team. This particular changes, of marking sscanf integer specifiers as deprecated, was done in a15d34326c581eab10 a year ago and is contained in released man-pages-6.02. The latest change of adding the BUGS note was done in 1f9949d11f499e5758f7e21 and is contained in man-pages 6.03. Whether that change ends up in your distribution’s man pages is another matter.

The discussion surrounding this is actually about ERANGE, and you can follow that in a few places, e.g.

Someone even asks the same question as OP. The response can be seen at From: Alejandro Colomar @ 2023-01-20 13:12 UTC. Some snippets:

Should it
really be deprecated?

While the interface of sscanf(3) numeric conversions is not mis-designed and
could be fixed, it is not correctly implemented, nor even standardized.

I think it’s correct to deprecate unless there’s a clear effort to fix it.

Is the undefined behavior here a real world issue
anywhere, or is this just a theoretical issue based on interpretation of the C
standard?

All implementations of sscanf(3) produce Undefined Behavior (UB), AFAIK. How
much you consider UB to be a real-world issue differs for each programmer, but I
tend to consider all UB to be as bad as nasal demons. I’m not saying UB
shouldn’t exist, just that you shouldn’t invoke it. And a function that is used
for scanning user input is one of those places where you really want to avoid
invoking UB.

One common aspect of man page documentation is that they draw a distinction between the POSIX compatible interface and the interface as used by their system. Both are available on man7.org:

You’ll notice the 3p version doesn’t list %d as deprecated. Therefore %d is only deprecated on the systems documented by man7.org.

If you wish to stop using scanf (and sscanf, fscanf), then there’s a handy guide available

1

  • Good point to distinguish Posix man pages from the system ones. I didn't even know the Posix ones exist.

    – Peter – Reinstate Monica

    2 hours ago


4

As pointed in the comments (thanks to @JeffHolt, @Eugene-sh, @DanielWalker, @Barmar, @DanielWalker) , the answer is indeed in the Bugs section:

BUGS
   Numeric conversion specifiers
       Use of the numeric conversion specifiers produces Undefined
       Behavior for invalid input.  See C11 7.21.6.2/10 
       ⟨https://port70.net/%7Ensz/c/c11/n1570.html#7.21.6.2p10⟩.  This is
       a bug in the ISO C standard, and not an inherent design issue
       with the API.  However, current implementations are not safe from
       that bug, so it is not recommended to use them.  Instead,
       programs should use functions such as strtol(3) to parse numeric
       input.  This manual page deprecates use of the numeric conversion
       specifiers until they are fixed by ISO C.

I do agree that of "deprecate" means here "express disapproval of" (as from @Barmar’s comment).


3

To echo everyone else, this use of "deprecated" is weird. They really mean "not recommended", not "no longer supported".

Here’s the issue the author of the man page is complaining about:

Assume the code

int x;
printf( "Gimme a number: " );
if ( scanf( "%d", &x ) == 1 )
  do_something_with( x );
else
  // handle input error

and the input

12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890

This is a syntactically valid decimal integer constant:

6.4.4.1 Integer constants

integer-constant :
    decimal-constant integer-suffixopt
    octal-constant integer-suffixopt
    hexadecimal-constant integer-suffixopt

decimal-constant :
    nonzero-digit
    decimal-constant digit

nonzero-digit: one of
    1 2 3 4 5 6 7 8 9

digit: one of
    0 1 2 3 4 5 6 7 8 9

and the scanf function will match the longest sequence of characters that satisfies the %d conversion:

7.21.6.2 The fscanf function

9 An input item is read from the stream, unless the specification includes an n specifier. An
input item is defined as the longest sequence of input characters which does not exceed
any specified field width and which is, or is a prefix of, a matching input sequence.285)

The first character, if any, after the input item remains unread. If the length of the input
item is zero, the execution of the directive fails; this condition is a matching failure unless
end-of-file, an encoding error, or a read error prevented input from the stream, in which
case it is an input failure.

No field width is specified, so that entire input will be converted and assigned to x and scanf will return 1 to indicate success; the problem is that input will overflow and result in undefined behavior.

Using a %d or %i or %o (or %s or pretty much any conversion specifier) without an explicit field width opens you up to accepting input that could lead to numeric overflow or worse.

This is one of those areas where C has no blade guards and will cut you if you aren’t careful. The optional bounds-checking version (scanf_s) only makes sure none of the arguments are NULL; it doesn’t check for numeric overflow.

*scanf is only really appropriate if you know your input is well-behaved. If you can’t guarantee your input is well-behaved, then you shouldn’t use *scanf at all; instead, use fgets to read input as text and perform some basic sanity checks for length and content before attempting to do any conversions.



Leave a Reply

Your email address will not be published. Required fields are marked *