I was just reading the glibc sscanf
man page (from the Linux man-pages package) and I found the following:
The following conversion specifiers are available:
(…)
d
Deprecated. Matches an optionally signed decimal integer; the
next pointer must be a pointer toint
.
i
Deprecated. Matches an optionally signed integer; the next
pointer must be a pointer toint
. The integer is read in base
16 if it begins with0x
or0X
, in base 8 if it begins with0
,
and in base 10 otherwise. Only characters that correspond to
the base are used.
o
Deprecated. Matches an unsigned octal integer; the next pointer
must be a pointer tounsigned int
.(…)
How come %d
is deprecated? It seem that all int
specifiers are deprecated.
What does it mean and what is there to replace them?
14
6 Answers
How come
%d
is deprecated? It seem that allint
specifiers are deprecated.
They are not deprecated in the sense that that term is ordinarily used in software documentation. There is no plan for their removal from the language and there are no direct replacements. The ISO committee responsible for maintaining the language standard has not expressed any opinion that they should be avoided, though there are indeed workarounds available to avoid their use.
The deprecation notices on some Linux manual pages that you are asking about constitute an inappropriate liberty taken by the maintainer of that version of the documentation. It is explained in the BUGS section of the same page:
Numeric conversion specifiers
Use of the numeric conversion specifiers produces Undefined
Behavior for invalid input. See C11 7.21.6.2/10
⟨https://port70.net/%7Ensz/c/c11/n1570.html#7.21.6.2p10⟩. This is
a bug in the ISO C standard, and not an inherent design issue
with the API. However, current implementations are not safe from
that bug, so it is not recommended to use them. Instead,
programs should use functions such as strtol(3) to parse numeric
input. This manual page deprecates use of the numeric conversion
specifiers until they are fixed by ISO C.
The manual page maintainer is both unfortunately opinionated and atypically aggressive. It is a somewhat controversial opinion that it constitutes a bug in the standard for the affected functions have undefined behavior for invalid input. It is a valid opinion that that is a good reason to avoid numeric conversion specifiers, but that author is not empowered to deprecate the functions in the sense that readers of the manual page would typically understand. The conventional approach to a situation like this would to be add references to the BUGS section at appropriate places in the manual text, possibly even with a brief explanatory note. Deprecation labels are not that, no matter how they are explained elsewhere in the document.
With that said, the scanf
-family functions are overall difficult to use correctly. Some around here are prone to recommend avoiding them entirely, and that should certainly be considered. If you do avoid them, then that moots the issue.
14
-
3
I'm just as frustrated with the manpages' maintainer's deprecation-happy attitude as you are, but I do not think there is any reasonable counterargument to the claim that 7.21.6.2p10's rule "if the result of the conversion cannot be represented in the object, the behavior is undefined" is a design defect in the standard. The only reason I haven't filed a DR is that I consider
*scanf
unfit for purpose anyway, for reasons that are much harder to fix.– zwolyesterday
-
3
At least every occurrence of "Deprecated." should have mentioned the reason and suggested alternative: "Deprecated, prefer
strtol
." or "Deprecated, see BUGS."!– Bergiyesterday
-
3
I understand the maintainer's attitude. Because the input is typically not under the control of the programmer, the UB is a potential opportunity for an exploit or at least denial of service triggered with malicious input. But many programs process input from known sources, and then this is not an issue. I'm also skeptical of the suggestion to replace a well-tested and versatile tool like scanf with one's own code.
strtol
is not perfectly trivial to use properly either (what's again the condition that the token was entirely read?) and there still is the issue of tokenizing etc.– Peter – Reinstate Monica22 hours ago
-
3
Maybe
Don't use with untrusted input; see BUGS.
. Cc: @zwol– alx – recommends codidact8 hours ago
-
2
@alx-recommendscodidact Thanks for listening and reacting! I think it will now cause much less confusion like the one which triggered this post (and less head-shaking by experienced users 😉 ).
– Peter – Reinstate Monica2 hours ago
This is explained in the BUGS section of the man page:
Numeric conversion specifiers
Use of the numeric conversion specifiers produces Undefined
Behavior for invalid input. See C11 7.21.6.2/10
⟨https://port70.net/%7Ensz/c/c11/n1570.html#7.21.6.2p10⟩. This is
a bug in the ISO C standard, and not an inherent design issue
with the API. However, current implementations are not safe from
that bug, so it is not recommended to use them. Instead,
programs should use functions such as strtol(3) to parse numeric
input. This manual page deprecates use of the numeric conversion
specifiers until they are fixed by ISO C.
So it’s not deprecated by the C language specification, the author of the man page is using this notation to indicate that they’re not safe to use.
However, this is only a problem in practice if the input being read might not contain validly formatted data. If you’re reading a file that is formatted reliably, you can use these specifiers safely.
This actually seems to be an inconsistency in the language spec, because it also says that the function returns the number of valid conversions (or EOF
if an input failure occurs before the first conversion). It makes no sense to say that a conversion failure is undefined behavior and also say what it returns in that case, and most implementations return the value properly.
The man-page author is being overly pedantic in recommending against these specifiers, in my opinion.
14
-
5
What's doubly-bad about putting this in a man page? The authors of the man page are the authors of the implementation they're complaining about. The C standard does not preclude the glibc authors from defining the behavior of their own implementation.
– Andrew Henle2 days ago
-
7
They're complaining about "Use of the numeric conversion specifiers produces Undefined Behavior for invalid input." That's a "bug" glibc devs are free to fix – no one is stopping them from defining the behavior for their implementation. GCC, for example, has
-fwrapv
that defines the behavior of signed integer overflow – behavior that otherwise, were the "logic" of this man page followed – would "deprecate" all integer operations in C. "They could overflow and cause undefined behavior!!!" Would a compiler that "deprecates" every use of+
between integral arguments be sane?– Andrew Henle2 days ago
-
4
@AndrewHenle They're not complaining about an implementation. They say it's a bug in the ISO C specification, because of the inconsistency of saying that it's undefined and then saying that it returns the number of valid conversions.
– Barmar2 days ago
-
2
That's asinine. There's no contradiction in returning the number of valid conversions and "this object does not have an appropriate type, or if the result of the conversion cannot be represented in the object, the behavior is undefined". Glibc devs can implement what they want. Where's the contradiction in signed integer overflow results in undefined behavior and 6.5.6 Additive operators, paragraph 5's "The result of the binary + operator is the sum of the operands".
– Andrew Henle2 days ago
-
3
@AndrewHenle: I wouldn't be surprised if intent of the man page is to warn that these aren't safely portable, even if glibc did check for overflow. (In practice I'm sure at worst the behaviour on integer overflow in glibc scanf is wrapping, since they compile to asm that has to work for non-overflowing cases, and the conversion loops are simple
total = total*base + digit
unless the check for overflow like in some other parts of glibc, such as in handling%12d
conversions for printf. Parsing the12
does check for overflow, making it unfortunately kinda slow for the common small case.)– Peter Cordesyesterday
This notice in the man page is for the benefit of people trying to write portable programs.
Since there has been speculation about what glibc itself does in this case, I decided to check.
The glibc source code actually avoids signed-overflow UB, at least in the conversion function scanf("%d")
uses. At worst you could say the conversion result is undefined with glibc, but not the behaviour of the whole program. int
on GNU systems doesn’t have trap values (it’s 2’s complement) so this can’t make your program crash or misbehave, other than perhaps not having a numeric value that matches what you might get from other ways of parsing the string. e.g. if your code looked at the last decimal digit as well as using sscanf
to convert, you could have -1
even though the last decimal digit was even.
errno == ERANGE
after a glibc scanf
integer conversion that overflowed long
or unsigned long
, for conversions of long
or narrower.
(%lld
on a 32-bit system would only check for overflow of long long
.)
I checked with this test program:
#include <stdio.h>
int main(){
int tmp = 0xcccccccc;
int conv_result = scanf("%d", &tmp);
printf("successful conversions = %d, result = %d = %#xn",
conv_result, tmp, (unsigned)tmp);
}
With input that fits in a long
(64-bit on x86-64 GNU/Linux), we get that value truncated to int
.
With larger input, glibc detects overflow and produces -1
(actually LONG_MIN
or LONG_MAX
according to the sign, in this case LONG_MAX which gets truncated to -1
when narrowing to int
).
For example it converts 1111111111111111111111111111111
as -1
, but 1111111111111111111
as 734294471
= 0x2bc471c7
. See it on Godbolt with 2 executors that feed stdin with those inputs. It treats this as a successful conversion either way, scanf returning 1
, e.g.
successful conversions = 1, result = -1 = 0xffffffff
I used GDB to single-step into scanf with glibc 2.38-7 on my Arch GNU/Linux system (letting debuginfod fetch the library source code, very helpful). It eventually reached __strtol_l
(https://codebrowser.dev/glibc/glibc/stdlib/strtol_l.c.html#215) after a bunch of stdio overhead and copying characters one at a time into a tmp buffer, checking the base each time to see if it should be checking for hex or base-10 digits. Yikes, not efficient.
https://codebrowser.dev/glibc/glibc/stdlib/strtol_l.c.html#466 is the actual part of that function which checks for overflow with something like total >= ULONG_MAX/10
and the the trailing decimal digit of ULONG_MAX
against the new digit being converted, before doing the total = total*base + digit
.
// glibc/stdlib/strtol_l.c
INT
INTERNAL (__strtol_l) (const STRING_TYPE *nptr, STRING_TYPE **endptr,
int base, int group, locale_t loc)
{
...
if (c >= L_('0') && c <= L_('9'))
c -= L_('0');
... // check for grouping characters like ' if enabled
else if (ISALPHA (c))
c = TOUPPER (c) - L_('A') + 10;
else
break;
// my comments added:
// c is a the new digit converted to integer in the [0,base) range
// i is the total to be returned
if ((int) c >= base)
break;
/* Check for overflow. */
if (i > cutoff || (i == cutoff && c > cutlim)) // cutoff and cutlim were set from a lookup table according to base
overflow = 1;
else
{
use_long: // goto label from a loop using narrower types, if LONG isn't the same size as long
i *= (unsigned LONG int) base;
i += c;
}
}
...
if (__glibc_unlikely (overflow))
{
__set_errno (ERANGE);
#if UNSIGNED
return STRTOL_ULONG_MAX;
#else
return negative ? STRTOL_LONG_MIN : STRTOL_LONG_MAX;
#endif
}
...
(Yes, the loop could skip overflowing digits and still process a later smaller digit, but the later code doesn’t use i
at all if overflow
is set.)
We can all see that man7 does indeed list it as deprecated, but no-one here is answering the pertinent question that was asked of "why".
How come %d is deprecated? It seem that all int specifiers are deprecated.
The man pages describe the state of a current POSIX distribution. Thus each system may have its set of man pages, and the documentation on one can differ from another. Ideally you’d consult your local man page with man sscanf
. However the online manages, e.g. at man7, are convenient. But note that they’re describing a system that isn’t yours, or perhaps even an idealised system that doesn’t exist.
You should always be wary about reading the man pages for a system that you’re aren’t programming for as they can be documented older or newer versions of the same interface.
In this instance, man7 is hosting the man pages as used by the Linux Kernel team and the GNU lib c team. This particular changes, of marking sscanf integer specifiers as deprecated, was done in a15d34326c581eab10
a year ago and is contained in released man-pages-6.02
. The latest change of adding the BUGS note was done in 1f9949d11f499e5758f7e21
and is contained in man-pages 6.03
. Whether that change ends up in your distribution’s man pages is another matter.
The discussion surrounding this is actually about ERANGE, and you can follow that in a few places, e.g.
- https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=487254
- https://bugzilla.kernel.org/show_bug.cgi?id=61511
- https://lore.kernel.org/linux-man/[email protected]/T/#u
Someone even asks the same question as OP. The response can be seen at From: Alejandro Colomar @ 2023-01-20 13:12 UTC
. Some snippets:
Should it
really be deprecated?
While the interface of sscanf(3) numeric conversions is not mis-designed and
could be fixed, it is not correctly implemented, nor even standardized.
I think it’s correct to deprecate unless there’s a clear effort to fix it.
Is the undefined behavior here a real world issue
anywhere, or is this just a theoretical issue based on interpretation of the C
standard?
All implementations of sscanf(3) produce Undefined Behavior (UB), AFAIK. How
much you consider UB to be a real-world issue differs for each programmer, but I
tend to consider all UB to be as bad as nasal demons. I’m not saying UB
shouldn’t exist, just that you shouldn’t invoke it. And a function that is used
for scanning user input is one of those places where you really want to avoid
invoking UB.
One common aspect of man page documentation is that they draw a distinction between the POSIX compatible interface and the interface as used by their system. Both are available on man7.org:
- https://man7.org/linux/man-pages/man3/sscanf.3.html
- https://man7.org/linux/man-pages/man3/fscanf.3p.html
You’ll notice the 3p version doesn’t list %d
as deprecated. Therefore %d
is only deprecated on the systems documented by man7.org.
If you wish to stop using scanf (and sscanf, fscanf), then there’s a handy guide available
1
-
Good point to distinguish Posix man pages from the system ones. I didn't even know the Posix ones exist.
– Peter – Reinstate Monica2 hours ago
As pointed in the comments (thanks to @JeffHolt, @Eugene-sh, @DanielWalker, @Barmar, @DanielWalker) , the answer is indeed in the Bugs section:
BUGS
Numeric conversion specifiers
Use of the numeric conversion specifiers produces Undefined
Behavior for invalid input. See C11 7.21.6.2/10
⟨https://port70.net/%7Ensz/c/c11/n1570.html#7.21.6.2p10⟩. This is
a bug in the ISO C standard, and not an inherent design issue
with the API. However, current implementations are not safe from
that bug, so it is not recommended to use them. Instead,
programs should use functions such as strtol(3) to parse numeric
input. This manual page deprecates use of the numeric conversion
specifiers until they are fixed by ISO C.
I do agree that of "deprecate" means here "express disapproval of" (as from @Barmar’s comment).
To echo everyone else, this use of "deprecated" is weird. They really mean "not recommended", not "no longer supported".
Here’s the issue the author of the man page is complaining about:
Assume the code
int x;
printf( "Gimme a number: " );
if ( scanf( "%d", &x ) == 1 )
do_something_with( x );
else
// handle input error
and the input
12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
This is a syntactically valid decimal integer constant:
6.4.4.1 Integer constants
…integer-constant : decimal-constant integer-suffixopt octal-constant integer-suffixopt hexadecimal-constant integer-suffixopt decimal-constant : nonzero-digit decimal-constant digit nonzero-digit: one of 1 2 3 4 5 6 7 8 9 digit: one of 0 1 2 3 4 5 6 7 8 9
and the scanf
function will match the longest sequence of characters that satisfies the %d
conversion:
7.21.6.2 The fscanf function
…
9 An input item is read from the stream, unless the specification includes ann
specifier. An
input item is defined as the longest sequence of input characters which does not exceed
any specified field width and which is, or is a prefix of, a matching input sequence.285)
The first character, if any, after the input item remains unread. If the length of the input
item is zero, the execution of the directive fails; this condition is a matching failure unless
end-of-file, an encoding error, or a read error prevented input from the stream, in which
case it is an input failure.
No field width is specified, so that entire input will be converted and assigned to x
and scanf
will return 1
to indicate success; the problem is that input will overflow and result in undefined behavior.
Using a %d
or %i
or %o
(or %s
or pretty much any conversion specifier) without an explicit field width opens you up to accepting input that could lead to numeric overflow or worse.
This is one of those areas where C has no blade guards and will cut you if you aren’t careful. The optional bounds-checking version (scanf_s
) only makes sure none of the arguments are NULL
; it doesn’t check for numeric overflow.
*scanf
is only really appropriate if you know your input is well-behaved. If you can’t guarantee your input is well-behaved, then you shouldn’t use *scanf
at all; instead, use fgets
to read input as text and perform some basic sanity checks for length and content before attempting to do any conversions.
man 3 sscanf
does not indicate deprecation for my toolchain. You should cite yours.2 days ago
There is an explanation in the "BUGS" subsection of the man page
2 days ago
The reasoning is given in the BUGS section of the Linux manpage @DanielWalker linked.
2 days ago
They're using the dictionary definition of "deprecate", which means "express disapproval of". This is not how it's usually used in software documentation, which is for warnings of obsolete features that are planned to be removed. It's the opinion of the man page author, not from the language specification.
2 days ago
@Barmar They're using the dictionary definition of "deprecate", which means "express disapproval of". This is not how it's usually used in software documentation… How very Microsoft of them. 😉
2 days ago