Age | Commit message (Collapse) | Author | Lines |
|
previously (before and after rewrite), spurious escaping of path
separators as \/ was not treated the same as /, but rather got split
as an unpaired \ at the end of the fnmatch pattern and an unescaped /,
resulting in a mismatch/error.
for the case of \/ as part of the maximal literal prefix, remove the
explicit rejection of it and move the handling of / below escape
processing.
for the case of \/ after a proper glob pattern, it's hard to parse the
pattern, so don't. instead cheat and count repetitions of \ prior to
the already-found / character. if there are an odd number, the last is
escaping the /, so back up the split position by one. now the
char clobbered by null termination is variable, so save it and restore
as needed.
|
|
this code has been long overdue for a rewrite, but the immediate cause
that necessitated it was total failure to see past unreadable path
components. for example, A/B/* would fail to match anything, even
though it should succeed, when both A and A/B are searchable but only
A/B is readable. this problem both was caught in conformance testing,
and impacted users.
the old glob implementation insisted on searching the listing of each
path component for a match, even if the next component was a literal.
it also used considerable stack space, up to length of the pattern,
per recursion level, and relied on an artificial bound of the pattern
length by PATH_MAX, which was incorrect because a pattern can be much
longer than PATH_MAX while having matches shorter (for example, with
necessarily long bracket expressions, or with redundancy).
in the new implementation, each level of recursion starts by consuming
the maximal literal (possibly escaped-literal) path prefix remaining
in the pattern, and only opening a directory to read when there is a
proper glob pattern in the next path component. it then recurses into
each matching entry. the top-level glob function provided automatic
storage (up to PATH_MAX) for construction of candidate/result strings,
and allocates a duplicate of the pattern that can be modified in-place
with temporary null-termination to pass to fnmatch. this allocation is
not a big deal since glob already has to perform allocation, and has
to link free to clean up if it experiences an allocation failure or
other error after some results have already been allocated.
care is taken to use the d_type field from iterated dirents when
possible; stat is called only when there are literal path components
past the last proper-glob component, or when needed to disambiguate
symlinks for the purpose of GLOB_MARK.
one peculiarity with the new implementation is the manner in which the
error handling callback will be called. if attempting to match */B/C/D
where a directory A exists that is inaccessible, the error reported
will be a stat error for A/B/C/D rather than (previous and wrong
implementation) an opendir error for A, or (likely on other
implementations) a stat error for A/B. such behavior does not seem to
be non-conforming, but if it turns out to be undesirable for any
reason, backtracking could be done on error to report the first
component producing it.
also, redundant slashes are no longer normalized, but preserved as
they appear in the pattern; this is probably more correct, and falls
out naturally from the algorithm used. since trailing slashes (which
force all matches to be directories) are preserved as well, the
behavior of GLOB_MARK has been adjusted not to append an additional
slash to results that already end in slash.
|
|
len was already passed as an argument, so don't use strcat, and use
memcpy instead of strcpy.
|
|
|
|
the LFS64 macro was not self-documenting and barely saved any
characters. simply use weak_alias directly so that it's clear what's
being done, and doesn't depend on a header to provide a strange macro.
|
|
|
|
commit 8c4be3e2209d2a1d3874b8bc2b474668fcbbbac6 was written to
preclude the GLOB_PERIOD extension from matching these directory
entries, but also precluded literal matches.
adjust the check that excludes . and .. to check whether the
GLOB_PERIOD flag is in effect, so that it cannot alter behavior in
cases governed by the standard, and also don't exclude . or .. in any
case where normal glob behavior (fnmatch's FNM_PERIOD flag) would have
included one or both of them (patterns such as ".*").
it's still not clear whether this is the preferred behavior for
GLOB_PERIOD, but at least it's clear that it can no longer break
applications which are not relying on quirks of a nonstandard feature.
|
|
GLOB_PERIOD is a gnu extension, and GNU glob does not seem to honor it
except in the last path component. it's not clear whether this a bug
or intentional, but it seems reasonable that it should exclude the
special entries . and .. when walking.
changes based on report and analysis by Julien Ramseier.
|
|
the check to prevent matching empty string wrongly blocked matching
of "/" due to checking emptiness after stripping leading slashes
rather than checking the full original argument string.
simplified from patch by Julien Ramseier.
|
|
With REG_NEWLINE, POSIX says:
"A <newline> in string shall not be matched by a period outside
a bracket expression or by any form of a non-matching list"
|
|
the fix in commit c3edc06d1e1360f3570db9155d6b318ae0d0f0f7 for
CVE-2016-8859 used gotos to exit on overflow conditions, but the code
in that error path assumed the buffer pointer was valid or null. thus,
the conditions which previously led to under-allocation and buffer
overflow could instead lead to an invalid pointer being passed to
free.
|
|
commit 0dc99ac413d8bc054a2e95578475c7122455eee8 added input length
checking to avoid unsafe VLA allocation, but put it in the wrong
place, before the glob_t structure was zeroed out. while POSIX isn't
clear on whether it's permitted to call globfree after glob failed
with GLOB_NOSPACE, making it safe is clearly better than letting
uninitialized pointers get passed to free in non-conforming callers.
while we're fixing this, change strlen check to the idiomatic strnlen
version to avoid unbounded input scanning before returning an error.
|
|
In BRE, ^ is an anchor at the beginning of an expression, optionally
it may be an anchor at the beginning of a subexpression and must be
treated as a literal otherwise.
Previously musl treated ^ in subexpressions as literal, but at least
glibc and gnu sed treats it as an anchor and that's the more useful
behaviour: it can always be escaped to get back the literal meaning.
Same for $ at the end of a subexpression.
Portable BRE should not rely on this, but there are sed commands in
build scripts which do.
This changes the meaning of the BREs:
\(^a\)
\(a\|^b\)
\(a$\)
\(a$\|b\)
|
|
we inherited from TRE regexec code that's utterly wrong with respect
to the integer types it's using. while it doesn't appear that
compilers are producing unsafe output, signed integer overflows seem
to happen, and regexec fails to find matches past offset INT_MAX.
this patch fixes the type of all variables/fields used to store
offsets in the string from int to regoff_t. after the changes, basic
testing showed that regexec can now find matches past 2GB (INT_MAX)
and past 4GB on x86_64, and code generation is unchanged on i386.
|
|
most of the possible overflows were already ruled out in practice by
regcomp having already succeeded performing larger allocations.
however at least the num_states*num_tags multiplication can clearly
overflow in practice. for safety, check them all, and use the proper
type, size_t, rather than int.
also improve comments, use calloc in place of malloc+memset, and
remove bogus casts.
|
|
the num_submatches field of some ast nodes was not initialized in
tre_add_tag_{left,right}, but was accessed later.
this was a benign bug since the uninitialized values were never used
(these values are created during tre_add_tags and copied around during
tre_expand_ast where they are also used in computations, but nothing
in the final tnfa depends on them).
|
|
This is a workaround to treat * as literal * at the start of a BRE.
Ideally ^ would be treated as an anchor at the start of any BRE
subexpression and similarly $ would be an anchor at the end of any
subexpression. This is not required by the standard and hard to do
with the current code, but it's the existing practice. If it is
changed, * should be treated as literal after such anchor as well.
|
|
commit 7eaa76fc2e7993582989d3838b1ac32dd8abac09 made * invalid at
the start of a BRE subexpression, but it should be accepted as
literal * there according to the standard.
This patch does not fix subexpressions starting with ^*.
|
|
10k elements stack is increased to 1000k, otherwise tnfa creation fails
for reasonable sized patterns: a single literal char can add 7 elements
to this stack, so regcomp of an 1500 char long pattern (with only litral
chars) fails with REG_ESPACE. (the new limit allows about < 150k chars,
this arbitrary limit allows most command line regex usage.)
ideally there would be no upper bound: regcomp dynamically reallocates
this buffer, every reallocation checks for allocation failure and at
the end this stack is freed so there is no reason for special bound.
however that may have unwanted effect on regcomp and regexec runtime
so this is a conservative change.
|
|
|
|
These are undefined escape sequences by the standard, but often
used in sed scripts.
|
|
The goto logic was hard to follow and modify. This is
in preparation for the BRE \+ and \? support.
|
|
The standard does not define semantics for \| in BRE, but some code
depends on it meaning alternation. Empty alternative expression is
allowed to be consistent with ERE.
Based on a patch by Rob Landley.
|
|
Previously repetitions were accepted after empty expressions like
in (*|?)|{2}, but in BRE the handling of * and \{\} were not
consistent: they were accepted as literals in some cases and
repetitions in others.
It is better to treat repetitions after an empty expression as an
error (this is allowed by the standard, and glibc mostly does the
same). This is hard to do consistently with the current logic so
the new rule is:
Reject repetitions after empty expressions, except after assertions
^*, $? and empty groups ()+ and never treat them as literals.
Empty alternation (|a) is undefined by the standard, but it can be
useful so that should be accepted.
|
|
This should not change the meaning of the code, just make the intent
clearer: advancing position is tied to adding a new literal.
|
|
The error code of an allocating function was not checked in tre_add_tag.
|
|
this patch makes the functions which work directly on multibyte
characters treat the high bytes as individual abstract code units
rather than as multibyte sequences when MB_CUR_MAX is 1. since
MB_CUR_MAX is presently defined as a constant 4, all of the new code
added is dead code, and optimizing compilers' code generation should
not be affected at all. a future commit will activate the new code.
as abstract code units, bytes 0x80 to 0xff are represented by wchar_t
values 0xdf80 to 0xdfff, at the end of the surrogates range. this
ensures that they will never be misinterpreted as Unicode characters,
and that all wctype functions return false for these "characters"
without needing locale-specific logic. a high range outside of Unicode
such as 0x7fffff80 to 0x7fffffff was also considered, but since C11's
char16_t also needs to be able to represent conversions of these
bytes, the surrogate range was the natural choice.
|
|
Internally regcomp needs to copy some iteration nodes before
translating the AST into TNFA representation.
Literal nodes were not copied correctly: the class type and list
of negated class types were not copied so classes were ignored
(in the non-negated case an ignored char class caused the literal
to match everything).
This affects iterations when the upper bound is finite, larger
than one or the lower bound is larger than one. So eg. the EREs
[[:digit:]]{2}
[^[:space:]ab]{1,4}
were treated as
.{2}
[^ab]{1,4}
The fix is done with minimal source modification to copy the
necessary fields, but the AST preparation and node handling
code of tre will need to be cleaned up for clarity.
|
|
The valid BRE backref tokens are \1 .. \9, and 0 is not a special
character either so \0 is undefined by the standard.
Such undefined escaped characters are treated as literal characters
currently, following existing practice, so \0 is the same as 0.
|
|
one of the features of ERE is that it's actually a regular language
and does not admit expressions which cannot be matched in linear time.
introduction of \n backref support into regcomp's ERE parsing was
unintentional.
|
|
the regex parser handles the (undefined) case of an unexpected byte
following a backslash as a literal. however, instead of correctly
decoding a character, it was treating the byte value itself as a
character. this was not only semantically unjustified, but turned out
to be dangerous on archs where plain char is signed: bytes in the
range 252-255 alias the internal codes -4 through -1 used for special
types of literal nodes in the AST.
|
|
|
|
The new code is a bit simpler and the generated code is about 1KB
smaller (on i386). The basic design was kept including internal
interfaces, TNFA generation was not touched.
The old tre parser had various issues:
[^aa-z]
negated overlapping ranges in a bracket expression were handled
incorrectly (eg [^aa-z] was handled as [^a] instead of [^a-z])
a{,2}
missing lower bound in a counted repetition should be an error,
but it was accepted with broken semantics: a{,2} was treated as
a{0,3}, the new parser rejects it
a{999,}
large min count was not rejected (a{5000,} failed with REG_ESPACE
due to reaching a stack limit), the new parser enforces the
RE_DUP_MAX limit
\xff
regcomp used to accept a pattern with illegal sequences in it
(treated them as empty expression so p\xffq matched pq) the new
parser rejects such patterns with REG_BADPAT or REG_ERANGE
[^b-fD-H] with REG_ICASE
old parser turned this into [^b-fB-F] because of the negated
overlapping range issue (see above), the new parser treats it
as [^b-hB-H], POSIX seems to require [^d-fD-F], but practical
implementations do case-folding first and negate the character
set later instead of the other way around. (Supporting the posix
way efficiently would require significant changes so it was left
as is, it is unclear if any application actually expects the
posix behaviour, this issue is raised on the austingroup tracker:
http://austingroupbugs.net/view.php?id=872 ).
another case-insensitive matching issue is that unicode case
folding rules can group more than two characters together while
towupper and towlower can only work for a pair of upper and
lower case characters, this is a limitation of POSIX so it is
not fixed.
invalid bracket and brace expressions may return different error
codes now (REG_ERANGE instead of REG_EBRACK or REG_BADBR instead
of REG_EBRACE) otherwise the new parser should be compatible with
the old one.
regcomp should be able to handle arbitrary pattern input if the
pattern length is limited, the only exception is the use of large
repetition counts (eg. (a{255}){255}) which require exp amount
of memory and there is no easy workaround.
|
|
|
|
for LC_MESSAGES, translation of strerror and similar literal message
functions is supported. for messages in other places (particularly the
dynamic linker) that use format strings, translation is not yet
supported. in order to make it possible and safe, such messages will
need to be refactored to separate the textual content from the format.
for LC_TIME, the day and month names and strftime-style format strings
provided by nl_langinfo are supported for translation. however there
may be limitations, as some of the original C-locale nl_langinfo
strings are non-unique and thus perhaps non-suitable as keys.
overall, the locale support activated by this commit should not be
seen as complete and polished but as a basis for beginning to test
locale functionality and implement locales.
|
|
per POSIX, the nmatch and pmatch arguments are ignored when the regex
was compiled with REG_NOSUB.
|
|
|
|
previously this flag was defined and accepted as a no-op, possibly
breaking some software that uses it. given the choice to remove the
definition and possibly break applications that were already working,
or simply implement the feature, the latter turned out to be easy
enough to make the decision easy.
in the case where the FNM_PATHNAME flag is also set, this
implementation is clean and essentially optimal. otherwise, it's an
inefficient "brute force" implementation. at some point, when cleaning
up and refactoring this code, I may add a more direct code path for
handling FNM_LEADING_DIR in the non-FNM_PATHNAME case, but at this
point my main interest is avoiding introducing new bugs in the code
that implements the standard fnmatch features specified by POSIX.
|
|
the FNM_PATHNAME logic for advancing by /-delimited components was
incorrect when the / character was escaped (i.e. \/), and a final \ at
the end of pattern was not handled correctly.
|
|
a '/' in the pattern could be incorrectly matched against the
terminating null byte in the string causing arbitrarily long
sequence of out-of-bounds access in fnmatch("/","",FNM_PATHNAME)
|
|
sizeof had incorrect argument in a few places, the size was always
large enough so the issue was not critical.
|
|
it's not clear to me at the moment whether the code that was removed
(and which is now being re-added) is needed, but it's far from being a
no-op, and i don't want to risk breaking regex in this release.
|
|
some structs and functions had reference to the params
feature of tre that is not used by the code anymore
|
|
pos_start local variable is not used in tre_tnfa_run_backtrack
|
|
to deal with the fact that the public headers may be used with pre-c99
compilers, __restrict is used in place of restrict, and defined
appropriately for any supported compiler. we also avoid the form
[restrict] since older versions of gcc rejected it due to a bug in the
original c99 standard, and instead use the form *restrict.
|
|
TRE has a broken assumption that wchar_t is signed, which is a sane
expectation, but not required by the standard, and false on ARM's ABI.
i leave tre_char_t as wchar_t for now, since a pointer to it is
directly passed to functions that need pointer to wchar_t. it does not
seem to break anything. and since the maximum unicode scalar value is
0x10ffff, just use that explicitly rather than using the max value of
any particular C type.
|
|
these are cruft from the original code which used an explicit string
length rather than null termination. i blindly converted all the
checks to null terminator checks, without noticing that in several
cases, the subsequent switch statement would automatically handle the
null byte correctly.
|
|
i don't understand why this has to be conditional on being in BRE
mode, but enabling this code unconditionally breaks a huge number of
ERE test cases.
|
|
|
|
|