File Explorer

/proc/self/root/proc/thread-self/root/proc/self/root/usr/share/doc/grep
This explorer reads the filesystem of the server it runs on, so /workspace/user isn't present here. Browsing and the terminal still work against this server's own disk from /.
0 dirs
5 files
TODO11.0 KB · 340 lines
1Things to do for GNU grep2 3  Copyright (C) 1992, 1997-2002, 2004-2022 Free Software Foundation, Inc.4 5  Copying and distribution of this file, with or without modification,6  are permitted in any medium without royalty provided the copyright7  notice and this notice are preserved.8 9===============10Short term work11===============12 13See where we are with UTF-8 performance.14 15Merge Debian patches that seem relevant.16 17Go through patches in Savannah.18 19Fix --directories=read.20 21Write better Texinfo documentation for grep.  The manual page would be a22good place to start, but Info documents are also supposed to contain a23tutorial and examples.24 25Some tests in tests/spencer2.tests should have failed!  Need to filter out26some bugs in dfa.[ch]/regex.[ch].27 28Multithreading?29 30GNU grep originally did 32-bit arithmetic.  Although it has moved to3164-bit on 64-bit platforms by using types like ptrdiff_t and size_t,32this conversion has not been entirely systematic and should be checked.33 34Lazy dynamic linking of the PCRE library.35 36Check FreeBSD’s integration of zgrep (-Z) and bzgrep (-J) in one37binary.  Is there a possibility of doing even better by automatically38checking the magic of binary files ourselves (0x1F 0x8B for gzip, 0x1F390x9D for compress, and 0x42 0x5A 0x68 for bzip2)?  Once what to do with40the PCRE library is decided, do the same for libz and libbz2.41 4243===================44Matching algorithms45===================46 47Take a look at these and consider opportunities for merging or cloning:48 49   -- http://osrd.org/projects/grep/global-regular-expression-print-tools-grep-variants50   -- ja-grep’s mlb2 patch (Japanese grep)51      <http://distcache.freebsd.org/ports-distfiles/grep-2.4.2-mlb2.patch.gz>52   -- lgrep (from lv, a Powerful Multilingual File Viewer / Grep)53      <http://www.mt.cs.keio.ac.jp/person/narita/lv/>;54   -- cgrep (Context grep) <https://awgn.github.io/cgrep/>55      seems like nice work;56   -- sgrep (Struct grep) <https://www.cs.helsinki.fi/u/jjaakkol/sgrep.html>;57   -- agrep (Approximate grep) <https://www.tgries.de/agrep/>,58      from glimpse;59   -- nr-grep (Nondeterministic reverse grep)60      <https://www.dcc.uchile.cl/~gnavarro/software/>;61   -- ggrep (Grouse grep) <http://www.grouse.com.au/ggrep/>;62   -- freegrep <https://github.com/howardjp/freegrep>;63 64Check some new algorithms for matching.  See, for example, Faro &65Lecroq (cited in kwset.c).66 67Fix the DFA matcher to never use exponential space.  (Fortunately, these68cases are rare.)69 7071============================72Standards: POSIX and Unicode73============================74 75For POSIX compliance issues, see POSIX 1003.1.76 77Current support for the POSIX [= =] and [. .] constructs is limited to78platforms whose regular expression matchers are sufficiently79compatible with the GNU C library so that the --without-included-regex80option of ‘configure’ is in effect.  Extend this support to non-glibc81platforms, where --with-included-regex is in effect, by modifying the82included version of the regex code to defer to the native version when83handling [= =] and [. .].84 85For Unicode, interesting things to check include the Unicode Standard86<https://www.unicode.org/standard/standard.html> and the Unicode Technical87Standard #18 (<https://www.unicode.org/reports/tr18/> “Unicode Regular88Expressions”).  Talk to Bruno Haible who’s maintaining GNU libunistring.89See also Unicode Standard Annex #15 (<https://www.unicode.org/reports/tr15/>90“Unicode Normalization Forms”), already implemented by GNU libunistring.91 92In particular, --ignore-case needs to be evaluated against the standards.93We may want to deviate from POSIX if Unicode provides better or clearer94semantics.95 96POSIX and --ignore-case97-----------------------98 99For this issue, interesting things to check in POSIX include the100Open Group Base Specifications, Chapter “Regular Expressions”, in101particular Section “Regular Expression General Requirements” and its102paragraph about caseless matching (this may not have been fully103thought through and that this text may be self-contradicting104[specifically: “of either data or patterns” versus all the rest]).105See:106 107http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_02108 109In particular, consider the following with POSIX’s approach to case110folding in mind.  Assume a non-Turkic locale with a character111repertoire reduced to the following various forms of “LATIN LETTER I”:112 113  0049;LATIN CAPITAL LETTER I;Lu;0;L;;;;;N;;;;0069;114  0069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049115  0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;\116    LATIN CAPITAL LETTER I DOT;;;0069;117  0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;0049118 119UTF-8 octet lengths differ between U+0049 (0x49) and U+0069 (0x69)120versus U+0130 (0xC4 0xB0) and U+0131 (0xC4 0xB1).  This implies that121whole UTF-8 strings cannot be case-converted in place, using the same122memory buffer, and that the needed octet-size of the new buffer cannot123merely be guessed (although there’s a simple upper bound of five times124the size of the input, as the longest UTF-8 encoding of any character125is five bytes).126 127We have128 129  lc(I) = i, uc(I) = I130  lc(i) = i, uc(i) = I131  lc(İ) = i, uc(İ) = İ132  lc(ı) = ı, uc(ı) = I133 134where lc() and uc() denote lower-case and upper-case conversions.135 136There are several candidate --ignore-case logics.  Using the137 138  if (lc(input_wchar) == lc(pattern_wchar))139 140logic leads to the following matches:141 142    \in  I  i  İ  ı143  pat\   ----------144   I  |  Y  Y  Y  n145   i  |  Y  Y  Y  n146   İ  |  Y  Y  Y  n147   ı  |  n  n  n  Y148 149There is a lack of symmetry between CAPITAL and SMALL LETTERs with150this.  Using the151 152  if (uc(input_wchar) == uc(pattern_wchar))153 154logic (which is what GNU grep currently does although this is not155documented or guaranteed in the future), leads to the following156matches:157 158    \in  I  i  İ  ı159  pat\   ----------160   I  |  Y  Y  n  Y161   i  |  Y  Y  n  Y162   İ  |  n  n  Y  n163   ı  |  Y  Y  n  Y164 165There is a lack of symmetry between CAPITAL and SMALL LETTERs with166this.167 168Using the169 170  if (lc(input_wchar) == lc(pattern_wchar)171      || uc(input_wchar) == uc(pattern_wchar))172 173logic leads to the following matches:174 175    \in  I  i  İ  ı176  pat\   ----------177   I  |  Y  Y  Y  Y178   i  |  Y  Y  Y  Y179   İ  |  Y  Y  Y  n180   ı  |  Y  Y  n  Y181 182There is some elegance and symmetry with this.  But there are183potentially two conversions to be made per input character.  If the184pattern is pre-converted, two copies of it need to be kept and used in185a mutually coherent fashion.186 187Using the188 189  if (input_wchar  == pattern_wchar190      || lc(input_wchar) == pattern_wchar191      || uc(input_wchar) == pattern_wchar)192 193logic (a plausible interpretation of POSIX) leads to the following194matches:195 196    \in  I  i  İ  ı197  pat\   ----------198   I  |  Y  Y  n  Y199   i  |  Y  Y  Y  n200   İ  |  n  n  Y  n201   ı  |  n  n  n  Y202 203There is a different CAPITAL/SMALL symmetry with this.  But there’s204also a loss of pattern/input symmetry that’s unique to it.  Also there205are potentially two conversions to be made per input character.206 207Using the208 209  if (lc(uc(input_wchar)) == lc(uc(pattern_wchar)))210 211logic leads to the following matches:212 213    \in  I  i  İ  ı214  pat\   ----------215   I  |  Y  Y  Y  Y216   i  |  Y  Y  Y  Y217   İ  |  Y  Y  Y  Y218   ı  |  Y  Y  Y  Y219 220This shows total symmetry and transitivity (at least in this example221analysis).  There are two conversions to be made per input character,222but support could be added for having a single straight mapping223performing a composition of the two conversions.224 225Any optimization in the implementation of each logic must not change226its basic semantic.227 228 229Unicode and --ignore-case230-------------------------231 232For this issue, interesting things to check in Unicode include:233 234  - The Unicode Standard, Chapter 3235    (<https://www.unicode.org/versions/Unicode9.0.0/ch03.pdf>236    “Conformance”), Section 3.13 (“Default Case Algorithms”) and the237    toCasefold() case conversion operation.238 239  - The Unicode Standard, Chapter 4240    (<https://www.unicode.org/versions/Unicode9.0.0/ch04.pdf>241    “Character Properties”), Section 4.2 (“Case”) and242    the <https://www.unicode.org/Public/UNIDATA/SpecialCasing.txt>243    SpecialCasing.txt and244    <https://www.unicode.org/Public/UNIDATA/CaseFolding.txt>245    CaseFolding.txt files.246 247  - The Unicode Standard, Chapter 5248    (<https://www.unicode.org/versions/Unicode9.0.0/ch05.pdf>249    “Implementation Guidelines”), Section 5.18 (“Case Mappings”),250    Subsection “Caseless Matching”.251 252  - The Unicode case charts <https://www.unicode.org/charts/case/>.253 254Unicode uses the255 256  if (toCasefold(input_wchar_string) == toCasefold(pattern_wchar_string))257 258logic for caseless matching.  Consider the “LATIN LETTER I” example259mentioned above.  In a non-Turkic locale, simple case folding yields260 261  toCasefold_simple(U+0049) = U+0069262  toCasefold_simple(U+0069) = U+0069263  toCasefold_simple(U+0130) = U+0130264  toCasefold_simple(U+0131) = U+0131265 266which leads to the following matches:267 268    \in  I  i  İ  ı269  pat\   ----------270   I  |  Y  Y  n  n271   i  |  Y  Y  n  n272   İ  |  n  n  Y  n273   ı  |  n  n  n  Y274 275This is different from anything so far!276 277In a non-Turkic locale, full case folding yields278 279  toCasefold_full(U+0049) = U+0069280  toCasefold_full(U+0069) = U+0069281  toCasefold_full(U+0130) = <U+0069, U+0307>282  toCasefold_full(U+0131) = U+0131283 284with285 286  0307;COMBINING DOT ABOVE;Mn;230;NSM;;;;;N;NON-SPACING DOT ABOVE;;;;287 288which leads to the following matches:289 290    \in  I  i  İ  ı291  pat\   ----------292   I  |  Y  Y  *  n293   i  |  Y  Y  *  n294   İ  |  n  n  Y  n295   ı  |  n  n  n  Y296 297This is just sad!298 299Having toCasefold(U+0131), simple or full, map to itself instead of300U+0069 is in contradiction with the rules of Section 5.18 of the301Unicode Standard since toUpperCase(U+0131) is U+0049.  Same thing for302toCasefold_simple(U+0130) since toLowerCase(U+0131) is U+0069.  The303justification for the weird toCasefold_full(U+0130) mapping is304unknown; it doesn’t even make sense to add a dot (U+0307) to a letter305that already has one (U+0069).  It would have been so simple to put306them all in the same equivalence class!307 308Otherwise, also consider the following problem with Unicode’s approach309on case folding in mind.  Assume that we want to perform310 311  echo 'AßBC' | grep -i 'Sb'312 313which corresponds to314 315  input:    U+0041 U+00DF U+0042 U+0043 U+000A316  pattern:  U+0053 U+0062317 318Following CaseFolding.txt, applying the toCasefold() transformation to319these yields320 321  input:    U+0061 U+0073 U+0073 U+0062 U+0063 U+000A322  pattern:                U+0073 U+0062323 324so, according to this approach, the input should match the pattern.325As long as the original input line is to be reported to the user as a326whole, there is no problem (from the user’s point-of-view;327implementation is complicated by this).328 329However, consider both these GNU extensions:330 331  echo 'AßBC' | grep -i --only-matching 'Sb'332  echo 'AßBC' | grep -i --color=always  'Sb'333 334What is to be reported in these cases, since the match begins in the335*middle* of the original input character ‘ß’?336 337Unicode’s toCasefold() cannot be implemented in terms of POSIX’s338towctrans() since that can only return a single wint_t value per input339wint_t value.340