All posts by admin

Source Code Printouts

The Microsoft Windows tool, Notepad++, is often used to print electronic text documents for use as exhibits when questioning deponents and trial witnesses.

Below is the Notepad++ dialog box which is used to configure how to print source code documents. This dialog box is accessed by Settings > Preferences > Print. It contains the configurations I have successfully used on multiple cases.

  • Check the “Print line number” box; even if you have this box checked, Notepad++ will only print line numbers if you also have them printed on the screen by checking the box accessed by Settings > Preferences > Margins/Borders/Edge > Line Number > Display.
  • Choose “Black on white” from the Color Options to avoid hard to read color text that gets harder to read if the printout is scanned and re-printed, especially if scanned and re-printed in greyscale.
  • Choose 0 for each Margin Setting to maximize content on the page since most cases limit the number of pages you can print.
  • In the Left part of the Header, put the “Full file path name” Variable, which is identified by the string: “$(FULL_CURRENT_PATH)”.
  • For the Header, since the “Full file path name” is sometimes very long, the smallest most condensed font name and font size that I have found to be still legible in hardcopy printouts is “Times New Roman” and “9”, without being either Bold or Italic.
    • This setting is very important because Notepad++ neither wraps text in headers and footers nor warns you if it truncates header or footer text that is too long. This is particularly a problem since the part of the path it will truncate is some or all of the file name. 
    • I suggest you ensure the root of the source code tree being cited is as short as possible; for example, “C:\Review\…” and not a folder which will use-up the limited number of characters you have available in the header of your printouts. An example of an unnecessarily long root folder name would be: “C:\Review installed January 12, 2024\…”.
    • You might consider printing in landscape mode instead of portrait mode, but this is a trade-off. Landscape mode will accommodate longer headers, longer footers, and result in fewer wrapped source code lines, but will probably print fewer lines per page which might push you against your page limit.
    • I have found that Notepad++ will wrap long lines for its printouts even if those same long lines are not wrapped on the screen.
    • For the body of the document, a Notepad++ printout will use the same font name and font size used for the screen. Therefore, the smallest most condensed fixed-width font name and font size that I have found to be still legible in hardcopy printouts is “Courier New” and “9”, without being either Bold, Italic, or Underline; see Settings > Style Configurator > Select theme : Default (stylers.xml) > Language : Global Styles ; Style : Global override > Font Style > Font name and Font size
  • In the Left part of the Footer, put the confidentiality notice pertaining to source code, which can be found in the Protective Order. A confidentiality notice that I have found in multiple protective orders is: “HIGHLY CONFIDENTIAL – SOURCE CODE”, but reference the Protective Order for your case.
  • In the Middle part of the Footer, put the “Page” Variable, which is identified by the string: “$(CURRENT_PRINTING_PAGE)”
    • The page number is put in the middle part of the footer to leave room in the right part of the footer for the Bates number that the attorneys’ IT team will stamp on each page.
  • For the Footer, I use the same font name and font size as in the Header.

Warning

Notepad++ will print the text highlights that appear on the screen at the time the print was made, if those highlights resulted from a text search or from a double-click on a word. For example, if you found a search term in the source code and then print that source code file, the highlighting of that search term will appear in the printout. This can give the other side a hint as to what you might consider to be important!

Further, any highlights will be printed in the color in which they appear on the screen, even if you have specified Settings > Preferences > Print > Color Options > Black on white. Therefore, if you print to PDF, those highlights will be in color in the PDF. However, if the IT staff which prints that PDF to paper uses a printer that can only print greyscale (or can print color, but is configured to only print greyscale), then the highlight of the text will probably obscure the text below it on the printed page, possibly enough that the text below the highlight will be illegible.

Cite Lines

When referencing patents, deposition testimony transcripts, and trial testimony transcripts, those documents contain easy and accurate ways to cite a single line or a range of lines.

However, citing lines in electronic text documents, like source code, can be tricky. For example, electronic text documents have the following characteristics:

  • A line can be terminated with:
    • a pair of “carriage return” and “line feed” characters (convention on Windows),
    • a single “line feed” character (convention on Linux/UNIX/macOS), or
    • a single “carriage return” character (convention on Mac OS prior to Mac OS X).
  • The last line might not be terminated with any of the above line termination characters.
  • Depending upon how the electronic text document is rendered, a long line might wrap to two or more additional lines.

I have seen source code files where some lines are terminated with a pair of “carriage return” and “line feed” characters and other lines in the same file are terminated with a single “line feed” character.

I have seen a source code file in which a line contained multiple instances of a single “carriage return” character, not as part of line termination, but embedded in a character string in the source code.

Source code citations must match source code printouts

Often citations in the expert report that refer to lines in an electronic text source code document must be cross-referenced to the printed copy of that document that is used as an exhibit when questioning a deponent or trial witness. Therefore, the citations made by the forensic software expert for inclusion in the expert report must match the line numbers in the printed copy of that document.

For example, the Microsoft Windows tool, Notepad++, is often used to print electronic text documents for use as exhibits when questioning deponents and trial witnesses; see the following post: Source Code Printouts.

By default, Notepad++ will separate lines at every occurrence of:

  • a pair of “carriage return” and “line feed” characters,
  • a single “line feed” character (not preceded by a “carriage return” character), and
  • a single “carriage return” character (not followed by a “line feed” character)

Therefore, in the example I gave above, Notepad++ printed multiple extra lines for that source code file which contained multiple single “carriage return” character. I had to adjust my citations to the lines that followed that string to match what Notepad++ printed, even though the Integrated Development Environment (IDE) I was using showed a single line for the line containing the multiple “carriage return” characters.

In one case, the only “IDE” that was installed on the review computer was Notepad++, so I had no conflict between the tool I used to identify source code citations and the tool I used to generate the source code printouts!

In another case, Notepad++ was not available on the review computer, but Cygwin was, so I had to prepare source code printouts using “cat -n”.

Line termination details

Of the 7 individual Unicode characters and 1 pair of Unicode characters that might be interpreted as line terminators, the default configuration of Notepad++ identifies a new line at every occurrence of a pair of “carriage return” and “line feed” characters, every occurrence of a single “line feed” character (not preceded by a “carriage return” character), and every occurrence of a single “carriage return” character (not followed by a “line feed” character).

Even though you might expect it, Notepad++ does not break a new line at occurrences of the “vertical tab”, “form feed”, “next line”, “line separator”, or “paragraph separator” characters.

LF:	Line Feed, U+000A
VT:	Vertical Tab, U+000B
FF:	Form Feed, U+000C
CR:	Carriage Return, U+000D
NEL:	Next Line, U+0085
LS:	Line Separator, U+2028
PS:	Paragraph Separator, U+2029
CR+LF:	Carriage Return, U+000D followed by Line Feed, U+000A

Here is an example of what Notepad++ prints when a file consists of lines which contain each of the above characters or pair of characters. Note the extra lines for the LF, CR, PS, and CR-LF characters or pair of characters.

Source code line counts must match source code printouts

Another type of analysis that involves lines, counts lines in a file. When counting lines is used to justify damage calculations, a forensic software expert must be exact. 

Even if being inexact only changes your side’s damage calculations by an insignificant amount of dollars, any differences between the lines counted for damage calculations versus the number of lines that the triers-of-fact can see on a printed page will call your processes into question.

The way that tools like Notepad++ identify line endings when printing text files may not match the way that the following batch tools identify line endings: “wc ‑l”, “cat ‑n”, “grep ‑n”, “nl”, “grep ‑c”, “awk’s NR variable”, “sed”, and “cloc”.

For example, most of these batch tools do not consider a single “carriage return” character as a line ending the way the Notepad++ does. Therefore, always cross-check the line counts generated by these tools with the line counts on the printouts of those files.

As another example, “wc -l” will not count the last line if it is not followed by a line termination. In particular, it counts only the number of “line feed” characters in the file, which may not be the number of lines if the file does not use a “line feed” character in its line terminators.

Further, any purpose-built tools which count lines, like python or perl scripts you might write, need to count lines based on the same conventions used by the tool you use to generate the printouts. In particular, python, in its attempt to be cross-platform with the “newline” attribute of its “open” function, allows for different line ending combinations.

Conclusion

While it is easy for the triers-of-fact to identify lines and count lines on source code printouts you generate for exhibits used when questioning a deponent or trial witness, be careful to correlate those lines and line counts with the tools and processes you use to cite to and count those same source code lines.

The Ties that Bind

We often think of a folder tree as a set of independent files and folders, where deleting files and folders from that tree has no effect on the contents of the files and folders that remain.

However, to save space, some file systems support symbolic links, which create dependencies between otherwise independent files and folders. 

A symbolic link is a file whose only contents is a name or path of another file or folder, called the target for the symbolic link. By default, when asked to open a symbolic link, most file processing tools will recognize the file as a symbolic link, read the target, and open the target file or folder. One advantage of a symbolic link is that the operating system does not have to store two versions of the same file.

However, there is a risk when using symbolic links. The operating system does not change a symbolic link if the target of that link has been renamed, moved, deleted, or never existed in the first place. A symbolic link whose target does not exist is called a broken link.

The target of a symbolic link can be one of the following:

  • The absolute path of a file or folder

            For example, the target can be “/usr/include/stdio.h” or “/usr/include”.

  • The name of a file or folder, relative to the folder containing the symbolic link:

            For example, the target can be “stdio.h” or “fldr”, and if the symbolic link is in the folder whose path is, say, “/usr/include”, then target would be “/usr/include/stdio.h” or “/usr/include/fldr”.

  • A relative path to a file or folder, relative to the folder containing the symbolic link:

            For example, the target can be fldr/stdio.h or fldr/subfldr, and if the symbolic link is in the folder whose path is, say, “/usr/include”, then target would be “/usr/include/fldr/stdio.h” or “/usr/include/fldr/subfldr”.

            For a further example, the target can be “../fldr/stdio.h” or “../fldr/subfldr”, and if the symbolic link is in the folder whose path is, say, “/usr/include”, then target would be “/usr/include/../fldr/stdio.h” or “/usr/include/../fldr/subfldr”, which would be the same as “/usr/fldr/stdio.h” and “/usr/fldr/subfldr”, respectively.

Because of the fragile dependencies that exist when symbolic links are used, it is possible for the engineer who assembles a production to unknowingly break one or more symbolic links. This will happen when a symbolic link is included in the production, its target exists on the file system from which the production was copied, but the target is not included in the production and the target does not exist on the review computer to which the production has been copied. 

Further, what can be worse is when the target is not included in the production, but the review computer happens to have a file or folder that matches the production, but that file or folder was on the review computer before the production was copied to the review computer, and thus did not come from the production. 

Here are some example scenarios: 

Scenario A: Say the production contains a symbolic link whose target is a relative path deeper in the folder tree: “include/file.h”. If the engineer who assembled the production pruned the production by removing the folder named “include”, then any attempt to access that symbolic link will fail.

Scenario B: Say the production contains a symbolic link whose target is a relative path that contains one or more references to a parent folder: “../../include/file.h”. If the engineer who assembled the production did not include the grandparent folder in the production, then:

  • If the review computer does not have a file whose path is “../../include/stdio.h”, relative to the folder containing the symbolic link, then any attempt to access that symbol link will fail. 
  • If the review computer does have a file whose path is “../../include/stdio.h”, relative to the folder containing the symbolic link, then any attempt to access that symbolic link will access the file whose path is “../../include/stdio.h”, relative to the folder containing the symbolic link, a file that is probably not part of the production!

Scenario C: Say the production contains a symbolic link whose target is the absolute path: “/usr/include/stdio.h” and the engineer who assembled the production did not include “/usr”, “/usr/include”, or “/usr/include.stdio.h” in the production, then:

  • If the review computer does not have a file whose path is “/usr/include/stdio.h”, then any attempt to access that symbol link will fail. 
  • If the review computer does have a file whose path is “/usr/include/stdio.h”, then any attempt to access that symbolic link will access the file whose path is “/usr/include/stdio.h”, a file that was not part of the production!

Therefore, I created the following script which does the following checks. Given the path of the folder that is the root of the production, for each symbolic link in the production:

  1. If the target of the symbolic link is an absolute path and the root of the production is not the root of the file system (i.e., “/”), then flag that symbolic link as broken; else determine whether that target exists in the production.
  2. If the target of the symbolic link is a file name or folder name and that file name or folder name does not exist in the same folder as the symbolic link, then flag that symbolic link as broken.
  3. If the target of the symbolic link is a relative path deeper in the folder tree and that relative path does not exist relative to folder containing the symbolic link, then flag that symbolic link as broken.
  4. If the target of the symbolic link is a relative path that contains one or more references to a parent folder and those one or more references to a parent folder reference a folder outside of the production, then flag that symbolic link as broken; else determine whether that target exists in the production.
#!/usr/bin/env python3

"""
Copyright 2020-2021 Stairstep Consulting LLC. All rights reserved.

Creative Commons Attribution-ShareAlike 4.0 International Public
License

By exercising the Licensed Rights (defined below), You accept and agree
to be bound by the terms and conditions of this Creative Commons
Attribution-ShareAlike 4.0 International Public License ("Public
License"). To the extent this Public License may be interpreted as a
contract, You are granted the Licensed Rights in consideration of Your
acceptance of these terms and conditions, and the Licensor grants You
such rights in consideration of benefits the Licensor receives from
making the Licensed Material available under these terms and
conditions.


Section 1 -- Definitions.

  a. Adapted Material means material subject to Copyright and Similar
     Rights that is derived from or based upon the Licensed Material
     and in which the Licensed Material is translated, altered,
     arranged, transformed, or otherwise modified in a manner requiring
     permission under the Copyright and Similar Rights held by the
     Licensor. For purposes of this Public License, where the Licensed
     Material is a musical work, performance, or sound recording,
     Adapted Material is always produced where the Licensed Material is
     synched in timed relation with a moving image.

  b. Adapter's License means the license You apply to Your Copyright
     and Similar Rights in Your contributions to Adapted Material in
     accordance with the terms and conditions of this Public License.

  c. BY-SA Compatible License means a license listed at
     creativecommons.org/compatiblelicenses, approved by Creative
     Commons as essentially the equivalent of this Public License.

  d. Copyright and Similar Rights means copyright and/or similar rights
     closely related to copyright including, without limitation,
     performance, broadcast, sound recording, and Sui Generis Database
     Rights, without regard to how the rights are labeled or
     categorized. For purposes of this Public License, the rights
     specified in Section 2(b)(1)-(2) are not Copyright and Similar
     Rights.

  e. Effective Technological Measures means those measures that, in the
     absence of proper authority, may not be circumvented under laws
     fulfilling obligations under Article 11 of the WIPO Copyright
     Treaty adopted on December 20, 1996, and/or similar international
     agreements.

  f. Exceptions and Limitations means fair use, fair dealing, and/or
     any other exception or limitation to Copyright and Similar Rights
     that applies to Your use of the Licensed Material.

  g. License Elements means the license attributes listed in the name
     of a Creative Commons Public License. The License Elements of this
     Public License are Attribution and ShareAlike.

  h. Licensed Material means the artistic or literary work, database,
     or other material to which the Licensor applied this Public
     License.

  i. Licensed Rights means the rights granted to You subject to the
     terms and conditions of this Public License, which are limited to
     all Copyright and Similar Rights that apply to Your use of the
     Licensed Material and that the Licensor has authority to license.

  j. Licensor means the individual(s) or entity(ies) granting rights
     under this Public License.

  k. Share means to provide material to the public by any means or
     process that requires permission under the Licensed Rights, such
     as reproduction, public display, public performance, distribution,
     dissemination, communication, or importation, and to make material
     available to the public including in ways that members of the
     public may access the material from a place and at a time
     individually chosen by them.

  l. Sui Generis Database Rights means rights other than copyright
     resulting from Directive 96/9/EC of the European Parliament and of
     the Council of 11 March 1996 on the legal protection of databases,
     as amended and/or succeeded, as well as other essentially
     equivalent rights anywhere in the world.

  m. You means the individual or entity exercising the Licensed Rights
     under this Public License. Your has a corresponding meaning.


Section 2 -- Scope.

  a. License grant.

       1. Subject to the terms and conditions of this Public License,
          the Licensor hereby grants You a worldwide, royalty-free,
          non-sublicensable, non-exclusive, irrevocable license to
          exercise the Licensed Rights in the Licensed Material to:

            a. reproduce and Share the Licensed Material, in whole or
               in part; and

            b. produce, reproduce, and Share Adapted Material.

       2. Exceptions and Limitations. For the avoidance of doubt, where
          Exceptions and Limitations apply to Your use, this Public
          License does not apply, and You do not need to comply with
          its terms and conditions.

       3. Term. The term of this Public License is specified in Section
          6(a).

       4. Media and formats; technical modifications allowed. The
          Licensor authorizes You to exercise the Licensed Rights in
          all media and formats whether now known or hereafter created,
          and to make technical modifications necessary to do so. The
          Licensor waives and/or agrees not to assert any right or
          authority to forbid You from making technical modifications
          necessary to exercise the Licensed Rights, including
          technical modifications necessary to circumvent Effective
          Technological Measures. For purposes of this Public License,
          simply making modifications authorized by this Section 2(a)
          (4) never produces Adapted Material.

       5. Downstream recipients.

            a. Offer from the Licensor -- Licensed Material. Every
               recipient of the Licensed Material automatically
               receives an offer from the Licensor to exercise the
               Licensed Rights under the terms and conditions of this
               Public License.

            b. Additional offer from the Licensor -- Adapted Material.
               Every recipient of Adapted Material from You
               automatically receives an offer from the Licensor to
               exercise the Licensed Rights in the Adapted Material
               under the conditions of the Adapter's License You apply.

            c. No downstream restrictions. You may not offer or impose
               any additional or different terms or conditions on, or
               apply any Effective Technological Measures to, the
               Licensed Material if doing so restricts exercise of the
               Licensed Rights by any recipient of the Licensed
               Material.

       6. No endorsement. Nothing in this Public License constitutes or
          may be construed as permission to assert or imply that You
          are, or that Your use of the Licensed Material is, connected
          with, or sponsored, endorsed, or granted official status by,
          the Licensor or others designated to receive attribution as
          provided in Section 3(a)(1)(A)(i).

  b. Other rights.

       1. Moral rights, such as the right of integrity, are not
          licensed under this Public License, nor are publicity,
          privacy, and/or other similar personality rights; however, to
          the extent possible, the Licensor waives and/or agrees not to
          assert any such rights held by the Licensor to the limited
          extent necessary to allow You to exercise the Licensed
          Rights, but not otherwise.

       2. Patent and trademark rights are not licensed under this
          Public License.

       3. To the extent possible, the Licensor waives any right to
          collect royalties from You for the exercise of the Licensed
          Rights, whether directly or through a collecting society
          under any voluntary or waivable statutory or compulsory
          licensing scheme. In all other cases the Licensor expressly
          reserves any right to collect such royalties.


Section 3 -- License Conditions.

Your exercise of the Licensed Rights is expressly made subject to the
following conditions.

  a. Attribution.

       1. If You Share the Licensed Material (including in modified
          form), You must:

            a. retain the following if it is supplied by the Licensor
               with the Licensed Material:

                 i. identification of the creator(s) of the Licensed
                    Material and any others designated to receive
                    attribution, in any reasonable manner requested by
                    the Licensor (including by pseudonym if
                    designated);

                ii. a copyright notice;

               iii. a notice that refers to this Public License;

                iv. a notice that refers to the disclaimer of
                    warranties;

                 v. a URI or hyperlink to the Licensed Material to the
                    extent reasonably practicable;

            b. indicate if You modified the Licensed Material and
               retain an indication of any previous modifications; and

            c. indicate the Licensed Material is licensed under this
               Public License, and include the text of, or the URI or
               hyperlink to, this Public License.

       2. You may satisfy the conditions in Section 3(a)(1) in any
          reasonable manner based on the medium, means, and context in
          which You Share the Licensed Material. For example, it may be
          reasonable to satisfy the conditions by providing a URI or
          hyperlink to a resource that includes the required
          information.

       3. If requested by the Licensor, You must remove any of the
          information required by Section 3(a)(1)(A) to the extent
          reasonably practicable.

  b. ShareAlike.

     In addition to the conditions in Section 3(a), if You Share
     Adapted Material You produce, the following conditions also apply.

       1. The Adapter's License You apply must be a Creative Commons
          license with the same License Elements, this version or
          later, or a BY-SA Compatible License.

       2. You must include the text of, or the URI or hyperlink to, the
          Adapter's License You apply. You may satisfy this condition
          in any reasonable manner based on the medium, means, and
          context in which You Share Adapted Material.

       3. You may not offer or impose any additional or different terms
          or conditions on, or apply any Effective Technological
          Measures to, Adapted Material that restrict exercise of the
          rights granted under the Adapter's License You apply.


Section 4 -- Sui Generis Database Rights.

Where the Licensed Rights include Sui Generis Database Rights that
apply to Your use of the Licensed Material:

  a. for the avoidance of doubt, Section 2(a)(1) grants You the right
     to extract, reuse, reproduce, and Share all or a substantial
     portion of the contents of the database;

  b. if You include all or a substantial portion of the database
     contents in a database in which You have Sui Generis Database
     Rights, then the database in which You have Sui Generis Database
     Rights (but not its individual contents) is Adapted Material,

     including for purposes of Section 3(b); and
  c. You must comply with the conditions in Section 3(a) if You Share
     all or a substantial portion of the contents of the database.

For the avoidance of doubt, this Section 4 supplements and does not
replace Your obligations under this Public License where the Licensed
Rights include other Copyright and Similar Rights.


Section 5 -- Disclaimer of Warranties and Limitation of Liability.

  a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
     EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
     AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
     ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
     IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
     WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
     PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
     ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
     KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
     ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.

  b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
     TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
     NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
     INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
     COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
     USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
     ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
     DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
     IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.

  c. The disclaimer of warranties and limitation of liability provided
     above shall be interpreted in a manner that, to the extent
     possible, most closely approximates an absolute disclaimer and
     waiver of all liability.


Section 6 -- Term and Termination.

  a. This Public License applies for the term of the Copyright and
     Similar Rights licensed here. However, if You fail to comply with
     this Public License, then Your rights under this Public License
     terminate automatically.

  b. Where Your right to use the Licensed Material has terminated under
     Section 6(a), it reinstates:

       1. automatically as of the date the violation is cured, provided
          it is cured within 30 days of Your discovery of the
          violation; or

       2. upon express reinstatement by the Licensor.

     For the avoidance of doubt, this Section 6(b) does not affect any
     right the Licensor may have to seek remedies for Your violations
     of this Public License.

  c. For the avoidance of doubt, the Licensor may also offer the
     Licensed Material under separate terms or conditions or stop
     distributing the Licensed Material at any time; however, doing so
     will not terminate this Public License.

  d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
     License.


Section 7 -- Other Terms and Conditions.

  a. The Licensor shall not be bound by any additional or different
     terms or conditions communicated by You unless expressly agreed.

  b. Any arrangements, understandings, or agreements regarding the
     Licensed Material not stated herein are separate from and
     independent of the terms and conditions of this Public License.


Section 8 -- Interpretation.

  a. For the avoidance of doubt, this Public License does not, and
     shall not be interpreted to, reduce, limit, restrict, or impose
     conditions on any use of the Licensed Material that could lawfully
     be made without permission under this Public License.

  b. To the extent possible, if any provision of this Public License is
     deemed unenforceable, it shall be automatically reformed to the
     minimum extent necessary to make it enforceable. If the provision
     cannot be reformed, it shall be severed from this Public License
     without affecting the enforceability of the remaining terms and
     conditions.

  c. No term or condition of this Public License will be waived and no
     failure to comply consented to unless expressly agreed to by the
     Licensor.

  d. Nothing in this Public License constitutes or may be interpreted
     as a limitation upon, or waiver of, any privileges and immunities
     that apply to the Licensor or You, including from the legal
     processes of any jurisdiction or authority.
"""


import sys
import os
import shutil
import enum


class Action(enum.Enum):
    # Does not use 0 or 1 to avoid matching with False or True if the caller mistakenly passes a Boolean
    PRINT = 2
    REMOVE = 3


def broken_symbolic_links_tree(root_folder_path: str, action=Action.PRINT):
    """
    Recursively traverses the specified folder(s) and takes action on symbolic links which either refer to targets that
    do not refer to a file or folder under their specified root folder path or are broken symbolic links
    :param root_folder_path: Path of a folder
    :param action: Prints (default) or removes broken symbolic links
    :return: None
    """
    for folder, __, files in os.walk(root_folder_path):
        for file in files:
            source_path = os.path.join(folder, file)
            if os.path.islink(source_path):
                target = os.readlink(source_path)
                if os.path.isabs(target):
                    target_path = target
                else:
                    target_path = os.path.join(os.path.dirname(source_path), target)
                if not target_path_under_root_path(target_path, root_folder_path) or not os.path.exists(target_path):
                    if action == Action.REMOVE:
                        os.remove(source_path)
                    else:
                        print(source_path, '->', target)


def target_path_under_root_path(target_path, root_path):
    target_path_normalized = os.path.normpath(target_path)
    root_path_normalized = os.path.normpath(root_path)
    if target_path_normalized.startswith(root_path_normalized):
        return True
    return False


def _broken_symbolic_links_main():
    """
    Recursively traverses the specified folder(s) and takes action on symbolic links which either refer to targets that
    do not refer to a file or folder under their specified root folder path or are broken symbolic links
    :usage: broken_symbolic_links [-r] folder_path ...
                -r          Remove files which are broken symbolic links
                default     Print files which are broken symbolic links
    """
    start_of_path_args = 1
    action = Action.PRINT
    if len(sys.argv) >= 3:
        if sys.argv[1] == '-r':
            start_of_path_args = 2
            action = Action.REMOVE
    for root_folder_path in sys.argv[start_of_path_args:]:
        broken_symbolic_links_tree(root_folder_path, action)


if __name__ == '__main__':
    _broken_symbolic_links_main()

Unboxing the Unboxed

In many of my cases, either the source code production on the review computer comprises one or more archive files (e.g., ZIP files) or the production is a folder tree that contains one or more archive files.

Some archive files, when unarchived, expand to a folder tree that contains more archive files, and some of those more archive files, when unarchived, expand to a folder tree that contains even more archive files, etc.

Therefore, in order to expose all the files in the production, there are two important steps to take:

  1. Identify all archive files
  2. Recursively unarchive each identified archive file

The first step might seem easy: find all the files with a “.zip” file extension. However, in addition to the “.zip” extension, there are many file types which contain one or more files and folders. 

From my experience, the 7-Zip tool unarchives the largest number of archive file types. The 7-Zip tool can unarchive more than 100 different archive file types. For example, even though the following file extensions might be thought to identify opaque binary file types, they are actually archives which contain other files and folders:

  • .docx is a modern Microsoft Word file that is an archive of files which comprise the content of the document and meta data about the document. Below is an example of the folder tree created after unarchiving a file called Doc.docx, where the contents of the Microsoft Word document are in the file “Doc\word\document.xml”, and the contents of a Microsoft Excel document embedded within the Microsoft Word file are in the file “Doc\word\embeddings\Worksheet.xlsx”:
        Doc
        ├── [Content_Types].xml
        ├── _rels
        ├── docProps
        │   ├── app.xml
        │   └── core.xml
        └── word
            ├── _rels
            │   └── document.xml.rels
            ├── document.xml
            ├── embeddings
            │   └── Worksheet.xlsx
            ├── fontTable.xml
            ├── settings.xml
            ├── styles.xml
            ├── theme
            │   └── theme1.xml
            └── webSettings.xml
  • .pptx is a modern Microsoft Powerpoint file that is an archive of files which comprise the content of the slides and meta data about the slides.
  • .xlsx is a modern Microsoft Excel file that is an archive of files which comprise the content of the spreadsheet and meta data about the spreadsheet.
  • .jar is a Java ARchive file that is an archive of Java class files.
  • .apk is an Android Package Kit file that is an archive of files used to implement an Android app.

The most popular archive file type that 7-Zip does not unarchive without the requirement to install 7-Zip plugins, is the Roshal ARchive file type (i.e., .rar).

The following is a script which uses 7-Zip (for most all archives) and unar (for .rar archives) to recursively unarchive the files in a production. The documentation for the script is contained in the comments in the script itself.

#!/usr/bin/env python3

"""
Copyright 2020-2021 Stairstep Consulting LLC. All rights reserved.

Creative Commons Attribution-ShareAlike 4.0 International Public
License

By exercising the Licensed Rights (defined below), You accept and agree
to be bound by the terms and conditions of this Creative Commons
Attribution-ShareAlike 4.0 International Public License ("Public
License"). To the extent this Public License may be interpreted as a
contract, You are granted the Licensed Rights in consideration of Your
acceptance of these terms and conditions, and the Licensor grants You
such rights in consideration of benefits the Licensor receives from
making the Licensed Material available under these terms and
conditions.


Section 1 -- Definitions.

  a. Adapted Material means material subject to Copyright and Similar
     Rights that is derived from or based upon the Licensed Material
     and in which the Licensed Material is translated, altered,
     arranged, transformed, or otherwise modified in a manner requiring
     permission under the Copyright and Similar Rights held by the
     Licensor. For purposes of this Public License, where the Licensed
     Material is a musical work, performance, or sound recording,
     Adapted Material is always produced where the Licensed Material is
     synched in timed relation with a moving image.

  b. Adapter's License means the license You apply to Your Copyright
     and Similar Rights in Your contributions to Adapted Material in
     accordance with the terms and conditions of this Public License.

  c. BY-SA Compatible License means a license listed at
     creativecommons.org/compatiblelicenses, approved by Creative
     Commons as essentially the equivalent of this Public License.

  d. Copyright and Similar Rights means copyright and/or similar rights
     closely related to copyright including, without limitation,
     performance, broadcast, sound recording, and Sui Generis Database
     Rights, without regard to how the rights are labeled or
     categorized. For purposes of this Public License, the rights
     specified in Section 2(b)(1)-(2) are not Copyright and Similar
     Rights.

  e. Effective Technological Measures means those measures that, in the
     absence of proper authority, may not be circumvented under laws
     fulfilling obligations under Article 11 of the WIPO Copyright
     Treaty adopted on December 20, 1996, and/or similar international
     agreements.

  f. Exceptions and Limitations means fair use, fair dealing, and/or
     any other exception or limitation to Copyright and Similar Rights
     that applies to Your use of the Licensed Material.

  g. License Elements means the license attributes listed in the name
     of a Creative Commons Public License. The License Elements of this
     Public License are Attribution and ShareAlike.

  h. Licensed Material means the artistic or literary work, database,
     or other material to which the Licensor applied this Public
     License.

  i. Licensed Rights means the rights granted to You subject to the
     terms and conditions of this Public License, which are limited to
     all Copyright and Similar Rights that apply to Your use of the
     Licensed Material and that the Licensor has authority to license.

  j. Licensor means the individual(s) or entity(ies) granting rights
     under this Public License.

  k. Share means to provide material to the public by any means or
     process that requires permission under the Licensed Rights, such
     as reproduction, public display, public performance, distribution,
     dissemination, communication, or importation, and to make material
     available to the public including in ways that members of the
     public may access the material from a place and at a time
     individually chosen by them.

  l. Sui Generis Database Rights means rights other than copyright
     resulting from Directive 96/9/EC of the European Parliament and of
     the Council of 11 March 1996 on the legal protection of databases,
     as amended and/or succeeded, as well as other essentially
     equivalent rights anywhere in the world.

  m. You means the individual or entity exercising the Licensed Rights
     under this Public License. Your has a corresponding meaning.


Section 2 -- Scope.

  a. License grant.

       1. Subject to the terms and conditions of this Public License,
          the Licensor hereby grants You a worldwide, royalty-free,
          non-sublicensable, non-exclusive, irrevocable license to
          exercise the Licensed Rights in the Licensed Material to:

            a. reproduce and Share the Licensed Material, in whole or
               in part; and

            b. produce, reproduce, and Share Adapted Material.

       2. Exceptions and Limitations. For the avoidance of doubt, where
          Exceptions and Limitations apply to Your use, this Public
          License does not apply, and You do not need to comply with
          its terms and conditions.

       3. Term. The term of this Public License is specified in Section
          6(a).

       4. Media and formats; technical modifications allowed. The
          Licensor authorizes You to exercise the Licensed Rights in
          all media and formats whether now known or hereafter created,
          and to make technical modifications necessary to do so. The
          Licensor waives and/or agrees not to assert any right or
          authority to forbid You from making technical modifications
          necessary to exercise the Licensed Rights, including
          technical modifications necessary to circumvent Effective
          Technological Measures. For purposes of this Public License,
          simply making modifications authorized by this Section 2(a)
          (4) never produces Adapted Material.

       5. Downstream recipients.

            a. Offer from the Licensor -- Licensed Material. Every
               recipient of the Licensed Material automatically
               receives an offer from the Licensor to exercise the
               Licensed Rights under the terms and conditions of this
               Public License.

            b. Additional offer from the Licensor -- Adapted Material.
               Every recipient of Adapted Material from You
               automatically receives an offer from the Licensor to
               exercise the Licensed Rights in the Adapted Material
               under the conditions of the Adapter's License You apply.

            c. No downstream restrictions. You may not offer or impose
               any additional or different terms or conditions on, or
               apply any Effective Technological Measures to, the
               Licensed Material if doing so restricts exercise of the
               Licensed Rights by any recipient of the Licensed
               Material.

       6. No endorsement. Nothing in this Public License constitutes or
          may be construed as permission to assert or imply that You
          are, or that Your use of the Licensed Material is, connected
          with, or sponsored, endorsed, or granted official status by,
          the Licensor or others designated to receive attribution as
          provided in Section 3(a)(1)(A)(i).

  b. Other rights.

       1. Moral rights, such as the right of integrity, are not
          licensed under this Public License, nor are publicity,
          privacy, and/or other similar personality rights; however, to
          the extent possible, the Licensor waives and/or agrees not to
          assert any such rights held by the Licensor to the limited
          extent necessary to allow You to exercise the Licensed
          Rights, but not otherwise.

       2. Patent and trademark rights are not licensed under this
          Public License.

       3. To the extent possible, the Licensor waives any right to
          collect royalties from You for the exercise of the Licensed
          Rights, whether directly or through a collecting society
          under any voluntary or waivable statutory or compulsory
          licensing scheme. In all other cases the Licensor expressly
          reserves any right to collect such royalties.


Section 3 -- License Conditions.

Your exercise of the Licensed Rights is expressly made subject to the
following conditions.

  a. Attribution.

       1. If You Share the Licensed Material (including in modified
          form), You must:

            a. retain the following if it is supplied by the Licensor
               with the Licensed Material:

                 i. identification of the creator(s) of the Licensed
                    Material and any others designated to receive
                    attribution, in any reasonable manner requested by
                    the Licensor (including by pseudonym if
                    designated);

                ii. a copyright notice;

               iii. a notice that refers to this Public License;

                iv. a notice that refers to the disclaimer of
                    warranties;

                 v. a URI or hyperlink to the Licensed Material to the
                    extent reasonably practicable;

            b. indicate if You modified the Licensed Material and
               retain an indication of any previous modifications; and

            c. indicate the Licensed Material is licensed under this
               Public License, and include the text of, or the URI or
               hyperlink to, this Public License.

       2. You may satisfy the conditions in Section 3(a)(1) in any
          reasonable manner based on the medium, means, and context in
          which You Share the Licensed Material. For example, it may be
          reasonable to satisfy the conditions by providing a URI or
          hyperlink to a resource that includes the required
          information.

       3. If requested by the Licensor, You must remove any of the
          information required by Section 3(a)(1)(A) to the extent
          reasonably practicable.

  b. ShareAlike.

     In addition to the conditions in Section 3(a), if You Share
     Adapted Material You produce, the following conditions also apply.

       1. The Adapter's License You apply must be a Creative Commons
          license with the same License Elements, this version or
          later, or a BY-SA Compatible License.

       2. You must include the text of, or the URI or hyperlink to, the
          Adapter's License You apply. You may satisfy this condition
          in any reasonable manner based on the medium, means, and
          context in which You Share Adapted Material.

       3. You may not offer or impose any additional or different terms
          or conditions on, or apply any Effective Technological
          Measures to, Adapted Material that restrict exercise of the
          rights granted under the Adapter's License You apply.


Section 4 -- Sui Generis Database Rights.

Where the Licensed Rights include Sui Generis Database Rights that
apply to Your use of the Licensed Material:

  a. for the avoidance of doubt, Section 2(a)(1) grants You the right
     to extract, reuse, reproduce, and Share all or a substantial
     portion of the contents of the database;

  b. if You include all or a substantial portion of the database
     contents in a database in which You have Sui Generis Database
     Rights, then the database in which You have Sui Generis Database
     Rights (but not its individual contents) is Adapted Material,

     including for purposes of Section 3(b); and
  c. You must comply with the conditions in Section 3(a) if You Share
     all or a substantial portion of the contents of the database.

For the avoidance of doubt, this Section 4 supplements and does not
replace Your obligations under this Public License where the Licensed
Rights include other Copyright and Similar Rights.


Section 5 -- Disclaimer of Warranties and Limitation of Liability.

  a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
     EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
     AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
     ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
     IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
     WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
     PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
     ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
     KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
     ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.

  b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
     TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
     NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
     INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
     COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
     USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
     ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
     DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
     IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.

  c. The disclaimer of warranties and limitation of liability provided
     above shall be interpreted in a manner that, to the extent
     possible, most closely approximates an absolute disclaimer and
     waiver of all liability.


Section 6 -- Term and Termination.

  a. This Public License applies for the term of the Copyright and
     Similar Rights licensed here. However, if You fail to comply with
     this Public License, then Your rights under this Public License
     terminate automatically.

  b. Where Your right to use the Licensed Material has terminated under
     Section 6(a), it reinstates:

       1. automatically as of the date the violation is cured, provided
          it is cured within 30 days of Your discovery of the
          violation; or

       2. upon express reinstatement by the Licensor.

     For the avoidance of doubt, this Section 6(b) does not affect any
     right the Licensor may have to seek remedies for Your violations
     of this Public License.

  c. For the avoidance of doubt, the Licensor may also offer the
     Licensed Material under separate terms or conditions or stop
     distributing the Licensed Material at any time; however, doing so
     will not terminate this Public License.

  d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
     License.


Section 7 -- Other Terms and Conditions.

  a. The Licensor shall not be bound by any additional or different
     terms or conditions communicated by You unless expressly agreed.

  b. Any arrangements, understandings, or agreements regarding the
     Licensed Material not stated herein are separate from and
     independent of the terms and conditions of this Public License.


Section 8 -- Interpretation.

  a. For the avoidance of doubt, this Public License does not, and
     shall not be interpreted to, reduce, limit, restrict, or impose
     conditions on any use of the Licensed Material that could lawfully
     be made without permission under this Public License.

  b. To the extent possible, if any provision of this Public License is
     deemed unenforceable, it shall be automatically reformed to the
     minimum extent necessary to make it enforceable. If the provision
     cannot be reformed, it shall be severed from this Public License
     without affecting the enforceability of the remaining terms and
     conditions.

  c. No term or condition of this Public License will be waived and no
     failure to comply consented to unless expressly agreed to by the
     Licensor.

  d. Nothing in this Public License constitutes or may be interpreted
     as a limitation upon, or waiver of, any privileges and immunities
     that apply to the Licensor or You, including from the legal
     processes of any jurisdiction or authority.
"""


import sys
import os
import subprocess


# file types to not attempt to unarchive because their unarchiving fails, at least for your production
fail_extensions = []

# file names to not attempt to unarchive because their unarchiving fails, at least for your production
fail_files = []

# file types to not attempt to unarchive because you don't need to see their unarchived contents
unnecessary_extensions = []

# file names to not attempt to unarchive because you don't need to see their unarchived contents
unnecessary_files = []

# If True, extensions to be unarchived will be parsed from output of "7z i"
do_parse_include_extensions = True

# file types to unarchive even if "7z i" output is not parsed or even if these file types are not output by "7z i"
force_extensions = []

# file names to unarchive even if "7z i" output is not parsed or even if these files' types are not output by "7z i"
force_files = []


fail_extensions = [x.lower() for x in fail_extensions]
fail_files = [x.lower() for x in fail_files]
unnecessary_extensions = [x.lower() for x in unnecessary_extensions]
unnecessary_files = [x.lower() for x in unnecessary_files]
force_extensions = [x.lower() for x in force_extensions]
force_files = [x.lower() for x in force_files]

exclude_extensions = set(fail_extensions + unnecessary_extensions)
exclude_files = set(fail_files + unnecessary_files)

include_extensions = set(force_extensions)
include_files = set(force_files)

# temporary suffix added to end of each archive file for which unarchiving has been attempted to avoid a second attempt
attempted_unarchiving_file_suffix = '-attempted_unarchiving'

# suffix added to the end of the folder created to hold the contents of an unarchived archive file
unarchived_folder_suffix = '-unarchived'


def parse_include_extensions() -> None:
    """
    Parses the output of the "7z i" command to determine what extensions 7z says it can process.
    At the end of this file is an example of the "7z i" output.
    Adds to the set in the include_extensions global variable.
    :return: None
    """
    if do_parse_include_extensions:
        start_of_formats_section = False
        for line in os.popen('7z i').readlines():
            line = line.rstrip()
            # Parses the "7z i" section between the line that begins with "Formats:" and the next blank line
            if start_of_formats_section:
                if not line:
                    # End of "Formats:" section
                    break
                # Ignore characters before start of the blank-separated extensions and create list of tokens that follow
                tokens = line[26:].split(' ')
                # Not all tokens at the end of the list might be extensions, so prune non-extensions from right-to-left
                add_remaining_tokens_as_extensions = False
                for index, token in enumerate(reversed(tokens)):
                    if len(token) > 2 or index == len(tokens) - 1:
                        if not token.startswith('offset=') and not token.startswith('\\x') and not token == '(~.swf)':
                            add_remaining_tokens_as_extensions = True
                    # Some extensions are specified as just, for example, "abc" and some are specified as "(.abc)";
                    #   normalize each to ".abc"
                    if add_remaining_tokens_as_extensions:
                        if token.startswith('(.') and token.endswith(')'):
                            token = token[1:-1]
                        else:
                            token = '.' + token
                        include_extensions.add(token)
            if line.startswith('Formats:'):
                start_of_formats_section = True
    if include_extensions or include_files:
        formatted_list = ' '.join([('*' + x) for x in sorted(include_extensions)] + sorted(include_files))
        print(f'\nWill attempt to unarchive these files: {formatted_list}')
    else:
        print(f'Will attempt to unarchive all files')
    if exclude_extensions or exclude_files:
        formatted_list = ' '.join([('*' + x) for x in sorted(exclude_extensions)] + sorted(exclude_files))
        print(f'\n...except will not attempt to unarchive these files: {formatted_list}')


def initial_linking(source_path: str, target_root: str) -> None:
    """
    Clones the files and folders in the source_path to the target_root folder, using hard links if available
    :param source_path: Path of file that might be an archive or of folder that might contain one or more archives
    :param target_root: Path of folder that will contain the recursively unarchived folder tree
    :return: None
    """
    print(f'\nLinking...')
    if os.path.isdir(source_path):
        source_path = source_path if not source_path.endswith(os.path.sep) else source_path[:-1]
        for source_folder, source_sub_folders, source_files in os.walk(source_path):
            for source_file in source_files:
                process_file(source_path, source_folder[len(source_path)+1:], source_file, target_root)
            for source_sub_folder in source_sub_folders:
                process_folder(source_path, source_folder[len(source_path)+1:], source_sub_folder, target_root)
    else:
        process_file(os.path.dirname(source_path), '', os.path.basename(source_path), target_root)


def process_file(source_root: str, source_rel: str, source_file_name: str, target_root: str) -> None:
    target_path = os.path.join(target_root, source_rel)
    os.makedirs(target_path, exist_ok=True)
    os.link(os.path.join(source_root, source_rel, source_file_name), os.path.join(target_path, source_file_name))


def process_folder(__source_root: str, source_rel: str, source_folder_name: str, target_root: str) -> None:
    target_path = os.path.join(target_root, source_rel, source_folder_name)
    os.makedirs(target_path, exist_ok=True)


def unarchive_recursively_passes(target_root: str, passwords: [str]) -> None:
    """
    Pass through the entire production unarchiving archive files. If any archive files were unarchived in a pass, do
    another pass since the prior unarchiving pass could have expanded to a folder tree that contains more archive files.
    Stop when a pass finds no archive files.
    :param target_root: Target folder
    :param passwords: List of passwords to try for each archive.
    :return: None
    """
    perform_another_pass = True
    pass_number = 1
    while perform_another_pass:
        print(f'\nUnarchiving... (pass {pass_number})')
        perform_another_pass = unarchive_pass(target_root, passwords)
        pass_number += 1
    print(f'None')


def unarchive_pass(target_root: str, passwords: [str]) -> bool:
    """
    Looks for archive files using the following precedents:
        ignore archive files for which unarchiving has already been attempted.
        ignore files that match either the file extension or file name exclusion lists.
        if either a file extension or file name inclusion list is specified, attempt to unarchive each file that matches
        either the file extension or file name inclusion list.
        if neither a file extension nor file name inclusion list is specified, attempt to unarchive all files (yes, even
        files like "*.txt" files, because some file's extension hides the fact that they are actually unarchive'able).
    :param target_root: path where the results go
    :param passwords: list of passwords to try for each unarchiving attempt
    :return: False, if no unarchiving was attempted on this pass; True, if at least one unarchiving was attempted on this
    pass, implying that an unarchiving in this pass might have exposed additional archive files to be processed in the
    next pass.
    """
    perform_another_pass = False
    for folder, __, files in os.walk(target_root):
        for file in files:
            if not file.endswith(attempted_unarchiving_file_suffix):
                _base, extension = os.path.splitext(file)
                extension = extension.lower()
                if (not exclude_extensions or extension not in exclude_extensions) and (not exclude_files or file not in exclude_files):
                    if (not include_extensions and not include_files) or (include_extensions and extension in include_extensions) or (include_files and file in include_files):
                        archive_path = os.path.join(folder, file)
                        print(f"{archive_path[len(target_root) + len('/'):]}")
                        unarchive_file(archive_path, extension, passwords)
                        perform_another_pass = True
    return perform_another_pass


def unarchive_file(archive_path: str, extension: str, passwords: [str]) -> None:
    """
    If a file has already been identified as being an archive (by the caller of this function), then call the external
    command to do the unarchiving.
    :param archive_path: Path of the archive file to be unarchived
    :param extension: The extension of the archive file
    :param passwords: The list of zero or more passwords to apply to this, and every other, archive
    :return: None
    """
    unarchived_folder_path = archive_path + unarchived_folder_suffix
    if not passwords:
        passwords.append('password')
    for password in passwords:
        try:
            if extension.lower() == '.rar':
                command = ['unar', '-q', '-p', password, '-o', unarchived_folder_path, archive_path]
            else:
                command = ['7z', 'x', archive_path, '-bso0', '-bsp0', f'-p{password}', f'-o{unarchived_folder_path}']
            result = subprocess.run(command, capture_output=True)
            if result.returncode != 0:
                print(result.stderr.decode('utf-8'), end='')
                exit(1)
            os.rename(archive_path, archive_path + attempted_unarchiving_file_suffix)
            return
        except subprocess.TimeoutExpired:
            pass
    print(f'Could not unarchive file: {archive_path}')
    exit(1)


def remove_attempted_unarchiving_file_suffixes(target_root: str) -> None:
    """
    To avoid unarchiving an archive more than once, an already unarchived archive is given a unique suffix to be
    identified as already having been processed. After all recursive unarchiving passes have completed, this function
    removes those suffixes.
    :param target_root: Target folder
    :return: None
    """
    print(f'\nRemoving Attempted Unarchiving File Suffixes...')
    paths = []
    for folder, __, files in os.walk(target_root):
        for file in files:
            if file.endswith(attempted_unarchiving_file_suffix):
                paths.append(os.path.join(folder, file))
    for path in sorted(paths, reverse=True):
        os.rename(path, path[:-len(attempted_unarchiving_file_suffix)])


def identify_longest_paths(target_root: str) -> None:
    """
    Prints the longest path. Necessary because Windows does not allow a path > 260 characters, including "C:\\" and a
    null byte at the end of the path. For example, if "C:\\Review\\" is prepended to every path in the target_root, then
    the maximum path starting from the target_root to the end is 260 - length("C:\\Review\\") - length(b'0x00') = 249
    :param target_root: The root of the target folder
    :return: None
    """
    print(f'\nIdentifying Longest Paths without Prefix (> 260 characters with prefix is too long)...')
    longest_path_len = 0
    longest_paths = []
    for folder, _, files in os.walk(target_root):
        for file in files:
            path = os.path.join(folder, file)[len(target_root) + len('/'):]
            path_len = len(path)
            if path_len > longest_path_len:
                longest_path_len = path_len
                longest_paths = [path]
            elif path_len == longest_path_len:
                longest_paths.append(path)
    for longest_path in sorted(longest_paths):
        print(f'{longest_path_len}: {longest_path}')


def unarchive_recursively(source_path: str, target_root: str, passwords: [str]) -> None:
    """
    Recursively unarchives any archive files found in source_path
    :param source_path: Path of the file or folder that might contain one or archive files
    :param target_root: Path of the folder to contain the recursively unarchived folder tree
    :param passwords: List of zero or more passwords to apply to each archive
    :return: None
    """
    if source_path.endswith('/'):
        source_path = source_path[:-1]

    if target_root.endswith('/'):
        target_root = target_root[:-1]

    if not os.path.exists(source_path):
        print(f'Source Path Does Not Exist: {source_path}')
        exit(1)

    if os.path.exists(target_root):
        print(f'Target Path Exists: {target_root}')
        exit(1)

    parse_include_extensions()
    initial_linking(source_path, target_root)
    unarchive_recursively_passes(target_root, passwords)
    remove_attempted_unarchiving_file_suffixes(target_root)
    identify_longest_paths(target_root)


def _unarchive_recursively_main():
    """
    :usage: unarchive_recursively source_file_or_folder target_folder [password...]
                source_file_or_folder   A file or a folder.
                                        Trivially, if this is a file which is not an archive, the target folder will
                                        contain the source file, and if this is a folder that does not contain any
                                        archive files, the target folder will contain the source folder.
                target_folder           The recursively unarchived source is put under the target folder.
                                        The target folder is populated with hard-links of all files from the source.
                [password...]           One or more passwords can be specified. For each archive, each specified
                                        password will be tried until one succeeds. This allows you to unarchive multiple
                                        nested archive files that use different passwords with only one invocation of
                                        this utility. If no password is specified, the default is "password".

    Dependencies
        If even one archive file in the source is a .rar file, then the unar utility must be in your search path
        (for macOS use: brew install unar).

        If even one archive file in the source is not a .rar file, then the 7z utility must be in your search path
        (for macOS use: brew install p7zip).

    Preferences
        It may be the case that files which this utility considers to be archive files, you don't feel you need to
        unarchive, for whatever reason. You can prevent unarchiving files with specific extensions and/or files with
        specific names by listing them in the Python variables "unnecessary_extensions" and "unnecessary_files" at the
        top of this script.

        By default, this script determines which files are archives based on the extensions output by calling the "7z i"
        command. This command lists over 100 extensions. If you want to only unarchive files with a smaller number of
        specific extension and/or files with specific names, then set the Python variable "do_parse_include_extensions"
        at the top of this file to False and list only the extensions you want to unarchive in the Python variable
        "force_extensions" and/or the file names you want to unarchive in the Python variable "force_files", both at the
        top of this script. Even if you allow the default so that this script determines which files are archives based
        on the extensions output by calling the "7z i" command, you can still list file extensions and file names in the
        "force_extensions" and "force_files" variables to archive files that would not otherwise be archived by default.

    Troubleshooting
        If this utility hangs, it might be waiting for the password of an archive file which you did not specify because
        you did not know the archive file requires a password.

        It may be the case that files which this utility considers to be archive files are not really archive files,
        or are corrupted archive files. You can prevent unarchiving files with specific extensions and/or files with
        specific names by listing them in the Python variables "fail_extensions" and "fail_files" at the top of this
        script.

    Paths that might be too long for Microsoft Windows are identified
        By default, the Microsoft Windows operating system does not allow the path of a file to be longer than 260
        characters. This limit includes the drive designation (e.g., "C:\\") and includes the null byte at the end of the
        path. For example, if all the files in the production are under the folder "C:\\Review\\", then the maximum path
        following "C:\\Review\\" will be 249 characters; that is, 260 - length(“C:\\Review\\”) - 1.

        While a folder tree itself can reach the 260 character path limit, archives embedded in other archives can
        quickly reach or exceed the 260 character path limit as they are recursively unarchived. Therefore, this script
        identifies the longest paths.

    Already unarchived archive files are avoided
        It is possible that the same folder in the production might contain both an archive file and a folder containing
        the unarchived contents of that archive file. For example, I have seen folders in productions which contain
        something like the following:
            foo.zip
            foo\

        In those situations, the folder named “foo\\” contained the unarchived contents of the archive file named
        “foo.zip”. However, the folder named “foo\\” is not guaranteed to contain the unarchived contents of the archive
        file named “foo.zip”.

        For this reason, when unarchiving a file named “foo.zip”, this script will unarchive its contents into a folder
        called “foo.zip-unarchived\\”, with the expectation that there will not already be a folder named
        “foo.zip-unarchived\\”. Therefore, in the above example, after this script is run, there will be the following:
            foo.zip
            foo\\
            foo.zip-unarchived\\

        If the contents of the folder named “foo\\” is indeed the unarchived contents of the archive file named
        “foo.zip”, then the contents of the folder named “foo.zip-unarchived\\” will be a duplicate of the contents of
        the folder named “foo\\”.
    """
    if len(sys.argv) >= 4:
        passwords = sys.argv[3:]
    else:
        passwords = []
    unarchive_recursively(sys.argv[1], sys.argv[2], passwords)


if __name__ == '__main__':
    _unarchive_recursively_main()


"""
The following is an example of the output of "7z i" that is parsed by the parse_include_extensions() function...

% 7z i

7-Zip [64] 17.03 : Copyright (c) 1999-2020 Igor Pavlov : 2017-08-28
p7zip Version 17.03 (locale=utf8,Utf16=on,HugeFiles=on,64 bits,10 CPUs x64)


Libs:
 0  /usr/local/Cellar/p7zip/17.03/lib/p7zip/7z.dll

Formats:
 0 C   F         7z       7z            7 z BC AF ' 1C
 0               APM      apm           E R
 0               Ar       ar a deb lib  ! < a r c h > 0A
 0               Arj      arj           ` EA
 0 CK            bzip2    bz2 bzip2 tbz2 (.tar) tbz (.tar) B Z h
 0     F         Cab      cab           M S C F 00 00 00 00
 0               Chm      chm chi chq chw I T S F 03 00 00 00 ` 00 00 00
 0     F         Hxs      hxs hxi hxr hxq hxw lit I T O L I T L S 01 00 00 00 ( 00 00 00
 0               Compound msi msp doc xls ppt D0 CF 11 E0 A1 B1 1A E1
 0      M        Cpio     cpio          0 7 0 7 0  ||  C7 q  ||  q C7
 0               CramFS   cramfs        offset=16 C o m p r e s s e d 20 R O M F S
 0       G  B    Dmg      dmg           k o l y 00 00 00 04 00 00 02 00
 0           E   ELF      elf            E L F
 0               Ext      ext ext2 ext3 ext4 img offset=1080 S EF
 0               FAT      fat img       offset=510 U AA
 0               FLV      flv           F L V 01
 0 CK            gzip     gz gzip tgz (.tar) tpz (.tar) apk (.tar) 1F 8B 08
 0               GPT      gpt mbr       offset=512 E F I 20 P A R T 00 00 01 00
 0      M        HFS      hfs hfsx      offset=1024 H + 00 04  ||  H X 00 05
 0        O      IHex     ihex          
 0               Iso      iso img       offset=32769 C D 0 0 1
 0               Lzh      lzh lha       offset=2 - l h
 0  K     O      lzma     lzma          
 0  K            lzma86   lzma86        
 0      M    E   MachO    macho         CE FA ED FE  ||  CF FA ED FE  ||  FE ED FA CE  ||  FE ED FA CF
 0         P     MBR      mbr           
 0               MsLZ     mslz          S Z D D 88 F0 ' 3 A
 0      M        Mub      mub           CA FE BA BE 00 00 00  ||  B9 FA F1 0E
 0     F G       Nsis     nsis          offset=4 EF BE AD DE N u l l s o f t I n s t
 0               NTFS     ntfs img      offset=3 N T F S 20 20 20 20 00
 0           E   PE       exe dll sys   M Z
 0           E   TE       te            V Z
 0               Ppmd     pmd           8F AF AC 84
 0               QCOW     qcow qcow2 qcow2c Q F I FB 00 00 00
 0     F         Rar      rar r00       R a r ! 1A 07 00
 0     F         Rar5     rar r00       R a r ! 1A 07 01 00
 0               Rpm      rpm           ED AB EE DB
 0               Split    001           
 0      M        SquashFS squashfs      h s q s  ||  s q s h  ||  s h s q  ||  q s h s
 0 C    M        SWFc     swf (~.swf)   C W S  ||  Z W S
 0  K            SWF      swf           F W S
 0 C      O   LH tar      tar ova       offset=257 u s t a r
 0        O      Udf      udf iso img   offset=32768 01 C D 0 0 1
 0     FM        UEFIc    scap          BD 86 f ; v 0D 0 @ B7 0E B5 Q 9E / C5 A0  ||  8B A6 < J # w FB H 80 = W 8C C1 FE C4 M  ||  B9 82 91 S B5 AB 91 C B6 9A E3 A9 C F7 / CC
 0     FM        UEFIf    uefif         offset=16 D9 T 93 z h 04 J D 81 CE 0B F6 17 D8 90 DF  ||  x E5 8C 8C = 8A 1C O 99 5 89 a 85 C3 - D3
 0               VDI      vdi           offset=64  10 DA BE
 0       G       VHD      vhd           c o n e c t i x 00 00
 0               VMDK     vmdk          K D M V
 0 C SN       LH wim      wim swm esd ppkg M S W I M 00 00 00
 0               Xar      xar pkg xip   x a r ! 00 1C
 0 CK            xz       xz txz (.tar) FD 7 z X Z 00
 0               Z        z taz (.tar)  1F 9D
 0 C   FMG       zip      zip z01 zipx jar xpi odt ods docx xlsx epub ipa apk appx P K 03 04  ||  P K 05 06  ||  P K 06 06  ||  P K 07 08 P K  ||  P K 0 0 P K
 0 CK            zstd     zst tzstd (.tar) 0 x F D 2 F B 5 2 2 . . 2 8 00
 0 CK            lz4      lz4 tlz4 (.tar) 0 x 1 8 4 D 2 2 0 4 00
 0 CK            lz5      lz5 tlz5 (.tar) 0 x 1 8 4 D 2 2 0 5 00
 0 CK            lizard   liz tliz (.tar) 0 x 1 8 4 D 2 2 0 6 00

Codecs:
 0  ED    40202 BZip2
 0 4ED  303011B BCJ2
 0  ED  3030103 BCJ
 0  ED  3030205 PPC
 0  ED  3030401 IA64
 0  ED  3030501 ARM
 0  ED  3030701 ARMT
 0  ED  3030805 SPARC
 0  ED    20302 Swap2
 0  ED    20304 Swap4
 0  ED        0 Copy
 0  ED    40109 Deflate64
 0  ED    40108 Deflate
 0  ED        3 Delta
 0  ED       21 LZMA2
 0  ED       21 FLZMA2
 0  ED    30101 LZMA
 0  ED    30401 PPMD
 0  ED  6F10701 7zAES
 0  ED  6F00181 AES256CBC
 0  ED  4F71101 ZSTD
 0  ED  4F71104 LZ4
 0  ED  4F71102 BROTLI
 0  ED  4F71105 LZ5
 0  ED  4F71106 LIZARD

Hashers:
 0   32      202 BLAKE2sp
 0    4        1 CRC32
 0   20      201 SHA1
 0   32        A SHA256
 0    8        4 CRC64
 0   16      205 MD2
 0   16      206 MD4
 0   16      207 MD5
 0   48      208 SHA384
 0   64      209 SHA512
 0    4      203 XXH32
 0    8      204 XXH64
"""

Thumbs Down to the Desktop Services Store

In some of my cases, I am asked to count files in a production that have particular characteristics. Sometimes, the attorneys only want an approximation of the file count, but other times, the exact file counts are important, such that being off-by-one could call my analysis process into question during the deposition.

Counting files that have a particular extension (e.g., .pdf, .cpp, .swift) leaves little ambiguity. However, counting files that do not have such a distinct characteristic can lead to ambiguity if the production contains hidden files or hidden folders of files.

Disclaimer: Depending upon your particular analysis, hidden files might be important evidence. However, my experience is that hidden files can erroneously increase your count of files or, at the very least, are distracting false positives that appear in your file search results.

On Windows, the File Explorer and other Windows file processing tools usually ignore hidden files created by Windows; for example, files with the name “Thumbs.db” and folders with the name “$RECYCLE.BIN”.

On macOS, the Finder and other macOS file processing tools usually ignore hidden files created by macOS; for example, files with the name “.DS_Store” and folders with the name “.Trashes”.

However, on a Windows review computer, I have analyzed productions that either originated from a macOS computer or were copied by a macOS computer sometime between their origin computer and the review computer. In these cases, files with the name “.DS_Store” and folders with the name “.Trashes” were not ignored by the Windows File Explorer and other Windows file processing tools. Likewise, on a macOS review computer, files with the name “Thumbs.db” and folders with the name “$RECYCLE.BIN” were not ignored by the macOS Finder and other macOS file processing tools.

I created the following script to identify or remove files that are “detritus” (definition: waste material or trash, especially left after a particular event). Again, note that files considered detritus for one litigation may be important evidence for another litigation. The documentation for this script is in the comments in the script itself.

#!/usr/bin/env python3

"""
Copyright 2020-2021 Stairstep Consulting LLC. All rights reserved.

Creative Commons Attribution-ShareAlike 4.0 International Public
License

By exercising the Licensed Rights (defined below), You accept and agree
to be bound by the terms and conditions of this Creative Commons
Attribution-ShareAlike 4.0 International Public License ("Public
License"). To the extent this Public License may be interpreted as a
contract, You are granted the Licensed Rights in consideration of Your
acceptance of these terms and conditions, and the Licensor grants You
such rights in consideration of benefits the Licensor receives from
making the Licensed Material available under these terms and
conditions.


Section 1 -- Definitions.

  a. Adapted Material means material subject to Copyright and Similar
     Rights that is derived from or based upon the Licensed Material
     and in which the Licensed Material is translated, altered,
     arranged, transformed, or otherwise modified in a manner requiring
     permission under the Copyright and Similar Rights held by the
     Licensor. For purposes of this Public License, where the Licensed
     Material is a musical work, performance, or sound recording,
     Adapted Material is always produced where the Licensed Material is
     synched in timed relation with a moving image.

  b. Adapter's License means the license You apply to Your Copyright
     and Similar Rights in Your contributions to Adapted Material in
     accordance with the terms and conditions of this Public License.

  c. BY-SA Compatible License means a license listed at
     creativecommons.org/compatiblelicenses, approved by Creative
     Commons as essentially the equivalent of this Public License.

  d. Copyright and Similar Rights means copyright and/or similar rights
     closely related to copyright including, without limitation,
     performance, broadcast, sound recording, and Sui Generis Database
     Rights, without regard to how the rights are labeled or
     categorized. For purposes of this Public License, the rights
     specified in Section 2(b)(1)-(2) are not Copyright and Similar
     Rights.

  e. Effective Technological Measures means those measures that, in the
     absence of proper authority, may not be circumvented under laws
     fulfilling obligations under Article 11 of the WIPO Copyright
     Treaty adopted on December 20, 1996, and/or similar international
     agreements.

  f. Exceptions and Limitations means fair use, fair dealing, and/or
     any other exception or limitation to Copyright and Similar Rights
     that applies to Your use of the Licensed Material.

  g. License Elements means the license attributes listed in the name
     of a Creative Commons Public License. The License Elements of this
     Public License are Attribution and ShareAlike.

  h. Licensed Material means the artistic or literary work, database,
     or other material to which the Licensor applied this Public
     License.

  i. Licensed Rights means the rights granted to You subject to the
     terms and conditions of this Public License, which are limited to
     all Copyright and Similar Rights that apply to Your use of the
     Licensed Material and that the Licensor has authority to license.

  j. Licensor means the individual(s) or entity(ies) granting rights
     under this Public License.

  k. Share means to provide material to the public by any means or
     process that requires permission under the Licensed Rights, such
     as reproduction, public display, public performance, distribution,
     dissemination, communication, or importation, and to make material
     available to the public including in ways that members of the
     public may access the material from a place and at a time
     individually chosen by them.

  l. Sui Generis Database Rights means rights other than copyright
     resulting from Directive 96/9/EC of the European Parliament and of
     the Council of 11 March 1996 on the legal protection of databases,
     as amended and/or succeeded, as well as other essentially
     equivalent rights anywhere in the world.

  m. You means the individual or entity exercising the Licensed Rights
     under this Public License. Your has a corresponding meaning.


Section 2 -- Scope.

  a. License grant.

       1. Subject to the terms and conditions of this Public License,
          the Licensor hereby grants You a worldwide, royalty-free,
          non-sublicensable, non-exclusive, irrevocable license to
          exercise the Licensed Rights in the Licensed Material to:

            a. reproduce and Share the Licensed Material, in whole or
               in part; and

            b. produce, reproduce, and Share Adapted Material.

       2. Exceptions and Limitations. For the avoidance of doubt, where
          Exceptions and Limitations apply to Your use, this Public
          License does not apply, and You do not need to comply with
          its terms and conditions.

       3. Term. The term of this Public License is specified in Section
          6(a).

       4. Media and formats; technical modifications allowed. The
          Licensor authorizes You to exercise the Licensed Rights in
          all media and formats whether now known or hereafter created,
          and to make technical modifications necessary to do so. The
          Licensor waives and/or agrees not to assert any right or
          authority to forbid You from making technical modifications
          necessary to exercise the Licensed Rights, including
          technical modifications necessary to circumvent Effective
          Technological Measures. For purposes of this Public License,
          simply making modifications authorized by this Section 2(a)
          (4) never produces Adapted Material.

       5. Downstream recipients.

            a. Offer from the Licensor -- Licensed Material. Every
               recipient of the Licensed Material automatically
               receives an offer from the Licensor to exercise the
               Licensed Rights under the terms and conditions of this
               Public License.

            b. Additional offer from the Licensor -- Adapted Material.
               Every recipient of Adapted Material from You
               automatically receives an offer from the Licensor to
               exercise the Licensed Rights in the Adapted Material
               under the conditions of the Adapter's License You apply.

            c. No downstream restrictions. You may not offer or impose
               any additional or different terms or conditions on, or
               apply any Effective Technological Measures to, the
               Licensed Material if doing so restricts exercise of the
               Licensed Rights by any recipient of the Licensed
               Material.

       6. No endorsement. Nothing in this Public License constitutes or
          may be construed as permission to assert or imply that You
          are, or that Your use of the Licensed Material is, connected
          with, or sponsored, endorsed, or granted official status by,
          the Licensor or others designated to receive attribution as
          provided in Section 3(a)(1)(A)(i).

  b. Other rights.

       1. Moral rights, such as the right of integrity, are not
          licensed under this Public License, nor are publicity,
          privacy, and/or other similar personality rights; however, to
          the extent possible, the Licensor waives and/or agrees not to
          assert any such rights held by the Licensor to the limited
          extent necessary to allow You to exercise the Licensed
          Rights, but not otherwise.

       2. Patent and trademark rights are not licensed under this
          Public License.

       3. To the extent possible, the Licensor waives any right to
          collect royalties from You for the exercise of the Licensed
          Rights, whether directly or through a collecting society
          under any voluntary or waivable statutory or compulsory
          licensing scheme. In all other cases the Licensor expressly
          reserves any right to collect such royalties.


Section 3 -- License Conditions.

Your exercise of the Licensed Rights is expressly made subject to the
following conditions.

  a. Attribution.

       1. If You Share the Licensed Material (including in modified
          form), You must:

            a. retain the following if it is supplied by the Licensor
               with the Licensed Material:

                 i. identification of the creator(s) of the Licensed
                    Material and any others designated to receive
                    attribution, in any reasonable manner requested by
                    the Licensor (including by pseudonym if
                    designated);

                ii. a copyright notice;

               iii. a notice that refers to this Public License;

                iv. a notice that refers to the disclaimer of
                    warranties;

                 v. a URI or hyperlink to the Licensed Material to the
                    extent reasonably practicable;

            b. indicate if You modified the Licensed Material and
               retain an indication of any previous modifications; and

            c. indicate the Licensed Material is licensed under this
               Public License, and include the text of, or the URI or
               hyperlink to, this Public License.

       2. You may satisfy the conditions in Section 3(a)(1) in any
          reasonable manner based on the medium, means, and context in
          which You Share the Licensed Material. For example, it may be
          reasonable to satisfy the conditions by providing a URI or
          hyperlink to a resource that includes the required
          information.

       3. If requested by the Licensor, You must remove any of the
          information required by Section 3(a)(1)(A) to the extent
          reasonably practicable.

  b. ShareAlike.

     In addition to the conditions in Section 3(a), if You Share
     Adapted Material You produce, the following conditions also apply.

       1. The Adapter's License You apply must be a Creative Commons
          license with the same License Elements, this version or
          later, or a BY-SA Compatible License.

       2. You must include the text of, or the URI or hyperlink to, the
          Adapter's License You apply. You may satisfy this condition
          in any reasonable manner based on the medium, means, and
          context in which You Share Adapted Material.

       3. You may not offer or impose any additional or different terms
          or conditions on, or apply any Effective Technological
          Measures to, Adapted Material that restrict exercise of the
          rights granted under the Adapter's License You apply.


Section 4 -- Sui Generis Database Rights.

Where the Licensed Rights include Sui Generis Database Rights that
apply to Your use of the Licensed Material:

  a. for the avoidance of doubt, Section 2(a)(1) grants You the right
     to extract, reuse, reproduce, and Share all or a substantial
     portion of the contents of the database;

  b. if You include all or a substantial portion of the database
     contents in a database in which You have Sui Generis Database
     Rights, then the database in which You have Sui Generis Database
     Rights (but not its individual contents) is Adapted Material,

     including for purposes of Section 3(b); and
  c. You must comply with the conditions in Section 3(a) if You Share
     all or a substantial portion of the contents of the database.

For the avoidance of doubt, this Section 4 supplements and does not
replace Your obligations under this Public License where the Licensed
Rights include other Copyright and Similar Rights.


Section 5 -- Disclaimer of Warranties and Limitation of Liability.

  a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
     EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
     AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
     ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
     IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
     WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
     PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
     ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
     KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
     ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.

  b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
     TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
     NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
     INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
     COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
     USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
     ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
     DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
     IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.

  c. The disclaimer of warranties and limitation of liability provided
     above shall be interpreted in a manner that, to the extent
     possible, most closely approximates an absolute disclaimer and
     waiver of all liability.


Section 6 -- Term and Termination.

  a. This Public License applies for the term of the Copyright and
     Similar Rights licensed here. However, if You fail to comply with
     this Public License, then Your rights under this Public License
     terminate automatically.

  b. Where Your right to use the Licensed Material has terminated under
     Section 6(a), it reinstates:

       1. automatically as of the date the violation is cured, provided
          it is cured within 30 days of Your discovery of the
          violation; or

       2. upon express reinstatement by the Licensor.

     For the avoidance of doubt, this Section 6(b) does not affect any
     right the Licensor may have to seek remedies for Your violations
     of this Public License.

  c. For the avoidance of doubt, the Licensor may also offer the
     Licensed Material under separate terms or conditions or stop
     distributing the Licensed Material at any time; however, doing so
     will not terminate this Public License.

  d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
     License.


Section 7 -- Other Terms and Conditions.

  a. The Licensor shall not be bound by any additional or different
     terms or conditions communicated by You unless expressly agreed.

  b. Any arrangements, understandings, or agreements regarding the
     Licensed Material not stated herein are separate from and
     independent of the terms and conditions of this Public License.


Section 8 -- Interpretation.

  a. For the avoidance of doubt, this Public License does not, and
     shall not be interpreted to, reduce, limit, restrict, or impose
     conditions on any use of the Licensed Material that could lawfully
     be made without permission under this Public License.

  b. To the extent possible, if any provision of this Public License is
     deemed unenforceable, it shall be automatically reformed to the
     minimum extent necessary to make it enforceable. If the provision
     cannot be reformed, it shall be severed from this Public License
     without affecting the enforceability of the remaining terms and
     conditions.

  c. No term or condition of this Public License will be waived and no
     failure to comply consented to unless expressly agreed to by the
     Licensor.

  d. Nothing in this Public License constitutes or may be interpreted
     as a limitation upon, or waiver of, any privileges and immunities
     that apply to the Licensor or You, including from the legal
     processes of any jurisdiction or authority.
"""


import sys
import os
import shutil
import enum


class Action(enum.Enum):
    # Does not use 0 or 1 to avoid matching with False or True if the caller mistakenly passes a Boolean
    PRINT = 2
    REMOVE = 3


def detritus_tree(root_path: str, action=Action.PRINT):
    """
    Recursively traverses the specified folder and takes action on files and folders whose names match known detritus files or folders
    :param root_path: Path of a folder
    :param action: Prints (default) or removes files and folders whose names match known detritus files or folders
    :return: None
    """
    files_and_folders_to_be_removed = {}
    for folder, __, files in os.walk(root_path):
        if detritus_folder_name(folder):
            files_and_folders_to_be_removed[folder] = 'folder'
        else:
            for file in files:
                if detritus_file_name(file):
                    files_and_folders_to_be_removed[os.path.join(folder, file)] = 'file'

    # Remove files and folders in reverse order so as not to remove an ancestor folder tree before removing a descendant file or folder tre
    for path, kind in sorted(files_and_folders_to_be_removed.items(), reverse=True):
        if action == Action.REMOVE:
            if kind == 'file':
                os.remove(path)
            else:
                shutil.rmtree(path)
        else:
            print(path)


def detritus(path: str) -> bool:
    """
    Determines whether the name of the specified file or folder matches a known detritus file or folder
    :param path: Path of a file or folder
    :return: True if the name of the file or folder specified by path matches a known detritus file or folder, else False
    """
    if os.path.isdir(path):
        return detritus_folder_name(path)
    elif os.path.isfile(path):
        return detritus_file_name(path)
    return False


def detritus_folder_name(folder_path: str) -> bool:
    """
    Determines whether the name of the specified folder matches a known detritus folder
    :param folder_path: Path of a folder
    :return: True if the name of the folder specified by folder_path matches a known detritus folder, else False
    """
    folder_name = os.path.basename(folder_path)
    detritus_folder_names = ('.fseventsd', '.Trashes', '.TemporaryItems', '.AppleDouble', '__MACOSX', '$RECYCLE.BIN', 'System Volume Information', '.idea')
    detritus_folder_name_prefixes = ('.com.apple.TimeMachine', '.Spotlight-V', '.DocumentRevisions-V')
    return folder_name in detritus_folder_names or folder_name.startswith(detritus_folder_name_prefixes)


def detritus_file_name(file_path: str) -> bool:
    """
    Determines whether the name of the specified file matches a known detritus file
    :param file_path: Path of a file
    :return: True if the name of the file specified by file_path matches a known detritus file, else False
    """
    file_name = os.path.basename(file_path)
    detritus_file_names = ('.DS_Store', '.AppleSingle', '.localized', '.nosync', 'Thumbs.db', 'Desktop.ini')
    detritus_file_name_prefixes = ('._', '~$')
    return file_name in detritus_file_names or file_name.startswith(detritus_file_name_prefixes)


def _detritus_main():
    """
    Recursively traverses the specified folder(s) and takes action on files and folders whose names match known detritus files or folders
    :usage: detritus [-r] folder_path ...
                -r          Remove files and folders whose names match known detritus files or folders
                default     Print files and folders whose names match known detritus files or folders
    """
    start_of_path_args = 1
    action = Action.PRINT
    if len(sys.argv) >= 3:
        if sys.argv[1] == '-r':
            start_of_path_args = 2
            action = Action.REMOVE
    for root_path in sys.argv[start_of_path_args:]:
        detritus_tree(root_path, action)


if __name__ == '__main__':
    _detritus_main()

The Workers’ Place in History

In one of my cases, the dates on which a set of inter-dependent files were created and modified were particularly important to the matter.

However, the creation and modification dates of files in a production are often not made clear to the recipient of the production. The production could have been copied from a source code escrow deposit, copied from a system backup, copied from an unspecified branch or label from an unspecified version control system, copied from a single developer’s computer, or copied from a copy made for a prior litigation.

Further, each file’s creation date and modification date, as recorded by the file system on the review computer, might not reflect the dates on which the developer created or modified that file. A file’s creation date and modification date sometimes only reflect the date on which the production was copied onto the review computer!

For this particular case, I received a source code production with thousands of files whose creation and modification dates were indeed the date the production was copied onto the review computer.

During my analysis of this production, the contents of some source code files appeared to be inconsistent with each other. For example, functions referenced in one file were not defined in any other file of the production. This caused me to doubt that the produced source code files were all from the same time range. 

Fortunately, the Git repositories (i.e., repos) for the production were also produced. Git is a very popular version control system, one of several such systems; for example, Subversion, Mercurial, and Perforce. For each file managed by a version control system, the version control system logs meta data about the file, or files, submitted (i.e., committed) by a developer. 

At a minimum, such logs contain the contents of the file, date and time of the commit, the user ID of the developer who made the commit, and whatever comment was made by that developer to describe why the commit was made and what was added, removed, or changed from the prior commit of that file.

In the jargon of the Git version control system, the copy of a file that the developer is editing on their local system is called the “working copy” of that file. A directory of working copies is called a “working directory” and a tree of working directories is called a “working tree”.

Often, the working copy of a file matches the version of that file that was most recently committed to the version control system. Other times, it is an edited copy of the most recently committed version that the developer intends to commit later.

However, there are no rules that require a working copy of a file to match, or to be derived from, the most recently committed version of that file. Further, a working copy is not required to have any relationship with any committed files; that is, a file on a local system might not yet have been added to the version control system or might never be intended to be added to the version control system.

In my case, I found that many of the inconsistent set of files were working copies that matched commits made on different dates. Using the meta data in the Git version control system, I was able to extract working copies of that set of files that were consistent on a particular date.

Instead of using the produced working tree as the source of truth for my analysis, I used the commit history as the source of truth for each file. For each file, I then noted which copy in the commit history matched the working copy of that file, if any.

To do this for the thousands of files in the production, I created the following script which correlates and tags all files in the working tree with all versions of all files that have committed to the Git version control system. See the documentation in the comments in the script itself. Note: this script is likely to fill your disk with many, many files.

#!/usr/bin/env python3

"""
Copyright 2020-2021 Stairstep Consulting LLC. All rights reserved.

Creative Commons Attribution-ShareAlike 4.0 International Public
License

By exercising the Licensed Rights (defined below), You accept and agree
to be bound by the terms and conditions of this Creative Commons
Attribution-ShareAlike 4.0 International Public License ("Public
License"). To the extent this Public License may be interpreted as a
contract, You are granted the Licensed Rights in consideration of Your
acceptance of these terms and conditions, and the Licensor grants You
such rights in consideration of benefits the Licensor receives from
making the Licensed Material available under these terms and
conditions.


Section 1 -- Definitions.

  a. Adapted Material means material subject to Copyright and Similar
     Rights that is derived from or based upon the Licensed Material
     and in which the Licensed Material is translated, altered,
     arranged, transformed, or otherwise modified in a manner requiring
     permission under the Copyright and Similar Rights held by the
     Licensor. For purposes of this Public License, where the Licensed
     Material is a musical work, performance, or sound recording,
     Adapted Material is always produced where the Licensed Material is
     synched in timed relation with a moving image.

  b. Adapter's License means the license You apply to Your Copyright
     and Similar Rights in Your contributions to Adapted Material in
     accordance with the terms and conditions of this Public License.

  c. BY-SA Compatible License means a license listed at
     creativecommons.org/compatiblelicenses, approved by Creative
     Commons as essentially the equivalent of this Public License.

  d. Copyright and Similar Rights means copyright and/or similar rights
     closely related to copyright including, without limitation,
     performance, broadcast, sound recording, and Sui Generis Database
     Rights, without regard to how the rights are labeled or
     categorized. For purposes of this Public License, the rights
     specified in Section 2(b)(1)-(2) are not Copyright and Similar
     Rights.

  e. Effective Technological Measures means those measures that, in the
     absence of proper authority, may not be circumvented under laws
     fulfilling obligations under Article 11 of the WIPO Copyright
     Treaty adopted on December 20, 1996, and/or similar international
     agreements.

  f. Exceptions and Limitations means fair use, fair dealing, and/or
     any other exception or limitation to Copyright and Similar Rights
     that applies to Your use of the Licensed Material.

  g. License Elements means the license attributes listed in the name
     of a Creative Commons Public License. The License Elements of this
     Public License are Attribution and ShareAlike.

  h. Licensed Material means the artistic or literary work, database,
     or other material to which the Licensor applied this Public
     License.

  i. Licensed Rights means the rights granted to You subject to the
     terms and conditions of this Public License, which are limited to
     all Copyright and Similar Rights that apply to Your use of the
     Licensed Material and that the Licensor has authority to license.

  j. Licensor means the individual(s) or entity(ies) granting rights
     under this Public License.

  k. Share means to provide material to the public by any means or
     process that requires permission under the Licensed Rights, such
     as reproduction, public display, public performance, distribution,
     dissemination, communication, or importation, and to make material
     available to the public including in ways that members of the
     public may access the material from a place and at a time
     individually chosen by them.

  l. Sui Generis Database Rights means rights other than copyright
     resulting from Directive 96/9/EC of the European Parliament and of
     the Council of 11 March 1996 on the legal protection of databases,
     as amended and/or succeeded, as well as other essentially
     equivalent rights anywhere in the world.

  m. You means the individual or entity exercising the Licensed Rights
     under this Public License. Your has a corresponding meaning.


Section 2 -- Scope.

  a. License grant.

       1. Subject to the terms and conditions of this Public License,
          the Licensor hereby grants You a worldwide, royalty-free,
          non-sublicensable, non-exclusive, irrevocable license to
          exercise the Licensed Rights in the Licensed Material to:

            a. reproduce and Share the Licensed Material, in whole or
               in part; and

            b. produce, reproduce, and Share Adapted Material.

       2. Exceptions and Limitations. For the avoidance of doubt, where
          Exceptions and Limitations apply to Your use, this Public
          License does not apply, and You do not need to comply with
          its terms and conditions.

       3. Term. The term of this Public License is specified in Section
          6(a).

       4. Media and formats; technical modifications allowed. The
          Licensor authorizes You to exercise the Licensed Rights in
          all media and formats whether now known or hereafter created,
          and to make technical modifications necessary to do so. The
          Licensor waives and/or agrees not to assert any right or
          authority to forbid You from making technical modifications
          necessary to exercise the Licensed Rights, including
          technical modifications necessary to circumvent Effective
          Technological Measures. For purposes of this Public License,
          simply making modifications authorized by this Section 2(a)
          (4) never produces Adapted Material.

       5. Downstream recipients.

            a. Offer from the Licensor -- Licensed Material. Every
               recipient of the Licensed Material automatically
               receives an offer from the Licensor to exercise the
               Licensed Rights under the terms and conditions of this
               Public License.

            b. Additional offer from the Licensor -- Adapted Material.
               Every recipient of Adapted Material from You
               automatically receives an offer from the Licensor to
               exercise the Licensed Rights in the Adapted Material
               under the conditions of the Adapter's License You apply.

            c. No downstream restrictions. You may not offer or impose
               any additional or different terms or conditions on, or
               apply any Effective Technological Measures to, the
               Licensed Material if doing so restricts exercise of the
               Licensed Rights by any recipient of the Licensed
               Material.

       6. No endorsement. Nothing in this Public License constitutes or
          may be construed as permission to assert or imply that You
          are, or that Your use of the Licensed Material is, connected
          with, or sponsored, endorsed, or granted official status by,
          the Licensor or others designated to receive attribution as
          provided in Section 3(a)(1)(A)(i).

  b. Other rights.

       1. Moral rights, such as the right of integrity, are not
          licensed under this Public License, nor are publicity,
          privacy, and/or other similar personality rights; however, to
          the extent possible, the Licensor waives and/or agrees not to
          assert any such rights held by the Licensor to the limited
          extent necessary to allow You to exercise the Licensed
          Rights, but not otherwise.

       2. Patent and trademark rights are not licensed under this
          Public License.

       3. To the extent possible, the Licensor waives any right to
          collect royalties from You for the exercise of the Licensed
          Rights, whether directly or through a collecting society
          under any voluntary or waivable statutory or compulsory
          licensing scheme. In all other cases the Licensor expressly
          reserves any right to collect such royalties.


Section 3 -- License Conditions.

Your exercise of the Licensed Rights is expressly made subject to the
following conditions.

  a. Attribution.

       1. If You Share the Licensed Material (including in modified
          form), You must:

            a. retain the following if it is supplied by the Licensor
               with the Licensed Material:

                 i. identification of the creator(s) of the Licensed
                    Material and any others designated to receive
                    attribution, in any reasonable manner requested by
                    the Licensor (including by pseudonym if
                    designated);

                ii. a copyright notice;

               iii. a notice that refers to this Public License;

                iv. a notice that refers to the disclaimer of
                    warranties;

                 v. a URI or hyperlink to the Licensed Material to the
                    extent reasonably practicable;

            b. indicate if You modified the Licensed Material and
               retain an indication of any previous modifications; and

            c. indicate the Licensed Material is licensed under this
               Public License, and include the text of, or the URI or
               hyperlink to, this Public License.

       2. You may satisfy the conditions in Section 3(a)(1) in any
          reasonable manner based on the medium, means, and context in
          which You Share the Licensed Material. For example, it may be
          reasonable to satisfy the conditions by providing a URI or
          hyperlink to a resource that includes the required
          information.

       3. If requested by the Licensor, You must remove any of the
          information required by Section 3(a)(1)(A) to the extent
          reasonably practicable.

  b. ShareAlike.

     In addition to the conditions in Section 3(a), if You Share
     Adapted Material You produce, the following conditions also apply.

       1. The Adapter's License You apply must be a Creative Commons
          license with the same License Elements, this version or
          later, or a BY-SA Compatible License.

       2. You must include the text of, or the URI or hyperlink to, the
          Adapter's License You apply. You may satisfy this condition
          in any reasonable manner based on the medium, means, and
          context in which You Share Adapted Material.

       3. You may not offer or impose any additional or different terms
          or conditions on, or apply any Effective Technological
          Measures to, Adapted Material that restrict exercise of the
          rights granted under the Adapter's License You apply.


Section 4 -- Sui Generis Database Rights.

Where the Licensed Rights include Sui Generis Database Rights that
apply to Your use of the Licensed Material:

  a. for the avoidance of doubt, Section 2(a)(1) grants You the right
     to extract, reuse, reproduce, and Share all or a substantial
     portion of the contents of the database;

  b. if You include all or a substantial portion of the database
     contents in a database in which You have Sui Generis Database
     Rights, then the database in which You have Sui Generis Database
     Rights (but not its individual contents) is Adapted Material,

     including for purposes of Section 3(b); and
  c. You must comply with the conditions in Section 3(a) if You Share
     all or a substantial portion of the contents of the database.

For the avoidance of doubt, this Section 4 supplements and does not
replace Your obligations under this Public License where the Licensed
Rights include other Copyright and Similar Rights.


Section 5 -- Disclaimer of Warranties and Limitation of Liability.

  a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
     EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
     AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
     ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
     IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
     WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
     PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
     ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
     KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
     ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.

  b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
     TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
     NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
     INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
     COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
     USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
     ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
     DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
     IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.

  c. The disclaimer of warranties and limitation of liability provided
     above shall be interpreted in a manner that, to the extent
     possible, most closely approximates an absolute disclaimer and
     waiver of all liability.


Section 6 -- Term and Termination.

  a. This Public License applies for the term of the Copyright and
     Similar Rights licensed here. However, if You fail to comply with
     this Public License, then Your rights under this Public License
     terminate automatically.

  b. Where Your right to use the Licensed Material has terminated under
     Section 6(a), it reinstates:

       1. automatically as of the date the violation is cured, provided
          it is cured within 30 days of Your discovery of the
          violation; or

       2. upon express reinstatement by the Licensor.

     For the avoidance of doubt, this Section 6(b) does not affect any
     right the Licensor may have to seek remedies for Your violations
     of this Public License.

  c. For the avoidance of doubt, the Licensor may also offer the
     Licensed Material under separate terms or conditions or stop
     distributing the Licensed Material at any time; however, doing so
     will not terminate this Public License.

  d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
     License.


Section 7 -- Other Terms and Conditions.

  a. The Licensor shall not be bound by any additional or different
     terms or conditions communicated by You unless expressly agreed.

  b. Any arrangements, understandings, or agreements regarding the
     Licensed Material not stated herein are separate from and
     independent of the terms and conditions of this Public License.


Section 8 -- Interpretation.

  a. For the avoidance of doubt, this Public License does not, and
     shall not be interpreted to, reduce, limit, restrict, or impose
     conditions on any use of the Licensed Material that could lawfully
     be made without permission under this Public License.

  b. To the extent possible, if any provision of this Public License is
     deemed unenforceable, it shall be automatically reformed to the
     minimum extent necessary to make it enforceable. If the provision
     cannot be reformed, it shall be severed from this Public License
     without affecting the enforceability of the remaining terms and
     conditions.

  c. No term or condition of this Public License will be waived and no
     failure to comply consented to unless expressly agreed to by the
     Licensor.

  d. Nothing in this Public License constitutes or may be interpreted
     as a limitation upon, or waiver of, any privileges and immunities
     that apply to the Licensor or You, including from the legal
     processes of any jurisdiction or authority.
"""

import sys
import datetime
import hashlib
import os
import re
import subprocess
import functools

tag = '#'
tag_working = f'{tag}working'
tag_newest = f'{tag}newest'
tag_oldest = f'{tag}oldest'
tag_only = f'{tag}only'
tag_deleted = f'{tag}deleted'


@functools.lru_cache
def md5_from_file_contents(file_path):
    h = hashlib.new('md5')
    chunk_size = 1024 * 1024
    with open(file_path, 'rb') as f:
        chunk = f.read(chunk_size)
        while chunk:
            h.update(chunk)
            chunk = f.read(chunk_size)
    md5 = h.hexdigest()
    return md5


def minimum_commit_prefix_length(commits):
    """ Helps avoid using unnecessarily large commit IDs
    :param commits: array of all commit ID in the .git repository folder
    :return: the fewest number of characters at the beginning of each commit ID where each commit ID will be unique,
    with a minimum of 4 characters
    """

    for prefix_length in range(4, len(commits[0])):
        unique_commits = set()
        collision = False
        for commit in commits:
            prefix = commit[:prefix_length]
            if prefix in unique_commits:
                collision = True
                break
            else:
                unique_commits.add(prefix)
        if not collision:
            return prefix_length
    raise ValueError('git commit identifiers are not unique')


def link_working_directory(case_root, project_root, target_root):
    """Recreates the project_root folder tree structure in the target root folder by creating hard links for each file,
    tags each file with the "working" tag, and calculates the MD5 hash sum for each file for later use in match working
    copy versions to committed versions.
    :param case_root: argv[1]
    :param project_root: argv[2]
    :param target_root: argv[3]
    :return: dict mapping the lowercase version of each target file path to its MD5 hash sum
    """

    md5s = {}
    for folder, __, files in os.walk(project_root):
        for file in files:
            if folder.endswith('/.git') or '/.git/' in folder:
                continue
            # Extract the relative path from folder, starting at the case_root, to apply to the target_root
            target_rel = re.sub(re.escape(case_root) + '/?', '', folder)
            target_folder = os.path.join(target_root, target_rel)
            target_path = os.path.join(target_folder, f'{tag_working}{tag}{file}')
            os.makedirs(target_folder, exist_ok=True)
            os.link(os.path.join(folder, file), target_path)
            md5s[target_path.lower()] = md5_from_file_contents(target_path)
    return md5s


def tag_newest_oldest_or_only_commit(target_root):
    """Add tags to the newest and oldest (or only) distinct committed versions for all files that have at least one
    distinct committed version, as indicated by files which have already been tagged with a commit datetime and commit
    ID.
    :param target_root: argv[3]
    :return: None
    """

    tag_commit_datetime_regex = tag + r'[0-9]{4}[0-9]{2}[0-9]{2}T[0-9]{6}Z'
    tag_commit_id_regex = tag + r'[0-9a-f]{4,}'
    tag_working_or_tag_deleted_regex = f'{tag_working}|{tag_deleted}|'
    tag_filename_regex = f'{tag}.*'
    tags_compiled_regex = re.compile(f'^({tag_commit_datetime_regex})({tag_commit_id_regex})({tag_working_or_tag_deleted_regex})({tag_filename_regex})$')
    for folder, __, files in os.walk(target_root):
        names_newest = {}
        names_oldest = {}
        # The files MUST be in sorted order, because that will sort them by commit datetime which is vital for determining
        #     oldest and newest commit
        for file in sorted(files):
            if search := tags_compiled_regex.search(file):
                tag_commit_datetime = search.group(1)
                tag_commit_id = search.group(2)
                tag_working_or_tag_deleted = search.group(3)
                tag_filename = search.group(4)
                values = folder, file, tag_commit_datetime, tag_commit_id, tag_working_or_tag_deleted
                if tag_working_or_tag_deleted != tag_deleted:
                    if tag_filename not in names_oldest:
                        names_oldest[tag_filename] = values
                    names_newest[tag_filename] = values
        for tag_filename, values in names_newest.items():
            folder, file, tag_commit_datetime, tag_commit_id, tag_working_or_tag_deleted = values
            if tag_filename not in names_oldest or names_oldest[tag_filename] != values:
                os.rename(os.path.join(folder, file), os.path.join(folder, f'{tag_commit_datetime}{tag_commit_id}{tag_newest}{tag_working_or_tag_deleted}{tag_filename}'))
            else:
                os.rename(os.path.join(folder, file), os.path.join(folder, f'{tag_commit_datetime}{tag_commit_id}{tag_only}{tag_working_or_tag_deleted}{tag_filename}'))
        for tag_filename, values in names_oldest.items():
            folder, file, tag_commit_datetime, tag_commit_id, tag_working_or_tag_deleted = values
            if tag_filename not in names_newest or names_newest[tag_filename] != values:
                os.rename(os.path.join(folder, file), os.path.join(folder, f'{tag_commit_datetime}{tag_commit_id}{tag_oldest}{tag_working_or_tag_deleted}{tag_filename}'))


def lock_down(target_root):
    """ Change permission on every folder and every file in the target root folder to disallow addition, removals, or
    changes.
    :param target_root: argv[3]
    :return: None
    """

    for folder, __, files in os.walk(target_root):
        for file in files:
            os.chmod(os.path.join(folder, file), 0o444)
        os.chmod(folder, 0o555)


def recreate_git_commits(case_root, project_rel, target_root):
    """ The main method.
    :param case_root: argv[1]
    :param project_rel: argv[2]
    :param target_root: argv[3]
    :return: None
    """

    project_root = os.path.join(case_root, project_rel)

    # Link all files in working directory to new target directory,
    # preface each file with "working" tag;
    # Calculate md5 of each file for later matching against
    # committed file versions.
    print('Recreating working copy file versions in target folder tree...')
    md5s = link_working_directory(case_root, project_root, target_root)

    # Find each committed file and copy to target directory,
    # preface each file with commit datetime and commit id,
    # match committed file with working copy file if their
    # md5s match.

    timezone_compiled_regex = re.compile(r' ([-+])([0-9][0-9])([0-9][0-9])$')

    # Recursively process each '.git' folder in the project root folder tree
    for folder, sub_folders, __ in os.walk(project_root):
        for sub_folder in sub_folders:
            if sub_folder == '.git':
                # For each '.git' folder, find all commits
                git_root = folder
                git_rel = git_root[len(case_root)+1:]
                print(f'Creating and tagging committed versions from: {git_root}{sub_folder}')
                os.chdir(git_root)
                result = subprocess.run(['git', 'log', '--all'], capture_output=True)
                assert result.returncode == 0
                commit_prefix = 'commit '
                commits = []
                for log_line in result.stdout.decode('utf-8').splitlines():
                    if log_line.startswith(commit_prefix):
                        commit = log_line[len(commit_prefix):]
                        commits.append(commit)
                minimum_length = minimum_commit_prefix_length(commits)
                
                # For each commit, find all files
                for commit in commits:
                    result = subprocess.run(['git', 'show', '--name-only', commit], capture_output=True)
                    assert result.returncode == 0
                    merge_prefix = 'Merge:'
                    date_prefix = 'Date:'
                    commit_datetime = None
                    relative_paths = []
                    blank_line = False
                    merge = False
                    for show_line in result.stdout.decode('utf-8').splitlines():
                        if not blank_line:
                            if not show_line:
                                blank_line = True
                            elif show_line.startswith(merge_prefix):
                                merge = True
                                break
                            elif show_line.startswith(date_prefix):
                                commit_datetime = show_line[len(date_prefix):].strip()
                        else:
                            if show_line and not show_line.startswith(' '):
                                relative_paths.append(show_line)
                    if not merge:
                        assert commit_datetime

                        # Normalize all commit datetimes to ISO format in the UTC timezone
                        timezone = timezone_compiled_regex.search(commit_datetime)
                        assert timezone
                        timezone_direction = -1 if timezone.group(1) == '+' else 1
                        timezone_hour_offset = int(timezone.group(2))
                        timezone_minute_offset = int(timezone.group(3))
                        assert timezone_hour_offset < 24 and timezone_minute_offset < 60
                        commit_datetime_without_timezone = commit_datetime[:-len(' +0000')]
                        t = datetime.datetime.strptime(commit_datetime_without_timezone, '%c')
                        t = t + timezone_direction * datetime.timedelta(hours=timezone_hour_offset,
                                                                        minutes=timezone_minute_offset)
                        tag_commit_datetime = tag + t.isoformat().replace('-', '').replace(':', '') + 'Z'

                        # Use minimum possible commit ID
                        tag_commit_id = f'{tag}{commit[:minimum_length]}'

                        # for each file in the commit, get its contents
                        for relative_path in relative_paths:
                            # git show writes the contents of the file in the commit to stdout
                            result = subprocess.run(['git', 'show', f'{commit}:{relative_path}'], capture_output=True)
                            # git show uses return code 128 to indicate file deleted by the commit
                            assert result.returncode in (0, 128)
                            tag_maybe_deleted = tag_deleted if result.returncode == 128 else ''
                            relative_path = relative_path.strip('"')

                            parent, filename = os.path.split(relative_path)
                            tag_filename = f'{tag}{filename}'

                            # Calculate MD5 hash sum for contents of file in the commit
                            h = hashlib.new('md5')
                            h.update(result.stdout)
                            md5 = h.hexdigest()

                            target_folder = os.path.join(target_root, git_rel, parent)
                            target_path = os.path.join(target_folder, f'{tag_working}{tag_maybe_deleted}{tag_filename}')
                            os.makedirs(target_folder, exist_ok=True)

                            if os.path.isfile(target_path) and md5 == md5s[target_path.lower()]:
                                # This committed version is the same as the working copy version, so just add committed version commit tags to the working copy version file name
                                new_target_path = os.path.join(target_root, git_rel, parent, f'{tag_commit_datetime}{tag_commit_id}{tag_working}{tag_maybe_deleted}{tag_filename}')
                                os.rename(target_path, new_target_path)
                            else:
                                # This committed version is not the same as the working copy version, create a new file for the committed version
                                new_target_path = os.path.join(target_root, git_rel, parent, f'{tag_commit_datetime}{tag_commit_id}{tag_maybe_deleted}{tag_filename}')
                                with open(new_target_path, 'wb') as w:
                                    w.write(result.stdout)

    print(f'Tagging newest and oldest (or only) committed versions...')
    tag_newest_oldest_or_only_commit(target_root)
    print(f'Locking down target folder tree...')
    lock_down(target_root)


if __name__ == '__main__':
    """
    argv[1] = absolute folder path for the case 
    argv[2] = relative folder path under argv[1] that contains the .git repository folder (or folders) to be processed
    argv[3] = absolute folder path that will contain the output of this utility
    
    For example, to process the following .git repository folder,
        '/Users/username/Documents/casename/sourcecodefolder/production1/.git', specify the following arguments:
    
         argv[1] = '/Users/username/Documents/casename'
         argv[2] = 'sourcecodefolder/production1'
         argv[3] = '/Users/username/Documents/casename/Committed Versions in sourcecodefolder-production1'
     
    This utility will recursively process all .git repository folders it finds in the folder tree specified in argv[2].
                
    The output folder tree in argv[3] will have the same structure as the folder tree specified in argv[2].
    
    The output folder tree in argv[3] will contain a copy of all the files in the folder tree specified in argv[2]. Per
    Git terminology, these are called the "working copy" versions.
    
    Additionally, the output folder tree in argv[3] will contain a copy of all distinct committed versions of all files
    from all commits from all .git repository folders, even files that have been deleted via "git rm". To avoid
    redundancy, a committed version of a file appears in the output folder tree only if it is the oldest (or only)
    committed version or its contents are different from its most recently distinct committed version. That is, a file
    may be included in 100 commits, but if it has only changed, say, twice in those 100 commits, this utility will
    output only the original copy of the file and its two changed versions.
    
    This will result in multiple versions of the file in its output folder, one copy for each distinct version.
    Therefore, to distinguish these multiple versions, each copy will have a file which has the same file name, but
    a different combination of "tags" in its file name. Each tag and the original file name are preceded by the '#'
    character. Here are the possible tags:
    
    A working copy version contains the tag:
    
        #working
    
    Each distinct committed version contains the two tags,
    
        commit datetime in ISO format in UTC timezone; e.g., #20140718T160140Z
        commit ID; e.g., #f48a2a9e
         
    The commit ID contains only as many characters from the beginning of the full commit ID that are necessary to
    distinguish the commit ID from the other commits in the same .git repository folder.
    
    The newest distinct committed version contains the tag:
    
        #newest
    
    The oldest distinct committed version contains the tag:
    
        #oldest
    
    If there is only one distinct committed version, it contains the tag:
    
        #only
        
    If the file has been deleted via "git rm", this utility creates an empty file to memorialize the deletion. Along with
    the tags for the commit datetime and the commit ID, such a file contains the tag:
    
        #deleted
        
    A common scenario is a file which has three or more distinct committed versions, where the working copy version 
    matches the newest distinct committed version. In this scenario, the output folder will contain the following 
    versions of our example file, README.txt:
    
        For the oldest distinct committed version, an example file name is:
        
            #20200403T173955Z#503cd058#oldest#README.txt

        For each distinct committed version that is neither the newest nor oldest distinct committed version, some
        example file names are:
        
            #20200610T003255Z#06fda5da#README.txt
            #20200709T001011Z#8ac696ab#README.txt
            ...

        For the newest distinct committed version that matches the working copy version, an example file name is:
        
            #20201117T134113Z#e8b08035#newest#working#README.txt
        
    In addition to this common scenario, below are other scenarios: 
    
        If the above README.txt file is later deleted via "git rm", the above example files will be followed by an empty
        file containing the "#deleted" tag which matches the commit that deleted the file:
        
            #20200403T173955Z#503cd058#oldest#README.txt
            #20200610T003255Z#06fda5da#README.txt
            #20200709T001011Z#8ac696ab#README.txt
            ...
            #20201117T134113Z#e8b08035#newest#README.txt
            #20210804T204541Z#b50d9684#deleted#README.txt

        In the above scenario, since README.txt has been deleted via "git rm", is unlikely that there will be a working
        copy version of README.txt, but there could be. The above scenario assumes there is no working copy version of 
        README.txt, and thus there is no "#working" tag in the newest distinct committed version.
        
        If there is only one distinct committed version and that version matches the working copy version, an example
        file name is:
        
            #20210804T204541Z#b50d9684#only#working#README.txt
        
        It is possible that no distinct committed version has a matching working copy version. In this scenario, no
        distinct committed version file name would contain the "#working" tag.
        
        It is possible that the working copy version matches a distinct committed version other than the newest
        distinct committed version. In this scenario, the "#working" tag will only appear in the file name of the
        distinct committed version that matches the working copy version.
    
        It is possible that there is no distinct committed version that matches a working copy version. In this scenario,
        there will be only one version of the file and that version will contain only the "#working" tag.          
    """
    recreate_git_commits(sys.argv[1], sys.argv[2], sys.argv[3])

Get the Git

Summary

If not already produced by the source code owner, the following is an argument to request the production of the Git repository (or repositories) for the produced source code.

When engaged by a client to review source code, it has been a pleasure to be provided with source code that contains sufficient programmer comments.

However, the frequent and clichéd programmer resistance to commenting their source code makes it more difficult and time-consuming for a reviewer to understand how the source code works and why it works the way it does.

Further, even when comments are provided in the source code, the comments sometimes only answer the “who” and “when” questions. For example, “Who authored the source code and when did they author it?” Such information is often valuable for the case, but does little to help understand the source code.

Better are source code comments that answer the more nuanced “what”, “how”, and “why” questions; that is, questions which give insight into the high-level relationships between functionality and data that assist in source code tracing.

Source code inspection tools can help stitch together the low-level relationships between source code methods, data, and classes. However, determining the high-level relationships between classes and modules without source code comments often takes more work hours and calendar days than the client expects.

There are too few processes which require a programmer to add comments to their source code, and since their source code is sometimes only briefly reviewed by their peers, a programmer’s resistance to commenting their source code is often not challenged.

However, it is almost always the case that programmers are required to contribute their drafts and final source code to a version control system. Git is one such version code control system.

Git requires each programmer to enter a comment for each file version that they commit to a repository. While a Git commit comment can be uninformative, Git commit comments are often reviewed by the other programmers working on the same project, or other programmers who are dependent upon the project. Therefore, since a programmer’s mandated Git commit comments are broadly viewed, there is peer pressure for each programmer to make substantiative Git commit comments.

Therefore, if a Git repository and the source code controlled by that repository are produced, the Git commit comments can give the source code reviewer insights that are unavailable when programmers do not put similar comments in the source code itself.

Further, Git commit comments provide history that even source code with sufficient comments sometimes do not. For example, often source code comments only describe the version of the source code presently in the file, whereas Git commit comments will also describe previous versions of each file. Sometimes the contents of these previous versions are crucial to the case itself, but even if not, the Git commit comments made for these previous versions of a file can give insight into understanding the produced version of the file.

Details

Each folder tree whose file versions are controlled by Git contains a sub-folder named “.git”. The files and sub-folders under a “.git” folder are called a Git repository (also called a “repo”).

It is possible, and often likely, that a complex source code folder tree might contain more than one Git repo. A complex software project is often divided into one or more sub-projects, each with their own sub-folder tree. This is often the case when multiple groups are working on different parts of the same project. In this situation, it often the case that each group will control the versions of only the files in their sub-project and there will be one “.git” folder for each significant sub-folder tree that is produced. Each “.git” folder is very much part of the folder tree for a given project or sub-project. However, often the source code owner prunes or empties all “.git” folders before they produced their source code.

Git and other Distributed Version Control Systems

This article uses Git as an example for three reasons: (1) Git has a been the most prevalent version control system I have seen used for source code, (2) a Git repo can be easily produced by the source code owner, and (3) it is easy for the source code owner to install on a review computer a tool to navigate the file histories in the produced Git repo.

Git is a distributed version control system and as such, a Git repo is fully contained in a single folder tree; that is, the folder tree whose name is “.git”. Producing a “.git” folder tree is as easy as producing any other folder tree.

On Windows, it is easy to install either a GUI tool like Git GUI or a command-line tool like Git BASH; neither of these example tools requires a commercial license. On Linux and macOS, a command-line Git tool is pre-installed with the operating system. There are several other easy-to-install and non-commercial GUI and command-line tools on Windows, Linux, and macOS that navigate the file histories in a Git repo.

Subversion and other Centralized Version Control Systems

While it is easy to produce a Git repo and easy to install a tool to navigate a Git repo, it is often difficult to produce a repo and install a tool to navigate that repo for a centralized version control system. Centralized version control systems, like Subversion, require client software, server software, and a proprietary database containing the version control information. The complexity of installing and configuring these components on a review computer has often been met with resistance by the source code owner.

Conclusion

Most all software projects are managed by using some version control system. Negotiating with the source code owner to produce all version control repositories that are associated with the produced source code might provide insightful comments that are often lacking in the source code files. Comments from a version control system can make source code reviews more time-efficient and accurate.

Source of Source Code

The word “source” in the term “source code file” implies that the file contains the source/origin of the instructions in that code file.

Most code files are indeed the source/origin of the instructions in that code file. That is, for such files, a human enters and edits the instructions in the source code file using a text editor.

For most all computer software, the instructions in a source code file go through one or more conversions between when the human enters and edits those instructions and those instructions are implemented by the computer.

For example, for computer software that originates with C++ code files, even though this is not seen by the programmer by default, the C++ code file is converted to an Assembly code file which is then converted to an Object code file before it is implemented by the computer.

In this scenario, the Assembly code file contain instructions that implement the same functioning and logic as the C++ code file, and is still called an Assembly source code file, even though the Assembly code file is not the source/origin of the instructions. Indeed, there is some software in which a human enters and edits the instructions in an Assembly code file using a text editor. Therefore, Assembly code can be either source code or target code.

In this scenario, a production that contains Assembly code generated from C++ code, but does not contain the original C++ code, would not be a source code production.

Further, even code files that are almost always entered and edited by humans are sometimes generated by other software.

For example, a C++ code file can be generated by the ProtoBuf tool from a file in the .proto format. Therefore, while C++ code files are most always referred to as source code files, the origin/source of the instructions in this scenario is the file in the .proto format. In this scenario, a production that contains C++ code generated from .proto files, but does not contain the original .proto files, would not be a source code production.

Artificial Intelligence as the source of code

Some modern AI tools can generate code. For example, I gave ChatGPT 3.5 the prompt, “Please generate a collection of C++ functions which implement a double-linked list.” In response to this prompt, ChatGPT 3.5 generated the C++ code at the end of this post.

Since ChatGPT 3.5 learned from one or more fragments of C++ code that existed somewhere on the Internet in 2021, what is the source of the code at the end of this post?

  • Is the source simply the generated code below?
  • Is the source the fragments of practically anonymous code used to train ChatGPT 3.5?
  • Is the source my prompt?

There are yet to be legally-tested definitions of the source of code generated by AI agents.

#include <iostream>

template <typename T>
class Node {
public:
    T data;
    Node* next;
    Node* prev;

    Node(T value) : data(value), next(nullptr), prev(nullptr) {}
};

template <typename T>
class DoublyLinkedList {
private:
    Node<T>* head;
    Node<T>* tail;

public:
    DoublyLinkedList() : head(nullptr), tail(nullptr) {}

    ~DoublyLinkedList() {
        clear();
    }

    void insertFront(T value) {
        Node<T>* newNode = new Node<T>(value);
        if (!head) {
            head = tail = newNode;
        } else {
            newNode->next = head;
            head->prev = newNode;
            head = newNode;
        }
    }

    void insertEnd(T value) {
        Node<T>* newNode = new Node<T>(value);
        if (!tail) {
            head = tail = newNode;
        } else {
            newNode->prev = tail;
            tail->next = newNode;
            tail = newNode;
        }
    }

    void remove(T value) {
        Node<T>* current = head;

        while (current && current->data != value) {
            current = current->next;
        }

        if (!current) {
            std::cout << "Element not found in the list.\n";
            return;
        }

        if (current->prev) {
            current->prev->next = current->next;
        } else {
            head = current->next;
        }

        if (current->next) {
            current->next->prev = current->prev;
        } else {
            tail = current->prev;
        }

        delete current;
    }

    void displayForward() {
        Node<T>* current = head;
        while (current) {
            std::cout << current->data << " ";
            current = current->next;
        }
        std::cout << std::endl;
    }

    void displayBackward() {
        Node<T>* current = tail;
        while (current) {
            std::cout << current->data << " ";
            current = current->prev;
        }
        std::cout << std::endl;
    }

    void clear() {
        while (head) {
            Node<T>* temp = head;
            head = head->next;
            delete temp;
        }
        tail = nullptr;
    }
};

int main() {
    DoublyLinkedList<int> myList;

    myList.insertFront(3);
    myList.insertFront(2);
    myList.insertEnd(4);
    myList.displayForward(); // Output: 2 3 4
    myList.displayBackward(); // Output: 4 3 2

    myList.remove(3);
    myList.displayForward(); // Output: 2 4

    myList.clear(); // Clean up memory

    return 0;
}

Scope of Source Code

Source code is often defined to be the files that contain instructions written in computer programming languages like Java, C++, Swift, etc. However, I have found that most every commercial software project comprises other types of files as well.

Therefore, an alternate definition of source code could include all the files authored by humans which express the functioning and logic to be implemented by a computer.

With this alternate definition, computer source code would include instructions written in programming languages plus the other files used by those instructions to express the functioning and logic to be implemented by a computer. This often includes configuration files, interface files, markdown files, and data files. 

Indeed, configuration, interfaces, markdown, and data can be represented in instructions written in programming languages. However, representing configuration, interfaces, markdown, and data in their own files that are not written in a programming language provides modularity and clarity of purpose.

The fact that configuration, interfaces, markdown, and data are sometimes not represented in instructions written in programming languages does not make them less important to the functioning and logic to be implemented by the computer. As such, this alternate defintion of source code includes instructions written in programming languages as well as configuration files, interface files, markdown files, and data files.

Source Code Comments not in Source Code

From the dawn of computer programming, source code has been more efficiently and accurately understood if the author provides a human language description of the source code.

As such, I have sometimes been asked to provide the number and density of comments that appear in the source code as a measure of whether the produced source code is sufficient for expert analysis. Tools like Scitool’s “Understand” product and the open source “cloc” tool measure comments inside source code written in a variety of computer programming languages.

Sadly, such measurements have often shown an insufficient number or density of comments in the source code.

However, the following are other potential locations that might contain descriptions of the source code…

Internal Architecture and Design Specifications

Architecture and design specifications are the obvious first type of documents in which to look for source code descriptions, but these sometimes contain descriptions that are too high-level.

Wiki Comments

Most modern software projects maintain one or more wikis that are used by employees to collaborate on various issues. The conversation threads in these wikis can give insight into the source code and are often for:

  • Architectural and Design issues; using tools like Confluence, Trello and Notion
  • Product issues; using tools like Jira and Front

Commit Comments

All modern software projects use one or more source code version management tools (e.g., git and svn). The logged commit comments made by authors can give insight into the changes made to source code files.

Source Code Story Arc

An under-appreciated advantage of the conversation threads in wikis and logs of commit comments is they provide a story arc for the source code. 

Each single version of source code provides only a snapshot in time. Since a distinct start date and end date are often relevant to a case, only the versions of source code that were released in that date range are produced. However, the conversation threads and commit comments made before the start date and after the end data can sometimes provide insights into the produced source code.

Source Code GPS

For one of my cases, I requested architectural diagrams and documentation associated with a particularly complex source code production.

The attorneys were concerned my request implied that my expert analysis would only paraphrase the diagrams and documentation rather than describe the source code. Understandably, they engaged me to read and understand source code. Those attorneys had enough technical background that they could read and understand such diagrams and documentation on their own.

We saw eye-to-eye after I explained that with little to no narrative comments from the authors in the source code itself, it is often more efficient to trace source code when one has a road map to help better understand the terminology, acronyms, and structure that are often described in diagrams and documentation.

However, we did agree that documentation can have one or more of the following flaws:

  • Documentation sometimes applies to a version of the product that is different than the produced source code.
  • Documentation sometimes contains mistakes such that it does not accurately describe the produced source code.
  • Internal documentation sometimes describes intented features that were not implemented in the produced source code.

Even a GPS navigation system can have similar flaws: its map can be outdated, its map can contain errors, and its map can show roads that are currently inaccessible because they are in the process of being repaired. Therefore, looking at the actual road is certainly necessary when driving. 

Likewise, basing one’s expert analysis on the actual source code is necessary, but if the architectural diagrams and documentation for that source code can help better navigate that source code, all the better for the expert’s accuracy and efficiency.

Truthy and Falsey Values

A Boolean data type is a data type that can represent two values: true or false.

In the C programming language, there is no built-in (a.k.a. primitive) Boolean data type prior to the C99 version of the language. Further, even programmers who use the C99 (or a later) version of the C programming language may choose not to use the built-in Boolean data type for compatibility with pre-C99 versions or out of habit. Therefore, C programmers who do not use a Boolean data type for a Boolean value often use an integer data type instead.

Because an integer data type can represent more than two values, the C programmer who uses an integer data type to behave as a Boolean data type must account for the possibility that a variable defined as an integer data type might contain values other than the integer values the programmer chooses to represent “true” and “false”, respectively.

The majority of the C programs I have analyzed which use an integer to represent a Boolean data type use the following method. This method takes advantage of the following convention in the C programming language:

When the C programming language tests an integer value in a context which requires the test to return “true” or “false” (e.g., in the expressions tested by “if” and “while” statements), the integer 0 is interpreted to be “false” and all other integers are interpreted to be “true”. In such a scenario, the integer 0 is called “falsey” and all other integers are called “truthy”. This convention of the C programming language applies regardless of the size of the integer and regardless of whether the integer is defined to be signed or unsigned. For example, signed integers less than 0 and signed and unsigned integers greater than 0 are interpreted to be “true” because they are not the integer 0.

Therefore, taking advantage of this convention, a C programmer will represent “false” as 0, and (often) a C programmer will represent “true” as 1, but they are not limited to 1 as the only “truthy” value. 

Note that C programmers often abstract the integers representing “true” and “false” by using C pre-processor macros, so that the actual values they choose to represent “true” and “false” are not exposed in the body of the source code, like the following:

#define TRUE 1
#define FALSE 0

int x = FALSE;
int y = TRUE;

Here are some examples of using integers in the Boolean context of an “if” statement:

int x = 0;
if (x) {
	/* will not be executed because x is not “truthy” */
} else {
	/* will be executed because x is “falsey” */
}

int x = 1;
if (x) {
	/* will be executed because x is “truthy” */
} else {
	/* will not be executed because x is not “falsey” */
}

int x = 72;
if (x) {
	/* will be executed because x is “truthy” */
} else {
	/* will not be executed because x is not “falsey” */
}

int x = -1;
if (x) {
	/* will be executed because x is “truthy” */
} else {
	/* will not be executed because x is not “falsey” */
}

int x = -29;
if (x) {
	/* will be executed because x is “truthy” */
} else {
	/* will not be executed because x is not “falsey” */
}

If the C programmer does not want to rely on this convention of “truthy” and “falsey” interpretation of integers in a Boolean context, the C programmer can explicitly test the value of the integer with a comparison operator. For example, each of the above examples can be re-written using the “x != 0” comparison (read as, x does not equal 0), as in the following:

int x = 0;
if (x != 0) {
	/* will not be executed because x != 0 is false */
} else {
	/* will be executed because x != 0 is false */
}

int x = 1;
if (x != 0) {
	/* will be executed because x != 0 is true */
} else {
	/* will not be executed because x != 0 is false */
}

int x = 72;
if (x != 0) {
	/* will be executed because x != 0 is true */
} else {
	/* will not be executed because x != 0 is false */
}

int x = -1;
if (x != 0) {
	/* will be executed because x != 0 is true */
} else {
	/* will not be executed because x != 0 is false */
}

int x = -29;
if (x != 0) {
	/* will be executed because x != 0 is true */
} else {
	/* will not be executed because x != 0 is false */
}

Product Composition Risk Management

When I first heard the term Software Composition Analysis (SCA), I was excited to hear of a new vision for what was thought of as only an open source discovery tool. I knew the vendors in this new SCA space were thinking more deeply about the problems faced by product owners than just generating a bill of materials which detailed the open source code used by and distributed with their proprietary code.

However, after thinking about the broad spectrum of what SCA vendor are actually doing, I came to realize that the only word in that market categorization which is fully applicable is: composition. Both the words software and analysis are far too narrow for the work being done by SCA vendors.

Software, Firmware, and Webware

Even while they have been benefitting by this market category, SCA vendors have been processing not only their customers’ desktop and server software, but also their mobile application software, device firmware, and webware written with open web APIs.  Just being positioned as servicing “software” limits the perception of the wide variety of intellectual property delivery and deployment models SCA vendors process daily.

Risk Detection, Assessment, and Mitigation

Merriam-Webster defines analysis to be a “separation of a whole into its component parts”. Not only is this redundant with the word composition, but SCA vendors have gone beyond simply identifying open source components.

SCA users have consistently received more than a bill of open source materials. They have achieved well-defined business outcomes that have resulted in minimized risk around the security, data privacy, operations, license compliance, and terms of use compliance.

Product Composition Risk Management

Therefore, to represent the actual scope of benefits provided by SCA vendors, the category “Product Composition Risk Management” is more appropriate.

A modern digital product is composed of one’s own proprietary code, code from commercial and non-commercials providers, and web service providers. The word product is not limited to software, firmware, mobile, or web development; it encompasses all modes of digital product composition which use all types of intellectual property.

There is risk in composing one’s product only from one’s own proprietary code, which is why that code is measured against multiple non-functional requirements. However, composing one’s product from intellectual property owned by others creates an inherent risk that is much greater. You don’t know the care with which that IP was created and don’t know the resources available to maintain it.

SCA vendors not only identify open source risk, they assess the risk, and provide mitigation alternatives for their customers.

So, while the SCA market categorization served its purpose for a few years, it is time to acknowledge the greater benefits that SCA vendors bring to a customer’s entire supply chain.

Data Privacy Requires Data Security, Just Ask Equifax

The following post was originally published here by Black Duck Software…

The EU’s General Data Protection Regulation (GDPR) will be enforced starting May 25, 2018. One of its goals is to better align data privacy with data security, as depicted in this simple Venn diagram:

That is, you can have data security without data privacy, but you can’t have data privacy without data security.

Equifax painfully has come to this same conclusion, and well before the May 25, 2018 date.

A Little History on Data Privacy Principles

Many years ago, Equifax could have successfully argued that they have complied with data privacy requirements because they have not sold consumers data without those consumers’ permission. That was how low the bar was set when data privacy first became an issue.

Even as long ago as 1995, one of the data privacy principles in Directive 95/46/EC required appropriate security controls when handling private data. However, data privacy had focused only on issues of consumer consent and intentional disclosure of private data; that is, until Equifax clarified for uslast week that that is not enough.

Behind the Equifax Breach: A Deep Dive Into Apache Struts CVE-2017-5638

GDPR: New Requirements for Security Controls

Just like with Directive 95/46/EC, one of the data privacy principles of the GPDR requires similar security controls, but the important requirement that GDPR adds is that companies must provide evidence of those security controls.

Certainly, GDPR regulators will want to see evidence of security controls, but even companies that are not directly targets of regulators will be required to produce such evidence to their customers if any company downstream in their supply chain perceives themselves to be a target of regulators. Evidence of security controls will be a condition of doing business.

The Equifax breach makes clear in a visceral way what the GDPR will make clear through regulations: the consequences to the private individual are just as damaging, if not more, when their private data is breached compared to when it is sold to an unauthorized party, ask the 140 million individuals in Equifax’s database.

David-Znidarsic-Corporate-Photo-200x300.jpg

David Znidarsic is the founder and president of Stairstep Consulting, where he provides intellectual property consultation services ranging from IP forensics, M&A diligence, information security management, open source usage management, and license management. Learn more about David and Stairstep Consulting at www.stairstepconsulting.com

Compliant? Sure, But With What?

The following post was originally published here by Black Duck Software…

The term compliance is used more and more in business. Some job titles even include the term: VP of Compliance, Compliance Officer, Compliance Manager. Usually these roles have focused on the legal and operational requirements imposed by external groups like licensors and regulatory agencies.

While abiding by such external requirements is the cost of you doing business, you give up control of your business or product development by only following the requirements of others and not establishing your own policies and complying with them.

Limited Scope

Let’s look at how the term “compliance” has been used to limit the scope of open source governance

Open source compliance has been narrowly interpreted to mean that one must abide by the open source author’s license terms. Indeed, that will always be a requirement, but consider that an open source author’s work is replacing the work of one of your own software engineers.

If the only hurdle to cross before using open source is to be compliant with the author’s license terms, that is like saying you fully trust all the code developed by one of your software engineers if and only if your management meets its legal requirements during the hiring and employment of that engineer!

A Question of Trust?

While that seems preposterous, in practice, you probably impose many more requirements on the work product of your own engineers than on the work product of open source authors. Is it your intention to trust open source authors more than your own employees? The assumptions you might be making are:

(a) every open source project is staffed by many more development, testing, and maintenance engineers than your company can deploy to solve the same problem, and

(b) those engineers know and have fixed all security vulnerabilities.

However, www.openhub.com shows that might be true for some open source projects, but not all. Therefore, unless your product teams perform the appropriate due diligence, they won’t know whether their assumptions are valid.

Explore projects in OpenHub

Open source management best practices require organizations to know the open source in their code in order to reduce risks, tighten policies, and monitor and audit for compliance and policy violations. Automating identification of all open source in use allows development and license teams to quickly gain visibility into any known open source security vulnerabilities as well as compliance issues, define and enforce open source use and risk policies, and continuously monitor for newly disclosed vulnerabilities.

David-Znidarsic-Corporate-Photo-200x300.jpg

David Znidarsic is the founder and president of Stairstep Consulting, where he provides intellectual property consultation services ranging from IP forensics, M&A diligence, information security management, open source usage management, and license management. Learn more about David and Stairstep Consulting at www.stairstepconsulting.com

 

 

 

 

Best Technology Stack Transcends Language

In Entrepreneur.com, Rahul Varshneya observes that a technology stack is often chosen by your same software or firmware developer who will be responsible for writing code in that stack’s programming language.

Who would be brave or foolish enough to recommend themselves out of a job by choosing a stack which requires expertise in a language they do not understand? Mr. Varshneya warns you to use an evaluator unbiased towards programming language.

This is because the programming language should only be one of the criteria when choosing a technology stack. However, even if an unbiased evaluator chooses a stack that meets the current and future technical needs of your company and uses the correct programming language, they can still make a wrong choice if the technology stack supplier is not right for your company.

Often evaluators choose a technology stack containing non-commercial software components that have been developed by open source authors. The additional challenge is to choose these open source “suppliers” based on your non-functional requirements.

Does your evaluator consider the security vulnerabilities that have been disclosed for each component of the stack they choose? Do they know if anyone is working on that open source component? Even if enough people are working on the open source component, how active are they? Are they making fixes, making scalability improvements, and plugging security and data privacy holes that you would expect from your own developers, or are they only adding fun-to-develop features?

Make sure you and your evaluator choose your open source technology stack suppliers based on all the same criteria you would apply if you were to hire an employee or outsourcer to develop those components for you.

We are all now in a Regulated Industry

For many years, a small minority of companies were considered to be in a regulated industry: medical, financial, automotive, etc. Those of us not in one of those industries looked at those companies from afar with envy and pity: how are they able to produce what they produce under the weight of those regulations?

Starting May 25, 2018, we will all be in a regulated industry. Those companies who do business in the EU and UK (and thus process data identifying their citizens) will be required to comply with the General Data Protection Regulation.

The data privacy principles espoused by the GDPR are not much different than those in the Directive 95/46/EC from 1995. However, the EU has concluded that nicely asking companies for 22 years to abide by those directives has not achieved the data privacy they require for their citizens. Therefore, creating the GDPR has given teeth to regulators in the EU and UK to enforce their data privacy principles and thus brings us all into a regulated industry.

Web APIs are the New Open Source Software

If you are relaxing because you have your open source usage under control, beware. There is another increasingly common type of ungoverned third-party code that your engineers are using in your products: Web APIs.

There are many Web APIs published that, like open source software, are free of cost, readily available, provide great value, but are not free of obligations or risks. For example, https://www.programmableweb.com/api/keystroke-resolver is a Web API for mapping keystrokes from one type of keyboard to another. Perhaps useful, but what is this open source service doing with those keystrokes? Retaining them (if so, in what country)? Selling them? Marketing to your customers based on them?

Sometimes Web APIs are available to you as part of your license for a commercial software product or service. For example, you can build your own web applications using DocuSign’s published Web APIs. Use of those APIs is covered by your DocuSign license and access to them is only available to holders of an API key issued by DocuSign to paid licensees. However, even these commercial Web APIs have pitfalls for the products and services that use them.

Mistaken Assumptions About Web APIsNon-Commercial Web APIsCommercial Web APIs
API terms of use will remain sameMaybe NotProbably
API implementation will remain sameNoNo
API interface will remain sameMaybe NotProbably
API will process data locallyNoNo
API will be hosted in same legal jurisdictionMaybe NotMaybe Not
API will be available 100% of timeNoNo
API has an SLA
NoMaybe Not

The Web API author’s ability to instantaneously change it is good if they fix bugs and security vulnerabilities. But it is bad if they just as instantaneously introduce new bugs and vulnerabilities, and bad if they change the functionality or interface to break your application. You have no control over whether or not you use those daily changes because you’re always using their current implementation.

Even if the Web API uses strong encryption for data in transit between your application and their server, the fact that some of this data might be personally identifiable information means not only will it be sent over a public network, but it may even be sent to another country.

Here is an example of a Web API. The current weather at a particular latitude and longitude can be found using the following URL (visit it yourself to see the results):

https://api.weatherbit.io/v2.0/current?lat=48.8583701&lon=2.2922873&key=876daf42ac7f4488956caf9011a83212

If I were a French citizen and visiting a web page that uses the weatherbit.io Web API to find out the weather at my current location, my latitude and longitude would be sent to their server in New Jersey, USA. Certainly, a data privacy concern.

To take it a step further, what Web APIs hosted by yet other parties might weatherbit.io be calling to map the latitude and longitude to my time zone? to my city? to my state? to my country?

This is another example of the newest technology being adopted by organizations before management knows about it or can govern it. This is what happened with Shadow IT. Then Shadow Engineering emerged when software developers started using open source without permission from their management or procurement departments. Now, shadow web development via Web APIs is an increasingly common way for programmers to efficiently build web applications. Today, building web applications is a composition of proprietary code, outsourced code, open source code, and open source online services accessed via Web APIs. You must understand and manage the provenance of each of these components.

Assume Every Application is an On-Premises Application

We feel the need to label applications as either on-premises or cloud.

We try to assure ourselves that an application categorized as on-premises will not send or receive data over a public network, and an application categorized as cloud will not install client resources.

However, the reality is that most applications categorized as cloud require resources to be installed on the client, and sometimes install those resources silently.

This is usually because browsers and HTML aren’t powerful enough to drive the complexity required by those applications.

Therefore, applications categorized as cloud sometimes require native browser plugins, agents, or beacons. Sometimes they require native applications that supplement the browser client, like update utilities, upload utilities, etc. Sometimes the only client is a native application, like is the case with mobile apps.

Installing any of these requires explicit action on the part of IT or user, but are often overlooked as requirements because the application is categorized as “cloud”.

Cookies, web storage, and JavaScript are examples of client side resources installed without explicit IT or user action. Web storage is becoming more prevalent and harder to manage. It started with local shared objects (aka Flash cookies) and it continues to expand via standards like IndexedDB and proprietary client-side storage methods used by Internet service providers.

So if prevention or knowledge of an application’s required client-side installations is important to you, you need to do a technical analysis of what is and what is not installed; don’t rely on marketing materials and naïve categorizations. In the absence of such an analysis, assume every application you use requires some type of client-side installation.

Assume Every Application is a Cloud Application

We feel the need to label applications as either on-premises or cloud.

We try to assure ourselves that an application categorized as on-premises will not send or receive data over a public network, and an application categorized as cloud will not install client resources.

However, the reality is that most applications categorized as on-premises send data to and receive data from the Internet.

This is usually because most applications rely on highly dynamic content that must be installed and then frequently updated on the client device or computer.

Certainly most mobile applications are just thick native clients that access one or more on-line services. Just look at the apps on your phone and tablet and guess which features, if any, of each of those apps will work if you don’t have a data connection.

Desktop and server applications also often need cloud services to function: zip code to city lookups pass your location to an Internet service, desktop publishing templates, clip art, and help system content are now all accessed remotely, and some applications even “outsource” complex computations to cloud services, sending your data outside your organization.

So if prevention or knowledge of an application’s online access is important to you, you need to do a technical analysis of what is and what is not accessed; don’t rely on marketing materials and naïve categorizations. In the absence of such an analysis, assume every application you use is sending data to and receiving data from the Internet.