pyxx.files.TextFile

class pyxx.files.TextFile(path: str | Path | None = None, comment_chars: Tuple[str, ...] | str | None = None)

Bases: File

Base class for processing text files

This class can be used to represent text files (that is, files with a series of ASCII-based characters as content, that can be open and read with an editor such as Notepad). It provides the capability to read/write text files and perform processing operations such as removing commented lines.

Attributes

comment_chars

A tuple of all characters considered to denote comments

contents

A reference to a list containing the (potentially modified) file content of each line of the file

line_ending

The character(s) used to denote the end of lines in the text file

raw_contents

A copy of the raw file content

trailing_newline

Whether the original file had a newline at the end of the file

Methods

__init__([path, comment_chars])

Define a text file

clean_contents([remove_comments, ...])

Clean contents in-place

overwrite([prologue, epilogue, line_ending])

Write data in contents to the file specified by path

parse()

Parses the data in contents and stores it in class attributes

read([path, parse])

Read file from disk

set_contents(contents, trailing_newline[, ...])

Add data to the contents list

update_contents()

Updates the contents list based on object attributes

write(output_file[, write_mode, ...])

Write file to disk

Inherited Attributes

hashes

A copy of the dictionary containing any file hashes previously computed for the file specified by the path attribute

path

Path describing the location of the file on the disk

Inherited Methods

clear_file_hashes()

Clears any stored file hashes

compute_file_hashes([hash_functions, store])

Computes hashes of the file specified by the path attribute

has_changed()

Returns whether the file specified by the path attribute has changed since the last time file hashes were computed

set_read_metadata([path])

Configures metadata related to file to be read from disk

store_file_hashes([hash_functions])

Computes and stores hashes of the file specified by the path attribute

track_new_file(path[, hash_functions])

Shortcut for simultaneously modifying the path attribute and storing file hashes

__init__(path: str | Path | None = None, comment_chars: Tuple[str, ...] | str | None = None) None

Define a text file

Creates an object that represents and can be used to process a text file.

Parameters:
  • path (str or pathlib.Path, optional) – Location of the text file in the file system (default is None)

  • comment_chars (tuple or str, optional) – Character(s) considered to represent comments in the text file (default is None, which considers no characters to denote comments in the file)

Notes

Passing an empty string ('') or empty tuple (()) as the comment_chars argument is equivalent to passing None (or not providing this argument) – in all these cases, the file will be considered to have no characters denoting comments.

property comment_chars: Tuple[str, ...] | None

A tuple of all characters considered to denote comments

property contents: List[str]

A reference to a list containing the (potentially modified) file content of each line of the file

Warning

This attribute returns the list by reference. This means that if you set a variable equal to this reference, then editing this variable will edit the contents attribute (e.g., if you set my_content = MyTextFile.contents, then editing my_content will change the content stored in MyTextFile).

Notes

If trying to set the contents attribute, do not try to set this attribute directly (i.e., don’t use code similar to MyTextFile.contents = ['line1', 'line2', 'line3']). Instead, use the set_contents() method, as it offers greater control over whether the contents are passed by reference or value.

property line_ending: str | Tuple[str, ...]

The character(s) used to denote the end of lines in the text file

This property only applies to files that were read using the read() method. After reading a file, this property stores the line ending(s) used in the file. Lines in text files can be terminated with '\n' (LF), '\r\n' (CRLF), '\r', or a combination of these characters (potentially with different line endings on different lines).

After reading a file, this property stores either a string containing the line endings on every line of the file, or a tuple containing all line endings encountered throughout the file.

property raw_contents: List[str] | None

A copy of the raw file content

If the file was read using the read() method, this attribute stores the original, unaltered contents of each line of the input file, and it returns a copy of this list of lines. If the file was not read with the read() method, this attribute stores a value of None.

property trailing_newline: bool

Whether the original file had a newline at the end of the file

clean_contents(remove_comments: bool = False, skip_full_line_comments: bool = False, strip: bool = False, concat_lines: bool = False, remove_blank_lines: bool = False) None

Clean contents in-place

Cleans contents (removing comments, blank lines, etc.) based on user-defined rules. Modifications are made in-place (i.e., the resulting content is stored in contents).

Parameters:
  • remove_comments (bool, optional) – Whether to remove comments from file (default is True)

  • skip_full_line_comments (bool, optional) – Whether to skip removing comments where the comment is the only text on a line. Only applies if remove_comments is True (default is False)

  • strip (bool, optional) – Whether to strip leading and trailing whitespace from each line (default is True)

  • concat_lines (bool, optional) – Whether to concatenate lines ending with a backslash with the following line (default is True)

  • remove_blank_lines (bool, optional) – Whether to remove lines that contain no content after other cleaning operations have completed (default is True)

overwrite(prologue: str = '', epilogue: str | None = None, line_ending: str = '\n') None

Write data in contents to the file specified by path

Writes the lines of content in the contents attribute to the (previously-defined) file specified by the path attribute, suppressing warnings before overwriting the file. This is useful for cases when the file contents are manually populated and it is desired to “dump” them to a file. This method is also useful if a file’s contents need to be updated periodically based on the results of another process.

Parameters:
  • prologue (str, optional) – Content written at beginning of file (default is '')

  • epilogue (str, optional) – Content written at end of file (default is to use the value of the line_ending argument if trailing_newline is True and '' otherwise)

  • line_ending (str, optional) – String written at the end of each line when writing file content (default is '\n')

parse() None

Parses the data in contents and stores it in class attributes

This method by default does nothing. However, it is intended that subclasses of TextFile should override this method and define file-specific behavior in this method for extracting data from the file and storing it in custom object attributes.

For example, if defining a CSV-parser, the parse() method might parse data from the file and store it as a NumPy array.

clear_file_hashes() None

Clears any stored file hashes

compute_file_hashes(hash_functions: tuple | str = ('md5', 'sha256'), store: bool = False) Dict[str, str]

Computes hashes of the file specified by the path attribute

Computes and returns the hashes of the file specified by the path attribute, with the option to populate the hashes dictionary with their values.

Parameters:
  • hash_functions (tuple or str, optional) – Tuple of strings (or individual string) specifying which hash(es) to compute. Any hash functions supported by hashlib can be used. Default is ('md5', 'sha256')

  • store (bool, optional) – Whether to store the computed hashes in the hashes dictionary (default is False)

Returns:

A dictionary containing the file hashes specified by hash_functions

Return type:

dict

See also

pyxx.files.compute_file_hash

Function used to compute file hashes

Notes

Prior to calling this method, the path attribute must be defined. To simultaneously set the path attribute and store file hashes, use track_new_file().

has_changed() bool

Returns whether the file specified by the path attribute has changed since the last time file hashes were computed

Returns:

Whether file has changed since the last time file hashes were computed

Return type:

bool

property hashes: Dict[str, str]

A copy of the dictionary containing any file hashes previously computed for the file specified by the path attribute

property path: Path | None

Path describing the location of the file on the disk

Assigning a value to this attribute (regardless whether it matches the current value or is a different path) will save the value as a pathlib.Path and will automatically clear any saved file hashes.

read(path: str | Path | None = None, parse: bool = True) None

Read file from disk

Calling this method reads the file specified by the path attribute from the disk, populating contents and raw_contents. Additionally, the file hashes stored in the hashes attribute are updated (to make it easier to check if the file has been modified later).

Parameters:
  • path (str or pathlib.Path, optional) – Location of the text file in the file system (default is None)

  • parse (bool, optional) – Whether to call the parse() method after reading the file (default is True)

set_read_metadata(path: str | Path | None = None) None

Configures metadata related to file to be read from disk

This method performs several pre-processing steps to prepare to read a file from the disk:

  1. Sets the path attribute. If the path argument was provided, the attribute is set to this value; otherwise, the existing value stored in the path attribute is used (or an error is thrown if not defined).

  2. Verifies that the file specified by the path attribute exists.

  3. Stores the hashes for the file.

It is advised that this method be called prior to reading any file.

Parameters:

path (str or pathlib.Path, optional) – Location of the file in the file system (default is None)

Raises:
  • AttributeError – If the both the path argument and the existing path attribute are None

  • FileNotFoundError – If the file specified by path (after completing Step 1 above) does not exist

store_file_hashes(hash_functions: tuple | str = ('md5', 'sha256')) None

Computes and stores hashes of the file specified by the path attribute

Computes given hashes of the file specified by the path attribute and populates the hashes dictionary with their values.

Parameters:

hash_functions (tuple or str, optional) – Tuple of strings (or individual string) specifying which hash(es) to compute. Any hash functions supported by hashlib can be used. Default is ('md5', 'sha256')

See also

pyxx.files.compute_file_hash

Function used to compute file hashes

track_new_file

Use this method if you want to store file hashes but the path attribute isn’t yet defined

Notes

Prior to calling this method, the path attribute must be defined. To simultaneously set the path attribute and store file hashes, use track_new_file().

track_new_file(path: str | Path, hash_functions: tuple | str = ('md5', 'sha256')) None

Shortcut for simultaneously modifying the path attribute and storing file hashes

This method functions as a “shortcut,” both modifying the path attribute and storing an optionally user-specified list of file hashes in the hashes attribute. The intention of this method is that if a File instance is tracking a given file, and user wants to switch to tracking another file, this provides a convenient way to do so with a single line of code.

Parameters:
  • file (str or pathlib.Path) – File that the object is to represent

  • hash_functions (tuple or str, optional) – Tuple of strings (or individual string) specifying which hash(es) to compute. Any hash functions supported by hashlib can be used. Default is ('md5', 'sha256')

See also

pyxx.files.compute_file_hash

Function used to compute file hashes

set_contents(contents: List[str], trailing_newline: bool, pass_by_reference: bool = False) None

Add data to the contents list

Allows users to manually fill the contents list with user-defined content. The input list must be a list of strings, and the user can optionally choose whether to pass the input by reference or value.

Parameters:
  • contents (list) – List of strings which are to be assigned to the contents list

  • trailing_newline (bool) – Whether the contents being added represent a file with a trailing newline (because the file wasn’t read, the object has no way to determine whether the file has a trailing newline, so users must provide this information)

  • pass_by_reference (bool, optional) – Whether to pass the contents argument by reference (default is False)

Notes

If passing contents by reference, this means that if subsequent changes are made to the original contents object, they will be reflected in the contents attribute. If passing by value, then a copy of the contents argument will be made, so changing the object outside the class instance will not affect the contents attribute.

update_contents() None

Updates the contents list based on object attributes

This method by default does nothing. However, it is intended that subclasses of TextFile should override this method and define file-specific behavior in this method for converting custom object attributes to lines of text in the file, and storing these data in contents.

For example, if defining a CSV-parser, the class might have an attribute that stores numerical data in a NumPy array, and the update_contents() method might convert the data in this array to comma-separated strings and store them in contents.

write(output_file: str | Path, write_mode: str = 'w', warn_before_overwrite: bool = True, prologue: str = '', epilogue: str | None = None, line_ending: str = '\n', update_contents: bool = True) None

Write file to disk

Calling this method writes the file contents stored in contents to the disk.

Parameters:
  • output_file (str or pathlib.Path) – Output file to which to write content

  • write_mode (str, optional) – Any mode (such as 'w' or 'a') for the built-in open() function for writing files (default is 'w')

  • warn_before_overwrite (bool, optional) – Whether to throw an error if output_file already exists (default is True)

  • prologue (str, optional) – Content written at beginning of file (default is '')

  • epilogue (str, optional) – Content written at end of file (default is to use the value of the line_ending argument if trailing_newline is True and '' otherwise)

  • line_ending (str, optional) – String written at the end of each line when writing file content (default is '\n')

  • update_contents (bool, optional) – Whether to call the update_contents() method before writing the file (default is True)