New Upstream Release - python-srsly

Ready changes

Summary

Merged new upstream version: 2.4.6 (was: 2.4.5).

Diff

diff --git a/README.md b/README.md
index 051ae45..70cc285 100644
--- a/README.md
+++ b/README.md
@@ -18,7 +18,7 @@ serialization utilities we need in a single binary wheel. Currently supports **J
 Serialization is hard, especially across Python versions and multiple platforms.
 After dealing with many subtle bugs over the years (encodings, locales, large
 files) our libraries like [spaCy](https://github.com/explosion/spaCy) and
-[Prodigy](https://prodi.gy) have steadily grown a number of utility functions to
+[Prodigy](https://prodi.gy) had steadily grown a number of utility functions to
 wrap the multiple serialization formats we need to support (especially `json`,
 `msgpack` and `pickle`). These wrapping functions ended up duplicated across our
 codebases, so we wanted to put them in one place.
@@ -45,8 +45,8 @@ wheel.
 `setuptools` and `wheel` are up to date.
 
 ```bash
-pip install -U pip setuptools wheel
-pip install srsly
+python -m pip install -U pip setuptools wheel
+python -m pip install srsly
 ```
 
 Or from conda via conda-forge:
@@ -56,13 +56,38 @@ conda install -c conda-forge srsly
 ```
 
 Alternatively, you can also compile the library from source. You'll need to make
-sure that you have a development environment consisting of a Python distribution
+sure that you have a development environment with a Python distribution
 including header files, a compiler (XCode command-line tools on macOS / OS X or
-Visual C++ build tools on Windows), pip, virtualenv and git installed.
+Visual C++ build tools on Windows), pip and git installed.
+
+Install from source:
 
 ```bash
-pip install -r requirements.txt  # install development dependencies
-python setup.py build_ext --inplace  # compile the library
+# clone the repo
+git clone https://github.com/explosion/srsly
+cd srsly
+
+# create a virtual environment
+python -m venv .env
+source .env/bin/activate
+
+# update pip
+python -m pip install -U pip setuptools wheel
+
+# compile and install from source
+python -m pip install .
+```
+
+For developers, install requirements separately and then install in editable
+mode without build isolation:
+
+```bash
+# install in editable mode
+python -m pip install -r requirements.txt
+python -m pip install --no-build-isolation --editable .
+
+# run test suite
+python -m pytest --pyargs srsly
 ```
 
 ## API
@@ -111,11 +136,11 @@ data = {"foo": "bar", "baz": 123}
 srsly.write_json("/path/to/file.json", data)
 ```
 
-| Argument   | Type         | Description                                            |
-| ---------- | ------------ | ------------------------------------------------------ |
-| `path` | str / `Path` | The file path or `"-"` to write to stdout.             |
-| `data`     | -            | The JSON-serializable data to output.                  |
-| `indent`   | int          | Number of spaces used to indent JSON. Defaults to `2`. |
+| Argument | Type         | Description                                            |
+| -------- | ------------ | ------------------------------------------------------ |
+| `path`   | str / `Path` | The file path or `"-"` to write to stdout.             |
+| `data`   | -            | The JSON-serializable data to output.                  |
+| `indent` | int          | Number of spaces used to indent JSON. Defaults to `2`. |
 
 #### <kbd>function</kbd> `srsly.read_json`
 
@@ -127,7 +152,7 @@ data = srsly.read_json("/path/to/file.json")
 
 | Argument    | Type         | Description                                |
 | ----------- | ------------ | ------------------------------------------ |
-| `path`  | str / `Path` | The file path or `"-"` to read from stdin. |
+| `path`      | str / `Path` | The file path or `"-"` to read from stdin. |
 | **RETURNS** | dict / list  | The loaded JSON content.                   |
 
 #### <kbd>function</kbd> `srsly.write_gzip_json`
@@ -139,11 +164,27 @@ data = {"foo": "bar", "baz": 123}
 srsly.write_gzip_json("/path/to/file.json.gz", data)
 ```
 
-| Argument   | Type         | Description                                            |
-| ---------- | ------------ | ------------------------------------------------------ |
-| `path` | str / `Path` | The file path.                                         |
-| `data`     | -            | The JSON-serializable data to output.                  |
-| `indent`   | int          | Number of spaces used to indent JSON. Defaults to `2`. |
+| Argument | Type         | Description                                            |
+| -------- | ------------ | ------------------------------------------------------ |
+| `path`   | str / `Path` | The file path.                                         |
+| `data`   | -            | The JSON-serializable data to output.                  |
+| `indent` | int          | Number of spaces used to indent JSON. Defaults to `2`. |
+
+#### <kbd>function</kbd> `srsly.write_gzip_jsonl`
+
+Create a gzipped JSONL file and dump contents.
+
+```python
+data = [{"foo": "bar"}, {"baz": 123}]
+srsly.write_gzip_json("/path/to/file.jsonl.gz", data)
+```
+
+| Argument          | Type         | Description                                                                                                                                                                                                             |
+| ----------------- | ------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `path`            | str / `Path` | The file path.                                                                                                                                                                                                          |
+| `lines`           | -            | The JSON-serializable contents of each line.                                                                                                                                                                            |
+| `append`          | bool         | Whether or not to append to the location. Appending to .gz files is generally not recommended, as it doesn't allow the algorithm to take advantage of all data when compressing - files may hence be poorly compressed. |
+| `append_new_line` | bool         | Whether or not to write a new line before appending to the file.                                                                                                                                                        |
 
 #### <kbd>function</kbd> `srsly.read_gzip_json`
 
@@ -155,9 +196,22 @@ data = srsly.read_gzip_json("/path/to/file.json.gz")
 
 | Argument    | Type         | Description              |
 | ----------- | ------------ | ------------------------ |
-| `path`  | str / `Path` | The file path.           |
+| `path`      | str / `Path` | The file path.           |
 | **RETURNS** | dict / list  | The loaded JSON content. |
 
+#### <kbd>function</kbd> `srsly.read_gzip_jsonl`
+
+Load gzipped JSONL from a file.
+
+```python
+data = srsly.read_gzip_jsonl("/path/to/file.jsonl.gz")
+```
+
+| Argument    | Type         | Description               |
+| ----------- | ------------ | ------------------------- |
+| `path`      | str / `Path` | The file path.            |
+| **RETURNS** | dict / list  | The loaded JSONL content. |
+
 #### <kbd>function</kbd> `srsly.write_jsonl`
 
 Create a JSONL file (newline-delimited JSON) and dump contents line by line, or
@@ -170,7 +224,7 @@ srsly.write_jsonl("/path/to/file.jsonl", data)
 
 | Argument          | Type         | Description                                                                                                            |
 | ----------------- | ------------ | ---------------------------------------------------------------------------------------------------------------------- |
-| `path`        | str / `Path` | The file path or `"-"` to write to stdout.                                                                             |
+| `path`            | str / `Path` | The file path or `"-"` to write to stdout.                                                                             |
 | `lines`           | iterable     | The JSON-serializable lines.                                                                                           |
 | `append`          | bool         | Append to an existing file. Will open it in `"a"` mode and insert a newline before writing lines. Defaults to `False`. |
 | `append_new_line` | bool         | Defines whether a new line should first be written when appending to an existing file. Defaults to `True`.             |
@@ -186,7 +240,7 @@ data = srsly.read_jsonl("/path/to/file.jsonl")
 
 | Argument   | Type       | Description                                                          |
 | ---------- | ---------- | -------------------------------------------------------------------- |
-| `path` | str / Path | The file path or `"-"` to read from stdin.                           |
+| `path`     | str / Path | The file path or `"-"` to read from stdin.                           |
 | `skip`     | bool       | Skip broken lines and don't raise `ValueError`. Defaults to `False`. |
 | **YIELDS** | -          | The loaded JSON contents of each line.                               |
 
@@ -247,10 +301,10 @@ data = {"foo": "bar", "baz": 123}
 srsly.write_msgpack("/path/to/file.msg", data)
 ```
 
-| Argument   | Type         | Description            |
-| ---------- | ------------ | ---------------------- |
-| `path` | str / `Path` | The file path.         |
-| `data`     | -            | The data to serialize. |
+| Argument | Type         | Description            |
+| -------- | ------------ | ---------------------- |
+| `path`   | str / `Path` | The file path.         |
+| `data`   | -            | The data to serialize. |
 
 #### <kbd>function</kbd> `srsly.read_msgpack`
 
@@ -262,7 +316,7 @@ data = srsly.read_msgpack("/path/to/file.msg")
 
 | Argument    | Type         | Description                                                                             |
 | ----------- | ------------ | --------------------------------------------------------------------------------------- |
-| `path`  | str / `Path` | The file path.                                                                          |
+| `path`      | str / `Path` | The file path.                                                                          |
 | `use_list`  | bool         | Don't use tuples instead of lists. Can make deserialization slower. Defaults to `True`. |
 | **RETURNS** | -            | The loaded and deserialized content.                                                    |
 
@@ -318,7 +372,7 @@ yaml_string = srsly.yaml_dumps(data)
 | ----------------- | ---- | ------------------------------------------ |
 | `data`            | -    | The JSON-serializable data to output.      |
 | `indent_mapping`  | int  | Mapping indentation. Defaults to `2`.      |
-| `indent_sequence` | int  | Sequence indentation. Defaults to `4`.      |
+| `indent_sequence` | int  | Sequence indentation. Defaults to `4`.     |
 | `indent_offset`   | int  | Indentation offset. Defaults to `2`.       |
 | `sort_keys`       | bool | Sort dictionary keys. Defaults to `False`. |
 | **RETURNS**       | str  | The serialized string.                     |
@@ -348,10 +402,10 @@ srsly.write_yaml("/path/to/file.yml", data)
 
 | Argument          | Type         | Description                                |
 | ----------------- | ------------ | ------------------------------------------ |
-| `path`        | str / `Path` | The file path or `"-"` to write to stdout. |
+| `path`            | str / `Path` | The file path or `"-"` to write to stdout. |
 | `data`            | -            | The JSON-serializable data to output.      |
 | `indent_mapping`  | int          | Mapping indentation. Defaults to `2`.      |
-| `indent_sequence` | int          | Sequence indentation. Defaults to `4`.      |
+| `indent_sequence` | int          | Sequence indentation. Defaults to `4`.     |
 | `indent_offset`   | int          | Indentation offset. Defaults to `2`.       |
 | `sort_keys`       | bool         | Sort dictionary keys. Defaults to `False`. |
 
@@ -365,7 +419,7 @@ data = srsly.read_yaml("/path/to/file.yml")
 
 | Argument    | Type         | Description                                |
 | ----------- | ------------ | ------------------------------------------ |
-| `path`  | str / `Path` | The file path or `"-"` to read from stdin. |
+| `path`      | str / `Path` | The file path or `"-"` to read from stdin. |
 | **RETURNS** | dict / list  | The loaded YAML content.                   |
 
 #### <kbd>function</kbd> `srsly.is_yaml_serializable`
diff --git a/azure-pipelines.yml b/azure-pipelines.yml
index c7c6887..0ca7259 100644
--- a/azure-pipelines.yml
+++ b/azure-pipelines.yml
@@ -3,6 +3,10 @@ trigger:
   branches:
     include:
     - '*'
+pr:
+  paths:
+    exclude:
+      - "*.md"
 
 jobs:
 
@@ -10,7 +14,7 @@ jobs:
   strategy:
     matrix:
       Python36Linux:
-        imageName: 'ubuntu-latest'
+        imageName: 'ubuntu-20.04'
         python.version: '3.6'
       Python36Windows:
         imageName: 'windows-2019'
@@ -44,13 +48,13 @@ jobs:
         python.version: '3.10'
       Python311Linux:
         imageName: 'ubuntu-latest'
-        python.version: '3.11.0-rc.2'
+        python.version: '3.11'
       Python311Windows:
         imageName: 'windows-latest'
-        python.version: '3.11.0-rc.2'
+        python.version: '3.11'
       Python311Mac:
         imageName: 'macos-latest'
-        python.version: '3.11.0-rc.2'
+        python.version: '3.11'
     maxParallel: 4
   pool:
     vmImage: $(imageName)
diff --git a/debian/changelog b/debian/changelog
index 0ef2588..816396d 100644
--- a/debian/changelog
+++ b/debian/changelog
@@ -1,10 +1,14 @@
-python-srsly (2.4.5-2) UNRELEASED; urgency=medium
+python-srsly (2.4.6-1) UNRELEASED; urgency=medium
 
+  [ Andreas Tille ]
   * Run build-time test properly
   * Standards-Version: 4.6.2 (routine-update)
   * Testsuite: autopkgtest-pkg-python (routine-update)
 
- -- Andreas Tille <tille@debian.org>  Sat, 14 Jan 2023 08:03:33 +0100
+  [ Debian Janitor ]
+  * New upstream release.
+
+ -- Andreas Tille <tille@debian.org>  Fri, 10 Mar 2023 22:35:08 -0000
 
 python-srsly (2.4.5-1) unstable; urgency=medium
 
diff --git a/debian/patches/reorder_setup.py b/debian/patches/reorder_setup.py
index 78e8dc7..a39a485 100644
--- a/debian/patches/reorder_setup.py
+++ b/debian/patches/reorder_setup.py
@@ -2,8 +2,10 @@ Author: Andreas Tille <tille@debian.org>
 Last-Update: Tue, 29 Nov 2022 16:09:55 +0100
 Description: Fix sequence of imports
 
---- a/setup.py
-+++ b/setup.py
+Index: python-srsly.git/setup.py
+===================================================================
+--- python-srsly.git.orig/setup.py
++++ python-srsly.git/setup.py
 @@ -1,8 +1,8 @@
  #!/usr/bin/env python
  import sys
diff --git a/srsly/__init__.py b/srsly/__init__.py
index 245286b..0c94d82 100644
--- a/srsly/__init__.py
+++ b/srsly/__init__.py
@@ -1,4 +1,5 @@
 from ._json_api import read_json, read_gzip_json, write_json, write_gzip_json
+from ._json_api import read_gzip_jsonl, write_gzip_jsonl
 from ._json_api import read_jsonl, write_jsonl
 from ._json_api import json_dumps, json_loads, is_json_serializable
 from ._msgpack_api import read_msgpack, write_msgpack, msgpack_dumps, msgpack_loads
diff --git a/srsly/_json_api.py b/srsly/_json_api.py
index 900e42b..24d25fd 100644
--- a/srsly/_json_api.py
+++ b/srsly/_json_api.py
@@ -1,4 +1,4 @@
-from typing import Union, Iterable, Sequence, Any, Optional
+from typing import Union, Iterable, Sequence, Any, Optional, Iterator
 import sys
 import json as _builtin_json
 import gzip
@@ -56,14 +56,27 @@ def read_json(path: FilePath) -> JSONOutput:
 def read_gzip_json(path: FilePath) -> JSONOutput:
     """Load JSON from a gzipped file.
 
-        location (FilePath): The file path.
-        RETURNS (JSONOutput): The loaded JSON content.
+    location (FilePath): The file path.
+    RETURNS (JSONOutput): The loaded JSON content.
     """
     file_path = force_string(path)
     with gzip.open(file_path, "r") as f:
         return ujson.load(f)
 
 
+def read_gzip_jsonl(path: FilePath, skip: bool = False) -> Iterator[JSONOutput]:
+    """Read a gzipped .jsonl file and yield contents line by line.
+    Blank lines will always be skipped.
+
+    path (FilePath): The file path.
+    skip (bool): Skip broken lines and don't raise ValueError.
+    YIELDS (JSONOutput): The unpacked, deserialized Python objects.
+    """
+    with gzip.open(force_path(path), "r") as f:
+        for line in _yield_json_lines(f, skip=skip):
+            yield line
+
+
 def write_json(path: FilePath, data: JSONInput, indent: int = 2) -> None:
     """Create a .json file and dump contents or write to standard
     output.
@@ -94,6 +107,30 @@ def write_gzip_json(path: FilePath, data: JSONInput, indent: int = 2) -> None:
         f.write(json_data.encode("utf-8"))
 
 
+def write_gzip_jsonl(
+    path: FilePath,
+    lines: Iterable[JSONInput],
+    append: bool = False,
+    append_new_line: bool = True,
+) -> None:
+    """Create a .jsonl.gz file and dump contents.
+
+    location (FilePath): The file path.
+    lines (Sequence[JSONInput]): The JSON-serializable contents of each line.
+    append (bool): Whether or not to append to the location. Appending to .gz files is generally not recommended, as it
+        doesn't allow the algorithm to take advantage of all data when compressing - files may hence be poorly
+        compressed.
+    append_new_line (bool): Whether or not to write a new line before appending
+        to the file.
+    """
+    mode = "a" if append else "w"
+    file_path = force_path(path, require_exists=False)
+    with gzip.open(file_path, mode=mode) as f:
+        if append and append_new_line:
+            f.write("\n".encode("utf-8"))
+        f.writelines([(json_dumps(line) + "\n").encode("utf-8") for line in lines])
+
+
 def read_jsonl(path: FilePath, skip: bool = False) -> Iterable[JSONOutput]:
     """Read a .jsonl file or standard input and yield contents line by line.
     Blank lines will always be skipped.
diff --git a/srsly/about.py b/srsly/about.py
index 71c852b..c9e914f 100644
--- a/srsly/about.py
+++ b/srsly/about.py
@@ -1 +1 @@
-__version__ = "2.4.5"
+__version__ = "2.4.6"
diff --git a/srsly/tests/cloudpickle/cloudpickle_test.py b/srsly/tests/cloudpickle/cloudpickle_test.py
index fe0cf39..b293c53 100644
--- a/srsly/tests/cloudpickle/cloudpickle_test.py
+++ b/srsly/tests/cloudpickle/cloudpickle_test.py
@@ -870,8 +870,12 @@ class CloudPickleTest(unittest.TestCase):
 
 
     @pytest.mark.skipif(
-        platform.machine() == "aarch64" and sys.version_info[:2] >= (3, 10),
-        reason="Fails on aarch64 + python 3.10+ in cibuildwheel, currently unable to replicate failure elsewhere")
+        (platform.machine() == "aarch64" and sys.version_info[:2] >= (3, 10))
+            or platform.python_implementation() == "PyPy"
+            or (sys.version_info[:2] == (3, 10) and sys.version_info >= (3, 10, 8))
+            # Skipping tests on 3.11 due to https://github.com/cloudpipe/cloudpickle/pull/486.
+            or sys.version_info[:2] == (3, 11),
+        reason="Fails on aarch64 + python 3.10+ in cibuildwheel, currently unable to replicate failure elsewhere; fails sometimes for pypy on conda-forge; fails for python 3.10.8+ and 3.11")
     def test_builtin_classmethod(self):
         obj = 1.5  # float object
 
@@ -1470,6 +1474,7 @@ class CloudPickleTest(unittest.TestCase):
                 finally:
                     sys.modules.pop("_faulty_module", None)
 
+    @pytest.mark.skip(reason="fails for pytest v7.2.0")
     def test_dynamic_pytest_module(self):
         # Test case for pull request https://github.com/cloudpipe/cloudpickle/pull/116
         import py
@@ -1567,6 +1572,8 @@ class CloudPickleTest(unittest.TestCase):
         assert isinstance(depickled_t2, MyTuple)
         assert depickled_t2 == t2
 
+    @pytest.mark.skipif(platform.python_implementation() == "PyPy",
+        reason="fails sometimes for pypy on conda-forge")
     def test_interactively_defined_function(self):
         # Check that callables defined in the __main__ module of a Python
         # script (or jupyter kernel) can be pickled / unpickled / executed.
diff --git a/srsly/tests/test_json_api.py b/srsly/tests/test_json_api.py
index dc23952..89ce400 100644
--- a/srsly/tests/test_json_api.py
+++ b/srsly/tests/test_json_api.py
@@ -4,7 +4,14 @@ from pathlib import Path
 import gzip
 import numpy
 
-from .._json_api import read_json, write_json, read_jsonl, write_jsonl
+from .._json_api import (
+    read_json,
+    write_json,
+    read_jsonl,
+    write_jsonl,
+    read_gzip_jsonl,
+    write_gzip_jsonl,
+)
 from .._json_api import write_gzip_json, json_dumps, is_json_serializable
 from .._json_api import json_loads
 from ..util import force_string
@@ -204,3 +211,54 @@ def test_unsupported_type_error():
     f = numpy.float32()
     with pytest.raises(TypeError):
         s = json_dumps(f)
+
+
+def test_write_jsonl_gzip():
+    """Tests writing data to a gzipped .jsonl file."""
+    data = [{"hello": "world"}, {"test": 123}]
+    expected = ['{"hello":"world"}\n', '{"test":123}\n']
+
+    with make_tempdir() as temp_dir:
+        file_path = temp_dir / "tmp.json"
+        write_gzip_jsonl(file_path, data)
+        with gzip.open(file_path, "r") as f:
+            assert [line.decode("utf8") for line in f.readlines()] == expected
+
+
+def test_write_jsonl_gzip_append():
+    """Tests appending data to a gzipped .jsonl file."""
+    data = [{"hello": "world"}, {"test": 123}]
+    expected = [
+        '{"hello":"world"}\n',
+        '{"test":123}\n',
+        "\n",
+        '{"hello":"world"}\n',
+        '{"test":123}\n',
+    ]
+    with make_tempdir() as temp_dir:
+        file_path = temp_dir / "tmp.json"
+        write_gzip_jsonl(file_path, data)
+        write_gzip_jsonl(file_path, data, append=True)
+        with gzip.open(file_path, "r") as f:
+            assert [line.decode("utf8") for line in f.readlines()] == expected
+
+
+def test_read_jsonl_gzip():
+    """Tests reading data from a gzipped .jsonl file."""
+    file_contents = [{"hello": "world"}, {"test": 123}]
+    with make_tempdir() as temp_dir:
+        file_path = temp_dir / "tmp.json"
+        with gzip.open(file_path, "w") as f:
+            f.writelines(
+                [(json_dumps(line) + "\n").encode("utf-8") for line in file_contents]
+            )
+        assert file_path.exists()
+        data = read_gzip_jsonl(file_path)
+        # Make sure this returns a generator, not just a list
+        assert not hasattr(data, "__len__")
+        data = list(data)
+    assert len(data) == 2
+    assert len(data[0]) == 1
+    assert len(data[1]) == 1
+    assert data[0]["hello"] == "world"
+    assert data[1]["test"] == 123
diff --git a/srsly/util.py b/srsly/util.py
index 575d3b5..c43120c 100644
--- a/srsly/util.py
+++ b/srsly/util.py
@@ -10,8 +10,8 @@ FilePath = Union[str, Path]
 JSONOutput = Union[str, int, float, bool, None, Dict[str, Any], List[Any]]
 JSONOutputBin = Union[bytes, str, int, float, bool, None, Dict[str, Any], List[Any]]
 # For input, we also accept tuples, ordered dicts etc.
-JSONInput = Union[str, int, float, bool, None, Dict[str, Any], List[Any], Tuple[Any], OrderedDict]
-JSONInputBin = Union[bytes, str, int, float, bool, None, Dict[str, Any], List[Any], Tuple[Any], OrderedDict]
+JSONInput = Union[str, int, float, bool, None, Dict[str, Any], List[Any], Tuple[Any, ...], OrderedDict]
+JSONInputBin = Union[bytes, str, int, float, bool, None, Dict[str, Any], List[Any], Tuple[Any, ...], OrderedDict]
 YAMLInput = JSONInput
 YAMLOutput = JSONOutput
 # fmt: on

More details

Full run details

Historical runs