Skip to content

chore(deps): bump nltk from 3.8.1 to 3.9.1#43

Merged
Aaron ("AJ") Steers (aaronsteers) merged 17 commits into
mainfrom
dependabot/pip/nltk-3.9
Nov 18, 2024
Merged

chore(deps): bump nltk from 3.8.1 to 3.9.1#43
Aaron ("AJ") Steers (aaronsteers) merged 17 commits into
mainfrom
dependabot/pip/nltk-3.9

Conversation

@dependabot
Copy link
Copy Markdown
Contributor

@dependabot dependabot Bot commented on behalf of github Nov 13, 2024

Bumps nltk from 3.8.1 to 3.9.

Changelog

Sourced from nltk's changelog.

Version 3.9.1 2024-08-19

  • Fixed bug that prevented wordnet from loading

Version 3.9 2024-08-18

  • Fix security vulnerability CVE-2024-39705 (breaking change)
  • Replace pickled models (punkt, chunker, taggers) by new pickle-free "_tab" packages
  • No longer sort Wordnet synsets and relations (sort in calling function when required)
  • Only strip the last suffix in Wordnet Morphy, thus restricting synsets() results
  • Add Python 3.12 support
  • Many other minor fixes

Thanks to the following contributors to 3.8.2: Tom Aarsen, Cat Lee Ball, Veralara Bernhard, Carlos Brandt, Konstantin Chernyshev, Michael Higgins, Eric Kafe, Vivek Kalyan, David Lukes, Rob Malouf, purificant, Alex Rudnick, Liling Tan, Akihiro Yamazaki.

Version 3.8.1 2023-01-02

  • Resolve RCE vulnerability in localhost WordNet Browser (#3100)
  • Remove unused tool scripts (#3099)
  • Resolve XSS vulnerability in localhost WordNet Browser (#3096)
  • Add Python 3.11 support (#3090)

Thanks to the following contributors to 3.8.1: Francis Bond, John Vandenberg, Tom Aarsen

Version 3.8 2022-12-12

  • Refactor dispersion plot (#3082)
  • Provide type hints for LazyCorpusLoader variables (#3081)
  • Throw warning when LanguageModel is initialized with incorrect vocabulary (#3080)
  • Fix WordNet's all_synsets() function (#3078)
  • Resolve TreebankWordDetokenizer inconsistency with end-of-string contractions (#3070)
  • Support both iso639-3 codes and BCP-47 language tags (#3060)
  • Avoid DeprecationWarning in Regexp tokenizer (#3055)
  • Fix many doctests, add doctests to CI (#3054, #3050, #3048)
  • Fix bool field not being read in VerbNet (#3044)
  • Greatly improve time efficiency of SyllableTokenizer when tokenizing numbers (#3042)
  • Fix encodings of Polish udhr corpus reader (#3038)
  • Allow TweetTokenizer to tokenize emoji flag sequences (#3034)
  • Prevent LazyModule from increasing the size of nltk.dict (#3033)
  • Fix CoreNLPServer non-default port issue (#3031)
  • Add "acion" suffix to the Spanish SnowballStemmer (#3030)
  • Allow loading WordNet without OMW (#3026)
  • Use input() in nltk.chat.chatbot() for Jupyter support (#3022)
  • Fix edit_distance_align() in distance.py (#3017)
  • Tackle performance and accuracy regression of sentence tokenizer since NLTK 3.6.6 (#3014)
  • Add the Iota operator to semantic logic (#3010)
  • Resolve critical errors in WordNet app (#3008)
  • Resolve critical error in CHILDES Corpus (#2998)
  • Make WordNet information_content() accept adjective satellites (#2995)

... (truncated)

Commits

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot merge will merge this PR after your CI passes on it
  • @dependabot squash and merge will squash and merge this PR after your CI passes on it
  • @dependabot cancel merge will cancel a previously requested merge and block automerging
  • @dependabot reopen will reopen this PR if it is closed
  • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    You can disable automated security fix PRs for this repo from the Security Alerts page.

Bumps [nltk](https://github.com/nltk/nltk) from 3.8.1 to 3.9.
- [Changelog](https://github.com/nltk/nltk/blob/develop/ChangeLog)
- [Commits](nltk/nltk@3.8.1...3.9)

---
updated-dependencies:
- dependency-name: nltk
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot dependabot Bot added the chore label Nov 13, 2024
@github-actions github-actions Bot added the dependencies Pull requests that update a dependency file label Nov 13, 2024
@aaronsteers Aaron ("AJ") Steers (aaronsteers) changed the title chore(deps): bump nltk from 3.8.1 to 3.9 chore(deps): bump nltk from 3.8.1 to 3.9.1 Nov 13, 2024
@aaronsteers
Copy link
Copy Markdown
Member

Aaron ("AJ") Steers (aaronsteers) commented Nov 13, 2024

Natik Gadzhi (@natikgadzhi) - fyi, this one is tricky. I fixed some obvious things but next steps are not clear. Currently raising some kind of failure during schema discovery.

fwiw nltk 3.9 is a no-go, had a bug fixed in 3.9.1.

@dependabot @github
Copy link
Copy Markdown
Contributor Author

dependabot Bot commented on behalf of github Nov 13, 2024

A newer version of nltk exists, but since this PR has been edited by someone other than Dependabot I haven't updated it. You'll get a PR for the updated version as normal once this PR is merged.

@natikgadzhi
Copy link
Copy Markdown
Contributor

Aaron ("AJ") Steers (@aaronsteers) Aldo Gonzalez (@aldogonzalez8) alright, this is a security concern, so we should wrap this up and fix the tests that are failing. I hope to wrap this up by end of week, but if this bleeds into the next week, I will make a line for this and assign it to Aldo.

You got this?

@aaronsteers
Copy link
Copy Markdown
Member

Aaron ("AJ") Steers (aaronsteers) commented Nov 13, 2024

Natik Gadzhi (@natikgadzhi) and Aldo Gonzalez (@aldogonzalez8) - Summarizing here what I'm finding...

1 - With the version bump alone, we start getting zero records from unstructured sources:

image

2 - The unstructured library leverages nltk (with no version constraints) so I tried to bump that to the latest version as well, with the hope that this would fix the issue - on the theory that maybe there's an incompatibility of versions and some part of the parsing is failing silently.

3 - Bumping the unstructured library causes some breaking changes - which I can mostly resolve. Except I can't test if this actually fixes it because there's another library called python-magic (aka magic or libmagic) used in inferences that relies on a c library that I don't have on my machine and don't want to create a hard dependency in our build process on pre-installing. Unstructured attempts to check for the presence of this library so it can fall back to other methods if needed, except that in the latest version the check doesn't work.

4 - I opened an issue below to see if Unstructured can fix the magiclib dependency check. One thing I didn't try was to simply import and override the constant with False to see if that would allow Unstructured to fall back to the desired behavior.

With all the above, I don't actually know if bumping Unstructured will solve the issue. I reverted the Unstructured version bump and was about to see if I could pin down the reason why we are getting zero records. Aldo Gonzalez (@aldogonzalez8), I'm going to task switch over to SDM for a bit, but I can come back - or I can pair with you to get you up to speed. Hopefully the above context is helpful.

Comment thread airbyte_cdk/sources/file_based/file_types/unstructured_parser.py Outdated
@aldogonzalez8
Copy link
Copy Markdown
Contributor

Aldo Gonzalez (aldogonzalez8) commented Nov 14, 2024

Natik Gadzhi (@natikgadzhi) and Aldo Gonzalez (@aldogonzalez8) - Summarizing here what I'm finding...

1 - With the version bump alone, we start getting zero records from unstructured sources:

image

2 - The unstructured library leverages nltk (with no version constraints) so I tried to bump that to the latest version as well, with the hope that this would fix the issue - on the theory that maybe there's an incompatibility of versions and some part of the parsing is failing silently.

3 - Bumping the unstructured library causes some breaking changes - which I can mostly resolve. Except I can't test if this actually fixes it because there's another library called python-magic (aka magic or libmagic) used in inferences that relies on a c library that I don't have on my machine and don't want to create a hard dependency in our build process on pre-installing. Unstructured attempts to check for the presence of this library so it can fall back to other methods if needed, except that in the latest version the check doesn't work.

4 - I opened an issue below to see if Unstructured can fix the magiclib dependency check. One thing I didn't try was to simply import and override the constant with False to see if that would allow Unstructured to fall back to the desired behavior.

With all the above, I don't actually know if bumping Unstructured will solve the issue. I reverted the Unstructured version bump and was about to see if I could pin down the reason why we are getting zero records. Aldo Gonzalez (@aldogonzalez8), I'm going to task switch over to SDM for a bit, but I can come back - or I can pair with you to get you up to speed. Hopefully the above context is helpful.

Ok, these two commits seem to be fixing failing unit tests; things are still running here, but I was getting errors gaster, so I'm optimistic:

Commit 1

Commit 2

@aldogonzalez8
Copy link
Copy Markdown
Contributor

Aldo Gonzalez (aldogonzalez8) commented Nov 14, 2024

Natik Gadzhi (@natikgadzhi) and Aldo Gonzalez (@aldogonzalez8) - Summarizing here what I'm finding...
1 - With the version bump alone, we start getting zero records from unstructured sources:

image

2 - The unstructured library leverages nltk (with no version constraints) so I tried to bump that to the latest version as well, with the hope that this would fix the issue - on the theory that maybe there's an incompatibility of versions and some part of the parsing is failing silently.
3 - Bumping the unstructured library causes some breaking changes - which I can mostly resolve. Except I can't test if this actually fixes it because there's another library called python-magic (aka magic or libmagic) used in inferences that relies on a c library that I don't have on my machine and don't want to create a hard dependency in our build process on pre-installing. Unstructured attempts to check for the presence of this library so it can fall back to other methods if needed, except that in the latest version the check doesn't work.
4 - I opened an issue below to see if Unstructured can fix the magiclib dependency check. One thing I didn't try was to simply import and override the constant with False to see if that would allow Unstructured to fall back to the desired behavior.

With all the above, I don't actually know if bumping Unstructured will solve the issue. I reverted the Unstructured version bump and was about to see if I could pin down the reason why we are getting zero records. Aldo Gonzalez (@aldogonzalez8), I'm going to task switch over to SDM for a bit, but I can come back - or I can pair with you to get you up to speed. Hopefully the above context is helpful.

Ok, these two commits seem to be fixing failing unit tests; things are still running here, but I was getting errors gaster, so I'm optimistic:

Commit 1

Commit 2

Aaron ("AJ") Steers (@aaronsteers) changes seems to make test pass

image

@aaronsteers
Copy link
Copy Markdown
Member

@dependabot @github
Copy link
Copy Markdown
Contributor Author

dependabot Bot commented on behalf of github Nov 18, 2024

OK, I won't notify you again about this release, but will get in touch when a new version is available. If you'd rather skip all updates until the next major or minor version, let me know by commenting @dependabot ignore this major version or @dependabot ignore this minor version. You can also ignore all major, minor, or patch releases for a dependency by adding an ignore condition with the desired update_types to your config file.

If you change your mind, just re-open this PR and I'll resolve any conflicts on it.

@dependabot dependabot Bot deleted the dependabot/pip/nltk-3.9 branch November 18, 2024 17:54
@aaronsteers Aaron ("AJ") Steers (aaronsteers) restored the dependabot/pip/nltk-3.9 branch November 18, 2024 18:17
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APPROVED

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

chore dependencies Pull requests that update a dependency file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants