-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes #20: Added basic summarize functionality #66
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not take on a python dependency if we don't need to. There are a few dozen text summarization modules on npm, let's try and use those.
Next, we can get rid of express here. We don't need a web server to do this, just a function interface. You can model your code on what @jatinAroraGit is doing in #46. Basically, we need to export an async
run
function that accepts some text, and gives back the result via a Promise
.
Also, we should be able to include a test now that the Jest code is in place. You can either do that in this PR, or file another bug for someone to work on that adds one. The test should, at the least, call your function with some snippet of text, and expect to get back results of the form we'll expect at runtime.
55a724d
to
7c2a1bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to run npm eslint-fix
and rerun npm test
to see if all errors have been corrected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is taking shape! I left some more comments.
summarizer/summarize.js
Outdated
const { SummarizerManager } = require('node-summarizer'); | ||
|
||
module.exports = async function (text, numSentences = 3) { | ||
const Summarizer = new SummarizerManager(text, numSentences); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const summarizer = ...
Use lower case to start a variable name, only use a capital for a Constructor function name.
summarizer/summarize.test.js
Outdated
const fs = require('fs'); | ||
const summarize = require('./summarize'); | ||
|
||
describe('summarize tests...', () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to convert this to Jest format (i.e., test()
vs. describe()
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unable to nest tests in Jest, I use describe to define a scope just in case I wish to add other types of test cases into the file. It isn't really needed at the moment so I could just remove describe,.
summarizer/summarize.test.js
Outdated
const summarize = require('./summarize'); | ||
|
||
describe('summarize tests...', () => { | ||
const text = fs.readFileSync('./summarizer/text.txt', 'utf-8'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should go in a before
test setup call. You can also drop the readFileSync
and just use readFile
, since we can use async calls in a test setup function.
summarizer/summarize.test.js
Outdated
|
||
it('summarize returns string', async () => { | ||
data = await summarize(text, 5); | ||
expect(typeof data).toBe('string'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also test for the exact summary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a good idea, but I'm am not certain I have the best approach to it. I think I could compare how similar the machine summary is to a human summary and have it pass if it meets a threshold, but this requires a good human summary for each blog input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is: run your code once, get the summary it produces, see if you think it makes sense. If it does, hard code that into your test, so we know in the future if it breaks for this given blog post text. This is really a regression test, to guard against future problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I added the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really close.
.stylelintrc.js
Outdated
@@ -6,7 +6,8 @@ module.exports = { | |||
"LICENSE", | |||
"*.*", | |||
"images/*.*", | |||
"!*.css" | |||
"!*.css", | |||
"summarizer/*.txt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change makes me think we need to file a new issue to adjust how our stylelint
works, such that it only looks at files in a particular directory, maybe src/frontend/styles/*
or something, and then we don't collide with random files like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Posted as issue #162.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I closed the issue as duplicate. Did not see the issue you posted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you know, this is being fixed in #151, so why don't you reset your changes to this file, so you don't have merge conflicts with that:
git checkout master .stylelintrc.js
And you can go back to what it was. It will fail Travis, but we can ignore. By the way, you might want to add another review to #151 so we can land it.
summarizer/summarize.test.js
Outdated
@@ -0,0 +1,23 @@ | |||
const fs = require('fs').promises; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this into test/
where the other tests are.
3f58669
to
0234cda
Compare
.stylelintrc.js
Outdated
@@ -6,7 +6,8 @@ module.exports = { | |||
"LICENSE", | |||
"*.*", | |||
"images/*.*", | |||
"!*.css" | |||
"!*.css", | |||
"summarizer/*.txt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you know, this is being fixed in #151, so why don't you reset your changes to this file, so you don't have merge conflicts with that:
git checkout master .stylelintrc.js
And you can go back to what it was. It will fail Travis, but we can ignore. By the way, you might want to add another review to #151 so we can land it.
test/summarize.test.js
Outdated
let expected; | ||
|
||
beforeAll(async () => { | ||
inputData = await fs.readFile('./summarizer/input.txt', 'utf-8'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move these files into test/
, and/or just inline the contents of the files into this test file so we don't even have to read from disk.
test/summarize.test.js
Outdated
return expect(predicted).toBe(expected); | ||
}); | ||
|
||
test('summarize returns string', async () => expect(typeof predicted).toBe('string')); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests can be run in different order, so don't count on predicted
to exist here. I'd re-calculate this in each test so you know it's good. Try not to have test()
functions depend on shared state. It can lead to random failures.
test/summarize.test.js
Outdated
|
||
test('summarize returns string', async () => expect(typeof predicted).toBe('string')); | ||
|
||
test('summarize returns correct number of lines', () => expect(predicted.split('.').length).toBe(6)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does predicted
look like that you have to call split? I don't follow the logic here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'Predicted' should be returning 5 sentences since I pass 5 sentences as a parameter for summarize. So I split 'predicted' in order to get the number of sentences. The module I use requires that a number of sentences be specified to produce a summary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Defining a sentence as "ends with a .
" could be problematic, if a sentence includes something like "I'm working on a file named foo.js in my course." Probably doing /\. {1,2}/
for "period followed by 1 or 2 spaces" would be more accurate. That still leaves out sentences that end with a ;
. Anyway, this gets into the weed. If you want to tighten up your split()
to deal with the periods too, that's fine. Otherwise, we'll go with what you have here.
0f22498
to
06c7742
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, a few little things, and I think this is probably good.
test/summarize.test.js
Outdated
|
||
test('summarize returns string', async () => expect(typeof predicted).toBe('string')); | ||
|
||
test('summarize returns correct number of lines', () => expect(predicted.split('.').length).toBe(6)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Defining a sentence as "ends with a .
" could be problematic, if a sentence includes something like "I'm working on a file named foo.js in my course." Probably doing /\. {1,2}/
for "period followed by 1 or 2 spaces" would be more accurate. That still leaves out sentences that end with a ;
. Anyway, this gets into the weed. If you want to tighten up your split()
to deal with the periods too, that's fine. Otherwise, we'll go with what you have here.
src/summarize.js
Outdated
module.exports = async function (text, numSentences = 3) { | ||
const summarizer = new SummarizerManager(text, numSentences); | ||
return summarizer.getSummaryByRank().then( | ||
(summaryObject) => Promise.resolve(summaryObject.summary), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(summaryObject) => summaryObject.summary
is all you need here
53199ad
to
f304ac6
Compare
Looks like your regex isn't passing this test anymore. I think it's fine if you want to revert to what you had, so we can land this and improve it later. |
I'm not sure why, but the TextRank algorithm in node-summarizer throws out inconsistent results. Sometimes it fails the test but once its rerun it might pass, or just fail again. |
So from what I read on the node-summarizer repo, looks like TextRank might not always throw out the same result, but it should still be generating the correct number of sentences each time. |
Fascinating. OK, let's adjust the tests to match the expectations of the library, and include some comments in your tests/code that indicate what your research has uncovered about how it varies with each run. |
Added expected summary test.
be43208
to
a45dc1d
Compare
Co-authored-by: Renovate Bot <bot@renovateapp.com>
* Initial work * Docs and bodyParser * fix: correct typo and add missing quotation mark * Move router onto Satellite instance * Use built-in body parsing with Express * Add tests for body parsers * Add GitHub CI + README badge * Fix package-lock.json sync with package.json * Fix README build badge URL * Add release workflow * 1.0.1 * Use standard tag format for npm with v prefix * 1.0.2 * Release yaml fix * 1.0.3 * Explicitly set package public for npm publish * 1.0.4 * Remove private field completely in order to publish to npm * 1.0.5 * Switch org name to @senecacdot * 1.0.6 * Update lock file * 1.0.7 * Expose Router from package * 1.1.0 * Switch from new Router() to Router() * 1.1.1 * Use env variables to start apm monitoring sooner * 1.2.0 * Switch to ELASTIC_APM_SERVER_URL, better 404 reporting, refactor Router() * 1.3.0 * Add beforeParsers and beforeRouter options with tests * 1.4.0 * Add pino-colada for debug logging * Improve logging, use ELASTIC_APM_SERVICE_NAME env var, add router option to ctor * 1.5.0 * Remove pino-tiny dep * Fix logger picking logic on startup * 1.5.1 * Document healthCheck and add more tests * 1.5.2 * Add default favicon support * 1.6.0 * Update README install instructions, deps * 1.6.1 * Add JWT validation, tests, and update docs * 1.7.0 * Ported Hash to Satellite * Removed Redundant Code, Added Comment Block, Fixed Import * Re-add crypto * Add test for req.user * Refactor into src/, breakup middleware authenticate vs. authorize, remove favicon * 1.8.0 * Init Prettier Commit * Adds the createError module for use in Telescope (#5) * Fixed CreateError module, use http-errors * removed merge conflict errors * Added Docs in README Co-authored-by: David Humphrey <david.andrew.humphrey@gmail.com> * Finish prettier integration * Fix workflows * Prettier for jest.config.js * Update deps, fix prettier-check on windows * 1.8.5 * Specify main entry point in package.json * 1.8.6 * Add .husky directory and pre-commit hook * Fix #1930 * Support credentials for HTTPS vs. HTTP server * 1.8.7 * Don't install Husky on postinstall * 1.8.8 * Add support for generating a service token * 1.9.0 * Updated redis and added ping test * Updated redis export * Fixed Redis test case * 1.10.0 * isAuthorized() always takes a function with req, user params * 1.11.0 * 1.12.0 * Initial Elastic client code * Updated elastic contructor, add initial tests * Fixed elastic search client, mock elastic connection * Updated README.md with Elastic() info * 1.13.0 * Add shutDown() to allow killing connections * 1.14.0 * Add automatic, graceful shutdown for Redis and Elastic clients * Update deps for 1.14.0 * 1.15.0 * Initial exported Fetch() function to Satellite * Updated exports to require node-fetch instead of it being in a separate file * Updated spelling to 'fetch' and updated tests to use nock * Removed done() from tests and moved node-fetch to be a dependency vs dev-dependency * Update lock file * 1.16.0 * chore: include nodejs 16 in the CI build matrix * feat: add auto-opt-out of FLoC * 1.17.0 * Add eslint to satellite * adding and configuring eslint * Fixing linting errors * configuring anti-trojan-source plugin * adding lint to pre-commit hook * Removing ts from eslint run and removing comment * removing unused function validateAuthorizationOptions * Adding no-unused-vars override to test.js and removing the override from config * Adding ESLint to CI runs * Integrating eslint with the release workflow * Fix Dependencies: * Remove elastic-apm-node * Remove @elastic/ecs-pino-format * Update Jest (To fix deprecated dependencies) * * Update pino * Switch from express-pino-logger to pino-http * Standardize Dependencies * Configure Renovate (#23) * Configure renovate bot Co-authored-by: Renovate Bot <bot@renovateapp.com> Co-authored-by: Duke Manh <manhducdkcb@gmail.com> * chore(deps): update dependency pretty-quick to v3.1.3 * fix(deps): update dependency express to v4.17.2 * fix(deps): update dependency node-fetch to v2.6.7 * fix(deps): update dependency @elastic/elasticsearch-mock to v0.3.1 * fix(deps): update dependency http-errors to v1.8.1 * chore(deps): update dependency eslint to v8.7.0 * chore(deps): update dependency eslint-plugin-anti-trojan-source to v1.1.0 * chore(deps): update dependency eslint-plugin-jest to v25.7.0 * chore(deps): update dependency husky to v5.2.0 * Switch npm to pnpm * Release v1.18.0 * 1.20.0 * 1.21.0 * Use --no-git-checks with pnpm publish to avoid failure on CI * 1.22.0 * remove pre-commit * Release v1.23.0 * fix(deps): update dependency pino-pretty to v7.5.1 * fix(deps): update dependency pino to v7.6.5 * chore(deps): update dependency eslint to v8.8.0 * chore(deps): update dependency nock to v13.2.2 * bump prettier to v2.5.1 and run prettier on entire tree * fix(deps): update dependency @elastic/elasticsearch to v7.16.0 * fix(deps): update dependency @godaddy/terminus to v4.10.2 (#48) Co-authored-by: Renovate Bot <bot@renovateapp.com> * fix(deps): update dependency express-jwt to v6.1.0 * fix(deps): update dependency ioredis to v4.28.3 (#50) Co-authored-by: Renovate Bot <bot@renovateapp.com> * fix(deps): update dependency ioredis-mock to v5.9.1 * fix-renovate-bot * Refactoring elastic.js so mock is exported for tests - Adding tests for mock Elastic() - Added mock Elastic() description in README.md * Release v1.24.0 * chore(deps): update dependency nock to v13.2.4 * fix(deps): update dependency ioredis to v4.28.4 (#55) Co-authored-by: Renovate Bot <bot@renovateapp.com> * chore(deps): update dependency jest to v27.5.0 (#56) Co-authored-by: Renovate Bot <bot@renovateapp.com> * fix(deps): update dependency ioredis to v4.28.5 * fix(deps): update dependency @elastic/elasticsearch to v7.17.0 * chore(deps): update dependency jest to v27.5.1 (#59) Co-authored-by: Renovate Bot <bot@renovateapp.com> * chore(deps): update dependency eslint to v8.9.0 (#60) Co-authored-by: Renovate Bot <bot@renovateapp.com> * changed all uses of SECRET -> JWT_SECRET * Release v1.25.0 * Adding more tests for createError * fix(deps): update dependency express to v4.17.3 * fix(deps): update dependency pino to v7.8.0 (#66) Co-authored-by: Renovate Bot <bot@renovateapp.com> * adding ES error cases to createError * Release v.1.26.0 * fix(deps): update dependency express-jwt to v6.1.1 * chore(deps): update dependency eslint to v8.10.0 * fix(deps): update dependency ioredis-mock to v7 * fix(deps): update dependency ioredis-mock to v7.1.0 * fix(deps): update dependency pino-pretty to v7.5.3 * chore(deps): update dependency husky to v7 * chore(deps): update dependency eslint-plugin-jest to v26 * fix(deps): update dependency helmet to v5 * set default values for status and argToSend so they're not undefined * Delete unneded config files from Satellite repo Co-authored-by: David Humphrey <david.andrew.humphrey@gmail.com> Co-authored-by: Josue <josue.quilon-barrios@senecacollege.ca> Co-authored-by: Metropass <moho472@gmail.com> Co-authored-by: Mo <58116522+Metropass@users.noreply.github.com> Co-authored-by: Abdulbasid Guled <guled.basid@gmail.com> Co-authored-by: Josue <manekenpix@fastmail.com> Co-authored-by: dhillonks <kunwarvir@hotmail.com> Co-authored-by: Kevan-Y <58233223+Kevan-Y@users.noreply.github.com> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Renovate Bot <bot@renovateapp.com> Co-authored-by: Duke Manh <manhducdkcb@gmail.com> Co-authored-by: Cindy Le <cindyledev@gmail.com> Co-authored-by: AmasiaNalbandian <amasia.nalbandian@mitel.com> Co-authored-by: Amasia <77639637+AmasiaNalbandian@users.noreply.github.com> Co-authored-by: rclee91 <32626950+rclee91@users.noreply.github.com> Co-authored-by: Jia Hua Zou <jiahua.zou1@gmail.com> Co-authored-by: Joel Azwar <joel_azwar@yahoo.com> Co-authored-by: Anatoliy Serputoff <65831678+aserputov@users.noreply.github.com> Co-authored-by: tpmai <thienphuoc.0108@gmail.com>
Fixes #20
Added a summarizer that uses gensim's TextRank Summarizer in python.
I included code that allows it to be called from the node.js framework just to test it and to make it easier for whoever wants to call the function.
It isn't good, but it's a starting point. I will continue working on a document summary model.