Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Noun phrase count #13

Closed
wants to merge 89 commits into from
Closed

Added Noun phrase count #13

wants to merge 89 commits into from

Conversation

ritikjain51
Copy link

@ritikjain51 ritikjain51 commented Oct 6, 2020

Added Noun phrase counting.
Solved the counting issue #14.
Logic Updated with emoji decoding for robust noun phrase.

@neomatrix369 neomatrix369 self-requested a review October 6, 2020 11:38
@neomatrix369 neomatrix369 added the enhancement New feature or request label Oct 6, 2020
@neomatrix369 neomatrix369 added this to Doing in NLP Profiler via automation Oct 6, 2020
Copy link
Owner

@neomatrix369 neomatrix369 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general LGTM
Very good and simple implementation and we can build more on top of this I think

Thank you for your first issue, please review my comments and response.

Have you also run the test coverage shell script, can you please post the screenshot of the results, so we can see if there is 100% coverage.

Soon I will have a CI/CD in place for Pull Requests but for now this would help.

nlp_profiler/granular_features.py Outdated Show resolved Hide resolved
nlp_profiler/noun_phase_count.py Outdated Show resolved Hide resolved
nlp_profiler/noun_phase_count.py Outdated Show resolved Hide resolved
nlp_profiler/noun_phase_count.py Outdated Show resolved Hide resolved
@neomatrix369
Copy link
Owner

neomatrix369 commented Oct 6, 2020

@ritikjain51 there are also acceptance tests and notebooks to update with your new change - the notebooks are under the notebook folder. Also, look for tests under the slow-tests folder. This is how you run the test coverage shell script:

cd [root of the project folder]
./test-coverage.sh "tests slow-tests"

This will generate tests run report and test coverage report.

@neomatrix369 neomatrix369 added hacktoberfest-accepted Approved/merged. Part of the Hacktoberfest 2020 (https://hacktoberfest.digitalocean.com) granular feature(s) Low-level/granular feature(s) labels Oct 6, 2020
@neomatrix369
Copy link
Owner

@ritikjain51 please also add a bit of description explaining the background what you are doing, what is the outcome and what are you using to get the results, this will help me and others reading this PR.

As you can see from #15, you have opened up a potential, something I knew about but didn't give it a priority.

@neomatrix369 neomatrix369 added hacktoberfest Classify topic. Part of the Hacktoberfest 2020 (https://hacktoberfest.digitalocean.com) and removed hacktoberfest-accepted Approved/merged. Part of the Hacktoberfest 2020 (https://hacktoberfest.digitalocean.com) labels Oct 6, 2020
@neomatrix369
Copy link
Owner

@ritikjain51 could you also please add a description to the PR as this will help me and others who look at it later.

You may have also missed out re-running the other notebook (nlp_profiler.ipynb), please take a look. You can skip the nlp_profiler-large-dataset.ipynb for now.

Thnx

@neomatrix369
Copy link
Owner

@ritikjain51 Thanks for your patience and following through with the code review.

I will write-up the Developer guide soon so it can be a standard way to do things. There are also a couple of things to be automated which will also help.

@neomatrix369
Copy link
Owner

neomatrix369 commented Oct 8, 2020

@ritikjain51 your PR is out of sync with master, you may need to rebase with master

Are you using nbdime or another Jupyter/notebook version control plugin, it can help understand the differences?

@ritikjain51
Copy link
Author

image
image
image
image
image

There are 3 Cases are failing because I don't have datasets to validate those.
Those 3 Cases are being used in test_apply_text_profiling.py

@neomatrix369
Copy link
Owner

neomatrix369 commented Oct 8, 2020

...

There are 3 Cases are failing because I don't have datasets to validate those.
Those 3 Cases are being used in test_apply_text_profiling.py

The dataset is generated in the test itself, have a read of the tests that are failing, there is a method that creates the dataset.

These failures are normal if you look closely the Acceptance test is trying to compare old results with your new results (has one extra column) so it will fail. I suggest you generate the expected datasets and save them in the https://github.com/neomatrix369/nlp_profiler/tree/master/tests/acceptance_tests/data.

It's easy to generate the new test data, but you do it only when you know the new dataset is correct in every form. For each of the test cases in https://github.com/neomatrix369/nlp_profiler/blob/master/tests/acceptance_tests/test_apply_text_profiling.py save the results generated by the actual_dataframe using this line:

        def test_case() # for all the relevant test cases
                     ...
                     actual_dataframe.to_csv(csv_filename, index=False)  ### remove this line when you have finished generating the csv file
                     ...

Compare the old results with new using git diff and you can see the changes, in your case only your new column should change not any other aspect of the .csv files.

Let me know if you don't follow this step.

I hope to make this easier in future releases. So we can generate our tests data easily. But generating test data is always a manual task.

@ritikjain51
Copy link
Author

image
image
image
image
image
There are 3 Cases are failing because I don't have datasets to validate those.
Those 3 Cases are being used in test_apply_text_profiling.py

The dataset is generated in the test itself, have a read of the tests that are failing, there is a method that creates the dataset.

In the test code, We are fetching data from expected data path location. Which is on available with me

@neomatrix369
Copy link
Owner

neomatrix369 commented Oct 8, 2020

@ritikjain51

Not true, the failing tests in https://github.com/neomatrix369/nlp_profiler/blob/master/tests/acceptance_tests/test_apply_text_profiling.py are creating their own dataset and also using the data in the folder https://github.com/neomatrix369/nlp_profiler/tree/master/tests/acceptance_tests/data. Please look again.

There is no external data used here.

@ritikjain51
Copy link
Author

@ritikjain51

Not true, the failing tests in https://github.com/neomatrix369/nlp_profiler/blob/master/tests/acceptance_tests/test_apply_text_profiling.py are creating their own dataset and also using the data in the folder https://github.com/neomatrix369/nlp_profiler/tree/master/tests/acceptance_tests/data. Please look again.

There is no external data used here.

Uploading image.png…

Copy link
Author

@ritikjain51 ritikjain51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge Conflict Solved on local.

@ritikjain51
Copy link
Author

These Issues are there because u have made structural changes in your code.

@ritikjain51
Copy link
Author

Thanks @neomatrix369. for your guidance. I am closing this Pull Request. It's getting difficult to understand the issues.
Hope you understand.
Regards,
Ritik Jain

@ritikjain51 ritikjain51 deleted the addNounPhrase branch October 18, 2020 10:22
@neomatrix369
Copy link
Owner

neomatrix369 commented Oct 18, 2020

@ritikjain51 the master branch did move on Friday as I was refactoring the code (implementation) - sorry about that but it had to be done for better readability and maintenance.

The merge conflicts were mainly due to changes in the data and notebook files.

It is normal in a project for code to change in branch to be out of step with master as development continues. I see you have also deleted your original changes, maybe someone could have helped you fix the changes.

PyCharm or VSCode have good git functionalities to help in the process - it's something out of scope of my project to help you with.

You can always create a new fork and then a new branch with the changes and regenerate the notebooks - it's up to you, otherwise myself or someone else might implement this functionality separately.

@neomatrix369
Copy link
Owner

neomatrix369 commented Oct 18, 2020

Your tests were failing on CI/CD due to the absence of this line:

nltk.download('averaged_perceptron_tagger')

in the noun_phase_count.py module (see 08e54c4#diff-071796e721082b14e9eb7c22c771ca452fe1f83652dc4bcce6d1c599409a6c41R10)

The CI/CD logs did indicate that.

After fixing this line the tests pass:

tests/acceptance_tests/test_apply_text_profiling.py .....                                                                [  2%]
tests/granular/test_alphanumeric.py ........                                                                             [  7%]
tests/granular/test_chars_and_spaces.py ......                                                                           [ 10%]
tests/granular/test_dates.py .........                                                                                   [ 15%]
tests/granular/test_duplicates.py .........                                                                              [ 21%]
tests/granular/test_emojis.py ........                                                                                   [ 25%]
tests/granular/test_non_alphanumeric.py ........                                                                         [ 30%]
tests/granular/test_nounphase.py .........                                                                               [ 35%]
tests/granular/test_numbers.py ........                                                                                  [ 39%]
tests/granular/test_punctuations.py ........                                                                             [ 44%]
tests/granular/test_sentences.py ..................                                                                      [ 54%]
tests/granular/test_stop_words.py ........                                                                               [ 59%]
tests/granular/test_words.py ........                                                                                    [ 63%]
tests/high_level/test_grammar_check.py ..........                                                                        [ 69%]
tests/high_level/test_sentiment_polarity.py ................                                                             [ 78%]
tests/high_level/test_sentiment_subjectivity.py ................                                                         [ 87%]
tests/high_level/test_spelling_check.py ..................                                                               [ 97%]
slow-tests/acceptance_tests/test_apply_text_profiling.py .                                                               [ 98%]
slow-tests/performance_tests/test_perf_grammar_check.py .                                                                [ 98%]
slow-tests/performance_tests/test_perf_noun_phase.py .                                                                   [ 99%]
slow-tests/performance_tests/test_perf_spelling_check.py .                                                               [100%]

I would also suggest using the PR page i.e. #13 to read and respond to messages, lots of information on this page helps with development process.

@neomatrix369
Copy link
Owner

@ritikjain51
Since you put so much time and effort into this work I restored the branch for you see https://github.com/neomatrix369/nlp_profiler/tree/addNounPhraseCount - your commits are preserved, I added some fixes to it.

Also, please have a look at the CI/CD process https://github.com/neomatrix369/nlp_profiler/runs/1271145643?check_suite_focus=true

Please do a thorough check of it to see if anything has been missed out. Happy for you to create a new PR as the community values your hard work - let's coordinate this time and try to stay in sync with master.

Again if you are unfamiliar with something its best to say it and ask for help - only way for us to come forward and step in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request granular feature(s) Low-level/granular feature(s) hacktoberfest Classify topic. Part of the Hacktoberfest 2020 (https://hacktoberfest.digitalocean.com)
Projects
NLP Profiler
  
Done
Development

Successfully merging this pull request may close these issues.

Add phrase counts or parts-of-speech token counts after extracting entities from a sentence
3 participants