-
-
Notifications
You must be signed in to change notification settings - Fork 39
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
bf83eda
commit 07b38fb
Showing
2 changed files
with
131 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,127 @@ | ||
|
||
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" | ||
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> | ||
|
||
<html xmlns="http://www.w3.org/1999/xhtml"> | ||
<head> | ||
<meta http-equiv="X-UA-Compatible" content="IE=Edge" /> | ||
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> | ||
<title>Stop Words — PyCantonese 2.2.0 documentation</title> | ||
<link rel="stylesheet" href="_static/sphinxdoc.css" type="text/css" /> | ||
<link rel="stylesheet" href="_static/pygments.css" type="text/css" /> | ||
<script type="text/javascript" id="documentation_options" data-url_root="./" src="_static/documentation_options.js"></script> | ||
<script type="text/javascript" src="_static/jquery.js"></script> | ||
<script type="text/javascript" src="_static/underscore.js"></script> | ||
<script type="text/javascript" src="_static/doctools.js"></script> | ||
<link rel="index" title="Index" href="genindex.html" /> | ||
<link rel="search" title="Search" href="search.html" /> | ||
<link rel="next" title="Corpus Reader Methods" href="reader.html" /> | ||
<link rel="prev" title="Corpus Data" href="data.html" /> | ||
</head><body> | ||
<div class="related" role="navigation" aria-label="related navigation"> | ||
<h3>Navigation</h3> | ||
<ul> | ||
<li class="right" style="margin-right: 10px"> | ||
<a href="genindex.html" title="General Index" | ||
accesskey="I">index</a></li> | ||
<li class="right" > | ||
<a href="reader.html" title="Corpus Reader Methods" | ||
accesskey="N">next</a> |</li> | ||
<li class="right" > | ||
<a href="data.html" title="Corpus Data" | ||
accesskey="P">previous</a> |</li> | ||
<li class="nav-item nav-item-0"><a href="index.html">PyCantonese 2.2.0 documentation</a> »</li> | ||
</ul> | ||
</div> | ||
<div class="sphinxsidebar" role="navigation" aria-label="main navigation"> | ||
<div class="sphinxsidebarwrapper"> | ||
<h4>Previous topic</h4> | ||
<p class="topless"><a href="data.html" | ||
title="previous chapter">Corpus Data</a></p> | ||
<h4>Next topic</h4> | ||
<p class="topless"><a href="reader.html" | ||
title="next chapter">Corpus Reader Methods</a></p> | ||
<div id="searchbox" style="display: none" role="search"> | ||
<h3>Quick search</h3> | ||
<div class="searchformwrapper"> | ||
<form class="search" action="search.html" method="get"> | ||
<input type="text" name="q" /> | ||
<input type="submit" value="Go" /> | ||
<input type="hidden" name="check_keywords" value="yes" /> | ||
<input type="hidden" name="area" value="default" /> | ||
</form> | ||
</div> | ||
</div> | ||
<script type="text/javascript">$('#searchbox').show(0);</script> | ||
</div> | ||
</div> | ||
|
||
<div class="document"> | ||
<div class="documentwrapper"> | ||
<div class="bodywrapper"> | ||
<div class="body" role="main"> | ||
|
||
<div class="section" id="stop-words"> | ||
<span id="id1"></span><h1>Stop Words<a class="headerlink" href="#stop-words" title="Permalink to this headline">¶</a></h1> | ||
<p>In many natural language processing tasks, it is often necessary to filter | ||
stop words, English examples of which include function words such as | ||
pronouns and determiners. PyCantonese provides the function <code class="docutils literal notranslate"><span class="pre">stop_words()</span></code> | ||
that returns a set of about 100 Cantonese stop words:</p> | ||
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="gp">>>> </span><span class="kn">import</span> <span class="nn">pycantonese</span> <span class="kn">as</span> <span class="nn">pc</span> | ||
<span class="gp">>>> </span><span class="n">stop_words</span> <span class="o">=</span> <span class="n">pc</span><span class="o">.</span><span class="n">stop_words</span><span class="p">()</span> | ||
<span class="gp">>>> </span><span class="nb">len</span><span class="p">(</span><span class="n">stop_words</span><span class="p">)</span> | ||
<span class="go">104</span> | ||
<span class="gp">>>> </span><span class="n">stop_words</span> | ||
<span class="go">{'一啲', '一定', '不如', '不過', ...}</span> | ||
</pre></div> | ||
</div> | ||
<p>Depending on your use cases, you may like to add or remove stop words | ||
from the default ones. | ||
The <code class="docutils literal notranslate"><span class="pre">stop_words()</span></code> function has the optional arguments of <code class="docutils literal notranslate"><span class="pre">add</span></code> and | ||
<code class="docutils literal notranslate"><span class="pre">remove</span></code>.</p> | ||
<p><code class="docutils literal notranslate"><span class="pre">add</span></code> can either be a string (e.g., treat <code class="docutils literal notranslate"><span class="pre">'香港'</span></code> as a stop word if your | ||
data is all about Hong Kong) or an iterable of strings:</p> | ||
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="gp">>>> </span><span class="kn">import</span> <span class="nn">pycantonese</span> <span class="kn">as</span> <span class="nn">pc</span> | ||
<span class="gp">>>> </span><span class="n">stop_words_1</span> <span class="o">=</span> <span class="n">pc</span><span class="o">.</span><span class="n">stop_words</span><span class="p">(</span><span class="n">add</span><span class="o">=</span><span class="s1">'香港'</span><span class="p">)</span> | ||
<span class="gp">>>> </span><span class="nb">len</span><span class="p">(</span><span class="n">stop_words_1</span><span class="p">)</span> | ||
<span class="go">105</span> | ||
<span class="gp">>>> </span><span class="s1">'香港'</span> <span class="ow">in</span> <span class="n">stop_words_1</span> | ||
<span class="go">True</span> | ||
<span class="gp">>>> </span><span class="n">stop_words_2</span> <span class="o">=</span> <span class="n">pc</span><span class="o">.</span><span class="n">stop_words</span><span class="p">(</span><span class="n">add</span><span class="o">=</span><span class="p">[</span><span class="s1">'香港島'</span><span class="p">,</span> <span class="s1">'九龍'</span><span class="p">,</span> <span class="s1">'新界'</span><span class="p">])</span> | ||
<span class="gp">>>> </span><span class="nb">len</span><span class="p">(</span><span class="n">stop_words_2</span><span class="p">)</span> | ||
<span class="gp">>>> </span><span class="mi">107</span> | ||
<span class="gp">>>> </span><span class="p">{</span><span class="s1">'香港島'</span><span class="p">,</span> <span class="s1">'九龍'</span><span class="p">,</span> <span class="s1">'新界'</span><span class="p">}</span><span class="o">.</span><span class="n">issubset</span><span class="p">(</span><span class="n">stop_words_2</span><span class="p">)</span> | ||
<span class="go">True</span> | ||
</pre></div> | ||
</div> | ||
<p>Similarly, the <code class="docutils literal notranslate"><span class="pre">remove</span></code> argument can also take either a string or an iterable | ||
of strings.</p> | ||
</div> | ||
|
||
|
||
</div> | ||
</div> | ||
</div> | ||
<div class="clearer"></div> | ||
</div> | ||
<div class="related" role="navigation" aria-label="related navigation"> | ||
<h3>Navigation</h3> | ||
<ul> | ||
<li class="right" style="margin-right: 10px"> | ||
<a href="genindex.html" title="General Index" | ||
>index</a></li> | ||
<li class="right" > | ||
<a href="reader.html" title="Corpus Reader Methods" | ||
>next</a> |</li> | ||
<li class="right" > | ||
<a href="data.html" title="Corpus Data" | ||
>previous</a> |</li> | ||
<li class="nav-item nav-item-0"><a href="index.html">PyCantonese 2.2.0 documentation</a> »</li> | ||
</ul> | ||
</div> | ||
<div class="footer" role="contentinfo"> | ||
© Copyright 2014-2018, Jackson L. Lee | Documentation last updated on June 30, 2018. | ||
Created using <a href="http://sphinx-doc.org/">Sphinx</a> 1.7.5. | ||
</div> | ||
</body> | ||
</html> |