Skip to content

My PhD thesis with all its source files, including all .tex files and images created, as well as the slides of my defense.

Notifications You must be signed in to change notification settings

tca19/phd-thesis

Repository files navigation

                Improving methods to learn word representations
                ===============================================
                for efficient semantic similarities computations
                ================================================

ABOUT
	This repository contains the PhD thesis  of  Julien  Tissier,  entitled,

	"Improving  methods  to  learn  word   representations   for   efficient
	semantic similarities computations"

	It also contains all the source materials used to  produce  the  thesis,
	including the Latex .tex source files, the images and  their  respective
	source files to generate or modify them  (either  the  Libreoffice  Draw
	source  or  the  Python  code)  and  the  slides  of  the  PhD  defense.

CONTENT
	This repository is composed of:
	  - chapters/: this folder contains all the chapters of the  thesis,  as
	    ".tex" source files. There are 10 chapters (from 00-introduction.tex
	    to  09-software.tex),  a  cover   page   (000-garde.tex)   and   the
	    bibliography (99-bibliography.bib).
	  - images/: this folder contains all the  images  used  in  the  thesis
	    (i.e.  with the \includegraphics{} command in the .tex files) either
	    as PNG or PDF.
	  - images-code/: this folder contains the Python code used to  generate
	    some plots or illustration images of the thesis with the  Matplotlib
	    library.
	  - images-src/:  this   folder  contains  the  source  files  of   some
	    illustrations images used in the thesis, as Libreoffice  Draw  files
	    (.odg).
	  - PhD-Defense-Julien-Tissier.pdf: the defense presentation as PDF,  48
	    slides.
	  - PhD-Thesis-Julien-Tissier.pdf: the thesis as PDF, 127 pages.
	  - makefile: used to generate the thesis from source  files.   Use  the
	    command `make` at the root of this repository to  produce  it.   You
	    will  need  the  following  tools:  make,   pdflatex   and   bibtex.
	  - phd-thesis.tex: the main .tex file, containing all the Latex package
	    to use and the different chapters to include.

SUMMARY
	Many natural language processing applications rely  on  word  embeddings
	(also called word representations) to achieve state-of-the-art  results.
	These numerical representations  of  the  language  should  encode  both
	syntactic and semantic information to perform well in downstream  tasks.
	However,  common  models  (word2vec,  GloVe)  use  generic  corpus  like
	Wikipedia to learn  them  and  they  therefore lack  specific   semantic
	information.  Moreover it requires a large memory space  to  store  them
	because the number of representations to save can be in the order  of  a
	million.

	The topic of my thesis is to develop new  learning  algorithms  to  both
	improve the semantic  information  encoded  within  the  representations
	while making them requiring less  memory space  for  storage  and  their
	applications in NLP tasks.

	The first part of  my  work  is  to  improve  the  semantic  information
	contained in word embeddings.  I developed dict2vec, a model  that  uses
	additional information from online lexical  dictionaries  when  learning
	word representations.  The dict2vec word embeddings perform ∼15%  better
	against  the  embeddings  learned  by  other  models  on  word  semantic
	similarity tasks.

	The second part of  my  work  is  to  reduce  the  memory  size  of  the
	embeddings.  I developed an architecture  based  on  an  autoencoder  to
	transform commonly used real-valued embeddings into  binary  embeddings,
	reducing their size in memory by 97% with only a loss of ∼2% in accuracy
	in downstream NLP tasks. 

AUTHOR
	Written  by  Julien  Tissier  <30314448+tca19@users.noreply.github.com>.

COPYRIGHT
	This thesis and all the files in this repository are licensed under  the
	"Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International
	Public License".  By using or downloading this repository, you agree to:
	  1. NonCommercial - You may not use the material for commercial
	     purposes.
	  2. Attribution - You must give appropriate credit, provide a link to
	     the licensor, and indicate if changes were made. You may do so in
	     any reasonable manner, but not in any way that suggests the
	     licensor endorses you or your use.
	  3. ShareAlike - If you remix, transform, or build upon the material,
	     you must distribute your contributions under the same license as the
	     original.
	  4. No additional restrictions - You may not apply legal terms or
	     technological measures that legally restrict others from doing
	     anything the license permits.

	For more details, see https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode