Greek Parliament Debates to Open Linked Data

This repository contains the code and resources for the "Greek Parliament Debates to Open Linked Data" diploma thesis. The project aims to convert Greek Parliament debates from text files (in Word or TXT format) to XML files based on the LegalDocML standard. These XML files are then transformed into RDF triples and uploaded to Apache Fuseki for further analysis and querying using SPARQL. Additionally, the LegalDocML XML files are converted to XML files based on the TEI schema provided by the ParlaMint repository.

Project Structure

The repository is structured as follows:

akn_to_tei/: Directory containing the code and resources for converting LegalDocML XML to TEI XML.
antlr4_grammar/: This directory contains the ANTLR4 grammar file used for parsing the text files and generating the XML output.
check_system_stats/:This directory contains code for generating statistics about the database and files.
lda_topic_modeling/:This directory contains code related to Latent Dirichlet Allocation (LDA) - Topic modelling.
text_to_akn_xml/: This directory contains the Python code for converting the text files to LegalDocML XML format.
xml_akn_files/: Directory to store the generated LegalDocML XML files.
xml_tei_files/: Directory to store the generated TEI XML files.
xml_to_rdf/: This directory contains the code for transforming the XML files into RDF triples.
sparql_queries.txt: This file provides a collection of example SPARQL queries that can be executed against the RDF data in Apache Fuseki.
debates_papanikolaou_present.pdf: Slides of presentation.
diploma_debates_papanikolaou_ioannis.pdf: Diploma Thesis (in Greek)
requirements.txt: This file lists the Python dependencies required to run the project.

Sample

You can check a representative sample of the Greek Parliament debate held on June 8, 2018, which has been converted into both XML/LegalDocML and TEI formats.

The files are es20180608000648.docx.xml and es20180608000648.docx_tei.xml respectively.

*The rdf data for this file is distributed throughout the rdf files.

Requirements

To run the code in this repository, you will need the following dependencies:

Python 3.10
ANTLR4 Python Runtime
Apache Jena Fuseki

Please make sure to install these dependencies before running the code. You can use the following command to install the required Python packages:

pip install -r requirements.txt

Dataset

The dataset used in this project is available at https://github.com/john-papani/diploma_dataset. It contains the raw text files of Greek Parliament debates in either Word or TXT format, which serve as the source material for the conversion process.

Getting Started

To get started with the project, follow these steps:

Clone the repository to your local machine using the following command:
```
git clone https://github.com/john-papani/diploma
```
Navigate to the project directory:
```
cd diploma
```
Install the required Python packages:
```
pip install -r requirements.txt
```
Run the conversion script to convert the text files to XML:
```
python text_to_akn_xml/convert_to_xml.py
```
This script will process the text files and generate corresponding XML files based on the LegalDocML standard.
Once you have the XML files, run the RDF conversion script to transform them into RDF triples:
```
python xml_to_rdf/create_rdf_speech_debate.py
```
and
```
python xml_to_rdf/create_rdf_members_policalFunction.py
```
This script will generate RDF files based on the XML files.
Upload the generated RDF files to Apache Fuseki.
With the RDF data in Fuseki, you can now execute SPARQL queries to analyze and retrieve information from the Greek Parliament debates.
If you want to create TEI files from the LegalDocML XML files, navigate to the akn_to_tei directory and run the following command:
```
python create_tei_from_akn.py
```
This script will generate TEI XML files based on the LegalDocML XML files.
If you want to create LDA results, navigate to lda_topic_modeling directory and run the folling command:
```
python lda.py
```
This script will generate all files for wordcloud_img/ and results of the topic modelling process (results/) [per year]

Acknowledgements

The ANTLR4 library: https://github.com/antlr/antlr4
Apache Jena Fuseki: https://jena.apache.org/documentation/fuseki2
OASIS LegalDocumentML (LegalDocML) TC: https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=legaldocml
lxml - Processing XML and HTML with Python: https://lxml.de
cobalt - A lightweight python library for working with Akoma Ntoso (LegalDocML) documents.: https://github.com/laws-africa/cobalt
RDFLib is a pure Python package for working with RDF.: https://rdflib.readthedocs.io/en/stable/
Saxon XSLT : https://www.saxonica.com/saxon-c/index.xml
Python library for interactive topic model visualization. Port of the R LDAvis package. : https://github.com/bmabey/pyLDAvis

Usage Guidelines

Contribution: If you find issues with the project or have improvements to suggest, feel free to open an issue or create a pull request.
Attribution: If you use this project in your research or applications, please provide appropriate attribution to this repository.
Data Integrity: While efforts have been made to ensure the accuracy of the data, please note that no dataset is perfect. Verify the data according to your use case requirements.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Greek Parliament Debates to Open Linked Data

Project Structure

Sample

Requirements

Dataset

Getting Started

Acknowledgements

Usage Guidelines

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
akn_to_tei		akn_to_tei
antlr4_grammar		antlr4_grammar
check_system_stats		check_system_stats
lda_topic_modeling		lda_topic_modeling
text_to_akn_xml		text_to_akn_xml
xml_akn_files		xml_akn_files
xml_tei_files		xml_tei_files
xml_to_rdf		xml_to_rdf
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
debates_papanikolaou_present.pdf		debates_papanikolaou_present.pdf
diploma_debates_papanikolaou_ioannis.pdf		diploma_debates_papanikolaou_ioannis.pdf
greek_parliament_picture.png		greek_parliament_picture.png
requirements.txt		requirements.txt

john-papani/diploma

Folders and files

Latest commit

History

Repository files navigation

Greek Parliament Debates to Open Linked Data

Project Structure

Sample

Requirements

Dataset

Getting Started

Acknowledgements

Usage Guidelines

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages