This repository contains two Java projects developed for the CSCI 5408 course: Data Management and Warehousing Analytics Projects.
- Created a lightweight multi-user database system using Java.
- Features multi-user authentication, Query Processor, and Single Transaction Management.
This project involves solving 3 problems related to data extraction and processing. It includes algorithms for cleaning news heading data and performing sentiment analysis using a Bag of Words (BoW) model. The cleaned data is pushed to a MongoDB cloud cluster by Atlas.
- The function starts by taking the fileName and MongoCollection as parameters.
- It creates a File object using the provided file name and gets its absolute path.
- It uses the Jsoup library to parse the XML file using the XML parser. The result is stored in a org.jsoup.nodes.Document object named document.
- It initializes an empty List named newsDocuments to store the MongoDB documents that will be created from the XML file.
- The function iterates over each "REUTERS" element in the parsed XML document.
- For each "REUTERS" element, it extracts the text content of the "TEXT > TITLE" and "TEXT > BODY" elements.
- The extracted text is passed through the cleanText method, which removes non-alphanumeric characters.
Steps followed: UniqueWordCount.py file can be found in the repository inside the Problem1B folder.
- Successfully implemented the task using files of Positive and Negative words from the same author. This algorithm performs sentiment analysis on news titles stored in a MongoDB collection. The sentiment analysis involves scoring each news title based on the presence of positive and negative words. The results are then inserted into a MySQL table for further analysis.
-
Connect to MongoDB:
- Established a connection to MongoDB using the provided connection string, database name, and collection name.
- Fetched the desired collection (newsCollection) from the MongoDB database.
-
Retrieve News Titles:
- Query the MongoDB collection to retrieve the news titles.
- Stored the news titles in a list (newsTitles).
-
Read Positive and Negative Words:
- Read positive and negative words from files (positive-words.txt and negative-words.txt).
- Stored positive and negative words in separate lists (positiveWords and negativeWords).
-
Sentiment Analysis Loop:
- Iterated over each news title in the newsTitles list.
-
Build Bag of Words (BoW):
- Built a Bag of Words (BoW) for each news title.
- Used a map (bow) to store word frequencies in the news title.
-
Score Calculation:
- Compared each word in the BoW to positive and negative words.
- Calculated a sentiment score based on the frequency of positive and negative words.
-
Determine Polarity:
- Determined the polarity of the news title based on the calculated score.
-
Store Results:
- Stored the news title, matched words, sentiment score, and polarity in a map (result).
- Added each result map to a list (results).
-
Insert into MySQL Table:
- Established a connection to a MySQL database using the JDBC URL, username, and password.
- Inserted each result into a table named NewsAnalysis with columns: NewsTitle, MatchedWords, Score, and Polarity.
- Used a prepared statement to efficiently insert multiple records.
-
Close Connections:
- Closed the MongoDB connection after processing all news titles.
- Closed the MySQL connection after inserting all results into the table.
- Sumit Savaliya (sumit.savaliya@dal.ca)