The data consists of previously scraped data used in another study, found here. Specifically, this study uses data from 2012 to 2021 from the Singaporean Hansard. The main files of interest are in hansard_full 2, which lists folders containing data from the different parliament sittings - 11th refers to the 11th parliament, 12th to the 12th parliament, etc. final_df.csv is the final working csv file while format_hansard.py and formatted_to_txt.py are py files used to prepare the data for replication. These py files were adapted from here.
The below image features a heat map that highlights the divergence between different years with respect to the use of the term 'minister'. The score ranges from 0-2, with higher values indicating greater divergences and vice versa. For example, between 2012 -2014, 2015-2019, and 2020-2021, the use of the term 'minister' had little change in those groups. This makes sense as these years coincide with the individual parliament sessions (12th, 13th, and 14th).
The below heatmap features how the word 'riot' evolved over the years. Interestingly, there was an increased difference in the use of the word between 2013 and 2014 from 2012 to 2013, with a score of 1.1. This suggests that the way riot was used changed from 2012 to 2014, potentially capturing a change from when the Little India riot happened in December, 2013.
The heatmap below illustrates the change in how the term 'migrantworker' was used. Prior to the riot (2012-2013), it seems that the use of the term was stable with the divergence between the two years at 0.85. However, this changes between 2013-2014 and interestingly, looking at the first two columns of the graph shows that the two years were consistently different from the years following the riot. This could point toward a significant and long-lasting shift in the way the workers were talked about. This is additionally suggested by the cluster of values below 1 from 2014 onwards, indicating little differentiation from each other.
Relevance: These findings validate that the method used can capture high-level patterns that coincide with real-world temporal trends. The 'riot' and 'migrantworker' heatmaps will be especially useful starting points for more qualitative inspections of the data. The heatmaps can also be compared with/validated with KL divergence heatmaps to further illustrate the significance of the events.
Save hansard_full 2 into your chosen directory - record this directory; you'll need it
It is recommended you either work from the Data_preprocessing.ipynb file or run the following functions in a python kernel.
pip install -r requirements.txt
import Data_preprocessing
files = scan_folder(_hansard_full 2 directory_)
word_tokenize(word_list)
migrant_worker(y)
normalizeTokens(word_list)
clean(df)
savepath = "wherever you want to save your yearly dataframes *make sure it's a directory! and ends with a /"
create_df(files, savepath)
path = "wherever you saved your yearly dataframes"
savepath = "wherever you wanna save your final df"
create_final_df(path, savepath)
def sorted_df(df, category):
cats = sorted(set(df[category]))
new_df = pd.DataFrame(index = cats, columns = ['normalized'])
for cat in tqdm(cats):
#This can take a while
print("Embedding {}".format(cat), end = '\r')
subsetDF = df[df[category] == cat]
vocabulary = subsetDF['normalized_KL'].sum()
print(new_df)
new_df.loc[cat]['normalized'] = vocabulary
return new_df
KL = sorted_df(df, 'year')
#Initializing Divergence code
def kl_divergence(X, Y):
P = X.copy()
Q = Y.copy()
P.columns = ['P']
Q.columns = ['Q']
df = Q.join(P).fillna(0)
p = df.iloc[:,1]
q = df.iloc[:,0]
D_kl = scipy.stats.entropy(p, q)
return D_kl
def chi2_divergence(X,Y):
P = X.copy()
Q = Y.copy()
P.columns = ['P']
Q.columns = ['Q']
df = Q.join(P).fillna(0)
p = df.iloc[:,1]
q = df.iloc[:,0]
return scipy.stats.chisquare(p, q).statistic
def Divergence(corpus1, corpus2, difference="KL"):
"""Difference parameter can equal KL, Chi2, or Wass"""
freqP = nltk.FreqDist(corpus1)
P = pd.DataFrame(list(freqP.values()), columns = ['frequency'], index = list(freqP.keys()))
freqQ = nltk.FreqDist(corpus2)
Q = pd.DataFrame(list(freqQ.values()), columns = ['frequency'], index = list(freqQ.keys()))
if difference == "KL":
return kl_divergence(P, Q)
elif difference == "Chi2":
return chi2_divergence(P, Q)
elif difference == "KS":
try:
return scipy.stats.ks_2samp(P['frequency'], Q['frequency']).statistic
except:
return scipy.stats.ks_2samp(P['frequency'], Q['frequency'])
elif difference == "Wasserstein":
try:
return scipy.stats.wasserstein_distance(P['frequency'], Q['frequency'], u_weights=None, v_weights=None).statistic
except:
return scipy.stats.wasserstein_distance(P['frequency'], Q['frequency'], u_weights=None, v_weights=None)
L = []
for p in tqdm(corpora):
l = []
for q in corpora:
l.append(Divergence(p,q, difference = 'KL'))
L.append(l)
M = np.array(L)
fig = plt.figure()
div = pd.DataFrame(M, columns = fileids, index = fileids)
ax = sns.heatmap(div, annot=True)
ax.set_xlabel("Starting year")
ax.set_ylabel("Final year")
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
ax.set_title("Yearly KL-Divergence")
plt.show()
calc_syn0norm(model)
smart_procrustes_align_gensim(base_embed, other_embed, words=None)
intersection_align_gensim(m1, m2, words=None)
import data_analyze_embeddings
df = "path to final_df created in preprocessing"
rawEmbeddings, comparedEmbeddings = compareModels(df, 'year', sort = True)
word = "word of interest"
embeddingsDict = comparedEmbeddings
pltDF = getDivergenceDF(targetWord, comparedEmbeddings)
fig, ax = plt.subplots(figsize = (10, 7))
seaborn.heatmap(pltDF, ax = ax, annot = True) #set annot True for a lot more information
ax.set_xlabel("Starting year")
ax.set_ylabel("Final year")
ax.set_ylabel("Final year")
ax.set_title("Yearly linguistic change for: '{}'".format(targetWord))
plt.show()
# Projections
morality_words = pd.read_csv('/home/valalvern/models/morality_pairs.csv')
word = 'migrantworker'
moral_words = [x for x in morality_words['moral_eng'] if pd.isnull(x) == False]
immoral_words = [x for x in morality_words['immoral_eng'] if pd.isnull(x) == False]
safe = ['secure', 'safe', 'sound', 'harmless', 'innocuous', 'benign', 'wholesome', 'mild', 'guard', 'shield', 'shielded', 'guarded', 'secured']
unsafe = ['insecure', 'unsafe', 'unsound', 'threat', 'harmful', 'hostile', 'vulnerable', 'reckless', 'dangerous', 'threatening', 'risky', 'prevarious', 'unpredictable']
def project_year(embeddingsDict, migrant):
'''
construct affluence, morality, and status dimensions, and
project 'democracy' on these dimensions for each years
'''
m = []
s = []
cats = sorted(set(embeddingsDict.keys()))
for catOuter in cats:
print(catOuter)
#embeddings_aligned[catOuter] = [embeddings_raw[catOuter]]
#for embed in embeddingsDict[catOuter][1:]:
embed = embeddingsDict[catOuter][0]
security = dimension(embed, safe, unsafe)
morality = dimension(embed, moral_words, immoral_words)
try:
m.append(cosine_similarity(embed.wv[migrant].reshape(1, -1), morality.reshape(1,-1))[0][0])
s.append(cosine_similarity(embed.wv[migrant].reshape(1, -1), security.reshape(1, -1))[0][0])
except:
print(catOuter)
projection_df = pd.DataFrame({'morality' : m, 'security': s}, index = [i for i in cats]
)
return projection_df
def plot_projection(projection, title):
#plt.plot(projection_df_safety, label='safety')
plt.plot(projection['morality'], label = 'morality')
plt.plot(projection['security'], label = 'security')
plt.title('{} Projection'.format(title))
plt.xlabel('Year')
plt.ylabel('Projection')
plt.legend(loc=3)
If you use this repository for a scientific publication, we would appreciate it if you cited the Zenodo DOI (see the "Cite as" section on our Zenodo page for more details).