Skip to content

Dictionaries

Gabor Szarnyas edited this page May 30, 2021 · 1 revision

Dictionaries

browsersDic.txt

Contains the name of the browser and its probability. Used to assign browsers to the persons based on the popularity probability. Used by BrowserDictionary.java

Sample:

Chrome              0.279    
Internet Explorer   0.232    
Firefox             0.422    

companiesByCountry.txt

Contains the country and the name of a company for that country. Used to give a workplace to the persons corresponding to its homeland if available. Used by CompanyDictionary.

Sample:

Afghanistan  Kam Air   
Afghanistan  Balkh Airlines   
Afghanistan  Khyber Afghan Airlines   
Afghanistan  MarcoPolo Airways   
Afghanistan  Pamir Airways   
Afghanistan  Bakhtar Afghan Airlines   
Afghanistan  Safi Airways    
Albania      Ada Air    

countryAbbrMapping.txt

Contains the abbreviation and the name of the country that its refer for. This is used to link countries to ips See: Ipzones. Used by IPAddressDictionary.java.

Sample:

ac   United Kingdom academic institutions   
ad   Andorra   
ae   United Arab Emirates   
af   Afghanistan   

dicCelebritiesByCountry.txt

Contains the countryId, the celebrityId and its cumulated probability of popularity within that country. Used to assign a celebrity of the same country of the person if available. Used by TagDictionary.java

Sample:

0 0 0.27328605200945627   
0 1 0.4884160756501182   
0 2 0.649645390070922    

dicLocation.txt

Contains the continent name, the country name, latd, longt, population and cumulated probability of population. Used to create the region-country hierarchy and to distribute the user nationality according to the population data. Used by LocationDictionary.java

Sample:

Asia Afghanistan 35 69 15500000 0.0028010447    
Africa Algeria   37  3 29100867 0.008059937    
Africa Angola    -9 13  5646177 0.0090802721    

dicTopic.txt

Contains the tagId, the tagClassId, the tag name and the tag foaf:name. Used in the serialization part of the software to assign names to the tags and write the tag basic class. Used by TagDictionary.java

Sample:

0	349	    Hamid_Karzai	Hamid Karzai    
1	211	            Rumi	Jalal ad-Dīn Muhammad Rumi    
2	98	Mahmud_of_Ghazni	Yamīn al-Dawlah Abul-Qāṣim Maḥmūd Ibn Sebük Tegīn    
3	336	Abbas_I_of_Persia	Shah ‘Abbās I    

email.txt

Contains the email domain name and its probability for the most popular ones and only the name for the rest. Used to assign email domains to the user. Used by EmailDictionary.java.

Sample:

gmail.com   0.45    
gmx.com     0.20    
yahoo.com   0.18    
hotmail.com 0.07    
zoho.com    0.06     

givennameByCountryBirthPlace.txt

Contains the CountryName, firstName, gender, birthdate period and an unused number. Used to assign a first name to the user according to the gender and age. Used by NamesDictionary.java

Sample:

Abkhazia  Diana            0  0  1    
Abkhazia  Maya             0  0  1    
Abkhazia  Diana Gurtskaya  1  0  1    
Abkhazia  Diana            0  1  1    

institutesCityByCountry.txt

Contains the country name, the university name and the city of that university. Used to create the country->city hierarchy and to assign to the user a university from the same country. Used by OrganizationsDictionary,java (all data) and LocationDictionary.java (the country->city data)

Sample:

Aland_Islands  Aland University of Applied Sciences  Mariehamn    
Abkhazia       Abkhazian State University            Sukhumi    
Afghanistan    Paktia University                     Gardez    
Afghanistan    Baghlan University                    Puli_Khumri   

languagesByCountry.txt

[Work In Progress] Contains the name of the country and a list of language data: the ISO 639-1 code, * if it is a official language and the speaker percentage (0 if unknown). Used to assign languages of its country to the user. Used by LanguageDictionary.java

Sample:

Aruba                es 12.6  en 7.7  nl * 5.8     
Antigua and Barbuda  en * 0    
United_Arab_Emirates ar * 0  fa 0  en 0  hi 0  ur 0     

popularPlacesByCountry.txt

Contains the country name, the location name, the location name with spaces, latitude and longitude. Used by PopularPlacesDictionary.java.

Sample:

Afghanistan  Ab-Kol    Ab-Kol    36.22000122070312  68.5     
Afghanistan  Ab_Bazan  Ab Bazan  36.93333435058594  69.94999694824219     
Afghanistan  Ab_Daw    Ab Daw    36.25              71.16666412353516     
Afghanistan  Ab_Gaj    Ab Gaj    36.98333358764648  72.69999694824219    

smartPhonesProviders.txt

Contains the name of smarthphone providers. Used by UserAgentDictionary.java

Sample:

IPhone    
IPad    
HTC    
Samsung    
LG    

surnameByCountryBirthPlace.txt.freq.sort

Contains the number of appearances of the last name, the country name and the last name. Used to assign a surname to the user. Used by NamesDicationary.java

Sample:

2,Abkhazia,Gurtskaya    
1,Abkhazia,Kopitseva    
1,Adjara,Vashalomidze    
7,Afghanistan,Zaland    

tagClasses.txt

Contains the tagClassId, the name and the rdf label. Used to serialize the name and label of the tagClasses. Used by TagDictionary.java

Sample:

0	Thing	thing   
1	BasketballLeague basketball league    
2	LunarCrater	 lunar crater    
3	MilitaryPerson	 military person    
4	AutomobileEngine automobile engine     

tagHierarchy.txt

Contains the base tagClassId and the parent tagClassId. Used to build the tag hierarchy in the serialize process. Used by TagDictionary.java

Sample:

19	179    
136	338     
173	211     
230	149     
305	0     

tagText.txt

Contains the tagId and a text. Used to assign a text to the post and comments related to its tags. Used by TagDictionary.java

Sample:

0  Hamid Karzai, GCMG (Pashto: حامد کرزی, Hāmid Karzay; born 24 December 1957) is the 12th and …   
1  Jalāl ad-Dīn Muḥammad Balkhī, also known as Jalāl ad-Dīn Muḥammad Rūmī and …    
2  Mahmud of Ghazni, actually Yamīn ad-Dawlah Abdul-Qāṣim …    

topicMatrixId.txt

Contains a the topic id 1, the topic id 2, the cumulative % for topic1 and the number of references the topic1 and topic2 appears in the same text. Used to select a list of correlated tags of the main interest of the user. Used by TagDictionary.java

Sample:

2909 4870 0.0 8.0    
2909 4871 2.392072671167751E-4 8.0    
2909 4872 4.784145342335503E-4 2.0    

topuniversities.txt

Not Used but not deleted yet. Contains the name of the university, the country and a cumulative percentage.

Sample:

University of Cambridge  United_Kingdom  100    
Harvard University       United_States   99.18     
Yale University          United_States   98.68     

idzones.txt

Contained in the folder resources/ipaddrByCountries there ara a list of files named XX.zone where XX is a valid country abbreviation contained in the countryAbbrMapping.txt dictionary. Each file contains a list of IP from the country.

Sample of ad.zone (Andorra):

85.94.160.0/19    
91.187.64.0/19     
109.111.96.0/19     
194.158.64.0/19