Is a News?
This is a web application that aims to automatically discover news URLs based on their content and a predefined URL database, which could be useful for various downstream applications such as online misinformation detection and news domain identification. To classify URLs, there is an underlying machine learning model in this tool that exploits a lookup of news and non-news domains and a content-based classifier trained using a labelled dataset of >20000 URLs.
To access:
Please access the web application via this link
This is an open-source tool. If you have found this tool useful for your research, please let me know your application.
O*NET Knwoledge Database
This dataset includes the occupational data, crawled from the O*NET website. It consists of specific information (i.e., summary, tasks, activities, and interest profiles) related to 1110 occupations.
Files Description:
occupation_dict.json - lists the 1110 occupations available in O*NET
For each occupation, there is a unique json file stored in the more_info/ directory under the name of the particular occupation. Each json file is a dictionary, giving specific occupational details: occupation summary; related occupations; tasks; technology skills; tools used; knowledge; skills; abilities; work activities; interests; work styles; work values; education; job zone; and detailed work activities.
Download:
Please click here to download this dataset.
If you use this dataset, please cite the following paper:
Amila Silva, Pei-Chi Lo and Ee-Peng Lim, JPLink: On Linking Jobs to Vocational Interest Types, In Proceeding of the 24th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2020), Pg 220-232, 2020
Singapore Personal Value Dataset
This resource includes two anonymized datasets: collected using 125 Facebook users ('facebook_dataset.json') and 85308 Twitter users ('twitter_dataset.json') in Singapore. Both datasets are in json format, where each entry in the json list corresponds to a user of that particular social network.
Files Description:
facebook_dataset.json - Each user entry in the Facebook dataset has four different keys: '_id' (unique identifier for each user); 'values' (ground truth labels for the personal values derived using survey); 'posts_scores' (liwc/s-liwc scores calculated using Facebook posts); and 'profile_scores' : (liwc/s-liwc scores calculated using Facebook profile information)
twitter_dataset.json - Each user entry in the Twitter dataset has three different keys: '_id' (unique identifier for each user); 'values' (predicted weak value labels using a personal values prediction engine); 'posts_scores' (liwc/s-liwc scores calculated using Twitter posts)
Download:
Please click here to download this dataset.
If you use this dataset, please cite the following paper:
Amila Silva, Pei-Chi Lo and Ee-Peng Lim, On Predicting Personal Values of Social Media Users using Community-Specific Language Features and Personal Value Correlation, In Proceeding of the International AAAI Conference on Web and Social Media (ICWSM 2021)