You may read android mobile version
Hours | Coral | Cristal | Vivaldi 1 | Vivaldi 2 | |
---|---|---|---|---|---|
07:30 | Registration | ||||
08:30 - 10:00 | MSNDS 2018 | MaHIN 2018 | Tutorial 1 – P1 | Tutorial 2 – P1 | |
|
|||||
10:30 - 12:00 | MSNDS 2018 | SNAST 2018 | Tutorial 1 – P2 | Tutorial 2 – P2 | |
|
|||||
13:30 - 15:30 | SNAA 2018 | SNAST 2018 | Tutorial 3 – P1 | Tutorial 4 | |
|
|||||
16:00 - 17:20 | SI 2018 |
PhD Forum full papers session and feedback to students Posters/Demos madness session for main conference, symposiums and workshops |
(16:00-18:00) Tutorial 3 – P2 |
(16:00-19:00) Tutorial 5 – P1,P2,P3 |
|
17:20 - 19:00 | DYNO 2018 | ||||
19:30 – 21:30 | Reception- (posters and demos) (Vivaldi+Garden+Rossini) |
Time | Tutorial Title | Instructor |
---|---|---|
3 Hours | Tutorial 1 - Wikimedia Public (Research) Resources | Diego Saez-Trumper / Wikimedia Foundation |
3 Hours | Tutorial 2 - Generative models of online discussion threads | Pablo Aragon, Vicenc¸ Gomez, Alberto Lumbreras, and Andreas Kaltenbrunner Universitat Pompeu Fabra Eurecat, Centre Tecnol`ogic de Catalunya |
3 ½ Hours | Tutorial 3 - Decide: Python Software Program for the Analysis of Collective Decisions | Frans N. Stokman, Jacob Dijkstra, Jelmer Draaijer(University of Groningen, The Netherlands) Marcel van Assen (University of Tilburg and Utrecht University, The Netherlands) |
2 Hours | Tutorial 4 - Basics of Privacy on Social Networks | Julian Salas (Internet Interdisciplinary Institute (IN3), Universitat Oberta de Catalunya (UOC), Center for Cybersecurity Research of Catalonia (CYBERCAT) Barcelona, Spain) |
2 Hours | Tutorial 5 - Information Diffusion on Social Networks: From the Traditional Setting to the Future Blockchain-Based | My T. Thai (Computer & Information Science & Engineering Department University of Florida) |
Hours | Verdi | Vivaldi 1 | Vivaldi 2 | Coral | Cristal |
---|---|---|---|---|---|
07:30 | Registration | ||||
09:00 - 09:30 | Opening Session for ASONAM | ||||
09:30 - 10:30 | Plenary 1 Springer SNAM Keynote -- Aidong Zhang, The State University of New York, USA (Verdi) Network Modeling, Fusion and Analysis with Applications Chair:Chandan Reddy |
||||
|
|||||
11:00 - 12:40 |
1A: Community Detection and Characterization Session Chair: Roberto Interdonato |
1B: Network Structure Analysis I Session Chair: Rami Puzis |
1C: Violence I Session Chair: Gizem Korkmaz |
FOSINT-SI -S1 Session Chair: Andrew Park |
Multidisciplinary S1 |
|
|||||
14:00 - 15:00 | Plenary 2 Emma Spiro, University of Washington, USA (Verdi) Online Social Networks to Support Personal Health and Wellness Chair:Jon Rokne |
||||
|
|||||
15:00 - 16:00 | 2A: Modeling I | 2B: Segregation | 2C: Network Structure | FOSINT-SI S2 | Multidisciplinary S2 |
|
|||||
16:30 - 18:10 | 3A: Recommendation | 3B: Data Quality | 3C: Modeling Social Bots | FOSINT-SI S3 | Multidisciplinary S3 |
Hours | Verdi | Vivaldi 1 | Vivaldi 2 | Coral | Cristal |
---|---|---|---|---|---|
07:30 | Registration | ||||
08:30 - 09:30 | Plenary 3 Christoph Stadtfeld, ETH Zürich, Switzerland (Verdi) The Micro-Macro Link in Social Networks Chair:Ulrik Brandes |
||||
09:30 - 10:30 | 4A: Damage Characterization | 4B: Wikipedia Analysis | 4C: Multiplexity | FAB-S1 | Multidisciplinary S4 |
|
|||||
11:00 - 12:40 | 5A: Online Behavior | 5B: Misinformation I | 5C: Modeling II | FAB-S2 | Industrial Track S1 |
|
|||||
14:00 - 16:00 | 6A: Misinformation I | 6B: Opinions and Reviews | 6C: Social Platforms | FAB-S3 | Industrial Track S2 |
|
|||||
16:30—18:00 | Panel Discussion: Are We Unfit for Social Media? Chair:Andrea Tagarelli |
||||
19:30—22:00 | Conference Dinner |
Hours | Verdi | Vivaldi 1 | Vivaldi 2 | Coral | Cristal |
---|---|---|---|---|---|
07:30 | Registration | ||||
08:30 - 09:30 | Plenary 4 George Karypis, University of Minnesota, USA (Verdi) Learning Analytics - Improving Higher Education Chair:Andrea Tagarelli |
||||
|
|||||
9:40 - 10:30 | 7A: Violence II | 7B: Collectives | HIBIBI-S1 | SAO 2018 | |
11:00 - 12:40 | 8A: Location | 8B: Dynamics | 8C: Embeddings and Learning | FAB-S4 | SAO 2018 |
12:40-14:00 | Lunch | ||||
|
|||||
14:00 - 16:00 | 9A: Ranking and Centrality | 9B: News and Politics | 9C: Predictive Modeling | FAB-S5 | HIBIBI-S2 |
16:10 - 16:30 | Farewell |
Understanding overlapping community structures is crucial for network analysis and prediction. AGM (Affiliation Graph Model) is one of the favorite models for explaining the densely overlapped community structures. In this paper, we thoroughly re-investigate the assumptions made by the AGM model on real datasets. We find that the AGM model is not sufficient to explain several empirical behaviors observed in popular real-world networks. To our surprise, all our experimental results can be explained by a parameter-free hypothesis, leading to more straightforward modeling than AGM which has many parameters. Based on these findings, we propose a parameter-free Jaccard-based Affiliation Graph (JAG) model which models the probability of edge as a network specific constant times the Jaccard similarity between community sets associated with the individuals. Our modeling is significantly simpler than AGM, and it eliminates the need of associating a parameter, the probability value, with each community. Furthermore, JAG model naturally explains why (and in fact when) overlapping communities are densely connected. Based on these observations, we propose a new community-driven friendship formation process, which mathematically recovers the JAG model. JAG is the first model that points towards a direct causal relationship between tight connections in the given community with the number of overlapping communities inside it. Thus, \emph{the most effective way to bring a community together is to form more sub-communities within it.} The community detection algorithm based on our modeling demonstrates a significantly simple algorithm with state-of-the-art accuracy on six real-world network datasets compared to the existing link analysis based methods.
In this work, we focus on the problem of local community detection with edge uncertainty. We use an estimator to cope with the intrinsic uncertainty of the problem. Then we illustrate with an example that periphery nodes tend to be grouped into their neighbor communities in uncertain networks, and we propose a new measure K to address this problem. Due to the very limited publicly available uncertain network datasets, we also put forward a way to generate uncertain networks. Finally, we evaluate our algorithm using existing ground truth as well as based on common metrics to show the effectiveness of our proposed approach.
Metagenomics is an important field in biology where an environmental sample is sequenced to study the genomic con-tent of species present in it. The data obtained from sequencingis a mixture of DNA fragments obtained from several species present in the sample. So an important step in this data analysis is to group together the DNA fragments originating from same specie or genera. In this paper we present an approach named ProxiClust, where we exhibit how Community Detection methods can be used to handle this task. The large size of the dataset poses challenge in using the traditional data mining technique given their computation and memory complexity. We aim to achieve scalability through a deterministic approach by converting the data from cloud points in to a graph and leverage community detection on it for identifying groups. Firstly the relevant pair-wise relationships between DNA fragments are extracted by building proximity graphs on the data so storing complete distance matrix in memory can be avoided. The groups on graph are identified by leveraging community detection methods. We perform exploratory study to examine properties of several approaches for this framework and exhibit specific instances of this approach that perform comparably with state of the art binning methods.
Microblogging social media (mainly represented by Twitter) focuses on fast open real-time communication using short messages between users and their followers. These platforms generate large amounts of content and community finding techniques are an attractive alternative for organising it. However there is no clear agreement in the literature for a definition of user community for the microblogging use case, leading to unreliable ground-truth data and evaluation. In this work, we differentiate between functional and structural definitions of communities for microblogging. A functional community groups its users by a common independent social function, e.g. fans of the same football team, while in a structural community the members exclusively depend on their connectivity in a network, e.g. modularity. We build and characterise eight types of functional communities to be used as user-labelled ground-truth and five types of live user interactions networks from Twitter. We then evaluate thirteen popular structural community definitions using five different Twitter datasets, exploring their goodness and robustness for detecting the functional ground-truth under different perturbation strategies. Our results show that definitions based on internal connectivity, e.g. Triangle Participation Ratio, Fraction Over Median Degree or Conductance work best for the Twitter use case and are very robust. On the other hand, classic scores such as Modularity are limited and do not fit very well due to the sparsity and noise of microblogging.
The Jordan center of a graph is defined as a vertex whose maximum distance to other nodes in the graph is minimal, and it finds applications in facility location and source detection problems. We study properties of the Jordan center in the case of random growing trees. In particular, we consider a regular tree graph on which an infection starts from a root node and then spreads along the edges of the graph according to various random spread models. For the Independent Cascade (IC) model and the discrete Susceptible Infected (SI) model, both of which are discrete time models, we show that as the infected subgraph grows with time, the Jordan center persists on a single vertex after a finite number of timesteps.
Some of the well known streaming algorithms to estimate number of triangles in a graph stream work as follows: Sample a single triangle with high enough probability and repeat this basic step to obtain a global triangle count. For example, the algorithm due to Buriol et al. (PODS 2006) uniformly at random picks a single vertex v and a single edge e and checks whether the two cross edges} that connect v to e appear in the stream. Similarly, the neighborhood sampling algorithm (PVLDB 2013) attempts to sample a triangle by randomly choosing a single vertex v, a single neighbor u of v and waits for a third edge that completes the triangle. In both the algorithms, the basic sampling step is repeated multiple times to obtain an estimate for the global triangle count in the input graph stream. In this work, we propose a multi-sampling variant of these algorithms: In case of Buriol et al's algorithm, instead of randomly choosing a single vertex and edge, randomly sample multiple vertices and multiple edges and collect cross edges that connect sampled vertices to the sampled edges. In case of neighborhood sampling algorithm, randomly pick multiple edges and pick multiple neighbors of them. We provide a theoretical analysis of these algorithms and prove that these new algorithms improve upon the known space and accuracy bounds. We experimentally show that these algorithms outperform well known triangle counting streaming algorithms.
Social networks evolve over time, that is, new contacts appear and old contacts may disappear. They can be modeled as temporal graphs where interactions between vertices (people) are represented by time-stamped edges. One of the most fundamental problems in social network analysis is community detection, and one of the most basic primitives to model a community is a clique. Addressing the problem of finding communities in temporal networks, Viard et al. [TCS 2016] introduced delta-cliques as a natural temporal version of cliques. Himmel et al. [SNAM 2017] showed how to adapt the well-known Bron-Kerbosch algorithm for enumerating static cliques to enumerate delta-cliques. We continue this work and improve and extend this algorithm to enumerate temporal k-plexes. We define a delta-k-plex as a set of vertices with a lifetime, where during the lifetime each vertex has an edge to all but at most k-1 vertices at least once within any consecutive delta+1 time steps. We develop an algorithm for enumerating all maximal delta-k-plexes and perform experiments on real-world networks that demonstrate the practical feasibility of our approach. In particular, for the special case of delta-1-plexes (that is, delta-cliques), we observe that our algorithm is significantly faster than the previous algorithm by Himmel et al. at enumerating delta-cliques.
Networks or graphs provide a natural and generic way for modeling rich structured data. Recent research on graph analysis has been focused on representation learning, of which the goal is to encode the network structures into distributed embedding vectors, so as to enable various downstream applications through off-the-shelf machine learning. However, existing methods mostly focus on node-level embedding, which is insufficient for subgraph analysis. Moreover, their leverage of network structures through path sampling or neighborhood preserving is implicit and coarse. Network motifs allow graph analysis in a finer granularity, but existing methods based on motif matching are limited to enumerated simple motifs and do not leverage node labels and supervision. In this paper, we develop NEST, a novel hierarchical network embedding method combining motif filtering and convolutional neural networks. Motif-based filtering enables NEST to capture exact small structures within networks, and convolution over the filtered embedding allows it to fully explore complex substructures and their combinations. NEST can be trivially applied to any domain and provide insight into particular network functional blocks. Extensive experiments on protein function prediction, drug toxicity prediction and social network community identification have demonstrated its effectiveness and efficiency.
Cyberbullying is a serious threat to both the short and long-term well-being of social media users. Addressing this problem in online environments demands the ability to automatically detect cyberbullying and to identify the roles that participants assume in social interactions. As cyberbullying occurs within online communities, it is also vital to understand the group dynamics that support bullying behavior. To this end, we propose a socio-linguistic model which jointly detects cyberbullying content in messages, discovers latent text categories, identifies participant roles and exploits social interactions. While our method makes use of content that is labeled as bullying, it does not require category, role or relationship labels. Furthermore, as bullying labels are often subjective, noisy and inconsistent, an important contribution of our paper is effective methods for leveraging inconsistent labels. Rather than discard inconsistent labels, we evaluate different methods for learning from them, demonstrating that incorporating uncertainty allows for better generalization. Our proposed socio-linguistic model achieves an 18% improvement over state-of-the-art methods.
Religious hate speech in the Arabic Twittersphere is a notable problem that requires developing automated tools to detect messages that use inflammatory sectarian language to promote hatred and violence against people on the basis of religious affiliation. Distinguishing hate speech from other profane and vulgar language is quite a challenging task that requires deep linguistic analysis. The richness of the Arabic morphology and the limited available resources for the Arabic language make this task even more challenging. To the best of our knowledge, this paper is the first to address the problem of identifying speech promoting religious hatred in the Arabic Twitter. In this work, we describe how we created the first publicly available Arabic dataset annotated for the task of religious hate speech detection and the first Arabic lexicon consisting of terms commonly found in religious discussions along with scores representing their polarity and strength. We then developed various classification models using lexicon-based, n-gram-based, and deep-learning-based approaches. A detailed comparison of the performance of different models on a completely new unseen dataset is then presented. We find that a simple Recurrent Neural Network (RNN) architecture with Gated Recurrent Units (GRU) and pre-trained word embeddings can adequately detect religious hate speech with 0.84 Area Under the Receiver Operating Characteristic curve (AUROC).
Mass gatherings often underlie civil disobedience activities and as such run the risk of turning violent, causing damage to both property and people. While civil unrest is a rather common phenomenon, only a small subset of them involve crowds turning violent. How can we distinguish which events are likely to lead to violence? Using articles gathered from thousands of online news sources, we study a two-level multi-instance learning formulation, CrowdForecaster, tailored to forecast violent crowd behavior, specifically violent protests. Using data from five countries in Latin America, we demonstrate not just the predictive utility of our approach, but also its effectiveness in discovering triggering factors, especially in uncovering how and when crowd behavior begets violence.
Interventions to reduce violence among homeless youth are difficult to implement due to the complex nature of violence. However, a peer-based intervention approach would likely be a worthy approach as it has been shown that individuals who interact with more violent individuals are more likely to be violent, suggesting a contagious nature of violence. We propose Uncertain Voter Model to represent the complex process of diffusion of violence over a social network, that captures uncertainties in links and time over which the diffusion of violence takes place. Assuming this model, we define Violence Minimization problem where the task is to select a predefined number of individuals for intervention so that the expected number of violent individuals in the network is minimized over a given time-frame. We extend the problem to a probabilistic setting, where the success probability of converting an individual into non-violent is a function of the number of ``units'' of intervention performed on them. We provide algorithms for finding the optimal intervention strategies for both scenarios. We demonstrate that our algorithms perform significantly better than interventions based on popular centrality measures in terms of reducing violence.
The task of points-of-interest (POI) recommendations has become an essential feature in location-based social networks (LBSNs) with the significant growth of shared data on LBSNs. However it remains a challenging problem, because the decision process of a user choosing to visit a POI depends on numerous factors. The high level of sparsity of the data in LBSNs makes the POI recommendation problem even more challenging, especially for large geographical areas and worldwide datasets. Moreover, in this context the mobility behavior of the users is very heterogeneous, ranging from urban to worldwide mobility. In this paper, we explore the impact of spatial clustering on the recommendation quality. The proposed approach combines spatial clustering with users' influences. It is based on a Poisson factorization model built on an implicit social network, inferred from the geographical mobility patterns. We conduct a comprehensive performance evaluation of our approach on the YFCC dataset (a very large-scale real-world dataset). The experiments show that our approach achieves a significantly superior recommendation quality compared to other state-of-the-art recommendation techniques.
Topic discovery has witnessed a significant growth as a field of data mining at large. In particular, time-evolving topic discovery, where the evolution of a topic is taken into account has been instrumental in understanding the historical context of an emerging topic in a dynamic corpus. Traditionally, time-evolving topic discovery approaches have focused on this notion of time. However, especially in settings where content is contributed by a community or a crowd, an orthogonal notion of time is the one that pertains to the level of expertise of the content creator: the more experienced the creator, the more advanced the topic will be. In this paper, we propose a novel time-evolving topic discovery method which, in addition to the extracted topics, is able to identify the evolution of that topic over time, as well as the level of difficulty of that topic, as it is inferred by the level of expertise of its main contributors. Our method is based on a novel formulation of Constrained Coupled Matrix-Tensor Factorization, which adopts constraints that are well motivated for, and, as we demonstrate, are necessary for high-quality topic discovery. We qualitatively evaluate our approach using real data from the Physics Stack Exchange forum, and we were able to identify topics of varying levels of difficulty which can be linked to external events, such as the announcement of gravitational waves by the LIGO lab. We provide a quantitative evaluation of our method by conducting a user study where experts were asked to judge the coherence and quality of the extracted topics. Finally, our proposed method has implications for automatic curriculum design using the extracted topics, where the notion of the level of difficulty is necessary for the proper modeling of prerequisites and advanced concepts. In addition, we conducted experiments on more datasets which can be found in our supplementary materials online.
We extend Schelling’s segregation model to a flexible social network configuration. Agents belong to two groups: they remove and add relationships when more than half of their neighbors are from the other group. We find that the original segregation gap between the intolerance threshold and the resulting segregation level is maintained in a network setting. When comparing different agents strategies, we find that the less selective behaviors lead to the most segregated final networks. The initial network topology does not seem to affect the results.
Community detection on social media has attracted considerable attention for many years. However, existing methods do not reveal the relations between communities. Communities can form alliances or engage in antagonisms due to various factors, e.g., shared or conflicting goals and values. Uncovering such relations can provide better insights to understand communities and the structure of social media. According to social science findings, the attitudes that members from different communities express towards each other are largely shaped by their community membership. Hence, we hypothesize that inter-community attitudes expressed among users in social media have the potential to reflect their inter-community relations. Therefore, we first validate this hypothesis in the context of social media. Then, inspired by the hypothesis, we develop a framework to detect communities and their relations by jointly modeling users' attitudes and social interactions. We present experimental results using three real-world social media datasets to demonstrate the efficacy of our framework.
The identification of critical nodes in a graph is a fundamental task in network analysis.Centrality measures are commonly used for this purpose. These methods rely on two assumptions that restrict their applicability. First, they only depend on the topology of the network and do not consider the activity over the network. Second, they assume the entire network is available. However, in many applications, it is the underlying activity of the network such as interactions and communications that makes a node critical, and it is hard to collect the entire network topology, when the network is vast and autonomous. We propose a new measure, Active Betweenness Cardinality, where the importance of the nodes are based not on the static structure, but the active utilization of the network. We show how this metric can be computed efficiently by only local information for a given node and how we can locate the critical nodes by using only a few nodes. We also show how this metric can be used to monitor a network and identify node failures. We evaluate our metric and algorithms on real-world networks and show the effectiveness of the proposed methods.
Given two graphs, network alignment asks for a potentially partial mapping between the vertices of the two graphs. This arises in many applications where data from different sources need to be integrated. Recent graph aligners use the global structure of input graphs and additional information given for the edges and vertices. We present SiNA, an efficient, shared memory parallel implementation of such an aligner. Our experimental evaluations on a 32-core shared memory machine showed that SiNA scales well for aligning large real-world graphs: SiNA can achieve up to 28.5 times speedup and can reduce the total execution time of a graph alignment problem with 2M vertices and 100M edges from 4.5 hours to under 10 minutes. To the best of our knowledge, SiNA is the first parallel aligner that uses global structure and vertex and edge attributes to handle large graphs.
The public expects a prompt response from emergency services to address requests for help posted on social media. However, the information overload of social media experienced by these organizations, coupled with their limited human resources, challenges them to timely identify and prioritize critical requests. This is particularly acute in crisis situations where any delay may have a severe impact on the effectiveness of the response. While social media has been extensively studied during crises, there is limited work on formally characterizing serviceable help requests and automatically prioritizing them for a timely response. In this paper, we present a formal model of serviceability called Social-EOC (Social Emergency Operations Center), which describes the elements of a serviceable message posted in social media that can be expressed as a request. We also describe a system for the discovery and ranking of highly serviceable requests, based on the proposed serviceability model. We validate the model for emergency services, by performing an evaluation based on real-world data from six crises, with ground truth provided by emergency management practitioners. Our experiments demonstrate that features based on the serviceability model improve the performance of discovering and ranking (nDCG up to 25%) service requests over different baselines. In the light of these experiments, the application of the serviceability model could reduce the cognitive load on emergency operation center personnel, in filtering and ranking public requests at scale.
Most of traditional recommender systems perform well only when sufficient user-item interactions, such as purchase records or ratings, have been obtained in advance, yielding poor performance in the scenario of sparse interactions. Addressing this problem, we propose a neural network based recommendation framework which is fed by user/item’s original tags as well as the expanded tags from social context. Through embedding the latent correlations between tags into distributed feature representations, our model uncovers the implicit relationships between users and items sufficiently, thus exhibits superior performance no matter whether sufficient user-item interactions have been obtained or not. Furthermore, our framework can be further extended to link prediction in networks, since recommending an item to a user can be recognized as predicting a link between them. The extensive experiments over two real recommendation tasks, i.e., Weibo followship recommendation and Douban movie recommendation, justify our framework’s superiority to the state- of-the-art methods.
In this paper, we propose an autonomous and adaptive recommendation system that relies on the user’s mood and implicit feedback to recommend songs without any prior knowledge about the user preferences. Our method builds autonomously a latent factor model from the available online data of many users (generic song map per mood) based on the associations extracted between user, song, user mood and song emotion. It uses a combination of the Reinforcement Learning (RL) framework and Page-Hinkley (PH) test to personalize the general song map for each mood according to user implicit reward. We conduct a series of tests using LiveJournal twomillion (LJ2M) dataset to show the effect of mood in music recommendation and how the proposed solution can improve the performance of music recommendation over time compared to other conventional solutions in terms of hit rate and F1 score.
"Mobile Applications (or Apps) are becoming more and more popular in recent years, which has attracted increasing attention on mobile App recommendations. The majority of existing App recommendation algorithms focus on mining App functionality or user usage data for discovering user preferences
"Social media sensing has emerged as a new application paradigm to collect observations from online social media users about the physical environment. A fundamental problem in social media sensing applications lies in estimating the evolving truth of the measured variables and the reliability of data sources without knowing either of them a priori. This problem is referred to as dynamic truth discovery. Two major limitations exist in current truth discovery solutions: i) existing solutions cannot effectively address the missing truth problem where the measured variables do not have any reported measurements from the data sources
Text-based media possess a wealth of insights that can be mined to understand perceptions and actions. Researchers and public officials can use these data to inform development policy and humanitarian action. An important step in analyzing text-based databases, such as social media, is the creation of taxonomies which are used to filter information relevant to topics of interest. We worked with thousands of online volunteers to translate 2,137 keywords or phrases in English to formal or vernacular expressions in 29 different languages with the aim of understanding human responses to natural disasters, as well as developing sets of corpus on non popular languages (non English and non EU languages) which still has limited studies. In processing the data set, we faced a challenge in selecting a set of quality translations for each language. This paper aims to estimate the quality of the crowdsourced translations by non-professional translators. This paper presents an extensive empirical study using 91 features from 29 languages corpora to describe (a) translators, (b) source expressions, and (c) translated expressions. Our results show that our approach exploring two regression models and two supervised learning methods produces better results than a baseline approach with a commonly used metric, namely peer-review scores.
This paper compares several imputation methods for missing data in network analysis on a diverse set of simulated networks under several missing data mechanisms. Previous work has highlighted the biases in descriptive statistics of networks introduced by missing data. The results of the current study indicate that the default methods (analysis of available cases and null-tie imputation) do not perform well with moderate or large amounts of missing data. The results further indicate that multiple imputation using sophisticated imputation models based on exponential random graph models (ERGMs) lead to acceptable biases even under large amounts of missing data.
"Over the past two decades, online social networks have attracted a great deal of attention from researchers. However, before one can gain insight into the behavior or structure of a network, one must first collect appropriate data. Data collection poses several challenges, such as API or bandwidth limits, which require the data collector to carefully consider which queries to make. Many network crawling methods have been proposed
"This work presents an in-depth forensic analysis of a large-scale spam attack launched by one of the largest Twitter botnets reported in academic literature. The Bursty botnet contains over 500,000
"The Posting schedule reveals characteristic patterns of users on social media. Motivated by this knowledge, several researchers have modeled posting schedules and argued that deviation from the model indicates bot or spammer characteristics. It is true that circadian rhythms induce regularity in human posting behavior
According to the Center for Disease Control and Prevention, hundreds of thousands initiate smoking each year, and millions live with smoking-related diseases in the United States. Many tobacco users discuss their opinions, habits and preferences on social media. This work conceptualizes a framework for targeted health interventions to inform tobacco users about the consequences of tobacco use. We designed a Twitter bot named Notobot (short for No-Tobacco Bot) that leverages machine learning to identify users posting pro-tobacco tweets and select individualized interventions to curb their tobacco use. We searched the Twitter feed for tobacco-related keywords and phrases, and trained a convolutional neural network using over 4,000 tweets manually labeled as either pro-tobacco or not pro-tobacco. This model achieved a 90% accuracy rate on the training set and 74% on test data. Users posting pro- tobacco tweets were matched with former smokers with similar interests who posted anti-tobacco tweets. Algorithmic matching, leveraging the power of peer influence, allows for the systematic delivery of personalized interventions based on real anti-tobacco tweets from former smokers. Experimental evaluation suggested that our system would perform well if deployed.
Online social networks (OSNs) such as Twitter and Facebook constitute an open space for developers to create sophisticated machines that imitate human users by automating their social network activities. The existence of socialbots is so powerful that bots can be transformed to influential users. Previous studies use fundamental functions such as posting a tweet or creating links to a specific target group to investigate the ability to infiltrate of these accounts. Our study analyzes the role of automated chatting in bots' infiltration. Our analysis is compared with the state of the art of this kind and reveals that the chat functionally was able to improve Klout and follow ratio on about $24\%$ and $123\%$ respectively. Also, the advanced communications skills contribute to more message interactions between socialbots and other accounts. Based on our empirical experimental study the conversational socialbots infiltrate more successfully in the Twitter-sphere.
Traditional post-disaster assessment of damage heavily relies on expensive GIS data, especially remote sensing image data. In recent years, social media has become a rich source of disaster information that may be useful in assessing damage at a lower cost. Such information includes text (e.g., tweets) or images posted by eyewitnesses of a disaster. Most of the existing research explores the use of text in identifying situational awareness information useful for disaster response teams. The use of social media images to assess disaster damage is limited. In this paper, we propose a novel approach, based on convolutional neural networks and class activation maps, to locate damage in a disaster image and to quantify the degree of the damage. Our proposed approach enables the use of social network images for post-disaster damage assessment, and provides an inexpensive and feasible alternative to the more expensive GIS approach.
Retrieving relevant information from social media based on specific requirements has become a focus area for researchers. In this paper, we propose a framework for online retrieval of tweets providing information about possible infrastructure damages caused due to earthquakes and use the same to determine a damage score for the possibly affected locations. Identifying such tweets would not only provide a holistic view of the affected areas but would also help in taking necessary relief actions. Existing works on this topic fail to effectively capture the semantic variation in the tweets, possibly due to poor content quality, thereby providing scopes for further improvement in the mechanisms involved. Our proposed technique relies on a novel split-query based mechanism along with a pseudo-relevance feedback approach to identify the relevant tweets. The pseudo-relevance feedback approach expands on an initial set of seed tweets obtained using a semi-automatic query generation mechanism that couples topic based clustering with human annotation. Empirical validation of our proposed method on a manually annotated ground truth data reveals a considerable improvement in precision, recall and mean average precision over several baseline methods.
"We define network-based indicators of diversity of Wikipedia teams and users. A team of Wikipedia users writting an article is said to be diverse if its members typically edit different articles. An individual user is said to be diverse (i.e., is a ""jack of all trades"") if she contributes to articles that are not normally co-edited by the same users. For both indicators we propose a model-based normalization in which we compare observed values to expected values in a random graph model that preserves expected degrees of users and articles. We show, using data on all articles of the English-language edition of Wikipedia, that diverse teams tend to write high-quality articles but articles written by teams of users with high individual diversity tend to be of low quality. These findings are robust with respect to several alternative explanations for article quality. We also show that the proposed model-based normalization of network indicators outperforms an ad-hoc normalization via cosine similarity."
Networks such as social networks, airplane networks, and citation networks are ubiquitous. To apply advanced machine learning algorithms to network data, low-dimensional and continuous representations are desired. To achieve this goal, many network embedding methods have been proposed recently. The majority of existing methods facilitate the local information i.e. local connections between nodes, to learn the representations, while neglecting global information (or node status), which has been proven to boost numerous network mining tasks such as link prediction and social recommendation. In this paper, we study the problem of preserving local and global information for network embedding. In particular, we introduce an approach to capture global information and propose a network embedding framework LOG, which can coherently model Local and Global information. Experiments demonstrate the effectiveness of the proposed framework.
"We propose a new model for the study of resilience of coevolving multiplex scale-free networks. Our network model, called \emph{preferential interdependent networks}, is a novel continuum over scale-free networks parameterized by their correlation $\rho, 0 \leq \rho \leq 1$. Our failure and recovery model ties the propensity of a node, both to fail and to assist in recovery, to its importance. We show, analytically, that our network model can achieve any $\gamma, 2 \leq \gamma \leq 3$ for the exponent of the power law of the degree distribution
The most-followed Twitter users and their pairwise relationships form a sub-graph of all Twitter users that we call the Twitter elite network. The connectivity patterns and influence (in terms of reply and retweet activity) among these elite users illustrate how the “important” users connect and interact with one another on Twitter. Such an elite-focused view also provides valuable information about the structure of the Twitter network as a whole. This paper presents the first detailed characterization of the top-10K Twitter elite network. We describe a new technique to efficiently and accurately capture the Twitter elite network along with social attributes of individual elite accounts. We show that a sufficiently large elite network is typically composed of 15-20 resilient and socially cohesive communities representing “socially meaningful” components of the elite network. We then characterize the community-level structure of the elite network in terms of bias in directed pairwise connectivity and relative reachability. We demonstrate that both the retweet and reply activity between elite users are effectively contained within individual elite communities. Finally, we illustrate that a majority of the elite friends of regular Twitter users tend to belong to a single elite community. This finding offers a promising criterion to group regular users into ”shadow partitions” based on their association with elite communities. We show that the level of overall inter-connectivity between shadow partitions mirrors the inter-connectivity of the elite communities. This suggests that these shadow partitions can be viewed as extensions of their corresponding elite communities.
"Twitter has increasingly become a popular platform to share news and user opinion. A tweet is considered to be important if it receives high number of affirmative reactions from other Twitter users via Retweets. Retweet count is thus considered as a surrogate measure for positive crowd-sourced reactions – high number of retweets of a tweet not only help the tweet being broadcasted, but also aid in making its topic trending. This in turn bolsters the social reputation of the author of the tweet. Since social reputation/impact of users/tweets influences many decisions (such as promoting brands, advertisement etc.), several blackmarket syndicates have actively been engaged in producing fake retweets in a collusive manner. Users who want to boost the impact of their tweets approach the blackmarket services, and gain retweets for their own tweets by retweeting other customers’ tweets. Thus they become customers of blackmarket syndicates and engage in fake activities. Interestingly, these customers are neither bots, nor even fake users – they are usually normal human beings
We characterize the Twitter networks of both major presidential candidates, Donald Trump and Hillary Clinton, with various American hate groups defined by the US Southern Poverty Law Center (SPLC). We further examined the Twitter networks for Bernie Sanders, Ted Cruz, and Paul Ryan, for 9 weeks around the 2016 election (4 weeks prior to the election and 4 weeks post-election). By carefully accounting for the observed heterogeneity in the Twitter activity levels across individuals under the null hypothesis of apathetic retweeting that is formalized as a random network model based on the directed, multi-edged, self-looped, configuration model, our data revealed via a generalized Fisher's exact test that there were significantly many Twitter accounts linked to SPLC-defined hate groups belonging to seven ideologies (Anti-Government, Anti-Immigrant, Anti-LGBT, Anti-Muslim, Alt-Right, Neo-Nazi, and White-Nationalist) and also to @realDonaldTrump relative to the accounts of the other four politicians. The exact hypothesis test uses Apache Spark's distributed sort and join algorithms to produce independent samples in a fully scalable way from the null model.
Delay discounting and its relation to human decision making is a hot topic in economics and behavior science since pitting the demands of long-term goals against short-term desires is among the most difficult tasks in human decision making. Previously, small-scale studies based on questionnaires were used to analyze an individual's delay discounting rate (DDR) and its relation to his/her real-world behavior such as substance abuse, credit card default, and pathological gambling. In this research, we employ large-scale social media analytics to study DDR and its relation to people's social media behavior (e.g., Facebook Likes). We also build computational models to automatically infer DDR from Social Media Likes. Since the predicting feature space is very large (e.g., millions of entities that a Facebook user can like) and the size of the delay discounting ground truth dataset is relatively small (e.g., a few thousand people), we focus on studying the impact of different unsupervised feature learning methods on predicting performance. Our results demonstrate the significant role unsupervised feature learning plays in this task.
Until recently, social media was seen to promote democratic discourse on social and political issues. However, this powerful communication platform has come under scrutiny for allowing hostile actors to exploit online discussions in an attempt to manipulate public opinion. A case in point is the ongoing U.S. Congress investigation of Russian interference in the 2016 U.S. election campaign, with Russia accused of, among other things, using trolls (malicious accounts created for the purpose of manipulation) and bots (automated accounts) to spread misinformation and politically biased information. In this study, we explore the effects of this manipulation campaign, taking a closer look at users who re-shared the posts produced on Twitter by the Russian troll accounts publicly disclosed by U.S. Congress investigation. We collected a dataset with over 43 million elections-related posts shared on Twitter between September 16 and November 9, 2016 by about 5.7 million distinct users. This dataset includes accounts associated with the identified Russian trolls. We use label propagation to infer the users’ ideology based on the news sources they shared, to classify a large number of them as liberal or conservative with precision and recall above 90%. Conservatives retweeted Russian trolls significantly more often than liberals and produced 36 times more tweets. Additionally, most of the troll content originated in, and was shared by users from Southern states. Using state-ofthe-art bot detection techniques, we estimated that about 4.9% and 6.2% of liberal and conservative users respectively were bots. Text analysis on the content shared by trolls reveals that they had a mostly conservative, pro-Trump agenda. Although an ideologically broad swath of Twitter users were exposed to Russian trolls in the period leading up to the 2016 U.S. Presidential election, it was mainly conservatives who helped amplify their message.
While online social networks (OSNs) have become an important platform for information exchange, the abuse of OSNs to spread misinformation has become a significant threat to our society. To restrain the propagation of misinformation in its early stages, we study the Distance-constrained Misinformation Combat under Uncertainty problem, which aims to both reduce the spread of misinformation and enhance the spread of correct information within a given propagation distance. The problem formulation considers the competitive diffusion of misinformation and correct information. It also accounts for the uncertainty in identifying initial misinformation adopters. For competitive propagation with major-threshold activation, we propose a solution based on stochastic programming and provide an upper-bound in the presence of uncertainty. We propose an efficient Combat Seed Selection algorithm to tackle general-threshold activation, in which we define a measure, “effectiveness”, to evaluate the contribution of nodes to the fight against misinformation. Through extensive experiments, we validate that our algorithm outputs high-quality solution with very fast computation.
The problem of automatic detection of fake news in social media, e.g., on Twitter, has recently drawn some attention. Although, from a technical perspective, it can be regarded as a straight-forward, binary classification problem, the major challenge is the collection of large enough training corpora, since manual annotation of tweets as fake or non-fake news is an expensive and tedious endeavor. In this paper, we discuss a weakly supervised approach, which automatically collects a large-scale, but very noisy training dataset comprising hundreds of thousands of tweets. During collection, we automatically label tweets by their source, i.e., trustworthy or untrustworthy source, and train a classifier on this dataset. We then use that classifier for a different classification target, i.e., the classification of fake and non-fake tweets. Although the labels are not accurate according to the new classification target (not all tweets by an untrustworthy source need to be fake news, and vice versa), we show that despite this unclean inaccurate dataset, it is possible to detect fake news with an F1 score of up to 0.9.
In this paper, we tackle the problem of fake news detection from social media by exploiting the presence of echo chamber communities (communities sharing same beliefs) that exist within the social network of the users. By modeling the echo-chambers as closely-connected communities within the social network, we represent a news article as a 3-mode tensor of the structure - $<$News, User, Community$>$ and propose a tensor factorization based method to encode the news article in a latent embedding space preserving the community structure. We also propose an extension of the above method, which jointly models the community and content information of the news article through a coupled matrix-tensor factorization framework. We empirically demonstrate the efficacy of our method for the task of Fake News Detection over two real-world datasets.
Random graph models are important constructs for data analytic applications as well as pure mathematical developments, as they provide capabilities for network synthesis and principled analysis. Several models have been developed with the aim of faithfully preserving important graph metrics and substructures. With the goal of capturing degree distribution, clustering coefficient, and communities in a single random graph model, we propose a new model to address shortcomings in a progression of network modeling capabilities. The Block Two-Level Erdos-Renyi (BTER) model of Seshadhri et al., designed to allow prescription of expected degree and clustering coefficient distributions, neglects community modeling, while the Generalized BTER (GBTER) model of Bridges et al., designed to add community modeling capabilities to BTER, struggles to faithfully represent all three characteristics simultaneously. In this work, we fit BTER and two GBTER configurations to several real-world networks and compare the results with that of our new model, the Extended GBTER (EGBTER) model. Our results support that EBGTER adds a community-modeling flexibility to BTER, while retaining a satisfactory level of accuracy in terms of degree and clustering coefficient. Our insights and empirical testing of previous models as well as the new model are novel contributions to the literature.
"Social network data are complex and dependent data. At the macro-level, social networks often exhibit clustering in the sense that social networks consist of communities
The United States is becoming increasingly politically divided. In addition to polarization between the two-major political parties, there is also divisiveness in intra-party dynamics. In this paper, we attempt to understand these intra-party divisions by using an exponential random graph model (ERGM) to compute a political cohesion metric to quantify the strength within the party at a given point in time. The analysis is applied to the 105th through 113th congressional sessions of the House of Representatives. We find that the Republican party not only generally exhibits stronger intra-party cohesion, but when voting patterns are broken out by topic, the party has a higher and more consistent cohesion factor compared to the Democratic Party.
Online job search and talent procurement have given rise to challenging match and search problems in the e-recruitment domain. Existing systems perform direct keyword matching of technical skills which misses out a closely matching candidate on account of it not having the exact skills. This results in substandard results which ignores the relationships between technical skills. In an attempt to improve relevancy, this paper proposes a semantic similarity measure between IT skills using a knowledge based approach. The approach builds an ontology using DBpedia and uses it to derive a similarity score using feature based similarity measures. The proposed approach performs better than the Resumatcher system in finding the similarity between skills.
Authenticity and reliability of the information spread over the cyberspace is becoming increasingly important. This is especially important in e-commerce since potential customers check reviews and customer feedbacks online before making a purchasing decision. Although this information is easily accessible through related websites, lack of verification of the authenticity of these reviews raises concerns about their reliability. Besides, fraudulent users disseminate misinformation to deceive people into acting against their interest. So, detection of fake and unreliable reviews is a crucial problem that must be addressed by the security researchers. Here we propose a spam review detection framework that incorporates knowledge extracted from the textual content of the reviews with information obtained by exploiting the underlying reviewer-product network structure. In the proposed framework, first, feature vectors are learned for each review, reviewer and product by utilizing state-of-the-art algorithms developed for learning document and node embeddings, and then these are fed into a classifier to identify opinion spam. The effectiveness of our framework over existing techniques on detecting spam reviews is demonstrated in three different data sets containing online reviews. The experimental results obtained confirm that combining representations learned from reviewer-product network and textual review data significantly improves the detection of spam reviews.
Social media, e.g. Twitter, has become a widely used medium for the exchange of information, but it has also become a valuable tool for hackers to spread misinformation through compromised accounts. Hence, detecting compromised accounts is a necessary step toward a safe and secure social media environment. Nevertheless, detecting compromised accounts faces several challenges. First, social media activities of users are temporally correlated which plays an important role in compromised account detection. Second, data associated with social media accounts is inherently sparse. Finally, social contagions where multiple accounts become compromised, take advantage of the user connectivity to propagate their attack. Thus, how to represent each user’s network features for compromised account detection is an additional challenge. To address these challenges, we propose an End-to-End Compromised Account Detection framework (E2ECAD). E2ECAD effectively captures temporal correlations via an LSTM (Long Short-Term Memory) network. Further, it addresses the sparsity problem by defining and employing a user context representation. Meanwhile, informative network-related features are modeled efficiently. To verify the working of the framework, we construct a real-world dataset of compromised accounts on Twitter and conduct extensive experiments. The results of experiments show that E2ECAD outperforms the state of the art compromised account detection algorithms.
Fake news may be intentionally created to promote economic, political and social interests, and can lead to negative impacts on humans beliefs and decisions. Hence, detection of fake news is an emerging problem that has become extremely prevalent during the last few years. Most existing works on this topic focus on manual feature extraction and supervised classification models leveraging a large number of labeled (fake or real) articles. In contrast, we focus on content-based detection of fake news articles, while assuming that we have a small amount of labels, made available by manual fact-checkers or automated sources. We argue this is a more realistic setting in the presence of massive amounts of content, most of which cannot be easily fact-checked. So, we represent collections of news articles as multi-dimensional tensors, leverage tensor decomposition to derive concise article embeddings that capture spatial/contextual information about each news article, and use those embeddings to create an article-by-article graph on which we propagate limited labels. Results on real-world datasets show that our method performs on par or better than existing fully supervised models, in that we achieve better detection accuracy using fewer labels. In particular, our proposed method achieves 75.43% of accuracy using only 30% of labels of a public dataset while an SVM-based classifier achieved 67.43%. Furthermore, our method achieves 70.92% of accuracy in a large dataset using only 2% of labels.
The systematic linking of explicitly-observed phrases within a document to entities of a knowledge base has already been explored in a process known as entity linking. The objective of this paper, however, is to identify and entity link those entities that are not mentioned but are implied within a document, more specifically within a tweet. This process is referred to as implicit entity linking. Unlike prior work that build a representation for each entity based on its related content in the knowledge base, we propose to perform implicit entity linking by determining how a tweet is related to user-generated content posted online and as such indirectly perform entity linking. We formulate this problem as an ad-hoc document retrieval process where the input query is the tweet, which needs to be implicitly linked and the document space is the set of user-generated content related to the entities of the knowledge base. We systematically compare our work with the state-of-the-art baseline and show that our method is able to provide statistically significant improvements.
In this paper we study the problem of domain-specific related entity finding on highly-heterogeneous knowledge graphs where the task is to find related entities with respect to a query entity. As we are operating in the context of knowledge graphs, our solutions will need to be able to deal with heterogeneous data with multiple objects and a high number of relationship types, and be able to leverage direct and indirect connections between entities. We propose two novel graph-based related entity finding methods: one based on learning to rank and the other based on subgraph propagation in a Bayesian framework. We perform contrastive experiments using a publicly available knowledge graph and show that both our proposed models manage to outperform a strong baseline based on supervised random walks.
During the past few years, sellers have increasingly offered discounted or free products to selected reviewers of e-commerce platforms in exchange for their reviews. Such incentivized (and often very positive) reviews can improve the rating of a product which in turn sways other users' opinions about the product. Despite their importance, the prevalence, characteristics, and the influence of incentivized reviews in a major e-commerce platform have not been systematically and quantitatively studied. This paper examines the problem of detecting and characterizing incentivized reviews in two primary categories of Amazon products. We describe a new method to identify Explicitly Incentivized Reviews (EIRs) and then collect a few datasets to capture an extensive collection of EIRs along with their associated products and reviewers. We show that the key features of EIRs and normal reviews exhibit different characteristics. Furthermore, we illustrate how the prevalence of EIRs has evolved and been affected by Amazon's ban. Our examination of the temporal pattern of submitted reviews for sample products reveals promotional campaigns by the corresponding sellers and their effectiveness in attracting other users. Finally, we demonstrate that a classifier that is trained by EIRs (without explicit keywords) and normal reviews can accurately detect other EIRs as well as implicitly incentivized reviews. Overall, this analysis sheds an insightful light on the impact of EIRs on Amazon products and users.
Helpful reviews play a pivotal role in recommending desirable goods and accelerating purchase decisions of customers in e-commercial services. Given a large proportion of product reviews with unknown helpfulness/unhelpfulness, the research on automatic identification of helpful reviews has drawn much attention in recent years. However, state-of-the-art approaches still rely heavily on extracting heuristic text features from reviews with domain-specific knowledge. In this paper, we first introduce a multi-task neural learning (MTNL) architecture for identifying helpful reviews. The end-to-end neural architecture can learn to reconstruct effective features upon the raw input of words and even characters, and the multi-task learning paradigm helps to make more accurate predictions of helpful reviews based on a secondary task which fits the star ratings of reviews. We also build two datasets containing helpful/unhelpful reviews from different product categories in Amazon, and compare the performance of MTNL with several mainstream methods on both datasets. Experimental results confirm that MTNL outperforms the state-of-the-art approaches by a significant margin.
The rapid growth of Web 2.0 and wide popularity of social media have brought the challenge of digesting and understanding large amounts of user-generated text. Automatically finding contradictions from user opinionated text is a potential solution to help sense-making and decision-making process from those user opinions. However, the problem of contradiction detection is understudied in social media analysis field. This study presents a computational approach to detecting contradictions in user opinionated text. Specifically, a typology of contradictions was proposed, and then the state-of-art deep learning models were adopted and enhanced by three methods of incorporating sentiment analysis. The enhanced models were evaluated with Amazon's customer reviews. The best model was selected and applied to a collection of tweets from Twitter to demonstrate its usefulness in understanding contradiction semantically and quantitatively in a large amount of user opinionated text.
"Online crowdfunding platforms have given creators new opportunities to obtain funding. Despite the popularity and success of many projects on the platforms, the quality of crowdfunded products in the market (e.g., Amazon) was not statistically and scientifically evaluated yet. To fill the gap, in this paper, we (i) compare crowdfunded products with traditional products in terms of their ratings in the largest e-commerce market, Amazon
"Social media has been widely adopted by online users to share their opinions. Among users in signed networks, two types of opinions can be expressed. They can directly specify opinions to others via establishing positive or negative links
"With the increasing popularity of online video sharing platforms (such as YouTube and Twitch), the detection of content that infringes copyright has emerged as a new critical problem in online social media. In contrast to the traditional copyright detection problem that studies the static content (e.g., music, films, digital documents), this paper focuses on a much more challenging problem: one in which the content of interest is from live videos. We found that the state-of-the-art commercial copyright infringement detection systems, such as the ContentID from YouTube, did not solve this problem well: large amounts of copyright-infringing videos bypass the detector while many legal videos are taken down by mistake. In addressing the copyright infringement detection problem for live videos, we identify several critical challenges: i) live streams are generated in real-time and the original copyright content from the owner may not be accessible
Social Live Stream Services (SLSS) exploit a new level of social interaction. One of the main challenges in these services is how to detect and prevent deviant behaviors that violate community guidelines. In this work, we focus on adult content production and consumption in two widely used SLSS, namely Live.me and Loops Live, which have millions of users producing massive amounts of video content on a daily basis. We use a pre-trained deep learning model to identify broadcasters of adult content. Our results indicate that moderation systems in place are highly ineffective in suspending the accounts of such users. We create two large datasets by crawling the social graphs of these platforms, which we analyze to identify characterizing traits of adult content producers and consumers, and discover interesting patterns of relationships among them, evident in both networks.
"The ways of communication and social interactions are changing. Web users are becoming increasingly engaged with Online Social Networks (OSN), which has a significant impact on the relationship mechanisms between individuals and communities. Most OSN platforms have strict policies regarding data access, harming its usage in psychological and social phenomena studies
"This paper presents an analysis of social experiences around wine consumption through the lens of Vivino, a social network for wine enthusiasts with over 26 million users worldwide. We compare users' perceptions of various wine types and regional styles across both New and Old World wines, examining them across price ranges, vintages, regions, varietals, and blends. We find that ratings provided by Vivino users are not biased by cost
Peatland fire and haze events in Southeast Asia are disasters with trans-boundary implications, having increased in recent years along with rapid deforestation, land clearing and severe dry seasons. Aerosols are emitted in high concentrations from the fires, which degrade air quality and reduce visibility, in turn causing economic, social, health, and environmental problems. During haze events, it is critical for public authorities to have timely information about affected populations. Currently, Indonesian disaster management authorities manage forest and peatland fire and haze events based on satellite data and sensors. They are looking for more real-time information in order to better protect vulnerable populations and environment. This paper explores information on visibility extracted from photos shared on social media to improve forecasting performance for haze severity. Our results show that visibility information can improve forecast accuracy over a baseline approach with common features, namely data from satellites and ground air quality sensors. Furthermore, by using social media photos, our model adds a near real-time property to the forecast model, with potential to improve disaster management and mitigation.
Gender-based-violence is a serious concern in recent times. Due to the social stigma attached to these assaults, victims rarely come forward. Implementing policy measures to prevent sexual violence get constrained due to lack of crime statistics. However, the recent outcry on the Twitter platform allows us to address this concern. Sexual assaults occur at workplaces, public places, educational institutes and also at home. Policy level approaches and awareness campaign for these assaults would not be similar. So, we want to identify the risk factor associated with these sexual assaults. We extracted 0.7 million tweets during the #MeToo social media movement. Next, we employ deep learning techniques to classify these sexual violences. We observe that sexual assault by a family member at own home is a more serious concern than harassment by a stranger at public places. This study reveals assaults by a known person are more prevalent than assaults by unknown strangers.
"Cyberbullying has emerged as a large-scale societal problem that demands accurate methods for its detection in an effort to mitigate its detrimental consequences. While automated, data-driven techniques for analyzing and detecting cyberbullying incidents have been developed, the scalability of existing approaches has largely been ignored. At the same time, the complexities underlying cyberbullying behavior (e.g., social context and changing language) make the automatic identification of ""the best subset of features"" to use challenging. We address this gap by formulating cyberbullying detection as a sequential hypothesis testing problem. Based on this formulation, we propose a novel algorithm to drastically reduce the number of features used in classification. We demonstrate the utility, scalability and responsiveness of our approach using a real-world dataset from Instagram, the online social media platform with the highest percentage of users reporting experiencing cyberbullying. Our approach improves recall by a staggering 700%, while at the same time reducing the average number of features by up to 99.82% compared to state-of-the-art supervised cyberbullying detection methods, learning approaches that require weak supervision, and traditional offline feature selection and dimensionality reduction techniques."
In this paper, we introduce homophily to a game-theoretic model of collective action (e.g., protests) on Facebook and study the effect of homophily in individuals' willingness to participate in collective action, i.e., their thresholds, on the emergence and spread of collective action. We use a real Facebook network and conduct computational experiments to study contagion dynamics (the size and the speed of diffusion) with respect to the level of homophily.
"Abduction is an inference approach that uses data and observations to identify plausible (and preferably, best) explanations for phenomena. Applications of abduction (e.g., robotics, genetics, image understanding) have largely been devoid of human behavior. Here, we devise and execute an iterative abductive analysis process that is driven by the social sciences: behaviors and interactions among groups of human subjects. One goal is to understand intra-group cooperation and its effect on fostering collective identity. We build an online game platform
Generating friend recommendations in location-based social networks is a challenging task, as we have to learn how different contextual factors influence users' behavior to form social relationships. For example, the contextual information of users' check-in behavior at common locations and users' activities at close regions may impact users' relationships. In this paper we propose a deep pairwise learning model, namely FDPL. Our model first learns the low dimensional latent embeddings of users' social relationships by jointly factorizing them with the available contextual information based on a multi-view learning strategy. In addition, to account for the fact that the contextual information is non-linearly correlated with users' social relationships we design a deep pairwise learning architecture based on a Bayesian personalized ranking strategy. We learn the non-linear deep representations of the computed low dimensional latent embeddings by formulating the top-k friend recommendation task at location-based social networks as a ranking task in our deep pairwise learning strategy. Our experiments on three real world location-based social networks from Brightkite, Gowalla and Foursquare show that the proposed FDPL model significantly outperforms other state-of-the-art methods. Finally, we evaluate the impact of contextual information on our model and we experimentally show that it is a key factor to boost the friend recommendation accuracy at location-based social networks.
Detecting local events (e.g., protests, accidents) in real-time is an important task needed by a wide spectrum of real-world applications. In recent years, with the proliferation of social media platforms, we can access massive geo-tagged social messages, which can serve as a precious resource for timely local event detection. However, existing local event detection methods either suffer from unsatisfactory performances or need intensive annotations. These limitations make existing methods impractical for large-scale applications. Through the analysis of real-world datasets, we found that the informativeness level of social media users, which is neglected by existing work, plays a highly critical role in distilling event-related information from noisy social media contexts. Motivated by this finding, we propose an unsupervised framework, named LEDetect, to estimate the informativeness level of social media users and leverage the power of highly informative users for local event detection. Experiments on a large-scale real-world dataset show that the proposed LEDetect model can improve the performance of event detection compared with the state-of-the-art unsupervised approach. Also, we use case studies to show that the events discovered by the proposed model are of high quality and the extracted highly informative users are reasonable.
We analyze fifteen Twitter user geolocation models and two baselines comparing how they are evaluated. Our results demonstrate that the choice of effectiveness metric can have a substantial impact on the conclusions drawn from an experiment. We show that for general evaluations, a range of metrics should be reported to ensure that a complete picture of system effectiveness is conveyed.
"Online Social Network (OSN) communities serve as different platforms for multiple users’ interaction – people behaving diversely among distinctive communities – such as entertainment, global and local discussion communities. However, attribute identification among online discussion communities remain largely unexplored.
Leadership is an essential part of collective decision and organization in social animals, including humans. In nature, leadership is dynamic and varies with context or temporal factors. Understanding dynamics of leadership, such as how leaders change, emerge, or converge, allows scientists to gain more insight into group decision-making and collective behavior in general. However, given only data of individual activities, it is challenging to infer these dynamic leadership events. In this paper, we focus on mining and modeling frequent patterns of leadership dynamics. We formalize a new computational problem, Mining Patterns of Leadership Dynamics, as well as propose a framework as a solution of this problem. Our framework can be used to address several questions regarding leadership dynamics of group movement. We use the leadership inference framework, mFLICA, to infer the time series of leaders from movement datasets, then propose the approach to mine and model frequent patterns of leadership dynamics. We evaluate our framework performance by using several simulated datasets, as well as using the real-world dataset of baboon movement to demonstrate the application of our framework. There are no existing methods to address this problem, thus, we modify and extend the existing leadership inference framework to provide a non-trivial baseline. Our framework performs better than this baseline in all datasets. Moreover, we also propose a method to perform statistical significance tests, comparing inferred frequent patterns of leadership dynamics with our proposed null hypotheses. Our framework opens the opportunities for scientists to generate scientific hypotheses that can be tested statistically regarding dynamics of leadership in movement data.
This paper examines the problem of adaptive influence maximization in social networks. As adaptive decision making is a time-critical task, a realistic feedback model has been considered, called myopic. In this direction, we propose the myopic adaptive greedy policy that is guaranteed to provide a (1 - 1/e)-approximation of the optimal policy under a variant of the independent cascade diffusion model. This strategy maximizes an alternative utility function that has been proven to be adaptive monotone and adaptive submodular. The proposed utility function considers the cumulative number of active nodes through the time, instead of the total number of the active nodes at the end of the diffusion. Our empirical analysis on real-world social networks reveals the benefits of the proposed myopic strategy, validating our theoretical results.
The research shared on a digital social media has enabled us to measure the impact of academic entities beyond the conventional bibliometric community. We explored a diffusion based metrics to measure the influence of academic entities in social media using 2-layered graph where the first layer is the graph between academic and social media entities and a second layer is the graph between social media entities. We employed the heat diffusion algorithms to measure the social impact of academic entities and evaluate them by (i) predicting links between academic entities and social media and (ii) suggesting memes for the academic entities. Our analysis on predicting links between scientist and social media entities showed the AUC-ROC score of 0.73 and the AUC-PR score of 0.30. Similarly, predicting links between scientific publications and social media entities showed the AUC-ROC score of 0.80 and the AUC-PR score of 0.19. Our approach also provides decent social media entities (memes) suggestion for scientific publications.
Simulating the behavior of economic agents fosters the analysis of interconnected markets' dynamics. Here, we extend the state-of-the-art by adding realistic details to simulating economic exchange networks. To this end, we use our economic network simulation framework TrEcSim, which is designed to support the following real-life features: specific complex network topologies, evolution of economic agent roles, dynamic creation of new economic agents, diversity in product types, dynamic evolution of product prices, and investment decisions at agent-level. Using TrEcSim, we simulate and determine the point at which the networks (having different topology types) transition from being a topocratic system to becoming a meritocratic one. Simulation also allows for analyzing the dynamic evolution of producers and middlemen distribution in the economic exchange network. Moreover, we gain valuable insight regarding the distribution of payoff for each agent-role in various economic exchange networks, as follows: when producers are assigned randomly to topological positions, the payoff distribution within the producers category is fat-tailed (only a handful of producers benefit from an increased payoff), while the payoff of the middlemen category closely resembles a normal (Gaussian) distribution. However, when the topological positions of producers are assigned preferentially, the payoff distributions of the two role categories reverse.
Social media has become a valuable tool for hackers to disseminate misleading content through compromised accounts. Detecting compromised accounts, however, is challenging due to the noisy nature of social media posts and the difficulty in acquiring sufficient labeled data that can effectively capture a wide variety of compromised tweets from different types of hackers (spammers, vandals, cybercriminals, revenge hackers, etc). To address these challenges, this proposal presents a multi-view learning framework that employs nonlinear autoencoders to learn the feature embedding from multiple views, such as the tweets' content, source, location, and timing information and then projects the embedded features into a common lower-rank feature representation. Suspicious user accounts are detected based on their reconstruction errors in the shared subspace. Our empirical results show the superiority of CADET compared to several existing representative approaches when applied to a real-world Twitter dataset.
"Social media has become an inevitable part of individuals personal and business lives. Its benefits come with various negative consequences. One major concern is the prevalence of detrimental online behavior on social media, such as online harassment and cyberbullying. In this study, we aim to address the computational challenges associated with harassment detection in social media by developing a machine learning framework with three distinguishing characteristics. (1) It uses minimal supervision in the form of expert-provided key phrases that are indicative of bullying or non-bullying. (2) It detects harassment with an ensemble of two learners that co-train one another
Recent advances in the field of network representation learning are mostly attributed to the application of the skip-gram model in the context of graphs. State of the art analogues of skip-gram model in graphs define a notion of neighbourhood and aim to find the vector representation for a node, which maximizes the likelihood of preserving this neighborhood. In this paper, we take a drastic departure from the existing notion of neighbourhood of a node by utilizing the idea of coreness. More specifically, we utilize the well-established idea that nodes with similar core numbers play equivalent roles in the network and hence induce a novel and an organic notion of neighbourhood. Based on this idea, we propose core2vec, a new algorithmic framework for learning low dimensional continuous feature mapping for a node. Consequently, the nodes having similar core numbers are relatively closer in the vector space that we learn. We further demonstrate the effectiveness of core2vec by comparing word similarity scores obtained by our method where the node representations are drawn from standard word association graphs against scores computed by other state of the art network representation techniques like node2vec, DeepWalk and LINE. Our results always outperform these existing methods, in some cases achieving improvements as high as 46% on certain ground-truth word similarity datasets. We make all codes used in this paper available in the public domain: https://github.com/Sam131112/Core2vec_test.
Graph representations have increasingly grown in popularity during the last years. Existing embedding approaches explicitly encode network structure. Despite their good perfor- mance in downstream processes (e.g., node classification), there is still room for improvement in different aspects, like effectiveness. In this paper, we propose, t-PNE, a method that addresses this limitation. Contrary to baseline methods, which generally learn explicit node representations by solely using an adjacency matrix, t-PNE avails a multi-view information graph—the adjacency ma- trix represents the first view, and a nearest neighbor adjacency, computed over the node features, is the second view—in order to learn explicit and implicit node representations, using the Canonical Polyadic (a.k.a. CP) decomposition. We argue that the implicit and the explicit mapping from a higher-dimensional to a lower-dimensional vector space is the key to learn more useful and highly predictable representations. Extensive experiments show that t-PNE drastically outperforms baseline methods by up to 158.6% with respect to Micro-F1, in several multi-label classification problems.
Graph pattern matching has been widely used in large spectrum of real applications. In this context, different models along with their appropriate algorithms have been proposed. However, a major drawback on existing models is their limitation to find meaningful matches resulting in a number of failing queries. In this paper we introduce a new model for graph pattern matching allowing the relaxation of queries in order to avoid the empty-answer problem. Then we develop an efficient algorithm based on optimization strategies for computing the top-k matches according to our model. Our experimental evaluation on four real datasets demonstrates both the effectiveness and the efficiency of our approach.
Due to the emergence of several nutrition-related mobile applications and websites in recent years, as well as the massive amount of crowd-sourced nutrition data, searching and finding relevant results has become increasingly difficult for users. This problem becomes even more challenging when dealing with crowd-sourced food names that are noisy and not well-structured . Because food names are short in length, it is difficult to incorporate existing methods to achieve an optimal matching quality. Despite several recent studies on nutrition data, these challenges remain. In this paper, we propose a novel learning-to-rank framework for crowd-sourced food names that has significant real-world applications, including food search and food recommendations. In particular, we propose a deep learning based, multi-modal learning-to-rank model that leverages the text describing a food name and the numerical values that represent its nutritional information. To this end, we also introduce a novel type of loss-function, which extends standard triplets hinge loss function into a multi-modal scenario. The proposed model is flexible and supports various data types as well as an arbitrary number of modalities. The effectiveness of our proposed model is demonstrated through several experiments on real-data, consisting of more than six million instances.
This paper aims to identify the factors that affect the impact of Open Source Software (OSS), measured by number of downloads and citations, with a case study of R packages. We generate the dependency and contributor networks of the packages using data collected from Depsy.org, and develop statistical models that use the network characteristics, as well as author and package attributes. We find that there are common network and package attributes that are important in determining both the number of downloads and citations of a package, including degree, closeness and betweenness centralities, as well as package attributes such as number of authors and number of commits.
The moderation of content in many social media systems, such as Twitter and Facebook, motivated the emergence of a new social network system that promotes free speech, named Gab. Soon after that, Gab has been removed from Google Play Store for violating the company's hate speech policy and it has been rejected by Apple for similar reasons. In this paper we characterize Gab, aiming at understanding who are the users who joined it and what kind of content they share in this system. Our findings show that Gab is a very politically oriented system that hosts banned users from other social networks, some of them due to possible cases of hate speech and association with extremism. We provide the first measurement of news dissemination inside a right-leaning echo chamber, investigating a social media where readers are rarely exposed to content that cuts across ideological lines, but rather are fed with content that reinforces their current political or social views.
Political campaigns have frequently used the online social network as an important environment to exhibit the candidate ideas, their activities, and their electoral plans if elected. Some users are more politically engaged than others. As an example, we can observe intense political debates, especially during major campaigns on Twitter. In such context, this paper presents a characterization of politically engaged user groups on Twitter during the 2016 US Presidential Campaign. Using a rich dataset with 23 million tweets, 115 thousand user profiles and their contact network collected from January 2016 to November 2016, we identified four politically engaged user groups: advocates for both main candidates, political bots, and regular users. We present a characterization of how Twitter users behave during a political campaign through the language patterns analysis of tweets, which users receive more popularity during the campaign and how tweets from each candidate may have affected their mood variation, as expressed by the messages they share.
In the context of community detection in online social media, a lot of effort has been put into the definition of sophisticated network clustering algorithms and much less on the equally crucial process of obtaining high-quality input data. User-interaction data explicitly provided by social media platforms has largely been used as the main source of data because of its easy accessibility. However, this data does not capture a fundamental and much more frequent type of participatory behavior where users do not explicitly mention others but direct their messages to an invisible audience following a common hashtag. In the context of multiplex community detection, we show how to construct an additional data layer about user participation not relying on explicit interactions between users, and how this layer can be used to find different types of communities in the context of Twitter political communication
The rising popularity of social media has radically changed the way news content is propagated, including interactive attempts with new dimensions. To date, traditional news media such as newspapers, television and radio have already adapted their activities to the online news media by utilizing social media, blogs, websites etc. This paper provides some insight into the social media presence of worldwide popular news media outlets. Despite the fact that these large news media propagate content via social media environments to a large extent and very little is known about the news item producers, providers and consumers in the news media community in social media.To better understand these interactions, this work aims to analyze news items in two large social media, Twitter and Facebook. Towards that end, we collected all published posts on Twitter and Facebook from 48 news media to perform descriptive and predictive analyses using the dataset of 152K tweets and 80K Facebook posts. We explored a set of news media that originate content by themselves in social media, those who distribute their news items to other news media and those who consume news content from other news media and/or share replicas. We propose a predictive model to increase news media popularity among readers based on the number of posts, number of followers and number of interactions performed within the news media community. The results manifested that, news media should disperse their own content and they should publish first in social media in order to become a popular news media and receive more attractions to their news items from news readers.
Traditionally, news media organizations used to publish only a few editions of the printed newspapers, and all subscribers of a particular edition used to receive the same information broadcasted by the media organization. The advent of personalized news recommendations has completely changed this simpler news landscape. Such recommendations effectively produce numerous personalized editions of a single newspaper, consisting of only the stories recommended to a particular reader. Although prior works have considered news coverage of different newspapers, due to the difficulty of knowing what news is recommended to whom, there has been no prior study to look into the coverage of information in different personalized news editions. Moreover, the evolution of the effects of personalization on recommended news stories is also not explored. In this work, we make the first attempt to investigate these issues. By collecting extensive data from New York Times personalized recommendations, we compare the information coverage in different personalized editions and investigate how they evolve over time. We observe that the coverage of news stories recommended to different readers are considerably different, and these differences further change with time. We believe that our work will be an important addition to the growing literature on algorithmic auditing and transparency.
Link prediction is the problem of inferring new relationships among nodes in a network that are likely to occur in the near future. Classical approaches mainly consider neighborhood structure similarity when linking nodes. However, we may also want to take into account if the two nodes are already indirectly interacting and if they will benefit from the link by having an active interaction over the time. For instance, it is better to link two nodes u and v if we know that these two nodes will interact in the social network even in the future, rather than suggesting v', who will never interact with u. In this paper, we deal with a new variant of the link prediction problem: given a pair of indirectly interacting nodes, predict whether or not they will form a link in the future. We propose a solution to this problem that leverages the predicted duration of their interaction and propose two supervised learning approaches to predict how long will two nodes interact in a network. Given a set of network-based predictors, the basic approach consists of learning a binary classifier to predict whether or not an observed indirect interaction will last in the future. The second and more fine-grained approach consists of estimating how long the interaction will last by modeling the problem via survival analysis or as a regression task. Once the duration is estimated, new links are predicted according to their descending order. Experimental results on the Facebook Network and Wall Interaction dataset show that our more fine-grained approach performs the best with an AUROC of 0.85 and clearly beats a link prediction model that does not consider the interaction duration and is based only on network properties.
We address the problem of detecting expressions of moral values in tweets using content analysis. This is a particularly challenging problem because moral values are often only implicitly signaled in language, and tweets contain little contextual information due to length constraints. To address these obstacles, we present a novel approach to automatically acquire background knowledge from an external knowledge base to enrich input texts and thus improve moral value prediction. By combining basic textual features with background knowledge, our overall context-aware framework achieves performance comparable to a single human annotator. Our approach obtains 13.3% absolute F-score gains compared to our baseline model that only uses textual features.
Network representation learning algorithms seek to embed the nodes of a network into a lower-dimensional featurespace such that nodes that are in close proximity to each other share a similar representation. In this paper, we investigate the effectiveness of using network representation learning algorithms for link prediction problems. Specifically, we demonstrate the limitations of existing algorithms in terms of their ability to accurately predict links between nodes that are in the same or different communities and nodes that have low degrees. We also show that incorporating node attribute information can help alleviate this problem and compare three different approaches to integrate this information with network representation learning for link prediction problems. Using five real-world network datasets, we demonstrate the efficacy of one such approach, called SPIN, that can effectively combine the link structure with node attribute information and predict links between nodes in the same and different communities without favoring high degree nodes.
In this paper, we develop statistical models to predict a person’s involvement in a criminal incident using criminal case records from the Albuquerque Police Department (APD). We generate a bipartite graph of criminals and cases as well as a criminal network, where an edge between two people means that they were involved in at least one case together. We use the characteristics of the individuals and the cases, and the structural properties of the networks to predict the edges in the bipartite graph. We show that adding network features to a baseline model improves the fit and the predictive performance of the models.
Abstract—With the rapid growth in urban transit networks inrecent years, detecting service disruptions in a timely manner is a problem of increased interest to service providers. Tran-sit agencies are seeking to move beyond traditional customer questionnaires and manual service inspections to leveraging open source indicators like social media for deteting emerging transit events. In this paper, we leverage Twitter data for early detection of metro service disruptions. Inspired by the multi-task learning framework, we propose the Metro Disruption Detection Model, which captures the semantic similarity between transit lines in Twitter space. We propose novel constraints on feature semantic similarity exploiting prior knowledge about the spatial connec- tivity and shared tracks of the metro network. An algorithm based on the alternating direction method of multipliers (ADMM) framework is developed to solve the proposed model. We run extensive experiments and comparisons to other models with real world Twitter data and transit disruption records from the Washington Metropolitan Area Transit Authority (WMATA) to justify the efficacy of our model.
"Existing works on local community detection in social networks focus on finding one single community a few seed members are most likely to be in. In this work, we address a much harder problem of multiple local community detection and propose a Nonnegative Matrix Factorization algorithm for finding multiple local communities for a single seed chosen randomly in multiple ground truth communities. The number of detected communities for the seed is determined automatically by the algorithm. We first apply a Breadth-First Search to sample the input graph up to several levels depending on the network density. We then use Nonnegative Matrix Factorization on the adjacency matrix of the sampled subgraph to estimate the number of communities, and then cluster the nodes of the subgraph into communities. Our proposed method differs from the existing NMF-based community detection methods as it does not use ""argmax"" function to assign nodes to communities. Our method has been evaluated on real-world networks and shows good accuracy as evaluated by the F1 score when comparing with the state-of-the-art local community detection algorithm."
Electronic commerce has a dominant role in consumer economics. and popular garnering a lot of research attention. Understanding consumer market dynamics based on product popularity is crucial for business intelligence. This work explores the temporal dynamics in online marketing. We introduce a new popularity index based on Amazon: Product Popularity based on Sales Review Volume (PPSRV). We explore and evaluate sequential deep learning models to obtain time series embedding that can predict the product popularity. We further characterize popularity competition between similar products and extend our model of popularity prediction in a competitive environment. Experimental results on large-scale reviews demonstrate the effectiveness of our approach.
"Society's reliance on social media as a primary source of news has spawned a renewed focus on the spread of misinformation. In this work, we identify the differences in how social media users and accounts identified as bots react to news sources of varying credibility, regardless of the veracity of the content those sources have shared. We analyze bot and human responses annotated using a fine-grained model that labels responses as being an answer, appreciation, agreement, disagreement, an elaboration, humor, or a negative reaction. We present key findings of our analysis into the prevalence of bots, the variety and speed of bot and human reactions, and the disparity in authorship of reaction tweets between these two sub-populations. We observe that bots are responsible for 9-15% of the reactions to sources of any given type but comprise only 7-10% of users responsible for reaction-tweets
The aim of this paper is to present methods to systematically analyze individual and group behavioral patterns observed in community driven discussion platforms like Reddit where users exchange information and views on various topics of current interest. We conduct this study by analyzing the statistical behavior of posts and modeling user interactions around them. We have chosen Reddit as an example, since it has grown exponentially from a small community to one of the biggest social network platforms in the recent times. Due to its large user base and popularity, a variety of behavior is present among users in terms of their activity. Our study provides interesting insights about a large number of inactive posts which fail to gather attention despite their authors exhibiting Cyborg-like behavior to draw attention. We also present interesting insights about shortlived but extremely active posts emulating a phenomenon like Mayfly Buzz. Further, we present methods to find the nature of activity around highly active posts to determine the presence of Limelight hogging activity, if any. We analyzed over 2 million posts and more than 7 million user responses to them during entire 2008 and over 63 million posts and over 608 million user responses to them from August 2014 to July 2015 amounting to two one-year periods, in order to understand how social media space has evolved over the years.
Given a corpus of employee peer reviews from a large corporation where each review is structured into pros and cons, what are the prevalent traits that employees talk about? How can we describe the performance of an employee with just a few sentences, that help us interpret what their work is praised and criticized for? What is the best way to summarize an employee's reviews, while preserving the content and sentiment as well as possible? In this work, we study a large collection of corporation-wide employee peer reviews from a technology enterprise. Motivated by the challenges we outline in our analysis of employee review data, our work makes two main contributions in the domain of people analytics: (a) Sentiment-Aspect Model: we introduce a stylized log-linear model that identifies the hidden aspects and sentiment within an employee peer review corpus, (b) Interpretable Sentiment-Aspect Representations: we produce a vector space embedding for each employee, containing an overall sentiment score per aspect, and (c) Summarization of Employee Peer Reviews: we summarize an employee's peer reviews with just a few sentences which reflect the most prevalent traits and associated sentiment for the employee as much as possible. We show that our model can use the structure present in the dataset as supervision to discover meaningful latent traits and sentiment embodied in the reviews. Our employee vector representations provide a compact, interpretable overview of their evaluation. The review summaries extracted provide text that explains the professional performance of an employee in a succinct and objectively quantifiable way. We also show how to use our techniques for people analytics tasks such as the analysis of thematic differences between departments, regions, and genders.
Sociological theories of career success provide fundamental principles for the analysis of social links to identify patterns that facilitate career development. Some theories (e.g. Granovetter's Strength of Weak Ties Theory and Burt's Structural Hole Theory) have shown that certain types of social ties provide career advantage to individuals by facilitating them to access unique information and connecting them with a diverse range of others in different social cliques. The assessment of link types and prediction of new links in the external social networks such as Facebook and Twitter have been studied extensively. However, this has not been addressed in the enterprise social networks and especially the prediction of weak ties in the context of employee career development. In this paper, we address this problem by proposing an Enterprise Weak Ties Recommendation (EWTR) framework which leverages enterprise social networks, employee collaboration activity streams and the organizational chart. We formulate weak ties recommendation as a link prediction problem. However, unlike any generic link prediction work, we first validated explicit enterprise social network with a set of heterogeneous collaboration networks and show assessment improves the explicit network's effectiveness in predicting new links. Furthermore, we leverage assessed social network for the weak ties prediction by optimizing the link prediction methods using organizational chart information. We demonstrate that optimization improves prediction accuracy in terms of AUC and average precision and our characterization of weak ties to a certain extent aligns with Granovetter's and Burt's seminal studies.
This paper proposes a novel system which utilizes information from a social network services to suggest food venues to users based on crowd preferences. To recommend an appropriate food venue for each crowd preference, the system ranks food venues in each region by using an improved collaborative filtering method based on the differences between locations and languages in geo-tagged tweets. A key feature of the proposed system is the ability to suggest food venues in regions where very few geo-tagged tweets are available in a specific language by using the weighted similarity by others’ preferences. To implement the system, more than 26 million tweets from European countries were collected and analyzed based on 6 languages and 7 regions. Afterwards, we provide an evaluation of the ranked venues proposed by the system based on 89 French speakers in 7 European countries.
The exponential growth in the usage of smart devices, such as smartphones, interconnected wearables etc., creates a huge amount of information to manage and many research and business opportunities. Such smart devices become a useful tool for user movement recognition, since they are equipped with different types of sensors and processors that can process sensor data and extract useful knowledge. Taking advantage of the GPS sensor, they can collect the timestamped geographical coordinates of the user, which can then be used to extract the geographical location and movement of the user. Our work, takes this analysis one step ahead and attempts to identify the user’s behavior and habits, based on the analysis of user’s location data. This type of information can be valuable for many other domains such as Recommender Systems, targeted/personalized advertising etc. In this paper, we present a methodology for analyzing user location information in order to identify user habits. To achieve this, we analyze user’s GPS logs provided through his Google location history, we find locations that user usually spends more time, and after identifying the user’s frequently preferred transportation types and trajectories, we find what type of places the user visits in a regular base (such as cinemas, restaurants, gyms, bars etc) and extract the habits that the user is most likely to have.
We present the results of a preliminary study to test the hypothesis that it is possible to automatically identify opinions, in the form of conviction narratives, as they emerge in text data, and to measure and monitor how actors in the online news media influence others in the media to adopt similar narratives to their own. Narratives are represented in the form of sentiment that online news sources express about various topics. Our results suggest that there is evidence of specific news sources acting as opinion leaders, determining the narratives that others in the online media adopt.
Nowadays the demand on carpooling system increases due to the need to decrease car crowdedness, saving fuel cost, decrease pollution, etc. Carpooling services depend on combining different passengers in one car who are willing to go to the same place in specific time. In this paper, a novel framework that utilizes trip profile and semantic of places (point of interest) is proposed. Users’ trips are distinguished into routine trips and occasional trips. For occasional trips, the user is offered a similar destination based on the semantic of destination such that the new location is within accepted range or in the route with respect to drivers and other passengers. The proposed framework is applied on real dataset of New York taxi. Two techniques have been applied one based on route matching and other applied machine learning. The results show that the proposed framework outperforms tradition carpooling system by reducing total number of trips by 22.3%in case of 3 passengers per car and by 26% in case of 4 passengers. While total number of trips has been reduced by 66% if 3 passengers accepted in one car, 74% if 4 passengers accepted when machine learning technique was applied on the same dataset.
Trust has been described as an intrinsic component of any social relation. Trust mainly refers to a measure of confidence on anentity that would behave in an expected manner. Academic social networking sites enable researchers to communicate, and share publications. This paper aims to rank both researchers and their productivity in terms of their scientific paper they are publishing. A trust model is proposed that utilizes the metadata of researchers and their papers, extracted from academic social networks, in order to produce two trust values, one for a researcher and another for a scientific paper. The utilized metadata for the researchers includes (total publication, total work citation, followers, h-index). Propagation of trust score using top co-authors is also considered for authors. The metadata of papers are: calculated author score, paper citation, and references citation. Each of those individual factors is assigned a specific weight based on user preference and AHP ranking method is applied. Individual metadata are aggregated to a collective one by considering aggregation weights for each feature and applying AHP ranking method. Experimental show that the proposed model provide a high accuracy value when compared using ground truth data from Google Scholar and global H-index.
The proliferation of smartphones has lead researchers towards using them as an observational tool in psychological science. However, there is little effort towards protecting user privacy in these analyses. The overarching question of our work is: Given a set of sensitive user features, what is the minimum amount of information required to group similar users? Our contributions are two fold: we introduce privacy surfaces that combine sensitive user data at different levels of temporal granularity. Second, we introduce MIMiS, an unsupervised privacy-aware framework that clusters users as homogeneous groups with respect to their temporal signature. In addition, we explore the trade-off between intrusiveness and prediction accuracy. We extensively evaluate MIMiS on real data across a variety of privacy surfaces. MIMiS identified groups that are highly homogeneous w.r.t. their mental health scores and their academic performance.
Automated social bots are reported to account for a large sum of activity on social media sites such as Twitter. In this short paper, we study the information-foraging behaviors of social media users including bots. We present here a preliminary investigation which compares the behaviors of a set of suspected bots with non-automated accounts. To do so, we measure the distance between word distributions on a daily basis. We posit that this methodology provides a quantitative measure of behavior, which allows for more rigorous descriptions of bot behaviors that move beyond the assumption of bots as a monolithic category.
In this paper, we introduce a novel problem of \textit{discovering influence hierarchy} to organize influential users in a social network into different levels according to their potential of spreading influence. We present a novel approach of discovering influence hierarchy utilizing the temporal aspect and flow direction of interactions among users. The influence hierarchy has the potential to visualize the information flow of the network and identify different roles such as creators, information disseminators, emerging leaders and active followers. It is highly applicable in several domains such as sociology, marketing, political science and disaster management.
Generating friend recommendations in location-based social networks is a challenging task, as we have to learn how different contextual factors influence users' behavior to form social relationships. For example, the contextual information of users' check-in behavior at common locations and users' activities at close regions may impact users' relationships. In this paper we propose a deep pairwise learning model, namely FDPL. Our model first learns the low dimensional latent embeddings of users' social relationships by jointly factorizing them with the available contextual information based on a multi-view learning strategy. In addition, to account for the fact that the contextual information is non-linearly correlated with users' social relationships we design a deep pairwise learning architecture based on a Bayesian personalized ranking strategy. We learn the non-linear deep representations of the computed low dimensional latent embeddings by formulating the top-k friend recommendation task at location-based social networks as a ranking task in our deep pairwise learning strategy. Our experiments on three real world location-based social networks from Brightkite, Gowalla and Foursquare show that the proposed FDPL model significantly outperforms other state-of-the-art methods. Finally, we evaluate the impact of contextual information on our model and we experimentally show that it is a key factor to boost the friend recommendation accuracy at location-based social networks.
While microblogging-based Online Social Networks have become an attractive data source in emergency situations, overcoming information overload is still not trivial. We propose a framework which integrates natural language processing and clustering techniques in order to produce a ranking of relevant tweets based on their informativeness. Experiments on four Twitter collections in two languages (English and French) proved the significance of our approach.
This paper presents an observational study of lexical propagation across online social networking platforms. By focusing on the highly followed @dog_rates Twitter account, we explore how a popular account's unique style of language propagates outside of the account's immediate follower community within Twitter. Initial results show a strong relationship between the prevalence of this account's language-specific features and the account's followership and popularity. Expanding this research across platforms, we demonstrate consistency in these results outside Twitter, as the @dog_rates vernacular shows a similarly strong relationship between use on Reddit and the account's followership over time.
"The growth of social media has created an open web where people freely share their opinion and even discuss sensitive subjects in online forums. Forums such as Reddit help support seekers by serving as a portal for open discussions for various stigmatized subjects such as rape. This paper investigates the potential roles of online forums and if such forums provide intended resources to the people who seek support. Specifically, the open nature of forums allows us to study how online users respond to seeker’s queries or needs
Nowadays, online video platforms mostly recommend related videos by analyzing user-driven data such as viewing patterns, rather than the content of the videos. However,content is more important than any other element when videos aim to deliver knowledge. Therefore, we have developed a web application which recommends related TED lecture videos to the users, considering the content of the videos from the transcripts. TED Talk Recommender constructs a network for recommending videos that are similar content-wise and providing a user interface. Our demo system is available at http://dmserver6.kaist.ac.kr:24673/.
Decide Madrid is the civic technology of Madrid City Council which allows users to create and support online petitions. Despite the initial success, the platform is encountering problems with the growth of petition signing because petitions are far from the minimum number of supporting votes they must gather. Previous analyses have suggested that this problem is produced by the interface: a paginated list of petitions which applies a non-optimal ranking algorithm. For this reason, we present an interactive system for the discovery of topics and petitions. This approach leads us to reflect on the usefulness of data visualization techniques to address relevant societal challenges.
In this paper we present the second version of our multilingual tweet classification tool. ClassStrength v2 classifies tweets into 14 categories (Sports, Music, News\&Politics etc.) using a distant supervision approach. The new version extends the initial set of five languages to ten (English, French, German, Chinese, Japanese, Arabic, Russian, Spanish, Portuguese and Polish). In addition, the classification models of each language get automatically updated every month to allow accurate classification over time. Our experimentation showed that the larger the time gap between the tweet and the data used for training the model, the worse the performance, which motivated for creating an adaptive version of ClassStrength that get its models updated periodically.
Mobile Network Operators (MNOs) are eager to learn more about complaint behaviour of their subscribers. In this demo, we study topic modeling approach for extracting relevant problems experienced by subscribers of MNOs in Turkey and visualize the topic distributions using LDAvis data analytics tool. For building topic models using Latent Dirchlet Allocation (LDA), we have built customer complaint text dataset of subscriber complaints for each MNOs from Turkey's largest customer complaint website. The proposed analysis tool can be used as customer complaint analysis service by MNOs in Turkey to gain more insight. We have also validated our generated topic model using another dataset obtained from Turkey's largest online community website. Our results indicate similar and dissimilar topics of complaints as well as some of the distinctive problems of MNOs in Turkey based on their subscriber's experiences and feedback.
During the last decade, a variety of social networks and applications has been developed, providing to the users a set of potential functionalities. Thanks to these functionalities, they have become vital part of the daily life of many people.As a result, a great volume of data has been created. Due to the different nature of the functionalities, datasets of different nature and schema are created. This paper introduces GeoTeGra, a system that targets to reveal non-obvious knowledge by connecting datasets that derive from multiple heterogeneous sources. GeoTeGra is a scalable framework to compare different machine learning algorithms in terms of scalability and effectiveness, finding semantic similarities between entities. Our system is based on a distributed storage and parallel map-reduce manipulation for the fast retrieval of information from multi-class feature representations.
We demonstrate a machine learning and artificial intelligence method, i.e., lexical link analysis (LLA) to discover high-value information from big data. In this paper, high-value information refers to the information that has the potential to grow its value over time. LLA is a unsupervised learning method that does not require manually labeled training data. New value metrics are defined based on a game-theoretic framework for LLA. In this paper, we show the value metrics generated from LLA in a use case of analyzing business news. We show the results from LLA are validated and correlated with the ground truth. We show that by using game theory, the high-value information selected by LLA reaches a Nash equilibrium by superpositioning popular and anomalous information, and at the same time generates high social welfare, therefore, contains higher intrinsic value.
Combating copyright infringing multimedia content has arisen as a critical undertaking in online video sharing platforms, such as YouTube and Twitch. In contrast to the traditional copyright detection problem that studies the static content (e.g., music, films, digital documents), the proposed system focuses on a much more challenging problem: detecting copyright infringements in live video streams. This is motivated by the observation that a large amount of copyright-infringing videos bypass the detector while many legal videos are taken down by mistake. In this paper, we present an end-to-end system that is dedicated to combating the copyright infringements in live video streams. The system to be demonstrated consists of 1) a web front-end for user interaction and customized video query, 2) a scalable and real-time video crawling system that can collect video metadata, live chat messages, and visual content of the live video streams on video sharing platforms, and 3) a novel supervised copyright detection engine that leverages the live chat messages of the audience to detect the copyright infringement of live videos.
There is an increasing amount of information posted on Web, especially on social media during real world events. Likewise, there is a vast amount of information and opinions posted about humanitarian issues on social media. Mining such data can provide timely knowledge to inform disaster resource allocation for who needs what and where as well as policies for humanitarian causes. However, information overload is a key challenge in leveraging this big data resource for organizations. We present an interactive user-feedback based streaming analytics system ‘CitizenHelper-Adaptive’ to mine social media, news, and other public Web data streams for emergency services and humanitarian organizations. The system aims to collect, organize, and visualize the vast amounts of data across various user and content-based information attributes using the adaptive machine learning models, such as intent classification models to continuously identify requests for help or offers of help during disasters. This demonstration shows the first application of transfer-active learning methods for time-critical events, when there is an availability of abundant labeled data from past events but a scarcity of the sufficient labeled data for the ongoing event. The proposed system provides a user interface to solicit expert feedback on the predicted instances from pretrained models and actively learns to improve the models for efficient information processing and organization. Finally, the system regularly updates the predicted information categories in the visualization dashboard. We will demo CitizenHelper-Adaptive system for case studies in both mass emergency events and humanitarian related topics such as gender violence using datasets of more than 50 million Twitter messages and news streams collected between 2016 and 2018.
"Providing easy and hassle-free product returns have become a norm for e-commerce companies. However, this flexibility on the part of the customer causes the respective e-commerce companies to incur heavy losses because of the delivery logistics involved and the eventual lower resale value of the product returned. In this paper, we consider data from one of the leading Indian e-commerce companies and investigate the problem of product returns across different lifestyle verticals. One of the striking observations from our measurements is that most of the returns take place for apparels/garments and the major reason for the return as cited by the customers is the ""size/fit"" issue. Here we develop, based on past purchase/return data, a model that given a user, a brand and a size of the product can predict whether the user is going to eventually return the product. The methodological novelty of our model is that it combines concepts from network science and machine learning to make the predictions. Across three different major verticals of various sizes, we obtain overall F-score improvements between 10% - 25% over a naive baseline where the clusters are obtained using simple random walk with restarts."
This paper proposes a statistical framework to automatically identify anomalous nodes in static networks. In our approach, we first associate to each node a neighborhood cohesiveness feature vector such that each element of this vector corresponds to a score quantifying the node’s neighborhood connectivity, as estimated by a specific similarity measure. Next, based on the estimated node’s feature vectors, we view the task of identifying anomalous nodes from a mixture modeling perspective, based on which we elaborate a statistical approach that exploits the Dirichlet distribution to automatically identify anomalies. The suitability of the proposed method is illustrated through experiments on both synthesized and real networks.
Standard sentiment analysis techniques usually rely either on sets of rules based on semantic and affective information or in machine learning approaches whose quality heavily depend on the size and significance of a training set of pre-labeled text samples. In many situations, this labeling needs to be performed by hand, potentially limiting the size of the training set. In order to address this issue, in this work we propose a methodology to retrieve text samples from Twitter and automatically label them. Additionally, we also tackle the situation in which the base rates of positive and negative sentiment samples in the training and test sets are biased with respect to the system in which the classifier is intended to be applied.
Abstract—This paper delineates the spatial characteristic of key-Christian church in Taipei metropolitan area from 1930s till 2010s. It compares and analyzes the transformation of spatial configuration corresponding to different sects and time periods. The dataset contains the spatial networks of 13 Christian churches including single and cluster building types of Presbyterian church, Chinese Baptist Convention and Taiwan Lutheran church. Applying measures in social network analysis, it attempts to understand the differences and similarities of spatial networks, especially on the churches of the same sect or same era, and to compare them with the prototype case. In other words, this paper illustrates the transformation of spatial organization of Christian churches in Taipei Taiwan during the past 80 years.
Precision medicine which refers to a new treatment and prevention method based on understanding of individual gene, environment and life style have emerged as a new healthcare paradigm to lead future medicine. With the rapid progress of the fourth industrial revolution technologies such as big data, artificial intelligence, and internet of things, it is drawing attention to the opportunities and challenges of precision medicine. There is currently no comprehensive overview of precision medicine of approaches to scientometric analysis, and this study aims to provide an overview of the research and development trends in precision medicine by bibliometric network analysis. Total 7,324 articles were retrieved and analyzed using some scientometric analysis tools such as KnowledgeMatrix Plus, Gephi and VOSviewer software. Particularly, each nation’s research activities and their global relative positions, international research collaboration in this field have also been analyzed and identified by scientometric method through network and co-word analysis and visualization maps.
Many social processes such as applications for bank credits, insurances or social services are digitized as Big Data, hosting hidden social networks. While social learning is fundamental for human intelligence, convolutional neural networks and deep learning extend the foundation of artificial intelligence. Deep Probabilistic Learning is a multidisciplinary approach of probabilistic machine learning introduced for Big Data analysis and hidden social networks mining. It is presented in this paper along with experimental outcomes in frauds detection related to facsimile problems.
Gaussian graphical models (GGMs) are probabilistic tools of choice for analyzing conditional dependencies between variables in complex networked systems such as social networks, sensor networks, financial markets, etc. Finding changepoints in the structural evolution of a GGM is therefore essential to detecting anomalies in the underlying system modeled by the GGM. In order to detect structural anomalies in a GGM, we consider the problem of estimating changes in the precision matrix of the corresponding multivariate Gaussian distribution. We take a two-step approach to solving this problem:- (i) estimating a background precision matrix using system observations from the past without any anomalies, and (ii) estimating a foreground precision matrix using a sliding temporal window during anomaly monitoring. Our primary contribution is in estimating the foreground precision using a novel contrastive inverse covariance estimation procedure. In order to accurately learn only the structural changes to the GGM, we maximize a penalized log-likelihood where the penalty is the l1 norm of difference between the foreground precision being estimated and the already learned background precision. We suitably modify the alternating direction method of multipliers (ADMM) algorithm for sparse inverse covariance estimation to perform contrastive estimation of the foreground precision matrix. Our results on simulated GGM data show significant improvement in precision and recall for detecting structural changes to the GGM, compared to a non-contrastive sliding window baseline.
Social interactions can be both positive and negative, and at various spatial and temporal scales. Negative interactions such as conflicts are often influenced by political, economic and social pre-conditions. The signatures of conflicts can be mapped and studied in the form of complex social networks. Using publicly available large digital databases of media records, we construct networks of actors involved in conflicts by aggregating the events over time. We then study the spatio-temporal dynamics and network topology of conflicts, which can provide important insights on the engaging individuals, groups, establishments and sometimes nations, pointing at their long range effect over space and time. Network analyses of the empirical data reveal certain statistical regularities, which can be reproduced using agent based models. The fat tails of actor mentions and network degree distributions indicate dominant roles of the influential actors and groups, which over time, form a part of a giant connected component. Targeted removal of actors may help preventing unruly events of conflicts. Inspired by the empirical findings, we also propose a model for interacting actors that can reproduce the most important features of our datasets.
Diffusion models are powerful tools for understanding the spread of diverse content such as information, opinions and ideas through social networks. Although these models have been successfully used to study the spreading dynamics such as viral marketing, there are many real scenarios (e.g. vaccination, evacuation) that require a more complex model. Hence, we propose a new hybrid framework that combines diffusion modelling with cognitive agent modelling. The hybrid, generic framework is grounded on BDI (Belief-Desire-Intention), an advanced, efficient cognitive agent framework. We demonstrate our framework to a wildfire evacuation case study consisting of 5,000 agents. We then compare and analyse the diffusion outcomes of our model against two baseline models, the standard Linear Threshold (LT) model and a slightly modified version of the LT model, across 17 different input configurations. The results show (statistically) significant differences with the baselines for the majority of the configurations, highlighting the need for cognitive agents in diffusion modelling. The framework presented here provides the basis for modelling complex reasoning to capture diffusion phenomena in complex and dynamic social systems.
Sophisticated data science techniques have recently been applied to social networks data to study social phenomena and people. Recognizing that social psychology research has witnessed a renewed interest in the notion of wisdom, with an emphasis to its contextual dimensions, this study looks at the expression of wisdom in twitter messages. Specifically, it examines the relation between wisdom in adversity and cultural influences using Twitter data from the tragic Japanese tsunami of 2011. The study employs natural language processing and data science to detect the expression of wisdom. Two categories for wisdom in adversity are used: recognition of uncertainty and change, and cognitive empathy. Data processing is applied to 1,000 annotated tweets and extended to 43,436 tweets. The results show that it is viable to study wisdom in context using social networking sites data. This short paper discusses some of the findings.
In the Korean entertainment industry, there exists ‘photaku’—a particular group of fans who take photos and videos of their favorite celebrities and distribute them online. To find out structures of their activities, we used a multi-method approach using in-depth interviews, offline observation, and social media analysis. Our findings demonstrate that photaku engage in four types of unique activities: one main activity to distribute photos and videos and three types of secondary activities: strengthening their competitiveness, supporting celebrities and monetization.
This work proposes a model to manage votes in a distributed network. Each node votes for one of the alternatives. The vote is shared with the neighbors, and the process repeats until it converges and all the nodes have the final result. If all nodes behave properly, all nodes know the final result for the voting. Nevertheless, cheating nodes may exist that manipulates the votes. The proposal includes a mechanism to detects when one or several nodes cheat during the diffusion process and to correct the obtained values.
In this work we develop a model that detects potential development of scientific disciplines in universities. The model is based on the position of superimposed disciplines developed by universities on a historical knowledge network by complement. We have observed that in our case study (WoS publications of five Chilean universities during the period 2008-2015), the model achieved up to 78% on the prognostic of total scientific disciplines that a university has developed, so it is offered as a valid tool to guide universities to the development of another areas of knowledge.
The Panama Papers represent a big set of relationships between people, companies, and organizations that had affairs with the Panamanian offshore law firm Mossack Fonseca, often due to money laundering. In this paper, we address for the first time the problem of searching the Panama Papers for people and companies that may be involved in illegal acts. We propose a new ranking algorithm, named Suspiciousness Rank Back and Forth (SRBF), that leverages this ground truth to assign a degree of suspiciousness to each entity in the Panama Papers. We use a collection of international blacklists of sanctioned people and organizations as ground truth for bad entities. We experimentally prove that our algorithm achieves an AUROC of 0.85 and an Area Under the Recall Curve of 0.87 and outperforms existing techniques.
As modern societies become more dependent on IT services, the potential impact both of adversarial cyberattacks and non-adversarial service management mistakes grows. This calls for better cyber situational awareness—decision-makers need to know what is going on. The main focus of this paper is to examine the information elements that need to be collected and included in a common operational picture in order for stakeholders to acquire cyber situational awareness. This problem is addressed through a survey conducted among the participants of a national information assurance exercise conducted in Sweden. Most participants were government officials and employees of commercial companies that operate critical infrastructure. The results give insight into information elements that are perceived as useful, that can be contributed to and required from other organizations, which roles and stakeholders would benefit from certain information, and how the organizations work with creating cyber common operational pictures today. Among findings, it is noteworthy that adversarial behavior is not perceived as interesting, and that the respondents in general focus solely on their own organization.
"To combat the evolving Android malware attacks, systems using machine learning techniques have been successfully deployed for Android malware detection. In these systems, based on different feature representations, various kinds of classifiers are constructed to detect Android malware. Unfortunately, as classifiers become more widely deployed, the incentive for defeating them increases. In this paper, we first extract a set of features from the Android applications (apps) and represent them as binary feature vectors
Micro-blogging sites provide a wealth of resources during disaster events in the form of short texts. Correct classification of those short texts into various actionable classes can be of great help in shaping the means to rescue people in disaster-affected places. The process of classification of short texts poses a challenging problem because the texts are usually short and very noisy and finding good features that can distinguish these texts into different classes is time consuming, tedious and often requires a lot of domain knowledge. We propose a deep learning based model to classify tweets into different actionable classes such as resource need and availability, activities of various NGO etc. Our model requires no domain knowledge and can be used in any disaster scenario with little to no modification.
Recently, several journalistic accounts have suggested that Twitter is becoming a bellwether for mis- and dis-information due to the pervasiveness of bots. These bots are either automated or semi-automated. Understanding the intent and usage of these bots has piqued the scientific curiosity among researchers. To that effect, in this study, we analyze the role of bots in two distinct categories of real-world events, i.e., natural disasters and sports. We collected over 1.2 million tweets that were generated by nearly 800,000 users for Hurricane Harvey, Hurricane Irma, Hurricane Maria, and Mexico Earthquake. We corroborate our analysis by examining bots that engaged with the 2018 Winter Olympics. We collected over 1.4 million tweets generated by nearly 700,000 users based on the hashtags #Olympics2018 and #PyeongChang2018. We examined the social and communication network of bots and humans for the aforementioned events. Our results show distinctive patterns in the network structures of bots when compared with that of humans. Content analysis of the tweets further revealed that bots used hashtags more uniformly than humans, across all the events.
Online hacking forums have been used as communities where users, possibly cybercriminals, can learn and exchange knowledge, and purchase the necessary tools and information to commit various offences such as hacking, credit card/identity fraud, money laundering, and even cyberattacks on infrastructure. Monitoring these forums and identifying key players are important when investigating emergent threats and developing efficient disruption strategies. Literature shows the lack of studies regarding users’ cross-forum activity. This paper presents an analysis of forum users’ cross-posting in three hacking forums including user overlap among different hacking communities/forums and identify user roles based on the type of posts and their frequencies. This allows us to assess the impact of users and forums in terms of cybercrime victimization.
Facebook News Feed personalization algorithm has a significant impact, on a daily basis, on the lifestyle, mood and opinion of millions of Internet users. Nonetheless, the behavior of such algorithms usually lacks transparency, motivating measurements, modeling and analysis in order to understand and improve its properties. In this paper, we propose a reproducible methodology encompassing measurements and an analytical model to capture the visibility of publishers over a News Feed. First, measurements are used to parameterize and to validate the expressive power of the proposed model. Then, we conduct a what-if analysis to assess the visibility bias incurred by the users against a baseline derived from the model. Our results indicate that a significant bias exists and it is more prominent at the top position of the News Feed. In addition, we found that the bias is non-negligible even for users that are deliberately set as neutral with respect to their political views.
Unknown landscape identification is the problem of identifying an unknown landscape from a set of already provided landscape images that are considered to be known. The aim of this work is to extract the intrinsic semantic of landscape images in order to automatically generalize concepts like a stadium, roads, a parking lot etc., and use this concept to identify unknown landscapes. This problem can be easily extended to many security applications. We propose two effective semi-supervised novelty detection approaches for the unknown landscape identification problem using Convolutional Neural Network (CNN) Transfer Learning. This is based on the use of pre-trained CNNs (i.e. already trained on large datasets) already containing general image knowledge that we transfer to our domain. Our best values of AUROC and average precision scores for the identification problem are 0.96 and 0.94, respectively. In addition, we statistically prove that our semi-supervised methods outperform the baseline.
This paper reports on an ongoing development of a tool for extracting structured information on events a given target entity participated in from massive collections of textual documents and anchoring these events on a timescale. An overview of the current version of the tool and the underlying timeline extraction process is given. Some evaluation figures that reflect system output quality are provided too. The paper will be accompanied by a live demo of the timeline extraction tool.
Automated social media bots have existed almost as long as the social media platforms they inhabit. Although efforts have long existed to detect and characterize these autonomous agents, these efforts have redoubled in the recent months following sophisticated deployment of bots by state and non-state actors. This research will study the differences between human and bot social communication networks by conducting an account snow ball data collection, and then evaluate features derived from this communication network in several bot detection machine learning models.
With the rapid development of technology and software, social media has become a necessity in our daily lives as it is a way for people to keep in touch with friends and share about current events. Some of the most popular social media that people use can include Facebook, Instagram, Snapchat, and Twitter. Finding the most compatible person to be friends with on social media can be a challenge as most of the people who are recommended by social media to the user are people who are already friends with them or had been followed. However, when users are looking for friends, the real concern is whether they have the same interests or hobbies with each other and whether they often interact with one another. In this paper, we propose a friend recommendation algorithm revolving around music interests and interactions in social media.
Online shopping in recent years demonstrate a constant increase and as a result the study of user behavior through clickstream has attracted again the interest of the research community. This increase though requires novel approaches to clickstream analytics since the volume of the products available online and the corresponding transactions is huge. In this paper, a sequential frequent itemsets detection methodology (SAFID) is adopted to solve a clickstream analytics problem by analyzing a composite dataset which simulates monthly traffic of Amazon U.S. online retail shop. It is shown that the methodology can perform the analysis very efficiently in a simple desktop and detect all the frequently bought together products which can provide valuable knowledge to marketers of online retail stores. The methodology used can further be improved to handle larger datasets by considering a cloud computing environment.
We present a study that examines the complex interactions that could exist in transactional databases, among the items that constitute the products or purchases present in a market basket of consumers. Using simulated and real databases, we show that it is possible to reconstruct certain characteristics of aggregated purchasing behavior, with only second order interactions using the inverse Ising problem. While Ising's model is well known, the applications in this context are novel and have the potential to reveal useful information to the retail manager.
This study explored the interactive resources of learning in a Flipped Classroom of the university campus. While a growing number of university campuses encourage the teaching strategies of Flipped Classroom to enhance students' learning responsibility and encourage them to become more active learners, this study argues that the students with less social network resources might not benefit from the advantages of these strategies. The investigation of forming and evolution of social network behind peer learning should understand the interplay of the social forces of social selection and social influence for developing effective strategies. Among the studies of curriculum and instruction, few foci on the function of peer learning and no studies have investigated the interactions by using dynamic social network analysis. This research applies the stochastic actor-based model to model these two social forces of knowledge sharing among peer-mediated learning. Drawing on the literature of the knowledge construction and social influence, it aims to have a better understanding of network dynamics in a Flipped Classroom by discovering how relationships created and identifying an individual and contextual attribute that facilitate its spontaneous knowledge contribution.
In this work we argue about the use of Twitter as a Palantir for predicting and revealing the political orientation of users in election campaigns. Specifically, our study aims at revealing the political orientation of a Twitter user in the context of the 2016 Italian Constitutional Referendum. After having collected and processed over 1,200,000 tweets, we classified them as YES-oriented, NO-oriented or UNCERTAIN, by exploiting the Naive Bayes Multinomial Text classification algorithm. We found that Twitter is used massively for political deliberation and just by counting the messages related to a political party we can reveal the election result. Moreover, the words used by the YES party and the NO party twits are in line with the language used by politicians in the real word and this strongly supports our analysis. Furthermore, our approach can be applied to different scenarios, in which it is necessary to distinguish between two main classes and another UNCERTAIN class. Finally, the results obtained by our classification methodology are very promising and encourage us to continue investigating this topic, deriving suggestions for further research.
In this paper, we consider a ”generalized” fractional program in order to solve a popularity optimization problem in which a source of contents controls the topics of her contents and the rate with which posts are sent to a timeline. The objective of the source is to maximize its overall popularity in an Online Social Network (OSN). We propose an efficient algorithm that converges to the optimal solution of the Popularity maximization problem.
k-truss decomposition of a graph is a method to discover cohesive subgraphs and to study the hierarchical structure among them. The existing algorithms for computing k-truss of today’s massive networks mainly focus on reducing the runtime using parallel computation on a powerful multi-core server. Our focus, by contrast, is to investigate the feasibility of computing the k-truss on a single consumer-grade machine within a reasonable amount of time. We engineer two efficient k-truss decomposition algorithms: the edge-peeling algorithm proposed by J. Wang and J. Cheng and the asynchronous h-index-updating algorithm proposed by A. E. Sariyuce, C. Seshadhri, and A. Pinar. We reduce their memory usage significantly by optimizing the underlying data structures and by using WebGraph, an efficient framework for graph compression. With our optimized implementation, we show that we can efficiently compute k-truss decomposition of large networks (e.g., a graph with 1.2 billion edges) on a single consumer-grade machine.
Online shopping in recent years demonstrate a constant increase and as a result the study of user behavior through clickstream has attracted again the interest of the research community. This increase though requires novel approaches to clickstream analytics since the volume of the products available online and the corresponding transactions is huge. In this paper, a sequential frequent itemsets detection methodology (SAFID) is adopted to solve a clickstream analytics problem by analyzing a composite dataset which simulates monthly traffic of Amazon U.S. online retail shop. It is shown that the methodology can perform the analysis very efficiently in a simple desktop and detect all the frequently bought together products which can provide valuable knowledge to marketers of online retail stores. The methodology used can further be improved to handle larger datasets by considering a cloud computing environment.
Machine learning has seen tremendous growth in recent years thanks to two key advances in technology: massive data generation and highly-parallel accelerator architectures. The rate that data is being generated is exploding across multiple domains, including medical research, environmental science, web-search, and e-commerce. Many of these advances have benefited from emergent web-based applications, and improvements in data storage and sensing technologies. Innovations in parallel accelerator hardware, such as GPUs, has made it possible to process massive amounts of data in a timely fashion. Given these advanced data acquisition technology and hardware, machine learning researchers are equipped to generate and sift through much larger and complex datasets quickly. In this work, we focus on accelerating Kernel Dimension Alternative Clustering algorithms using GPUs. We conduct a thorough performance analysis by using both synthetic and real-world datasets, while also modifying both the structure of the data, and the size of the datasets. Our GPU implementation reduces execution time from minutes to seconds, which enables us to develop a web-based application for users to, interactively, view alternative clustering solutions.
Online job search and talent procurement have given rise to challenging match and search problems in the e-recruitment domain. Existing systems perform direct keyword matching of technical skills which misses out a closely matching candidate on account of it not having the exact skills. This results in substandard results which ignores the relationships between technical skills. In an attempt to improve relevancy, this paper proposes a semantic similarity measure between IT skills using a knowledge based approach. The approach builds an ontology using DBpedia and uses it to derive a similarity score using feature based similarity measures. The proposed approach performs better than the Resumatcher system in finding the similarity between skills.
We apply transfer learning techniques to create topically and/or stylistically biased natural language models from small data samples, given generic long short-term memory (LSTM) language models trained on larger data sets. Although LSTM language models are powerful tools with wide-ranging applications, they require enormous amounts of data and time to train. Thus, we build general purpose language models that take advantage of large standing corpora and computational resources proactively, allowing us to build more specialized analytical tools from smaller data sets on demand. We show that it is possible to construct a language model from a small, focused corpus by first training an LSTM language model on a large corpus (e.g., the text from English Wikipedia) and then retraining only the internal transition model parameters on the smaller corpus. We also show that a single general language model can be reused through transfer learning to create many distinct special purpose language models quickly with modest amounts of data.
Online job search and talent procurement have given rise to challenging match and search problems in the e-recruitment domain. Existing systems perform direct keyword matching of technical skills which misses out a closely matching candidate on account of it not having the exact skills. This results in substandard results which ignores the relationships between technical skills. In an attempt to improve relevancy, this paper proposes a semantic similarity measure between IT skills using a knowledge based approach. The approach builds an ontology using DBpedia and uses it to derive a similarity score using feature based similarity measures. The proposed approach performs better than the Resumatcher system in finding the similarity between skills.
Researchers and scientists read articles to improve their studies. Researchers spend too much time and struggle to find the suitable article they are looking for. The purpose of article recommendation system is to reduce the time they spend and present to them the related articles they are not aware of. Classic article recommendation systems do not consider the user's information, they show the same results in the same sort for each researcher. In this study, an article recommendation system that takes into consideration the researcher's work field and the publisher's previous articles is presented. One of the most important innovations of this work is the use of TF-IDF and Cosinus similarity to make article recommendations taking user's past articles into consideration. As a result of the work, the users have been recommended articles, and the method we present has proved more successful results compared to equivalent methods according to f-Measure criterion.
Advances in technology in the current era of big data has led to the high-velocity generation of high volumes of a wide variety of valuable data of different veracity. As rich sources of big data, social networks consist of users (or social entities) who are often linked by some interdependency such as 'following' relationships. Given these big social networks keep growing, there are situations in which an individual user (or business) wants to find those frequently followed groups of social entities so that he can follow the same groups. Discovery of these frequently followed groups can be challenging because the social networks are usually big (with lots of users/social entities) but can be very sparse (with most users only know some but not all users/social entities in a social network). In this paper, we present a few social network mining algorithms that use compressed models in mining these very big but sparse social networks for discovering groups of frequently followed social entities. Evaluation results show the practicality of our algorithms in efficient mining of 'following' patterns from big but very sparse social networks.
One of the most attractive problems of social network analysis is the link prediction. Social networks’ user growth is mostly supported with data driven friend recommendations which are provided by link predictors. Previously, we had studied new features to improve prediction accuracy in Location Based Social Networks (LBSNs) where users share temporal location information with check-in interactions. In this paper, we focused on the efficiency of link predictors as the speed of prediction is as critical as its accuracy in LBSNs. Extraction time costs and prediction accuracy of individual LBSN features are mined to pick a feature subset that is achieving faster link prediction while not losing from accuracy.
Event detection is a popular research problem, aiming to detect events from online data sources with least possible delay. Most of the previous work focus on analyzing textual content such as social media postings to detect happenings. In this work, we consider event detection as a change detection problem in network structure, and propose a method that detects change in community structure extracted from communication network. We study three versions of the method based on different change models. Experimental analysis on benchmark data set reveals that change in the community can be used as an indication of an event.
The use of internet news sites increases day by day. The internet has gone beyond institutions and organizations to provide a different service and there have been attempts to provide services only through the internet and thus to earn money. It can also be a organization or an individual who opens an account on the social network and provides financial gain on these accounts. The financial gain on the internet is increasing in parallel with the number of people entering the site in general or the number of people reading the content on the site. Clickbait is a click technique in which a user manipulates the curiosity of a person in order to open more pages in a web site, usually by writing exaggerated and unreal headlines. In this study, headlines or subheadings for news were collected. In these news articles, the Clickbait headline identified by TF-IDF has summarized by looking at the content of the Clickbait news with text feature extraction based on ontology method. The summarized version has been shown to the user without having to click on the newsletter. In this study, news in 4 news sites with Turkish and English content were examined. This study is the first Turkish study about Clickbait detection. In the English tests, results have given in comparison with equivalent algorithms and explained in detail.
Online shopping in recent years demonstrate a constant increase and as a result the study of user behavior through clickstream has attracted again the interest of the research community. This increase though requires novel approaches to clickstream analytics since the volume of the products available online and the corresponding transactions is huge. In this paper, a sequential frequent itemsets detection methodology (SAFID) is adopted to solve a clickstream analytics problem by analyzing a composite dataset which simulates monthly traffic of Amazon U.S. online retail shop. It is shown that the methodology can perform the analysis very efficiently in a simple desktop and detect all the frequently bought together products which can provide valuable knowledge to marketers of online retail stores. The methodology used can further be improved to handle larger datasets by considering a cloud computing environment.
Dynamic community detection is an important technology for the study of network evolution. However, most existing dynamic community detection algorithms are time-consuming in large-scale networks. Most current parallel community detection algorithms are static and they ignore the changes of network over time. In this paper, we propose a novel parallel algorithm based on incremental vertices, which is able to process large-scale dynamic networks, called PICD. In PICD algorithm, the revised Parallel Weighted Community Clustering (PWCC) metric is conductive to a convenient calculation, which is more sensitive to community structure compared to other metrics. The PICD approach consists of two main steps. Firstly, it identifies the incremental vertices in the dynamic network. Secondly, it maximizes the PWCC of the entire network by merely adjusting the community membership of incremental vertices to capture community structure in high quality. The results of experiments on both the synthetic and real world networks demonstrate that the PICD algorithm achieves a higher accuracy and efficiency. Moreover, it performs more stable than most of the baseline methods. The experiments also show that PICD algorithm takes an almost linear time with the growth of the network scale.
Causal inference and analytics plays a critical role in public health and disease prevention. Through mining of large patient datasets, it is possible to identify opportunities for intervention and to determine the effectiveness of treatment. There are currently many methods to analyze and learn causal relationships in large patient datasets, as well as specific causal studies in epidemiology that define specific relationships among symptoms and treatments. This paper introduces a novel methodology to utilize causal knowledge to extend and improve a standard hierarchical medical ontology. First, we will obtain the hierarchical structure of the patient symptom variables based on the Medical Dictionary for Regulatory Activities Terminology (MedDRA). Then, we will learn a Causal Bayesian Network (CBN) using Max-Min Hill-Climbing (MMHC), a hybrid constraint and score-based learning algorithm, on the pre-existing National Institutes of Mental Health (NIMH) study on Sequenced Treatment Alternatives to Relieve Depression (STAR*D) patient dataset. Finally, we will use the causal links discovered in the CBN to evolve the ontology and its hierarchy.
"Chronic pain is a very common problem worldwide and helping people coping with it is fundamental for improving their quality of life. Since smartphones are available anywhere and anytime for all users, the present work proposes the development of an App that helps users to change their mood when facing low back and cervical pain. The App will drive the user thorough several screens that will help him or her to challenge their negative thoughts for more positive ones. This process will be driven thorough some messages and questions proposed by reserachers with expertise on health and pain management, but also thorough messages and questions proposed by users themselves. The main contributions of this work are: 1) using an App to face pain thorugh a process of cognitive restructuring
Improving seasonal influenza forecasting combining official data sources with web search and social media is a recent research topic which can enhance situational awareness of healthcare organizations when monitoring the outbreak of seasonal flu. In this paper, a prediction model based on autoregression that combines data coming from official influenza surveillance system, with data from web search and social media regarding influenza is proposed. The model is evaluated on the two influenza seasons 2016-2017 and 2017- 2018, restricted to Italy. The results show that by using Webbased social data, like Google search queries and tweets, we can obtain accurate weekly influenza predictions up to four weeks in advance. The proposed approach improves real-time influenza forecast compared to traditional surveillance systems based on data from sentinel doctors: the prediction error is reduced up to 47%, while the Pearson’s correlation is improved of about 24%.
"This research analyzes the scientific information sharing behaviors on Twitter. Over an eleven-month period, we collected tweets related to the controversy over the supposed linkage between the MMR vaccine and autism. We examined the usage pattern of scientific information resources by both sides of the ongoing debate. Then, we explored how each side uses scientific evidence in the vaccine debate. To achieve this goal, we analyzed the usage of scientific and non-scientific URLs by both polarized opinions. A domain network, which connects domains shared by the same user, was generated based on the URLs ""tweeted"" by users engaging in the debate in order to understand the nature of different domains and how they relate to each other. Our results showed that people with anti-vaccine attitudes linked many times to the same URL while people with pro-vaccine attitudes are linked to fewer overall sources but from a wider range of resources and they provided fewer total links compared to anti-vaccine. Moreover, our results showed that vocal journalists have a huge impact on users’ opinions. This study has the potential to improve understanding about the ways in which health information is disseminated via social media by understanding the way scientific evidence is referenced in the discussions of controversial health issues. Monitoring scientific evidence usage on social media can uncover concerns and misconceptions related to the usage of these types of evidence."
Machine (learning)-based techniques have made substantial advances recently, and there is a general suggestion that they will drive major changes in health care within a few years. Yet, we all suffer from the lack of precise comparative studies on the accuracy of machine-based interpretations of medical data. To fill this gap, in this paper we investigate on the efficacy of using an automated mood analysis methodology to understand how patients react to the prescription to take different kinds of prenatal diagnostic tests (invasive vs non-invasive) and to the corresponding outcomes, based on conversations developed on Reddit. Our study essentially provides answers to research questions concerning: i) the popularity of prenatal diagnosis, ii) the patients’ sentiment about different prenatal tests, iii) the existence of a cause-effect relationship between prenatal testing and patients’ mood, and iv) the type of dialogues held by patients and physicians on this topic. Nonetheless, a general result emerging from our research is that a machine-based decision loop for now still needs human involvement, at least to alleviate the tension between empirical data and their correct medical interpretation.
The aim of the study was to use eye-tracking data to predict students’ concentration on both combined “photo and text” and “text only” HIV campaign messages. Eye-tracking was used to measure attention allocation processes to HIV campaigns. Each study participant completed a post-eye-tracking questionnaire to determine the recall of HIV messages and the relationship between eye-tracking measures with cognitive processing of HIV messages. In total, 60 students were randomly selected from Westville and Howard campuses of the University of KwaZulu-Natal, Durban, South Africa. A multivariate analysis of variance (MANOVA) test used to determine relationship between participants’ ages, sex and educational level with gaze parameters, indicated that age had a statistically significant effect on measures of search and processing. Using Tukey’s pair-wise post-hoc test, age was found to significantly affect the average saccade length and fixation/saccade ratio. The higher fixation/saccade ratio for male students, meant that males had more processing and less search activity than females. Mean fixation attention heat-maps, showed that younger students at high school, paid attention to only combined “photo and text” messages, while university students, in addition to fixating on combined “photo and text” messages, also paid significant attention to “text only” messages.
Ruptured intracranial aneurysms are associated with a high rate of mortality and disability due to the difficulty in predicting the rupture and complexity of the condition itself. Clinical narratives such as progress summaries and radiological reports, etc. contain key biomarkers, medical signs, and symptoms. By applying ontology-based information extraction on clinical narratives to extract important evidences and subsequently using machine learning can help to make decision support tools for complex decision making such as prediction of aneurysm rupture. According to best of our knowledge, there doesn’t exist any work to extract clinical features from clinical narratives to predict the rupture of intracranial/Brain aneurysms (BA). While no single factor individually contributes to the risk of rupture of a BA, it is important to consider the combined impact of these aspects to understand the rupture probability of the aneurysm. In this paper, we explore the impact of size as a relative factor in saccular aneurysms with respect to location, gender and symptomatic/asymptomatic aspects of BA. Our study involves descriptive and inferential statistical data analysis on features extracted from retrospective electronic health records (EHRs) using natural language processing (NLP) and ontology-based information extraction techniques. Our analysis shows that size alone is not the sole contributor for rupture but the combination of size, location and patient’s gender can influence aneurysm ruptures. Our results also show interesting insight that on same vasculature location, the average size of ruptured aneurysms for females is always smaller than that of males.
MicroRNAs (miRNAs) have effects on regulation of gene expressions also they have various functions on biological processes so the disruption of functions of miRNAs can cause to diseases and abnormal phenotypes. Environmental Factors (EFs) such as drugs, radiation, cigarette smoke, alcohol have important negative effects on human health because they carries out interactions at molecular level in human organisms. miRNA is one of the molecules interacting with EFs, and the interactions between two have affects on diseases. In this study, we have developed a model based on KATZ measure to find new potential associations between miRNAs and EFs with using Gaussian interaction profile kernel similarity.
Computing shortest path distances between nodes lies at the heart of many graph algorithms and applications. Traditional exact methods such as breadth-first-search (BFS) do not scale up to contemporary, rapidly evolving today’s massive networks. Therefore, it is required to find approximation methods to enable scalable graph processing with a significant speedup. In this paper, we utilize vector embeddings learnt by deep learning techniques to approximate the shortest paths distances in large graphs. We show that a feedforward neural network fed with embeddings can approximate distances with relatively low distortion error. The suggested method is evaluated on the Facebook, BlogCatalog, Youtube and Flickr social networks.
This study investigated government management of and public involvement in social media pages to summarize the internal management, daily routines, and perceptions of government agencies toward social media management. Furthermore, this study analyzed the driving factors of the intention to use, actual usage behaviors, and suggestions and expectations of the public regarding their involvement in the social media pages operated by government agencies. Subsequently, to identify the key factors in government social media management and directions for improvement, a comparison was made between the management practice of government agencies’ social media and people’s experiences in using them. Through participant observations, in-depth interviews, and other data collection methods, this study examined examples in Facebook—the social media platform most frequently used by Taiwanese people and government agencies—to explore these research questions. The government has invested considerable resources in social media management, attracting a large number of people to follow and participate in the official social media operated by various government agencies. However, whether the stickiness of these social media can be maintained, whether they continue to attract more participants, and whether they can evolve with social media platforms as people’s needs and expectations change remain to be investigated.
Digital social networks and social media have created increased attention and have become important in the political life of a country. Politicians, media and citizens are visible online and share their opinions immediately on these platforms. During the last days before an election a lot of activity is observed on these media, giving a picture of the issues and key people. In this paper, we study the case of the French primaries election through the Twitter microscope over a period of 10 days preceding the election. We apply social network analysis to uncover the key topics and influencers for the left and rightwings. A set of 6 graphs are constructed to analyze, separately and conjointly, the opposing factions by means of mentions and hashtag graphs.
"Robo-Advisors has been growing attraction from the financial industry for offering financial services by using algorithms and acting as like human advisors to support investors making investment decisions. During the investment planning stage, portfolio optimization plays a crucial role, especially for the medium and long-term investors, in determining the allocation weight of assets to achieve the balance between investors expectation return and risk tolerance. The literature on the topic of portfolio optimization has been offering plenty of theoretical and practical guidance for implementing the theory
There is no agreed definition of social capital in the literature. However, one interpretation is that it refers to those resources embedded in an individual's social network offering benefits to that individual in relation to achieving goals and facilitating actions. This can be viewed as a resource-based interpretation of social capital aimed at the level of individuals. In this paper, we propose a family of social capital measures in line with this interpretation. Our measures are designed for a model of social networks based on weighted and attributed graphs, and cover four dimensions of social capital: (i) access to resources, (ii) access to superiors, (iii) homogeneity of ties, and (iv) heterogeneity of ties. We demonstrate the real-world application of our measures by exploring an illustrative use case in the form of a workplace social network.
Research related to information diffusion within complex networks tends to focus on the effective ways to maximize its reach and dynamics. Most of the strategies are based on seeding nodes according to their potential role for social influence. The presented study shows how the seeding can be supported by changes in the target users’ motivation to spread the content, thus modifying the propagation probabilities. The allocation of propagation probabilities to nodes takes the form of a spraying process following a given probability distribution, projected from the nodes’ rankings. The results showed how different spraying strategies affect the results when compared to the commonly used uniform distribution. Apart from the performance analysis, the empirical study shows to which extent the seeding of nodes with high centrality measures can be compensated by seeding the nodes which are ranked lower, but are having higher motivation and propagation probabilities.
This paper proposes an effective way to discover and memorize new English vocabulary based on both semantic and phonetic associations. The method we proposed aims to automatically find out the most associated words of a given target word. The measurement of semantic association was achieved by calculating cosine similarity of two-word vectors, and the measurement of phonetic association was achieved by calculating the longest common subsequence of phonetic symbol strings of two words. Finally, the method was implemented as a web application.
One of the most interesting problems in network analysis is community detection, i.e. the partitioning of nodes into communities, with many edges connecting nodes of the same community and comparatively few edges connecting nodes of different communities. We introduce a new quality measure to evaluate a partitioning of an undirected and unweighted graph into communities that is called ‘inclusion’. This quality measure evaluates how well each node is ‘included’ in its community by considering both its existent and its non-existent edges. We have implemented a strategy that maximizes the inclusion criterion by moving each time a single node to another community. We also considered inclusion as a criterion for evaluating partitions provided by spectral clustering. In our experimental study, the inclusion criterion is compared to the widely used modularity criterion providing improved community detection results without requiring the a priori specification of the number of communities.
"With the upcoming of the AI era, AI combines with Fintech (Financial Technology
Although many new methods that aim to improve the performance of link prediction have been proposed in recent years, there is still no widely accepted benchmark for evaluating and comparing these link prediction methods. In this paper, we propose LPBenchmark, a solution towards a fair and effective benchmark for link prediction. LPBenchmark offers a suite of well-selected datasets covering major research fields in link prediction without redundancy. These datasets are selected from widely adopted open access collections of datasets via performing AHC(Adapted Hierarchical Clustering) and DNFS(Deepest Node First Selection) Algorithm. LPBenchmark measures the difficulty of each selected dataset through OSR(Optimal Subset Regression) Algorithm, which makes it possible to fairly compare the experiment performance of two methods operated on different datasets. Moreover, LPBenchmark includes three APIs, allowing researchers to obtain the largest connected components of a dataset, modify a dataset based on node degree and construct subgraphs based on node clustering coefficients. After presenting all the characteristics and functionalities of LPBenchmark, we conduct a comprehensive evaluation on several classic and newly proposed link prediction methods by using LPBenchmark. Results show that LPBenchmark is not only capable of fairly comparing each method's overall performance, but also can reveal each method's advantages and limitations on different types of networks.
Online underground forums have been widely used by cybercriminals to trade the illicit products, resources and services, which have played a central role in the cybercriminal ecosystem. Unfortunately, due to the number of forums, their size, and the expertise required, it’s infeasible to perform manual exploration to understand their behavioral processes. In this paper, we propose a novel framework named iDetector to automate the analysis of underground forums for the detection of cybercrime-suspected threads. In iDetector, to detect whether the given threads are cybercrime-suspected threads, we not only analyze the content in the threads, but also utilize the relations among threads, users, replies, and topics. To model this kind of rich semantic relationships (i.e., thread-user, thread-reply, thread-topic, reply-user and reply-topic relations), we introduce a structured heterogeneous information network (HIN) for representation, which is capable to be composed of different types of entities and relations. To capture the complex relationships (e.g., two threads are relevant if they were posted by the same user and discussed the same topic), we use a meta-structure based approach to characterize the semantic relatedness over threads. As different meta-structures depict the relatedness over threads at different views, we then build a classifier using Laplacian scores to aggregate different similarities formulated by different meta-structures to make predictions. To the best of our knowledge, this is the first work to use structural HIN to automate underground forum analysis. Comprehensive experiments on real data collections from underground forums (e.g., Hack Forums) are conducted to validate the effectiveness of our developed system iDetector in cybercrime-suspected thread detection by comparisons with other alternative methods.
Abstract—When a terror-related event occurs, there is a surge of traffic on social media comprising of informative messages, emotional outbursts, helpful safety tips, and rumors. It is important to understand the behavior manifested on social media sites to gain a better understanding of how to govern and manage in a time of crisis. We undertook a detailed study of Twitter during two recent terror-related events: the Manchester attacks and the Las Vegas shooting. We analyze the tweets during these periods using (a) sentiment analysis, (b) topic analysis, and (c) fake news detection. Our analysis demonstrates the spectrum of emotions evinced in reaction and the way those reactions spread over the event timeline. Also, with respect to topic analysis, we find “echo chambers”, groups of people interested in similar aspects of the event. Encouraged by our results on these two event datasets, the paper seeks to enable a holistic analysis of social media messages in a time of crisis.
This paper presents a methodology for automatically detecting the presence of hate speech within the terrorist argument. Hate speech can be used by a terrorist group as a means of judging possible targets’ guilt and deciding on their punishment, as well as a means of making people to accept acts of terror or even as propaganda for possibly attracting new members. In this paper, we examine both ideology expressed and practices employed by the Revolutionary Organization 17 November (hereafter 17N) that operated in Greece between the years of 1975 and 2002. Within this line of thought, we will focus on the ideological justification, ethical standing and deployment of the terrorist operations as presented in the communiqués published by 17N, emphasizing on the use of hate speech as a means of justifying their choices and actions, as well as a way of reaching out to Greek people. To decide on how the automatic classification will be performed, we experimented with different text analyzing techniques such as critical discourse and content analysis and based on the preliminary results of these techniques a classification algorithm is proposed that can classify the communiqués in three categories depending on the presence of hate speech. The methodology was tested over the existing dataset with all the communiqués and the corresponding results are discussed.
YouTube, since its inception in 2005, has grown to become largest online video sharing website. It’s massive user- base uploads videos and generates discussion by commenting on these videos. Lately, YouTube, akin to other social media sites, has become a vehicle for spreading fake news, propaganda, conspiracy theories, and radicalizing content. However, lack in effective image and video processing techniques has hindered research on YouTube. In this paper, we advocate the use of metadata in identifying such malicious behaviors. Specifically, we analyze metadata of videos (comments, commenters) to study a channel on Y ouTube that was pushing content promoting conspiracy theories regarding World War 3. Identifying signals that could be used to detect such deviant content (e.g., videos, comments) can help in stemming the spread of disinformation. We collected over 4,145 videos along with 16,493 comments from YouTube. We analyze user engagement to assess the reach of the channel and apply social network analysis techniques to identify inorganic behavior.
How can we identify individuals at risk of being drawn into online sex work? The spread of online communication removes transaction costs and enables a greater number of people to be involved in illicit activities, including online sex trade. As a result, social media platforms often work as springboard for criminal careers posing a significant risk to the economy, public health and trust. Detecting deviant behaviors online is limited by the poor availability of ground-truth data and machine learning tools. Unlike prior work which focuses exclusively on either qualitative or quantitative methods, in this paper we combine covert online ethnography with semi-supervised learning methodologies, using data from a popular European adult forum. We obtained risk assessment results of 78 users using covert online ethnography, and set out to build a machine learning model that can predict the risk factor in other 28,832 users. Results show that a combination-based approach in which all features are used yields the most accurate results.
Disasters are causing tremendous damage to human lives and properties. The United Nations International Strategy for Disaster Reduction (UNISDR) recognizes that behavioral change of society is needed to significantly reduce disaster losses. Understanding human behavior during disasters could help in making decisions on how to prepare for disasters, how to properly act and strategically respond during and after a calamity. This study aims to understand human behavior during disaster through agent-based modeling and social network analysis. eBayanihan, a disaster management platform that uses crowdsourcing to gather disaster-related information was used to capture disaster behavior during a simulated disaster-event. Survey data was also used for disaster behavior modeling. Generated disaster behavior models and computed social network centrality measures using ORA-Netscenes shows that there are specific agents in the network that can play an important role during disaster risk reduction and management (DRRM) operations.
This paper presents PS0, an ontological framework and a methodology for improving physical security and insider threat detection. PS0 can facilitate forensic data analysis and proactively mitigate insider threats by leveraging rule-based anomaly detection. In all too many cases, rule-based anomaly detection can detect employee deviations from organizational security policies. In addition, PS0 can be considered a security provenance solution because of its ability to fully reconstruct attack patterns. Provenance graphs can be further analyzed to identify deceptive actions and overcome analytical mistakes that can result in bad decision-making, such as false attribution. Moreover, the information can be used to enrich the available intelligence (about intrusion attempts) that can form use cases to detect and remediate limitations in the system, such as loosely-coupled provenance graphs that in many cases indicate weaknesses in the physical security architecture. Ultimately, validation of the framework through use cases demonstrates and proves that PS0 can improve an organization’s security posture in terms of physical security and insider threat detection.
Inferring locations from user texts on social media platforms is a non-trivial and challenging problem relating to public safety. We propose a novel non-uniform grid-based approach for location inference from Twitter messages using Quadtree spatial partitions. The proposed algorithm uses natural language processing (NLP) for semantic understanding and incorporates Cosine similarity and Jaccard similarity measures for feature vector extraction and dimensionality reduction. We chose Twitter as our experimental social media platform due to its popularity and effectiveness for the dissemination of news and stories about recent events happening around the world. Our approach is the first of its kind to make location inference from tweets using Quadtree spatial partitions and NLP, in hybrid word-vector representations. The proposed algorithm achieved significant classification accuracy and outperformed state-of-the-art grid-based content-only location inference methods by up to 24% in correctly predicting tweet locations within a 161km radius and by 300km in median error distance on benchmark datasets.
Ear recognition has its advantages in identifying non-cooperative individuals in unconstrained environments. Ear detection is a major step within the ear recognition algorithmic process. While conventional approaches for ear detection have been used in the past, Faster Region-based Convolutional Neural Network (Faster R-CNN) based detection methods have recently achieved superior detection performance in various benchmark studies, including those on face detection. In this work, we propose an ear detection system that uses Faster R-CNN. The training of the system is performed on two stages: First, an AlexNet model is trained for classifying ear vs. non-ear segments. Second, the unified Region Proposal Network (RPN) with the AlexNet, that shares the convolutional features, are trained for ear detection. The proposed system operates in real-time and accomplishes 98\% detection rate on a test set, composed of data coming from different ear datasets. In addition, the system's ear detection performance is high even when the test images are coming from un-controlled settings with a wide variety of images in terms of image quality, illumination and ear occlusion.
"Supercomputing environments are becoming the norm for daily use. However, their complex infrastructure makes troubleshooting and monitoring failures extremely difficult. This is because these infrastructures contain thousands of nodes representing various applications and processors. To address these concerns, we propose a real-time reliability analysis framework for high performance computing (HPC) environments where the contributions are three-fold. First, an improved data network extrapolation (DNE) methodology is proposed as a pre-processing module. This component incorporates the system failure information (i.e. job, fault, and error log files) and performs robust job-based failure accounting for sequential and parallel jobs. This element also performs cross-referencing to compute task-based failure accounting, where the assumption is made that tasks are comprised of either one or more jobs. Next, a reliability characterization and analysis (RCA) schema is proposed that takes the failure information from the DNE process to perform survival analyses on each individual node in addition to the entire reliability infrastructure. This is coupled with a failure metrics characterization (FMC) schema that estimates the failure metrics such as the mean time to failure (MTTF) as well as the hazard rate. Additionally, a comparative analysis is made between the Log-Normal or Weibull distributions in terms of modeling job and task-based failure activity. Empirical analysis using the Structural Simulation Toolkit (SST) illustrate the promise of this approach in terms of characterizing, monitoring, and troubleshooting failure behavior. The results of this work can aide systems administrators the dynamic tools to pinpoint and monitor failure behavior
Recommender systems aim to suggest relevant items to users among a large number of available items. They have been successfully applied in various industries, such as e-commerce, education and digital health. On the other hand, clustering approaches can help the recommender systems to group users into appropriate clusters, which are considered as neighborhoods in prediction process. Although it is a fact that preferences of users vary over time, traditional clustering approaches fail to consider this important factor. To address this problem, a social recommender system is proposed in this paper, which is based on a temporal clustering approach. Specifically, the temporal information of ratings provided by users on items and also social information among the users are considered in the proposed method. Experimental results on a benchmark dataset show that the quality of recommendations based on the proposed method is significantly higher than the state-of-the-art methods in terms of both accuracy and coverage metrics.
We demonstrate a machine learning and artificial intelligence method, i.e., lexical link analysis (LLA) to discover high-value information from big data. In this paper, high-value information refers to the information that has the potential to grow its value over time. LLA is a unsupervised learning method that does not require manually labeled training data. New value metrics are defined based on a game-theoretic framework for LLA. In this paper, we show the value metrics generated from LLA in a use case of analyzing business news. We show the results from LLA are validated and correlated with the ground truth. We show that by using game theory, the high-value information selected by LLA reaches a Nash equilibrium by superpositioning popular and anomalous information, and at the same time generates high social welfare, therefore, contains higher intrinsic value.
"With the increasing popularity of online video sharing platforms (such as YouTube and Twitch), the detection of content that infringes copyright has emerged as a new critical problem in online social media. In contrast to the traditional copyright detection problem that studies the static content (e.g., music, films, digital documents), this paper focuses on a much more challenging problem: one in which the content of interest is from live videos. We found that the state-of-the-art commercial copyright infringement detection systems, such as the ContentID from YouTube, did not solve this problem well: large amounts of copyright-infringing videos bypass the detector while many legal videos are taken down by mistake. In addressing the copyright infringement detection problem for live videos, we identify several critical challenges: i) live streams are generated in real-time and the original copyright content from the owner may not be accessible
An information network is represented as a graph where nodes represent entities and edges represent interactions between nodes. There can be multiple types of nodes and edges in such networks giving rise to homogeneous, multi-relational and heterogeneous networks. Link prediction problem is defined as predicting edges that are more likely to be formed in the network at a future time. Many measures have been proposed in the literature for homogeneous networks. Extensions of many of these measures to heterogeneous networks are not available. Further, the measures need to be redefined in order to utilize the weight and time information available with the interactions. In this work, along with the logical grouping of the measures as topological, probabilistic and linear algebraic measures for all types of networks, we fill the gaps by defining the measures where ever they are not available in the literature. The empirical evaluation of each of these measures in different types of networks on the DBLP benchmark dataset is presented. An overall improvement of 12% is observed in prediction accuracy when temporal and heterogeneous information is efficiently utilized.
We study community detection in criminal networks and address the problem caused by intentionally hidden edges which hinder the performance of community detection. We make use of link prediction to demonstrate how the community structure of a network can be better identified by augmenting it with edges. We demonstrate the value of this method by showing this method delivers us better quality communities for real life drug trafficking networks. We discuss also the limitations of the approach, and importance of community detection for investigating of criminal networks.
Decision makers use partial information of networks to guide their decision, yet when they act, they act in the real network or the ground truth. Therefore, a way of comparing the partial information to ground truth is required. We introduce a statistical measure that analyzes the network obtained from the partially observed information and ground truth, which of course can be applied to the comparison of any networks. As a first step, in the current research, we restrict ourselves to networks of the same size to introduce such a method, which can be generalized to different size networks. We perform mathematical analysis on the random graph, and then apply our methodology to synthetic networks generated using five different generating models. We conclude with a statistical hypothesis test to decide whether two graphs are correlated or not correlated.
Social networks have become a popular way for internet surfers to interact with friends and family members, reading news, and also discuss events. Users spend more time on wellknown social platforms (e.g., Facebook, Twitter, etc.) storing and sharing their personal information. This information together with the opportunity of contacting thousands of users attract the interest of malicious users. They exploit the implicit trust relationships between users in order to achieve their malicious ims, for example, create malicious links within the posts/tweets, spread fake news, send out unsolicited messages to legitimate users, etc. In this paper, we investigate the nature of spam users on Twitter with the goal to improve existing spam detection mechanisms. For detecting Twitter spammers, we make use of several new features, which are more effective and robust than existing used features (e.g., number of followings/followers, etc.). We evaluated the proposed set of features by exploiting very popular machine learning classification algorithms, namely k- Nearest Neighbor (k-NN), Decision Tree (DT), Naive Bayesian (NB), Random Forest (RF), Logistic Regression (LR), Support Vector Machine (SVM), and eXtreme Gradient Boosting (XGBoost). The performance of these classifiers are evaluated and compared based on different evaluation metrics. We compared the performance of our proposed approach with four latest state of art approaches. The experimental results show that the proposed set of features gives better performance than existing state of art approaches.
We introduce a flexible method for determining the hierarchical structure of a network based on the theory of isospectral network reductions. To illustrate the usefulness of this approach we apply our procedure to the Southern Women Data Set, one of the most studied of all social networks. We find that these techniques provide new information that is consistent in a number of ways to previous results regarding this network but that is also complementary to these earlier findings.
An organization of any society often takes on a complex structure, where many different communities may be identified which include or overlap one another. A significant question is how local activities affect its global position and also facilitate gaining position in other local communities. We discuss a problem of how a local group supports the user in achieving a significant global position. In this paper we present a model of the system containing descriptions of users and roles as well as results of analyses using data gathered from the blog portal Salon24. We describe relationships between global roles played by users in the whole social portal and local roles in given identified communities. Additionally, other useful characteristics of users of such analyzed portals are presented, such as: numbers of groups to which users with given roles belongs or the total numbers of global and local roles.
Due to the anonymous and remote nature of Electronic Commerce (E-commerce), reviews of products and vendors left by previous customers have emerged as an integral part of most online transactions. The reviews may influence the decision of customers buying the product since E-commerce websites/services do not allow customers to validate and inspect products in-store. In this paper, we analyze data from two BITCOIN marketplaces which include transactions between marketplace users and the ratings of those transactions given by those users. In this analysis we create a synthetic network model with similar topological properties as the networks of the interactions of both marketplaces. The results of our analysis show an interesting phenomenon in which user ratings, which range from -10 to 10, converge to a value of approximately two as the number of a user’s transactions increase. Finally, we suggest future work on our synthetic model to improve its agreement with the transaction networks in order to better understand how reviews influence user decisions on transactions.
Previous work has shown that selectivity based on opinions and values of attributes is an important tie-formation mechanism in human social networks. Less well-known is how selectivity influences the formation and composition of whole groups in which interactions extend beyond the dyads. To address this question, we use data from the NetSense study consisting of a multi-layer (nomination, communication, co-location) network of university students. We examine how group formation differs from tie-formation in terms of the role of selectivity based on opinions and attributes. In addition, we show how levels of such selectivity varies between groups formed to meet different needs.
To study the detailed effects of social media consumption on personal opinion dynamics, we gather self reported survey data on the volume of different media types an individual must consume before forming or changing their opinion on a subject. We then use frequent pattern mining to analyze the data for common groupings of responses with respect to various media types, sources, and contexts. We show that in general individuals tend to perceive their behavior to be consistent across many variations in these parameters, while further detail shows various common parameter groupings that indicate response changes as well as significant groups of individuals that tend to be consistently more easily swayed than the average participant.
This work involves agent-based simulation of bootstrap percolation on hyperbolic networks. Our goal is to identify influential nodes in a network which might inhibit the percolation process. Our motivation, given a small scale random seeding of an activity in a network, is to identify the most influential nodes in a network to inhibit the spread of an activity amongst the general population of agents. This might model obstructing the spread of fake news in an on line social network, or cascades of panic selling in a network of mutual funds, based on rumour propagation. Hyperbolic networks typically display power law degree distribution, high clustering and skewed centrality distributions. We introduce a form of immunity into the networks, targeting nodes of high centrality and low clustering to be immune to the percolation process, then comparing outcomes with standard bootstrap percolation and with random selection of immune nodes. We generally observe that targeting nodes of high degree has a delaying effect on percolation but, for our chosen graph centralisation measures, a high degree of skew in the distribution of local node centrality values bears some correlation with an increased inhibitory impact on percolation.
In this paper, we analyze the message exchange patterns that emerge when social bots and human users communicate via Twitter. In particular, we use a multilayer network to analyze the emergence of the corresponding representative and statistically significant sub-graphs (so called motifs). Our analysis is based on two recent riot events, namely the Philadelphia Superbowl 2018 riots and the 2017 G20 riots in Hamburg (Germany). We found that in these two events message exchanges between humans form characteristic and re-occurring communication patterns. In contrast, message exchanges including bots occur rather sporadically and do not follow a particular statistically significant pattern.
A well-known result in social science states that, although people maintain a large number of social relationships, in practice, their interactions are concentrated mostly on a small portion of their neighbors. According to Dunbar's hypothesis, this phenomenon is due to a limited human cognitive capacity to handle too many social relationships, together with a few time for socializing. These constraints result in an organization of people's ego-network into four or five groups depending on the strength of the interactions, the so called Dunbar's circles. The verification of the Dunbar's hypothesis, the identification of social circles and their characterization are still open questions, although researchers have found evidence of its validity on different social networks, from small offline networks to online social networks, such as Twitter and Facebook. In addition, little is known about the semantic aspects of the circles, i.e. who are the members of each circle. In this paper, we cope with these issues by analyzing a mobile phone graph where people's interactions are expressed by both voice calls and text messages. We firstly compare two methods for the identification of the circles, which rely on the different definitions of strength proposed in literature. Both methods confirm the subdivision of ego-networks into four or five circles, each characterized by a specific interaction strength. Then, we validate the Dunbar's hypothesis and, by leveraging some powerful features of the mobile phone dataset in use, we provide a first semantic characterization of social circles. We show, for instance, that people maintain relationships with the members of the closest circles by combining calls and texts. In addition, by detecting home and work locations of each individual, we highlight that a semantic aspect, such as the role of the alters, impacts on the ego-network circles. In particular, family members and workmates are located in different circles: the former mainly form the closest circle, i.e. the support clique, while the latter are distributed among the outer circles.
User timelines in Online Social Media (OSM) re- mains filled with a significant amount of information received from followees. Given that content posted by followee is not under user’s control, this information may not always be relevant. If there is large presence of not so relevant content, then a user may end up overlooking relevant content, which is undesirable. To address this issue, in the first part of our work, we propose suitable metrics to characterize the user-followee relationship. We find that most of the users choose their followees primarily due to the content that they post (content-conscious behavior, measured by content similarity scores). For a small number of followees, a high degree of social engagement (likes and shares) irrespective of the content posted by them is observed (user- conscious behavior, measured by user affinity scores). We evaluate our proposed approach on 26,516 followees across 100 random users on Twitter who have cumulatively posted 234,403 tweets. We find that on average for 60% of their followees, users exhibit very low degree of content similarity and social engagement. These findings motivate the second part of our work, where we develop a Followee Management Nudge (FMN) through a browser extension (plugin) that helps users remain more informed about their relationship with each of their followees. In particular, the FMN nudges a user with a list of followees with whom they have least (or never) engaged in the past and also exhibit very low similarity in terms of content, thereby helping a user to make an informed decision (say by unfollowing some of these followees). Results from a preliminary controlled lab study show that 62.5% of participants find the nudge to be quite useful.
Basketball is an inherently social sport, which implies that social dynamics within a team may influence the team’s performance on the court. As NBA players use social media, it may be possible to study the social structure of a team by examining the relationships that form within social media networks. This paper investigates the relationship between publicly available online social networks and quantitative performance data. It is hypothesized that network centrality measures for an NBA team’s network will correlate with measurable performance metrics such as win percentage, points differential and assists per play. The hypothesis is tested using exponential random graph models (ERGM) and investigating correlation between network and performance variables. The results show that there are league-wide trends correlating certain network measures with game performance, and also quantifies the effects of various player attributes on network formation.
Sociological theories of career success provide fundamental principles for the analysis of social networks to identify patterns that facilitate career development. Structural Hole Theory argues that certain network structures provide advantages to individuals by facilitating them to access unique information from parts of the network. The network structural advantages of social networks in workplace settings have not been studied enough for the purpose of employees career development. In this paper, we address this challenge by proposing a Social Capital-Driven Career Development framework which leverages enterprise collaboration activity streams to assess employees social capital across organizational hierarchy levels. WWe demonstrate that our framework can enable employees to reflect on their social network structure from the prospective of information benefits for progressing their career from one hierarchy level to the immediate next level in their respective business units.
Network science fosters our understanding of many natural and man-made complex systems by providing models for evolution, emergence and reliability. Existing studies on network robustness mainly focus on targeted node attacks, where the topology shrinks until the network loses its basic functionality. Here, we bring a novel aspect into perspective. First, we investigate the fragility of networks when dealing with edge attacks instead of node attacks – a highly relevant scenario in dynamical social systems where, because of the frequent interactions, edges are dynamically created or destroyed. Second, we propose an intuitive edge repair model to respond after edge attacks. Accordingly, we study different strategies for targeted edge attack and edge repair, to find that real-world networks are generally more robust than synthetic ones, that high betweenness and high degree nodes present structural weak-spots for targeted attacks, and that repairing edges incident to high-degree nodes first offers the most effective response to an attack. Finally, we find evidence of a behaviour we tag as topological antifragility for networks that actually grow stronger when being under attack, due to the triggered edge repair strategy.
Economic and convenience benefits of interconnectivity drive current explosive emergence and growth of networked systems. However, as recent catastrophic contagious failures in numerous large scale networked infrastructures demonstrated, these benefits of interconnectivity are inherently associated with various risks, including risk of undesirable contagion. Current research on network formation by contagion risk averse agents, which analyzes Nash or some other game-theoretic equilibrium notion of the corresponding game, suffers from interrelated problems of intractability and oversimplification. We argue that these problems can be alleviated with dynamic view, which assumes logit responses by strategic agents with utilities quantifying multiple competing incentives. While this approach naturally incorporates practically critical assumption of bounded rationality, it also allows for leveraging a vast body of results on network formation, e.g., preferential attachment in growing networks.
In many real-life applications it is crucial to be able to, given a collection of link states of a network in a certain time period, accurately predict the link state of the network at a future time. This is known as dynamic link prediction, which compared to its static counterpart is more complex, as capturing the temporal characteristics is a non-trivial task. This explains while still majority of today's research in network representation learning focuses on static setting ignoring temporal information. In this work, we focus on one such case and aim at extending node2vec, representation learning method successfully applied for static link prediction, to a dynamic setup. This extended method is applied and validated on several real-life networks with different properties. Results show that taking into account dynamic aspect outperforms static approach. Additionally, based on the network properties, recommendations are given for the node2vec parameters.