TODO
- Add a timeline/roadmap of the blog
- Our Team
- Add evaluation+conclusion
- Image Grid
- Plots
- Inferences from Plots (things like what students were expected to say in classes in Google Meet)
- Slides
- Video
- Divide sections for each person
- Record on empty meet meetings
- combine
- Phone Numbers (StackOverflow)
Plot test
ForgetMeNot: Exploring Accidental Linkability of Users
A case study on GitHub, StackOverflow and Twitter
Online Social Networks (OSNs) have experienced exponential growth in recent years. They have a major presence in both personal and professional lives of a huge segment of the world's populace. Any content or personal information that an individual shares on any social network can be traced back to him/her even after it has been removed. It is rightly said that 'Nothing is ever lost on the internet'.
Unsurprisingly, OSNs are subject to serious privacy and security risks. Due to the amount of personally identifiable information shared by users in OSNs and lack of adequate privacy settings, it becomes possible to aggregate information of users by linking their profiles across several online social networks.
For our project, we decided to explore this accidental linkability by exploiting privacy and security issues associated with OSNs and behavioural tendencies of users.
We started with accessing publically available data, trying to collect personally identifiable information (PII) on the users, followed by varied attempts to link them to accounts on different OSNs.
Initial Data Collection
Github
When setting up git
for the first time on your laptop/machine, a user is required to set a Name and a Email, before being allowed to make a commit. These are generally set by a command similar to git config --global user.email "email@example.com"
. Users often do not wonder about the repurcussions this may have a few years later in the future. For example, when pushing commits to your public repository on GitHub, you are attaching this set name+email (in plaintext) at the start.
Don't take our word for it. Test it out!
- Clone any public repository (eg.
git clone https://github.com/Daksh/process-github-daily-dumps.git
) - Navigate inside this repository using terminal (eg.
cd process-github-daily-dumps
) - Check the commits using
git log
Voila! You will find the user mentioned email address along with each commit
We decided to use this interesting observation to see the magnitude of this, possibly unintentional breach of privacy. To operationalize this experiment, We downloaded daily dumps of GitHub data for the months of May and June for the year 2019. Each dump was about 5-7 GB's on average in its zipped form. Due to computational restrictions, processing the data was a challenge for us. To overcome this difficulty, we automated the following steps of the process:
- Download the daily dump (
.zip
) - Extract the
.bson
files from the zip - Import select
.bson
files to the Mongo database - Run a
.py
script to extract information from the Mongo database - Cleanup
We have publically released this code on Github, at https://github.com/Daksh/process-github-daily-dumps.
The following information on the users was collected from the above-mentioned data:
- Github Profile
- Name
- Organisation
- Avatar
- Site Admin Status
StackOverflow
We downloaded all data made publically available for StackOverflow and extracted the following information on the users:
- StackOverflow Profile
- Name
- Avatar
Using the Collected Data
From the data collected above, we decided to make use of the email ids of users and find the possible social media platforms on which their accounts can be linked to. We started with an exploration of the information we could get for a user using their email on platforms like LinkedIn, Twitter, Facebook, and Instagram.
LinkedIn Sales API
We came across an old chrome extension Rapportive
which would return the LinkedIn profile for a given email id, provided that a LinkedIn account existed, associated with that email.
We read that this extension was eventually acquired by LinkedIn and integrated into their sales API. Then, we looked at the documentation of their API and found that the CURL request for the sales API was of the form:
https://www.linkedin.com/sales/gmail/profile/viewByEmail/
We opened this given URL directly in our browser upon suffixing it by a valid email:
https://www.linkedin.com/sales/gmail/profile/viewByEmail/abc@gmail.com
The resulting webpage looked like the following:
We found that this worked for every email id which was associated with a LinkedIn profile.
We decided to proceed by scraping data from the page that was returned by this URL. However, we encountered another challenge. For the above request to work, we needed to be logged into a LinkedIn account which wasn't feasible while using a scraper programmatically.
To overcome this, we used Burp Suite - it acts as a Man in the Middle (MITM) between our web browser and the internet. We used it to analyze the GET request made to the above URL.
We could now repeat the same request with the same headers but different payload (email) to get user information. As only authenticated LinkedIn users are allowed to use this API, we needed authentication tokens (from headers) to automate the whole process which was made possible through the use of Burp Suite.
From the output of Burp Suite, we scraped the entire HTML and parsed it to collect more information. In addition to their LinkedIn Profiles, we also collected the following information:
- Job Title
- Organization
- Location
- Other links (Twitter, Facebook etc.)
While we had started collecting data using this methodology, shortly after, the LinkedIn Sales API feature was sunset. Due to this and API restrictions, we could only collect the above-mentioned information for a subset of the users in our dataset.
Syncing Google Contacts
To try and link accounts on more online platforms, we decided to make use of the Sync Contacts
feature that exists on popular OSM's such as Facebook, Instagram and Twitter.
We made VCFs (virtual contact file) containing names and email ids for a significant proportion of users in our data and added them to a newly created dummy Google Account. The first difficulty we faced here was that Google Contacts only allowed adding a limited number of contacts at a time despite the limit being 25k.
When we tried syncing these contacts on Facebook, Instagram, and Twitter, we ran into the following problems:
- Facebook - Post syncing of the contacts, we did get suggestions for people in our contact list. However, they also contained other random people from outside our contacts. Another challenge here was that the output page only loaded a limited number of profiles at a time and required scrolling the page to load more results. This wasn't programmatically possible.
- Instagram - Earlier Instagram had a feature where one could sync their Google Contacts and follow them on Instagram but that feature did not exist anymore. Even after syncing the contacts, the suggestions were very limited and mixed.
- Twitter - Despite successfully syncing the contacts, Twitter could not find the profile of people in our contact list or suggest their twitter accounts to us.
As a result, we were not successful in this approach.
Analysis
We made Word Clouds to see the most prominent working titles of users and organizations where they worked at:
We also plotted the locatons of the users on a Map to further profile our demographic.
Finding linkability on Twitter
Lately, we came across this interesting trend in Twitter where users posted tweets asking for their followers' profiles on other social networks such as Instagram, Soundcloud, Paypal etc. so they could add them as friends on that platform, promote their content or even donate money to them (buying them a lunch etc.).
Data Collection
We decided to collect this data by querying searches of the form "drop your <platform_name>
" on Twitter. We considered the following platforms for collection of this data:
- Snapchat
- Spotify
- Soundcloud
- Venmo
- Paypal
We used Twint, an advanced Twitter scraping tool written in python, to retrieve tweets using such queries.
Interestingly, we noticed that the activity of posting such tweets had increased since lockdowns commenced due to COVID. We did a time-series analysis on all tweets containing our queries to confirm the same.
<timeseries_graph>
<timeseries_CF_graph>
As visible from the graph, there is an increase in the slope post the commencement of lockdown periods in countries majorly affected by COVID, denoting higher usage periods of social media in those countries due to reduced in-person interactions. We also plotted the locations of these users:
Once such tweets were retrieved, we decided to use the Twitter API to get the replies to such tweets. Here, we ran into another roadblock since the API itself did not have a feature for querying replies to a tweet. We worked our way around by querying the mentions of the users who posted the original tweet and checking if those mentions were in replies to the tweets we initially stored. However, this only gave us recent mention-containing tweets due to the API endpoints. Because of this, we were restricted to scraping tweets only from the past 24 hours.
We continued collecting this data for a period of 5 days and got good results. Following is a collage of the snapcodes we collected from the images embedded in replies to the tweets:
We parsed the collected replies for possible links of their accounts on these platforms. While parsing these, we also collected any URLs that the users had linked in their profile or bio to enable linking to more platforms. This process yielded the following distribution of linked platforms:
<platforms_graph>
Insights
We plotted a graph with the followers and following of the users who were the ones replying to such tweets:
- Top 5% of the Users have 10.5k or more followers.
- Top 25% of the Users have 1.5k or more followers.
- Proportion of Users with low # of followers is more than Users with high # of followers.
We found an interesting platform Linktree which was linked to the twitter profiles of such users. Linktree allows users to curate a one stop page to link all their content. Here, we noticed users adding all their social media links so that their audience can access them at one page.
We found no previous research on this platform and read online that it consists of roughly 3 million users. We feel that this platform might be ideal for future research on linkability, since it has no limits on scraping and compromises the links to multiple accounts of users.
API Keys
We noticed that a lot of StackOverflow and Github users forget to remove API keys for their social network accounts such as Twitter, Slack, etc. To confirm this, we manually searched GitHub and found some lists of Twitter API keys. After curating them together, we were able to get roughly 300 unique Twitter API keys out of which around 200 were valid and working keys. The problem was very interesting since there were a lot of security and privacy related aspects to it. These keys could not only be used for rate limit problems but also in some cases, could give a stranger complete access to another user’s Twitter account which would be a gross breach of security and privacy. Hence, we decided to extract such API keys from GitHub commits and StackOverflow data downloaded by us so that we could explore to what extent we could use these API keys to manipulate a user’s account.
Initially, we used a regex of Slack API keys to find such keys in the GitHub commits and StackOverflow data. Once we got these API keys, we needed a way to test them and be able to use them. Hence we set up a terminal using the language Go which was a challenge in itself since it was a completely new language that we had never worked with. Then we used a terminal-based library that took a Slack API key as input and opened a terminal interface for that user. Once this terminal interface was opened, we had complete control over the user’s account where we could even send messages from the account. This was a gross breach of security and privacy that we were able to exploit.
Comments
Post a Comment