With the advent of countless social networks in the 21st century, an individual’s privacy and security are
highly compromised. Any content or personal information that an individual shares on any social network
can be traced back to him/her even after it has been removed. It is rightly said that ‘Nothing is ever lost
on the internet’. Along with the problem of online privacy and security, potentially existing linkability of
several online social networks to a single user is a popular research question recently. 


Hence for our project, we decided to explore this accidental linkability by exploiting the internet’s privacy
and security issues. We planned to do so by accessing publically available data and trying to collect as
much personally identifiable information (PII) as possible for any individual and then try to link him/her to
accounts on different social network platforms.    


Initial Data Collection 
To get a starting point, we downloaded all the public data available for the internet site called
StackOverflow and also GitHub daily dumps of data for the months of May and June for the year 2019.
Every GitHub daily dump was about 5GB on average in its zipped form. Since there were computational
restrictions, as we were working on our laptops, processing the data was a challenge. To overcome this
difficulty, we automated a process where every downloaded file was unzipped, then the .bson file from
the unzipped contents was loaded onto a setup Mongo database, and finally, a python script was run
which downloaded the data needed to a .csv file.
We collected information about different users from the ‘About me’ section of StackOverflow and the
commits of the GitHub data. 
The following information was collected from the above-mentioned data - 
  1. Name of the user
  2. Name of the Organisation that the user is associated with
  3. Emails of the user
  4. Phone Number of the user


Using the Collected Data
These days all users register on any social network using their email ids or phone numbers or both.
We wanted to explore being able to get more profiles of a user with the help of his/her email id. We
started by exploring the information we could get for a user using his email on platforms such as
LinkedIn, Twitter, Facebook, Instagram, etc. 


LinkedIn Sales API
While searching for options regarding LinkedIn, we came across a Chrome extension known as
Rapportive which would return a LinkedIn profile for a given email id, provided a LinkedIn profile was
linked to that email id. This was eventually acquired by LinkedIn itself and LinkedIn integrated this
feature into its sales API. The curl request for the sales API was of the form
https://www.linkedin.com/sales/gmail/profile/viewByEmail/
We opened this given URL directly in the browser and added a valid email id at the end -
https://www.linkedin.com/sales/gmail/profile/viewByEmail/{email_address}. 
This worked for every email id with a LinkedIn profile. However, the challenge was now to scrape data
from the page that was returned. We tried to use normal scrapers but there was another problem there.
In order for the above request to work, we needed to be logged into our own LinkedIn profile and that
was not feasible using a normal scraper. 
To overcome this, we used Burp Suite. Burp Suite acts as a Man in the Middle (MITM) between our web
browser and the internet. It was used to analyze the GET request we made when we went to
"https://www.linkedin.com/sales/gmail/profile/viewByEmail/{email_address}". We could now replay the
same request with the same headers but different payload (email) to get user information. As only
authenticated LinkedIn users are allowed to use this API, we needed authentication tokens
(from headers) to automate the whole process which was made possible through the use of Burp Suite.
Now the output that we got, we scraped the entire HTML and parsed it to collect more information.
In addition to the information collected above, we further got the following information - 
  1. Linkedin Profile of the user
  2. Job Profile of the user
  3. Location as mentioned of the user
  4. Other links given in the user’s profile 
One other challenge we faced was that shortly after we discovered this feature, it was sunset. This
caused us to be able to collect the above-mentioned information for a subset of the users only. 


Syncing Google Contacts 
We also wanted to try to link more accounts and PII to the information collected so far about the users.
Hence, we made a new Google account where we added VCF (virtual contact file) cards with the user’s
names and their email id to the contacts of this account. The difficulty we faced here was Google
contacts took only a limited number of contacts even though the limit is 25,000.
The idea was to sync these Google contacts with the contacts of platforms like Facebook, Instagram,
and Twitter. This method did not work as successfully as we had hoped and following were the different
problems we faced - 

Facebook - After the syncing of the contacts, we did get suggestions for people we wanted to friend but
these suggestions were mixed. They contained people from the contact list but also other random
people. There was another obstacle and that was the output page only loaded more Facebook profile
links for the people in the contacts on scrolling the page and hence there was no way to scrape the data
properly. 
Instagram - Earlier Instagram had a feature where you could sync your Google contacts but that
feature does not exist anymore.
Twitter - We successfully synced the contacts but Twitter could not find the profile of those people. 


Finding linkability on Twitter
We came across another very interesting thing about Twitter through which we could link Twitter
accounts to other social networks. We came across many Twitter posts which asked people for their
Snapchat, Spotify, etc. account ids to connect to each other. Hence we aimed to get all these different
accounts for different Twitter users. We started by using Twint, an advanced Twitter scraping tool written
in python that allows for scraping of tweets without using Twitter’s API, to get tweets with the keywords - “drop your {platform_name}”. The following platforms were considered - 
  1. Instagram
  2. Snapchat
  3. Spotify
  4. Soundcloud
  5. Venmo
  6. Paypal
Once such tweets were retrieved, we used the Twitter API to get the replies to such tweets. This was
quite challenging since the API itself does not have a feature for this and hence we had to run a python
script every day to scrape tweets from the last 24 hours. This python script searched for tweets that
contained the mention of the user who had posted the original tweet and the tweet id of the original
tweet matched the in_reply_to field of the tweet under consideration. Another difficulty we faced while
implementing this was that text data can be very noisy at times and hence, the data had to be
pre-processed. These replies eventually gave us different social network profiles for various Twitter
users. 

API Keys 
We noticed that a lot of StackOverflow and Github users forget to remove API keys for their social
network accounts such as Twitter, Slack, etc. hence, we decided to extract such API keys from GitHub
commits and StackOverflow data downloaded by us so that we could explore to what extent we could
use these API keys to manipulate a user’s account. 


Initially, we used a regex of Slack API keys to find such keys in the GitHub commits and StackOverflow
data.
Once we got these API keys, we needed a way to test them and be able to use them. Hence we set up
a terminal using the language Go which was a challenge in itself since it was a completely new
language that we had never worked with. Then we used a terminal-based library that took a  Slack API
key as input and opened a terminal interface for that user. Once this terminal interface was opened, we
had complete control over the user’s account where we could even send messages from the account.
This was a gross breach of security and privacy that we were able to exploit. 

Comments