Counting Presidential Tweets

May 07, 201912 min read

Last reviewed May 07, 2019

Some Context

The Brazilian president Jair Bolsonaro is very active in the social media, specifically, Twitter. This behavior, which has been disrupting “traditional media”, is controversial at least. It is also debatable if the president himself is the owner of the account or if his son, who adopts a more aggressive style, has been tweeting.

This big dilemma (contains irony) caught the attention of a college professor who allegedly counted 6.452 tweets, discovered that most of them were from an iPhone and Sherlock Holmes-style declared that Bolsonaro’s son (who apparently owns an iPhone) must be the one piloting the account (it is said that the president owns a Samsumg, even though he has been seen with an Apple device several times).

This big puzzle (contains irony) sparkled me an idea — can one count tweets in an efficient manner? It is unclear how the professor did it (most likely a poor undergrad was given this noble task), but I like to think that I am saving humans’ time (computers serve for this purpose, after all).

In my Twitter account I gave a sneak peek of what we are going to see in this post:

So now it is time to elaborate.

Opening the Toolbox

In order to scrape the tweets, I used puppeteer, which is an awesome headless solution powered by the Google Chrome folks. Puppeteer is awesome and has a large community, so finding how to login into Twitter is as easy as this less-than-20-lines-code.

Puppeteer is a Node library, so make sure you have Node installed on your machine. Also notice that all code related to this tutorial is available on GitHub.

The Plan

Before starting to get technical, let us establish a baseline plan for our end goal. We want to:

  1. Navigate to the target account (in our case, @jairbolsonaro).
  2. Open the most recent tweet.
  3. Scrape the medium used to tweet (e.g. Twitter for iPhone, Twitter for Android, etc).
  4. Go back and open the next tweet.
  5. Repeat 3 and 4 until we analyze all tweets.

It is simple but has some gotchas. As we are going to see, Twitter does his best to prevent us to spam their server (and you can get rate-limited) and also try to efficiently load tweets, so only a handful tweets are “visible” at a single time (i.e. in order to prevent overloading your browser, if you keep scrolling down the top tweets are replaced).

Finished with the preamble, let’s finally get our hands around some code.

First Steps

Setup puppeteer is as easy as:

(async () => {
  const browser = await puppeteer.launch({ headless: false });
  const page = await browser.newPage();
  page.on('console', consoleObj => console.log(consoleObj.text()));  await page.setViewport({ width: 1280, height: 800 });
});

I like to launch puppeteer with the headless: false option when I’m actively developing so I can see it visually working, but feel free to omit the config if you want.

The highlighted line is a cool trick where you can check the outputs of console.log right into the terminal. It will be useful during the development phase.

Next, we need to login into Twitter. This shouldn’t be difficult and you can follow the example I posted above. There is an extra at the end of the article to make this process interactive, i.e., to ask for the username, password, and target account.

(async () => {
  ...
  // Make sure you change to your credentials
  const username = "webquintanilha";
  const pwd = "mysecretpwd";

  // Login
  await page.goto('https://twitter.com/');
  await page.waitForSelector('.StaticLoggedOutHomePage-cell > .StaticLoggedOutHomePage-login > .LoginForm > .LoginForm-username > .text-input');
  await page.type('.StaticLoggedOutHomePage-cell > .StaticLoggedOutHomePage-login > .LoginForm > .LoginForm-username > .text-input', username);
  await page.type('.StaticLoggedOutHomePage-cell > .StaticLoggedOutHomePage-login > .LoginForm > .LoginForm-password > .text-input', pwd);
  await page.click('.StaticLoggedOutHomePage-content > .StaticLoggedOutHomePage-cell > .StaticLoggedOutHomePage-login > .LoginForm > .EdgeButton');
  await page.waitForNavigation();
});

Once you are logged in, it is important to switch to the new version of Twitter, once the medium used to tweet is only available there. Realize that this might change in the future (and become the default). In any case, the solution here is to click on settings and select the Try the new Twitter option.

// Open new Twitter
await page.waitForSelector('#timeline');
await page.click('#user-dropdown-toggle');
await page.waitForSelector('.enable-rweb-link');
await page.click('.enable-rweb-link');

To finish the first step all you need to do is go to your target account. Make sure you go to their profile with replies, so you don’t miss anything.

const target = "jairbolsonaro";
await page.waitForNavigation({ waitUntil: 'networkidle2' });
await page.goto(
  `https://twitter.com/${target}/with_replies`, 
  { waitUntil: 'networkidle2' });

The waitUntil option basically determines when to consider the navigation succeeded. networkidle2 means consider navigation to be finished when there are no more than 2 network connections for at least 500 ms. All that can be checked in the API docs.

Save and run your code and see you are in the right territory! Now things will get more interesting.

Fetching Tweets

Let us now fetch and analyze every tweet. Notice that the new version of Twitter has a unique URL for each tweet, which makes our life way easier. All we need to do is do tweet by tweet, click on it, wait for the new URL to open, get the relevant data, rinse and repeat.

But first, let’s define some control variables:

const data = [];
const urls = [];
await page.evaluate(() => {
  window.i = 0;
});

The data array will hold our values and urls will hold every inspected tweet, once we will need to avoid counting the same tweet twice.

The different thing here, page.evaluate, basically gives us access to the regular DOM (like you are writing JavaScript in the console). We will define an iterator which we will use to go over the list of tweets. Now, window.i will be available in the browser context.

We need the iterator because Twitter only shows a subset of the tweets. As you scroll down, this list changes and we will need to re-assess them all again. The bad news is that there will be a lot of overlap.

Suppose that the following is your visible timeline, i.e., the tweets loaded from the server.

  1. Tweet A
  2. Tweet B
  3. Tweet C

We can get the total visible tweets with a selector like:

await page.evaluate(() => {
  const visibleTweets = document.querySelectorAll("div[data-testid=tweet]").length;
  console.log("Total visible tweets", visibleTweets);
});

And in our example visibleTweets is 3. Using our iterator, we can do something like:

while ( true ) {
  await page.evaluate(() => {
    ...
    const tweet = document.querySelectorAll("div[data-testid=tweet]")[window.i];
    tweet.click();
    window.i++;
  });

  /* DO SOMETHING IN THE PAGE */
  await page.goBack();
}

This will work for the first 3 posts, but after that window.i will overflow the array and our code will crash (tweet will be undefined). So we need to handle the case where we need to fetch more tweets.

We can do this by scrolling down a bit and waiting for a while (to make sure the tweets are loaded). This will also be handy to avoid being throttled by Twitter.

Here’s a complete version:

while ( true ) {
  await page.evaluate(async () => {
    ...
    const tweet = document.querySelectorAll("div[data-testid=tweet]")[window.i];
    if ( typeof tweet === "undefined" ) {
      window.i = 0;
      window.scrollBy(0, 1500);
      const sleep = ms => { return new Promise(resolve => setTimeout(resolve, ms)); }
      await sleep(5000);
      document.querySelectorAll("div[data-testid=tweet]")[0].click();
    } else {
      tweet.click();
    }
    window.i++;
  });

  /* DO SOMETHING IN THE PAGE */
  await page.goBack();
}

Couple important things going on here:

  • First notice that page.evaluate now receives an async function, which is necessary in order to sleep 5s.
  • When we extrapolate the visible tweets, we reset our count.
  • We scroll an arbitrary number of pixels. By trial and error, I found that 1500px gave me the best results (you don’t want to scroll too little and fetch few posts neither scroll too much and lose tweets in the way).
  • We sleep 5 seconds, which I consider safe by trial and error. You can experiment smaller delays such as 4 seconds, but 1-2s definitely will get the Twitter police after you.

Now one important caveat of this approach: when we fetch new posts, remember that some of the previously visible ones will remain visible. So now we get a situation like:

  1. Tweet B
  2. Tweet C
  3. Tweet D
  4. Tweet E
  5. Tweet F

While the bottom 3 tweets are new, B and C will overlap and be evaluated again. We will need to account for that in the next section.

Scraping the Data

Recall that we are in a while loop which redirects us to a specific URL.

while ( true ) {
  await page.evaluate(async () => {
    ...
  });
  /* 
    WE ARE IN THE TWEET PAGE
    DO SOMETHING AND RETURN 
  */
  await page.goBack();
}

The first thing is to check which URL we are in. We can use page.url() for that.

There are 3 possible outcomes for a URL:

  1. The URL was already evaluated;
  2. The URL does not belong to the target (i.e. current tweet is a retweet of another user);
  3. The URL is a new tweet yet to be scrapped.

Here’s how we can summarize it:

while ( true ) {
  await page.evaluate(async () => {
    ...
  });

  const url = page.url();
  if ( urls.indexOf(url) > -1 ) {
    console.log("Ooops, repeated URL", url);
  } else if ( url.indexOf(`https://twitter.com/${target}/`) === -1 ) {
    console.log("Ooops, this does not belong to the target", url)