# Using Tweepy to Fetch Large Data Sets
Look at the following code, besure to understand what it does, then adapt it to your needs.

Much of the code here is adaped from [Tweepy Best Practices](http://www.nirg.net/using-tweepy.html)

# Let tweepy manage rate limiting for you

instead of managing time.sleep by yourself, let tweepy manage it for you by changing the following setting on the API:

- Set wait_on_rate_limit to True, as this will allow tweepy to automatically manage rate limiting for you
- Set also wait_on_rate_limit_notify to True so that the API will print a message letting you know that it is waiting on rate limiting to end

The API configuration will now look like this:

In [None]:
api = tweepy.API(auth,wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

# Configure the API to take into account some unepected connection errors

Set the following options on the API configuration:

- set retry_errors and give it the http errors that tweepy needs to retry on, like connection errors
- set retry_count and retry_delay

The added configurations look like this:

In [None]:
api = tweepy.API(auth,wait_on_rate_limit=True, wait_on_rate_limit_notify=True,
                 # these are the new configurations
                 retry_count=3, # retry 3 times
                 retry_delay=5, # wait 5 seconds between retries
                 retry_errors=set([401, 404, 500, 503]) # the errors that you retry on
                )

# Store progress, and remember to print some information to allow you to pickup where you stopped

Twitter uses cursor to continue with fetching data where it left off

If the cursor value is -1, it will start from the beginning

If you print the value of next_cursor, you can use it to continue where you left off if you stop

For that you place the cursor in a variable, and always print the next_page value on error

Remember to always store the data you receive

See how the final code looks like:

In [4]:
import pandas as pd
import tweepy

# lets add first key
consumer_key = "**HIDDEN**"
consumer_secret = "**HIDDEN**"
access_token = "**HIDDEN**"
access_token_secret = "**HIDDEN**"
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)


# you can add more keys here, I will use only one

# configuring the api instance
api = tweepy.API(auth, # the keys list
                 wait_on_rate_limit=True, # will manage rate limiting
                 wait_on_rate_limit_notify=True, # will notify me of rate limiting
                 # these are the new configurations
                 retry_count=3, # retry 3 times
                 retry_delay=5, # wait 5 seconds between retries
                 retry_errors=set([401, 404, 500, 503]) # the errors that you retry on
                )


# put the cursor in a variable
# you can set the limit of pages, for example 5 pages:
#followers_cursor = tweepy.Cursor(api.followers, screen_name="zainkuwait").pages()


# but im going to fetch all the pages

# notice that next_cursor is set to -1
# change this value to the value printed in the error message
# for next_cursor to continue where you left off
next_cursor = -1 
followers_cursor = tweepy.Cursor(api.followers, screen_name="zainkuwait", cursor=next_cursor).pages()

# this will hold the dataframe
follower_df = None

# now start fetching pages:
try:
    for page in followers_cursor:
        # fetch the data that you want from the page
        follower_data_list = [
            {
                # Remeber, User is an object not a dictionary
                "screen_name": x.screen_name,
                "location": x.location,
                "description": x.description,
                "protected": x.protected,
                "followers_count": x.followers_count,
                "favourites_count": x.favourites_count,
                "statuses_count": x.statuses_count,
                "created_at": x.created_at,
            } for x in page
        ]
        # create a dataframe and append new data to it
        if follower_df is None:
            follower_df = pd.DataFrame(follower_data_list)
        else:
            follower_df = follower_df.append(pd.DataFrame(follower_data_list), ignore_index=True)
# in case of any error, then store the dataframe, and print the next_cursor
# with KeyboardInterrupt, even if u stop the cell, you save the data
# and get the next_cursor
except (Exception, KeyboardInterrupt) as e:
    print("error is ",e)
    print("next_cursor is ", followers_cursor.next_cursor)
    # dont forget to store the dataframe after the loop is done
    # so we do it outside

if follower_df:
    follower_df.to_csv("zain_followers.csv")
    
    

Rate limit reached. Sleeping for: 687
error is  
next_cursor is  -1
