Screaming Frog on Google Cloud


Screaming Frog SEO Spider is one of the few tools no SEO can do without. It's unquestionably a superb crawler, until you start feeling its limitations. Does that situation sound familiar? Like running out of RAM when using Screaming Frog SEO Spider? And not being able to put more RAM into the computer to satisfy Screaming Frog SEO Spider’s needs when crawling really big sites either? There is a solution: Run Screaming Frog SEO Spider in the cloud, and more specifically on Google Cloud infrastructure.

  1. Screaming Frog SEO Spider
  2. Google Compute Engine
    1. Installing the Google Cloud SDK
    2. Google Cloud Console
    3. Running Your First Google Compute Engine Instance
  3. Setting up Your Screaming Frog Instance
    1. Setting up Startup Scripts
    2. Installing Screaming Frog SEO Spider
    3. Installing Screaming Frog Log Analyser
    4. Connecting through VNC
    5. Backing up Your Instance
  4. Using Your Preconfigured Instance
  5. Updating Your Preconfigured Instance
  6. Adding Extra Disk Space
  7. Transferring Data Through SSH
  8. Storing Your Data in Google Cloud Storage
  9. Keeping Command Line Processes Running
  10. Conclusion
  11. Frequently Asked Questions & Troubleshooting

Screaming Frog SEO Spider

For most SEOs, Screaming Frog SEO Spider is THE crawler to crawl and analyse a website from an SEO perspective. Developed by an agency in the United Kingdom called Screaming Frog, Screaming Frog SEO Spider is also one of the few quality crawlers - if not the only one - that works on Debian-based systems like Ubuntu. Screaming Frog SEO Spider can be used for many different purposes, such as on-page analysis, and checking your backlink profile.

Screaming Frog Logo

The biggest challenge with Screaming Frog SEO Spider is that the program needs a lot of RAM to crawl big sites or large lists of URLs. Although the development team are working on several fixes, you can utilise Google infrastructure with lots of RAM to run Screaming Frog SEO Spider. This is where Google Compute Engine comes in.

Sannah Rajpurohit

Working with Fili is always a learning experience as I get to learn many a best practices in the industry today.

Sannah Rajpurohit

Google Compute Engine

In case you are not aware, Google allows you to run large-scale workloads on virtual machines powered by the Google infrastructure. This is called Google Compute Engine, a great opportunity to rent computer resources on a case-use basis. Google Compute Engine is very much still under development, but is already a serious competitor in the cloud-computing resources field compared to the Amazon Cloud and Windows Azure Cloud services. Pricing is often as competitive as, or even cheaper, with Google Compute Engine, making it a great alternative to the already-established Amazon Cloud service.

And to be honest as a SEO Consultant, former Google Support Engineer and Google Search Quality team member, I love the notion of using Google hardware and infrastructure to crawl websites for me. So let’s get started by setting up your new Google Compute Engine instance with the Screaming Frog SEO Spider.

Installing the Google Cloud SDK

To get started, you first need to install the Google Cloud SDK locally on your computer. This process is rather lengthy, but only needs to be done once. As it is much easier to run this from a Linux distribution, such as Ubuntu or Macs, this article will only dive into the steps for Linux machines. Follow these steps if you prefer to install the Google Cloud SDK on a local Windows machine.

First open a terminal and make sure that you have installed curl. You can test this by typing “curl” into the terminal and seeing if it suggests that you install it. If needed and assuming you run Ubuntu, install curl by executing the following command:

sudo apt-get install curl

After you have verified that curl is installed, run the following command in the terminal (from the home directory). This will download and install the Google Cloud SDK:

curl https://sdk.cloud.google.com | bash

The next step is to let the terminal know that Google Cloud SDK has been installed. You have two options, the first of which is the more straightforward: close the terminal and then reopen it. Alternatively, you can also execute the following command (which will avoid you having to restart the terminal):

exec -l $SHELL

Once you have done this, you can verify that the Google Cloud SDK is installed by the following command:

gcloud version

If this does not return an error, and gives you an output with a version number and a list of different tools installed, then the Google Cloud SDK is installed and works on your system.

Verifying the installed version of Google Cloud SDK on local machine. Verifying the installed version of Google Cloud SDK.

The next step is to authenticate your computer with the Google Cloud services. This will allow you to send commands to the Google Cloud to manage different Google Cloud services. In the terminal, execute the following command:

gcloud auth login

Depending on whether you have a browser installed on your local computer, a new browser window will open and ask you to give permission for Google Cloud SDK to access your Google account (you may be asked to log into your Google account first). Alternatively, you may also be asked to copy-paste a link into a browser from the terminal, and complete the process that way. Once accepted, you should see a confirmation: “You are now authenticated with the Google Cloud SDK.”

In the meantime, in the terminal you will be asked to enter the Google Cloud project ID. Just press Enter for now, as we will get back to this in a minute. If everything went well, then you will have received a message in the terminal that you are logged in with your Google account.

Google Cloud Console

The next step is to go to the Google Cloud Console for developers. This is the website for managing the Google Cloud services, such as Google Compute Engine.

To get started, a new project needs to be created, so click the red button at the top of the page that says “Create Project”. A pop-up appears where you can give the project a name and define a Project ID. The Project Name is not that important, as it is only used in the Google Cloud Console. Just fill in here “Screaming Frog”.

The Project ID is very important, as it uniquely identifies the project among all Google Cloud Services users. So if you don’t go with the default suggestions from Google (click the little arrow on the right side in the input field for more suggestions from Google) it may take some effort to find an available and unique Project ID. For this project I am going with the “screaming-frog-om”.

Creating a new Google Cloud project in the Google Cloud Console. Creating a new Google Cloud project.

Once you click the Create button, you will be redirected to the permissions overview page of the project - in this case https://console.cloud.google.com/home/dashboard?project=screaming-frog-om (note that the Project ID is used in the URL here).

Now comes an important step: we need to enable billing. Google Compute Engine does not have any free quotas, so in order to use Google Compute Engine, we need to enable billing. Most likely you will receive a free trial credit to utilize in the first 60 days. Find out more information about the pricing of Google Compute Engine here.

Enable Billing. Enable Billing.

To enable billing, click on Settings in the left sidebar of the project overview page. The one of first options you now should see is Billing, click it to “Enable billing”. Have your credit card ready, then click this. Select the right country, and enter your address, tax information (if applicable), phone number and name, and continue to enter your credit card data. Once you have completed this step you are ready to start using Google Compute Engine.

Tip: Once you have enabled billing and started using the Google Cloud services, you will see a link on the overview page of the project (in the right upper corner) to more details of the estimated charges for the present month.

All the steps until now have been about setting up your computer and the project. Most of these steps you don’t need to repeat unless you change computer or want to set up new projects.

Are you ready to start the first virtual machine on Google infrastructure?

Running Your First Google Compute Engine Instance

Open the terminal again, and execute the following command (replace <project-id> with your unique Project ID - for this article that would be "screaming-frog-om"):

gcloud config set project <project-id>

Everything you do now with the Google Cloud SDK will be executed as part of screaming-frog-om, which includes the billing.

Next, we need to set a zone by executing the following command (replace <ZONE_NAME> with zone - for this article that would be "us-central1-f"):

gcloud config set compute/zone us-central1-f

Be aware that European zones tend to be slightly more expensive than US zones and that setting a zone is optional and can also be accomplished by adding:

--zone "<ZONE_NAME>"

to almost every gcloud command in this article. When not set using "gcloud config", you can use this added command line parameter to run multiple Google Compute Engine instances utilizing many CPUs in multiple zones.

To confirm that you don’t have anything running, execute the following command in the terminal:

gcloud compute instances list
List all active Google Compute Engine instances through command line, using gcloud tool. List all active Google Compute Engine instances.

All good so far. Now you can create a new instance by executing the following command:

gcloud compute instances create screaming-frog-test

The observant reader may have noticed that I omitted the “-om” part in the previous command. This is because “screaming-frog-test” is another unique identifier for the instance within the project “screaming-frog-om”.

For this stage you can go with default option, which uses the debian-8 image and a default minimum machine type. The instance is now being set up. Once completed you can run the following command to log in using SSH (command line):

gcloud compute ssh screaming-frog-test

Note: You may be asked to set up the SSH keys. Just follow the instructions.

Setting up your SSH keys. Setting up your SSH keys.

Congratulations! You are now connected to the virtual machine on Google infrastructure. You can confirm this by going to the project in the Google Cloud Console, or executing the following command in the terminal on your local machine:

gcloud compute instances list

Just for this stage, let’s shut the instance down again. Assuming that you are still connected to the instance using SSH, execute the following command in the terminal:

exit

This will log you out and close the connection between your computer and the instance. Now go to the Google Cloud Console, select the project, select Compute Engine, select VM Instances, click on the screaming-frog-test link, and go to the bottom of the page. Here you can click the Delete button to delete the instance. When you click this, don’t forget to also delete the boot disk, screaming-frog-test.

Alternatively, to shut down an instance again, execute the following command in the terminal:

gcloud compute instances delete screaming-frog-test --delete-disks boot

You will be asked to confirm that you want to delete the instance and the boot disk. After you confirm this, Google Compute Engine will try to delete the virtual machine instance and the boot disk. You can again confirm that the instance is shut down (and therefore not accruing any cost) by executing the following command on your local machine:

gcloud compute instances list

Note: Sometimes Google Cloud services may have some lag, and the commands may time out in the terminal. If this happens, you can review the deletion progress in the Google Cloud Console.


Setting up Your Screaming Frog Instance

Now that the Google Cloud project is set up, and you know the basic commands to work with Google Compute Engine instances, it is time to set up an instance with Screaming Frog SEO Spider.

First we create a new instance. You have the choice of doing this in the Google Cloud Console or by executing the following command in the terminal.

I am using “screaming-frog” as the unique identifier for the instance:

Creating an instance on Google Compute Engine, using the Google Cloud Console. Creating an instance on Google Compute Engine, using the Google Cloud Console.

For this guide I will continue with the command line.

gcloud compute instances create screaming-frog --machine-type n1-standard-8

Next, choose a machine with enough RAM (I tend to go for n1-standard-8, check out your machine type options here) and the debian-8 image (most important!).

After the instance is up and running, SSH into the instance with the following command:

gcloud compute ssh screaming-frog

Now that you are logged into the instance, you need to switch to root by executing the following command in the terminal:

sudo -s

Now that you are in root, you need to update the software packages. Execute the following command:

apt-get update && apt-get upgrade

Then execute the following command to install the necessary programs:

apt-get install tightvncserver xfce4 xfce4-goodies xdg-utils software-properties-common python-software-properties enchant geoclue-2.0 gstreamer1.0-plugins-good gstreamer1.0-x hunspell-en-us libaa1 libavc1394-0 libcaca0 libdv4 libenchant1c2a libflac8 libharfbuzz-icu0 libhunspell-1.3-0 libiec61883-0 libjack-jackd2-0   libjavascriptcoregtk-3.0-0 libjim0.75 libmbim-glib4 libmbim-proxy libmm-glib0 libnl-3-200 libnl-genl-3-200 libopus0 libpcsclite1 libqmi-glib1 libqmi-proxy libraw1394-11 libsamplerate0 libshout3 libspeex1 libwavpack1 libwebkitgtk-3.0-0 libwebkitgtk-3.0-common libwebp5 libxslt1.1 modemmanager usb-modeswitch usb-modeswitch-data wpasupplicant zenity zenity-common curl

This will take a few minutes, and will install a VNC server and a minimalistic Graphical User Interface that uses very little resources. When asked for the keyboard configuration, just choose the default (use Tab on your keyboard to navigate to the “OK” option and Enter to execute).

At this point it may also be handy to execute the following command to avoid future warnings about locales not set:

dpkg-reconfigure locales

You will be asked to select a locale. The easiest (but also the most time-consuming) option is to select “All Locales” (Use Tab to navigate to the “OK” option.) And then select the default “None” as default locale for the system environment. This process may take a few minutes to complete and is completely optional.

Once that process is completed, you need to add another user, named “vnc”, to the system by executing the following command:

adduser vnc

When prompted, enter a secure password of eight characters. You can skip all the other values by just pressing Enter for the default. Choose Y to confirm that the information is correct.

Now you need to set up a new password for the user. First switch to the new user by executing the following command:

su vnc

and then execute the following command:

vncpasswd

When prompted, say (N)o to the question if you would like to enter a view-only password. I recommend you use the same eigh character password as the one you choose when creating the user.

This process will create a new directory in the /home/ directory of the VNC user, and set a new password that will later be used to make a VNC connection to the instance. Keep in mind that this password can not be more than eight characters long.

Setting up Startup Scripts

Now that the VNC user has been set up, a few startup scripts need to be installed that will run the VNC server every time the instance gets started and/or rebooted. First change back to the root user by typing the following command:

exit

Now download the first startup script by executing the following command:

wget https://online.marketing/files/vncserver -O /etc/init.d/vncserver

Then download the second startup script by executing the following command:

wget https://online.marketing/files/xstartup -O /home/vnc/.vnc/xstartup

Now that the startup scripts have been downloaded and installed, you can make the VNCserver work by executing the following commands:

chown -R vnc. /home/vnc/.vnc && chmod +x /home/vnc/.vnc/xstartup && sed -i 's/allowed_users.*/allowed_users=anybody/g' /etc/X11/Xwrapper.config && chmod +x /etc/init.d/vncserver

Now reboot the instance by executing the following command:

reboot

The SSH connection will be closed at this time. It may take a minute or two, but then access the instance through SSH again by executing the following command:

gcloud compute ssh screaming-frog

and switch again to the root user by executing the following command:

sudo -s

Now let’s start the VNC service by executing the following two commands:

update-rc.d vncserver defaults && service vncserver start

Congratulations, you can now use any VNC-capable program to access the instance using a VNC connection.

Installing Screaming Frog SEO Spider

Before connecting through VNC, lets finish the installation process by installing Screaming Frog SEO Spider and the Oracle Java library by executing the following commands:

echo "deb http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main" | tee /etc/apt/sources.list.d/webupd8team-java.list && echo "deb-src http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main" | tee -a /etc/apt/sources.list.d/webupd8team-java.list && apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys EEA14886 && 
apt-get update && apt-get install oracle-java8-installer

When prompted, select OK and use the arrow keys to select YES. Next, set Oracle Java as the default Java library:

apt-get install oracle-java8-set-default

To confirm Oracle Java8 is successfully have installed and made the default Java library, execute the following command:

java -version

and something like the following is returned:

java version "1.8.0_101"
Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)

Now that the Oracle Java library is installed, we need to add "ttf-mscorefonts-installer" library before we can install the latest version of Screaming Frog SEO Spider.

add-apt-repository "deb http://http.debian.net/debian jessie main contrib non-free" && apt-get update && apt-get install ttf-mscorefonts-installer

Now execute the following command to download Screaming Frog SEO Spider:

wget https://www.screamingfrog.co.uk/products/seo-spider/screamingfrogseospider_6.2_all.deb

Then install Screaming Frog SEO Spider for all users by executing the following command:

dpkg -i screamingfrogseospider_6.2_all.deb

This may throw up an error due to a dependency called “zenity”. This can be solved by entering the following command:

apt-get -f install

After which the following is returned (in a list):

Setting up screamingfrogseospider (6.2) ...

Screaming Frog SEO Spider is now installed!

Note: A newer version of Screaming Frog for Ubuntu may have been released by the time you read this article. If that is the case, go to the website of Screaming Frog SEO Spider to find the new URL to the latest Ubuntu version.

Installing Screaming Frog Log Analyser

Now that all the necessary dependencies are installed in the steps above, it is also possible and quite easy to install Screaming Frog Log Analyser - although this is optional and depends on having a valid license.

wget https://www.screamingfrog.co.uk/products/log-file-analyser/screamingfroglogfileanalyser_1.6_all.deb

Then install Screaming Frog SEO Spider for all users by executing the following command:

dpkg -i screamingfroglogfileanalyser_1.6_all.deb

Screaming Frog Log Analyser is now installed!

Note: A newer version of Screaming Frog Log Analyser for Ubuntu may have been released by the time you read this article. If that is the case, go to the website of Screaming Frog Log Analyser to find the new URL to the latest Ubuntu version.

Connecting through VNC

Let's log out of the instance again.

exit
exit

Before you continue in the terminal, the firewall rules of the instance need to be updated. To do this, go to the Google Cloud Console in your browser, select the project, select Networking in the left sidebar.

Selecting Networking in the Google Cloud Project. Selecting Networking in the Google Cloud Project.

Click on the “Firewall rules” link in the menu. Here you need to add a new rule. Click on the “Create a firewall rule” button, and use the following details to fill in the form:

Name: vnc
Source IP ranges:  <YOUR-IP-ADDRESS>
Allowed protocols & ports: tcp:5800,5900-5909

Replace <YOUR-IP-ADDRESS> with your IP address, if needed you can find your IP address here. Use the defaults for the remaining fields and click the blue “Create” button. A new firewall rule will be created that will allow you to access the VNC server.

Now lets connect to the VNC server. To find out which IP address you need to access the instance upon, log out of the SSH connection and execute the following command on your local machine to get the external IP address listed in the table:

gcloud compute instances list

To connect to the instance through VNC on Ubuntu, using the program Remmina, start a new connection and update the <INSTANCE-IP-ADDRESS> and <YOUR-VNC-PASSWORD> as illustrated in the example below and click "Connect":

hostname: <INSTANCE-IP-ADDRESS>:5901
password: <YOUR-VNC-PASSWORD>
Example of configuring your VNC connection (Remote Desktop) to a Google Compute Engine instance. Example of configuring your VNC connection.

If needed, you can install Remmina on your local Ubuntu computer by executing the following command on your local machine:

sudo apt-get install remmina

Once the VNC connection has been established, a pop-up can be seen on the desktop. Select the “Use default config” option.

Next create a shortcut on the desktop for Screaming Frog SEO Spider. Right-click on the background of the desktop, select “Create Launcher” and use the following details to fill in the pop-up:

Name: Screaming Frog
Command: screamingfrogseospider %f

and click the Create button. A new shortcut will be created, and can be found on the desktop. Just double-click this shortcut to start Screaming Frog SEO Spider. When asked, choose "Mark as Executable". At this point, it will help to enter the licence information before closing the program again.

Note: When typing this name you may get a suggestion to select “Create Launcher Screaming Frog SE...”. If this happens, select this option. Also by clicking on the ICON button, before clicking the “Create” button and after selecting the suggestion, the standard Screaming Frog SEO Spider icon can be selected for the launcher.

Example of creating a new launcher and choosing the suggestion from Debian. Example of creating a new launcher and choosing the suggestion from Debian.

Optional: If you have also installed Screaming Frog Log Analyser, repeat the step above and also create a launcher for Screaming Frog Log Analyser. After creating the launcher, again just double-click this shortcut to start Screaming Frog Log Analyser and enter the license data before closing the program again. When asked, choose "Mark as Executable".

You are now almost done with the installation process. There is one more step, which is to adapt the allocated memory to the instance RAM we have available. This process will only work if you have started Screaming Frog SEO Spider at least once through the VNC. Go back to the terminal and connect through SSH again with the following command:

gcloud compute ssh screaming-frog

Now switch to the root user with the following commands:

sudo -s

Now open the Screaming Frog SEO Spider configuration file for allocated memory with the following command:

pico /home/vnc/.screamingfrogseospider

Depending on the type of machine you have chosen for the instance, change the number 512 to a number close to the maximum RAM of the instance. For example, if using the n1-standard-8 then the available RAM is 30GB - in this case update the number 512 to 30000. Close the file and save the changes by pressing Ctrl-X, and answer Y(es) to the question whether or not to save the modified buffer. In the future, whenever the type of machine of the instance is smaller or bigger, be sure to update this number to a size just below the available RAM.

This is also a good time to start the Screaming Frog SEO Spider and change the configuration, e.g. User Agent, Speed, Crawl settings, etc. After you are done, be sure to set your current configuration to default.

At this point VNC and Screaming Frog SEO Spider is set up. Now to make sure you don’t have to repeat all of these steps again each time you want to start another instance with Screaming Frog SEO Spider and VNC, let’s exit from our connection and save everything in as a preconfigured backup image.

exit
exit

Backing up Your Instance

To back up your instance as a preconfigured image, we need to first deactivate the auto-deletion process for the disk and then delete the instance with the following commands:

gcloud compute instances set-disk-auto-delete <INSTANCE-NAME> --no-auto-delete --disk <DISK-NAME>

This command will disable the main disk from being deleted when the instance is shut down. Replace <INSTANCE-NAME> and <DISK-NAME> with the correct values. Most of the time the <INSTANCE-NAME> and <DISK-NAME> are the same, and in this case is "screaming-frog"

gcloud compute instances set-disk-auto-delete screaming-frog --no-auto-delete --disk screaming-frog

In the Google Cloud Console or using the following command on the command line, you can verify that the disk is indeed not set to automatic deletion.

gcloud compute instances describe screaming-frog
Example of checking the auto-deletion status of the main disk using Google Cloud SDK. Example of checking the auto-deletion status of the main disk using Google Cloud SDK.

Now lets shut down and delete the instance using the instance with the following command:

gcloud compute instances delete screaming-frog

Now lets add the disk as a preconfigured image to the Images collection of the Google Cloud Project by executing the following command in the terminal on your local machine:

gcloud compute images create <IMAGE-NAME> --source-disk <DISK-NAME>

In this example, this translates into the following command:

gcloud compute images create screaming-frog-image --source-disk screaming-frog

Once this process has been completed, the back-up image is safely stored in the Google Cloud Project and will be accessible within the project the next time you create a new instance. This can be confirmed by checking if the image is listed when executing the following command:

gcloud compute images list
Listing available images using Google Cloud SDK. Listing available images using Google Cloud SDK.

If needed, the image can also be deleted again using the Google Cloud Console or by executing the following command:

gcloud compute images delete screaming-frog-image

Alternatively, to verify that the creation of the back-up image was successful, go to the Google Cloud Console, select the project, select Compute Engine, select Images on the left sidebar and here you should see the image “screaming-frog-image” in the list of available images.

Note: Tip for naming of the preconfigured images created try including the date in the name, for example: screaming-frog-20160823. This will help if you need to create new images when upgrading the software.

At this point everything is configured and saved, so the current disk can be deleted by using the following command:

gcloud compute disks delete screaming-frog

This will delete the disk, avoiding any additional cost (except for storage of the preconfigured image) until the next time you need to use Screaming Frog SEO Spider on Google Compute Engine.


Using Your Preconfigured Instance

When you are ready to use Screaming Frog SEO Spider on Google Compute Engine, open the terminal on your computer and execute the following command:

gcloud compute instances create screaming-frog --machine-type n1-standard-8 --image screaming-frog-image

This command will start up a new instance, named "screaming-frog", with machine type n1-standard-8 (30GB RAM) and using the preconfigured image "screaming-frog-image".

Now every time you need a new instance of Screaming Frog up and running, just execute this command, note the external IP address assigned to the instance, optionally when selecting a larger machine type SSH into the instance and update the Screaming Frog memory file with the following commands:

gcloud compute ssh screaming-frog
sudo -s
su vnc
pico /home/vnc/.screamingfrogseospider

And save the changes by pressing Ctrl-X, and answer Y(es) to the question whether or not to save the modified buffer.

Next, start up the VNC program, such as Remmina, and connect to the instance using the external IP address on port 5901, and the eight-character password you previously set.

Start Screaming Frog SEO Spider and start crawling…


Updating Your Preconfigured Instance

When you want to save a new preconfigured image with an updated Screaming Frog SEO Spider, just update debian system software and Screaming Frog through the command line by running the following commands:

gcloud compute ssh screaming-frog
sudo -s
apt-get update && apt-get upgrade
wget https://www.screamingfrog.co.uk/products/seo-spider/screamingfrogseospider_6.X_all.deb
dpkg -i screamingfrogseospider_6.X_all.deb
exit

Note: replace the name of the Screaming Frog SEO Spider package with the most up to date file name.

Now deactivate the auto-deletion process for the disk with the following command:

gcloud compute instances set-disk-auto-delete screaming-frog --no-auto-delete --disk screaming-frog

And delete the instance with the following command:

gcloud compute instances delete screaming-frog

Next add the disk as a preconfigured image to the Images collection of the Google Cloud Project by executing the following command in the terminal on your local machine:

gcloud compute images create screaming-frog-image --source-disk screaming-frog

Note: Tip for naming of the preconfigured images created try including the date in the name, for example: screaming-frog-20160823.

And verify the new image has been saved with the following command:

gcloud compute images list

And delete the main disk with the following command:

gcloud compute disks delete screaming-frog

Going forward you can start up a new instance using the new image name.


Adding Extra Disk Space

When you start using the instance, you will soon notice that the maximum space on the default instance is only ten gigabyte, of which approximately one gigabyte is used for the installation of the operating system, Screaming Frog, VNC and its dependencies. If you need more disk space then you can add additional disk space using a second persistent disk.

The simplest way of adding additional disk space to your Google Compute Engine instance is by going to the Google Cloud Console in your browser, select your project, then Compute Engine, then Disks and click on the red “Create Disk” link on the top of the page. In the form you get then, select the same Zone as your instance (this is extremely important as you will not be able to connect your disk from an instance running in different zone) and select a blank disk as Source Type. For size, I recommend to stick to the default 500 gigabyte disk unless you already know in advance that you need more. You will be surprised how quickly a second disk fills up.

Creating a new persistent blank disk in the Google Compute Engine Cloud. Creating a new persistent blank disk.

After you have created the disk, attach the disk to your instance with the following command in the terminal (replace <DISK-NAME> with the name chosen in the previous step, in this example "disk-1"):

gcloud compute instances attach-disk screaming-frog --disk <DISK-NAME>

After this you will see the disk attached to your instance (if not try rebooting the instance) and it can be accessed through the command line. Next log into your instance through SSH and switch to the root user by executing the following commands:

gcloud compute ssh screaming-frog

Switch to root user.

sudo -s

Now identify the disk designation by executing the following command:

fdisk -l

You will most likely see a message stating that the second disk (e.g. /dev/sdb) does not have a valid partition table.

Check the status of attached disks using fdisk. Check the status of attached disks using fdisk.

We solve that with adding a partition table to the new disk by executing the following command:

fdisk /dev/sdb

When prompted, type ‘n’, choose ‘p’ and just press Enter for the defaults after this. When finished, type ‘q’ to quit fdisk again.

Create a new partition table on a disk using fdisk. Create a new partition table on a disk using fdisk.

The next step is to format the disk by executing the following command:

mkfs.ext3 /dev/sdb

Now the disk can be mounted to the instance so it can be accessed through command line and/or the file explorer in VNC and used to store data. Execute the following commands to mount the disk:

mkdir /mnt/disk1
mount /dev/sdb /mnt/disk1

To make sure that all users can access the disk and the contents of the disk, the rights to the disk need to be updated by executing the following command:

chmod 777 /mnt/disk1

When adding additional files to the disk, you may need to keep updating the rights of your files so that other users (e.g. vnc) can also read and write to these files. This can be done by navigating to the relevant directory through the command line and execute the following command:

chmod 777 *

Note: It is generally considered a bad idea to use “chmod 777” applied to files on any a remote server, however since our instance is meant to be temporary and is only accessible through the VNC and SSH, I consider it to be a lower risk and it makes things easier. If the instance is meant to run for longer periods of time, I recommend you to explore other options such as using user groups instead.

Now the disk can be used to store data from Screaming Frog SEO Spider crawls and more.

Be aware that the disk is operating independent of the instance. When the instance is deleted the disk will not be automatically deleted. This makes it possible to preserve the disk (with all stored data) in the Google cloud and re-attach it to another instance in the future by mounting it again to a new instance. Keep in mind that there are cost involved with not deleting the additional disk.


Transferring Data Through SSH

Data between your local machine and the Google Compute Engine instance can also be transferred through SSH using the copy-files command of gcloud tool.

On your local machine you can upload data to the Google Compute Engine instance by executing the following command:

gcloud compute copy-files /home/user/local-file instance-name:/home/remote-user/remote-file

For example, when this command is translated to uploading a zip file to the second disk attached on the screaming-frog instance the following command works:

gcloud compute copy-files /home/fili/local.zip screaming-frog:/mnt/disk1/

Alternatively, data can also be downloaded from the instance through SSH by executing the following command:

gcloud compute copy-files instance-name:/mnt/disk1/remote-file /home/user/

Again, when this command is translated to downloading a zip file from the second disk attached on the screaming-frog instance to the local computer the following command works:

gcloud compute copy-files screaming-frog:/mnt/disk1/remote.zip /home/fili/

More information about the copy-files command of gcloud tool (and other commands) can be found here.


Storing Your Data in Google Cloud Storage

Downloading the data to your local computer using SSH has the advantage that you can now work on it locally, however the next time you want to use it again you may need to upload it again if you delete your [extra] disk space. The alternative solution is to store the data in Google Cloud Storage.

The benefit is that Google Cloud Storage is a lot cheaper than using disk space on Google Compute Engine, you can use the command line to initiate the uploads and downloads and make resources publicly available if you wish. The bad news is that you can not edit or move files once stored on Google Cloud Storage, you can just upload (and potentially overwrite or delete) the file or download the file.

Make sure you are logged into your instance using SSH:

gcloud compute ssh screaming-frog

First authenticate and configure your Google Cloud Storage access with the following command:

gsutil config

Follow the instructions and open a new browser window with the provided URL, accept the permission request and copy-paste back the authorisation code provided in the input field of the provided URL. Then enter the Project ID, in this case screaming-frog-om, and hit Enter.

Next, create a new bucket in Google Cloud Storage with a unique name by executing the following command:

gsutil mb gs://<BUCKET-NAME>

Note: Replace the <BUCKET-NAME> with a unique name that is unique across all Google Cloud Storage buckets. Be aware that you may need to be creative to find an available name.

Using the VNC connection, export the relevant reports in Screaming Frog and store the CSV files in a subdirectory. Now navigate to the subdirectory in the terminal with the data you want to upload. Next issue the following command to upload file(s) to Google Cloud Storage:

gsutil -m cp *.csv gs://<BUCKET-NAME>/

Note: in the example listed above we upload all CSV files in the current subdirectory to Google Cloud Storage.

If you need to download the data again on a new instance or to your local computer (make sure you have gsutil installed locally) issue the following command:

gsutil -m cp gs://<BUCKET-NAME>/* .

Be sure to check out the documentation to utilize gsutil to the max. To create subdirectories in your bucket or make files publicly available, check out the Google Cloud Storage interface.


Keeping Command Line Processes Running

The instance can also be used for processing large data files through command line or for other command line processes. When disconnecting with an instance through SSH, for example through loss of a network connection, processes you are running in the command line are shut down and data may be lost. This can easily be solved by installing and using tmux.

First connect to the instance through SSH and install tmux by executing the following commands:

gcloud compute ssh screaming-frog
sudo -s
apt-get install tmux
exit

Now tmux can be used by executing the following command and it will open a new command line shell within existing command line shell:

tmux

Now any command executed in the tmux shell will continue running when disconnected. When needed, exit the tmux window by typing ‘Ctrl-B’ and then type ‘d’ on your keyboard. This will detach the tmux shell from the shell window. You can get back into the tmux shell window, for example to see the status of your processes, by executing the following command:

tmux attach

More information about tmux can be found here and here and here.


Conclusion

Although getting started with Google Compute Engine may seem intimidating at first, I discovered a lot of benefits (speed and pure computing power) by using Google Compute Engine for processing large data files and crawling URLs. I highly recommend learning more and experimenting with Google Cloud services, especially as a SEO Consultant.


Frequently Asked Questions & Troubleshooting

What is the maximum amount of URLs you can crawl with this setup?

There is still a limit but this limit is much higher than on a standard computer. For example, crawling one million URLs is not a problem. Having said that, Screaming Frog can still improve things and they are actively working on this. And when this goes live, maybe crawling one million URLs on Compute Engine can then become 10 million URLs or more…

After I setup billing, I type "gcloud compute instances list" in the terminal but it states API rate limit exceeded.

Check out this community post.

When I’m trying to update the memory for Screaming Frog at /home/vnc/.screamingfrogseospider I find that the document is blank.

Make sure that you start the Screaming Frog program at least once before trying to edit the memory settings. The most likely reason why your memory settings file is blank is because the file does not exist yet. This is because the program has not been started yet./p>

I am trying to make a VNC connection to the instance from a Windows machine but I don't have Remmina. What should I use?

Try RealVNC.

How much does Google cloud cost? Is it an hourly fee?

At Google Cloud you pay per minute (after the minimum of the first 10 minutes). Be aware that as long as any instance is running, you are incurring cost and you are liable to pay this. However when debating which Cloud platform to utilize you are likely to find that most of the time Google Cloud is cheaper than most common alternatives such as AWS or Azure.

Can I repeat most of the instructions above on my own Virtual Private Server?

Yes, you can adapt the instructions above to your own VPS. The main reason why I went with Google Cloud is because I can expand it as much or as little as I want and I can save custom OS images in the cloud. Last but not least, the processing power is awesome and I have a chance to run my stuff on Google servers. It may seem like a complicated system at first, but once you use it for one thing like Screaming Frog you will soon use it for other things as well.

Does this set up work within a team environment?

Yes, if you share the VNC login and password (and whitelist the right IP addresses) you can share VNC access of the instance with the rest of your team. If you want to share instance management, be sure to set the right permissions in the Google Cloud Console IAM interface.

I am getting "Unable to locate package" error messages when I am trying to install software.

Run the following command as the root user:

apt-get update

I am trying to install this on Windows using CygWin. However I keep getting errors.

I will not be able to help you much as I don’t use Windows at all. Alternatively to CygWin you can also try to install Ubuntu within your Windows installation (just like you would install any other program within Windows).

How many URI/s per second are you getting as mine seems to be 20 which seems quite low?

The crawl rate depends completely on the setup you are crawling. If you are crawling a list of URLs from different servers then you can go as high up as 200 threads per second (as I have done easily). However if you are crawling the same website, the server may not be able to respond to such a high thread rate and may slow down significantly. Be sure to play around with the speed configuration in Screaming Frog and be nice to the websites you crawl.

To ensure we don’t get charged when not in use, do we just turn off Google Cloud on the dashboard?

Yes, delete any instances or disks you set up. If you don’t have any instances running or storage space used (this includes Google Cloud Storage), you will not be charged for Google Cloud services.

Where do you check to see if there are any instances currently running?

You can verify if you have any instances running by executing the following command in the terminal:

gcloud compute instances list

Or visiting the Google Cloud Console interface. For more information see the documentation.

I had some errors while setting locales because I was connecting via SSH from a Spanish computer

Modify the VM files /etc/default/locale and /etc/environment with pico as follows:

LANG=”es_ES.UTF-8″
LC_ALL=”es_ES.UTF-8″
LANGUAGE=”es_ES”

Note: Update the language settings to your LOCALE if different than Spanish.

Help, I still have a problem with the instructions mentioned in this article...

Drop me a line and be specific/descriptive.


Final Note: A previous German version of this article was published in the 25th anniversary print edition of the German magazine Website Boosting on behalf of SearchBrothers.com. If you read German and are interested in SEO, I highly recommend you check out this magazine.

Back to top