Capturing Web-Pages With C

Google Web-Page Preview

Have you ever wondered how Google generates those website “thumbnails” on its search page? We’re going to show you how.

The goal of this tutorial is to capture whole web-pages to a JPEG using Awesomium 1.6.2 and C.

To help speed things along, we’ll skip over some of the basics and start with the following boilerplate code:

#include <Awesomium/awesomium_capi.h>
#include <string.h>

#define URL "http://www.apple.com/ipodtouch/"

int main()
{
    // Create the WebCore with the default options
    awe_webcore_initialize_default();
	
    // Create a new WebView to load our page
    awe_webview* webView = awe_webcore_create_webview(1024,
                                                      768,
                                                      false);
	
    // Create our URL string
    awe_string* url_str = awe_string_create_from_ascii(URL,
                                                       strlen(URL));
	
    // Load the URL into our WebView instance
    awe_webview_load_url(webView,
                         url_str,
                         awe_string_empty(),
                         awe_string_empty(),
                         awe_string_empty());
	
    // Destroy our URL string
    awe_string_destroy(url_str);
	
    // Wait for WebView to finish loading the page
    while(awe_webview_is_loading_page(webView))
        awe_webcore_update();
	
    // Destroy our WebView instance
    awe_webview_destroy(webView);
	
    // Destroy our WebCore instance
    awe_webcore_shutdown();
 
    return 0;
}

That bit of code isn’t very exciting– it just creates a WebView, loads a URL into it, waits for it to finish loading, and then shuts it all down.

Rendering to an Image

You’ll notice that if you run that bit of code, you won’t see anything. Let’s talk about that— Awesomium is a windowless web-page renderer, that basically means that if you want to actually see a web-page on your screen, you’ll need to display it yourself (which is a good thing, it gives you the freedom to display it any way you want).

Let’s render our WebView to a pixel buffer (basically a bucket that will store our pixels in memory) and then save that buffer to a JPEG file:

	// Wait for WebView to finish loading the page
	while(awe_webview_is_loading_page(webView))
		awe_webcore_update();
	
	// Render our WebView to a buffer
	const awe_renderbuffer* buffer = awe_webview_render(webView);

	// Make sure our buffer is not NULL; WebView::render will 
	// return NULL if the WebView process has crashed.
	if(buffer != NULL)
	{
		// Create our filename string
		awe_string* file_str = 
			       awe_string_create_from_ascii("./result.jpg", 
									   strlen("./result.jpg"));

		// Save our RenderBuffer directly to a JPEG image
		awe_renderbuffer_save_to_jpeg(buffer, file_str, 90);

		// Destroy our filename string
		awe_string_destroy(file_str);
	}

	// Destroy our WebView instance
	awe_webview_destroy(webView);

Now if you run the code, you’ll get this (result.jpg in your working directory):

Result.jpg

Hey, not bad for just a couple lines of code!

Optimizing the Code with Sleep

Most of the work done in Awesomium occurs in a separate child process and on other background threads— our application would be more efficient (and use less CPU) if we made it so our main thread slept while we wait for the page to load.

Since the call to “sleep” is different on each platform, let’s add the following headers and macro to the top of our code:

#include <Awesomium/awesomium_capi.h>
#include <string.h>
#if defined(_WIN32)
#include <windows.h>
#else
#include <unistd.h>
#endif

#define URL "http://www.apple.com/ipodtouch/"
#define SLEEP_MS    50

int main()
{

Now let’s create a special helper function, “updateCore()” that automatically sleeps for a bit every time we update the WebCore. Place this bit of code above your main function:

#define URL "http://www.apple.com/ipodtouch/"
#define SLEEP_MS    50

void updateCore()
{
    // Sleep a little bit so we don't consume too much CPU 
    // while waiting for the page to finish loading.
#if defined(_WIN32)
	Sleep(SLEEP_MS);
#else
	usleep(SLEEP_MS * 1000);
#endif
	
    // Update the WebCore.
    awe_webcore_update();
}

int main()
{

Alrite, let’s use our helper function in place of all calls to “awe_webcore_update”:

	// Destroy our URL string
	awe_string_destroy(url_str);
	
	// Wait for WebView to finish loading the page
	while(awe_webview_is_loading_page(webView))
		updateCore();
	
	// Render our WebView to a buffer
	const awe_renderbuffer* buffer = awe_webview_render(webView);

Very nice work my friend! Now your code doesn’t hog the CPU while the page loads.

Capturing the Whole Page

“But wait!”, you ask, “That web-page is longer than that, you’re missing at least 60% of the page in your image— how do we make it render the ENTIRE page?”

Where's the rest of the page?

Well, doing that is pretty straightforward, let’s modify our code so that it renders the ENTIRE page. Intuitively, we know that we’ll want to resize our WebView so that it fits the whole page exactly.

Getting the Whole Page Size

To make this happen, we need to determine the scrollable dimensions of the web-page. We can use the following two API functions to figure this out:


// Request the scrollable page dimensions and scroll position. 
// You'll need to bind a callback to retrieve the result later.
void awe_webview_request_scroll_data(awe_webview* webview,
									const awe_string* frame_name);

// Bind a callback to be notified of a response to RequestScrollData
void awe_webview_set_callback_get_scroll_data(
                            awe_webview* webview,
                            void (*callback)(awe_webview* caller,
											 int contentWidth,
                            				 int contentHeight,
                            				 int preferredWidth,
                            				 int scrollX,
                            				 int scrollY));

We can put these API functions to good use in our application, place the following highlighted code after the while loop:

	// Wait for WebView to finish loading the page
	while(awe_webview_is_loading_page(webView))
		awe_webcore_update();
	
	// Bind a callback function to handle the results
	awe_webview_set_callback_get_scroll_data(webView, 
											 onGetScrollData);
	
	// Request the scrollable dimensions for the main frame
	// of the web-page
	awe_webview_request_scroll_data(webView, awe_string_empty());

	// Wait for onGetScrollData callback to be called
	while(!gotPageDimensions)
		updateCore();
	
	// Render our WebView to a buffer
	const awe_renderbuffer* buffer = awe_webview_render(webView);

Now we just need to define our callback function “onGetScrollData” to handle the response, let’s create it, along with a boolean “gotPageDimensions”, above the “updateCore()” function:

#include <Awesomium/awesomium_capi.h>
#include <string.h>
#if defined(_WIN32)
#include <windows.h>
#else
#include <unistd.h>
#endif

#define URL "http://www.apple.com/ipodtouch/"
#define SLEEP_MS    50

bool gotPageDimensions = false;

void onGetScrollData(awe_webview* caller,
					 int contentWidth,
					 int contentHeight,
					 int preferredWidth,
					 int scrollX,
					 int scrollY)
{
	// Use the page dimensions for something here
	
	gotPageDimensions = true;
}

void updateCore()
{
    // Sleep a little bit so we don't consume too much CPU 
    // while waiting for the page to finish loading.

Looking good!

Resizing the WebView

Alrite, so far we’ve got a way to get the dimensions of the page. Now we just need to use that information to resize our WebView, let’s do that now inside our “onGetScrollData” callback:

void onGetScrollData(awe_webview* caller,
					 int contentWidth,
					 int contentHeight,
					 int preferredWidth,
					 int scrollX,
					 int scrollY)
{
	// Begin resizing the WebView to the width and height
	// of our content
	awe_webview_resize(caller, contentWidth, contentHeight, 
					   true, 1000);
	
	gotPageDimensions = true;
}

Like most of the API in Awesomium, the call to “resize” is asynchronous (it is not guaranteed to be completed immediately). We need to make sure that the WebView has finished resizing and repainting the web-page before we can render it. Add the following code before the render code:

	// Wait for onGetScrollData callback to be called
	while(!gotPageDimensions)
		updateCore();
	
	// Wait for the WebView to finish resizing
	while(awe_webview_is_resizing(webView))
		updateCore();
	
	// Render our WebView to a buffer
	const awe_renderbuffer* buffer = awe_webview_render(webView);

Awesome! Now when you run your code, you should get whole-page render like this:

Final Result.jpg

Getting Rid of Scrollbars

Sometimes, you’ll find some sites always display scrollbars (despite your best efforts at resizing the WebView). We need to figure out how to get rid of scrollbars.

The WebCore is super-configurable, one of its options allows you to define the global CSS for all WebViews. Let’s set some CSS that will make all our scrollbars invisible.

Add this macro to the top of your application:

#define SCROLLBAR_CSS   "::-webkit-scrollbar { width: 0px; height: 0px; } "

Now let’s replace our call to “awe_webcore_initialize_default” near the top of our main function with the following block of code:

int main()
{
	// Create our CSS string
	awe_string* custom_css_str = awe_string_create_from_ascii(
										  SCROLLBAR_CSS,
										  strlen(SCROLLBAR_CSS));
	
    // Create our WebCore singleton with our custom CSS
    awe_webcore_initialize(false, true, false, awe_string_empty(), 
						   awe_string_empty(), awe_string_empty(), 
						   AWE_LL_NORMAL, false, 
						   awe_string_empty(), true, 
						   awe_string_empty(), awe_string_empty(), 
						   awe_string_empty(), awe_string_empty(), 
						   awe_string_empty(), awe_string_empty(), 
						   false, 0, false, false, custom_css_str);
	
	// Destroy our CSS string
	awe_string_destroy(custom_css_str);
	
    // Create a new WebView to load our page
    awe_webview* webView = awe_webcore_create_webview(1024,
                                                      768,
                                                      false);

For more information about re-styling scrollbars with WebKit, please check out this article.

Rendering Super-Long Pages to Multiple Images

Some web-pages are just way too long to render all-at-once (due to RAM limitations). For those websites, it makes more sense to render the page to multiple images.

Rendering Long Web-Pages to Multiple Images

We dont have enough time in this tutorial to go really in-depth about this topic, but the basic idea is to render a section of the page, save it to an image, scroll down, and repeat.

I’ve posted a full source-code example on how to do this here: https://gist.github.com/1112719/

Conclusion

You should now have a pretty good idea how to capture web-pages using the Awesomium C API.

For more information: http://support.awesomium.com

If you liked this tutorial or have any suggestions, let me know below!

Also read...

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>