Chapter

Fetch Page Module

Fetch Page Module

The Fetch Page module is an excellent way to “scrape” data from web pages in Yahoo Pipes, without knowing PHP or other programming languages. Just plug in a URL to the Fetch Page Module and it will attempt to scrape it and display the results in the Debugger panel. This module is extremely useful for grabbing ever-changing content from websites without RSS feeds. Once your page is displayed in the Yahoo Pipes debugger pane, you’ll have some work to do to make it show up properly in Pipes.

Tip! In order to successfully use the Fetch Page Module, you’ll need to have an understanding of HTML so you know how to parse the data.

Tip! The Fetch Page Module will only work with HTML pages under 200k and if the web site allows robots to crawl it.

First enter the URL of the page you want to fetch. If that’s all you enter then Yahoo Pipes will return the whole page. Now you can “cut” the content between certain points to only display a certain part of the page, and/or you can also split up the content using a delimiter. If your data is in table form, you could use “

” to break it up and place each new line of table data in a new element.

In this example, lets take a look at Stubhub ticket prices for the men’s Olympic gold medal hockey game at Canada Place in Vancouver, BC.

http://www.stubhub.com/winter-games-ice-hockey-tickets/winter-games-ice-hockey-2-28-2010-736672/?isGen2Event=1

Here’s how the original page looks:

Fetch Page Module

Yup, just a little expensive!

Now lets plug it into our Fetch Page Module. For now, we’ll just plug in the URL without cutting or splitting the content.

Fetch Page Module

And here’s how it looks in the Pipes debugger when we drill down the results:

Fetch Page Module

As you can see, the page loses all of its formatting. This is completely normal. Pages will look different in the Pipes debugger.

Now lets take a look at the HTML source code of the page in the Debugger. (Note that code besides HTML will not load).

To view the source code, just click on “source” in the debugger pane.

Fetch Page Module

The source code is what you’ll really want to focus on with the Fetch Page module. This is what the first part of it looks like:

Fetch Page Module

As of right now, if you were to view the output of this Pipe, it wouldn’t load properly because titles, descriptions and/or links haven’t been assigned. Depending on what results you’re looking for, you’ll need to use different modules in combination with the Fetch Page module to mash up your data. For right now we just wanted to look at how Pipes interprets a HTML page.

For our second example, lets learn how to “cut the content” from our page.

Because we’re only concerned about the ticket prices for the gold medal hockey game, we don’t care about the other elements of the page. All we want is the ticket listings.

Fetch Page Module

We want to find the “beginning” of the part we want to cut, and the “end”. In this case, the “beginning” would be around “16 ticket listings found…” and the end would be “…Page 1 of 1”.

Ok. Time to start thinking like a programmer.

I can’t stress enough how important it is to be familiar with the page you want to work on. In this particular case, I know there’s a lot of “dynamic” content (content that’s always changing) on Stubhub. I know that the number of ticket listings will change, prices/sections/# of tickets will change, the number of pages of ticket listings will change, so it’s very important to know this before I start cutting content. If I tell the Fetch Page module to cut our content starting with “16 ticket listings”, it will work for now, but as soon as the number of ticket listings increases or decreases, that number 16 won’t stay the same, and therefore the Fetch Page module will still be looking for “16 ticket listings” that isn’t there, which will result in an error.

The same thing goes for our ending content that we want to cut – “Page 1 of 1”. If more tickets get added, a new page will be added, and it will then read “Page 1 of 2”. So if we set that as our ending content, it will work for now, but as soon that text changes to “Page 1 of 2”, it will no longer work.

I do however know that the words “ticket listings found” will not change on the page. Only the number in front of it will. So this would be a good string to use for our “beginning cut content”. It could say “20 ticket listings found” or “100 ticket listings found” and our string will still work.

I also know that our ending cut content “Page 1 of” will always stay the same. It could say “Page 1 of 5” or “Page 1 of 1”. The “Page 1 of” and our string will still work.

Fetch Page Module

Tip! “cut content from” and “split using delimiter” attributes are case sensitive.

Lets take a look at our newly cut content:

Fetch Page Module

Now lets take a look at the “split using delimiter” attribute. We want to split up each ticket listing so it’s its own item. We need to find a common string that separates each line to do so.

Fetch Page Module

Yikes. That looks a little overwhelming, but it’s really not. We’re going to clean it up and assign each ticket listing to its own item.

In “HTML” view, I can see the last thing on each line is “View Details”. But I prefer to take the delimiter from the source view because it’s more technically correct. In the source view, it looks like “View Details”.

Fetch Page Module

Lets take a look at how our content looks after we split it up using our delimiter in HTML view. (Results drilled down):

Fetch Page Module

And in source view:

Notice there are URLs for each ticket listing hiding in there? Perhaps that would be a good place to add an affiliate link if you are apart of the Stubhub affiliate program. ;)

Fetch Page Module

Much better! There’s still work to be done to assign the data to titles, descriptions and links, but these jobs are for other modules.

The first item in our results includes some unwanted info (“ticket listings found|Help”), but we can get rid of that using the Regex Module.

Depending on what page you use, the results will always be different. I just chose Stubhub because I’m familiar with the layout and I know what content will change and what will stay the same. If I didn’t, I would have no idea what content to leave and what to cut out. When you use the Fetch Page module, I suggest starting out with a page you know very well and start playing around with the “cut content from” and “split using delimiter” options to see what results you get.

0 Comments ↓

No comments yet.

Leave a Reply