Wednesday, 29 May 2013

Scraping Yahoo! Search with Web::Scraper


Scraping websites is usually pretty boring and annoying, but for some reason it always comes back. Tatsuhiko Miyagawa comes to the rescue! His Web::Scraper makes scraping the web easy and fast.

Since the documentation is scarce (there are the POD and the slides of a presentation I missed), I'll post this blog entry in which I'll show how to effectively scrape Yahoo! Search.

First we'll define what we want to see. We'll going to run a query for 'Perl'. From that page, we want to fetch the following things:

    title (the linked text)
    url (the actual link)
    description (the text beneath the link)

So let's start our first little script:

use Data::Dumper;
use URI;
use Web::Scraper;

my $yahoo = scraper {
   process "a.yschttl", 'title' => 'TEXT', 'url' => '@href';
   process "div.yschabstr", 'description' => "TEXT";

   result 'description', 'title', 'url';
};

print Dumper $yahoo->scrape(URI->new("http://search.yahoo.com/search?p=Perl"));

Now what happens here? The important stuff can be found in the process statements. Basically, you may translate those lines to "Fetch an A-element with the CSS class named 'yschttl' and put the text in 'title', and the href value in url. Then fetch the text of the div with the class named 'yschabstr' and put that in description.

The result looks something like this:

$VAR1 = {
          'url' => 'http://www.perl.com/',
          'title' => 'Perl.com',
          'description' => 'Central resource for Perl developers. It contains
 the Perl Language, edited by Tom Christiansen, and the Perl Reference, edited
by Clay Irving.'
        };

Fun and a good start, but hey, do we really get only one result for a query on 'Perl'? No way! We need a loop!

The slides tell you to append '[]' to the key, to enable looping. The process lines then look like this:

   process "a.yschttl", 'title[]' => 'TEXT', 'url[]' => '@href';
   process "div.yschabstr", 'description[]' => "TEXT";

And when we run it now, the result looks like this:

$VAR1 = {
          'url' => [
                     'http://www.perl.com/',
                     'http://www.perl.org/',
                     'http://www.perl.com/download.csp',
                   ...
                   ],
          'title' => [
                       'Perl.com',
                       'Perl Mongers',
                       'Getting Perl',
                     ...
                     ],
          'description' => [
                             'Central resource for Perl developers. It contains
the Perl Language, edited by Tom Christiansen, and the Perl Reference, edited by
 Clay Irving.',
                             'Nonprofit organization, established to support the
 Perl community.',
                             'Instructions on downloading a Perl interpreter for
 your computer platform. ... On CPAN, you will find Perl source in the /src
directory. ...',
                           ...
                           ]
        };

That looks a lot better! We now get all the search results and could loop through the different arrays to get the right title with the right url. But still we shouldn't be satisfied, for we don't want three arrays, we want one array of hashes! For that we need a little trickery; we need another process line! All the stuff we grab already is located in a big ordered list (the OL-element), so let's find that one first, and for each list element (LI) find our title,url and description. For this we don't use the CSS selectors, but we'll go for the XPath selectors (heck, we can do both, so why not?).

To grab an XPath I really suggest firebug , a FireFox addon. With the easy point and click interface, you can grab the path within seconds.

use Data::Dumper;
use URI;
use Web::Scraper;

my $yahoo = scraper {
   process "/html/body/div[5]/div/div/div[2]/ol/li", 'results[]' => scraper {
      process "a.yschttl", 'title' => 'TEXT', 'url' => '@href';
      process "div.yschabstr", 'description' => "TEXT";

      result 'description', 'title', 'url';
   };
   result 'results';
};

print Dumper $yahoo->scrape( URI->new("http://search.yahoo.com/search?p=Perl") );

You see that we switched our title, url and description fields back to the old notation (without []), for we don't want to loop those fields. We've moved the looping a step higher, being to the li-elements. Then we open another scraper which will dump the hashes into the results array (note the '[]' in 'results[]').

The result is exactly what we wanted:

$VAR1 = [
          {
            'url' => 'http://www.perl.com/',
            'title' => 'Perl.com',
            'description' => 'Central resource for Perl developers. It
contains the Perl Language, edited by Tom Christiansen, and the Perl Reference,
edited by Clay Irving.'
          },
          {
            'url' => 'http://www.perl.org/',
            'title' => 'Perl Mongers',
            'description' => 'Nonprofit organization, established to support
the Perl community.'
          },
          {
            'url' => 'http://www.perl.com/download.csp',
            'title' => 'Getting Perl',
            'description' => 'Instructions on downloading a Perl interpreter
for your computer platform. ... On CPAN, you will find Perl source in the /src
directory. ...'
          },
...
        ];

Again Tatsuhiko impresses me with a Perl module. Well done! Very well done!


Source: http://menno.b10m.net/blog/blosxom.cgi/perl/scraping-yahoo-search-with-web-scraper.html

Sunday, 26 May 2013

Data Mining And Importance to Achieve Competitive Edge in Business

What is data mining? And why it is so much importance in business? These are simple yet complicated questions to be answered, below is brief information to help understanding data and web mining services.

Mining of data in general terms can be elaborated as retrieving useful information or knowledge for further process of analyzing from various perspectives and summarizing in valuable information to be used for increasing revenue, cut cost, to gather competitive information on business or product. And data abstraction finds a great importance in business world as it help business to harness the power of accurate information thus providing competitive edge in business. May business firms and companies have their own warehouse to help them collect, organize and mine information such as transactional data, purchase data etc.

But to have a mining services and warehouse at premises is not affordable and not very cost effective to solution for reliable information solutions. But as if taking out of information is the need for every business now days. Many companies are providing accurate and effective data and web data mining solutions at reasonable price.

Outsourcing information abstraction services are offered at affordable rates and it is available for wide range of data mine solutions:

• taking out business data
• service to gather data sets
• digging information of datasets
• Website data mining
• stock market information
• Statistical information
• Information classification
• Information regression
• Structured data analysis
• Online mining of data to gather product details
• to gather prices
• to gather product specifications
• to gather images

Outsource web mining solutions and data gathering solutions has been effective in terms of cost cutting, increasing productivity at affordable rates. Benefits of data mining services include:

• clear customer, service or product understanding
• less or minimal marketing cost
• exact information on sales, transactions
• detection of beneficial patterns
• minimizing risk and increased ROI
• new market detection
• Understanding clear business problems and goals

Accurate data mining solutions could prove to be an effective way to cut down cost by concentrating on right place.

Source: http://ezinearticles.com/?Data-Mining-And-Importance-to-Achieve-Competitive-Edge-in-Business&id=5771888

Saturday, 18 May 2013

Scrape Yahoo Finance

#!/bin/sh
#
# scrape_yahoo.sh
#
# This script pulls data from yahoo using a NYSE index list generated
# from scrape_nyse.sh.  It iterates through the list, saving data from each
# security in one big log and seperate per-security log files.  This stuff
# should go into a database, sooner or later.

DEBUG=0
PATH=$PATH:/usr/local/bin
BASEDIR=/home/stockh/stockharmony.com/api/scripts/
SYMBOLS_FILE=${BASEDIR}../data/nyse_index_symbols.txt
LAST_SYMBOL=`tail -1 $SYMBOLS_FILE`
# the SYMBOL string used in a request to yahoo
SYMBOLS=''
# took out "6t" (the URL)
ARG_STRING=`tr -d '
' < ${BASEDIR}../data/yahoo_arg_string_custom.txt `
YAHOO_URL='http://finance.yahoo.com/d/quotes.csv?s='
COUNT=0
# how many symbols should we query in one request?
GET_SYMBOLS=15
SLEEP_TIME=60
BEGIN_TIME=`date +%Y%m%d%H%M%S | tr -d '
'`
DATA_FILE=${BASEDIR}../data/yahoo-finance-${BEGIN_TIME}.csv
TMP_FILE=/tmp/yahoo-data
LOGFILE=${BASEDIR}../logs/scraping
SECS_DIR=${BASEDIR}../data/securities/
GET_YAHOO=0
SELF=`basename $0`
PIDFILE=${BASEDIR}../logs/${SELF}.pid
ERROR_STRING='default error string'

# define functions first, put in include file later

# send sms msg, only once
send_sms_msg () {
if [ "$1" ]; then STRING=$1; fi
if [ $SMS_SENT ]; then
return 0
else
echo $STRING | mailx -s 'stockh error' <a href="mailto:1234567789@cingularme.com">1234567789@cingularme.com</a>
SMS_SENT=1
fi
}

log_error () {
if [ "$1" ]; then
STRING="OOPS: return code $? because $1"
else
STRING="OOPS: return code $?"
fi
date >> $LOGFILE
echo $STRING >> $LOGFILE
}

log_normal () {
if [ "$1" ]; then
STRING="OK: $1"
else
STRING="OK: seems ok $?"
fi
date >> $LOGFILE
echo $STRING >> $LOGFILE
}

save_symbol_data () {
# save once in main file
cat $TMP_FILE >> $DATA_FILE 2>> $LOGFILE
# grep for each symbol and save in seperate files
for j in $SYMBOLS; do
echo -n ${GOT_WHEN}, >> ${SECS_DIR}$j.csv
match=","$j","
grep -i $match $TMP_FILE >> ${SECS_DIR}$j.csv 2>> $LOGFILE
done
}

# Look if any symbols have changed in TMP_FILE, get that data too, append to TMP_FILE.
fetch_changed_symbol_data () {

TMP=`grep '"Ticker symbol has changed to:' $TMP_FILE | sed 's/.*changed to: <a href="/q?s=(.*)">.*/1/'`
TMP=`echo $TMP | tr -d '
'`
TMP=`echo $TMP | sed 's/^s+//g'`
NEW=$TMP
if [ "$NEW" ]; then
log_normal "got changed symbols: $NEW"
# save new syms to global var $SYMBOLS
SYMBOLS="${SYMBOLS} ${NEW}"
TMP=`echo $NEW | tr ' ' '+'`
URL="${YAHOO_URL}${TMP}&f=$ARG_STRING"
# append data to tmp file
lynx -dump $URL >> $TMP_FILE 2>> $LOGFILE
# report errors
if [ $? = 0 ]; then
if [ $DEBUG ]; then log_normal "$URL" ; fi
else
log_error "lynx failed getting changed symbols $URL"
send_sms_msg "lynx failed getting $NEW"
fi
fi

}

# take space seperate list of symbols, query yahoo and save to TMP_FILE
fetch_and_save_symbol_data () {

# replace space with + for URL
TMP=`echo $SYMBOLS | tr ' ' '+'`
URL="${YAHOO_URL}${TMP}&f=$ARG_STRING"
# clobber TMP_FILE with new data
lynx -dump $URL > $TMP_FILE 2>> $LOGFILE
if [ $? = 0 ]; then
if [ $DEBUG ]; then log_normal "$URL" ; fi
GOT_WHEN=`date +%Y%m%d%H%M%S | tr -d '
'`
fetch_changed_symbol_data
# regardless of fetch_changed_symbol_data always save symbol data at
# this point
save_symbol_data
else
log_error "lynx failed getting $URL"
send_sms_msg "lynx failed on $TMP"
fi
}

if [ -f $PIDFILE ]; then
send_sms_msg "$SELF exiting, PID exists"
log_error "$SELF exiting, PID exists"
exit 1
fi

echo $$ > $PIDFILE 2>> $LOGFILE

# save $COUNT amount of symbols in $SYMBOLS then call functions
for i in `cat $SYMBOLS_FILE`; do

COUNT=`expr $COUNT + 1`

if [ "$SYMBOLS" ]; then
SYMBOLS="$SYMBOLS $i"
else
SYMBOLS=$i
fi

if [ $COUNT = $GET_SYMBOLS ]; then
GET_YAHOO=1
elif [ $i = $LAST_SYMBOL ]; then
GET_YAHOO=1
fi

if [ $GET_YAHOO = 1 ]; then
if [ $DEBUG ]; then log_normal "SYMBOLS are $SYMBOLS"; fi
fetch_and_save_symbol_data
sleep $SLEEP_TIME
SYMBOLS=''
COUNT=0
GET_YAHOO=0;
fi

done

echo $SELF started at $BEGIN_TIME >> $LOGFILE  2>&1
echo $SELF finished on `date` >> $LOGFILE 2>&1
wc -l $DATA_FILE >> $LOGFILE 2>&1
wc -l $SYMBOLS_FILE >> $LOGFILE 2>&1
rm -f $PIDFILE >> $LOGFILE 2>&1

if [ $? = 0 ]; then
exit 0
else
send_sms_msg "could not remove PIDFILE. rm returned $?"
exit 1
fi

Source: http://www.snippetsmania.com/scrape-yahoo-finance/

Wednesday, 15 May 2013

Yahoo! Finance - Business Finance, Stock Market, Quotes, News

At Yahoo! Finance, you get free stock quotes, up to date news, portfolio management resources, international market data, message boards, and mortgage rates that help you manage your financial life.

Quotes are real-time for NASDAQ, NYSE, and NYSEAmex when available. See also delay times for other exchanges. Quotes and other information supplied by independent providers identified on the Yahoo! Finance partner page. Quotes are updated automatically, but will be turned off after 25 minutes of inactivity. Quotes are delayed at least 15 minutes. All information provided "as is" for informational purposes only, not intended for trading purposes or advice. Neither Yahoo! nor any of independent providers is liable for any informational errors, incompleteness, or delays, or for any actions taken in reliance on information contained herein. By accessing the Yahoo! site, you agree not to redistribute the information found therein.

Python Beautiful Soup Example: Yahoo Finance Scraper

Python offers a lot of powerful and easy to use tools for scraping websites. One of Python’s useful modules to scrape websites is known as Beautiful Soup.

In this example we’ll provide you with a Beautiful Soup example, known as a ‘web scraper’. This will get data from a Yahoo Finance page about stock options. It’s alright if you don’t know anything about stock options, the most important thing is that the website has a table of information you can see below that we’d like to use in our program. Below is a listing for Apple Computer stock options.

Python offers a lot of powerful and easy to use tools for scraping websites. One of Python’s useful modules to scrape websites is known as Beautiful Soup.

In this example we’ll provide you with a Beautiful Soup example, known as a ‘web scraper’. This will get data from a Yahoo Finance page about stock options. It’s alright if you don’t know anything about stock options, the most important thing is that the website has a table of information you can see below that we’d like to use in our program. Below is a listing for Apple Computer stock options.

This code retrieves the Yahoo Finance HTML and returns a file-like object.

If you go to the page we opened with Python and use your browser’s “get source” command you’ll see that it’s a large, complicated HTML file. It will be Python’s job to simplify and extract the useful data using the BeautifulSoup module. BeautifulSoup is an external module so you’ll have to install it. If you haven’t installed BeautifulSoup already, you can get it here.

Beautiful Soup Example: Loading a Page

The following code will load the page into BeautifulSoup:

Beautiful Soup Example: Searching

Now we can start trying to extract information from the page source (HTML). We can see that the options have pretty unique looking names in the “symbol” column something like AAPL130328C00350000. The symbols might be slightly different by the time you read this but we can solve the problem by using BeautifulSoup to search the document for this unique string.

Let’s search the soup variable for this particular option (you may have to substitute a different symbol, just get one from the webpage):

This result isn’t very useful yet. It’s just a unicode string (that’s what the ‘u’ means) of what we searched for. However BeautifulSoup returns things in a tree format so we can find the context in which this text occurs by asking for it’s parent node like so:

We don’t see all the information from the table. Let’s try the next level higher.

Bingo. It’s still a little messy, but you can see all of the data that we need is there. If you ignore all the stuff in brackets, you can see that this is just the data from one row.

This code is a little dense, so let’s take it apart piece by piece. The code is a list comprehension within a list comprehension. Let’s look at the inner one first:

This uses BeautifulSoup‘s findAll function to get all of the HTML elements with a td tag, a class of yfnc_h and a nowrap of nowrap. We chose this because it’s a unique element in every table entry.

If we had just gotten td‘s with the class yfnc_h we would have gotten seven elements per table entry. Another thing to note is that we have to wrap the attributes in a dictionary because class is one of Python’s reserved words. From the table above it would return this:

We need to get one level higher and then get the text from all of the child nodes of this node’s parent. That’s what this code does:

This works, but you should be careful if this is code you plan to frequently reuse. If Yahoo changed the way they format their HTML, this could stop working. If you plan to use code like this in an automated way it would be best to wrap it in a try/catch block and validate the output.

This is only a simple Beautiful Soup example, and gives you an idea of what you can do with HTML and XML parsing in Python. You can find the Beautiful Soup documentation here. You’ll find a lot more tools for searching and validating HTML documents.

Source: http://pythoncentral.org/python-beautiful-soup-example-yahoo-finance-scraper/