Home Web Scraping 101 with Python
Post
Cancel
Preview Image

Web Scraping 101 with Python

Miniature Calendar

This mini-project started from another design project called Miniature Calendar. It’s a beautiful creative project, which creates real world scenes with small scaled objects. The author of the project posts 1 creative idea per day in a chronological order. The only way to get access and acquire these images is by visiting project social media pages or the project official website.

I was just thinking a way to download these photos into my local image repository and enjoy as wallpaper and so on. So, why not analyze the site structure and try to scrap these images for inspiration. I think the author might not mind for personal use of these images since these images have been sharing across the social medias.

Web Scraping 101

After researching about half an hour, I decided to use two major Python libraries for this mini-projects which are BeautifulSoup and urllib2. These two libraries become a de-facto standard for web scraping with Python. I also decided to use argparse library to handle arguments to download images for today or images of the past whole month.

The following listing is sort of a breakdown plan for to achieve this web scrapper:

  1. Analyze website information structure.
  2. Extract the images from the server through “src” of HTML img tags.
  3. Create a function to ask,store and retrieve folder path to store the images.
  4. Use argument parser to ask user to download today image or images from the entire past month.
  5. Some error handling.

The final python script of this project can be see at scrap_miniature.

What are we looking for?

The first thing to do it to learn a bit about these 2 Python libraries; BeautifulSoup and urllib2. The documentations are pretty straightforward to start working on it. I browse the website and use Chrome developer tools to examine the location of the img tag. At the end of this process, I implemented a Python function to download images from the website. Urllib2 opens URLs and retrieve HTML content which is parsed with bs4 to lxml format. HTML elements can be extracted by using class name, attributes and also by HTML tags. In this case, I extract the img tags with attribute of height to 1080. Finally, the image path is downloaded with ‘urlretrieve’ function of urllib.

1
2
3
4
5
6
7
8
9
10
11
12
13
# function that scrap src of images
def get_miniature(url,filename):
    try:
        html = urlopen(url).read()
        soup = BeautifulSoup(html, "lxml")
        outpath = os.path.join(out_folder, filename)
        # This is the html component we are looking for
        # <p><img alt="" class="alignnone size-full wp-image-8908" height="1080" src="http://miniature-calendar.com/wp-content/uploads/2016/02/160202tue.jpg" title="Sand shot" width="1080"/></p>    
        img = soup.find(attrs={"height" : "1080"})    
        print img["src"]+ " is downloading..."
        urlretrieve(img["src"], outpath)
    except:
        print "URL don't exist..skip!"

Download images

Two functions are created to download images by today time and also loop for the whole month and download in a single batch. Since the image path is in timestamp format(yymmdd), it is possible to implement the path by manipulating Python datetime object.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#download today image
def daily():        
    today = str(time.strftime("%y%m%d",local)) 
    absurl = baseurl+today     # concat baseURL with yymmmdd folderr today
    file_ = today+"_min.jpg"    
    get_miniature(absurl,file_)    

# download photos of the whole month 
def monthly(yy,mm):
    # get folder structure like http://miniature-calendar.com/160126/ (baseurl/yymmdd/)
    # iterate for whole month
    for i in range(0,calendar.monthrange(yy,mm)[1]):
        day = str(yy)+str(format(mm,'02'))+str(format(i+1,'02'))
        absurl = baseurl+day    # concat baseURL with yymmdd and each days in month
        file_ = day+"_min.jpg"        
        get_miniature(absurl,file_)

Default Download Location

Also, it might be nice to have a feature to select a location to download the images. 3 functions are implemented to ask the user preference and save it to a flat file at home directory. Later on,the script will refer the flat file,”.outpath”,as the path to save images. To change the path by simply removing the file and run the program again.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#create folder and write file path to flat file
def mkdir_(path):
    out_file = open(".outpath",'w')
    if (os.path.isdir(os.path.expanduser(path))):
        print "Folder already exists"
        out_file.write(os.path.expanduser(path))
    else:   
        os.mkdir(os.path.expanduser(path))
        out_file.write(os.path.expanduser(path))
    out_file.close()

# ask user the file path
def out_path():    
    print "folder path to download images (default:~/Downloads/miniature-calendar/)"
    file_dir = raw_input("> ")    
    if file_dir == "":
        mkdir_("~/Downloads/miniature-calendar/")        
    else:
        mkdir_(file_dir)                

# read and validate output file location
def read_file_path():
    outpathfile = open(os.path.expanduser(".outpath"), 'r')
    file_dir = outpathfile.read()
    if (os.path.isdir(os.path.expanduser(file_dir))):    
        return file_dir
    else:
        print "Invalid file path!"
        print "Folder created at this path: "+ os.path.expanduser(file_dir)
        os.mkdir(os.path.expanduser(file_dir))
        return file_dir

Argument Parser for better interaction

Argparse is a Python module that can handle program argument nicely. For this script, it accepts 2 optional arguments to download images daily or monthly.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def ParseCommandLine():
    parser = argparse.ArgumentParser('Download miniature-calendar images.["-d" for today photo.."-m mmyy" to archive photos of entire month.]')
    parser.add_argument('-d','--daily',action="store_true", help='Download today image')
    parser.add_argument('-m','--monthly', help='Download for the whole month (mmyy) e.g. "0216" for February 2016')    
    theArgs = parser.parse_args()
    return theArgs

def main(args):
    out_folder = read_file_path()
    if args.daily:            
        print "Today image is downloading to " + out_folder
        daily()
        print ".........Done........."
    elif args.monthly:            
        if len(args.monthly) <4:
            print "pls make sure you input is in this format:mmyy"
        else:
            mm = int(args.monthly[0]+args.monthly[1])
            yy = int(args.monthly[2]+args.monthly[3])
            print "Images are downloading to " + out_folder
            monthly(yy,mm)
            print ".........Done........."
    else:
        print "try again with -h."

if __name__ == "__main__":    
    # check output filepath
    if os.path.isfile(os.path.expanduser(".outpath")) is False:
        out_path()
        print "pls run the program again..."
    else:
        args = ParseCommandLine()        
        main(args)

Usage

A short usage suggestion can be found at Github Repo.

This post is licensed under CC BY 4.0 by the author.

CloudReady ChromeOS Review

Digital Forensic Resources