Welcome Guest.
Not a member yet? Why not sign up today and start posting on our forums.
Simple Python Site Scraper Tutorial
#1

Requirements:
  • A basic understanding of HTML
  • A method of viewing browser cookies

Grabbing the site code:
import urllib2
try:
    read = urllib2.urlopen("http://www.examplesite.com").read()
    print read
except:
    ''
Be careful running that code since it may take ages to print out the site data.

Grabbing specific data using regex:
import urllib2, re
try:
    read = urllib2.urlopen("http://www.examplesite.com").read()
    images = re.findall(r'<img src="(.*?)"',str(read))
    for image in images:
        print image
except:
    ''
What the regex is doing:
code = '<img src="something.png" anything="anything" />'

regex = re.findall(r'<img src="(.*?)"',str(code))
The (.*?) text in regex is a wildcard value. (Multiple wildcard values turns the regex scan into a simple array)

What if the site blocks bot traffic?
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36')]
urllib2.install_opener(opener)
try:
    read = urllib2.urlopen("http://www.examplesite.com").read()
    print read
except:
    ''
Essentially we're just sending a user-agent value along with our traffic. This bypasses most anti-bot methods that I've come across in the past.

What if the site requires specific cookie values?
import urllib2, re
cookievals = [["cookie_1","1"],["cookie_2","2"]]
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36')]
opener.addheaders.append(('Cookie', '; '.join('{}={}'.format(k,v) for k,v in cookievals)))
urllib2.install_opener(opener)
try:
    read = urllib2.urlopen("http://www.examplesite.com").read()
    print read
except:
    ''
So the example has the following cookie values: cookie_1=1 and cookie_2=2.

What if I want to use proxies?
import urllib2, re
handler = urllib2.ProxyHandler({'http': proxy})
opener = urllib2.build_opener(handler)
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36')]
urllib2.install_opener(opener)
try:
    read = urllib2.urlopen("http://www.examplesite.com").read()
    print read
except:
    ''

I want to download the images
import urllib2, re
try:
    read = urllib2.urlopen("http://www.examplesite.com").read()
    images = re.findall(r'<img src="(.*?)"',str(read))
    for image in images:
        filename = image.split('/')[-1]
        with open(filename,'wb') as f:
            f.write(urllib2.urlopen(image).read())
except:
    ''

I simply copied over my Nulled thread to here and removed some comments, NSFW content, and the spoiler tags.