Log in

View Full Version : Any good Python coders here? Especially with regard to web scraping?


mjcrossuk
2024-09-13, 09:12 AM
I'm trying to scrape the torrent details from TTD to put into a database which I would make available online, in the event that TTD closes down.

Please see this thread for more details: http://www.thetradersden.org/forums/showthread.php?t=202461

I've not used the requests package before, and I'm having problems.

I'm starting from this page: http://www.thetradersden.org/forums/archive/index.php/f-11.html, then going into the various list pages for some of the categories (eg Audio, Audio Inactive, Audio Pulled). The next level pages list torrents, 250 per page, each with a link to the torrent detail thread eg http://www.thetradersden.org/forums/archive/index.php/f-12.html is page 1 of 147 listing Active Audio torrents.

Sometimes my scraping code retrieves an Index page, but other times I get a Status of 200 from the request, but an empty response.

Trying to retrieve a torrent thread, eg http://www.thetradersden.org/forums/archive/index.php/t-203252.html, always give a 200 status and an empty response.

If the request fails, I should get a non-200 status code, but I don't.

Could it be authentication? Caching? Is the TTD backend blocking scraping of torrent detail threads?

TIA

I realise that web scraping these pages may not be the best way to get the info; a better way would be a dump/extract of the backend database(s). If the scraping does work, I would be mindful of NOT scraping 100,000+ pages quickly.

brentter
2024-09-25, 11:26 AM
So what's happening is your scraper is trying to grab the page data before it's loaded because there's a js file that loaded before the content.
For example: http://www.thetradersden.org/forums/archive/index.php/t-203252.html
If you pull up your dev tools in your browser->console-> this is the error:

Layout was forced before the page was fully loaded. If stylesheets are not yet loaded this may cause a flash of unstyled content. markup.js:250:53

clicking on the markup.js file highlights
try {
// If we didn't wait for the document to load, we want to force a layout update
// to ensure the anonymous content will be rendered (see Bug 1580394).
const forceSynchronousLayoutUpdate = !this.waitForDocumentToLoad;
this._content = this.anonymousContentDocument.insertAnonymousContent(
forceSynchronousLayoutUpdate
);

The page hasn't been rendered. So there's a page there, hence the 200, just nothing on it yet. If you add in a wait timer that might fix it or just switch to a tool like Selenium that works well with javascript and has a built-in option to automatically wait until a certain page element has been loaded:


from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Path to your WebDriver executable (e.g., chromedriver)
driver_path = '/path/to/chromedriver'

# Initialize WebDriver (Chrome in this case)
driver = webdriver.Chrome(executable_path=driver_path)

try:
#open the web page
driver.get('http://example.com')

# Wait until specific element is loaded (adjust selector as needed)
wait = WebDriverWait(driver, 10) # Timeout after 10 seconds
element = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'your-target-element-selector')))

# Now that we are sure the content has loaded, get page source
html_content = driver.page_source

# You can now parse html_content with BeautifulSoup or any other parser as needed

finally:
driver.quit() # Close browser when done



Selenium uses chrome btw and this example was obv just for a single page scrape.
https://selenium-python.readthedocs.io/waits.html

mjcrossuk
2024-09-25, 11:52 AM
Thanks so much for your reply.

I will study what you've suggested, and see what happens.

Would you be willing to engage in a direct conversation via email?

all the best,

Mike

brentter
2024-09-25, 12:45 PM
Yes, actually after I closed the browser (i'm at work atm) I realized that I could probably just scrape it all tonight for you.
I will DM you my email.