Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
web-scraping know-how
#1
i am searching for help regarding the know-how on how to build a web-scraping solution. any help is welcome, as i am not very familiar with this particular topic, but would need it for an application that has to be done as part of a comprehensive university task. 

are here any web-scraping specialists, that could give me an overview, which techniques or methods are best-practice or which tools (not so good for me, as i have to do it on my own) i should look at? 

a lot of thanks to the admins to accept me to this forum.
Reply
#2
(04-18-2018, 01:57 PM)westberlin Wrote:  i am searching for help regarding the know-how on how to build a web-scraping solution. any help is welcome, as i am not very familiar with this particular topic, but would need it for an application that has to be done as part of a comprehensive university task. 

are here any web-scraping specialists, that could give me an overview, which techniques or methods are best-practice or which tools (not so good for me, as i have to do it on my own) i should look at? 

a lot of thanks to the admins to accept me to this forum.

Selenium is easiest route to get started: https://www.seleniumhq.org/
Reply
#3
Hey bro! Do you know programming? If you have knowledge about it you can do some bots with C# with web browser or Javascript frameworks like phantomjs. If you don't, you can use some scraping tool like ubot, or chrome or Firefox add-on on their market.
Reply
#4
I can help you code your own custom scraper with python.
Reply
#5
If you want to roll your own thing, look into Scrapy and Beautifulsoup for Python.
Reply
#6
python and BS4 mate all you need
Reply
#7
Did you saw this:
https://raidforums.com/Thread-Advanced-W...s-Chromium
A good summary and starting point for web-scraping.
Reply
#8
simply use selenium.
Reply
#9
(05-01-2018, 09:39 PM)tit Wrote:  python and BS4 mate all you need

unless the shit you scrape is generated by javascript then you need headless (selenium) because python requests wont run javascript and wont work. DM me for help. I got lots of exp in scraping and cloudflare bypass.
Reply
#10
selenium sucks as slow. try phantomjs or ghostjs. they work headless and better compared to selenium.

python + bs4 + scrapy works great if your target sites are simple html (no fancy js).
Reply
#11
(08-23-2018, 09:44 PM)darksh33p Wrote:  
(05-01-2018, 09:39 PM)tit Wrote:  python and BS4 mate all you need

unless the shit you scrape is generated by javascript then you need headless (selenium) because python requests wont run javascript and wont work. DM me for help. I got lots of exp in scraping and cloudflare bypass.

javascript loads do mess every thing up
Reply
#12
If you want to make a big scalable application you can use headless browser to load the page you want to and take the needed information. Puppeteer is an example of headless browser and it is easy to maintain. You can use css selectors to take the information you need and to automation navigation, login, etc. and a any kind of DB you want to store the taken info. There are some introduction examples but for puppeteer you have to have knowledge in JS (Ofc there are headless browsers for other languages)
Reply
 


Possibly Related Threads...
Thread Author Replies Views Last Post
  Learning Scrapy - The art of efficient web scraping and crawling with Python BoringApe 0 91 09-20-2018, 04:52 PM
Last Post: BoringApe
  Advanced Web Scraping with Headless Chromium booloop 1 255 09-14-2018, 03:29 PM
Last Post: Glies1976



Users browsing this thread: 1 Guest(s)