Daniel - August 9, 2021
Spoofing user agents is a must if you have to scrape data successfully on websites. In this post, we will show you the list of user agents for scraping and how to use them to protect your scraper bots from web server bans.
A user-agent is a string of text included in the headers of requests sent to web servers. A webserver uses details in the user agent to identify the device type, operating system version, and the browser used.
Example: Windows 10 with Google Chrome
user_agent_desktop = ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) ‘\
‘AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 ‘\
‘Safari/537.36’
The user string tells the web server you’re browsing with Mozilla browser on a Windows 10, 64bit device; The website server uses this information to tailor its response to suit your device type, Operating system, and browser in the format below:
User-Agent: Mozilla/<version> (<system-information>) <platform> (<platform-details>) <extensions>
Web servers can identify browsers, web scrapers, download managers, spambots, etc. because they have unique user-agent strings. For that reason, most antibot websites can identify and ban a web scraper based on its user-agent string.
Web scrapers, spambots, download managers, etc., use fake user-agent strings that give them legitimate identities by using user strings belonging to popular browsers. This process of changing user string is known as user string spoofing.
Therefore, changing or spoofing your user agent is the only way to scrape data successfully from antibot websites.
Here is a list of top PC-based user agents:
Windows 10/ Edge browser: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246
Windows 7/ Chrome browser: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36
Mac OS X10/Safari browser: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9
Linux PC/Firefox browser: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1
Chrome OS/Chrome browser: Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36
For data scraping, the best user agents are user agent strings belonging to a real browser. Thus, to change web scraper user agent using python request, copy the user string of a well-known browser (Mozilla, Chrome, Edge, Opera, etc.), and paste it in a dict with the key ‘user-agent’ e.g.
headers = {“User-Agent”:”Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36″}
To test if your web scraper is sending the right header, send a request to HTTPBin in the format below:
r = requests.get(‘http://httpbin.org/headers’,headers=headers) pprint(r.json())
However, some web servers can detect that you’re using a bot because the following headers will be missing (also missing from the request sent to HTTPBin)
*/* replacing user agent string
Accept-Language
Dnt
Upgrade-Insecure request.
Thus, for a successful scraping, your user string should include the missing headers above; example:
headers = {
“Accept”: “text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image
apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9″,
“Accept-Encoding”: “gzip, deflate”,
“Accept-Language”: “en-GB,en-US;q=0.9,en;q=0.8”,
“Dnt”: “1”,
“Host”: “httpbin.org”,
“Upgrade-Insecure-Requests”: “1”,
“User-Agent”: “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36”,}
To test if your bot is sending the right user string, send requests to HTTPBin in the format below:
r = requests.get(‘http://httpbin.org/headers’,headers=headers) pprint(r.json())
Following the above steps and instructions will successfully spoof your user agent to that of a real browser. However, the webserver can ban your webserver if it sends and receives a large volume of data requests humanly impossible per minute. Hence, to prevent an IP address ban, you should rotate your user agent using rotating proxies and a list of user agents belonging to real browsers.
To rotate a user agent using proxies,
Have a collection of user agents from popular web browsers
Create a python list of user agent
Program your web scraper to choose a user agent string from the python list
Replace the exit IP address using rotating proxies
For the best result, you should source rotating proxies from ProxyRack because we have a large collection of rotating proxies worldwide.
Also, rotate each user-agent with all headers associated with the user-agent string, as mentioned in examples above, to prevent the webserver from identifying your web scraper as a bot.
Note: Always remember to delete any header starting with ‘X’ in ‘HTTPBin’ because it is generated by HTTPBin as a load balancer.
Successfully scraping websites using web scrapers depends on how well you can spoof user agents and the type of proxies you use. Therefore, you should:
Make sure you’re using the right user string for the headers you’re using
Organize your headers in the right order as used by the browser whose user string you’re spoofing because most websites using sophisticated antibot tools can detect that you’re using a bot if your headers are not arranged orderly.
Add a Referer header to the user string you’re using to make it authentic
Don’t keep cookies or log into the website you’re scraping to prevent the webserver from identifying you based on past activities.
Get premium rotating proxies from ProxyRack, especially if you intend to scrape a large volume of data.