In the Spring/Summer semester of 2018 I took Astronomy 111 at Washtenaw Community College to finish up my Natural Sciences requirement at EMU. Being in this class reminded me of the great website, apod.nasa.gov. I was hoping to find that NASA had set up an Instagram account to cross post each days image. Much to my surprise, I was unable to find any such account that would provide me with that service. Around the same time I was attempting to teach myself Python, so naturally I thought it would be a great experiment and learning experience to make my own Python bot that would post the content from the APOD site every morning.
Python has many available modules designed to help with web scraping. The most popular one is probably Beautiful Soup, but I went with lxml which is one of the technologies underlying Beautiful Soup. These frameworks allow you to access the data on a web page by the HTML tags the give the page structure. Every element on the page is somewhere in a tree of HTML tags and can be accessed by simply knowing that location. This process is made very easy because any modern browser should have the capability to inspect and element and output exactly which tag you have selected. So the first thing I did was right click the image file on the APOD site and then click on Inspect Element.
These

When I was inspecting the element I noticed that there was one filename in the <a> tag and a second one in the <img> tag nested inside of it. An <a> tag is used for a hyperlink to a different page and the href= portion signifies the address it links to. The <img> in this case is what you click on to follow that link, and sure enough if you click on that picture of the black hole on the APOD home page it takes you to an even higher resolution version of the image. The path for that full resolution version is after the href attribute in the <a> tag.

Setting up the Environment
#!/usr/bin/python
from PIL import Image
import requests
from lxml import html
import urllib
import sys
from InstagramAPI import InstagramAPI
The first thing needed to use LXML or Instagram is to download and install all of the required modules. Using pipenv, a new virtual environment was created and with in it I installed Pillow, lxml, and InstagramAPI. With out these modules installed Python will error out when trying to execute the code. There are a couple other modules that need to be imported into the script to make it run properly. Requests and urllib are part of the Python core utilities and available to be imported with out needing to be installed first. Once all the required modules are imported they are ready to be used in the project.
Diving into the Code
apod = requests.get('http://apod.nasa.gov')
et = html.fromstring(apod.text)
image = et.xpath('/html/body/center[1]/p[2]/a')
imagePath = image.pop().attrib['href']
apodPath = "http://apod.nasa.gov/"+imagePath
print(apodPath)
urllib.request.urlretrieve(apodPath,'./apod.jpg')
Knowing that we will want to grab the high-res version of the image, we want to extract the value of that href attribute. Obviously this is not a URL, but instead it is a relative path on the server to access the file we need. You can see in the code above that I am opening a HTTP request to download the webpage. Then I take the text of the site and use it to create an HTML object, that I will then be able to inspect. The third line is where I use LXML to extract the specific tag I am looking for. You can right click the element that you want in the inspector and click Copy -> XPath to grab the path to that tag. When I do that on the <a> with the path to the high-res image, it returns:
/html/body/center[1]/p[2]/a
From there I inspect the href attribute and save that to a variable called imagePath. Finally, I append the path to the URL for the site to create the the complete URL for the image file we want to download.
Once we have created the full path we can use the requests library to download that file and save it locally as apod.jpg. Next we will need to manipulate the image so that Instagram accepts it.
img = Image.open('./apod.jpg')
if not img.size[0] == img.size[1]:
longer_side = max(img.size)
horizontal_padding = (longer_side - img.size[0]) / 2
vertical_padding = (longer_side - img.size[1]) / 2
imgNew = img.crop(
(
-horizontal_padding,
-vertical_padding,
img.size[0] + horizontal_padding,
img.size[1] + vertical_padding
)
)
imgNew.save("./apodPadded.jpg")
else:
img.save("./apodPadded.jpg")
In this snippet of code we used the Pillow module to crop the dimensions of the image to a square so that Instagram will not reject it. After opening the file that we just saved, the image is checked to see if it is already a square. If not, then we figure which side is the longer of the two by running that max() function against the size attribute of the image. The size attribute is a tuple with two values, the width and the height.
The next part seems a little more confusing but really it is just checking all the sides against the length of the longest side and dividing the difference by two. If the longest side is the side it is check then the padding value will be 0. The reason I am looking for half of the difference between lengths is that we want the image centered when we add the padding to make it a square, so we must add half to either side. Once we have values for how much padding to add on each side we pass that information as a tuple to the crop function saving the returned object as imgNew.

Processing the Text
explanation = et.xpath('/html/body/p[1]')[0]
description = ' '.join(explanation.text_content().split())
description = description + '\n#nasa #space #esa #apod #astronomy #astrophotography\n'
To create the description and image credit we need to pull the text from below the image. We go back to the HTML object created earlier and grab the entire block text in the first <p> tag of the <body>. The text_content() function allows us to grab all of the text from the paragraph, but ignore all of the <a> tags and their attributes. We also want to use the split() function on the string to break up the text on all of the wacky newline and tab characters hidden throughout. Then I join all those lines back together with a simple space and append the hashtags, #nasa, #space, #esa, #apod, #astronomy, and #astrophotography.

Getting the title and image credit proved to be a bit more difficult. The image credit can be variable. Sometimes it reads “Image Credit:” and others it will say “Image Credit & Copyright:”. On top of that there can be a number of different credits listed and some will be hyperlinks as well. This makes it difficult to grab in a boiler plate way.
credit = et.xpath('/html/body/center[2]')[0].text_content()
creditLines = credit.splitlines()
for i,line in enumerate(creditLines):
if 'Credit' in line:
if 'Copy' in creditLines[i+1]:
credit = creditLines[i+2:]
else:
credit = creditLines[i+1:]
credit = [ele.strip() for ele in credit if ele != '']
creditString = '📷: '+ ' '.join(credit)
title = et.xpath('/html/body/center[2]/b[1]')[0].text.strip() + '\n'
description = title + description + creditString
print(description)
As in every other case the first thing to do is grab the element and its text_content(). This command actually gives us the title and the entire image credit. So after that I have broken up the lines into a list, because the string is full of newline characters. With the segments broken up, each one is checked to see if it contains the word “Credit”. Once the index of the “Image Credit:” segment is found, there is a quick check to see if the next segment is for “Copyright:” or not. Once it is determined which segments the actual credit consists of a slice of the original list is created.
The next issue to address was the fact that some of the segments were just empty strings, but when they were all joined with spaces there would be spots with extra spaces for no reason. Using list comprehension, I stripped every string that was not empty and placed them into the credit variable. With the formatting taken care of, the list is joined back together and appended to the ‘📷: ‘ string. The title is the grabbed and combined with the description and the image credit to form the full caption for Instagram.
Bringing it all Together
#!/usr/bin/python
from PIL import Image
import requests
from lxml import html
import urllib
import sys
from InstagramAPI import InstagramAPI
apod = requests.get('http://apod.nasa.gov')
et = html.fromstring(apod.text)
image = et.xpath('/html/body/center[1]/p[2]/a')
imagePath = image.pop().attrib['href']
apodPath = "http://apod.nasa.gov/"+imagePath
print(apodPath)
urllib.request.urlretrieve(apodPath,'./apod.jpg')
img = Image.open('./apod.jpg')
if not img.size[0] == img.size[1]:
longer_side = max(img.size)
horizontal_padding = (longer_side - img.size[0]) / 2
vertical_padding = (longer_side - img.size[1]) / 2
imgNew = img.crop(
(
-horizontal_padding,
-vertical_padding,
img.size[0] + horizontal_padding,
img.size[1] + vertical_padding
)
)
imgNew.save("./apodPadded.jpg")
else:
img.save("./apodPadded.jpg")
explanation = et.xpath('/html/body/p[1]')[0]
description = ' '.join(explanation.text_content().split())
description = description + '\n#nasa #space #esa #apod #astronomy #astrophotography\n'
credit = et.xpath('/html/body/center[2]')[0].text_content()
creditLines = credit.splitlines()
for i,line in enumerate(creditLines):
if 'Credit' in line:
if 'Copy' in creditLines[i+1]:
credit = creditLines[i+2:]
else:
credit = creditLines[i+1:]
credit = [ele.strip() for ele in credit if ele != '']
creditString = '📷: '+ ' '.join(credit)
title = et.xpath('/html/body/center[2]/b[1]')[0].text.strip() + '\n'
description = title + description + creditString
print(description)
try:
InstagramAPI = InstagramAPI("username", "password")
InstagramAPI.login() # login
photo_path = './apodPadded.jpg'
InstagramAPI.uploadPhoto(photo_path, caption=description)
except:
print("Could not post to Instagram" + str(sys.exc_info()[0]))
With the image successfully cropped to a square and the description properly formatted, we create an InstagramAPI object with the desired username and password. After calling the login() function, I run uploadPhoto() with the path to the cropped photo and the caption as parameters. That is it, the photo is posted.

I have a small VPS (this wordpress site also runs on it) and I created a cronjob that runs every morning at 7am which calls the script and uploads the picture.
I really felt that this was a great project. It was really rewarding to create the tool I was looking for and I learned a lot about webscraping and some other cool Python libraries like Pillow.
Thanks for reading,
Raul
Make sure you follow the ‘pod: https://www.instagram.com/nasapod/
Also the all the code is hosted on GitHub
