Scraper for Taiwan's Weather Data

October 23, 2019

A collaborator I work with was tasked with analyzing a set of data that is published from Taiwan which contains weather data from each of their 800+ weather stations across every region within the state/country(?). Unfortunately there was no way to download the full data set, only a per station per month download, which would lead to downloading ~15 years 12 months 800 stations = over 140k individual files by hand. Now, normally, scraping data like this would be pretty straightforward; find the base url, spend a few minutes compiling the parameters, write the script to iterate through each group of url parameters and call it good.

The parameters

I right clicked on the ‘Get CSV’ link, and copied the url. This seemed straight forward enough (at least I thought, more later…), so I moved straight into the parameters which looked like this:

station=544321&
stname=%25E9%259E%258D%25E9%2583%25A8&
datepicker=2019-10

station and datepicker seems straight forward as well, the issue was getting a list of station ID’s that I could pass in. Luckily, I quickly stumbled across a local variable which was an object of { station id: [mandarin text, station name, more mandarin, a digit?] }. I didn’t want to dig too deep into any of these values other than station id, I grabbed the station id using Object.keys(stList).

Now that I had the station, I moved on to trying to figure out the garbled mess of stname. After a bit of trial and error I figured out that it wasn’t just an encoded string, it was actually encoded twice.

>>> const singleDecode = decodeURI("%25E9%259E%258D%25E9%2583%25A8");
"%E9%9E%8D%E9%83%A8"
>>> const doubleDecode = decodeURI(singleDecode);
"鞍部"

Perfect. A list of 800 two-three character codes in a language I did not understand. After a bit of trial and error, I figured out this was the first piece of mandarin in stList, so I now knew: { station id: [mandarin station name, english station name, unknown mandarin, unknown digit] }, which was enough to start writing the script to download those files. I exported the data:

const stationList = Object.keys(stList).map((id) => ({
  id: id,
  stname: stList[id][0].trim()
}));

const stringified = JSON.stringify(stationList);

The URL

With all of the data available, now it was just a matter of iterating through each piece of data and building the url.

const dateRange = [2016, 2018];

for (let i = 0; i < stations.length; i++) {
  for (let year = dateRange[0]; year <= dateRange[1]; year++) {
    for (let month = 1; month <= 12; month++) {
      const date = [year, strMonth].join("-");

      const req = API.get("/month-controller", {
        params: {
          station: station.id,
          // The API object will encode this a second time
          stname: encodeURI(station.stname),
          datepicker: date
        }
      });

      const resp = await req;
      return; // Return early, we only need to test one response.
    }
  }
}

Once I sent the first request though I realized I had made a major oversight. The url on the ‘Get CSV’ button was actually the same url that would return the HTML page, meaning it wasn’t actually a separate url but rather just a button which had an onclick event. The click handler actually parsed the HTML on the page, built a CSV file, and then served it to your browser. I still needed the URL to fetch the raw HTML, but now I needed to actually parse the HTML and build that CSV by hand. I leveraged a couple existing libraries to help out with this: DOMParser and node-table-to-csv. These libs were able to solve this (although resource heavy) very simply.

const DOMParser = require("xmldom").DOMParser;
const tableToCsv = require("node-table-to-csv");
const parser = new DOMParser();

const doc = parser.parseFromString(resp.data, "text/html");
const table = doc.getElementById("MyTable");
const tableHTML = table.toString();
const csv = tableToCsv(tableHTML);

The data

A CSV string isn’t too beneficial, so I dumped the data into a set of folders. There were so many stations I decided to separate the data a bit further, adding a region to the nesting: region -> station -> {year}-{month}.csv. The region of the station turned out to be the second mandarin string in the stList array, so I adjusted the parameter scraper to include that.

const stationList = Object.keys(stList).map((id) => ({
  id: id,
  stname: stList[id][0].trim(),
  region: stList[id][2].trim()
}));

const groupedStationList = stationList.reduce((out, item) => {
  const { region, ...rest } = item;
  if (!out[region]) {
    out[region] = [];
  }
  out[region].push(rest);
  return out;
}, {});

Exporting the data was as simple as building the folder path and dumping the CSV string.

const folderPath = path.join("output", regionID, s.id);
fs.mkdirSync(folderPath, { recursive: true });

const filePath = path.join(folderPath, `${date}.csv`);
fs.writeFileSync(filePath, csv);

Things I learned

Unique window properties

I took this section out into it’s own article… It’s so incredibly useful to be able to find variables/functions on a webpage in order to figure out how their logic actually works.

Link

Heavy code

Even though some of that code is extremely gross (parsing an HTML file, selecting element, dumping to string, re-parsing and dumping to CSV), it took a lot less time to get that working and get the data in the “client’s” hands than it would have to do it “properly”. Even though the code is inefficient it was extremely appropriate for the task at hand.