In Openrefine
Hands-on: basic transformations
Getting Started
We'll use a subset 4 of Raleigh Building Permits data
- Launch the Open-Refine icon from your reckoner (find and double-click the jewel icon.)
- Installations / Commencement / End instructions
- Owen Stephens'south helpful video illustrating installation
- Remember: The User Interface for Refine is Chrome or Firefox
- If your default browser is one of these, Refine will auto-launch to http://127.0.0.1:3333
- If your default browser is IE, you'll demand to open the following URL http://127.0.0.1:3333 in Chrome or Firefox
-
https://raw.githubusercontent.com/libjohn/openrefine/master/data/subset-RBP-narrow.csv
- Click
- Select: Columns are separated past "commas (CSV)"
- Change the Project Proper name to
Raleigh Building Permitsand click Create Projection >> (summit-correct)
Shutting Down OpenRefine
It's IMPORTANT to properly shutdown the application. OpenRefine will automatically relieve your projection as you transform your data. However, in my feel your last operation may have to be manually saved by following the procedures below…
Windows: Control-C Mac: Click the OR app in the doc, invoke Quit
Note: It is possible, but not guaranteed, to lose data if you don't follow the rather unintuitive shutdown procedures. Ameliorate prophylactic than sorry.
Facets & Cluster
Facet & Cluster
Goal ane: Create a facet of authorized work. Cluster & Merge types of authorized work.
-
- How many rows are in this dataset?
21,982 rows
- Slide the bottom column navigation bar to the right
- notice the authorized_work cavalcade
- click the column header:
-
- How many facets are at that place?
7633 choices
- Y'all may see a program warning, a dialog-box error message screen shot prompting you to fix the maximum number of choices show in the text facet. If so, take the default and proceed.
- In the Facet box, click
count.-
- What is the 4th most popular type of authorized work?
SCREEN PORCH
-
- How many permits are recorded?
233
-
- To observe spelling clusters, click the
Clusterbutton in the facet box - Click the
Select Allbutton, then theMerge Selected & Re-Clusterbutton, to merge all terms (accepting the default: Method = "key collision" ; Keying Part = "fingerprint") - Repeat previous step using the "ngram-fingerprint" Keying Function, so shut the Cluster & Edit dialog box
-
- How many SCREEN PORCH facets now exist?
238 - Compare to your answer in #4 above
Chemical compound Facets
- Select the DECK Facet.
-
- How many matching rows match the Deck Facet?
340 matching rows
-
- Select the SCREEN PORCH facet combined with the DECK Facet. (Hover your mouse over the facet, click
include)-
- Now how many matching rows exist?
578 matching rows
-
- On the "land_use_code" column, make a text facet and limit to "SINGLE Family"
- On the "county" column, make a text facet and limit to Durham Canton (DURH)
-
- How many "Unmarried Family" homes received permits in Durham County for Screen Porches or Decks?
iii
-
- Is the authorized work for Screen Porches, Decks, or both?
both
-
- Click the "Remove All" button to remove all text facets.
-
- How many matching rows are in the dataset now?
21,982
-
Mass Editing
It'due south of import to understand OpenRefine was designed to transform data in majority. It is possible to edit single data cells but it is not as convenient as some other, more WYSWIG, tools. This practice will help you lot acquire how to attain these kinds of mass data transformations
- Make a Text facet on the work_type_description column
- In that location are two facets for new buildings: "NEW BUILDING" and "New Edifice".
-
- How many "NEW BUILDING" rows exist? (click to reveal respond)
3 matching rows
-
- how many "New Building" rows exist?
9,668 matching rows
-
- Select "NEW BUILDING" facet, limiting to 3 matching rows. To the right of the "NEW Edifice" facet, hover your mouse over the "edit" feature; click "edit" and alter the text to championship case: "New Edifice" ; click Apply
-
- How many "New Building" rows exist now?
9,671 matching rows
-
- Mass edit "OTHER" & "Other" and then they take the same value
- Mass edit "ALTERATIONS/REPAIRS" and "Alterations/repairs" so they have the aforementioned value
- Click "Remove All" to remove the facet window
Split information in cells
-
- Separator =
(> OK
- Separator =
-
- Separator =
,(i.e. accept default and click) > OK
- Separator =
-
-
breadth
-
-
-
longitude
-
(more information transformation could exist done, merely let'due south move on for now…)
Concatenate cells together
-
- New column proper noun =
Full Clarification - Expression =
value + cells["proposed_work"].value
- New column proper noun =
The last step adds two columns together, but the preview screen is difficult to read. Make it readable by using the next expression instead …
- Expression =
value + " sq ft. " + cells["proposed_work"].value> OK
Search & Supercede, Plus More
Looking at the breadth and longitude cells, one column appears in light-green text (indicating OpenRefine considers information those cells equally numbers) and ane cavalcade appears in black with a endmost parenthesis in the last position. Convert both columns to text, trim leading and abaft spaces, then find and supplant the parenthesis
Convert Data Types
Remove Whitespace
Search & Replace
Search and Supervene upon is normally performed as a information transformation using the following function: value.search("old text","new text"). In the example below we replace a closing parenthesis with zip, effectively removing the abaft parenthesis. The case may appear strange since the supervene upon part exists inside a gear up of parenthesis. Retrieve the text y'all are replacing is idnetified within the first set of quotation marks You will identify as the replacing text within the second set of quotation marks. I've draw red circles effectually the role, every bit well as the before and after text preview to clarify how the procedure will work.
-
- Expression =
value.replace(")","")
- Expression =
Web Scraping
Select a subset
We want to gather the FIPS code for a subset of the data. The government server returns information in a JSON format so we'll parse the data later on nosotros retrieve it. Start we'll subset our dataset for expediency. This limits our waiting fourth dimension during the workshop.
-
- expression =
value.slice(6,10)
- expression =
- select the "2014" facet
- select the "3 Flavor ROOM" facet
You should at present accept 6 matching rows.
API
At present let's fetch the information from an API made available via the National Broadband Map. This API returns a FIPS code if we give it a county name (or in this instance, even a partial county name.)
- fetch JSON data from the National Broadband Map. Nosotros'll use the API documentation for Geography by Name API which returns Census geography for a geography proper noun (eastward.g. Durham)
- Documentation
- The documentation informs u.s. that the format of the URL we desire to construct is as follows: http://www.broadbandmap.gov/broadbandmap/census/county/durh?format=json
- Notice the data values in the "county" column. All we do is construct a URL which calls the value of the cells from each row of the "country" column
- New cavalcade proper noun =
JSON data - Throttle filibuster =
2000 - Expression =
'https://world wide web.broadbandmap.gov/broadbandmap/demography/canton/'+value+'?format=json' - OK
- Expect for the results. If you express to the matching rows in the select subset section this will only take a few seconds.
- Documentation
Parse
Now parse the value of the JSON data "fips" element; call the "fips" key when traversing the "county" objects from the Results set.
-
- New cavalcade name =
FIPS Code
- expression =
value.parseJson().Results.canton[0].fips
- Annotation the foursquare-bracket (
[0]) notation in theParseJson()role denotes and identifies the starting time array element. It's the first element because in OpenRefine counting begins with nothing (e.g. 0,one,2,three,4,5). The county array in the example below consists of only 1 value chemical element (consisting of iv, named key/value pairs; of which fips is 1 key). Since the JSON note indicates county is an array, in this case of quantity 1, we identify that get-go element of the array by the number '0'. Run into the JSON example beneath
- New cavalcade name =
JSON Data Instance
JSON 5 is JavaScript Object Notation a information wrapper. The API, in this case, returns the data in a JSON format.
{ "condition": "OK", "responseTime": xiv, "message": [ ], "Results": { "county": [ { "geographyType": "COUNTY2010", "stateFips": "37", "fips": "37063", "proper noun": "Durham" } ] } } Source: https://libjohn.github.io/openrefine/start.html
0 Response to "In Openrefine"
Post a Comment