File systems

Whether you run Windows, Mac OS X, or some flavor of Linux, you've no doubt had experience with at least some of the following concepts, though you may not be familiar with all of the terminology, and not all of you will have seen the same subset of these concepts. To ensure that we're all on the same page, here are some things you'll need to know about file systems. A file system is software that manages how information is stored on a storage device such as a hard drive. Information is stored in files, which are containers in which a sequence of bytes are stored. Each byte is effectively a sequence of eight "digits" that are all either 1 or 0; each of these is called a bit. The bytes in each file are interpreted differently depending on what kind of file it is (i.e., what the program consuming the file expects it to contain, such as text, an image, a song, or a video). Each file has a filename that can be used to identify it. While some operating systems treat files differently depending on their names — most often based on the extension, which is the part of the filename that follows the last dot (e.g., .doc in the filename invoice.doc) — there is not necessarily a relationship between the name of a file and the kind of data it contains. Side note: If you're running Windows on your own machine, have you completed Assignment #0 yet? In particular, Step 1 of the installation instructions for Windows is not14-1-14 ICS 32 Winter 2014, Project #1: Begin the Begin www.ics.uci.edu/~thornton/ics32/ProjectGuide/Project1/ 3/8 to be overlooked; it's going to be really important in the context of this project! Windows, by default, hides filename extensions in its user interface, but we're going to need to see the filenames for what they are. If you haven't done this, DO IT NOW! You'll be glad you did — and not just in your work on this project. Files are arranged into directories (also called folders, but we'll settle on one terminology and call them "directories"). Like files, directories have names. Directories can contain not only files but other directories, which can, in turn, contain other directories, and so on. A directory inside of another directory is generally called a subdirectory. The rules that govern directories are, more or less, the same as the rules that govern the directories inside of them, except for one special directory called the root, which forms a kind of starting point for a search; the root directory is the one directory on a device that is not inside another directory. More than one file or directory on a device can have the same name; they are instead identified more precisely by a path, which indicates the sequence of directories, starting from the root, one would have to descend into in order to find a file or directory. While the concept of a path transcends operating systems, they are written slightly differently on different ones. Instead of storing a file in a directory, you can also store a symbolic link that points to another file stored in some other directory, which allows the same file to appear in more than one directory. Similarly, a directory can contain a symbolic link to another directory (and this can get a little bit strange, such as directories that link back to their parent or even to themselves) This project won't require you to handle symbolic links, though you're welcome to take a crack at it if you're so inclined. Do be aware, though, that it may not be easily possible to create them on Windows — it can be done, but requires tweaking security settings, and would require administrative access to your machine, which means you wouldn't be able to do it in the ICS labs. If most of these terms seem unfamiliar, it's worth doing a bit of research and experimentation to be sure you understand what they are before you proceed too far through this project. In addition to being useful in the context of the problem at hand, this is good practical knowledge that you'll need if you want to be successful as a programmer. The program The program you'll be writing for this project is one that can search for files in a directory (and all of its subdirectories, and their subdirectories, and so on) with interesting characteristics and take some kind of action on those files. Both the notion of interesting characteristics and taking action will be configurable and can each work in a few different ways, but the core act of finding all files will always be the same. The user interface When executed, your program should interact with the user by printing to the console and reading input from the console (e.g., using the print() and input() functions). It should do the following: Ask the user to specify the path to the directory in which the search for files should be rooted. For example, on Windows, a user might type C:\Program Files or D:\Python33\Lib, in which case only files in or "underneath" the chosen directory will be found. If the specified directory does not exist, print an error message and ask the user to specify another directory. Ask the user which of these search characteristics to use in deciding whether files are "interesting" and should have action taken on them. 14-1-14 ICS 32 Winter 2014, Project #1: Begin the Begin www.ics.uci.edu/~thornton/ics32/ProjectGuide/Project1/ 4/8 Search by Name. Files whose names exactly match a particular name. If the user chooses this characteristic, ask what filename they'd like to search for. For example, if the user specifies boo.jpg, only files whose names are exactly boo.jpg will be considered interesting. Search by Name Ending. Files whose names end with a particular string. If the user chooses this characteristic, ask what ending they'd like to search for. For example, if the user specifies .py, all files whose names end in .py will be considered interesting. Search by Size. Files whose size, measured in bytes, exceeds a specified threshold. If the user chooses this characteristic, ask what that threshold should be. For example, if the user specifies 2097151, files whose sizes are at least 2,097,152 bytes will be considered interesting. Ask the user which of these actions should be taken on each of the interesting files. Always print the file's path to the console, on its own line of output, before taking the action, so it will be clear what files, if any, were affected. Print Path Only. Print the file's path to the console, just as you will always do, but don't otherwise do anything with it. Print First Line of Text. Open the file, assuming that it's a text file, read the first line of text from the file and print that text to the console. (If the file does not contain text, it's fine if this choice prints unreadable garbage or even crashes your program; we'll only test this feature on text files.) Copy File. Make a copy of the file and store it in the same directory where the original resides, but the copy should have .dup (short for "duplicate") appended to the filename. For example, if the interesting file is C:\pictures\boo.jpg, you could copy it to C:\pictures\boo.jpg.dup. Touch File. "Touch" the file, which means to modify its last modified timestamp to be the current date/time. Whenever the user is asked to specify input, remove spaces from the beginning and end of whatever was typed before doing anything. If, after removing the spaces, the input is empty, does not match an available choice, or is the wrong type (e.g., a number is expected and the user types alex), print an error message and ask the user to try again. Your program should not crash as a result of incorrect input from the user. Running the search and taking action Your search should look for files in the directory where the user asked for the search to be rooted, in any subdirectories of that directory, in any of their subdirectories, and so on, for as deep as the directory structure goes. Outside of the occurrence of symbolic links, which we're ignoring for the purposes of this project, directory structures are hierarchical (i.e., directories have subdirectories inside of them, and those subdirectories have the same structure as their "parents"). In general, the order in which you take action on files is not important, so long as every interesting file has the action taken on it exactly once. Handling failure You'll want to handle failure carefully in this program. In general, your program should not crash just because one activity fails; it should instead note the failure (by printing a readable message to the console) and continue, if possible. For example: 14-1-14 ICS 32 Winter 2014, Project #1: Begin the Begin www.ics.uci.edu/~thornton/ics32/ProjectGuide/Project1/ 5/8 If, during the search, accessing some directory fails, the search should still continue attempting to access other directories. If taking action on an interesting file fails, the program should continue processing additional interesting files. The one scenario where it's fine for your program to misbehave (or even crash) is when printing the first line of text from a file that is not a text file (e.g., an image or a video). It turns out that detecting whether a file contains only text is not nearly as trivial as it sounds, so we'll avoid that problem here. Organization of your program Your program should be written in a single Python module (i.e., a single .py file). You can feel free to name the file anything you'd like, but running your module (e.g., by loading it in IDLE and pressing F5 or selecting Run Module from the Run menu) should result in your program being executed and your user interface being displayed, but importing your module manually using an import statement should not. The way to differentiate between these scenarios is to write an if statement, outside of any function (and usually at the bottom of your module), that looks like this: if __name__ == '__main__': the_things_you_want_to_happen_only_when_the_module_is_run The expression __name__ == '__main__' will only return True within a module that was originally executed using F5 in IDLE; it will return False in any other module (e.g., in a module that has been imported instead). A word about documentation As you likely did in ICS 31, you are required to include type annotations and docstrings on every one of your functions. This will not only make your design clear to us when we're grading your project, but will also clarify your own design for yourself; details like this will help you to read and understand your own program. For example, if you were to write a function find_max that finds the maximum integer in a list of integers, you might start the function this way: def find_max(numbers: [int]) - int: '''Finds and returns the maximum number in a list of numbers''' ... Other aspects of how this project (and others) will be graded are described on the front page of the Project Guide. An important design goal As you write your program, you'll want to be on the lookout for complex functions that can be made simpler by breaking them up into multiple smaller functions. Taking complexity and putting it into a function with a well-chosen name and clearly-named parameters is a great way to tame that complexity; this is the first step in building an ability to write significantly larger programs than you've written so far, so be sure that you're getting some practice in taking that step in this project. One of the criteria we'll use in grading your project is to assess the quality of its design; in this project, the largest factor affecting quality is the extent to which functions were broken up when they are long or cumbersome. 14-1-14 ICS 32 Winter 2014, Project #1: Begin the Begin www.ics.uci.edu/~thornton/ics32/ProjectGuide/Project1/ 6/8 A good rule of thumb is to consider what you would say if I asked you "What does this function do?" If the answer is more than a sentence or two, it's probably doing too much, and might better be implemented as multiple functions that each do part of the job. (One of the good things about writing docstrings in your functions is that it forces you to test this; if your docstring needs to be especially long, your function is probably too complex.) The Python Standard Library documentation At python.org, you'll find a comprehensive set of documentation about the Python language and its standard library. Two good places to start are these pages: Python 3.3.3 documentation Python 3.3.3 Standard Library documentation You'll probably need to spend a fair amount of time, especially early in your work on this project, looking at the Standard Library documentation and assessing what functions in that library might be able to assist you in your solution. The Standard Library documentation is mostly broken down by module. It's tough sometimes to know where to look, as there are so many modules, so we'll sometimes try to point you in the right direction. For this project, you'll want to pay special attention to (at least) these modules: os, os.path, and shutil, though you might also find useful tools in other modules. Feel free to poke around and see what's available; if it's in the Standard Library, you can feel free to use it, except for one limitation pointed out below. Note, too, that part of understanding what's available in the Standard Library is careful, targeted experimentation. If you're not sure what a library function does just by reading its documentation, you can fire up IDLE and experiment with it; one of the nice things about an interpreted programming language like Python is that this experimentation can be done in the interpreter, without having to make changes to your program until you're more sure that you've found library functions that will help. For the most part, this kind of experimentation is harmless — if things don't work the way you expect, you'll see an error message or behavior that doesn't match what you expect — though you do need to be a little bit careful calling functions that do potentially destructive things like deleting files. A note about versioning As you're reading through Python's documentation, be sure that you're reading the right version of it. Near the top of each page is a drop-down list that allows you to select a version. Be sure you're reading the documentation for version 3.3.3. (In particular, if search engines lead you to Python documentation, you'll find that they quite often lead you to the documentation for older versions of Python than 3.3.3, which is relatively new. But you can usually just select "3.3.3" in the drop-down list in order to bring up the corresponding page of the documentation for our version of Python.) Limitations on what you can use For the most part, if you find a function that can assist in your solution, you can feel free to use it; note that some care is sometimes required in ensuring that the function you find is actually a good fit. There are two functions in the Python Standard Library that are off-limits for your use in this project: os.walk and os.fwalk, both of which perform a full traversal of files in a directory structure, which is one of the skills I'd like you to build in this project. (In general, it's safe to assume that you can use anything I don't specifically call out as being off-limits.) 14-1-14 ICS 32 Winter 2014, Project #1: Begin the Begin www.ics.uci.edu/~thornton/ics32/ProjectGuide/Project1/ 7/8 The value of working incrementally You may be accustomed to solving relatively small, mostly self-contained problems that you can handle all at once. This program, while not giant by real-world standards, consists of more moving parts than you might be used to writing, so you will likely find that attempting to write this program all at once will lead you astray, even if an all-inclusive, everything-at-once approach worked well for you in ICS 31. A better approach for this project — and one that will increase in importance this quarter, as the problems we solve get larger and more complex — is to look for stable ground as often as you can find it. Rather than trying to write the entire program, write some small part of it that you understand well, then find a way to verify that the part you wrote works as you expected (e.g., by writing a function and then calling it in the Python interpreter manually). When it does, you've reached stable ground, and you're ready to choose the next small step you should take, again taking care to choose something that you'll be able to verify after you're done with it. Ideally, you'll find yourself reaching stable ground quite often — every few minutes, if things are going well — and this will help you to feel confident that you're making progress. If you find yourself stuck on one problem and have no idea how to move further, find another positive step you can take. For example, if you're not sure how to find all of the files in a directory structure, write a function that simply returns a hard-coded list of file paths and use that temporarily, then move on and work on something else, eventually working your way back to the places that were causing you trouble. The goal is always to be making some kind of progress, and it's not necessary to write the program in a particular order (e.g., the order in which things happen in the user interface). Also, when you reach what you believe to be stable ground, it's not a bad idea to make a copy of your Python module before proceeding. That way, if your next step doesn't go as you planned, you maintain the option of "rolling back" to the previous, stable version and trying again. It also gives you something stable to turn in if you find that the deadline arrives and you're not finished; it's always better to turn in a partial program that does something correctly, rather than one that doesn't run. Don't feel like you need to keep every stable version, but it's not a bad idea to at least have the most recent one. (One of the practical skills you'll need to start thinking about, if you haven't already, is staying organized. If you have a couple of versions of a file and find that you're often not sure which is which, then you need a better organizational scheme — better file names, better directory names, or whatever helps it to be more obvious to you.) So, what steps should I take? There are a variety of ways to build this program from beginning to end, so don't go looking for the "perfect way" to do it. Find a small step you can take, take it, and then find another. Of course, there are missteps you might take along the way, but the best way to learn how to write programs this way is simply to do it; you'll learn as much from your missteps as you will from the ones that work out well. As an example, though, think about the fact that the program is built around the core notion of "Find all of the unique files in a directory, its subdirectories, their subdirectories, and so on, and return a list of their paths." That sounds like a pretty good step to start with, except that it's actually a bigger step than it sounds like. So you could start even smaller: write a function that finds the files in a directory but ignores subdirectories. Once you can do that, add handling for one level of subdirectories (but assume they have no subdirectories inside of them). Then consider unbounded subdirectory depth and handle that scenario. Whenever you're working on something and you feel you've bitten off more than you can chew, think 14-1-14 ICS 32 Winter 2014, Project #1: Begin the Begin www.ics.uci.edu/~thornton/ics32/ProjectGuide/Project1/ 8/8 about ways to break the problem into smaller ones; eventually, you'll be left with a problem you can think through and complete. If you're not sure what step to take next, feel free to ask us; we'll help you find a task that will lead you to stable ground. Testing Testing this program is going to seem cumbersome, because it will require creating directory structures and files in various configurations, then running the program to see how it behaves. You might find that it's worth your time to automate some scenarios, by writing short Python functions (perhaps in a separate module) that create directories and files in interesting combinations. You won't likely find that it will require writing a lot of code, but it will pay you back (and then some!) as you test your program. It's quite common in real-world software development to write code whose sole use is as a development tool; it's not part of the program, per se, and will not be given to users of the program, but is strictly meant to make it easier to build the program. You'll be well-served to explore ways to use Python to lighten your testing burden; if there are five interesting scenarios you've identified, you should write five Python functions that can set them up for you automatically, so you can get them back any time you need to test them. If it takes a half-hour to write the test programs, but it takes five minutes to set up the tests manually, as soon as you've set up the tests six times, the time spent writing the test program will begin paying you back. Computers automate tasks that are otherwise cumbersome, and testing certainly falls into that category. (We'll see this theme repeated throughout the course.)
Powered by